交叉验证,你用的是正确的版本吗?
稍微了解一点机器学习的都会知道cross-validation,作为一个验证算法是否靠谱(具有可推广到新数据的能力)的主要方法。然而,魔鬼往往藏在细节中,这篇文章调查了之前学者做的cross validation 是那个版本的,为后来的研究者指出了一些方法论上需要注意的地方。
BioRivx是生物界的论文预印本网站,目前生物界的还没有把文章放到preprint上的习惯,不过很多和计算相关的文章,都可以在这里找到。这篇文章的标题是“**Voodoo Machine Learning for Clinical Predictions”。**Voodoo 就是巫毒教的意思,这个题目够逗吧。
开篇介绍背景,智能手机和可穿戴设备的普及使得研究者积累了人类行为的大量数据,从而使得使用机器学习的方法来预测精神类疾病成为了可能。当越来越多的算法被使用,如何量化的评估这些方法的好坏变成了一个重要的问题。
接着作者介绍了这篇文章的核心概念,record-wise vs subject-wise cross-validation,作者发现record-wise cross-validation often massively overestimates the prediction accuracy of the algorithms,同时this erroneous method is used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smart phones to predict clinical outcomes. 因此,为了让之后的研究成果更加靠谱,作者提倡我们都要用正确的交叉验证方法。
detect the identity of the person based on the features, it can automatically also “diagnose” the disease.这句说的是如果算法能够看出test set是对应的哪一个样本,那么其就可以判断出这个样本是否患病。
接下来举一个简单的例子。Imagine we recruit 4 subjects, 2 healthy and 2 affected by Parkinson’s disease (PD). We have a machine learning algorithm that estimates if a person has PD based on their walking speed. Let our two healthy subjects have constant walking speeds of 1 meter per second (m/s) and 0.4 m/s, and our two PD patients 0.6 m/s and 0.2 m/s. If we do subject-wise CV, we will be unable to predict the performance of the slow healthy subject as well as the fast PD subject, resulting in a prediction accuracy of roughly 50%.
然而如果是Leave-One-Subject-Out的方式,那么这个问题就简化了,在这种情况下,算法给出的预测准确度会是100%,然而这样的结果是不能泛化的。
接着看看作者做到对已有文献的调查分析时用的流程图吧。
那这些文章的结论质量如何了,下图给出了总结,左边的是不同类项文章的预测错误的箱图,右侧是这些文章的引用次数。我们可以看出,使用了subject wise的文章预测错误率更高。
说了这么多,用作者的句子总结一下这篇小文。The use of machine learning for clinical predictions is growing in popularity. Yet, such erroneously positive results threaten the progress in the field. Such results might contribute to the problem of irreproducibility of research findings and thereby undermine the trust in both medicine and data science. Only with meaningful validation procedures can the transition into machine learning driven, data-rich medicine succeed.
更多阅读