摘要 Abstract
交叉验证是一种常用的机器学习模型预测性能估计方法。在数据稀缺的情况下,通常希望最大化用于训练模型的样本数量,此时常采用“留一法交叉验证”方法。在这种设计中,对于每个数据实例,通过利用其他所有实例进行训练来构建单独的预测模型。由于每训练一个模型只留下一个测试实例,因此需要将整个数据集上的预测结果汇总起来,计算接收者操作特征曲线下面积(AUC)或R²分数等常见性能指标。本文研究发现,这种方法会导致每个训练折的平均标签与其对应测试实例标签之间产生负相关现象,我们称之为分布偏差。由于机器学习模型倾向于回归到其训练数据的均值,这种分布偏差会对性能评估和超参数优化产生负面影响。我们证明了这种效应也适用于留P法交叉验证,并且在广泛的建模和评估方法中持续存在,可能导致对更强正则化的偏见。为了解决这一问题,我们提出了一种通用的重平衡交叉验证方法,能够校正分类和回归任务中的分布偏差。我们通过合成模拟、机器学习基准以及多个已发表的留一法分析表明,我们的方法可以改善交叉验证的性能评估。
Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called "leave-one-out cross-validation" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.