摘要 Abstract
鲁棒回归旨在开发在存在异常值、重尾分布或污染数据时估计未知回归函数的方法,这些因素可能严重影响性能。大多数现有鲁棒回归的理论结果假设噪声具有有限的绝对均值,但某些分布(如柯西分布和某些帕累托分布)违反了这一假设。本文引入了一个广义柯西噪声框架,该框架可以容纳所有具有任意阶有限矩的噪声分布,即使绝对均值为无穷大。在此框架下,我们研究了“核柯西岭回归器”(KCRR),通过最小化正则化的经验柯西风险来实现鲁棒性。为了推导KCRR的$L_2$-风险界,我们建立了当柯西损失的尺度参数足够大时,超额柯西风险与$L_2$-风险之间的联系,揭示了这两种风险是等价的。此外,在回归函数满足Hölder光滑性的假设下,我们推导了KCRR的超额柯西风险界,表明随着尺度参数减小,性能得到改善。通过考虑尺度参数对超额柯西风险的双重影响及其与$L_2$-风险的等价性,我们得到了KCRR在$L_2$-风险下的几乎最优收敛率,强调了柯西损失在处理各种类型噪声方面的鲁棒性。最后,我们在合成数据集和真实数据集的多种噪声污染场景下通过实验验证了KCRR的有效性。
Robust regression aims to develop methods for estimating an unknown regression function in the presence of outliers, heavy-tailed distributions, or contaminated data, which can severely impact performance. Most existing theoretical results in robust regression assume that the noise has a finite absolute mean, an assumption violated by certain distributions, such as Cauchy and some Pareto noise. In this paper, we introduce a generalized Cauchy noise framework that accommodates all noise distributions with finite moments of any order, even when the absolute mean is infinite. Within this framework, we study the \textit{kernel Cauchy ridge regressor} (\textit{KCRR}), which minimizes a regularized empirical Cauchy risk to achieve robustness. To derive the $L_2$-risk bound for KCRR, we establish a connection between the excess Cauchy risk and $L_2$-risk for sufficiently large scale parameters of the Cauchy loss, which reveals that these two risks are equivalent. Furthermore, under the assumption that the regression function satisfies H\"older smoothness, we derive excess Cauchy risk bounds for KCRR, showing improved performance as the scale parameter decreases. By considering the twofold effect of the scale parameter on the excess Cauchy risk and its equivalence with the $L_2$-risk, we establish the almost minimax-optimal convergence rate for KCRR in terms of $L_2$-risk, highlighting the robustness of the Cauchy loss in handling various types of noise. Finally, we validate the effectiveness of KCRR through experiments on both synthetic and real-world datasets under diverse noise corruption scenarios.