摘要 Abstract
我们研究了在具有高斯协变量(未知协方差结构)的普通最小二乘回归问题中,以隐私保护的方式获得预测误差保证的任务。我们提供了第一个在纯差分隐私和近似差分隐私下都具有最优样本复杂度且运行时间为多项式的算法。我们证明,任何对该算法样本复杂度的改进都将违反统计查询或信息论的下界。此外,我们的算法对少量任意异常值具有鲁棒性,并且能够根据异常值的比例实现最优的误差率。相比之下,所有先前有效的算法要么具有次优的维度依赖样本复杂度(例如依赖于协变量的条件数),要么在隐私参数上表现出多项式级别的退化。我们的技术贡献分为两部分:首先,我们在平方和框架内利用高斯分布的弹性保证。由此,我们获得了针对最优鲁棒性和样本复杂度的高效平方和回归算法。其次,我们将最近提出的抗差性到隐私框架 [HKMN23, (arXiv:2212.05015)] 推广至考虑输入样本协方差所诱导的几何结构。该框架的关键在于抗差估计器必须是平方和算法,结合这两步可得到一个达到最优样本量的私有回归算法。我们认为这些技术本身具有独立的研究价值,并通过其展示了对带有协方差感知的均值估计问题的高效算法,同时在隐私参数上实现了最优依赖性。
We consider the task of privately obtaining prediction error guarantees in ordinary least-squares regression problems with Gaussian covariates (with unknown covariance structure). We provide the first sample-optimal polynomial time algorithm for this task under both pure and approximate differential privacy. We show that any improvement to the sample complexity of our algorithm would violate either statistical-query or information-theoretic lower bounds. Additionally, our algorithm is robust to a small fraction of arbitrary outliers and achieves optimal error rates as a function of the fraction of outliers. In contrast, all prior efficient algorithms either incurred sample complexities with sub-optimal dimension dependence, scaling with the condition number of the covariates, or obtained a polynomially worse dependence on the privacy parameters. Our technical contributions are two-fold: first, we leverage resilience guarantees of Gaussians within the sum-of-squares framework. As a consequence, we obtain efficient sum-of-squares algorithms for regression with optimal robustness rates and sample complexity. Second, we generalize the recent robustness-to-privacy framework [HKMN23, (arXiv:2212.05015)] to account for the geometry induced by the covariance of the input samples. This framework crucially relies on the robust estimators to be sum-of-squares algorithms, and combining the two steps yields a sample-optimal private regression algorithm. We believe our techniques are of independent interest, and we demonstrate this by obtaining an efficient algorithm for covariance-aware mean estimation, with an optimal dependence on the privacy parameters.