Ridge 杠杆得分采样用于 $\ell_p$ 子空间近似

Research

arXiv

Ridge Leverage Score Sampling for $\ell_p$ Subspace Approximation

摘要 Abstract

$\ell_p$ 子空间近似问题是一个NP难的低秩近似问题，它推广了中值超平面($p=1$)、主成分分析($p=2$)和中心超平面问题($p=\infty$)。为了应对NP难性，一种流行的方法是计算强核心集，即输入点的一个带权子集，该子集同时近似每个$k$维子空间的成本，通常相对于一个小常数$\epsilon$达到$(1+\epsilon)$的相对误差。我们得到了一个用于构建$\ell_p$子空间近似强核心集的算法，其大小为$\tilde{O}(k\epsilon^{-4/p})$（对于$p<2$）和$\tilde{O}(k^{p/2}\epsilon^{-p})$（对于$p>2$）。这一结果相比之前的工作有以下改进：- 对于所有$p\neq 2$，我们构造了第一个强核心集，其对$k$的依赖接近最优。在之前的工作中，[SW18] 构造了修改点的强核心集，其对$k$的依赖相似，而[HV20] 构造了真正的强核心集，但对$k$的依赖是多项式级别更差。- 对所有$p$，我们恢复或改进了最佳的$\epsilon$依赖性。特别是，对于$p>2$，[SW18] 的修改点强核心集的依赖性为$\epsilon^{-p^2/2}$，而[HV20] 的强核心集的依赖性为$\epsilon^{-3p}$。我们的算法基于根杠杆得分采样，这种方法对稀疏或结构化矩阵特别适合快速实现。我们的分析避免了使用代表性子空间定理[SW18]，这是所有先前独立维度$\ell_p$子空间近似强核心集的关键组成部分。我们的技术还导致了第一个在线强核心集，其在离线设置中的界限类似，解决了[WY23]提出的问题。所有先前的方法在这种情况下都会丢失$\mathrm{poly}(k)$因子，即使允许修改原始点。

The $\ell_p$ subspace approximation problem is an NP-hard low rank approximation problem that generalizes the median hyperplane ($p = 1$), principal component analysis ($p = 2$), and center hyperplane problems ($p = \infty$). A popular approach to cope with the NP-hardness is to compute a strong coreset, which is a weighted subset of input points that simultaneously approximates the cost of every $k$-dimensional subspace, typically to $(1+\epsilon)$ relative error for a small constant $\epsilon$. We obtain an algorithm for constructing a strong coreset for $\ell_p$ subspace approximation of size $\tilde O(k\epsilon^{-4/p})$ for $p<2$ and $\tilde O(k^{p/2}\epsilon^{-p})$ for $p>2$. This offers the following improvements over prior work: - We construct the first strong coresets with nearly optimal dependence on $k$ for all $p\neq 2$. In prior work, [SW18] constructed coresets of modified points with a similar dependence on $k$, while [HV20] constructed true coresets with polynomially worse dependence on $k$. - We recover or improve the best known $\epsilon$ dependence for all $p$. In particular, for $p > 2$, the [SW18] coreset of modified points had a dependence of $\epsilon^{-p^2/2}$ and the [HV20] coreset had a dependence of $\epsilon^{-3p}$. Our algorithm is based on sampling by root ridge leverage scores, which admits fast algorithms, especially for sparse or structured matrices. Our analysis avoids the use of the representative subspace theorem [SW18], which is a critical component of all prior dimension-independent coresets for $\ell_p$ subspace approximation. Our techniques also lead to the first nearly optimal online strong coresets for $\ell_p$ subspace approximation with similar bounds as the offline setting, resolving a problem of [WY23]. All prior approaches lose $\mathrm{poly}(k)$ factors in this setting, even when allowed to modify the original points.