小内存环境下的私有合成数据生成

Research

arXiv

小内存环境下的私有合成数据生成

Private Synthetic Data Generation in Small Memory

摘要 Abstract

我们提出了$\mathtt{PrivHP}$，这是一种具有\textit{差分隐私}保证的轻量级合成数据生成器。$\mathtt{PrivHP}$采用了一种新颖的层次分解方法，能够在有限内存中近似输入的累积分布函数（CDF）。它在平衡层次深度、噪声添加以及低频子域剪枝的同时保留高频子域。私有草图能够高效地估计子域频率，而无需访问完整数据。其关键特性是剪枝参数$k$，用于控制空间复杂度与实用性之间的权衡。我们定义了偏斜度量$\mathtt{tail}_k$，捕获除了前$k$个子域频率之外的所有信息。给定一个数据集$\mathcal{X}$，$\mathtt{PrivHP}$使用$M=\mathcal{O}(k\log^2 |X|)$的空间，并且对于输入域$\Omega = [0,1]$，确保$\varepsilon$-差分隐私。它生成的生成器与经验分布之间的预期Wasserstein距离为：\[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \] 这种可参数化的权衡提供了先前工作中无法实现的灵活性。此外，我们还提供了可解释的实用性界限，考虑了层次深度、隐私噪声、剪枝以及频率估计误差等因素。

We propose $\mathtt{PrivHP}$, a lightweight synthetic data generator with \textit{differential privacy} guarantees. $\mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $\mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $\mathcal{X}$, $\mathtt{PrivHP}$ uses $M=\mathcal{O}(k\log^2 |X|)$ space and, for input domain $\Omega = [0,1]$, ensures $\varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: \[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.