隐式关系:低秩微调与 differential privacy

On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy

摘要 Abstract

自然语言处理中的一个重要方法是通过在通用领域数据上的大规模预训练模型,然后针对特定任务或领域进行适应性调整。随着模型规模的增长,对所有参数进行全面微调变得越来越不切实际。为了解决这一问题,已经提出了一些用于语言模型低秩任务适应的方法,例如 LoRA 和 FLoRA。这些方法保持预训练模型权重不变,并在 Transformer 架构的一些层中引入可训练的低秩分解矩阵,称为适配器。与全参数微调相比,这种方法显著减少了下游任务所需的可训练参数数量。在这项工作中,我们从数据隐私的角度审视低秩适应。我们理论上表明,LoRA 和 FLoRA 中使用的低秩适应会在适配器参数的批量梯度中注入一些随机噪声。我们量化了注入噪声的方差,并表明适应秩越小,噪声方差越大。通过建立注入噪声分布与具有相同方差的高斯分布之间总变差距离的 Berry-Esseen 型界,我们证明了低秩适应的动力学接近于适配器的差分隐私微调动力学。最后,利用 Johnson-Lindenstrauss 引理,我们表明当结合梯度缩放时,低秩适应非常接近于使用固定噪声尺度的 DPSGD 算法来微调适配器。根据我们的理论发现和实验结果,我们展示了低秩适应不仅减轻了空间和计算复杂性,还隐式地提供了对微调数据的隐私保护,而不会引发 DPSGD 的高空间复杂性。

A significant approach in natural language processing involves large-scale pre-training of models on general domain data followed by their adaptation to specific tasks or domains. As models grow in size, full fine-tuning all of their parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g., LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA leads to the injection of some random noise into the batch gradients w.r.t the adapter parameters. We quantify the variance of the injected noise and show that the smaller the adaptation rank, the larger the noise variance. By establishing a Berry-Esseen type bound on the total variation distance between distribution of the injected noise and a Gaussian distribution with the same variance, we show that the dynamics of low-rank adaptation is close to that of differentially private fine-tuning of the adapters. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient scaling, low-rank adaptation is very close to performing DPSGD algorithm with a fixed noise scale to fine-tune the adapters. Suggested by our theoretical findings and approved by our experimental results, we show that low-rank adaptation, besides mitigating the space and computational complexities, implicitly provides a privacy protection w.r.t the fine-tuning data, without inducing the high space complexity of DPSGD.