关于精调注意力机制的理论洞见:泛化与优化
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
摘要 Abstract
基于Transformer架构的大规模语言模型(LLMs)在广泛的任务中表现出显著的泛化能力。然而,由于其庞大的参数量,在特定任务上的微调仍然需要大量资源。本文研究了LLMs微调过程中与注意力机制相关的两种引人注目的现象。第一种现象被称为“注意力矩阵的重要性不平等”,揭示了微调不同权重矩阵的影响。研究发现,优化$\mathbf{W}_v$矩阵相较于优化$\mathbf{W}_k$矩阵能够显著提升性能。仅微调$\mathbf{W}_q$和$\mathbf{W}_v$矩阵不仅计算效率更高,而且结果可与甚至优于微调全部三个矩阵($\mathbf{W}_q$、$\mathbf{W}_k$和$\mathbf{W}_v$)的情况相媲美。第二种现象“定制学习率的注意力矩阵能实现更好的收敛”,强调了为这些矩阵分配不同的学习率的重要性。具体而言,与$\mathbf{W}_q$和$\mathbf{W}_k$相比,为$\mathbf{W}_v$矩阵设置更高的学习率可以加速收敛并提高性能。基于这些洞见,我们提出了一种新的策略,从存储和时间两方面提升了微调效率。基准数据集上的实验结果验证了该方法的有效性,支持了我们的理论发现。我们的分析为配置和改进LLMs微调中的轻量化算法奠定了理论基础。
Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we investigate two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs. The first phenomenon, termed "Unequal Importance of Attention Matrices," highlights the impact of fine-tuning different weight matrices. It shows that optimizing the $\mathbf{W}_v$ matrix yields significantly better performance than optimizing the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices ($\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$). The second phenomenon, "Attention Matrices with Customized Learning Rate Leads to Better Convergence," emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the $\mathbf{W}_v$ matrix compared to $\mathbf{W}_q$ and $\mathbf{W}_k$ accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving lightweight algorithms in LLMs fine-tuning.