大型语言模型中大规模激活的精炼分析

Research

arXiv

大型语言模型中大规模激活的精炼分析

A Refined Analysis of Massive Activations in LLMs

Louis Owen ,

Nilabhra Roy Chowdhury ,

Abhay Kumar ,

Fabian Güra

论文信息在线阅读PDF

摘要 Abstract

受低精度训练和量化相关性的部分推动，大型语言模型（LLMs）中的大规模激活最近成为研究热点。然而，现有的分析范围有限，不同架构之间的可推广性尚不明确。本文通过分析广泛范围内的LLMs（包括基于GLU和非基于GLU的架构）填补了部分空白。我们的研究结果挑战了一些先前的假设，最重要的是：（1）并非所有大规模激活都是有害的，即抑制它们并不会导致困惑度爆炸或下游任务性能崩溃；（2）一些提出的缓解策略（如注意力KV偏置）在特定情况下对模型具有特定性且无效。因此，我们进一步研究了新颖的混合缓解策略；特别是将目标方差重缩放（TVR）与注意力KV偏置或动态tanh（DyT）结合使用，在我们调查的情境下成功平衡了大规模激活的缓解与下游模型性能的保持。我们的代码可在https://github.com/bluorion-com/refine_massive_activations获取。

Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.