基于特征引导激活添加的大语言模型可解释操控

Research

arXiv

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Samuel Soo ,

Chen Guang ,

Wesley Teng ,

Chandrasekaran Balaganesh ,

Tan Guoxian ,

Yan Ming

论文信息在线阅读PDF

摘要 Abstract

对大型语言模型（LLM）行为进行有效且可靠的控制是一项重要挑战。虽然通过向模型隐藏状态添加引导向量的激活操控方法具有潜力，但现有技术往往缺乏在影响模型输出时的精确性和可解释性。我们提出了特征引导激活添加（Feature Guided Activation Additions, FGAA），这是一种新颖的激活操控方法，结合了对比激活添加（CAA）和稀疏自动编码器目标导向操控（SAE-TS）的见解。通过在稀疏自动编码器（SAE）的潜在空间中操作，并采用优化技术选择所需的SAE特征，FGAA构建出更精确的操控向量，从而在保持操控后模型输出连贯性的同时实现更好的操控效果。在Gemma-2-2B和Gemma-2-9B模型上的多种操控任务评估表明，FGAA的表现优于现有的CAA、SAE解码器操控以及SAE-TS方法。此外，我们的研究还揭示了在所有测试的操控方法中，操控规模与模型通用能力之间的一致权衡关系。

Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.