大型语言模型安全性中的表征调整方法

Representation Bending for Large Language Model Safety

摘要 Abstract

大型语言模型(LLMs)已成为强大的工具,但其固有的安全风险——从有害内容生成到更广泛的社会危害——构成了重大挑战。近期的对抗攻击、微调漏洞以及LLMs在高风险环境中的日益部署进一步放大了这些风险。现有的增强安全性的技术,如基于人类反馈的微调或对抗训练,仍然存在局限性,因为它们仅针对特定威胁,并且往往无法推广到未见过的攻击,或者需要人工系统级防御。本文介绍了一种名为RepBend的新方法,该方法从根本上破坏了LLMs中潜在有害行为的基础表征,提供了一种可扩展的解决方案以增强(潜在的)安全性。RepBend将激活引导的思想——即推理过程中通过简单的向量算术来引导模型行为——引入基于损失的微调中。通过广泛的评估,RepBend实现了最先进的性能,在不同的越狱基准测试中,其攻击成功率降低了高达95%,同时对模型的可用性和通用能力几乎没有影响。

Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model's behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.