大型语言模型安全性中的表征调整方法

Research

arXiv

大型语言模型安全性中的表征调整方法

Representation Bending for Large Language Model Safety

Ashkan Yousefpour ,

Taeheon Kim ,

Ryan S. Kwon ,

Seungbeen Lee ,

Wonje Jeung ,

Seungju Han ,

Alvin Wan ,

Harrison Ngan ,

Youngjae Yu ,

Jonghyun Choi

论文信息在线阅读PDF

摘要 Abstract

大型语言模型（LLMs）已成为强大的工具，但其固有的安全风险——从有害内容生成到更广泛的社会危害——构成了重大挑战。近期的对抗攻击、微调漏洞以及LLMs在高风险环境中的日益部署进一步放大了这些风险。现有的增强安全性的技术，如基于人类反馈的微调或对抗训练，仍然存在局限性，因为它们仅针对特定威胁，并且往往无法推广到未见过的攻击，或者需要人工系统级防御。本文介绍了一种名为RepBend的新方法，该方法从根本上破坏了LLMs中潜在有害行为的基础表征，提供了一种可扩展的解决方案以增强（潜在的）安全性。RepBend将激活引导的思想——即推理过程中通过简单的向量算术来引导模型行为——引入基于损失的微调中。通过广泛的评估，RepBend实现了最先进的性能，在不同的越狱基准测试中，其攻击成功率降低了高达95%，同时对模型的可用性和通用能力几乎没有影响。

Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model's behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.