CleanGen:针对大型语言模型生成任务中后门攻击的缓解方法
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
摘要 Abstract
大型语言模型(LLMs)在生成任务中的卓越表现使得从业者能够利用公开可用的模型来驱动定制应用,如聊天机器人和虚拟助手。然而,用于训练或微调这些LLMs的数据往往未被披露,这使攻击者有机会篡改数据并在模型中注入后门。本文开发了一种新颖的推理时防御方法,名为CLEANGEN,以减轻LLMs生成任务中的后门攻击。CLEANGEN是一种轻量且有效的解码策略,与最先进的(SOTA)LLMs兼容。我们开发CLEANGEN的核心见解在于,与其它LLMs相比,受后门影响的LLMs会为表示攻击者期望内容的标记赋予显著更高的概率。这些标记概率之间的差异使CLEANGEN能够识别攻击者青睐的可疑标记,并将其替换为由未受到同一攻击者影响的另一个LLM生成的标记,从而避免生成攻击者期望的内容。我们在五种最先进的后门攻击上评估了CLEANGEN。结果表明,对于所有五种后门攻击,CLEANGEN的攻击成功率(ASR)均低于五种最先进的基线防御方法。此外,部署CLEANGEN的LLMs在服务良性用户查询时仍能保持其响应的有用性,并且只增加了极小的计算开销。
The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CLEANGEN, to mitigate backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CLEANGEN to identify suspicious tokens favored by the attacker and replace them with tokens generated by another LLM that is not compromised by the same attacker, thereby avoiding generation of attacker-desired content. We evaluate CLEANGEN against five SOTA backdoor attacks. Our results show that CLEANGEN achieves lower attack success rates (ASR) compared to five SOTA baseline defenses for all five backdoor attacks. Moreover, LLMs deploying CLEANGEN maintain helpfulness in their responses when serving benign user queries with minimal added computational overhead.