大型语言模型的安全训练是否对语义相关的自然提示具有一致性?
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
摘要 Abstract
大型语言模型(LLMs)在采用安全微调方法与人类偏好对齐后,仍然容易受到精心设计的对抗性攻击或越狱行为的影响,导致生成不当内容。尽管输入令牌空间的高维特性使得找到能够使这些模型越狱的对抗性提示不可避免,但我们旨在评估经过安全微调的LLMs是否对那些语义上与毒性种子提示相关的自然提示具有安全性,这些种子提示在对齐后能够产生安全响应。我们惊讶地发现,流行的对齐LLMs(如GPT-4)可以被一些并非专门设计用于越狱模型的简单提示所突破。此外,我们通过实证表明,给定一个从未对齐模型中引发毒性响应的种子提示,可以系统地生成多个语义相关的自然提示,从而实现对对齐LLMs的越狱。为此,我们提出了基于响应引导的问题增强(ReG-QA)方法来评估安全对齐的LLMs对自然提示的一致性,该方法首先利用未对齐的LLM(问题到答案,Q到A)生成多个有毒的答案,然后利用另一个LLM生成可能产生这些答案的问题(答案到问题,A到Q)。有趣的是,我们发现像GPT-4o这样的安全微调LLMs在面对不安全内容时容易生成自然越狱问题(不否认),因此可以用于后者(A到Q)步骤。我们在JailbreakBench排行榜上的攻击成功率与领先对抗性攻击方法相当甚至更好,同时对Smooth-LLM和同义词替换等防御措施表现出显著更高的稳定性,这些防御措施对排行榜上的现有所有攻击都有效。
Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.