原型引导的后门防御

Prototype Guided Backdoor Defense

摘要 Abstract

深度学习模型容易受到“后门攻击”的影响,恶意攻击者通过在少量训练数据中添加“触发器”来导致错误分类。已使用的触发器包括语义触发器,这些触发器无需攻击者操纵图像即可轻松实现。生成式人工智能的出现简化了多样化中毒样本的生成。对不同类型触发器的鲁棒性对于有效的防御至关重要。我们提出了原型引导的后门防御(PGBD),这是一种稳健的后处理防御方法,可以扩展到不同的触发器类型,包括之前未解决的语义触发器。PGBD 利用激活的几何空间中的位移来惩罚接近触发器的移动,这是通过后处理微调步骤中的新颖净化损失实现的。这种方法易于扩展到所有类型的攻击。PGBD 在所有设置下都取得了更好的性能。我们还首次针对名人面部图像上的新语义攻击提出了防御方法。项目页面:\href{https://venkatadithya9.github.io/pgbd.github.io/}{此链接}。

Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlink{https://venkatadithya9.github.io/pgbd.github.io/}{this https URL}.

原型引导的后门防御 - arXiv