频率感知线索融合与攻击无关提示学习的统一人脸攻击检测方法：FA^{3}-CLIP

Research

arXiv

FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection

Yongze Li ,

Ning Li ,

Ajian Liu ,

Hui Ma ,

Liying Yang ,

Xihong Chen ,

Jun Wan ,

摘要 Abstract

人脸识别系统容易受到物理（例如打印的照片）和数字（例如DeepFake）人脸攻击的威胁。现有方法由于1）这些攻击类型之间存在显著的类内变化，以及2）空间信息不足以全面捕捉真实和伪造线索，难以同时检测物理和数字攻击。为了解决这些问题，我们提出了一种名为频率感知与攻击无关CLIP（FA\textsuperscript{3}-CLIP）的统一攻击检测模型，引入了攻击无关提示学习，通过空间和频率特征融合提取通用的真实和伪造线索，实现对真实人脸和所有类别攻击的统一检测。具体而言，攻击无关提示模块在语言分支中生成通用的真实和伪造提示，从真实和伪造人脸中提取相应的通用表示，引导模型学习统一的特征空间用于统一攻击检测。同时，该模块自适应地从原始的空间和频率信息中生成真实/伪造条件偏差，优化通用提示，减少类内变化的影响。我们还在视觉分支中提出了一个双流线索融合框架，利用频率信息补充在空间域难以捕捉的细微线索。此外，频率流中还使用了频率压缩块，减少了频率特征中的冗余，同时保留了关键线索的多样性。我们还建立了新的具有挑战性的协议，以促进统一人脸攻击检测的有效性。实验结果表明，所提出的方法在检测物理和数字人脸攻击方面显著提高了性能，达到了最先进的效果。

Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.