基于中性目标数据的无源个性化面部表情识别方法
Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data
摘要 Abstract
视频面部表情识别(Facial Expression Recognition, FER)在人机交互和健康监测(如疼痛、抑郁、疲劳和压力)等各个应用领域都是一项至关重要的任务。除了识别微妙的情感或健康状态所面临的挑战外,深度FER模型的有效性常常受到受试者之间表情显著差异的影响。无源域适应(Source-free Domain Adaptation, SFDA)方法通过仅利用未标记的目标域数据来适配预训练的源模型,从而避免了数据隐私和存储问题。通常,SFDA方法会针对整个群体的目标域数据集进行适配,并假定该数据集包含所有识别类别中的数据。然而,在医疗保健应用中的FER任务中,收集如此全面的目标数据可能困难甚至不可能实现。在许多现实场景中,在部署之前为受试者收集一段短时间的中性控制视频(仅显示中性表情)可能是可行的。这些视频可以用于适配模型,使其更好地处理受试者之间的表情变化。本文提出了解耦无源域适应(Disentangled Source-Free Domain Adaptation, DSFDA)方法,以解决因缺少目标表情数据而带来的SFDA挑战。DSFDA利用中性目标控制视频的数据,对缺失非中性数据的目标数据进行端到端生成和适配。我们的方法在学习解耦表情和身份相关特征的同时生成缺失的非中性目标数据,从而提高模型准确性。此外,我们的自监督策略通过重建保持相同身份和源表情的目标图像来改进模型适配。在具有挑战性的BioVid和UNBC-McMaster疼痛数据集上的实验结果表明,我们的DSFDA方法能够超越最先进的适配方法。
Facial Expression Recognition (FER) from videos is a crucial task in various application areas, such as human-computer interaction and health monitoring (e.g., pain, depression, fatigue, and stress). Beyond the challenges of recognizing subtle emotional or health states, the effectiveness of deep FER models is often hindered by the considerable variability of expressions among subjects. Source-free domain adaptation (SFDA) methods are employed to adapt a pre-trained source model using only unlabeled target domain data, thereby avoiding data privacy and storage issues. Typically, SFDA methods adapt to a target domain dataset corresponding to an entire population and assume it includes data from all recognition classes. However, collecting such comprehensive target data can be difficult or even impossible for FER in healthcare applications. In many real-world scenarios, it may be feasible to collect a short neutral control video (displaying only neutral expressions) for target subjects before deployment. These videos can be used to adapt a model to better handle the variability of expressions among subjects. This paper introduces the Disentangled Source-Free Domain Adaptation (DSFDA) method to address the SFDA challenge posed by missing target expression data. DSFDA leverages data from a neutral target control video for end-to-end generation and adaptation of target data with missing non-neutral data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral target data, thereby enhancing model accuracy. Additionally, our self-supervision strategy improves model adaptation by reconstructing target images that maintain the same identity and source expression. Experimental results on the challenging BioVid and UNBC-McMaster pain datasets indicate that our DSFDA approach can outperform state-of-the-art adaptation method.