基于掩码选择状态空间建模的音视频可控视频扩散网络用于自然Talking Head生成
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
摘要 Abstract
Talking Head合成对于虚拟化身和人机交互至关重要。然而,大多数现有方法通常局限于接受单一主要模态的控制,限制了其实际应用价值。为了解决这一问题,我们提出了\textbf{ACTalker},一种端到端的视频扩散框架,支持Talking Head视频生成的多信号控制和单信号控制。对于多信号控制,我们设计了一个具有多个分支的并行mamba结构,每个分支利用单独的驱动信号控制特定的面部区域。在所有分支之间应用门机制,提供对视频生成的灵活控制。为了确保受控视频在时间和空间上的自然协调,我们采用了mamba结构,使驱动信号能够在每个分支的两个维度上操作特征标记。此外,我们引入了一种掩码丢弃策略,允许每个驱动信号在其对应的面部区域内独立控制mamba结构中的相应部分,避免控制冲突。实验结果表明,我们的方法可以生成由多样化信号驱动的自然面部视频,并且mamba层能够无缝整合多种驱动模态而不会产生冲突。
Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbf{ACTalker}, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.