黎明:基于非自回归扩散框架的动态帧虚拟化身生成用于说话人脸视频生成
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation
摘要 Abstract
说话人脸生成旨在从单张肖像图片和语音音频片段生成生动逼真的说话人脸视频。尽管基于扩散模型的说话人脸生成已取得显著进展,但几乎所有方法都依赖于自回归策略,这些策略在当前生成步长之外的上下文利用率有限、误差累积且生成速度较慢。为了解决这些问题,我们提出了DAWN(Dynamic frame Avatar With Non-autoregressive diffusion,基于非自回归扩散的动态帧虚拟化身),该框架能够一次性生成动态长度的视频序列。具体而言,它包含两个主要组件:(1) 基于音频驱动的整体面部动态生成(在潜在运动空间中);(2) 基于音频驱动的头部姿态和眨眼生成。广泛的实验表明,我们的方法能够生成真实生动的视频,具有精确的唇部动作以及自然的头部姿势和眨眼动作。此外,凭借较高的生成速度,DAWN具备强大的外推能力,确保高质量长视频的稳定生产。这些结果凸显了DAWN在说话人脸视频生成领域的巨大潜力和影响。此外,我们希望DAWN能激发更多关于扩散模型中非自回归方法的探索。我们的代码将在https://github.com/Hanbo-Cheng/DAWN-pytorch公开发布。
Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available at https://github.com/Hanbo-Cheng/DAWN-pytorch.