感知准确的三维说话头生成：新定义、语音-网格表征及评估指标

Research

arXiv

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Lee Chae-Yeon ,

Oh Hyun-Bin ,

Han EunGi ,

摘要 Abstract

近年来，基于语音驱动的三维说话头生成在唇部同步方面取得了显著进展。然而，现有模型仍然难以捕捉不同语音特征与其对应唇部运动之间的感知对齐关系。本文提出，实现感知准确的唇部运动需要满足三个关键标准——时间同步性、唇读可辨性和表现力。受此启发，我们假设存在一个理想的表示空间能够满足这些标准，并引入了一种语音-网格同步表示，以捕获语音信号与三维人脸网格之间的复杂对应关系。研究发现，我们所学习的表示具有理想特性，将其作为感知损失嵌入到现有模型中，可以更好地使唇部运动与给定语音对齐。此外，我们利用该表示作为感知度量，并提出了两种基于物理的唇部同步度量方法，以评估生成的三维说话头在多大程度上符合这三个标准。实验表明，使用我们的感知损失训练三维说话头生成模型可显著改善唇部运动的三个感知准确性方面。代码和数据集可在https://perceptual-3d-talking-head.github.io/获取。

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.