野生环境下伴随言语的手势理解

Understanding Co-speech Gestures in-the-wild

摘要 Abstract

伴随言语的手势在非语言交流中发挥着至关重要的作用。本文提出了一种新的框架,用于理解野生环境下的伴随言语手势。具体而言,我们提出了三项新任务和相应的基准测试,以评估模型理解手势-文本-语音关联的能力:(i)基于手势的检索,(ii)手势词检测,以及(iii)利用手势进行主动说话者检测。我们提出了一种新方法,学习三模态(语音-文本-视频-手势)表征以解决这些任务。通过结合全局短语对比损失和局部手势-词汇耦合损失,我们证明了可以通过弱监督的方式从野生环境中的视频中学习到强大的手势表征。在所有三项任务中,我们的学习表征均优于先前的方法,包括大型视觉-语言模型(VLM)。进一步分析表明,语音和文本模态捕捉了不同的与手势相关信号,这凸显了学习共享三模态嵌入空间的优势。数据集、模型和代码可在以下网址获取:https://www.robots.ox.ac.uk/~vgg/research/jegal

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal

野生环境下伴随言语的手势理解 - arXiv