摘要 Abstract
人体动作是非语言交流的重要形式,在社会互动中占据重要地位。本文特别关注一种被称为微动作的体态动作子集,这些微动作是细微且低强度的身体运动,在人类情感分析中有广阔的应用前景。在现实场景中,人类的微动作往往在时间上共现,多个微动作重叠出现,例如头部和手部的同时运动。然而,当前研究主要集中在识别单个微动作,而忽视了它们共现的本质。为解决这一问题,我们提出了一个新的任务——多标签微动作检测(MMAD),该任务旨在识别给定短视频中的所有微动作,确定其起止时间并进行分类。完成此任务需要一种能够准确捕捉长短期动作关系的模型,以便检测多个重叠的微动作。为了促进MMAD任务的研究,我们引入了一个新的数据集——多标签微动作-52(MMA-52),并提出了一种带有双路径时空适配器的基线方法,以应对MMAD中微妙视觉变化带来的挑战。我们希望MMA-52能激发视频中微动作分析的研究,并推动以人为中心的视频理解中的时空建模发展。所提出的MMA-52数据集可在https://github.com/VUT-HFUT/Micro-Action获取。
Human body actions are an important form of non-verbal communication in social interactions. This paper specifically focuses on a subset of body actions known as micro-actions, which are subtle, low-intensity body movements with promising applications in human emotion analysis. In real-world scenarios, human micro-actions often temporally co-occur, with multiple micro-actions overlapping in time, such as concurrent head and hand movements. However, current research primarily focuses on recognizing individual micro-actions while overlooking their co-occurring nature. To address this gap, we propose a new task named Multi-label Micro-Action Detection (MMAD), which involves identifying all micro-actions in a given short video, determining their start and end times, and categorizing them. Accomplishing this requires a model capable of accurately capturing both long-term and short-term action relationships to detect multiple overlapping micro-actions. To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. We hope that MMA-52 can stimulate research on micro-action analysis in videos and prompt the development of spatio-temporal modeling in human-centric video understanding. The proposed MMA-52 dataset is available at: https://github.com/VUT-HFUT/Micro-Action.