自己构建视角:用于细粒度视点不变视频表征的掩码式自我-外部建模
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
摘要 Abstract
从第一人称(自我视角)和第三人称(外部视角)视频中学习视点不变的表征是一种有望实现跨多个视角推广视频理解系统的方法。然而,由于自我视角与外部视角之间在视角、运动模式和上下文方面的巨大差异,这一领域尚未得到充分探索。本文提出了一种新颖的掩码式自我-外部建模方法,称为“自己构建视角”(BYOV),该方法促进了因果时间动态和跨视角对齐,用于从未配对的自我-外部视频中进行细粒度视点不变的视频表征学习。我们强调了捕捉人类动作的组合性质作为稳健跨视角理解的基础的重要性。具体而言,设计了自视图掩码预测和跨视图掩码预测,以同时学习视点不变且强大的表征。实验结果表明,我们的BYOV在四个下游自我-外部视频任务的所有指标上显著优于现有方法。代码可在https://github.com/park-jungin/byov获取。
View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at https://github.com/park-jungin/byov.