摘要 Abstract
当前基于网络视频训练的大型语言视觉模型(LLVMs)在通用视频理解方面表现出色,但在细粒度细节、复杂的人体-物体交互(HOI)以及日常生活活动(ADL)所必需的视点不变表征学习方面表现不佳。这一局限性源于缺乏专门的ADL视频指令微调数据集,以及模态整合不足以捕捉判别性的动作表征。为解决这些问题,我们提出了一种半自动化的ADL数据集整理框架,创建了ADL-X,这是一个多视角、多模态RGBS指令微调数据集。此外,我们引入了LLAVIDAL,这是一种集成视频、3D骨架和HOI的LLVM,用于建模ADL复杂的时空关系。为了训练LLAVIDAL,简单的所有模态联合对齐会导致次优结果;因此,我们提出了多模态渐进(MMPro)训练策略,按照课程顺序分阶段整合模态。我们还建立了ADL多项选择题(MCQ)和视频描述基准,以评估LLVM在ADL任务中的性能。在ADL-X上训练后,LLAVIDAL在各种ADL基准测试中达到了最先进的性能。代码和数据将在https://adl-x.github.io/公开提供。
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.