摘要 Abstract
本文通过分解多阶段、模块化的推理框架解决了视频问答(videoQA)任务。以往的模块化方法在单一规划阶段未与视觉内容对齐的情况下也显示出了一定的前景。然而,通过一个简单而有效的基线模型,我们发现此类系统在具有挑战性的视频问答场景中可能会表现出脆弱的行为。因此,不同于传统的单阶段规划方法,我们提出了一种多阶段系统,该系统包括事件解析器、对齐阶段和最终推理阶段,并结合外部记忆。所有阶段均为无训练模式,利用大规模模型的少量提示进行操作,并在每个阶段生成可解释的中间输出。通过分解底层规划和任务复杂性,我们的方法MoReVQA在标准视频问答基准(NExT-QA、iVQA、EgoSchema、ActivityNet-QA)上取得了最先进的结果,并在相关任务(基于上下文的视频问答、段落描述)上实现了扩展。
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).