视频-R1:多模态大型语言模型中视频推理的强化

Video-R1: Reinforcing Video Reasoning in MLLMs

摘要 Abstract

受DeepSeek-R1通过基于规则的强化学习(RL)激发推理能力成功的启发,我们引入了视频-R1,这是首次系统性探索R1范式以激发多模态大型语言模型(MLLMs)中视频推理的尝试。然而,直接应用GRPO算法进行视频推理的强化学习训练面临两大主要挑战:(i)缺乏对视频推理的时间建模,以及(ii)高质量视频推理数据的稀缺性。为了解决这些问题,我们首先提出了T-GRPO算法,该算法鼓励模型利用视频中的时间信息进行推理。此外,我们没有仅仅依赖视频数据,而是将高质量的图像推理数据纳入训练过程。我们构建了两个数据集:Video-R1-COT-165k用于SFT冷启动,Video-R1-260k用于RL训练,这两个数据集均包含图像和视频数据。实验结果表明,视频-R1在视频推理基准测试如VideoMMMU和VSI-Bench,以及通用视频基准测试如MVBench和TempCompass等方面取得了显著改进。值得注意的是,视频-R1-7B在视频空间推理基准测试VSI-Bench上的准确率达到35.8%,超过了商业专有模型GPT-4o。所有代码、模型和数据均已开源。

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.