慢快-LLaVA-1.5:面向长视频理解的高效令牌视频大型语言模型家族

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

摘要 Abstract

我们提出了慢快-LLaVA-1.5(简称SF-LLaVA-1.5),这是一组提供高效解决方案的视频大型语言模型(LLMs),用于长视频理解。我们将两流慢快机制整合到一个精简的训练管道中,并在仅包含公开可用数据集的精心策划的数据混合体上进行联合视频-图像训练。我们的主要重点是高度高效的模型规模(1B和3B),证明即使相对较小的视频LLMs也可以在视频理解任务中达到最先进的性能,满足移动友好型模型的需求。实验结果表明,SF-LLaVA-1.5在各种视频和图像任务上表现出色,在所有模型大小(从1B到7B)下均具有稳健的结果。值得注意的是,SF-LLaVA-1.5在长视频理解(例如LongVideoBench和MLVU)方面取得了最先进的成果,并在各种视频基准测试中小规模下表现出色。

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.