开放式Qwen2VL:基于学术资源的高效全开放式多模态大型语言模型预训练

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

摘要 Abstract

当前最先进的多模态大型语言模型(MLLM)预训练在数据过滤、多模态数据混合策略、序列打包技术以及训练框架等各个环节都面临诸多障碍。我们引入了Open-Qwen2VL,这是一个参数量为20亿的多模态大型语言模型,仅利用220个A100-40G显卡小时就在2900万图像-文本对上进行了高效的预训练。我们的方法采用了从低分辨率到高分辨率的动态图像分辨率以及多模态序列打包技术,显著提高了预训练效率。训练数据集经过精心筛选,结合了基于MLLM的过滤技术(如MLM-Filter)和传统的CLIP基线过滤方法,极大地提升了数据质量和训练效率。Open-Qwen2VL的预训练是在UCSB的8xA100-40G学术级GPU上进行的,处理了50亿个多模态标记,这相当于Qwen2-VL预训练总标记数(1.4万亿)的0.36%。最终指令微调后的Open-Qwen2VL在MMBench、SEEDBench、MMstar和MathVista等多个多模态基准测试中表现优于部分开放的最先进MLLM Qwen2-VL-2B,表明了Open-Qwen2VL卓越的训练效率。我们将工作中的各个方面开源,包括计算高效和数据高效的训练细节、数据过滤方法、序列打包脚本、WebDataset格式的预训练数据、基于FSDP的训练代码库以及基础模型和指令微调后的模型检查点。我们重新定义了“全开放式”多模态大型语言模型的标准,即完全公开:1)训练代码库,2)详细的数据过滤技术,以及3)用于开发模型的所有预训练和监督微调数据。

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.