关于视觉抽象推理的数据合成与后训练方法

On Data Synthesis and Post-training for Visual Abstract Reasoning

摘要 Abstract

本文是一项开创性的工作,尝试解决大型视觉-语言模型(VLM)在视觉抽象推理(AVR)问题上的挑战。我们使一个常见的LLaVA-NeXT 7B模型能够感知并推理特定的AVR问题,显著超越了开源模型(例如Qwen-2-VL-72B)和闭源的强大VLM(例如GPT-4o)。这是一个重要的突破,因为几乎所有先前的VLM在代表性AVR基准测试中都失败或表现出几乎随机的性能。我们的成功关键在于创新的数据合成和后训练过程,旨在逐步减轻任务难度并激发模型的学习能力。我们的7B模型不仅在AVR方面表现良好,同时也没有牺牲多模态理解的一般能力。我们希望本文能成为该领域的早期努力,并激励进一步研究抽象视觉推理。

This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.