ViLBench:一种视觉-语言过程奖励建模的基准测试套件

ViLBench: A Suite for Vision-Language Process Reward Modeling

摘要 Abstract

过程监督奖励模型是一种细粒度函数,能够为模型响应提供逐步骤的详细反馈,从而有效选择复杂任务中的推理路径。尽管其具有诸多优势,但对过程奖励模型(PRMs)的评估仍较少被探索,尤其是在多模态领域。为解决这一问题,本文首先在多个视觉-语言基准数据集上将当前视觉大语言模型(VLLMs)作为两种类型的奖励模型进行基准测试:输出奖励模型(ORMs)和过程奖励模型(PRMs)。结果表明,无论是ORM还是PRM,在所有任务中均未表现出始终如一的优越性能,且优秀的VLLMs并不一定带来更好的奖励表现。为进一步推进评估工作,我们引入ViLBench,这是一个旨在需要密集过程奖励信号的视觉-语言基准。值得注意的是,OpenAI的GPT-4o结合思维链(CoT)仅达到27.3%的准确率,表明该基准对当前VLLMs提出了挑战。最后,我们初步展示了弥合通用VLLMs与奖励模型之间差距的一种有前景的途径——通过使用增强的树搜索算法收集了73.6K个视觉-语言过程奖励数据,我们的3B模型在ViLBench上的表现相较于标准CoT提升了平均3.3%,相比未经训练的版本提升了最高2.5%,具体通过选择OpenAI o1的生成结果实现。我们已公开代码、模型和数据,可访问https://ucsc-vlaa.github.io/ViLBench获取。

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.