OmniBench:迈向通用多模态语言模型的未来

OmniBench: Towards The Future of Universal Omni-Language Models

摘要 Abstract

多模态大型语言模型(MLLMs)的最新进展主要集中在整合多种模态,但其同时处理和推理不同输入的能力仍未得到充分探索。我们引入了OmniBench,这是一个新型基准测试,旨在评估模型在视觉、声学和文本输入之间同时识别、解释和推理的能力。我们将具备这种三模态处理能力的语言模型定义为全语言模型(OLMs)。OmniBench具有高质量的人类注释,需要跨所有模态的综合理解。我们的评估显示:i)开源OLMs在三模态上下文中的指令跟随和推理能力存在显著局限性;ii)即使有文本替代图像/音频输入,大多数基线模型的表现也很差(准确率约为50%)。为解决这些局限性,我们开发了OmniInstruct,一个包含96K样本的指令调优数据集,用于训练OLMs。我们提倡开发更强大的三模态集成技术和训练策略,以提高OLM性能。代码和数据可在我们的存储库(https://github.com/multimodal-art-projection/OmniBench)中找到。

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).