RefChartQA:通过指令微调在图表图像中定位视觉答案

RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning

摘要 Abstract

近年来,视觉语言模型(VLMs)越来越注重文档视觉定位,以实现更好的人机交互、可访问性和详细理解。然而,由于图表图像中交织的视觉-数值关系固有的复杂性,其在可视化方面的应用(如图表)仍缺乏深入研究。现有的图表理解方法主要集中在回答问题,而没有明确识别支持其预测的视觉元素。为弥合这一差距,我们引入了RefChartQA,这是一个新的基准测试,它将图表问答(ChartQA)与视觉定位相结合,使模型能够在图表图像中引用多粒度的元素。此外,我们通过对不同类别的5种最先进的VLM进行指令微调进行全面评估。我们的实验表明,通过定位引入空间意识可以提高超过15%的回答准确性,减少幻觉并提高模型可靠性。此外,我们确定了影响文本-空间对齐的关键因素,例如TinyChart中的架构改进,该模型利用令牌合并模块以增强特征融合。我们的数据集已开源,供社区开发和进一步发展。所有模型和代码将在https://github.com/moured/RefChartQA公开获取。

Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.

RefChartQA:通过指令微调在图表图像中定位视觉答案 - arXiv