基于有向场景图的大规模视觉-语言模型基准测试用于全面图像描述

Research

arXiv

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Fan Lu ,

Wei Wu ,

Kecheng Zheng ,

Shuailei Ma ,

Biao Gong ,

Jiawei Liu ,

Wei Zhai ,

Yang Cao ,

Yujun Shen ,

Zheng-Jun Zha

论文信息在线阅读PDF

摘要 Abstract

对于大规模视觉-语言模型（LVLMs），生成能够理解图像中丰富文本内容的详细描述已引起越来越多的关注。然而，很少有研究专门开发针对详细描述的基准来衡量其准确性和全面性。本文介绍了一个名为CompreCap的详细描述基准，从有向场景图的角度评估视觉上下文。具体而言，我们首先根据常见物体词汇手动将图像分割为语义上有意义的区域（即语义分割掩码），同时区分这些区域内对象的属性。然后为这些对象标注方向关系标签，构建一个能很好地编码图像丰富组合信息的有向场景图。基于我们的有向场景图，我们开发了一套流程，从多个层面评估LVLMs生成的详细描述，包括对象级别的覆盖率、属性描述的准确性以及关键关系的评分等。CompreCap数据集上的实验结果表明，我们的评估方法与人类评估分数高度一致。

Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object-level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs.