视觉-语言模型理解人类意图的能力如何?一种对心理理论任务评估基准的开放性问题框架研究
How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark
摘要 Abstract
视觉-语言模型(VLMs)在视觉问答(VQA)任务中表现出强大的推理能力;然而,其在准确推断人类意图、信念及其他心理状态等心理理论(ToM)任务中的表现仍缺乏深入探索。本文提出了一种开放性问题框架,全面评估了不同规模的VLMs在多样化的ToM任务类别中的性能。我们构建并标注了一个包含30张图片的数据集,并在此数据集上评估了四个不同规模的VLMs的表现。实验结果显示,GPT-4模型在所有模型中表现最优,而较小规模的模型GPT-4o-mini则实现了可比的性能。此外,我们观察到VLMs在复杂场景(如欺凌或欺骗)中往往难以准确推断意图。同时,我们的研究还发现,尽管依赖错误的视觉线索,较小规模的模型有时仍能正确推断出意图。
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; However, their ability to perform Theory of Mind (ToM) tasks such as accurately inferring human intentions, beliefs, and other mental states remains underexplored. In this work, we propose an open-ended question framework to comprehensively evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset composed of 30 images. We then assessed the performance of four VLMs of varying sizes on this dataset. Our experimental results show that the GPT-4 model outperformed all others, with only one smaller model, GPT-4o-mini, achieving comparable performance. Additionally, we observed that VLMs often struggle to accurately infer intentions in complex scenarios such as bullying or cheating. Moreover, our findings also reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues.