视觉内容感知：人类与基础模型之间的差异

Research

arXiv

视觉内容感知：人类与基础模型之间的差异

Perception of Visual Content: Differences Between Humans and Foundation Models

Nardiena A. Pratama ,

Shaoyang Fan ,

Gianluca Demartini

论文信息在线阅读PDF

摘要 Abstract

人类标注的内容常用于训练机器学习（ML）模型。然而，近期语言和多模态基础模型已被用来取代并扩大人工标注员的工作。本研究比较了代表不同社会经济背景的图像的人类生成注释与机器学习生成注释。我们的目标是理解感知方面的差异，并识别内容解释中的潜在偏见。我们的数据集包含来自不同地理区域和收入水平的人们的图像，涵盖各种日常活动和家庭环境。我们从语义角度比较了人类和机器学习生成的注释，并评估了它们对预测模型的影响。结果表明，从低层次角度来看，机器学习描述与人类标签之间具有最高的相似性，即出现的词种和句子结构，但三种注释在感知不同地区图像的相似或相异程度方面类似。此外，机器学习描述在整体区域分类性能上表现最佳，而机器学习对象和机器学习描述在收入回归任务中整体表现最佳。注释集的性能差异表明，所有注释都很重要，且人类生成的注释目前仍不可替代。

Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show highest similarity between ML captions and human labels from a low-level perspective, i.e., types of words that appear and sentence structures, but all three annotations are alike in how similar or dissimilar they perceive images across different regions. Additionally, ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. The varying performance of annotation sets highlights the notion that all annotations are important, and that human-generated annotations are yet to be replaceable.