设计中的AI裁判：实现视觉-语言模型与人类专家同等性的统计视角

Research

arXiv

AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models

摘要 Abstract

早期工程设计（如概念草图）的主观评估传统上依赖于人类专家。然而，专家评估耗时、昂贵且有时不一致。最近在视觉-语言模型（VLMs）方面的进展为自动化设计评估提供了可能性，但确保这些AI“裁判”表现与人类专家相当至关重要。然而，目前尚无现有框架评估专家等效性。本文引入了一个严格的统计框架，以确定AI裁判的评分是否与人类专家的评分一致。我们在一个案例研究中应用此框架，评估了基于四种VLM的裁判在关键设计指标（独特性、创造性、实用性以及绘图质量）上的表现。这些AI裁判采用了多种上下文学习（ICL）技术，包括单模态与多模态提示以及推理时间推理。同一统计框架也被用于评估三名受过训练的初学者的专家等效性。结果显示，采用基于文本和图像的ICL并结合推理的顶级AI裁判在独特性和绘图质量方面达到了与专家的同等水平，并在所有指标上优于或匹配受过训练的初学者。在独特性和创造性的6次运行中有6次，以及在绘图质量和实用性的6次运行中有5次，其与专家的一致性达到了或超过了大多数受过训练的初学者。这些发现表明，支持推理的VLM模型可以在设计评估中实现人类专家的同等性。这对教育和实践中扩展设计评估具有重要意义，并为其他需要主观内容评估的领域验证AI裁判提供了一般性统计框架。

The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.