摘要 Abstract
科学同行评审的核心部分涉及提供专家评论,直接评估论文提出的科学主张。虽然现在可以自动生成看似合理(尽管通用)的评论,但确保这些评论基于论文主张且合理仍然具有挑战性。为了促进大型语言模型(LLMs)在这些挑战上的基准测试,我们引入了CLAIMCHECK,这是一个来自OpenReview的NeurIPS 2023年和2024年提交论文及其评论的数据集,并由机器学习专家对其进行丰富注释,包括评论中的弱点陈述以及它们所质疑的论文主张,以及对所识别弱点的有效性、客观性和类型的细粒度标签。我们利用CLAIMCHECK支持的三种以主张为中心的任务对多个LLMs进行基准测试,要求模型完成以下任务:(1)将弱点与它们所质疑的主张关联起来;(2)预测弱点的细粒度标签并改写弱点以增强其特异性;(3)通过有依据的推理验证论文的主张。我们的实验表明,尽管最先进的LLMs在任务(2)中能够预测弱点标签,但在其他所有任务上相对于人类专家的表现仍显不足。
A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.