真才实学还是虚张声势？对大型语言模型在2025年美国数学奥林匹克竞赛中的评估

Research

arXiv

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Ivo Petrov ,

摘要 Abstract

近期针对大型语言模型（LLMs）的数学基准测试，如MathArena表明，最先进的推理模型在像AIME这样的数学竞赛中表现出色，其中领先的模型Gemini-2.5-Pro的成绩可与顶尖人类参赛者相媲美。然而，这些基准测试仅基于最终的数值答案来评估模型，忽略了严格的推理和证明生成，而这在现实世界的数学任务中至关重要。为了解决这一问题，我们首次全面评估了具有挑战性数学问题的全解推理能力。通过专家人工标注员，我们在2025年美国数学奥林匹克竞赛（USAMO）发布的六道题发布后的几个小时内评估了几种最先进的推理模型。我们的结果显示，所有被测试的模型都遇到了显著困难：只有Gemini-2.5-Pro达到了非微不足道的25%分数，而其他所有模型得分均低于5%。通过详细分析推理轨迹，我们确定了最常见的失败模式，并发现了几种由于模型训练期间采用的优化策略产生的不良现象。总体而言，我们的结果表明当前的LLMs在严格的数学推理任务中表现不足，强调了推理和证明生成能力需要大幅改进。

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, Gemini-2.5-Pro, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly: only Gemini-2.5-Pro achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.