摘要 Abstract
费米问题(Fermi Problems, FPs)是需要人类逻辑和数值推理的数学推理任务,其特点在于常涉及现实世界中的不切实际或模糊概念,即使对人类而言也具有挑战性。尽管人工智能在各类推理任务中取得了显著进展,尤其是大型语言模型(Large Language Models, LLMs)的应用,但费米问题的研究相对较少。本研究开展了一项探索性研究,考察LLMs解决费米问题的能力与局限性。我们首先利用公开可用的费米问题数据集评估了三种先进LLMs的整体性能。基于最近提出的TELeR分类法设计提示词,并包括零样本情景。结果表明,这三种LLMs的fp_score(范围为0到1)均低于0.5,凸显了此类推理任务的固有难度。为进一步探究,我们将费米问题分为标准问题和特定问题,并假设LLMs在标准问题上的表现会优于特定问题,因为标准问题更清晰简洁。对比实验验证了这一假设,表明LLMs在标准费米问题上的准确性和效率均更高。
Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency.