MegaMath：突破开放数学语料库的极限

Research

arXiv

MegaMath: Pushing the Limits of Open Math Corpora

Fan Zhou ,

Zengzhi Wang ,

Nikhil Ranjan ,

Zhoujun Cheng ,

Liping Tang ,

Guowei He ,

Zhengzhong Liu ,

Eric P. Xing

论文信息在线阅读PDF

摘要 Abstract

数学推理是人类智能的核心，也是大型语言模型（LLMs）高级能力的关键基准。然而，研究社区仍然缺乏一个专门针对数学中心化LLMs预训练需求的开放、大规模、高质量语料库。我们提出了MegaMath，这是一个通过以下实践从多样化的数学关注来源精心策划的开放数据集：（1）重新审视网络数据：我们使用数学导向的HTML优化、基于fasttext的过滤和去重，从Common Crawl中重新提取数学文档，从而在互联网上获取更高品质的数据。（2）召回与数学相关的代码数据：我们从大规模代码训练语料库Stack-V2中识别出高质量的数学相关代码，进一步增强数据多样性。（3）探索合成数据：我们从网络数据或代码数据中合成QA风格的文本、与数学相关的代码以及交错的文本-代码块。通过整合这些策略并通过广泛的消融实验验证其有效性，MegaMath提供了3710亿个token，在现有开放数学预训练数据集中数量最多且质量最高。

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.