140万开源蒸馏推理数据集助力大型语言模型训练

Research

arXiv

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Han Zhao ,

Haotian Wang ,

Yiping Peng ,

Sitong Zhao ,

Xiaoyu Tian ,

Shuaiting Chen ,

Yunjie Ji ,

Xiangang Li

论文信息在线阅读PDF

摘要 Abstract

AM-DeepSeek-R1-蒸馏数据集是一个包含通用推理任务思维轨迹的大规模数据集，由高质量且具有挑战性的推理问题组成。这些问题来自众多开源数据集，并经过语义去重和细致清洗以消除测试集污染。数据集中所有回答均来自推理模型（主要是DeepSeek-R1），并经过严格验证程序。数学问题通过参考答案验证，代码问题利用测试用例验证，其他任务借助奖励模型进行评估。仅通过简单监督微调（SFT）训练的AM-Distill-Qwen-32B模型在四个基准测试（AIME2024、MATH-500、GPQA-Diamond和LiveCodeBench）上超过了DeepSeek-R1-蒸馏-Qwen-32B模型。此外，AM-Distill-Qwen-72B模型在所有基准测试上也优于DeepSeek-R1-蒸馏-Llama-70B模型。我们向研究社区发布这140万个问题及其对应回答，旨在促进强大的推理导向型大型语言模型（LLMs）的发展。该数据集已发布于\href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}。

The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.