摘要 Abstract
本文提出了Text-based Open Molecule Generation Benchmark(TOMG-Bench),这是首个用于评估大型语言模型(LLMs)在开放领域分子生成能力的基准。TOMG-Bench包含三个主要任务的数据集:分子编辑(MolEdit)、分子优化(MolOpt)和定制化分子生成(MolCustom)。每个主要任务进一步细分为三个子任务,每个子任务包含5,000个测试样本。鉴于开放分子生成评估的固有复杂性,我们还开发了一套自动评估系统,以帮助衡量生成分子的质量和准确性。通过对25个LLMs的全面基准测试,我们揭示了当前文本引导分子发现的局限性和潜在改进领域。此外,我们提出了OpenMolIns,这是一个专门用于解决TOMG-Bench所提出挑战的指令微调数据集。经过OpenMolIns微调后,Llama3.1-8B的表现优于所有开源通用LLMs,在TOMG-Bench上的表现甚至比GPT-3.5-turbo高出46.5%。我们的代码和数据集可通过https://github.com/phenixace/TOMG-Bench获取。
In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each major task further contains three subtasks, while each subtask comprises 5,000 test samples. Given the inherent complexity of open molecule generation evaluation, we also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations as well as potential areas for improvement in text-guided molecule discovery. Furthermore, we propose OpenMolIns, a specialized instruction tuning dataset established for solving challenges raised by TOMG-Bench. Fine-tuned on OpenMolIns, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5\% on TOMG-Bench. Our codes and datasets are available through https://github.com/phenixace/TOMG-Bench.