PharMolixFM:全原子基础模型用于分子建模与生成

PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation

摘要 Abstract

结构生物学依赖于精确的三维生物分子结构以推动我们对生物功能、疾病机制及治疗手段的理解。尽管深度学习的最新进展使得开发全原子基础模型用于分子建模和生成成为可能,但现有方法由于原子数据的多模态特性以及训练和采样策略缺乏全面分析,导致泛化能力不足。为解决这些局限性,我们提出了PharMolixFM,这是一个基于多模态生成技术构建全原子基础模型的统一框架。我们的框架包括三种采用最先进多模态生成模型的变体。通过将分子任务表述为具有任务特定先验的广义去噪过程,PharMolixFM在各种结构生物学应用中实现了稳健的性能。实验结果表明,PharMolixFM-Diff在蛋白质-小分子对接任务中(给定口袋情况下,RMSD < 2Å)达到了竞争性的预测精度(83.9% vs. 90.2%),并且显著提高了推理速度。此外,我们通过引入更多的采样重复或步骤探索了经验推理扩展定律。我们的代码和模型可在https://github.com/PharMolix/OpenBioMed获取。

Structural biology relies on accurate three-dimensional biomolecular structures to advance our understanding of biological functions, disease mechanisms, and therapeutics. While recent advances in deep learning have enabled the development of all-atom foundation models for molecular modeling and generation, existing approaches face challenges in generalization due to the multi-modal nature of atomic data and the lack of comprehensive analysis of training and sampling strategies. To address these limitations, we propose PharMolixFM, a unified framework for constructing all-atom foundation models based on multi-modal generative techniques. Our framework includes three variants using state-of-the-art multi-modal generative models. By formulating molecular tasks as a generalized denoising process with task-specific priors, PharMolixFM achieves robust performance across various structural biology applications. Experimental results demonstrate that PharMolixFM-Diff achieves competitive prediction accuracy in protein-small-molecule docking (83.9% vs. 90.2% RMSD < 2{\AA}, given pocket) with significantly improved inference speed. Moreover, we explore the empirical inference scaling law by introducing more sampling repeats or steps. Our code and model are available at https://github.com/PharMolix/OpenBioMed.