大规模多组学生物序列Transformer模型用于蛋白质-核酸相互作用的建模

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

摘要 Abstract

Transformer架构彻底改变了生物信息学,并推动了对生物分子特性和预测理解的进步。几乎所有关于大规模生物序列Transformer的研究都集中在单一领域(单组学),通常是DNA/RNA或蛋白质。这些模型在各自领域的下游任务中取得了令人难以置信的成功,并在序列建模和结构建模方面实现了特别显著的突破。然而,这些单组学模型自然无法高效地建模多组学任务,其中最具生物学意义的一项便是蛋白质-核酸相互作用。我们展示了训练迄今为止最大的开源多组学基础模型的工作。尽管仅通过未标注的生物序列进行训练,我们发现这些多组学模型(MOMs)能够学习各种单组学分布之间的联合表示,这些表示与分子生物学中心法则一致。我们进一步证明,MOMs可以被微调以在蛋白质-核酸相互作用任务中达到最先进的结果,即预测给定核酸和蛋白质之间结合相互作用的Gibbs自由能变化($\Delta G$)。令人惊讶的是,我们表明多组学生物序列Transformer能够自发学习有用的结构信息,而无需任何先验结构训练,从而允许我们预测哪些蛋白质残基最参与蛋白质-核酸结合相互作用。最后,我们提供了证据表明,在许多情况下,多组学生物序列模型在每浮点运算性能(performance-per-FLOP)和绝对性能方面均优于仅基于单组学分布训练的基础模型,这表明构建这些模型的方法可能更具通用性或基础性。

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($\Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any \textit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.