大规模多组学生物序列Transformer模型用于蛋白质-核酸相互作用的建模

Research

arXiv

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

摘要 Abstract

Transformer架构彻底改变了生物信息学，并推动了对生物分子特性和预测理解的进步。几乎所有关于大规模生物序列Transformer的研究都集中在单一领域（单组学），通常是DNA/RNA或蛋白质。这些模型在各自领域的下游任务中取得了令人难以置信的成功，并在序列建模和结构建模方面实现了特别显著的突破。然而，这些单组学模型自然无法高效地建模多组学任务，其中最具生物学意义的一项便是蛋白质-核酸相互作用。我们展示了训练迄今为止最大的开源多组学基础模型的工作。尽管仅通过未标注的生物序列进行训练，我们发现这些多组学模型（MOMs）能够学习各种单组学分布之间的联合表示，这些表示与分子生物学中心法则一致。我们进一步证明，MOMs可以被微调以在蛋白质-核酸相互作用任务中达到最先进的结果，即预测给定核酸和蛋白质之间结合相互作用的Gibbs自由能变化（$\Delta G$）。令人惊讶的是，我们表明多组学生物序列Transformer能够自发学习有用的结构信息，而无需任何先验结构训练，从而允许我们预测哪些蛋白质残基最参与蛋白质-核酸结合相互作用。最后，我们提供了证据表明，在许多情况下，多组学生物序列模型在每浮点运算性能（performance-per-FLOP）和绝对性能方面均优于仅基于单组学分布训练的基础模型，这表明构建这些模型的方法可能更具通用性或基础性。

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($\Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any \textit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.