从头功能性蛋白质序列生成:通过再生和大模型克服数据稀缺性

De Novo Functional Protein Sequence Generation: Overcoming Data Scarcity through Regeneration and Large Models

摘要 Abstract

蛋白质是所有生物体的关键组成部分,在细胞存活中发挥着至关重要的作用。它们在临床治疗到材料工程等领域有着广泛的应用。这种多功能性推动了蛋白质设计的发展,其中氨基酸序列设计是这一过程中的重要一步。近年来,深度生成模型在蛋白质序列设计方面显示出巨大潜力。然而,某些类型的功能性蛋白质序列数据的匮乏可能阻碍这些模型的训练,因为这些模型通常需要大规模的数据集。为了解决这一挑战,我们提出了一种名为ProteinRG的分层模型,该模型能够利用相对较小的数据集生成功能性蛋白质序列。ProteinRG首先利用现有的大型蛋白质序列模型生成蛋白质序列的表示,然后生成功能性蛋白质序列。我们在多种功能性蛋白质序列上测试了我们的模型,并从三个角度评估了结果:多序列比对、t-SNE分布分析以及三维结构预测。研究结果显示,我们生成的蛋白质序列既保持了与原始序列的相似性,又与所需功能保持一致。此外,我们的模型在蛋白质序列生成的其他生成模型中表现出更优的性能。

Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for protein sequence design. However, the scarcity of functional protein sequence data for certain types can hinder the training of these models, which often require large datasets. To address this challenge, we propose a hierarchical model named ProteinRG that can generate functional protein sequences using relatively small datasets. ProteinRG begins by generating a representation of a protein sequence, leveraging existing large protein sequence models, before producing a functional protein sequence. We have tested our model on various functional protein sequences and evaluated the results from three perspectives: multiple sequence alignment, t-SNE distribution analysis, and 3D structure prediction. The findings indicate that our generated protein sequences maintain both similarity to the original sequences and consistency with the desired functions. Moreover, our model demonstrates superior performance compared to other generative models for protein sequence generation.