摘要 Abstract
虽然机器学习通过极高的精度改变了折叠蛋白基态结构预测的方式,但内在无序蛋白及其区域(IDPs/IDRs)由多种动态结构集合组成,这些集合被AlphaFold等算法以低置信度预测。我们提出了一种新的机器学习方法——IDPForge(Intrinsically Disordered Protein, FOlded and disordered Region GEnerator),该方法利用变压器蛋白语言扩散模型生成包含所有原子的IDP集合以及保持折叠域的IDR无序集合。IDPForge无需序列特定训练、从粗粒度表示的反向转换或集合重加权,因为所创建的IDP/IDR构象集合与溶液实验数据具有良好的一致性,并且如果需要,提供了偏向实验约束的选项。我们预计,具备这些多样化能力的IDPForge将促进包含内在无序的蛋白质的整合性和结构性研究。
Although machine learning has transformed protein structure prediction of folded protein ground states with remarkable accuracy, intrinsically disordered proteins and regions (IDPs/IDRs) are defined by diverse and dynamical structural ensembles that are predicted with low confidence by algorithms such as AlphaFold. We present a new machine learning method, IDPForge (Intrinsically Disordered Protein, FOlded and disordered Region GEnerator), that exploits a transformer protein language diffusion model to create all-atom IDP ensembles and IDR disordered ensembles that maintains the folded domains. IDPForge does not require sequence-specific training, back transformations from coarse-grained representations, nor ensemble reweighting, as in general the created IDP/IDR conformational ensembles show good agreement with solution experimental data, and options for biasing with experimental restraints are provided if desired. We envision that IDPForge with these diverse capabilities will facilitate integrative and structural studies for proteins that contain intrinsic disorder.