摘要 Abstract
基于Transformer的模型在大规模通用数据集(包含分子字符串)上训练后,已成为成功建模各种结构-性质关系的强大工具。受此成功的启发,我们在本文中将基于化学语言Transformer在大规模化学数据集上的训练范式扩展到生成任务中。具体而言,我们提出了GP-MoLFormer,这是一种自回归分子字符串生成器,其训练数据超过11亿个化学SMILES。GP-MoLFormer采用参数量为4680万的Transformer解码器模型,该模型具有线性注意力机制和旋转位置编码作为基础架构。我们在三种不同任务中评估了GP-MoLFormer的性能并与现有基线进行了比较:从头生成、骨架约束的分子修饰以及无约束的性质引导优化。前两项任务无需额外训练即可完成,对于最后一项任务,我们提出了一种参数高效的微调方法,该方法以按性质排序的分子对作为输入。我们将这种方法称为配对微调。我们的结果显示,GP-MoLFormer在三项任务中的表现优于或相当于基线模型,证明了其在多种分子生成任务中的通用性。我们进一步报告了GP-MoLFormer生成结果中对训练数据的强烈记忆效应,这在化学语言模型中尚未被充分探索。我们的分析表明,训练数据的记忆效应和生成结果的新颖性受到训练数据质量和规模的影响;训练数据中的重复偏差可以增强记忆效应,但会降低新颖性。我们还建立了推理计算与生成结果新颖性之间的缩放规律。
Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B (billion) chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations.