基于基因表达知识图谱的多数据集与迁移学习

Research

arXiv

Multi-dataset and Transfer Learning Using Gene Expression Knowledge Graphs

摘要 Abstract

基因表达数据集为基因调控机制、生化通路以及细胞功能提供了重要见解。此外，通过比较疾病患者与对照组患者的基因表达谱，可以更深入地理解疾病的病理机制。因此，机器学习被广泛应用于处理基因表达数据，其中患者诊断成为最流行的用途之一。尽管基因表达数据具有重要价值，但也面临挑战，因为表达数据集中患者数量通常有限，并且来自不同数据集且具有不同基因表达的数据难以直接整合。本文提出了一种创新方法，利用知识图谱这一生物医学数据整合的独特工具，集成多个基因表达数据集及其领域特定知识，然后通过知识图谱嵌入技术生成向量表示，作为图神经网络和多层感知器的输入。我们从单数据集学习、多数据集学习和迁移学习三种场景评估了该方法的有效性。实验结果表明，结合基因表达数据集和领域特定知识在所有三种场景下均能提升患者诊断的性能。

Gene expression datasets offer insights into gene regulation mechanisms, biochemical pathways, and cellular functions. Additionally, comparing gene expression profiles between disease and control patients can deepen the understanding of disease pathology. Therefore, machine learning has been used to process gene expression data, with patient diagnosis emerging as one of the most popular applications. Although gene expression data can provide valuable insights, challenges arise because the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel methodology to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration. Then, vector representations are produced using knowledge graph embedding techniques, which are used as inputs for a graph neural network and a multi-layer perceptron. We evaluate the efficacy of our methodology in three settings: single-dataset learning, multi-dataset learning, and transfer learning. The experimental results show that combining gene expression datasets and domain-specific knowledge improves patient diagnosis in all three settings.