超图表示单细胞RNA测序数据以通过随机游走改进聚类

Research

arXiv

Hypergraph Representations of scRNA-seq Data for Improved Clustering with Random Walks

Wan He ,

摘要 Abstract

单细胞RNA测序数据分析通常通过共表达网络等网络投影进行，这主要是因为下游任务有丰富的网络分析工具可用。然而，这种方法存在多个局限性：丢失高阶信息、由于将稀疏数据集转换为全连接网络导致的数据表示效率低下，以及由于零膨胀导致的共表达过度估计。为了解决这些局限性，我们建议将单细胞RNA测序表达数据概念化为超图，超图是一类广义图，其中超边可以连接两个以上的顶点。在单细胞RNA测序数据的背景下，超图节点代表细胞，边代表基因。每个超边连接其对应基因活跃表达的所有细胞，并记录该基因在不同细胞中的表达情况。这种超图概念化使我们能够探索多向关系，而不仅仅是共表达网络中的成对交互，且不会丢失信息。我们提出了两种新的聚类方法：（1）双重重要性偏好超图游走（DIPHW）和（2）共表达与记忆集成的双重重要性偏好超图游走（CoMem-DIPHW）。这两种方法在模拟数据和真实单细胞RNA测序数据集上均优于现有方法。当数据模块性较弱时，我们提出的方法带来的改进尤为显著。此外，CoMem-DIPHW结合了基因共表达网络、细胞共表达网络以及从单细胞丰度计数数据中得到的细胞-基因表达超图，用于嵌入计算。此方法同时考虑了来自单细胞水平基因表达的局部信息和来自两个共表达网络中成对相似性的全局信息。

Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and overestimation of coexpression due to zero-inflation. To address these limitations, we propose conceptualizing scRNA-seq expression data as hypergraphs, which are generalized graphs in which the hyperedges can connect more than two vertices. In the context of scRNA-seq data, the hypergraph nodes represent cells and the edges represent genes. Each hyperedge connects all cells where its corresponding gene is actively expressed and records the expression of the gene across different cells. This hypergraph conceptualization enables us to explore multi-way relationships beyond the pairwise interactions in coexpression networks without loss of information. We propose two novel clustering methods: (1) the Dual-Importance Preference Hypergraph Walk (DIPHW) and (2) the Coexpression and Memory-Integrated Dual-Importance Preference Hypergraph Walk (CoMem-DIPHW). They outperform established methods on both simulated and real scRNA-seq datasets. The improvement brought by our proposed methods is especially significant when data modularity is weak. Furthermore, CoMem-DIPHW incorporates the gene coexpression network, cell coexpression network, and the cell-gene expression hypergraph from the single-cell abundance counts data altogether for embedding computation. This approach accounts for both the local level information from single-cell level gene expression and the global level information from the pairwise similarity in the two coexpression networks.