基于深度割集信息的图嵌入与聚类

Research

arXiv

基于深度割集信息的图嵌入与聚类

Deep Cut-informed Graph Embedding and Clustering

Zhiyuan Ning ,

Zaitian Wang ,

Ran Zhang ,

Ping Xu ,

Kunpeng Liu ,

Wei Ju ,

摘要 Abstract

图聚类旨在将图划分为不同的簇。近年来新兴的深度图聚类方法大多基于图神经网络（GNN）。然而，GNN的设计目的是通用图编码，现有基于GNN的深度图聚类算法普遍存在表征塌缩问题。我们归因于两个主要原因：(i) GNN模型的归纳偏差：GNN倾向于为邻近节点生成相似的表示。由于图中往往包含不可忽略的簇间链接，这种偏差会导致错误的消息传递并导致有偏的聚类；(ii) 聚类引导的损失函数：大多数传统方法试图使所有样本更接近预学习的簇中心，这会导致退化解，即所有数据点都被分配到单一标签，从而使样本变得不具区分性。为了解决这些挑战，我们从图割的角度研究图聚类，并提出了一种创新的非GNN基础的深度割集信息图嵌入与聚类框架，即DCGC。该框架包括两个模块：(i) 割集信息图编码；(ii) 基于最优传输的自监督图聚类。对于编码模块，我们推导出一个割集信息图嵌入目标，通过最小化其联合归一化割来融合图结构和属性。对于聚类模块，我们利用最优传输理论获得聚类分配，可以平衡“接近预学习簇中心”的指导作用。通过上述两种定制设计，DCGC更适合图聚类任务，能有效缓解表征塌缩问题并实现更好的性能。我们进行了广泛的实验，证明了我们的方法在基准对比中简单但有效。

Graph clustering aims to divide the graph into different clusters. The recently emerging deep graph clustering approaches are largely built on graph neural networks (GNN). However, GNN is designed for general graph encoding and there is a common issue of representation collapse in existing GNN-based deep graph clustering algorithms. We attribute two main reasons for such issues: (i) the inductive bias of GNN models: GNNs tend to generate similar representations for proximal nodes. Since graphs often contain a non-negligible amount of inter-cluster links, the bias results in error message passing and leads to biased clustering; (ii) the clustering guided loss function: most traditional approaches strive to make all samples closer to pre-learned cluster centers, which causes a degenerate solution assigning all data points to a single label thus make all samples and less discriminative. To address these challenges, we investigate graph clustering from a graph cut perspective and propose an innovative and non-GNN-based Deep Cut-informed Graph embedding and Clustering framework, namely DCGC. This framework includes two modules: (i) cut-informed graph encoding; (ii) self-supervised graph clustering via optimal transport. For the encoding module, we derive a cut-informed graph embedding objective to fuse graph structure and attributes by minimizing their joint normalized cut. For the clustering module, we utilize the optimal transport theory to obtain the clustering assignments, which can balance the guidance of "proximity to the pre-learned cluster center". With the above two tailored designs, DCGC is more suitable for the graph clustering task, which can effectively alleviate the problem of representation collapse and achieve better performance. We conduct extensive experiments to demonstrate that our method is simple but effective compared with benchmarks.