属性图上的自适应局部聚类

Research

arXiv

属性图上的自适应局部聚类

Adaptive Local Clustering over Attributed Graphs

Haoran Zheng ,

Renchi Yang ,

Jianliang Xu

论文信息在线阅读PDF

摘要 Abstract

给定一个图 $G$ 和种子节点 $v_s$，局部图聚类（LGC）的目标是在时间大致线性于子图 $C_s \in G$ 大小的情况下识别出围绕 $v_s$ 的子图 $C_s$（即局部聚类）。这种方法可以生成个性化聚类结果，而无需访问整个图，使其非常适合涉及大规模图的各种应用。然而，大多数现有的解决方案仅仅依赖于图 $G$ 中节点之间的拓扑连通性，这使得它们容易受到现实世界图中存在的缺失或噪声链接的影响。为了解决这个问题，本文利用图拓扑与节点属性的互补性来增强局部聚类质量。为了有效利用属性信息，我们首先将 LGC 表述为双向扩散分布（BDD）的估计问题，该问题专门用于在存在属性的情况下捕捉节点间的多跳亲和性。此外，我们提出了 LACA，这是一种高效且有效的 LGC 方法，在多个真实数据集上实现了卓越的经验性能，同时保持了强局部性。LACA 的核心组件包括：(i) 针对节点属性的快速且有理论依据的预处理技术；(ii) 一种具有严格理论保证并加速收敛的自适应算法，用于在 $G$ 上扩散任何向量；(iii) 一种有效的三步方案来近似 BDD。广泛的实验表明，通过与 8 个真实数据集上的 17 种竞争方法进行比较，LACA 在基于真实局部聚类的地面真值衡量的结果质量方面优于所有竞争对手，同时在速度上快几个数量级。代码可在 https://github.com/HaoranZ99/alac 获取。

Given a graph $G$ and a seed node $v_s$, the objective of local graph clustering (LGC) is to identify a subgraph $C_s \in G$ (a.k.a. local cluster) surrounding $v_s$ in time roughly linear with the size of $C_s$. This approach yields personalized clusters without needing to access the entire graph, which makes it highly suitable for numerous applications involving large graphs. However, most existing solutions merely rely on the topological connectivity between nodes in $G$, rendering them vulnerable to missing or noisy links that are commonly present in real-world graphs. To address this issue, this paper resorts to leveraging the complementary nature of graph topology and node attributes to enhance local clustering quality. To effectively exploit the attribute information, we first formulate the LGC as an estimation of the bidirectional diffusion distribution (BDD), which is specialized for capturing the multi-hop affinity between nodes in the presence of attributes. Furthermore, we propose LACA, an efficient and effective approach for LGC that achieves superb empirical performance on multiple real datasets while maintaining strong locality. The core components of LACA include (i) a fast and theoretically-grounded preprocessing technique for node attributes, (ii) an adaptive algorithm for diffusing any vectors over $G$ with rigorous theoretical guarantees and expedited convergence, and (iii) an effective three-step scheme for BDD approximation. Extensive experiments, comparing 17 competitors on 8 real datasets, show that LACA outperforms all competitors in terms of result quality measured against ground truth local clusters, while also being up to orders of magnitude faster. The code is available at https://github.com/HaoranZ99/alac.