基于大型语言模型的数据中心联邦图学习

Research

arXiv

基于大型语言模型的数据中心联邦图学习

Data-centric Federated Graph Learning with Large Language Models

Bo Yan ,

Huabin Sun ,

Yang Cao ,

摘要 Abstract

在联邦图学习（Federated Graph Learning, FGL）中，由于隐私问题，完整的图会被分割为多个子图并存储在各个客户端中，所有客户端仅通过传输模型参数来联合训练全局图模型。然而，FGL面临的一个痛点是异构性问题，即节点或结构在客户端之间表现出非独立同分布（non-IID）特性（例如，不同的节点标签分布），这严重削弱了FGL的收敛性和性能。现有方法主要集中在模型层面的设计策略，即设计模型以提取共同知识来缓解异构性。然而，这些模型层面的策略无法从根本上解决异构性问题，因为当转移到其他任务时，模型需要从头开始重新设计。受大型语言模型（Large Language Models, LLMs）取得显著成功的启发，我们旨在利用LLMs全面理解和增强本地带文本属性的图，从而在数据层面解决数据异构性问题。本文提出了一种通用框架LLM4FGL，创新性地将LLM用于FGL的任务分解为两个理论上的子任务。具体而言，对于每个客户端，首先利用LLM生成缺失的邻居节点，然后推断生成节点与原始节点之间的连接关系。为了提高生成节点的质量，我们设计了一种新颖的联邦生成与反馈机制，无需修改LLM的参数，而是依赖于所有客户端的集体反馈。在生成邻居节点之后，所有客户端使用预训练的边预测器推断缺失的边。此外，我们的框架可以无缝集成到现有的FGL方法中作为一个插件。在三个真实世界数据集上的实验结果证明了我们方法相对于先进基线的优势。

In federated graph learning (FGL), a complete graph is divided into multiple subgraphs stored in each client due to privacy concerns, and all clients jointly train a global graph model by only transmitting model parameters. A pain point of FGL is the heterogeneity problem, where nodes or structures present non-IID properties among clients (e.g., different node label distributions), dramatically undermining the convergence and performance of FGL. To address this, existing efforts focus on design strategies at the model level, i.e., they design models to extract common knowledge to mitigate heterogeneity. However, these model-level strategies fail to fundamentally address the heterogeneity problem as the model needs to be designed from scratch when transferring to other tasks. Motivated by large language models (LLMs) having achieved remarkable success, we aim to utilize LLMs to fully understand and augment local text-attributed graphs, to address data heterogeneity at the data level. In this paper, we propose a general framework LLM4FGL that innovatively decomposes the task of LLM for FGL into two sub-tasks theoretically. Specifically, for each client, it first utilizes the LLM to generate missing neighbors and then infers connections between generated nodes and raw nodes. To improve the quality of generated nodes, we design a novel federated generation-and-reflection mechanism for LLMs, without the need to modify the parameters of the LLM but relying solely on the collective feedback from all clients. After neighbor generation, all the clients utilize a pre-trained edge predictor to infer the missing edges. Furthermore, our framework can seamlessly integrate as a plug-in with existing FGL methods. Experiments on three real-world datasets demonstrate the superiority of our method compared to advanced baselines.