基于大语言模型的少量样本图分布外检测

Research

arXiv

基于大语言模型的少量样本图分布外检测

Few-Shot Graph Out-of-Distribution Detection with LLMs

Haoyan Xu ,

Zhengtao Yao ,

Yushun Dong ,

Ziyi Wang ,

Ryan A. Rossi ,

Mengyuan Li ,

Yue Zhao

论文信息在线阅读PDF

摘要 Abstract

现有的图分布外（Out-of-Distribution, OOD）检测方法通常依赖于使用大量标记的在分布（In-Distribution, ID）数据来训练图神经网络（Graph Neural Network, GNN）分类器。然而，在文本属性图（Text-Attributed Graphs, TAGs）中获取高质量的标记节点既具有挑战性又成本高昂，这是由于其复杂的文本和结构特性所致。虽然大型语言模型（Large Language Models, LLMs）以其强大的零样本能力在文本任务中表现出色，但它们难以自然地捕获TAGs固有的关键结构信息，从而限制了其直接有效性。为了解决这些挑战，我们提出了LLM-GOOD，这是一种结合LLMs和GNNs优势的一般框架，以提高图OOD检测的数据效率。具体来说，我们首先利用LLMs的强大零样本能力过滤出可能的OOD节点，显著减少了人工标注负担。为了最小化LLM的使用和成本，我们仅使用它对一小部分未标记节点进行标注。然后，我们使用这些噪声标签训练一个轻量级的GNN过滤器，通过结合文本和结构信息，高效预测其他所有未标记节点的ID状态。在从GNN过滤器获得节点嵌入后，我们可以应用基于信息量的方法选择最宝贵的节点用于精确的人工标注。最后，我们使用这些准确标注的ID节点训练目标ID分类器。在四个真实世界TAG数据集上的广泛实验表明，LLM-GOOD不仅显著降低了人工标注成本，而且在ID分类准确性和OOD检测性能方面均优于最先进的基线方法。

Existing methods for graph out-of-distribution (OOD) detection typically depend on training graph neural network (GNN) classifiers using a substantial amount of labeled in-distribution (ID) data. However, acquiring high-quality labeled nodes in text-attributed graphs (TAGs) is challenging and costly due to their complex textual and structural characteristics. Large language models (LLMs), known for their powerful zero-shot capabilities in textual tasks, show promise but struggle to naturally capture the critical structural information inherent to TAGs, limiting their direct effectiveness. To address these challenges, we propose LLM-GOOD, a general framework that effectively combines the strengths of LLMs and GNNs to enhance data efficiency in graph OOD detection. Specifically, we first leverage LLMs' strong zero-shot capabilities to filter out likely OOD nodes, significantly reducing the human annotation burden. To minimize the usage and cost of the LLM, we employ it only to annotate a small subset of unlabeled nodes. We then train a lightweight GNN filter using these noisy labels, enabling efficient predictions of ID status for all other unlabeled nodes by leveraging both textual and structural information. After obtaining node embeddings from the GNN filter, we can apply informativeness-based methods to select the most valuable nodes for precise human annotation. Finally, we train the target ID classifier using these accurately annotated ID nodes. Extensive experiments on four real-world TAG datasets demonstrate that LLM-GOOD significantly reduces human annotation costs and outperforms state-of-the-art baselines in terms of both ID classification accuracy and OOD detection performance.