摘要 Abstract
图压缩作为一种有前景的方法,通过在保持图基本特征的同时压缩训练数据集,实现了对图神经网络(GNNs)的可扩展训练。我们的研究揭示了当前图压缩技术存在的显著不足。首先,大多数算法在进行压缩时需要依赖整个数据集进行训练,这与压缩的目标背道而驰。其次,由于这些方法采用梯度模拟的方式,任何超参数或GNN架构的变化都需要重新进行压缩,从而限制了其灵活性和可重用性。最后,由于合成的是全连接、带边权的图,它们无法实现显著的图规模缩减。为了解决这些问题,我们提出了Bonsai,这是一种新颖的图压缩方法,其灵感来源于消息传递GNNs的核心处理单元——\textit{计算树}。Bonsai通过精心选择一组\textit{样本树}来编码训练集中所有计算树的表示,从而实现数据集的压缩。这种方法使Bonsai成为首个针对节点分类任务的线性时间、模型无关的图压缩算法,其在7个真实世界数据集上的准确率超越了现有基线算法,并且平均速度快22倍。此外,Bonsai基于所采用的近似策略的严格数学保证,使其对GNN架构、数据集和参数具有鲁棒性。
Graph condensation has emerged as a promising avenue to enable scalable training of GNNs by compressing the training dataset while preserving essential graph characteristics. Our study uncovers significant shortcomings in current graph condensation techniques. First, the majority of the algorithms paradoxically require training on the full dataset to perform condensation. Second, due to their gradient-emulating approach, these methods require fresh condensation for any change in hyperparameters or GNN architecture, limiting their flexibility and reusability. Finally, they fail to achieve substantial size reduction due to synthesizing fully-connected, edge-weighted graphs. To address these challenges, we present Bonsai, a novel graph condensation method empowered by the observation that \textit{computation trees} form the fundamental processing units of message-passing GNNs. Bonsai condenses datasets by encoding a careful selection of \textit{exemplar} trees that maximize the representation of all computation trees in the training set. This unique approach imparts Bonsai as the first linear-time, model-agnostic graph condensation algorithm for node classification that outperforms existing baselines across $7$ real-world datasets on accuracy, while being $22$ times faster on average. Bonsai is grounded in rigorous mathematical guarantees on the adopted approximation strategies making it robust to GNN architectures, datasets, and parameters.