基于可泛化图神经网络的大规模网络流量工程

Traffic Engineering in Large-scale Networks with Generalizable Graph Neural Networks

摘要 Abstract

随着全球范围云广域网或低地球轨道卫星星座的快速发展,大规模计算机网络中的流量工程(TE)已成为一个基础但具有挑战性的问题。为了解决传统TE算法的可扩展性问题,提出了基于学习的方法,显示出相对于现有最优方法显著的效率提升潜力。然而,现有基于学习的方法存在固有局限性,限制了其实际应用:它们无法在不同的拓扑结构和网络条件下通用,训练开销过大,并且默认情况下不尊重链路容量。本文提出了一种新的TE算法TELGEN,该算法能够在大规模网络中高效解决TE问题,同时在多样化的网络条件下表现出优越的泛化能力。TELGEN基于一种新颖的思想,即将“预测最优TE解决方案”的问题转化为“预测最优TE算法”的问题,这使得TELGEN能够学习并有效地逼近经典最优TE算法的端到端求解过程。所学得的算法对具体的网络拓扑或流量模式无依赖性,并且可以在任意输入下高效解决TE问题,并很好地推广到未见过的拓扑和需求。我们在随机和真实网络上训练和评估了TELGEN,这些网络最多包含5000个节点和106条链路。TELGEN在所有情况下保证可行性的同时,优化差距小于3%,即使测试网络的节点数量比训练集中的最大网络多出20倍。它还比经典的最优求解器节省高达84%的求解时间,并且在最大的网络上相比最新的学习算法,每轮训练时间和求解时间减少了2到4个数量级。

Traffic engineering (TE) in large-scale computer networks has become a fundamental yet challenging problem, owing to the swift growth of global-scale cloud wide-area networks or backbone low-Earth-orbit satellite constellations. To address the scalability issue of traditional TE algorithms, learning-based approaches have been proposed, showing potential of significant efficiency improvement over state-of-the-art methods. Nevertheless, the intrinsic limitations of existing learning-based methods hinder their practical application: they are not generalizable across diverse topologies and network conditions, incur excessive training overhead, and do not respect link capacities by default. This paper proposes TELGEN, a novel TE algorithm that learns to solve TE problems efficiently in large-scale networks, while achieving superior generalizability across diverse network conditions. TELGEN is based on the novel idea of transforming the problem of "predicting the optimal TE solution" into "predicting the optimal TE algorithm", which enables TELGEN to learn and efficiently approximate the end-to-end solving process of classical optimal TE algorithms. The learned algorithm is agnostic to the exact network topology or traffic patterns, and can efficiently solve TE problems given arbitrary inputs and generalize well to unseen topologies and demands. We trained and evaluated TELGEN on random and real-world networks with up to 5000 nodes and 106 links. TELGEN achieved less than 3% optimality gap while ensuring feasibility in all cases, even when the test network had up to 20x more nodes than the largest in training. It also saved up to 84% solving time than classical optimal solver, and could reduce training time per epoch and solving time by 2-4 orders of magnitude than latest learning algorithms on the largest networks.

基于可泛化图神经网络的大规模网络流量工程 - arXiv