针对自适应对手的有向无环图在线最短路径的高效近似最优算法
Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries
摘要 Abstract
本文研究了在带噪声反馈下有向无环图(DAGs)中的在线最短路径问题,面对自适应对手时,我们提出了一种高效的近似最优算法。给定一个具有源节点$v_{\mathsf{s}}$和汇节点$v_{\mathsf{t}}$的DAG $G = (V, E)$,设$X \subseteq \{0,1\}^{|E|}$表示从$v_{\mathsf{s}}$到$v_{\mathsf{t}}$的所有路径集合。在每一轮$t$中,我们选择一条路径$\mathbf{x}_t \in X$并接收关于损失$\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1]$的带噪声反馈,其中$\mathbf{y}_t$是由对手任意选择的损失向量。我们的目标是在$T$轮内最小化相对于事后最佳路径的遗憾。我们提出了首个在高概率下对任何自适应对手实现接近极小极大最优遗憾界$\tilde O(\sqrt{|E|T\log |X|})$的计算高效算法,其中$\tilde O(\cdot)$隐藏了边数$|E|$的对数因子。我们的算法通过新颖的损失估计器和基于质心的分解方式,以非平凡的方式达到了这一遗憾界。作为一种应用,我们展示了该DAG算法为$m$-集、扩展式博弈、Colonel Blotto博弈、有向图中的最短行走、超立方体以及多任务多臂老虎机提供了最先进的高效算法,并在所有这些场景中实现了改进的高概率遗憾保证。
In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{\mathsf{s}}$ and a sink node $v_{\mathsf{t}}$, let $X \subseteq \{0,1\}^{|E|}$ denote the set of all paths from $v_{\mathsf{s}}$ to $v_{\mathsf{t}}$. At each round $t$, we select a path $\mathbf{x}_t \in X$ and receive bandit feedback on our loss $\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1]$, where $\mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $\tilde O(\sqrt{|E|T\log |X|})$ with high probability against any adaptive adversary, where $\tilde O(\cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.