基于正交梯度从流式视频中学习

Learning from Streaming Video with Orthogonal Gradients

摘要 Abstract

我们以自监督的方式解决了从连续视频流中进行表征学习的挑战。这不同于标准的视频学习方法,在这些方法中,视频在训练过程中被分割并打乱,以便创建一个满足独立同分布(IID)样本假设的非冗余批次,这是传统训练范式的期望。当视频仅作为连续输入流时,显然破坏了IID假设,导致性能下降。我们通过三个任务展示了从随机学习转向顺序学习时性能的下降:单视频表征学习方法DoRA、多视频数据集上的标准VideoMAE以及未来视频预测任务。为了解决这一性能下降问题,我们对标准优化器进行了几何修改,在训练过程中利用正交梯度来解耦批次。该修改可以应用于任何优化器——我们在随机梯度下降(SGD)和AdamW中进行了演示。我们提出的正交优化器允许从流式视频中训练的模型缓解表征学习性能的下降,在下游任务中进行评估。在三种场景(DoRA、VideoMAE、未来预测)下,我们证明了我们的正交优化器在所有三种场景中都优于强大的AdamW。

We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.