摘要 Abstract
在训练能够处理长视频的视觉模型时,视频的有效标记仍然是一个挑战。一个有前景的方向是开发一种可以对长视频片段进行编码的标记器,因为它可以更好地利用视频的时间一致性来进行标记。然而,由于现有的标记器在训练时需要一次性重构所有帧,因此在长视频上训练它们通常会带来巨大的训练成本。本文介绍了一种名为CoordTok的视频标记器,它受近期三维生成模型进展的启发,从基于坐标的表示形式学习到输入视频对应补丁的映射。具体来说,CoordTok将视频编码为因子化的三平面表示,并重建与随机采样的$(x,y,t)$坐标相对应的补丁。这使得大型标记器模型可以直接在长视频上进行训练,而无需过多的训练资源。我们的实验表明,CoordTok可以极大地减少对长视频片段进行编码所需的标记数量。例如,CoordTok可以用1280个标记将分辨率为128×128的128帧视频编码,而基线方法需要6144或8192个标记才能达到相似的重构质量。我们进一步证明了这种高效的视频标记方法可以实现内存效率更高的扩散变换器训练,使其能够一次性生成128帧。
Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.