令牌动力学：面向视频大语言模型的高效动态视频令牌表示方法

Research

arXiv

Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

摘要 Abstract

基于令牌的视频表示作为一种使大型语言模型（LLMs）解释视频内容的有前景的方法已经出现。然而，现有的令牌减少方法，如令牌剪枝和令牌合并，常常会破坏重要的时空位置嵌入，无法在计算效率和较少的令牌之间实现充分平衡。因此，这些方法导致了冗长的令牌序列，在需要极端令牌压缩的场景中，如视频大语言模型，其适用性受到限制。本文引入了极端短令牌减少的新任务，旨在用最少数量的令牌表示广泛的视频序列。为了解决这一挑战，我们提出了Token Dynamics，这是一种新的视频表示框架，能够在保持时空一致性的同时动态减少令牌数量。具体而言，我们将视频表示分解为视觉嵌入与网格级运动信息分离，并构建为：1. 通过聚类描述对象级内容的令牌创建简洁的令牌哈希表；2. 捕捉网格间详细时空运动模式的令牌索引键映射；3. 将令牌哈希表向量化量化以从键映射重构令牌序列的令牌哈希函数。此外，我们引入了一种跨动态注意力机制，将运动特征整合到令牌基础中而不增加令牌长度，从而保持紧凑性和时空完整性。实验表明，令牌数量可减少到原始令牌的0.07%，仅造成1.13%的微小性能下降。此外，我们在极端令牌减少中提出了两个新子任务（固定长度和自适应长度压缩）。我们的方法具有显著更低的理论复杂度、更少的令牌和更高的吞吐量，为视频大语言模型提供了一个高效的解决方案。

Token-based video representation has emerged as a promising approach for enabling LLMs to interpret video content. However, existing token reduction, such as token pruning and token merging, often disrupt essential spatial-temporal positional embeddings, failing to adequately balance computational efficiency with fewer tokens. Consequently, these methods result in lengthy token sequences, limiting their applicability in scenarios requiring extreme token compression, such as video large language models. In this paper, we introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. To address this challenge, we propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Specifically, we disentangle video representations by separating visual embeddings from grid-level motion information, structuring them into: 1. a concise token hash table, created by clustering tokens that describe object-level content; 2. a token indices key map, capturing detailed spatial-temporal motion patterns across grids; 3. a token hash function, which vector-quantizes the token hash table to reconstruct the token sequence from the key map. Furthermore, we introduce a cross-dynamics attention mechanism that integrates motion features into the token base without increasing token length, thereby maintaining compactness and spatial-temporal integrity. The experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%. Additionally, we propose two novel subtasks within extreme token reduction (fixed-length and adaptive-length compression). Our method offers significantly lower theoretical complexity, fewer tokens, and enhanced throughput, thus providing an efficient solution for video LLMs.