顺序至关重要:针对识别近似对称动作的高效图像到视频探测方法研究
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
摘要 Abstract
我们研究了参数高效的图像到视频探测方法,以应对未解决的近似对称动作(视觉上相似但时间顺序相反的动作,例如打开和关闭瓶子)识别挑战。现有的基于图像预训练模型(如DinoV2和CLIP)的探测机制依赖于注意力机制进行时间建模,但本质上是排列不变的,导致无论帧的顺序如何,预测结果都相同。为了解决这一问题,我们引入了自注意力时序嵌入探测(STEP),这是一种简单而有效的方法,旨在参数高效的图像到视频迁移中强制执行时间敏感性。STEP通过三项关键修改增强了自注意力探测:(1) 可学习的帧级位置编码,显式地对时间顺序进行编码;(2) 单一全局分类标记,用于序列一致性;(3) 简化的注意力机制以提高参数效率。在四个活动识别基准数据集上,STEP的表现比现有图像到视频探测机制高出3%-15%,并且仅使用了三分之一的可学习参数。在两个数据集上,它超越了所有已发表的方法,包括完全微调的模型。STEP在识别近似对称动作方面具有显著优势,比其他探测机制高出9%-19%,比参数更重的PEFT基迁移方法高出5%-15%。代码和模型将在公开发布。
We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.