时间提示是否足以应对标注数据有限的动作识别任务？

Research

arXiv

Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

Shreyank N Gowda ,

Boyan Gao ,

Xiao Gu ,

Xiaobo Jin

论文信息在线阅读PDF

摘要 Abstract

近年来，视频理解取得了显著进展，这在很大程度上依赖于大规模标注数据集的可用性。视觉-语言模型，尤其是基于对比预训练的方法，最近在零样本任务中表现出色，有助于克服对标注数据的依赖。然而，将此类模型适应到视频任务中通常需要修改视觉-语言模型的架构以适配视频数据，但这并非易事，因为这些适应方法大多计算成本高昂且在时序建模方面存在困难。我们提出了TP-CLIP，这是一种基于CLIP的适应方法，利用时间视觉提示进行时间适应，而无需修改CLIP的核心架构，从而保持其泛化能力。TP-CLIP高效地融入了CLIP架构，充分利用其针对视频数据的预训练能力。在各种数据集上的大量实验表明，TP-CLIP在零样本和少量样本学习中表现优异，相比现有方法具有更少的参数量和更高的计算效率。特别是在某些任务和数据集上，我们的方法仅使用现有最先进方法1/3的GFLOPs和1/28的可调参数量，但仍能超越其性能最高达15.8%。

Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.