QG-VTC: 多模态大语言模型中基于问题引导的视觉标记压缩方法用于高效的视觉问答
QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
摘要 Abstract
多模态大语言模型(MLLMs)在开放世界视觉问答(VQA)方面取得了显著进展。然而,整合视觉信息会增加处理的标记数量,导致更高的GPU内存使用量和计算开销。图像往往包含比文本更多的冗余信息,并且并非所有视觉细节都与特定问题相关。为了解决这些挑战,我们提出了QG-VTC,这是一种针对基于多模态大语言模型的视觉问答任务的新颖的问题引导视觉标记压缩方法。QG-VTC利用预训练的文本编码器和可学习的前馈层将用户问题嵌入到视觉编码器的特征空间中,然后计算问题嵌入与视觉标记之间的相关性分数。通过选择最相关的标记并柔和地压缩其他标记,QG-VTC确保了对用户需求的微调相关性。此外,一种渐进策略在不同的视觉编码器层上应用这种压缩,逐步减少标记数量。这种方法最大限度地保留了与问题相关的信息,同时丢弃了无关细节。实验结果表明,我们的方法在仅使用1/8的视觉标记的情况下达到了与未压缩模型相当的性能。代码和模型将在GitHub上公开发布。
Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.