视觉作为LoRA(Vision as LoRA)

Vision as LoRA

摘要 Abstract

我们提出了视觉作为LoRA(Vision as LoRA,简称VoRA),这是一种将大型语言模型(LLM)转化为多模态语言模型(MLLM)的新范式。与依赖外部视觉模块进行视觉编码的传统多模态模型架构不同,VoRA通过在LLM中集成专门针对视觉任务的LoRA层,实现了视觉能力的内化。这种设计使得在推理过程中新增的参数能够无缝融入到LLM中,消除了结构复杂性并最小化了计算开销。此外,继承了LLM处理灵活上下文的能力,VoRA可以处理任意分辨率的输入数据。为了进一步增强VoRA的视觉能力,我们引入了一种分块蒸馏方法,将预训练ViT中的视觉先验知识转移到LoRA层中,通过注入视觉知识有效加速了训练过程。此外,我们还应用了双向注意力掩码以更好地捕捉图像的上下文信息。实验结果表明,通过额外的预训练数据,VoRA的表现可与传统的基于编码器的多模态模型相媲美。所有训练数据、代码和模型权重将在https://github.com/Hon-Wong/VoRA发布。

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.