大型视觉语言模型中模态对齐的改进方法

Improved Alignment of Modalities in Large Vision Language Models

摘要 Abstract

视觉-语言模型近年来在使语言模型理解视觉输入方面取得了显著成果。然而,针对图像描述生成和视觉问答等多样化任务实现统一的模型对齐方法仍是一项挑战。现有方法要么需要非常大的语言模型,要么需要非常大的数据集,这在利用现有模型时效率不高。本文解决了这一问题,并设计了一种自回归视觉-语言模型的训练策略,以统一图像描述生成和视觉问答等视觉-语言任务。我们提出了四个训练阶段,用于对齐视觉模型与语言模型,即赋予语言模型处理视觉输入的能力。同时,我们还为基于Transformer的语言模型设计了不同的注意力掩码,以提升视觉特征的质量。此外,我们还得出了一些发现:1)注意力掩码不应应用于视觉输入;2)语言模型在AI生成的数据上收敛更快;3)在模型预训练期间应进一步优化对齐阶段的工作;4)该模型能够轻松适应下游任务,如在医疗数据集PathVQA上的视觉问答任务。经过一轮训练后,在COCO和Flickr30k数据集上的CIDEr分数等常见基准测试中,该模型的表现优于VILA-13亿参数模型,并且与GIT-2模型在相同数据集上的得分非常接近,尽管它是一个规模更小、训练数据量更少的模型。所有训练均采用最佳实践完成,如多GPU并行训练、半精度(16位浮点数)训练、快速注意力(SDPA)以及梯度累积,整个训练过程在12小时内完成。

Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the language model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.

大型视觉语言模型中模态对齐的改进方法 - arXiv