通过创建与LLM对齐的指令弥合视觉指令微调中的写作方式差距

Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions

摘要 Abstract

在大规模多模态模型(LMMs)领域,视觉指令微调阶段的指令质量显著影响模态对齐性能。本文从一个独特的视角——写作方式,评估了指令质量。写作方式涵盖词汇选择、语法和句法结构,用于传达特定语义。我们指出,在LMMs中的视觉指令与基础大语言模型(LLMs)之间存在显著的写作方式差距。这种差距迫使预训练的基础LLMs偏离其原有的写作风格,导致基础LLMs和LMMs的能力下降。为弥合写作方式差距并保留原始语义,我们提出直接利用基础LLMs使软格式视觉指令的写作方式与其自身对齐,从而产生新的与LLM对齐的指令。人工写作方式评估结果表明,我们的方法成功缩小了写作方式差距。通过使用与LLM对齐的指令,基线模型LLaVA-7B和QwenVL在所有15个视觉和语言基准测试中表现出更强的抗幻觉能力以及全面的改进。

In the realm of Large Multi-modal Models (LMMs), the instruction quality during the visual instruction tuning stage significantly influences the performance of modality alignment. In this paper, we assess the instruction quality from a unique perspective termed \textbf{Writing Manner}, which encompasses the selection of vocabulary, grammar and sentence structure to convey specific semantics. We argue that there exists a substantial writing manner gap between the visual instructions and the base Large Language Models (LLMs) within LMMs. This gap forces the pre-trained base LLMs to deviate from their original writing styles, leading to capability degradation of both base LLMs and LMMs. To bridge the writing manner gap while preserving the original semantics, we propose directly leveraging the base LLM to align the writing manner of soft-format visual instructions with that of the base LLM itself, resulting in novel LLM-aligned instructions. The manual writing manner evaluation results demonstrate that our approach successfully minimizes the writing manner gap. By utilizing LLM-aligned instructions, the baseline models LLaVA-7B and QwenVL demonstrate enhanced resistance to hallucinations and non-trivial comprehensive improvements across all $15$ visual and language benchmarks.