大型(视觉)语言模型是无监督上下文学习者

Large (Vision) Language Models are Unsupervised In-Context Learners

摘要 Abstract

近年来,大型语言模型和视觉-语言模型的进步实现了零样本推理,使模型能够在无需任务特定训练的情况下解决新任务。各种适应技术如提示工程、上下文学习(ICL)以及有监督微调可以进一步提升模型在下游任务上的表现,但这些方法需要大量的人工努力来构建有效的提示或标注示例。在这项工作中,我们引入了一种完全无监督适应的联合推理框架,消除了人工提示工程和标注示例的需求。与独立预测的零样本推理不同,联合推理对给定任务中的所有输入同时进行预测。由于直接的联合推理涉及昂贵的优化计算,我们开发了高效的近似技术,从而产生了两种无监督适应方法:无监督微调和无监督ICL。我们在多样化的任务和模型上展示了我们方法的有效性,包括在自然语言处理任务中的仅语言模型Llama-3.1、在小学数学问题上的推理导向模型Qwen2.5-Math、在视觉任务上的视觉-语言模型OpenFlamingo以及通过API访问的多学科任务GPT-4o模型。我们的实验表明,与标准的零样本方法相比,我们的方法在具有挑战性的GSM8K数学推理数据集上取得了39%的绝对改进。令人印象深刻的是,尽管完全无监督,我们的框架在许多情况下与依赖于真实标签的有监督方法表现相当。

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.