基于强化学习将多模态大型语言模型赋予具身代理以寻求帮助的研究
Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning
摘要 Abstract
在真实世界环境中运行的具身代理必须解释模糊且未明确的人类指令。一个功能强大的家用机器人应该能够识别模糊性,并提出相关的澄清问题,以便准确推断用户意图,从而更有效地完成任务。为研究这一问题,我们引入了“Ask-to-Act”任务,即在家庭环境中,具身代理需要在模糊指令下获取特定对象实例。代理必须战略性地提出最少但相关的澄清问题,以解决模糊性,同时在部分可观察条件下导航。为了解决此问题,我们提出了一种新颖的方法,利用在线强化学习(RL)微调多模态大型语言模型(MLLMs)作为视觉-语言-动作(VLA)策略,并采用由LLM生成的奖励。我们的方法无需大规模人类演示或人为设计的奖励来训练此类代理。我们在任务上对强零样本基线(包括GPT-4o)以及监督微调的MLLMs进行了基准测试。结果表明,我们的RL微调MLLM在性能上显著优于所有基线(高出$19.1$-$40.3\%$),并且在新场景和任务中具有良好的泛化能力。据我们所知,这是首次展示通过在线RL使用LLM生成的奖励将MLLMs适应为VLA代理并使其能够行动和寻求帮助的研究。
Embodied agents operating in real-world environments must interpret ambiguous and under-specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask-to-Act task, where an embodied agent must fetch a specific object instance given an ambiguous instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To solve this problem, we propose a novel approach that fine-tunes multimodal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards. Our method eliminates the need for large-scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero-shot baselines, including GPT-4o, and supervised fine-tuned MLLMs, on our task. Our results demonstrate that our RL-finetuned MLLM outperforms all baselines by a significant margin ($19.1$-$40.3\%$), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL.