苏格拉底规划器：基于自问自答的零样本规划方法实现具身指令跟随

Research

arXiv

Socratic Planner: Self-QA-Based Zero-Shot Planning for Embodied Instruction Following

Suyeon Shin ,

Sujin jeon ,

摘要 Abstract

具身指令跟随（EIF）任务是指通过在交互环境中导航和操作物体来执行自然语言指令。EIF的关键挑战之一是组合式任务规划，通常通过有监督学习或少量样本上下文学习来解决。为此，我们提出了苏格拉底规划器，这是一种基于自我问答的零样本规划方法，无需进一步训练即可推断出适当的计划。苏格拉底规划器首先利用大型语言模型（LLM）进行自我提问和回答，从而帮助生成一系列子目标序列。在执行子目标的过程中，具身代理可能会遇到意外情况，例如不可预见的障碍。苏格拉底规划器随后通过视觉引导的重新规划机制，根据密集的视觉反馈调整计划。实验表明，苏格拉底规划器的有效性显著优于ALFRED基准上的当前最先进的规划模型，在所有指标上均表现优异，特别是在需要复杂推理的长时序任务中尤为突出。此外，我们通过在物理机器人上部署该方法，进一步证明了其实际应用价值。

Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments. A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data. To this end, we introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training. The Socratic Planner first facilitates self-questioning and answering by the Large Language Model (LLM), which in turn helps generate a sequence of subgoals. While executing the subgoals, an embodied agent may encounter unexpected situations, such as unforeseen obstacles. The Socratic Planner then adjusts plans based on dense visual feedback through a visually-grounded re-planning mechanism. Experiments demonstrate the effectiveness of the Socratic Planner, outperforming current state-of-the-art planning models on the ALFRED benchmark across all metrics, particularly excelling in long-horizon tasks that demand complex inference. We further demonstrate its real-world applicability through deployment on a physical robot for long-horizon tasks.