视觉环境交互式规划在具身复杂问题回答中的应用
Visual Environment-Interactive Planning for Embodied Complex-Question Answering
摘要 Abstract
本文聚焦于具身复杂问题回答任务,即要求具身机器人理解具有复杂结构和抽象语义的人类问题。该任务的核心在于基于视觉环境感知制定适当的计划。现有方法通常采用一次性规划的方式,即一步到位规划,这种方法依赖于大型模型,且对环境的理解不足。为解决这一问题,本文提出了一种按顺序形式化计划的框架,并考虑了多步规划。为了确保框架能够处理复杂问题,我们构建了一个结构化的语义空间,在这个空间中,分层视觉感知和问题本质的链式表达可以实现迭代交互,从而使得顺序任务规划成为可能。在框架内,我们首先基于视觉层次场景图解析自然语言,以明确问题意图。然后,结合外部规则为当前步骤制定计划,弱化对大型模型的依赖。每个计划都基于视觉感知反馈生成,并通过多轮交互直至获得答案。此方法实现了持续反馈和调整,使机器人能够优化其行动策略。为验证我们的框架,我们贡献了一个包含更复杂问题的新数据集。实验结果表明,我们的方法在复杂任务中表现出色且稳定,并且在现实场景中的可行性也得到了验证,显示了其实用性。
This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.