用于机器人操作的自回归动作序列学习

Research

arXiv

用于机器人操作的自回归动作序列学习

Autoregressive Action Sequence Learning for Robotic Manipulation

Xinyu Zhang ,

Yuhan Liu ,

摘要 Abstract

构建在多种机器人和任务配置下表现良好的通用策略架构仍然是一个关键挑战。在这项工作中，我们通过将机器人动作表示为序列数据，并通过自回归序列建模生成动作来解决这一问题。现有的自回归架构以语言建模中的单词标记形式依次生成末端执行器路径点，这些方法仅限于低频控制任务。与语言不同，机器人动作是异构的，通常包括连续值——例如关节位置、二维像素坐标和末端执行器姿态——这些不容易适合基于语言的建模。基于这一见解，我们引入了一个简单的增强功能：我们将因果变换器的单标记预测扩展为通过我们的分块因果变换器（CCT）支持在单步中预测可变数量的标记。这种增强使CCT能够在各种控制频率的多样化任务中实现稳健性能，减少自回归步骤以提高效率，并通过混合不同类型的动作并为每种动作类型使用不同的分块大小，从而产生混合动作序列设计。基于CCT，我们提出了自回归策略（ARP）架构，该架构通过生成混合动作序列解决操作任务。我们在多样化的机器人操作环境中评估了ARP，包括Push-T、ALOHA和RLBench，并表明ARP作为一种通用架构，在所有测试的基准中匹配或超过了环境特定的最先进水平，同时在计算和参数规模方面更加高效。我们的真实机器人演示视频、所有源代码以及ARP的预训练模型可以在http://github.com/mlzxy/arp找到。

Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT). This enhancement enables robust performance across diverse tasks of various control frequencies, greater efficiency by having fewer autoregression steps, and lead to a hybrid action sequence design by mixing different types of actions and using a different chunk size for each action type. Based on CCT, we propose the Autoregressive Policy (ARP) architecture, which solves manipulation tasks by generating hybrid action sequences. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that ARP, as a universal architecture, matches or outperforms the environment-specific state-of-the-art in all tested benchmarks, while being more efficient in computation and parameter sizes. Videos of our real robot demonstrations, all source code and the pretrained models of ARP can be found at http://github.com/mlzxy/arp.