OTTER：一种具有文本感知视觉特征提取的视觉-语言-动作模型

Research

arXiv

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Huang Huang ,

Fangchen Liu ,

Letian Fu ,

Tingfan Wu ,

摘要 Abstract

视觉-语言-动作（VLA）模型旨在根据视觉观测和语言指令预测机器人动作。现有方法需要对预训练的视觉-语言模型（VLMs）进行微调，因为视觉和语言特征被独立地输入到下游策略中，这会削弱预训练的语义对齐效果。我们提出了OTTER，这是一种新颖的VLA架构，通过显式的、文本感知的视觉特征提取利用这些现有的对齐效果。OTTER并非处理所有视觉特征，而是选择性地提取并仅传递与语言指令在语义上对齐的任务相关视觉特征给策略Transformer。这使得OTTER能够保持预训练的视觉-语言编码器不变。因此，OTTER保留并利用了从大规模预训练中学到的丰富的语义理解能力，从而具备强大的零样本泛化能力。在模拟和真实世界实验中，OTTER显著优于现有的VLA模型，展示了对新物体和环境的强大零样本泛化能力。视频、代码、检查点和数据集：https://ottervla.github.io/。

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.