基于视觉Transformer的四旋翼无人机端到端视觉避障研究

Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

摘要 Abstract

我们展示了基于注意力机制的端到端方法在密集杂乱环境中高速视觉引导四旋翼无人机避障的能力,并与其他最先进的学习架构进行了比较。当四旋翼无人飞行器(UAV)快速飞行时,其机动性极大;然而,随着飞行速度的增加,由于传感器噪声增大、误差累积以及处理延迟增加,传统基于模型的方法(通过独立的感知、建图、规划和控制模块实现导航)会失效。因此,基于学习的端到端视觉到控制网络在这些快速机器人穿越杂乱环境时显示出巨大潜力。我们在高保真模拟中训练并对比了卷积神经网络、U-Net及循环架构与视觉Transformer(ViT)模型在深度图像到控制任务中的表现,观察到随着四旋翼速度增加以及对未见过环境的泛化能力方面,ViT模型比其他模型更有效。此外,加入循环结构进一步提高了性能,同时降低了所有测试飞行速度下的四旋翼能耗。我们评估了高达7米/秒速度下的仿真与硬件性能。据我们所知,这是首次利用视觉Transformer实现端到端视觉引导四旋翼无人机控制的工作。

We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.