RG-Attn: 雷达-注意力机制：多模态多智能体协同感知中的径向粘合注意力

Research

arXiv

RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

Lantao Li ,

Kang Yang ,

Wenqi Zhang ,

Xiaoxue Wang ,

Chen Sun

论文信息在线阅读PDF

摘要 Abstract

协同感知通过车辆与万物通信（Vehicle-to-Everything, V2X）实现数据共享与融合，为克服单智能体系统感知局限提供了最优解决方案。然而，大多数现有方法仅关注单一模态的数据交换，限制了同构和异构融合在智能体之间的潜力。这忽视了利用每个智能体的多模态数据的机会，从而限制了系统的性能表现。在汽车行业中，制造商采用不同的传感器配置，导致各智能体之间存在异构的传感器模态组合。为了充分利用每一个可能的数据源以实现最优性能，我们设计了一种鲁棒的激光雷达与摄像头跨模态融合模块——径向粘合注意力机制（Radian-Glue-Attention, RG-Attn），该模块适用于智能体内部跨模态融合以及智能体间跨模态融合场景，得益于转换矩阵带来的便捷坐标转换以及统一的采样/反演机制。此外，我们提出了两种不同的架构，即Paint-To-Puzzle (PTP) 和 Co-Sketching-Co-Coloring (CoS-CoCo)，用于开展协同感知任务。PTP 旨在实现最大精度性能，并通过限制跨智能体融合到单一实例来减小数据包大小，但要求所有参与者都配备激光雷达。相比之下，CoS-CoCo 支持任意配置的智能体——仅激光雷达、仅摄像头或同时具备激光雷达和摄像头，展现出更强的泛化能力。我们的方法在真实和模拟的协同感知数据集上均达到了最先进的（SOTA）性能。代码现已在GitHub上发布。

Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system's performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code is now available at GitHub.