Ross3D: 具有3D感知的重建视觉指令微调

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

摘要 Abstract

二维图像和视频的大规模多模态模型(LMMs)快速发展,推动了这些模型向解释三维场景的适应。然而,缺乏大规模的三维视觉-语言数据集构成了重大障碍。为了解决这一问题,典型方法通过设计三维输入级场景表示,将三维感知注入到二维LMMs中。本文提供了一个新视角。我们引入了具有三维感知的重建视觉指令微调(Ross3D),将三维感知的视觉监督整合到训练过程中。具体而言,它结合了跨视图和全局视图的重建。前者需要通过聚合其他视图中的重叠信息来重建被遮挡的视图,后者旨在从所有可用视图中聚合信息以恢复鸟瞰图(Bird's-Eye-View)图像,从而为整个场景提供全面的概览。实证研究表明,Ross3D在各种三维场景理解基准测试中达到了最先进的性能。更重要的是,我们的半监督实验表明,在利用大量未标记的仅三维视觉数据方面具有显著潜力。

The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.