SceneSplat:基于高斯点 splatting 的场景理解与视觉-语言预训练

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

摘要 Abstract

识别任意或之前未见过的类别对于全面的现实世界 3D 场景理解至关重要。目前,所有现有方法在训练过程中都依赖于 2D 或文本模态,或者在推理时结合两者。这凸显了一个明显的缺失,即没有一种模型能够单独处理 3D 数据以端到端地学习语义,并且缺乏训练此类模型所需的数据。同时,3D 高斯点 splatting (3DGS) 已成为各种视觉任务中 3D 场景表示的事实标准。然而,以通用的方式将语义推理有效整合到 3DGS 中仍然是一个开放的挑战。为了解决这些限制,我们引入了 SceneSplat,据我们所知,这是首个针对 3DGS 原生操作的大规模室内场景理解方法。此外,我们提出了一种自监督学习方案,可以利用未标注场景解锁丰富的 3D 特征学习。为了支持所提出的方案,我们推出了 SceneSplat-7K,这是首个用于室内场景的大规模 3DGS 数据集,包含来自 ScanNet、Matterport3D 等 7 个已建立数据集的 6868 个场景。生成 SceneSplat-7K 所需的计算资源相当于在 L4 GPU 上运行 119 个 GPU 天,从而实现了基于 3DGS 的室内场景推理标准化基准测试。我们在 SceneSplat-7K 上的详尽实验表明,所提出的方法相对于现有基线具有显著优势。

Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge. To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines.