摘要 Abstract
高分辨率语义分割对于图像编辑、背景虚化成像、增强现实/虚拟现实(AR/VR)等应用至关重要。然而,现有的数据集往往分辨率有限,并且缺乏精确的掩码细节和边界。在本文中,我们构建了一个大规模的、基于抠图级别的语义分割数据集,名为MaSS13K,该数据集包含13,348张真实世界图像,全部达到4K分辨率。MaSS13K为多种物体提供了高质量的掩码标注,这些物体被分为七个类别:人物、植被、地面、天空、水体、建筑和其他。MaSS13K具有精确的掩码,其平均掩码复杂度比现有语义分割数据集高出20至50倍。因此,我们提出了一种专门设计用于高分辨率语义分割的方法,即MaSSFormer,它采用高效的像素解码器,在三个阶段聚合高级语义特征和低级纹理特征,旨在以最小的计算成本生成高分辨率掩码。最后,我们提出了一种新的学习范式,将七个给定类别的高质量掩码与新类别的伪标签相结合,使MaSSFormer能够将其准确的分割能力转移到其他类别的物体上。我们的提出的MaSSFormer在MaSS13K基准数据集上与14个代表性分割模型进行了全面评估。我们期望我们精心标注的MaSS13K数据集和MaSSFormer模型能够促进高分辨率和高质量语义分割的研究。数据集和代码可在https://github.com/xiechenxi99/MaSS13K找到。
High-resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes can be found at https://github.com/xiechenxi99/MaSS13K.