基于状态空间模型与局部注意力的全尺度上下文建模用于语义分割

Research

arXiv

SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

Yunxiang Fu ,

Meng Lou ,

Yizhou Yu

论文信息在线阅读PDF

摘要 Abstract

高质量的语义分割依赖于三种关键能力：全局上下文建模、局部细节编码以及多尺度特征提取。然而，现有方法难以同时具备这三种能力。因此，我们致力于使分割网络能够同时进行高效的全局上下文建模、高质量的局部细节编码以及丰富的多尺度特征表示，适用于不同输入分辨率。本文提出了一种名为SegMAN的新颖线性时间模型，该模型包含一个称为SegMAN编码器的混合特征编码器以及基于状态空间模型的解码器。具体而言，SegMAN编码器通过滑动局部注意力与动态状态空间模型的协同集成，实现了高效全局上下文建模的同时保留了细粒度的局部细节。此外，解码器中的MMSCopE模块增强了多尺度上下文特征提取，并自适应地随输入分辨率缩放。我们的SegMAN-B编码器在ImageNet-1k上达到了85.1%的准确率（比VMamba-S高出1.5%，且参数更少）。当与我们的解码器结合时，完整的SegMAN-B模型在ADE20K数据集上达到了52.6%的mIoU（比SegNeXt-L高出1.6%，且浮点运算次数减少了15%），在Cityscapes数据集上达到了83.8%的mIoU（比SegFormer-B3高出2.1%，且浮点运算次数仅为一半），在COCO-Stuff数据集上的mIoU比VWFormer-B3高1.6%，同时浮点运算次数更低。我们的代码可在https://github.com/yunxiangfu2001/SegMAN获取。

High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.