DANCE: 数据-网络协同优化用于高效分割模型训练与推理

Research

arXiv

DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference

Chaojian Li ,

Wuyang Chen ,

Yuchen Gu ,

Yonggan Fu ,

摘要 Abstract

场景理解中的语义分割如今需求广泛，对算法效率提出了显著挑战，尤其是在资源受限平台上的应用。当前的分割模型在大量高分辨率场景图像（“数据层面”）上进行训练和评估，并受到多尺度聚合所需计算开销（“网络层面”）的影响。“数据层面”和“网络层面”的计算和能耗在训练和推理过程中都很显著，这通常是由于期望的大输入分辨率和分割模型的繁重计算负担所致。为了解决这一问题，我们提出了DANCE，一种通用的自动化数据-网络协同优化方法，用于高效的分割模型训练和推理。不同于仅关注轻量级网络设计的现有高效分割方法，DANCE通过输入数据操作和网络架构瘦身实现了自动化的同时数据-网络协同优化。具体而言，DANCE集成了自动化的数据瘦身，该方法自适应地对输入图像进行下采样/丢弃，并根据图像的空间复杂度控制其对训练损失的相应贡献。这种下采样操作不仅直接减少了与输入大小相关的成本，还缩小了输入对象和上下文尺度的动态范围，从而促使我们自适应地瘦身网络以匹配下采样的数据。大量的实验和消融研究（在两种训练设置下的四个最先进的分割模型和三个流行的分割数据集上）表明，DANCE可以在高效分割方面实现“全胜”（降低训练成本、减少昂贵的推理成本并提高平均交并比(mIoU)）。

Semantic segmentation for scene understanding is nowadays widely demanded, raising significant challenges for the algorithm efficiency, especially its applications on resource-limited platforms. Current segmentation models are trained and evaluated on massive high-resolution scene images ("data level") and suffer from the expensive computation arising from the required multi-scale aggregation("network level"). In both folds, the computational and energy costs in training and inference are notable due to the often desired large input resolutions and heavy computational burden of segmentation models. To this end, we propose DANCE, general automated DAta-Network Co-optimization for Efficient segmentation model training and inference. Distinct from existing efficient segmentation approaches that focus merely on light-weight network design, DANCE distinguishes itself as an automated simultaneous data-network co-optimization via both input data manipulation and network architecture slimming. Specifically, DANCE integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity. Such a downsampling operation, in addition to slimming down the cost associated with the input size directly, also shrinks the dynamic range of input object and context scales, therefore motivating us to also adaptively slim the network to match the downsampled data. Extensive experiments and ablating studies (on four SOTA segmentation models with three popular segmentation datasets under two training settings) demonstrate that DANCE can achieve "all-win" towards efficient segmentation(reduced training cost, less expensive inference, and better mean Intersection-over-Union (mIoU)).