摘要 Abstract
高效的单实例分割对于解锁移动成像应用中的功能(如拍摄或编辑)至关重要。由于计算资源的限制,现有的实时移动成像应用程序仅将分割任务限定在人像或显著主体上。尽管最近在高效网络方面取得了进展,但实例分割仍然很“重”,因为需要在整个图像上进行计算以识别所有实例。为了解决这个问题,我们提出并制定了一个基于用户正向点击选择单个实例的单实例分割任务。与Segment Anything Model(SAM)建议的“分割任何事物”任务不同,该任务专注于高效分割用户指定的单个实例。为了解决此问题,我们提出了TraceNet,它通过感受野追踪显式定位选定的实例。TraceNet识别与用户点击相关的图像区域,并仅对图像的选定区域执行繁重计算,从而在推理过程中减少整体计算成本和内存消耗。我们在实例IoU平均值以及用户点击能够落入高质量单实例掩码的区域比例上评估了TraceNet的性能。MS-COCO和LVIS上的实验结果证明了所提出方法的有效性和效率。TraceNet可以同时实现效率和交互性,填补了高效移动推理需求与近期多模态和交互式分割模型研究趋势之间的空白。
Efficient single instance segmentation is essential for unlocking features in the mobile imaging applications, such as capture or editing. Existing on-the-fly mobile imaging applications scope the segmentation task to portraits or the salient subject due to the computational constraints. Instance segmentation, despite its recent developments towards efficient networks, is still heavy due to the cost of computation on the entire image to identify all instances. To address this, we propose and formulate a one tap driven single instance segmentation task that segments a single instance selected by a user via a positive tap. This task, in contrast to the broader task of segmenting anything as suggested in the Segment Anything Model \cite{sam}, focuses on efficient segmentation of a single instance specified by the user. To solve this problem, we present TraceNet, which explicitly locates the selected instance by way of receptive field tracing. TraceNet identifies image regions that are related to the user tap and heavy computations are only performed on selected regions of the image. Therefore overall computation cost and memory consumption are reduced during inference. We evaluate the performance of TraceNet on instance IoU average over taps and the proportion of the region that a user tap can fall into for a high-quality single-instance mask. Experimental results on MS-COCO and LVIS demonstrate the effectiveness and efficiency of the proposed approach. TraceNet can jointly achieve the efficiency and interactivity, filling in the gap between needs for efficient mobile inference and recent research trend towards multimodal and interactive segmentation models.