摘要 Abstract
神经架构搜索(NAS)已被广泛用于设计准确且高效的图像分类模型。然而,将NAS应用于新的计算机视觉任务仍需耗费大量精力。这是因为:1)以往的NAS研究过于关注图像分类,而对其他任务的关注较少;2)许多NAS工作侧重于优化特定任务的组件,这些组件难以在其他任务中有效迁移;3)现有的NAS方法通常是“无代理”的,并且需要大量努力才能集成到每个新任务的训练管道中。为了解决这些挑战,我们提出了FBNetV5,这是一种NAS框架,可以在大大减少计算成本和人力投入的情况下为多种视觉任务搜索神经架构。具体而言,我们设计了1)一个简单但包容且可转移的搜索空间;2)一种与目标任务训练管道解耦的多任务搜索过程;3)一种算法,可以同时为多个任务搜索架构,而其计算成本与任务数量无关。我们在三种基础视觉任务——图像分类、对象检测和语义分割上评估了所提出的FBNetV5。在一次搜索运行中,FBNetV5搜索的模型在这三个任务上均超过了之前的最先进水平:图像分类(例如,在相同浮点运算(FLOPs)下比FBNetV3高出1.3%的ImageNet顶级-1准确性)、语义分割(例如,比SegFormer高出1.8%的ADE20K验证集平均mIoU,同时浮点运算减少了3.6倍),以及对象检测(例如,比YOLOX少1.2倍浮点运算的情况下,COCO验证集mAP提高了1.1%)。
Neural Architecture Search (NAS) has been widely adopted to design accurate and efficient image classification models. However, applying NAS to a new computer vision task still requires a huge amount of effort. This is because 1) previous NAS research has been over-prioritized on image classification while largely ignoring other tasks; 2) many NAS works focus on optimizing task-specific components that cannot be favorably transferred to other tasks; and 3) existing NAS methods are typically designed to be "proxyless" and require significant effort to be integrated with each new task's training pipelines. To tackle these challenges, we propose FBNetV5, a NAS framework that can search for neural architectures for a variety of vision tasks with much reduced computational cost and human effort. Specifically, we design 1) a search space that is simple yet inclusive and transferable; 2) a multitask search process that is disentangled with target tasks' training pipeline; and 3) an algorithm to simultaneously search for architectures for multiple tasks with a computational cost agnostic to the number of tasks. We evaluate the proposed FBNetV5 targeting three fundamental vision tasks -- image classification, object detection, and semantic segmentation. Models searched by FBNetV5 in a single run of search have outperformed the previous stateof-the-art in all the three tasks: image classification (e.g., +1.3% ImageNet top-1 accuracy under the same FLOPs as compared to FBNetV3), semantic segmentation (e.g., +1.8% higher ADE20K val. mIoU than SegFormer with 3.6x fewer FLOPs), and object detection (e.g., +1.1% COCO val. mAP with 1.2x fewer FLOPs as compared to YOLOX).