MULTI-SCALE NETWORK ARCHITECTURE SEARCH FOR OBJECT DETECTION

Abstract

Many commonly-used detection frameworks aim to handle the multi-scale object detection problem. The input image is always encoded to multi-scale features and objects grouped by scale range are assigned to the corresponding features. However, the design of multi-scale feature production is quite hand-crafted or partially automatic. In this paper, we show that more possible architectures of encoder network and different strategies of feature utilization can lead to superior performance. Specifically, we propose an efficient and effective multi-scale network architecture search method (MSNAS) to improve multi-scale object detection by jointly optimizing network stride search of the encoder and appropriate feature selection for detection heads. We demonstrate the effectiveness of the method on COCO dataset and obtain a remarkable performance gain with respect to the original Feature Pyramid Networks.

1. INTRODUCTION

Recognizing and localizing objects at vastly different scales is a fundamental challenge in object detection. Detection performance for objects with different scales is highly related to features with different properties such as feature resolution, receptive fields, and feature fusion ways. The key to solving the multi-scale problem in object detection is how to build a multi-scale network that has proper high-level semantic features for objects with different scales. A recent work in object detection Feature Pyramid Networks(FPN) (Lin et al., 2017) has achieved remarkable success in multi-scale feature design and has been commonly used by many modern object detectors (He et al., 2017; Lin et al., 2020; Lu et al., 2019) . FPN extracts multi-scale intermediate features from the encoder network and assigns objects grouped by scales to corresponding features according to a heuristic rule. Another prevalent detection framework, SSD (Liu et al., 2016) , conducts feature generation by a lighter encoder network without upsampling operators. The basic idea to deal with the multi-scale detection problem can be summarized as below. Given the input image, a series of feature maps with the various resolution are generated to detect objects grouped by scale range. We note it as multi-scale feature production. In FPN and its variants, the multi-scale feature production is split into two steps, feature generation and feature utilization. In terms of feature generation, an encoder network composed of blocks provides features with different scales. And the strategy of feature utilization determines the rule of assigning objects to feature maps. These two steps are closely related to each other. Although FPN has achieved promising results on multi-scale object detection tasks, the production of multi-scale features is quite hand-crafted and relies heavily on the experiences of human experts. More specifically, network architectures of FPN are based on a downsample-upsample architecture which may not be effective enough. By changing the downsampling and upsampling operation's positions and numbers, we could obtain many other candidates to generate different multi-scale features. Also, the predefined rule of feature utilization is very empirical and other alternatives may lead to better performance. Therefore we wonder: Can we find network architectures that can build better semantic feature representation for multiple scales? The answer is yes. Recent advances in neural architecture search have shown promising results compared with handcrafted architecture by human experts (Zoph et al., 2018; Liu et al., 2019b; Cai et al., 2019; Guo et al., 2019) . Several works have also focused on neural architecture search in object detection tasks (Chen et al., 2019; Ghiasi et al., 2019; Du et al., 2019) , but generating and utilizing multi-scale

