MULTI-SCALE NETWORK ARCHITECTURE SEARCH FOR OBJECT DETECTION

Abstract

Many commonly-used detection frameworks aim to handle the multi-scale object detection problem. The input image is always encoded to multi-scale features and objects grouped by scale range are assigned to the corresponding features. However, the design of multi-scale feature production is quite hand-crafted or partially automatic. In this paper, we show that more possible architectures of encoder network and different strategies of feature utilization can lead to superior performance. Specifically, we propose an efficient and effective multi-scale network architecture search method (MSNAS) to improve multi-scale object detection by jointly optimizing network stride search of the encoder and appropriate feature selection for detection heads. We demonstrate the effectiveness of the method on COCO dataset and obtain a remarkable performance gain with respect to the original Feature Pyramid Networks.

1. INTRODUCTION

Recognizing and localizing objects at vastly different scales is a fundamental challenge in object detection. Detection performance for objects with different scales is highly related to features with different properties such as feature resolution, receptive fields, and feature fusion ways. The key to solving the multi-scale problem in object detection is how to build a multi-scale network that has proper high-level semantic features for objects with different scales. A recent work in object detection Feature Pyramid Networks(FPN) (Lin et al., 2017) has achieved remarkable success in multi-scale feature design and has been commonly used by many modern object detectors (He et al., 2017; Lin et al., 2020; Lu et al., 2019) . FPN extracts multi-scale intermediate features from the encoder network and assigns objects grouped by scales to corresponding features according to a heuristic rule. Another prevalent detection framework, SSD (Liu et al., 2016) , conducts feature generation by a lighter encoder network without upsampling operators. The basic idea to deal with the multi-scale detection problem can be summarized as below. Given the input image, a series of feature maps with the various resolution are generated to detect objects grouped by scale range. We note it as multi-scale feature production. In FPN and its variants, the multi-scale feature production is split into two steps, feature generation and feature utilization. In terms of feature generation, an encoder network composed of blocks provides features with different scales. And the strategy of feature utilization determines the rule of assigning objects to feature maps. These two steps are closely related to each other. Although FPN has achieved promising results on multi-scale object detection tasks, the production of multi-scale features is quite hand-crafted and relies heavily on the experiences of human experts. More specifically, network architectures of FPN are based on a downsample-upsample architecture which may not be effective enough. By changing the downsampling and upsampling operation's positions and numbers, we could obtain many other candidates to generate different multi-scale features. Also, the predefined rule of feature utilization is very empirical and other alternatives may lead to better performance. Therefore we wonder: Can we find network architectures that can build better semantic feature representation for multiple scales? The answer is yes. Recent advances in neural architecture search have shown promising results compared with handcrafted architecture by human experts (Zoph et al., 2018; Liu et al., 2019b; Cai et al., 2019; Guo et al., 2019) . Several works have also focused on neural architecture search in object detection tasks (Chen et al., 2019; Ghiasi et al., 2019; Du et al., 2019) , but generating and utilizing multi-scale In this paper, we propose a new method to take into account of both aspects and build detection networks with the strong and proper multi-scale feature production strategy by neural architecture search. For feature generation, we put forward a network stride search method to generate multiple feature representations for different scales. Different from the scale-decreasing-increasing architecture of FPN, the scale of our networks can decrease or increase at each block, as illustrated in Figure 1 . By stride search for each block, we could significantly explore a wide range of possible feature generation designs of multi-resolution networks. Most backbones of object detectors are originally designed on image classification without multi-scale problems. However, stride configuration in the encoder network would be optimized in the context of the multi-scale task. Moreover, more complex cross-scale feature fusions might appear according to more complex internal scale changes. For feature utilization, we change the previous one-to-one mapping strategy into a more flexible feature selection. Since each group with objects of the same scale range owns one detection head, feature utilization is implemented by selecting proper features for detection heads. Objects of different scale ranges might be assigned to the same feature map. It is not possible in previous methods, as shown in Figure 1(b) . By jointly optimizing feature generation and utilization of multi-scale features, we search for flexible but complete multi-scale feature production strategies. Extensive experiments demonstrate complete multi-scale feature production search is critical to building strong and proper semantic features for object detection with different scales. On challenging COCO dataset (Lin et al., 2014) , our method obtains a 2.6%, 1.5%, 1.2% mAP improvement with similar FLOPs as ResNet18-FPN, ResNet34-FPN, ResNet50-FPN.

2. RELATED WORK

Neural Architecture Search Neural Architecture Search aims to design better network architectures automatically. RL-based methods (Zoph et al., 2018; Zoph & Le, 2017) have achieved great success despite a huge computation cost. In differentiable algorithms (Liu et al., 2019b; Cai et al., 2019) , architecture parameters are employed and operators in the search space are considered as



Figure 1: Architecture of ResNet18-FPN and the searched network of MSNAS-R18. MSNAS-R18 has different stride values of blocks in the encoding network and a more flexible feature utilization strategy. For simplification, P6 of FPN is not included in the figure and only three detection heads are presented.

