MSFM: MULTI-SCALE FUSION MODULE FOR OBJECT DETECTION

Abstract

Feature fusion is beneficial to object detection tasks in two folds. On one hand, detail and position information can be combined with semantic information when high and low-resolution features from shallow and deep layers are fused. On the other hand, objects can be detected in different scales, which improves the robustness of the framework. In this work, we present a Multi-Scale Fusion Module (MSFM) that extracts both detail and semantical information from a single input but at different scales within the same layer. Specifically, the input of the module will be resized into different scales on which position and semantic information will be processed, and then they will be rescaled back and combined with the module input. The MSFM is lightweight and can be used as a drop-in layer to many existing object detection frameworks. Experiments show that MSFM can bring +2.5% mAP improvement with only 2.4M extra parameters on Faster R-CNN with ResNet-50 FPN backbone on COCO Object Detection minival set, outperforming that with ResNet-101 FPN backbone without the module which obtains +2.0% mAP with 19.0M extra parameters. The best resulting model achieves a 45.7% mAP on test-dev set. Code will be available.

1. INTRODUCTION

Object detection is one of the fundamental tasks in computer vision. It requires the detector to localize the objects in the image using bounding boxes and assign the correct category to each of them. In recent years, deep convolutional neural networks (CNNs) have seen great success in object detection, which can be divided into two categories: two-stage detectors, e.g., Faster R-CNN (Ren et al., 2015) , and one-stage detectors, e.g., SSD (Liu et al., 2016) . Two-stage detectors have high localization and recognition accuracy, while one-stage detectors achieve high inference speed (Jiao et al., 2019) . A typical two-stage detector consists of a backbone, a neck, a Region Proposal Network (RPN), and a Region of Interest (ROI) head (Chen et al., 2019) . A backbone is a feature extractor usually pre-trained on ImageNet dataset (Deng et al., 2009) . A neck could be a Feature Pyramid Network (FPN) (Lin et al., 2017a) that fuses the features from multiple layers. A RPN proposes candidate object bounding boxes, and a ROI head is for box regression and classification (Ren et al., 2015) . Compared to two-stage detectors, one-stage detectors propose predicted bounding boxes directly from the input image without the region proposal step, thus being more efficient (Jiao et al., 2019) . One of the key challenges in object detection is to solve the two subtasks, namely localization and classification, coordinately. Localization requires the network to capture the object position accurately, while classification expects the network to extract the semantic information of the objects. Due to the layered structure of the CNNs, detail and position-accurate information resides in shallow but high-resolution layers; however, high-level and semantically strong information exists in deep but low-resolution layers (Long et al., 2014) . Another key challenge is scale invariance that the detector is expected to be capable of handling different object scales (Liu et al., 2016) . Feature Fusion is beneficial to object detectors in solving the two challenges. On one hand, through multi-layer fusion (Chen et al., 2020) , detail and position information can be combined with semantic information when high and low-resolution features from shallow and deep layers are fused. On the other hand, by fusing the results from different receptive fields (Yu & Koltun, 2016) or scales

