MSFM: MULTI-SCALE FUSION MODULE FOR OBJECT DETECTION

Abstract

Feature fusion is beneficial to object detection tasks in two folds. On one hand, detail and position information can be combined with semantic information when high and low-resolution features from shallow and deep layers are fused. On the other hand, objects can be detected in different scales, which improves the robustness of the framework. In this work, we present a Multi-Scale Fusion Module (MSFM) that extracts both detail and semantical information from a single input but at different scales within the same layer. Specifically, the input of the module will be resized into different scales on which position and semantic information will be processed, and then they will be rescaled back and combined with the module input. The MSFM is lightweight and can be used as a drop-in layer to many existing object detection frameworks. Experiments show that MSFM can bring +2.5% mAP improvement with only 2.4M extra parameters on Faster R-CNN with ResNet-50 FPN backbone on COCO Object Detection minival set, outperforming that with ResNet-101 FPN backbone without the module which obtains +2.0% mAP with 19.0M extra parameters. The best resulting model achieves a 45.7% mAP on test-dev set. Code will be available.

1. INTRODUCTION

Object detection is one of the fundamental tasks in computer vision. It requires the detector to localize the objects in the image using bounding boxes and assign the correct category to each of them. In recent years, deep convolutional neural networks (CNNs) have seen great success in object detection, which can be divided into two categories: two-stage detectors, e.g., Faster R-CNN (Ren et al., 2015) , and one-stage detectors, e.g., SSD (Liu et al., 2016) . Two-stage detectors have high localization and recognition accuracy, while one-stage detectors achieve high inference speed (Jiao et al., 2019) . A typical two-stage detector consists of a backbone, a neck, a Region Proposal Network (RPN), and a Region of Interest (ROI) head (Chen et al., 2019) . A backbone is a feature extractor usually pre-trained on ImageNet dataset (Deng et al., 2009) . A neck could be a Feature Pyramid Network (FPN) (Lin et al., 2017a ) that fuses the features from multiple layers. A RPN proposes candidate object bounding boxes, and a ROI head is for box regression and classification (Ren et al., 2015) . Compared to two-stage detectors, one-stage detectors propose predicted bounding boxes directly from the input image without the region proposal step, thus being more efficient (Jiao et al., 2019) . One of the key challenges in object detection is to solve the two subtasks, namely localization and classification, coordinately. Localization requires the network to capture the object position accurately, while classification expects the network to extract the semantic information of the objects. Due to the layered structure of the CNNs, detail and position-accurate information resides in shallow but high-resolution layers; however, high-level and semantically strong information exists in deep but low-resolution layers (Long et al., 2014) . Another key challenge is scale invariance that the detector is expected to be capable of handling different object scales (Liu et al., 2016) . Feature Fusion is beneficial to object detectors in solving the two challenges. On one hand, through multi-layer fusion (Chen et al., 2020) , detail and position information can be combined with semantic information when high and low-resolution features from shallow and deep layers are fused. On the other hand, by fusing the results from different receptive fields (Yu & Koltun, 2016) or scales (Li et al., 2019) via dilated convolutions or different kernel sizes (Szegedy et al., 2014) , objects can be detected in different scales, which improves the robustness of the model. In this paper, we present a Multi-Scale Fusion Module (MSFM) that extracts both detail and semantical information from a single input but at different scales within the same layer. Specifically, the input of the module will be resized into different scales on which position and semantic information will be processed, and then they will be rescaled back and combined with the module input. The MSFM is lightweight and can be used as a drop-in layer to many existing object detection frameworks, complementing shallow and deep layers with semantic and position information. 

2. RELATED WORK

2.1 MULTI-LAYER FEATURE FUSION FPN (Lin et al., 2017a) is the de facto multi-layer feature fusion module in modern CNNs to compensate for the position information loss in the deep layer and lack of semantic information in shallow layers. By upsampling the deep features and fusing them with shallow features through a top-down path, it enables the model to coordinate the heterogenous information and enhances the robustness. NAS-FPN (Ghiasi et al., 2019) designs a NAS (Zoph & Le, 2017) search space that covers all possible cross-layer connections, the result of which is a laterally repeatable FPN structure sharing the same dimensions between its input and output. FPG (Chen et al., 2020) proposes a multi-pathway feature pyramid, representing the feature scale-space as a regular grid of parallel bottom-up pathways fused by multi-directional lateral connections. EfficientDet (Tan et al., 2020) adopts a weighted bi-directional feature pyramid network for multi-layer feature fusion. M2Det (Zhao et al., 2018) presents a multi-level feature pyramid network, fusing the features with the same depth and dimension from multiple sequentially connected hourglass-like modules to generate multi-scale feature groups for prediction. Similar structures can also be seen in DSSD (Fu et al., 2017) , TDM (Shrivastava et al., 2016 ), YOLOv3 (Redmon & Farhadi, 2018 ), and RefineDet (Zhang et al., 2017) .

2.2. MULTI-BRANCH FEATURE FUSION

In Inception (Szegedy et al., 2014) , kernels on Inception Module branches have different sizes, which makes the output of the module contain different receptive fields. However, a large kernel contains a large number of parameters. Instead, dilated convolution allows a kernel to have an enlarged receptive field while keeping the parameter size unchanged. MCA (Yu & Koltun, 2016) utilizes dilated convolutions to systematically aggregate multi-scale contextual information. Going even further, TridentNet (Li et al., 2019) lets multiple convolutions share the same weight but with different dilation rates to explore a uniform representational capability.

3. MULTI-SCALE FUSION MODULE

In this section, we present our Multi-Scale Fusion Module (MSFM) and the possible configurations when inserting it into existing frameworks.

3.1. MODULE DEFINITION

An instantiation of MSFM is shown in Figure 1a . It can be formulated as follows: M (x) = x + U {C[F 1 (S(x)), F 2 (S(x)), ..., F n (S(x))]}



Experiments show that MSFM can bring +2.5% mAP improvement with only 2.4M extra parameters on Faster R-CNN with ResNet-50 FPN backbone on COCO Object Detection (Lin et al., 2014) minival set, outperforming that with ResNet-101 FPN backbone without the module which obtains +2.0% mAP with 19.0M extra parameters. When applied on other frameworks, it also shows about +2.0% mAP improvement, which show its generalizability. The best resulting model achieves a 45.7% mAP on test-dev set.

