MSFM: MULTI-SCALE FUSION MODULE FOR OBJECT DETECTION

Abstract

Feature fusion is beneficial to object detection tasks in two folds. On one hand, detail and position information can be combined with semantic information when high and low-resolution features from shallow and deep layers are fused. On the other hand, objects can be detected in different scales, which improves the robustness of the framework. In this work, we present a Multi-Scale Fusion Module (MSFM) that extracts both detail and semantical information from a single input but at different scales within the same layer. Specifically, the input of the module will be resized into different scales on which position and semantic information will be processed, and then they will be rescaled back and combined with the module input. The MSFM is lightweight and can be used as a drop-in layer to many existing object detection frameworks. Experiments show that MSFM can bring +2.5% mAP improvement with only 2.4M extra parameters on Faster R-CNN with ResNet-50 FPN backbone on COCO Object Detection minival set, outperforming that with ResNet-101 FPN backbone without the module which obtains +2.0% mAP with 19.0M extra parameters. The best resulting model achieves a 45.7% mAP on test-dev set. Code will be available.

1. INTRODUCTION

Object detection is one of the fundamental tasks in computer vision. It requires the detector to localize the objects in the image using bounding boxes and assign the correct category to each of them. In recent years, deep convolutional neural networks (CNNs) have seen great success in object detection, which can be divided into two categories: two-stage detectors, e.g., Faster R-CNN (Ren et al., 2015) , and one-stage detectors, e.g., SSD (Liu et al., 2016) . Two-stage detectors have high localization and recognition accuracy, while one-stage detectors achieve high inference speed (Jiao et al., 2019) . A typical two-stage detector consists of a backbone, a neck, a Region Proposal Network (RPN), and a Region of Interest (ROI) head (Chen et al., 2019) . A backbone is a feature extractor usually pre-trained on ImageNet dataset (Deng et al., 2009) . A neck could be a Feature Pyramid Network (FPN) (Lin et al., 2017a ) that fuses the features from multiple layers. A RPN proposes candidate object bounding boxes, and a ROI head is for box regression and classification (Ren et al., 2015) . Compared to two-stage detectors, one-stage detectors propose predicted bounding boxes directly from the input image without the region proposal step, thus being more efficient (Jiao et al., 2019) . One of the key challenges in object detection is to solve the two subtasks, namely localization and classification, coordinately. Localization requires the network to capture the object position accurately, while classification expects the network to extract the semantic information of the objects. Due to the layered structure of the CNNs, detail and position-accurate information resides in shallow but high-resolution layers; however, high-level and semantically strong information exists in deep but low-resolution layers (Long et al., 2014) . Another key challenge is scale invariance that the detector is expected to be capable of handling different object scales (Liu et al., 2016) . Feature Fusion is beneficial to object detectors in solving the two challenges. On one hand, through multi-layer fusion (Chen et al., 2020) , detail and position information can be combined with semantic information when high and low-resolution features from shallow and deep layers are fused. On the other hand, by fusing the results from different receptive fields (Yu & Koltun, 2016) or scales (Li et al., 2019) via dilated convolutions or different kernel sizes (Szegedy et al., 2014) , objects can be detected in different scales, which improves the robustness of the model. In this paper, we present a Multi-Scale Fusion Module (MSFM) that extracts both detail and semantical information from a single input but at different scales within the same layer. Specifically, the input of the module will be resized into different scales on which position and semantic information will be processed, and then they will be rescaled back and combined with the module input. The MSFM is lightweight and can be used as a drop-in layer to many existing object detection frameworks, complementing shallow and deep layers with semantic and position information. Experiments show that MSFM can bring +2.5% mAP improvement with only 2.4M extra parameters on Faster R-CNN with ResNet-50 FPN backbone on COCO Object Detection (Lin et al., 2014) minival set, outperforming that with ResNet-101 FPN backbone without the module which obtains +2.0% mAP with 19.0M extra parameters. When applied on other frameworks, it also shows about +2.0% mAP improvement, which show its generalizability. The best resulting model achieves a 45.7% mAP on test-dev set.

2. RELATED WORK

2.1 MULTI-LAYER FEATURE FUSION FPN (Lin et al., 2017a) is the de facto multi-layer feature fusion module in modern CNNs to compensate for the position information loss in the deep layer and lack of semantic information in shallow layers. By upsampling the deep features and fusing them with shallow features through a top-down path, it enables the model to coordinate the heterogenous information and enhances the robustness. NAS-FPN (Ghiasi et al., 2019) designs a NAS (Zoph & Le, 2017) search space that covers all possible cross-layer connections, the result of which is a laterally repeatable FPN structure sharing the same dimensions between its input and output. FPG (Chen et al., 2020) proposes a multi-pathway feature pyramid, representing the feature scale-space as a regular grid of parallel bottom-up pathways fused by multi-directional lateral connections. EfficientDet (Tan et al., 2020) adopts a weighted bi-directional feature pyramid network for multi-layer feature fusion. M2Det (Zhao et al., 2018) presents a multi-level feature pyramid network, fusing the features with the same depth and dimension from multiple sequentially connected hourglass-like modules to generate multi-scale feature groups for prediction. Similar structures can also be seen in DSSD (Fu et al., 2017) , TDM (Shrivastava et al., 2016) , YOLOv3 (Redmon & Farhadi, 2018) , and RefineDet (Zhang et al., 2017) .

2.2. MULTI-BRANCH FEATURE FUSION

In Inception (Szegedy et al., 2014) , kernels on Inception Module branches have different sizes, which makes the output of the module contain different receptive fields. However, a large kernel contains a large number of parameters. Instead, dilated convolution allows a kernel to have an enlarged receptive field while keeping the parameter size unchanged. MCA (Yu & Koltun, 2016) utilizes dilated convolutions to systematically aggregate multi-scale contextual information. Going even further, TridentNet (Li et al., 2019) lets multiple convolutions share the same weight but with different dilation rates to explore a uniform representational capability.

3. MULTI-SCALE FUSION MODULE

In this section, we present our Multi-Scale Fusion Module (MSFM) and the possible configurations when inserting it into existing frameworks.

3.1. MODULE DEFINITION

An instantiation of MSFM is shown in Figure 1a . It can be formulated as follows: M (x) = x + U {C[F 1 (S(x)), F 2 (S(x)), ..., F n (S(x))]} where x is the module input, M (x) is the module output, S() is the squeeze module that makes the input x thinner, F n () is the operation on n-th branch, C() is the combination function, and U () is the unsqueeze module which will restore the depth of the branch output to make it the same as x. The branch operation F n () can be represented as below: F n (a) = R -1 n (CGN n,i (CGN n,i-1 (...(CGN n,1 (R n (a)))))) where a = S(x) is the result of squeeze module, R n () is the resize function on n-th branch, CGN n,i is the i-th {Conv2D ⇒ GroupN ormalization ⇒ N onLinearity} operation on n-th branch, R -1 n is the resize function to restore the feature dimension (height and width). To make the module lightweight, we utilize a bottleneck-like (He et al., 2015) structure where the module input will first be thinned channel-wise, then fed into the branches. Branch input is resized using bilinear interpolation, and the same method is used when resizing the feature back to its original size. All the 3x3 convolutions on the branches have the padding=1 to keep the spatial dimension unchanged, and the number of the output channel is the same as that of the input channel as well. We choose ReLU as the nonlinearity activation in the MSFM. By default, MSFM is inserted in stages 2, 3, and 4 for ResNet backbones (He et al., 2015) . 

3.2. CONFIGURATIONS

MSFM acts as a drop-in layer to existing frameworks. To show several possible configurations when inserting it into an object detector, we take as an example inserting it into a ResNet backbone. A Residual Bottleneck (He et al., 2015) in ResNet (He et al., 2016) is shown in Figure 1b . Some tunable hyperparameters we can configure are listed in Table 1 . 

4. EXPERIMENTS

To evaluate the proposed module, we carry out experiments on object detection and instance segmentation tasks on COCO (Lin et al., 2014) . Experimental results demonstrate that the MSFM can enhance the performance of common two-stage object detection frameworks with very light computational overhead.

4.1. EXPERIMENTS SETUP

We perform hyperparameter tuning on Faster R-CNN with ResNet-50 FPN backbone (Ren et al., 2015) . Unless otherwise stated, the backbone of the framework being mentioned is ResNet-50 FPN. To test the generalizability of MSFM, experiments are also conducted on Faster R-CNN with ResNet-101 FPN backbone (Ren et al., 2015) , Mask R-CNN (He et al., 2017) , Cascade R-CNN (Cai & Vasconcelos, 2017) , Grid R-CNN (Lu et al., 2018) , Dynamic R-CNN (Zhang et al., 2020) , RetinaNet (Lin et al., 2017b) , Reppoints (Yang et al., 2019) , and Faster R-CNN with ResNet-50 FPN and Deformable Convolution on c3-c5 (Dai et al., 2017) . We carry out our experiments on object detection and instance segmentation tasks on COCO (Lin et al., 2014) , whose train set contains 118k images, minival set 5k images, and test-dev set 20k images. Mean average-precision (mAP) scores at different boxes and mask IoUs are adopted as the metrics when evaluating object detection and instance segmentation tasks. Our experiments are implemented with PyTorch (Paszke et al., 2019) and MMDetection (Chen et al., 2019) . The input images are resized such that the shorter side is no longer than 800 pixels. and the longer side is no longer than 1333 pixels. All the models are trained on 8 GPUs with 2 images per GPU. The backbone of all models are pretrained on ImageNet classification dataset (Deng et al., 2009) . Unless otherwise stated, all models are trained for 12 epochs using SGD with a weight decay of 0.0001, and a momentum of 0.9. The learning rate is set to 0.02 initially and decays by a factor of 10 at the 8th and 11th epochs. Learning rate linear warmup is adopted for first 500 steps with a warmup ratio of 0.001.

4.2. ABLATION STUDIES

The ablation studies are performed on COCO 2017 (Lin et al., 2014) minival set. Unless otherwise stated, the MSFM in the following experiments has the default configuration: the insertion position is after conv3, the resize scales of three branches are 0.5, 0.7, and 1, respectively, the squeeze ratios are 16, 32, and 64 for stage 2, 3, and 4 of ResNet-50 (He et al., 2015) , respectively, the number of groups in Group Normalization (Wu & He, 2018) is 16, only one {Conv2D, Group Normalization, Nonlinearity} operation is adopted on all branches, and the method to combine the branch results is add.

4.2.1. SCALES

As can be seen from Table 2 Scales part, small scales (3S=[0.5, 0.7, 1], 5S=[0.5, 0.6, 0.7, 0.85, 1]) are helpful for detecting large objects, while large scales (3L=[1, 1.4, 2]) can enhance the detection of small objects. Compared to only using small or large scales, using compound scales (4=[0.5, 0.7, 1.4, 2], 5=[0.5, 0.7, 1, 1.4, 2]) turn out to be the optimal option, which can achieve better overall performance. This indicates that simultaneously generating and inserting detail and semantic information to the same layer is beneficial.

4.2.2. RATIOS

We compare the effect of different squeeze ratios for different insertion positions, shown in Table 2 Ratios part. For position=after conv3, as we increase the ratios, the model will experience more information loss but less computational overhead; therefore, the ratios of 16, 32, and 64 for stages 2, 3 and 4, respectively, can be a good trade-off between information loss and computational overhead. For postion=after conv1 (norm group=8), MSFM is not sensitive to the change of ratios. We guess that it might be because the channel number is already so low after conv1 that changing its channel number will have no further effect.

4.2.3. NORM GROUP

We explore the optimal group number for Group Normalization (Wu & He, 2018) for after conv3, after conv2 and after conv1 are 32, 4, and 8, respectively. Because the channel number is much larger for after conv3 compared to after conv1 and after conv2, the group number for Group Normalization (Wu & He, 2018) is much larger for after conv3.

4.2.4. CONV NUM

All the experiments of Conv num in Table 2 are conducted with Norm group=32. 2* indicates that only the branches with scales larger than 1 have 2 {Conv2D, Group Normalization, Nonlinearity} operations. As we can see, the model with scale=[0.5, 0.7, 1, 1.4, 2] and conv num=2 achieves the best performance. What's more, all the models of conv num=2 achieves better or at least comparable performance with that of conv num=2*, which indicates that a coordinate representational power among all the branches is important, even though they do not have the same receptive field size.

4.2.5. FUSION TYPE

As two typical feature fusion operations, add and concatenation are alternatives. We compare their effects in the models of position=after conv1 and the ones of position=after conv3. The results in Table 2 show that concatenation is slightly better than add.

4.2.6. MULTI-POSITION INSERTION

According to the experiment results and analysis above, we carry out a multi-position insertion ablation study, in order to see the effect of MSFM being inserted in multiple positions. All the experiments in this part have the following configurations for all the models: the resize scales of all the branches are 0.5, 0.7, 1, 1.4, and 2, the squeeze ratios for stage 2, 3, and 4 are 16, 32, and 64, respectively, the number of {Conv2D, Group Normalization, Nonlinearity} operations on all branches is 2, and the combination method is add. The number of groups used in Group Normalization (Wu & He, 2018 ) is 8, 4, and 32 for after conv1, after conv2, and after conv3, respectively. As can be seen from the results in Table 4 , the combination of after conv2 and after conv3 turns out the best configuration, which we will use as the default configuration when applying the MSFM to other frameworks. 4 and Table 5 . For a fair comparison, all baseline models are re-trained. As we can see, there is a consistent improvement in the following models when the MSFM is applied, which demonstrates that the MSFM can be used as a drop-in layer for many existing object detection frameworks. Notice that when MSFM is applied to Faster R-CNN with ResNet FPN backbone (Ren et al., 2015) , the performance of the model even surpasses the one with ResNet-101 FPN backbone. It indicates that adding the MSFM to existing frameworks is more efficient than just adding more convolutional layers. We also train a Cascade R-CNN with ResNet-101 FPN backbone for 24 epochs using multi-scale training and submit the results to the evaluation server. The result in Table 6 shows it achieves a 45.7% mAP on the test-dev set. 



Figure 1: MSFM and Residual Bottleneck. BN=Batch Normalization (Ioffe & Szegedy, 2015), N=NonLinearity, GN=Group Normalization (Wu & He, 2018), 1x1=1x1 Convolution, 3x3=3x3 Convolutional with padding=1.

Tunable hyperparameters

when inserting into different positions. As we can see from the Norm group part in Table 2, the best group number Ablation Studies Name AP AP 50 AP 75 AP s AP m AP l #Param Name Scales AP AP 50 AP 75 AP s AP m AP l #Param AP 75 AP s AP m AP l #Param Name #Group Pos AP AP 50 AP 75 AP s AP m AP l #Param Conv Scales AP AP 50 AP 75 AP s AP m AP l #Param Name Type Pos AP AP 50 AP 75 AP s AP m AP l #Param

Mutli-position insertionPosition AP AP 50 AP 75 AP s AP m AP l #Param

Mutli-position insertion for object detection. * indicates with MSFM. Framework AP AP 50 AP 75 AP s AP m AP l #Param

Mutli-position insertion for instance segmentation. * indicates with MSFM.

Result of Cascade R-CNN with ResNet-101 FPN backbone trained for 24 epochs with multi-scale training. this paper, we have presented a Multi-Scale Fusion Module (MSFM) that extracts both detail and semantical information from a single input but at different scales within the same layer. Ablation studies have demonstrated that MSFM can bring +2.5% mAP improvement with only 2.4M extra parameters on Faster R-CNN with ResNet-50 FPN backbone on COCO Object Detection minival set, outperforming that with ResNet-101 FPN backbone without the module which obtains +2.0% mAP with 19.0M extra parameters. The best resulting model on Cascade R-CNN with ResNet-101 FPN backbone achieved a 45.7% mAP on COCO Object Detection test-dev set.

