IMPROVE OBJECT DETECTION WITH FEATURE-BASED KNOWLEDGE DISTILLATION: TOWARDS ACCURATE AND EFFICIENT DETECTORS

Abstract

Knowledge distillation, in which a student model is trained to mimic a teacher model, has been proved as an effective technique for model compression and model accuracy boosting. However, most knowledge distillation methods, designed for image classification, have failed on more challenging tasks, such as object detection. In this paper, we suggest that the failure of knowledge distillation on object detection is mainly caused by two reasons: (1) the imbalance between pixels of foreground and background and (2) lack of distillation on the relation between different pixels. Observing the above reasons, we propose attention-guided distillation and non-local distillation to address the two problems, respectively. Attention-guided distillation is proposed to find the crucial pixels of foreground objects with attention mechanism and then make the students take more effort to learn their features. Non-local distillation is proposed to enable students to learn not only the feature of an individual pixel but also the relation between different pixels captured by non-local modules. Experiments show that our methods achieve excellent AP improvements on both one-stage and two-stage, both anchor-based and anchor-free detectors. For example, Faster RCNN (ResNet101 backbone) with our distillation achieves 43.9 AP on COCO2017, which is 4.1 higher than the baseline. Codes have been released on Github † .

AP (MS COCO2017)

Distilled Model Baseline Model Recently, excellent breakthrough in various domains has been achieved with the success of deep learning (Ronneberger et al., 2015; Devlin et al., 2018; Ren et al., 2015) . However, the most advanced deep neural networks always consume a large amount of computation and memory, which has limited their deployment in edge devices such as self-driving cars and mobile phones. To address this problem, abundant techniques are proposed, including pruning (Han et al., 2016; Zhang et al., 2018; Liu et al., 2018; Frankle & Carbin, 2018) , quantization (Nagel et al., 2019; Zhou et al., 2017) , compact model design (Sandler et al., 2018; Howard et al., 2019; Ma et al., 2018; Iandola et al., 2016) and knowledge distillation (Hinton et al., 2014; Buciluǎ et al., 2006) . Knowledge distillation, which is also known as teacher-student learning, aims to transfer the knowledge of an over-parameterized teacher to a lightweight student. Since the student is trained to mimic the logits or features of the teacher, the student can inherit the dark knowledge from the teacher, and thus often achieves much higher accuracy. Due to its simplicity and effectiveness, knowledge distillation has become a popular technique for both model compression and model accuracy boosting. As one of the most crucial challenges in computer vision, object detection has an urgent requirement of both accurate and efficient models. Unfortunately, most of the existing knowledge distillation methods in computer vision are designed for image classification and usually leads to trivial improvements on object detection (Li et al., 2017) . In this paper, we impute the failure of knowledge distillation on object detection to the following two issues, which will be solved later, respectively. Imbalance between foreground and background. In an image to be detected, the background pixels are often more overwhelming than the pixels of the foreground objects. However, in previous knowledge distillation, the student is always trained to mimic the features of all pixels with the same priority. As a result, students have paid most of their attention to learning background pixels features, which suppresses student's learning on features of the foreground objects. Since foreground pixels are more crucial in detection, the imbalance hurts the performance of knowledge distillation severely. To overcome this obstacle, we propose the attention-guided distillation which distills only the crucial foreground pixels. Since the attention map can reflect the position of the important pixels (Zhou et al., 2016) , we adopt the attention map as the mask for knowledge distillation. Concretely, the pixel with a higher attention value is regarded as a pixel of a foreground object and then is learned by the student model with a higher priority. Compared with the previous binary mask method (Wang et al., 2019) , the mask generated by attention maps in our methods is more fine-grained and requires no additional supervision. Compared with the previous attention-based distillation methods (Zagoruyko & Komodakis, 2017) , the attention map in our methods is not only utilized as the information to be distilled but also utilized as the mask signal for feature distillation. Lack of distillation on relation information. It is generally acknowledged that the relation between different objects contains valuable information in object detection. Recently, lots of researchers successfully improve the performance of detectors by enabling detectors to capture and make use of these relations, such as non-local modules (Wang et al., 2018) and relation networks (Hu et al., 2018) . However, the existing object detection knowledge distillation methods only distill the information of individual pixels but ignore the relation of different pixels. To solve this issue, we propose the non-local distillation, which aims to capture the relation information of students and teachers with non-local modules and then distill them from teachers to students. Since the non-local modules and attention mechanism in our methods are only required in the training period, our methods don't introduce additional computation and parameters in the inference period. Besides, our methods are feature-based distillation methods which do not depend on a specific detection algorithm so they can be directly utilized in all kinds of detectors without any modification. On MS COCO2017, 2.9, 2.9 and 2.2 AP improvements can be observed on two-stage, one-stage, and anchor-free models on average, respectively. Experiments on Mask RCNN show that our methods can also improve the performance of instance segmentation by 2.0 AP, on average. We have conducted a detailed ablation study and sensitivity study to show the effectiveness and stability of each distillation loss. Moreover, we study the relation between teachers and students on object detection and find that knowledge distillation on object detection requires a high AP teacher, which is different from the conclusion in image classification where a high AP teacher may harm the performance of students (Mirzadeh et al., 2019; Cho & Hariharan, 2019) . We hope that these results are worth more contemplation of knowledge distillation on tasks except for image classification. To sum up, the contribution of this paper can be summarized as follows. • We propose the attention-guided distillation, which emphasizes students' learning on the foreground objects and suppresses students' learning on the background pixels. • We propose the non-local distillation, which enables the students to learn not only the information of the individual pixel but also the relation between different pixels from teachers. • We show that a teacher with higher AP is usually a better teacher in knowledge distillation on object detection, which is different from the conclusion in image classification.

2. RELATED WORK

As an effective method for model compression and model accuracy boosting, knowledge distillation has been widely utilized in various domains and tasks, including image classification (Hinton et al., 2014; Romero et al., 2015; Zagoruyko & Komodakis, 2017) , object detection (Chen et al., 2017; Li et al., 2017; Wang et al., 2019; Bajestani & Yang, 2020) , semantic segmentation (Liu et al., 2019) , face recognition (Ge et al., 2018) , pretrained language model (Sanh et al., 2019; Xu et al., 2020) , multi-exit networks training (Zhang et al., 2019b; a) , model robustness (Zhang et al., 2020b ) and so on. Hinton et al. (2014) first propose the concept of knowledge distillation where the students are trained to mimic the results after softmax layers of teachers. Then, abundant methods are proposed to transfer the knowledge in teacher's features (Romero et al., 2015) or the variants, such as attention (Zagoruyko & Komodakis, 2017; Hou et al., 2019) , FSP (Yim et al., 2017) , mutual information (Ahn et al., 2019) , positive features (Heo et al., 2019) , relation of samples in a batch (Park et al., 2019; Tung & Mori, 2019) . Improving the performance of object detection becomes a hot topic in knowledge distillation recently. Chen et al. (2017) design the first knowledge distillation method on object detection, which includes distillation loss on the backbone, the classification head and the regression head. Then, many researchers find that the imbalance between the foreground objects and background is a crucial problem in detection distillation. Instead of distilling the whole features of backbone networks, Li et al. (2017) only apply L 2 distillation loss to the features sampled by RPN. Bajestani & Yang (2020) propose the temporal knowledge distillation, which introduces a hyper-parameter to balance the distillation loss between the pixels of the foreground and background. Wang et al. (2019) propose the fine-grained feature imitation, which only distills the feature near object anchor locations. However, although these works have tried to distill only the pixels of foreground objects, they always reply on the annotation in groundtruth, anchors, and bounding boxes and thus can not be transferred to different kinds of detectors and tasks. In contrast, in our method, the pixels of foreground objects are found with attention mechanism, which can be easily generated from features. As a result, it can be directly utilized in all kinds of detectors without any modification. As shown in Figure 3 , the difference between the previous mask-based detection distillation method (Wang et al., 2019) and our attention-guided distillation can be summarized as follows (i) Our methods generate the mask with attention mechanism while they generate the mask with ground truth bounding boxes and anchor priors. (ii) The mask in our methods is a pixel-wise and fine-grained mask while the mask in their method is an object-wise and binary mask. (iii) The masks in our methods are composed of a spatial mask and a channel mask while they only have a spatial mask. More detailed comparison with related work can be found in Appendix.E.

3.1. ATTENTION-GUIDED DISTILLATION

We use A ∈ R C,H,W to denote the feature of the backbone in an object detection model, where C, H, W denotes its channel number, height and width, respectively. Then, the generation of the spatial attention map and channel attention map is equivalent to finding the mapping function G s : R C,H,W - → R H,W and G c : R C,H,W - → R C , respectively. Note that the superscripts s and c here are utilized to discriminate 'spatial' and 'channel'. Since the absolute value of each element in the feature implies its importance, we construct G s by summing the absolute values across the channel dimension and construct G c by summing the absolute values across the width and height dimension, which can be formulated as G c (A) = 1 HW i=1 H j=1 W |A •,i,j | and G s (A) = 1 C k=1 C |A k,•,• |, where i, j, k denotes the i th , j th , k th slice of A in the height, width, and channel dimension, respectively. Then, the spatial attention mask M s and the channel attention mask M c used in attentionguided distillation can be obtained by summing the attention maps from the teacher and the student detector, which can be formulated as M s = HW • softmax((G s (A S ) + G s (A T ))/T ), M c = C • softmax((G c (A S ) + G c (A T ))/T ). Note that the superscripts S and T here are used to discriminate students and teachers. T is a hyper-parameter in softmax introduced by Hinton et al. to adjust the distribution of elements in attention masks (see Figure 4 ). The attention-guided distillation loss L AGD is composed of two components -attention transfer loss L AT and attention-masked loss L AM . L AT is utilized to encourage the student model to mimic the spatial and channel attention of the teacher model, which can be formulated as L AT = L 2 (G s (A S ), G s (A T )) + L 2 (G c (A S ), G c (A T )). (1) L AM is utilized to encourage the student to mimic the features of teacher models by a L 2 norm loss masked by M s and M c , which can be formulated as L AM =   C k=1 H i=1 W j=1 (A T k,i,j -A S k,i,j ) 2 • M s i,j • M c k   1 2 . (2)

3.2. NON-LOCAL DISTILLATION

Non-local module (Wang et al., 2018) is an effective method to improve the performance of neural networks by capturing the global relation information. In this paper, we apply non-local modules to capture the relation between pixels in an image, which can be formulated as r i,j = 1 W H i =1 H j =1 W f (A •,i,j , A •,i ,j )g(A •,i ,j ), where r denotes the obtained relation information. i, j are the spatial indexes of an output position whose response is to be computed. i , j are the spatial indexes that enumerates all possible positions. f is a pairwise function for computing the relation of two pixels and g is an unary function for computing the representation of an individual pixel. Now, we can introduce the proposed non-local distillation loss L N LD as the L 2 loss between the relation information of the students and teachers, which can be formulated as L N LD = L 2 (r S , r T ).

3.3. OVERALL LOSS FUNCTION

We introduce three hyper-parameters α, β, γ to balance different distillation loss in our methods. The overall distillation loss can be formulated as  L Distill (A T , A S ) = α • L AT + β • L AM Attention-guided distillation + γ • L N LD .

4.1. EXPERIMENTS SETTINGS

The proposed knowledge distillation method is evaluated on MS COCO2017, which is a large-scale dataset that contains over 120k images spanning 80 categories (Lin et al., 2014) . The benchmark detection networks are composed of both two-stage detection models, including Faster RCNN (Ren et al., 2015) , Cascade RCNN (Cai & Vasconcelos, 2019) , Dynamic RCNN (Zhang et al., 2020a) , Grid RCNN (Lu et al., 2019) and one-stage detection models, including the RetinaNet (Lin et al., 2017) , Fsaf RetinaNet (Zhu et al., 2019) . Besides, we also evaluate our methods on the Mask RCNN (He et al., 2017) , Cascade Mask RCNN (Cai & Vasconcelos, 2019) , and anchor-free models -RepPoints (Yang et al., 2019) . We adopt the ResNet50 and ResNet101 (He et al., 2016) as the backbone network of each detection model. We pre-train the backbone model on ImageNet (Deng et al., 2009) and then finetune it on MS COCO2017. We have compared our methods with three kinds of object detection knowledge distillation methods (Chen et al., 2017; Wang et al., 2019; Heo et al., 2019) . All the experiments in this paper are implemented with PyTorch (Paszke et al., 2019) with mmdetection2 framework (Chen et al., 2019) . The reported fps is measured on one RTX 2080Ti GPU. We adopt the same hyper-parameters settings {α = γ = 7 × 10 -5 , β = 4 × 10 -3 , T = 0.1} for all the two-stage models and {α = γ = 4 × 10 -4 , β = 2 × 10 -2 , T = 0.5} for all the onestage models. Cascade Mask RCNN with ResNeXt101 backbone is utilized as the teacher for all the two-stage students and RetinaNet with ResNeXt101 backbone is utilized as the teacher for all the one-stage students. Please refer to the codes in Github for more details.

4.2. EXPERIMENT RESULTS

In this section, we show the experiment results of the baseline detectors and our models in Table 1 and Table 2 , and compare our methods with other three knowledge distillation methods in Table 3 . It is observed that: (i) Consistent and significant AP boost can be observed on all the 9 kinds of detectors. On average, there are 2.9, 2.9, and 2.2 AP improvements on the two-stage, one-stage, and anchor-free detectors, respectively. (ii) With the proposed method, a student model with ResNet50 backbone can outperform the same model with ResNet101 backbone by 1.2 AP on average. (iii) On Mask RCNN related models, there are 2.3 improvements on bounding box AP and 2.0 improvements on mask AP on average respectively, indicating the proposed method can be utilized in not only object detection but also instance segmentation. (iv) Our methods achieve 2.2 higher AP than the second-best distillation method, on average. (v) There are 2.7 and 2.9 AP improvements on models with ResNet50 and ResNet101 backbones, respectively, indicating that deeper detectors benefit more from knowledge distillation. 4 shows the ablation study of the proposed attention-guided distillation (L AT and L AM ) and non-local distillation (L N LD ). It is observed that: (i) Attention-guided distillation and non-local distillation lead to 2.8 and 1.4 AP improvements, respectively. (ii) L AT and L AM lead to 1.2 and 2.4 AP improvements respectively, indicating that most of the benefits of attentionguided distillation are obtained from the feature loss masked by the attention maps (L AM ). (iii) There are 3.1 AP improvements with the combination of attention-guided distillation and non-local distillation. These observations indicate that each distillation loss in our methods has their individual effectiveness and they can be utilized together to achieve better performance. We also give an ablation study to the spatial and channel attention in Appendix A. Sensitivity study on hyper-parameters. Four hyper-parameters are introduced in this paper. α, β, and γ are utilized to balance the magnitude of different distillation loss and T is utilized to adjust the distribution of attention masks. The hyper-parameter sensitivity study on MS COCO2017 with Faster RCNN (ResNet50 backbone) is introduced in Figure 5 . It is observed that the worst hyper-parameters only lead to 0.3 AP drop compared with the highest AP, which is still 2.9 higher compared with the baseline model, indicating that our methods are not sensitive to the choice of hyper-parameters. Sensitivity study on the types of non-local modules. There are four kinds of non-local modules, including Gaussian, embedded Gaussian, dot production, and concatenation. Table 5 shows the performance of our methods with different types of non-local modules. It is observed that the worst non-local type (Gaussian) is only 0.2 AP lower than the best non-local type (Embedded Gaussian and Concatenation), indicating our methods are not sensitive to the choice of non-local modules. 6 shows the comparison of detection results between a baseline and a distilled detector. It is observed that: (i) Our methods improve the detection ability on small-objects. In the first three figures, the distilled model can correctly detect cars, the handbag, and the person in the car, respectively. (ii) Our methods prevent models from generating multiple bounding boxes for the same object. In the last two figures, the baseline model generates multiple bounding boxes for the boat and the train while the distilled model avoids these errors. Analysis on the types of detection error. We have analyzed the different types of detection errors in the baseline and distilled models in Figure 7 . The number in the legend indicates AUC (area under the curve). It is observed that our distillation method leads to error reduction on all kinds of error. In brief, our methods can improve the ability of both localization and classification. 

5.2. RELATION BETWEEN STUDENT DETECTORS AND TEACHER DETECTORS.

There is sufficient research focusing on the relation between students and teachers. Mirzadeh et al. (2019) and Cho & Hariharan (2019) show that a teacher with higher accuracy may not be the better teacher for knowledge distillation and sometimes a teacher with too high accuracy may harm the performance of students. Besides, Mobahi et al. (2020) and Yuan et al. (2019) show that the same model and even a model with lower accuracy than the student model can be utilized as the teacher model for knowledge distillation. However, all their experiments are conducted on image classification. In this section, we study whether these observations still hold in the task of object detection. As shown in Figure 8 , we conduct experiments on Faster RCNN (ResNet50 backbone) and Cascade RCNN (ResNet50 backbone) students with teacher models of different AP. It is observed that: (i) In all of our experiments, the student with a higher AP teacher always achieves higher AP. (ii) When the teacher has lower or the same AP as the student, there are very limited and even negative improvements with knowledge distillation. This observation indicates that the relation between students and teachers on object detection is opposite to that on image classification. Our experiment results suggest that there is a strong positive correlation between the AP of students and teachers. A high AP teacher tends to improve the performance of students significantly. We think that the reason why a high AP teacher model is crucial in object detection but not very necessary in image classification is that object detection is a more challenging task. As a result, a weaker teacher model may introduce more negative influence on students, which prevents students from achieving higher AP. In contrast, on image classification, most of teacher models can achieve a very high training accuracy so they don't introduce so much error. 

6. CONCLUSION

In this paper, we have proposed two knowledge distillation methods, including attention-guided distillation and the non-local distillation to improve the performance of object detection models. Attention-guided distillation manages to find the crucial pixels and channels from the whole feature map with attention mechanism and then enables the student to focus more on these crucial pixels and channels instead of the whole feature map. Non-local distillation enables students to learn not only the information of an individual pixel but also the relation between different pixels captured by the non-local modules. Experiments on 9 kinds of models including two-stage, one-stage, anchor-free and anchor-based models have been provided to evaluate our methods. Besides, we have also given a study on the relation between students and teachers in object detection. Our experiments show that there is a strong positive correlation between the AP of teachers and students. A high AP teacher detector plays an essential role in knowledge distillation. This observation is much different from the previous conclusion in image classification, where a teacher model with very high accuracy may harm the performance of knowledge distillation. We hope that our result may call for more rethinking works on knowledge distillation in tasks except image classification.

7. ACKNOWLEDGEMENT

This work was partially supported by Institute for interdisciplinary Information Core Technology. 

B ADAPTATION LAYERS IN KNOWLEDGE DISTILLATION

The adaptation layers in knowledge distillation are first proposed by Romero et al. (2015) to adjust the feature size of students and teachers. Then, recent research finds that the adaptation layers play an important role in improving the performance of students (Chen et al., 2017) . In this paper, we adopt different kinds of adaptation layers for different distillation loss. Concretely, We adopt 1x1 convolutional layers for L AM and L N LD , 3x3 convolutional layers for L spatial AT , and fully connected layers for L channel AT . Note that the adaptation layers are only utilized in the training period and they don't introduce additional computation and parameters.

C EXPERIMENTS ON SMALLER BACKBONES

According to the insightful comments of the reviewers, we conduct a series of experiments on models with small backbones including ResNet18 and RegNet-800M, and compact detectors including Yolo v3 and SSD. As shown in Table 7 , our methods also achieve significant AP improvements on these compact models. Note that more experiments with small backbones will be added in the camera ready version. 

E COMPARISION WITH RELATED WORK

Comparison on Methodology and Application. Feature distillation is utilized in all the five methods. However, Chen et al. distill not only the feature, but also the classification logits and bounding box regression results, which has limited their application scenes in one stage and anchor-free models. Li et al. and Wang et al. distill the features in the regions of proposals and near object anchor locations, respectively. As a result, their methods reply on the supervision of anchors and groundtruths and can't be utilized in one stage and anchor free models. Bajestani & Yang is utilized for active perception on video, which can not be utilized in image-based detection. In contrast, in our method, the attention mask and relation information can be easily generated from the backbone features, which has no requirements on groundtruths, anchors and proposals. As a result, it can be easily used in different kinds of models and tasks without any modification. Comparison on Motivation. Chen et al.'s method is a direct application of knowledge distillation on object detection. The other three methods and our method are motivated by the imbalance between foreground and background piexles and these methods try to address this issue by reweighting the distillation loss. Besides, our method is also motivated by the effect of the relation among pixels in an image, which is ignored by the other methods.



Figure 1: Results overview.

Figure 2: Details of our methods: (a) Attention-guided distillation generates the spatial and channel attention with average pooling in the channel and spatial dimension, respectively. Then, students are encouraged to mimic the attention of teachers. Besides, students are also trained to mimic the feature of teachers, which is masked by the attention of both students and teachers. (b) Nonlocal distillation captures the relation of pixels in an image with non-local modules. The relation information of teachers is learned by students with L 2 norm loss. (c) The architecture of non-local modules. '1x1' is convolution layer with 1x1 kernel. (d) Distillation loss is applied to backbone features with different resolutions. The detection head and neck are not involved in our methods.

Figure 3: Comparison between the proposed attention-guided distillation and other methods.

Figure 4: Visualization and distribution of the spatial attention with different T . With a smaller T , the pixels of high attention values are emphasized more in knowledge distillation.

Figure 5: Hyper-parameter sensitivity study of α, β, γ, T with Faster RCNN on MS COCO2017.

Figure 7: Distribution of error types on distilled and baseline Faster RCNN. Loc -Localization error; Sim & Oth -Classification error on similar & not similar classes; BG -False positive prediction fired on background. FN -False Negative prediction.

Experiments on MS COCO2017 with the proposed distillation method. Model Backbone AP AP 50 AP 75 AP S AP M AP L FPS Params

Experiments on MS COCO2017 with the proposed distillation method on Mask RCNN.

Comparison between our methods and other distillation methods. Note that we don't compare our methods with Chen's and Wang's methods on RetinaNet because their methods can not be utilized in one-stage models. ResNet50 is utilized as backbone in these models.Model AP AP 50 AP 75 AP S AP M AP L

Ablation study of the three distillation loss.

Ablation study on the spatial attention and channel attention.

Experiments on MS COCO2017 with our method on small backbones. Backbone AP AP 50 AP 75 AP S AP M AP L FPS Params

Experiments on Cityscapes with our method.ModelBackbone Box AP Mask AP

Comparision on methodology and application. Method Feature Classify Regress Relation One-Stage Two-Stage Anchor-based Anchor-free Segment.

A ABLATION STUDY ON THE SPATIAL AND CHANNEL ATTENTION

Different from previous attention-based knowledge distillation methods, the attention-guided distillation in our methods uses not only spatial attention but also the channel attention. In this appendix, we have conducted an ablation study on the two kinds of attention with Faster RCNN (ResNet50) on MS COCO2017 to show their individual effectiveness.It is observed that spatial attention and channel attention lead to 2.6 and 2.3 AP improvements, respectively. In contrast, the combination of the two kinds of attention leads to 2.8 AP improvements. These results indicate that both spatial and channel attention have their individual effectiveness and they can be utilized together to achieve better performance.

