MASKED DISTILLATION WITH RECEPTIVE TOKENS

Abstract

Distilling from the feature maps can be fairly effective for dense prediction tasks since both the feature discriminability and localization priors can be well transferred. However, not every pixel contributes equally to the performance, and a good student should learn from what really matters to the teacher. In this paper, we introduce a learnable embedding dubbed receptive token to localize those pixels of interests (PoIs) in the feature map, with a distillation mask generated via pixel-wise attention. Then the distillation will be performed on the mask via pixel-wise reconstruction. In this way, a distillation mask actually indicates a pattern of pixel dependencies within feature maps of teacher. We thus adopt multiple receptive tokens to investigate more sophisticated and informative pixel dependencies to further enhance the distillation. To obtain a group of masks, the receptive tokens are learned via the regular task loss but with teacher fixed, and we also leverage a Dice loss to enrich the diversity of learned masks. Our method dubbed MasKD is simple and practical, and needs no priors of tasks in application. Experiments show that our MasKD can achieve state-of-the-art performance consistently on object detection and semantic segmentation benchmarks. Code is available at https://github.com/hunto/MasKD.

1. INTRODUCTION

Recent deep learning models tend to grow deeper and wider for ultimate performance (He et al., 2016; Xie et al., 2017; Li et al., 2019) . However, with the limitations of computational and memory resources, such huge models are clumsy and inefficient to deploy on edge devices. As a friendly solution, knowledge distillation (KD) (Hinton et al., 2015; Romero et al., 2014) has been proposed to transfer knowledge in the heavy model (teacher) to a small model (student). Nevertheless, applying KD on dense prediction tasks such as object detection and semantic segmentation sometimes cannot achieve significant improvements as expected. For example, Fitnet (Romero et al., 2014) mimics the feature maps of teacher element-wisely but it has only minor improvement in object detection 1 . Therefore, feature reconstruction for all pixels may not be a good option for dense prediction, since not every pixel contributes equally to the performance. Many followups (Li et al., 2017; Wang et al., 2019; Sun et al., 2020; Guo et al., 2021) thus dedicated to show that distillation on sampled valuable regions could achieve noticeable improvements over the simple baseline methods. For example, Mimicking (Li et al., 2017) distills the positive regions proposed by region proposal network (RPN) of the student; FGFI (Wang et al., 2019) and TADF (Sun et al., 2020) imitate valuable regions near the foreground boxes; Defeat (Guo et al., 2021) uses ground-truth bounding boxes to balance the loss weights of foreground and background distillations; GID (Dai et al., 2021) selects valuable regions according to the outputs of teacher and student. These methods all rely on the priors of bounding boxes; however, are all pixels inside the bounding boxes necessarily valuable for distillation? The answer might be negative. As shown in Figure 1 , the activated regions inside each object box are much smaller than the boxes. Also, different layers, even different strides of features in FPN, have different regions of interest. Moreover, objects that do not exist in ground-truth annotations would be treated as "background", but they actually contain valuable discriminative information. This inspires us that we should discard the ground truth boxes and select distillation regions on a fine-grained pixel level. In this way, we propose to learn a pixel-wise mask as an indicator for the feature distillation. An intuitive idea is that we need to localize what pixel in the teacher's feature map is really meaningful to the task. To this end, we introduce a learnable embedding dubbed receptive token to perceive each pixel via attention calculation. Then a mask is generated to indicate the pixels of interests (PoIs) encoded by a receptive token. As there may be sophisticated pixel dependencies within feature maps, we thus leverage multiple receptive tokens in practice to enhance the distillation. The receptive tokens as well as the corresponding masks can be trained with the regular task loss with the teacher fixed. For the group of masks, we adopt Dice loss to ensure their diversity, and devise a mask weighting module to accommodate the different importance of masks. During distillation, we also propose to customize the learned masks using the student's feature, which helps our distillation focus more on the pixels that teachers and students really care about simultaneously. Our MasKD is simple and practical, and does not need task prior for designing masks, which is friendly for various dense prediction tasks. Extensive experiments show that, our MasKD achieves state-of-the-art performance consistently on object detection and semantic segmentation tasks. For example, MasKD significantly improves 2.4 AP over the Faster RCNN-R50 student on object detection, while 2.79 mIoU over the DeepLabV3-R18 student on semantic segmentation.

2. RELATED WORK

Knowledge distillation on object detection. Knowledge distillation methods on object detection task have been demonstrated successful in improving the light-weight compact detection networks with the guidance of larger teachers. The distillation methods can be divided into response-based and feature-based methods according to their distillation inputs. Response-based methods (Hinton et al., 2015; Chen et al., 2017; Li et al., 2017) perform distillation on the predictions (e.g., classification scores and bounding box regressions) of teacher and student. In contrast, feature-based methods (Romero et al., 2014; Wang et al., 2019; Guo et al., 2021) are more popular as they can distill both recognition and localization information in the intermediate feature maps. Unlike the classification tasks, the distillation losses in detection tasks will encounter an extreme imbalance between positive and negative instances. To alleviate this issue, some methods (Wang et al., 2019; Sun et al., 2020; Dai et al., 2021; Guo et al., 2021; Yang et al., 2021) propose to distill the features on various sophisticatedly-selected sub-regions of the feature map. For instance, FGFI (Wang et al., 2019) selects anchors overlapping with the ground-truth object anchors as distillation regions;



Figure1: Visualization of learned masks on COCO dataset. In Faster RCNN-R101 model, the earlier stages in FPN focus more on small objects, while the later ones focus on larger objects. Complete visualization results can be found in Appendix A.6. Zoom up to view better.

