LEARNING LIGHTWEIGHT OBJECT DETECTORS VIA PROGRESSIVE KNOWLEDGE DISTILLATION

Abstract

Resource-constrained perception systems such as edge computing and vision-forrobotics require vision models to be both accurate and lightweight in computation and memory usage. Knowledge distillation is one effective strategy to improve the performance of lightweight classification models, but it is less well-explored for structured outputs such as object detection and instance segmentation, where the variable number of outputs and complex internal network modules complicate the distillation. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teachers to a given lightweight student. Our approach is inspired by curriculum learning: To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive distillation strategy can be easily combined with existing distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

1. INTRODUCTION

The success of recent deep neural network models generally depends on an elaborate design of architectures with tens or hundreds of millions of model parameters. However, their huge computational complexity and massive memory/storage requirements make them challenging to be deployed in safety-critical real-time applications, especially on devices with limited resources, such as selfdriving cars or virtual/augmented reality models. Such concerns have spawned a wide body of literature on compression and acceleration techniques. Many approaches focus on reducing computation demands by sparsifying/pruning networks (Lebedev & Lempitsky, 2016; Han et al., 2016 ), quantization (Rastegari et al., 2016; Wu et al., 2016) , or neural architecture search (Zoph & Le, 2017; Liu et al., 2019) , but reduced computation does not always translate into lower latency because of subtle issues with memory access and caching on GPUs (Tan et al., 2019; Ding et al., 2021) . Rather than searching over new architectures, we seek to better train existing lightweight architectures that have already been carefully engineered for efficient memory access. Instead of relying on additional data or human supervision, we follow the large body of work on knowledge distillation (Buciluǎ et al., 2006; Hinton et al., 2014) for compressing the information from a large model into a small model. While most recent efforts in knowledge distillation focus on image classification, relatively less work exists for distilling object detectors. The extension from classification to object detection and instance segmentation is nontrivial due to the complicated outputs of the tasks. Most detectors operate with multi-task heads (for classification, and box/mask regression) that can generate variable-length outputs. In the literature of detector distillation, recent work (Zhang & Ma, 2021; Shu et al., 2021; Yang et al., 2022b) mainly focuses on designing advanced distillation loss functions for transferring features from teachers to students. However, there are two unsolved challenges: 1) The capacity gap (Cho & Hariharan, 2019; Mirzadeh et al., 2020) between models can result in a sub-optimal distilled student even if the strongest teacher has been employed, which is undesired when optimizing the accuracy-efficiency trade-off of the student. Moreover, when trying to distill knowledge from Transformer-based teachers (Dosovitskiy et al., 2020; Liu et al., 2021) to classical convolution-based students, this architectural difference can enlarge the teacher-student gap. 2) Current methods assume that one target teacher has been selected. However, this meta-level optimization of teacher selection is neglected in the existing literature of detector distillation. In fact, finding a pool of strong teacher candidates is easy, but trial-and-error may be necessary before determining one most compatible teacher for a specific student. To address these challenges, we propose a framework to learn lightweight detectors through progressive knowledge distillation: 1) We find sequential distillation from multiple teachers arranged into a curriculum significantly improves knowledge distillation and bridges the teacher-student capacity gap. As shown in Figure 1 , even with huge architectural difference, our progressive distillation can effectively transfer knowledge from Transformer-based teachers to convolution-based students, while previous methods cannot. 2) For the teacher selection problem, we design a heuristic algorithm for a given student and a pool of teacher candidates, to automatically determine the order of teachers to use. This algorithm is based on the analysis of the representation similarity between models, which does not require knowledge of the specific distillation mechanism to be used. Overall, our progressive distillation is a general meta-level strategy that consistently improves both simple feature-matching distillation and more sophisticated ones. With the help of modern distillation mechanisms and teacher detectors, our progressive distillation learns lightweight RetinaNet and Mask R-CNN students with state-of-the-art accuracy. Furthermore, by analyzing the training loss dynamics of the student model, we find the improvement is not due to minimizing the training loss better. Rather, the knowledge transferred from multiple teachers can lead the student to a flat minimum, and thus help the student generalize better. To summarize, we transfer knowledge from multiple teachers to progressively distill a student. Our main contributions include: • We propose a framework for learning lightweight detectors through progressive knowledge distillation, which is simple, general, yet effective. We develop a principled way to automatically design a sequence of teachers appropriate for a given student and progressively distill it. • Our progressive distillation is a meta-level strategy that can be easily combined with previous efforts in detection distillation. We perform comprehensive empirical evaluation on the challenging MS COCO dataset and observe consistent gains. • For the first time, we investigate distillation from Transformer-based teacher detectors to a convolution-based student, and find progressive distillation is the key to bridge their gap. • We show the performance gain comes from better generalization rather than better optimization.

2. RELATED WORK

Knowledge Distillation: Knowledge distillation or transfer, an idea of training a shallow student network with supervision from a deep teacher, was originally proposed by Buciluǎ et al. (2006) , and later formally popularized by Hinton et al. (2014) . Different knowledge can be used, such as response-based knowledge (Hinton et al., 2014) , and feature-based knowledge (Romero et al., 2015; Heo et al., 2019) . Several multi-teacher knowledge distillation methods have been proposed (Vongkulbhisal et al., 2019; Sau & Balasubramanian, 2016) , which usually use the average of logits and feature representations as the knowledge (You et al., 2017; Fukuda et al., 2017) . Mirzadeh et al. (2020) find that an intermediate teacher assistant, decided by architectural similarities, can bridge the gap between the student and the teacher. We find it more effective to use a sequence of teachers instead of their ensemble, and extend Mirzadeh et al. (2020) to a more general case where teacher models have diverse architectures and their relative ordering is unknown.

