LEARNING LIGHTWEIGHT OBJECT DETECTORS VIA PROGRESSIVE KNOWLEDGE DISTILLATION

Abstract

Resource-constrained perception systems such as edge computing and vision-forrobotics require vision models to be both accurate and lightweight in computation and memory usage. Knowledge distillation is one effective strategy to improve the performance of lightweight classification models, but it is less well-explored for structured outputs such as object detection and instance segmentation, where the variable number of outputs and complex internal network modules complicate the distillation. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teachers to a given lightweight student. Our approach is inspired by curriculum learning: To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive distillation strategy can be easily combined with existing distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

1. INTRODUCTION

The success of recent deep neural network models generally depends on an elaborate design of architectures with tens or hundreds of millions of model parameters. However, their huge computational complexity and massive memory/storage requirements make them challenging to be deployed in safety-critical real-time applications, especially on devices with limited resources, such as selfdriving cars or virtual/augmented reality models. Such concerns have spawned a wide body of literature on compression and acceleration techniques. Many approaches focus on reducing computation demands by sparsifying/pruning networks (Lebedev & Lempitsky, 2016; Han et al., 2016 ), quantization (Rastegari et al., 2016; Wu et al., 2016) , or neural architecture search (Zoph & Le, 2017; Liu et al., 2019) , but reduced computation does not always translate into lower latency because of subtle issues with memory access and caching on GPUs (Tan et al., 2019; Ding et al., 2021) . Rather than searching over new architectures, we seek to better train existing lightweight architectures that have already been carefully engineered for efficient memory access. Instead of relying on additional data or human supervision, we follow the large body of work on knowledge distillation (Buciluǎ et al., 2006; Hinton et al., 2014) for compressing the information from a large model into a small model. While most recent efforts in knowledge distillation focus on image classification, relatively less work exists for distilling object detectors. The extension from classification to object detection and instance segmentation is nontrivial due to the complicated outputs of the tasks. Most detectors operate with multi-task heads (for classification, and box/mask regression) that can generate variable-length outputs. In the literature of detector distillation, recent work (Zhang & Ma, 2021; Shu et al., 2021; Yang et al., 2022b) mainly focuses on designing advanced distillation loss functions for transferring features from teachers to students. However, there are two unsolved challenges: 1) The capacity gap (Cho & Hariharan, 2019; Mirzadeh et al., 2020) between models can result in a sub-optimal distilled student even if the strongest teacher has been employed, which is undesired when optimizing the accuracy-efficiency trade-off of the student. Moreover, when trying to distill knowledge from Transformer-based teachers (Dosovitskiy et al., 2020; Liu et al., 2021) 

