EXPLICIT CONNECTION DISTILLATION

Abstract

One effective way to ease the deployment of deep neural networks on resource constrained devices is Knowledge Distillation (KD), which boosts the accuracy of a low-capacity student model by mimicking the learnt information of a highcapacity teacher (either a single model or a multi-model ensemble). Although great progress has been attained on KD research, existing efforts are primarily invested to design better distillation losses by using soft logits or intermediate feature representations of the teacher as the extra supervision. In this paper, we present Explicit Connection Distillation (ECD), a new KD framework, which addresses the knowledge distillation problem in a novel perspective of bridging dense intermediate feature connections between a student network and its corresponding teacher generated automatically in the training, achieving knowledge transfer goal via direct cross-network layer-to-layer gradients propagation. ECD has two interdependent modules. In the first module, given a student network, an auxiliary teacher architecture is temporarily generated conditioned on strengthening feature representations of basic convolutions of the student network via replacing them with dynamic additive convolutions and keeping the other layers unchanged in structure. The teacher generated in this way guarantees its superior capacity and makes a perfect feature alignment (both in input and output dimensions) to the student at every convolutional layer. In the second module, dense feature connections between the aligned convolutional layers from the student to its auxiliary teacher are introduced, which allows explicit layer-to-layer gradients propagation from the teacher to the student via the merged model training from scratch. Intriguingly, as feature connection direction is one-way, all feature connections together with the auxiliary teacher merely exist during training phase. Experiments on popular image classification tasks validate the effectiveness of our method. Code will be made publicly available.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved great success in tackling a variety of visual recognition tasks (Krizhevsky et al., 2012; Girshick et al., 2014; Long et al., 2015) . Despite the appealing performance, the prevailing DNN models usually have large numbers of parameters, leading to heavy costs of memory and computation. Conventional techniques such as pruning weights from networks (Han et al., 2015; Li et al., 2017) and quantizing networks to use low-bit parameters (Courbariaux et al., 2015; Rastegari et al., 2016; Zhou et al., 2016) have proven to be effective for mitigating this computational burden. More recently, Knowledge Distillation (KD), another promising solution family to get compact yet accurate models, has attracted increasing attention. The goal of KD is to transfer the learnt information (knowledge) of a high-capacity DNN model or an ensemble of multiple DNN models (teacher) to a low-capacity target DNN model (student), striking better accuracy-efficiency tradeoffs at runtime. There exist numerous KD methods to address the knowledge transfer from the teacher to the student. Many of them use a two-stage training process which begins with training a teacher model and keeping it fixed, and then learns a target student model by forcing it to match the outputted knowledge from the pre-trained teacher model. Various types of knowledge have been explored, such as outputted logits (Ba & Caruana, 2014; Hinton et al., 2015) , intermediate feature representations (Romero et al., 2015; Zagoruyko & Komodakis, 2017) , and relational information of model outputs or representations (Park et al., 2019; Tian et al., 2020) . Instead of defining new types of knowledge, the other methods adopt one-stage training process,

