EXPLICIT CONNECTION DISTILLATION

Abstract

One effective way to ease the deployment of deep neural networks on resource constrained devices is Knowledge Distillation (KD), which boosts the accuracy of a low-capacity student model by mimicking the learnt information of a highcapacity teacher (either a single model or a multi-model ensemble). Although great progress has been attained on KD research, existing efforts are primarily invested to design better distillation losses by using soft logits or intermediate feature representations of the teacher as the extra supervision. In this paper, we present Explicit Connection Distillation (ECD), a new KD framework, which addresses the knowledge distillation problem in a novel perspective of bridging dense intermediate feature connections between a student network and its corresponding teacher generated automatically in the training, achieving knowledge transfer goal via direct cross-network layer-to-layer gradients propagation. ECD has two interdependent modules. In the first module, given a student network, an auxiliary teacher architecture is temporarily generated conditioned on strengthening feature representations of basic convolutions of the student network via replacing them with dynamic additive convolutions and keeping the other layers unchanged in structure. The teacher generated in this way guarantees its superior capacity and makes a perfect feature alignment (both in input and output dimensions) to the student at every convolutional layer. In the second module, dense feature connections between the aligned convolutional layers from the student to its auxiliary teacher are introduced, which allows explicit layer-to-layer gradients propagation from the teacher to the student via the merged model training from scratch. Intriguingly, as feature connection direction is one-way, all feature connections together with the auxiliary teacher merely exist during training phase. Experiments on popular image classification tasks validate the effectiveness of our method. Code will be made publicly available.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved great success in tackling a variety of visual recognition tasks (Krizhevsky et al., 2012; Girshick et al., 2014; Long et al., 2015) . Despite the appealing performance, the prevailing DNN models usually have large numbers of parameters, leading to heavy costs of memory and computation. Conventional techniques such as pruning weights from networks (Han et al., 2015; Li et al., 2017) and quantizing networks to use low-bit parameters (Courbariaux et al., 2015; Rastegari et al., 2016; Zhou et al., 2016) have proven to be effective for mitigating this computational burden. More recently, Knowledge Distillation (KD), another promising solution family to get compact yet accurate models, has attracted increasing attention. The goal of KD is to transfer the learnt information (knowledge) of a high-capacity DNN model or an ensemble of multiple DNN models (teacher) to a low-capacity target DNN model (student), striking better accuracy-efficiency tradeoffs at runtime. There exist numerous KD methods to address the knowledge transfer from the teacher to the student. Many of them use a two-stage training process which begins with training a teacher model and keeping it fixed, and then learns a target student model by forcing it to match the outputted knowledge from the pre-trained teacher model. Various types of knowledge have been explored, such as outputted logits (Ba & Caruana, 2014; Hinton et al., 2015) , intermediate feature representations (Romero et al., 2015; Zagoruyko & Komodakis, 2017) , and relational information of model outputs or representations (Park et al., 2019; Tian et al., 2020) . Instead of defining new types of knowledge, the other methods adopt one-stage training process, jointly training the student and teacher/peer models using bidirectional knowledge distillation (Zhang et al., 2018a; Yao & Sun, 2020) or on-the-fly ensemble distillation (Lan et al., 2018; Anil et al., 2018) or multi-exit distillation (Phuong & Lampert, 2019) . Both two-stage and one-stage KD methods as above typically treat the knowledge transfer as an optimization problem of formulating robust distillation loss functions via introducing more informative knowledge supervisions and more effective knowledge matching strategies. Instead of extracting informative knowledge and designing alternative distillation loss functions to fit a desired knowledge transfer optimization objective as existing KD methods, we investigate a new technical perspective in this paper: casting the knowledge distillation problem into designing auxiliary connection paths between the student and teacher networks to enable explicit layer-to-layer gradients distillation from the teacher to the student via training them from scratch simultaneously. We are partially inspired by recent advances on DNN architecture engineering, which show that designing sophisticated feature connection paths such as residual connections (He et al., 2016) and dense connections (Huang et al., 2017) across neighboring layers can allow better information and gradients flow throughout a single network, making the training easy to have significantly improved performance in model accuracy and convergence. We conjecture this simple principle would also be crucial to open the door to develop a totally new knowledge distillation framework if we can merge the student and teacher into a single network temporarily during training phase, and can also separate them easily after training. To explore this hypothesis, we present Explicit Connection Distillation (ECD), a very simple knowledge distillation framework. Specifically, we decompose the design of ECD into two interdependent modules, namely auxiliary teacher generation and dense feature connection distillation. Recent works (Liu et al., 2020; Yue et al., 2020) show that searching a good alignment of structural feature channels between the student and teacher networks can bring improved knowledge distillation performance under the premise that a pre-trained teacher model is available. Regarding the first module of ECD, we hope the generated teacher can make a perfect structure alignment (both in input and output feature dimensions of every convolutional layer) to the student network in an easier manner (no need of time-consuming searching procedure as used in (Liu et al., 2020; Yue et al., 2020) ) while can attain superior model capacity. To this goal, we retain all structural units of the student network in constructing the teacher architecture, except for replacing original convolutions by dynamic additive convolutions (Yang et al., 2019) which have proven to be very effective for enhancing model capacity in network architecture engineering research. Regarding the second module of ECD, we hope knowledge distillation can be realized through the explicit layer-to-layer flows of gradients from the teacher to the student instead of the conventional mimicking procedure. To this goal, we add dense feature connections from the convolutional layers of the student to those aligned layers of the auxiliary teacher, and train the merged model from scratch. After training, all feature connections can be naturally removed as their connection direction is only from the student to its auxiliary teacher which also exists merely in training phase. Beyond the common wisdom of knowledge distillation that requires the knowledge mimicking process between the student and teacher networks, our ECD sheds new insight: by considering knowledge distillation from a novel student-to-teacher merging, co-training and splitting viewpoint, direct feature connections from the student to its well-aligned auxiliary teacher (generated from the student in an automatic manner) can also achieve competitive performance to improve low-capacity student models, as validated by extensive experiments on CIFAR-100 and ImageNet datasets.

2. RELATED WORK

In this section, we make a brief summary of existing knowledge distillation works. Two-stage KD methods. The idea of training a compact model to mimic the functions learnt by a larger ensemble of models is firstly proposed in (Bucilǎ et al., 2006) . Ba & Caruana (2014) extends this idea, showing that shallower yet wider neural networks can also approximate the functions previously learnt by deep ones. Hinton et al. (2015) presents the famous Knowledge Distillation (KD) method, which adopts a teacher-student framework for transferring learnt soft knowledge from a high-capacity teacher to a low-capacity target student network. In this framework, the teacher is pre-trained and fixed, and then its soft logits on the training data are used as the extra supervision to guide the training of the student besides the ground truth labels. FitNets (Romero et al., 2015) shows the intermediate feature representations learnt by the teacher can be used as the complementary

