KNOWLEDGE DISTILLATION VIA SOFTMAX REGRES-SION REPRESENTATION LEARNING

Abstract

This paper addresses the problem of model compression via knowledge distillation. We advocate for a method that optimizes the output feature of the penultimate layer of the student network and hence is directly related to representation learning. To this end, we firstly propose a direct feature matching approach which focuses on optimizing the student's penultimate layer only. Secondly and more importantly, because feature matching does not take into account the classification problem at hand, we propose a second approach that decouples representation learning and classification and utilizes the teacher's pre-trained classifier to train the student's penultimate layer feature. In particular, for the same input image, we wish the teacher's and student's feature to produce the same output when passed through the teacher's classifier, which is achieved with a simple L 2 loss. Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains. The code is available at https://github.com/jingyang2017/KD_SRRL.

1. INTRODUCTION

Recently, there has been a great amount of research effort to make Convolutional Neural Networks (CNNs) lightweight so that they can be deployed in devices with limited resources. To this end, several approaches for model compression have been proposed, including network pruning (Han et al., 2016; Lebedev & Lempitsky, 2016) , network quantization (Rastegari et al., 2016; Wu et al., 2016) , knowledge transfer/distillation (Hinton et al., 2015; Zagoruyko & Komodakis, 2017) , and neural architecture search (Zoph & Le, 2017; Liu et al., 2018) . Knowledge distillation (Buciluǎ et al., 2006; Hinton et al., 2015) aims to transfer knowledge from one network (the so-called "teacher") to another (the so-called "student"). Typically, the teacher is a high-capacity model capable of achieving high accuracy, while the student is a compact model with much fewer parameters, thus also requiring much less computation. The goal of knowledge distillation is to use the teacher to improve the training of the student and push its accuracy closer to that of the teacher. The rationale behind knowledge distillation can be explained from an optimization perspective: there is evidence that high capacity models (i.e. the teacher) can find good local minima due to over-parameterization (Du & Lee, 2018; Soltanolkotabi et al., 2018) . In knowledge distillation, such models are used to facilitate the optimization of lower capacity models (i.e. the student) during training. For example, in the seminal work of (Hinton et al., 2015) , the softmax outputs of the teacher provide extra supervisory signals of inter-class similarities which facilitate the training of Figure 1 : Our method performs knowledge distillation by minimizing the discrepancy between the penultimate feature representations h T and h S of the teacher and the student, respectively. To this end, we propose to use two losses: (a) the Feature Matching loss L F M , and (b) the so-called Softmax Regression loss L SR . In contrary to L F M , our main contribution, L SR , is designed to take into account the classification task at hand. To this end, L SR imposes that for the same input image, the teacher's and student's feature produce the same output when passed through the teacher's pre-trained and frozen classifier. Note that, for simplicity, the function for making the feature dimensionality of h T and h S the same is not shown. the student. In other influential works, intermediate representations extracted from the teacher such as feature tensors (Romero et al., 2015) or attention maps (Zagoruyko & Komodakis, 2017) have been used to define auxiliary loss functions used in the optimization of the student. Training a network whose output feature representation is rich and powerful has been shown crucial for achieving high accuracy for the subsequent classification task in recent works in both unsupervised and supervised learning, see for example (Chen et al., 2020; He et al., 2020) and (Kang et al., 2020) . Hence, in this paper, we are advocating for representation learning-based knowledge distillation by optimizing the student's penultimate layer output feature. If we are able to do this effectively, we expect (and show experimentally) to end up with a student network which can generalize better than one trained with logit matching as in the KD paper of (Hinton et al., 2015) . Main contributions: To accomplish the aforementioned goal we propose two loss functions: The first loss function, akin to (Romero et al., 2015; Zagoruyko & Komodakis, 2017) , is based on direct feature matching but focuses on optimizing the student's penultimate layer feature only. Because direct feature matching might be difficult due to the lower representation capacity of the student and, more importantly, is detached from the classification task at hand, we also propose a second loss function: we propose to decouple representation learning and classification and utilize the teacher's pre-trained classifier to train the student's penultimate layer feature. In particular, for the same input image, we wish the teacher's and student's feature to produce the same output when passed through the teacher's classifier, which is achieved with a simple L 2 loss (see Fig. 1 ). This softmax regression projection is used to retain from the student's feature the information that is relevant to classification, but since the projection matrix is pre-trained (learned during the teacher's training phase) this does not compromise the representational power of the student's feature. Main results: Our method has two advantages: (1) It is simple and straightforward to implement. (2) It consistently outperforms state-of-the-art methods over a large set of experimental settings including different (a) network architectures (WideResNets, ResNets, MobileNets), (b) teacherstudent capacities, (c) datasets (CIFAR-10/100, ImageNet), and (d) domains (real-to-binary).

2. RELATED WORK

Knowledge transfer: In the work of (Hinton et al., 2015) , knowledge is defined as the teacher's outputs after the final softmax layer. The softmax outputs carry richer information than one-hot labels because they provide extra supervision signals in terms of the inter-class similarities learned

