KNOWLEDGE DISTILLATION BY SPARSE REPRESEN-TATION MATCHING

Abstract

Knowledge Distillation refers to a class of methods that transfers the knowledge from a teacher network to a student network. In this paper, we propose Sparse Representation Matching (SRM), a method to transfer intermediate knowledge obtained from one Convolutional Neural Network (CNN) to another by utilizing sparse representation learning. SRM first extracts sparse representations of the hidden features of the teacher CNN, which are then used to generate both pixellevel and image-level labels for training intermediate feature maps of the student network. We formulate SRM as a neural processing block, which can be efficiently optimized using stochastic gradient descent and integrated into any CNN in a plugand-play manner. Our experiments demonstrate that SRM is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.

1. INTRODUCTION

Over the past decade, deep neural networks have become the primary tools to tackle learning problems in several domains, ranging from machine vision (Ren et al., 2015; Redmon & Farhadi, 2018) , natural language processing (Devlin et al., 2018; Radford et al., 2019) to biomedical analysis (Kiranyaz et al., 2015) or financial forecasting (Tran et al., 2018b; Zhang et al., 2019) . Of those important developments, Convolutional Neural Networks have evolved as a de facto choice for high-dimensional signals, either as a feature extraction block or the main workhorse in a learning system. Initially developed in the 1990s for handwritten character recognition using only two convolutional layers (LeCun et al., 1998) , state-of-the-art CNN topologies nowadays consist of hundreds of layers, having millions of parameters (Huang et al., 2017; Xie et al., 2017) . In fact, not only in computer vision but also in other domains, state-of-the-art solutions are mainly driven by very large networks (Devlin et al., 2018; Radford et al., 2019) , which limits their deployment in practice due to the high computational complexity. The promising results obtained from maximally attainable computational power has encouraged a lot of research on developing smaller and light-weight models while achieving similar performances. This includes efforts on designing more efficient neural network families (both automatic and handcrafted) (Howard et al., 2017; Tran et al., 2018a; 2019; Zoph & Le, 2016) , compressing pretrained networks through weight pruning (Manessi et al., 2018; Tung & Mori, 2018) , quantization (Hubara et al., 2017; Zhou et al., 2017) or approximation Denton et al. (2014); Jaderberg et al. (2014) , as well as transferring knowledge from one network to another via knowledge distillation Hinton et al. (2015) . Of these developments, Knowledge Distillation (KD) (Hinton et al., 2015) is a simple and widely used technique that has been shown to be effective in improving the performance of a network, given the access to one or many pretrained networks. KD and its variants work by utilizing the knowledge acquired in one or many models (the teacher(s)) as supervisory signals to train another model (the student) along with the labeled data. Thus, there are two central questions in KD: • How to represent the knowledge encoded in a teacher network? • How to efficiently transfer such knowledge to other networks, especially when there are architectural differences between the teacher and the student networks? In the original formulation (Hinton et al., 2015) , soft probabilities produced by the teacher represent its knowledge and the student network is trained to mimic this soft prediction. Prior to the era of deep learning, sparse representations attracted a great amount of interest in computer vision community and is a basis of many important works (Zhang et al., 2015) . Sparse representation learning aims at representing the input signal in a domain where the coefficients are sparsest. This is achieved by using an overcomplete dictionary and decomposing the signal as a sparse linear combination of the atoms in the dictionary. While the dictionary can be prespecified, it is often desirable to optimize the dictionary together with the sparse decomposition using example signals. Since hidden feature maps in CNN are often smooth with high correlations between neighboring pixels, they are compressible, e.g., in Fourier domain. Thus, sparse representation serves as a good choice for representing information encoded in the hidden feature maps. Sparse representation learning is a well-established topic in which several algorithms have been proposed (Zhang et al., 2015) . However, to the best of our knowledge, existing formulations are computationally intensive to fit a large amount of data. Although learning task-specific sparse representations have been proposed in prior works (Mairal et al., 2011; Sprechmann et al., 2015; Monga et al., 2019) , we have not seen its utilization for knowledge transfer using deep neural networks and stochastic optimization. In this work, we formulate sparse representation learning as a computation block that can be incorporated into any CNN and be efficiently optimized using mini-batch update from stochastic gradient-descent based algorithms. Our formulation allows us to take advantage of not only modern stochastic optimization techniques but also data augmentation to generate target sparsity on-the-fly. Given the sparse representations obtained from the teacher network, we derive the target pixel-level and image-level sparse representation for the student network. Transferring knowledge from the teacher to the student is then conducted by optimizing the student with its own dictionaries to induce the target sparsity. Thus, our knowledge distillation method is dubbed as Sparse Representation Matching (SRM). Extensive experiments presented in Section 4 show that SRM significantly outperforms other recent KD methods, especially in transfer learning tasks by large margins. In addition, empirical results also indicate that SRM exhibits robustness to architectural mismatch between the teacher and the student.

2. RELATED WORK

The idea of transferring knowledge from one model to another has existed for a long time. This idea was first introduced in Breiman & Shang (1996) in which the authors proposed to grow decision trees to mimic the output of a complex predictor. Later, similar ideas were proposed for training neural networks (Ba & Caruana, 2014; Bucilu et al., 2006; Hinton et al., 2015) , mainly for the purpose of model compression. Variants of the knowledge transfer idea differ in the methods of representing and transferring knowledge (Cao et al., 2018; Heo et al., 2019; Romero et al., 2014; Zagoruyko & Komodakis, 2016; Passalis & Tefas, 2018; Park et al., 2019; Tian et al., 2020) , as well as the types of data being used (Lopes et al., 2017; Yoo et al., 2019; Papernot et al., 2016; Kimura et al., 2018) . In Bucilu et al. (2006) , the final predictions of an ensemble on unlabeled data are used to train a single neural network. In Ba & Caruana (2014) , the authors proposed to use the logits produced by the source network as the representation of knowledge, which is transferred to a target network by minimizing the Mean Squared Error (MSE) between the logits. The term Knowledge Distillation was introduced in Hinton et al. (2015) in which the student network is trained to simultaneously minimize the cross-entropy measured on the labeled data and the Kullback-Leibler (KL) divergence between its predicted probabilities and the soft probabilities produced by the teacher network. Since its introduction, this formulation has been widely adopted.



Besides the final predictions, other works have proposed to utilize intermediate feature maps of the teacher as additional knowledge(Romero et al., 2014; Zagoruyko & Komodakis, 2016; Gao et al., 2018; Heo et al., 2019;  Passalis & Tefas, 2018). Intuitively, intermediate feature maps contain certain clues on how the input is progressively transformed through layers of a CNN, thus can act as a good source of knowledge. However, we argue that the intermediate feature maps by themselves are not a good representation of the knowledge encoded in the teacher to teach the students. To address the question of representing the knowledge of the teacher CNN, instead of directly utilizing the intermediate feature maps of the teacher as supervisory signals, we propose to encode each pixel (of the feature maps) in a sparse domain and use the sparse representation as the source of supervision.

