KNOWLEDGE DISTILLATION BY SPARSE REPRESEN-TATION MATCHING

Abstract

Knowledge Distillation refers to a class of methods that transfers the knowledge from a teacher network to a student network. In this paper, we propose Sparse Representation Matching (SRM), a method to transfer intermediate knowledge obtained from one Convolutional Neural Network (CNN) to another by utilizing sparse representation learning. SRM first extracts sparse representations of the hidden features of the teacher CNN, which are then used to generate both pixellevel and image-level labels for training intermediate feature maps of the student network. We formulate SRM as a neural processing block, which can be efficiently optimized using stochastic gradient descent and integrated into any CNN in a plugand-play manner. Our experiments demonstrate that SRM is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.

1. INTRODUCTION

Over the past decade, deep neural networks have become the primary tools to tackle learning problems in several domains, ranging from machine vision (Ren et al., 2015; Redmon & Farhadi, 2018) , natural language processing (Devlin et al., 2018; Radford et al., 2019) to biomedical analysis (Kiranyaz et al., 2015) or financial forecasting (Tran et al., 2018b; Zhang et al., 2019) . Of those important developments, Convolutional Neural Networks have evolved as a de facto choice for high-dimensional signals, either as a feature extraction block or the main workhorse in a learning system. Initially developed in the 1990s for handwritten character recognition using only two convolutional layers (LeCun et al., 1998) , state-of-the-art CNN topologies nowadays consist of hundreds of layers, having millions of parameters (Huang et al., 2017; Xie et al., 2017) . In fact, not only in computer vision but also in other domains, state-of-the-art solutions are mainly driven by very large networks (Devlin et al., 2018; Radford et al., 2019) , which limits their deployment in practice due to the high computational complexity. The promising results obtained from maximally attainable computational power has encouraged a lot of research on developing smaller and light-weight models while achieving similar performances. This includes efforts on designing more efficient neural network families (both automatic and handcrafted) (Howard et al., 2017; Tran et al., 2018a; 2019; Zoph & Le, 2016) , compressing pretrained networks through weight pruning (Manessi et al., 2018; Tung & Mori, 2018) , quantization (Hubara et al., 2017; Zhou et al., 2017) or approximation Denton et al. (2014) ; Jaderberg et al. (2014) , as well as transferring knowledge from one network to another via knowledge distillation Hinton et al. (2015) . Of these developments, Knowledge Distillation (KD) (Hinton et al., 2015) is a simple and widely used technique that has been shown to be effective in improving the performance of a network, given the access to one or many pretrained networks. KD and its variants work by utilizing the knowledge acquired in one or many models (the teacher(s)) as supervisory signals to train another model (the student) along with the labeled data. Thus, there are two central questions in KD: • How to represent the knowledge encoded in a teacher network? • How to efficiently transfer such knowledge to other networks, especially when there are architectural differences between the teacher and the student networks? In the original formulation (Hinton et al., 2015) , soft probabilities produced by the teacher represent its knowledge and the student network is trained to mimic this soft prediction. Besides the final predictions, other works have proposed to utilize intermediate feature maps of the teacher as additional knowledge (Romero et al., 2014; Zagoruyko & Komodakis, 2016; Gao et al., 2018; Heo et al., 2019; Passalis & Tefas, 2018) . Intuitively, intermediate feature maps contain certain clues on how the input is progressively transformed through layers of a CNN, thus can act as a good source of knowledge. However, we argue that the intermediate feature maps by themselves are not a good representation of the knowledge encoded in the teacher to teach the students. To address the question of representing the knowledge of the teacher CNN, instead of directly utilizing the intermediate feature maps of the teacher as supervisory signals, we propose to encode each pixel (of the feature maps) in a sparse domain and use the sparse representation as the source of supervision. Prior to the era of deep learning, sparse representations attracted a great amount of interest in computer vision community and is a basis of many important works (Zhang et al., 2015) . Sparse representation learning aims at representing the input signal in a domain where the coefficients are sparsest. This is achieved by using an overcomplete dictionary and decomposing the signal as a sparse linear combination of the atoms in the dictionary. While the dictionary can be prespecified, it is often desirable to optimize the dictionary together with the sparse decomposition using example signals. Since hidden feature maps in CNN are often smooth with high correlations between neighboring pixels, they are compressible, e.g., in Fourier domain. Thus, sparse representation serves as a good choice for representing information encoded in the hidden feature maps. Sparse representation learning is a well-established topic in which several algorithms have been proposed (Zhang et al., 2015) . However, to the best of our knowledge, existing formulations are computationally intensive to fit a large amount of data. Although learning task-specific sparse representations have been proposed in prior works (Mairal et al., 2011; Sprechmann et al., 2015; Monga et al., 2019) , we have not seen its utilization for knowledge transfer using deep neural networks and stochastic optimization. In this work, we formulate sparse representation learning as a computation block that can be incorporated into any CNN and be efficiently optimized using mini-batch update from stochastic gradient-descent based algorithms. Our formulation allows us to take advantage of not only modern stochastic optimization techniques but also data augmentation to generate target sparsity on-the-fly. Given the sparse representations obtained from the teacher network, we derive the target pixel-level and image-level sparse representation for the student network. Transferring knowledge from the teacher to the student is then conducted by optimizing the student with its own dictionaries to induce the target sparsity. Thus, our knowledge distillation method is dubbed as Sparse Representation Matching (SRM). Extensive experiments presented in Section 4 show that SRM significantly outperforms other recent KD methods, especially in transfer learning tasks by large margins. In addition, empirical results also indicate that SRM exhibits robustness to architectural mismatch between the teacher and the student.

2. RELATED WORK

The idea of transferring knowledge from one model to another has existed for a long time. This idea was first introduced in Breiman & Shang (1996) in which the authors proposed to grow decision trees to mimic the output of a complex predictor. Later, similar ideas were proposed for training neural networks (Ba & Caruana, 2014; Bucilu et al., 2006; Hinton et al., 2015) , mainly for the purpose of model compression. Variants of the knowledge transfer idea differ in the methods of representing and transferring knowledge (Cao et al., 2018; Heo et al., 2019; Romero et al., 2014; Zagoruyko & Komodakis, 2016; Passalis & Tefas, 2018; Park et al., 2019; Tian et al., 2020) , as well as the types of data being used (Lopes et al., 2017; Yoo et al., 2019; Papernot et al., 2016; Kimura et al., 2018) . In Bucilu et al. (2006) , the final predictions of an ensemble on unlabeled data are used to train a single neural network. In Ba & Caruana (2014) , the authors proposed to use the logits produced by the source network as the representation of knowledge, which is transferred to a target network by minimizing the Mean Squared Error (MSE) between the logits. The term Knowledge Distillation was introduced in Hinton et al. (2015) in which the student network is trained to simultaneously minimize the cross-entropy measured on the labeled data and the Kullback-Leibler (KL) divergence between its predicted probabilities and the soft probabilities produced by the teacher network. Since its introduction, this formulation has been widely adopted. In addition to the soft probabilities of the teacher, later works have been proposed to utilize intermediate features of the teacher as additional sources of knowledge. For example, in FitNet (Romero et al., 2014) , the authors referred to intermediate feature maps of the teacher as hints and the student is first pretrained by regressing its intermediate features to the teacher's hints, then later optimized with the standard KD approach. In other works, activation maps (Zagoruyko & Komodakis, 2016) as well as feature distributions (Passalis & Tefas, 2018) computed from intermediate layers have been proposed. In recent works (Park et al., 2019; Tian et al., 2020) , instead of transferring knowledge about each individual sample, the authors proposed to transfer relational knowledge between pairs of samples. Our SRM method bears some resemblances to previous works in the sense that SRM also uses intermediate feature maps as additional sources of knowledge. However, there are many differences between SRM and existing works. For example, in FitNet (Romero et al., 2014) , the student network learns to regress from its intermediate features to the teacher's; however, the regressed features are not actually used in the student network. Thus, the hints in FitNet only indirectly influence the student's features. Since the sparse representation is another representation (equivalent) of the feature maps, SRM directly influences the student's features. In addition, by manipulating the sparse representation rather than the hidden features themselves, SRM is less prone to feature value range mismatch between the teacher and the student. This is because by construction, the sparse coefficients generated by SRM only have values in the range [0, 1] as we will see in Section 3. Attention-based KD method (Zagoruyko & Komodakis, 2016) overcomes this problem by normalizing (using l 2 norm) the attention maps of the teacher and the student. This normalization step, however, might suffer from numerical instability (when l 2 norm is very small) when attention maps are calculated from activation layers such as ReLU. In (Liu et al., 2019) , the authors employed the idea of sparse coding, however, to represent the network's parameters rather than the intermediate feature maps as in our work. In (Jain et al., 2019) , the authors used the idea of feature quantization and k-means clustering using intermediate features of the teacher to train the student with additional convolutional modules to predict the cluster labels. Our pixel-level label bears some resemblances to this method. However, we explicitly represent the intermediate features by sparse representation (by minimizing reconstruction error) and use the same process to transfer both local (pixel-level) and global (image-level) information.

3.1. KNOWLEDGE REPRESENTATION

Given the n-th input image X n , let us denote by T (l) n ∈ R H l ×W l ×C l the output of the l-th layer of the teacher CNN, with H l and W l being the spatial dimensions and C l is the number of channels. In the following, we used the subscript T and S to denote a variable that is related to the teacher and student networks, respectively. In addition, we also denote by t (l) n,i,j = T (l) n (i, j, :) ∈ R C l , which is the pixel at position (i, j) of T (l) n . The first objective in SRM is to represent each pixel t (l) n,i,j in a sparse domain. To do so, SRM learns an overcomplete dictionary of M l atoms: D (l) T = [d (l) T ,1 , . . . , d (l) T ,M l ] ∈ R C l ×M l (M l > C l ), which is used to express each pixel t (l) n,i,j as a linear combination of d (l) T ,m as follows: t (l) n,i,j = M l m=1 ψ k (t (l) n,i,j , d (l) T ,m ) • κ(t (l) n,i,j , d (l) T ,m ) • d (l) T ,m where • κ(t (l) n,i,j , d (l) T ,m ) denotes a function that measures the similarity between t (l) n,i,j and atom d (l) T ,m . We further denote by k (l) T ,n,i,j = [κ(t (l) n,i,j , d (l) T ,1 ), . . . , κ(t (l) n,i,j , d (l) T ,M l )] the vector that contains similarities between t (l) n,i,j and all atoms in the dictionary D (l) T . • ψ k (t (l) n,i,j , d (l) m ) denotes the indicator function that returns a value of 1 if κ(t (l) n,i,j , d (l) T ,m ) belongs to the set of top-k values in k (l) T ,n,i,j , and a value of 0 otherwise. The decomposition in Eq. ( 1) basically means that t (l) n,i,j is expressed as the linear combination of k most similar atoms in D (l) T , with the coefficients being the corresponding similarity values. Let λ (l) n,i,j,m = ψ k (t (l) n,i,j , d (l) T ,m ) • κ(t (l) n,i,j , d (l) T ,m ), then the sparse representation of t (l) n,i,j is then defined as: t(l) n,i,j = [λ (l) n,i,j,1 , . . . , λ (l) n,i,j,M l ] ∈ R M l (2) By construction, there are only k non-zero elements in t(l) n,i,j , and k defines the degree of sparsity, which is a hyper-parameter of SRM. In order to find t(l) n,i,j , we simply minimize the reconstruction error in Eq. ( 1) as follows: arg min D (l) T n,i,j t (l) n,i,j - M l m=1 ψ k (t (l) n,i,j , d (l) T ,m ) • κ(t (l) n,i,j , d (l) T ,m ) • d (l) T ,m 2 2 (3) There are many choices for the similarity function κ such as linear kernel, RBF kernel, sigmoid kernel and so on. In our work, we used the sigmoid kernel κ(x, y) = sigmoid(x T y + c) since the dot-product makes it computationally efficient and the gradients in the backward pass are stable. Although the RBF kernel is popular in many works, we empirically found that RBF kernel is sensitive to the learning rate, which easily leads to numerical issues.

3.2. TRANSFERRING KNOWLEDGE

Let us denote by S (p) n ∈ R Hp×Wp×Cp the output of the p-th layer of the student network given input image is X n . In addition, s (p) n,i,j ∈ R Cp denotes the pixel at position (i, j) of S (p) n . We consider the task of transferring knowledge from the l-th layer of the teacher to the p-th layer of the student. To do so, we require that the spatial dimensions of both networks match (H p = H l and W p = W l ) while the channel dimensions might differ. Given the sparse representation of the teacher in Eq. ( 2), a straightforward way is to train the student network to produce hidden features at spatial position (i, j), having the same sparse coefficients as its teacher. However, trying to learn exact sparse representations as produced by the teacher is a too restrictive task since this enforces learning the absolute value of every point in a high-dimensional space. Instead of enforcing an absolute constraint on how each pixel of every sample should be represented, to transfer knowledge, we only enforce a relative constraints between them in the sparse domain. Specifically, we propose to train the student to only approximate sparse structures of the teacher network by solving a classification problem with two types of labels extracted from the sparse representation t(l) n,i,j of the teacher: pixel-level and image-level labels. Pixel-level labeling: for each spatial position (i, j), we assign a class label, which is the index of the largest element of t(l) n,i,j , i.e., the index of the closest (most similar) atom in D T . This basically means that we partition all pixels into M l disjoint sets using dictionary D (l) T , and the student network is trained to learn the same partitioning using its own dictionary D (p) S = [d (p) S,1 , . . . , d (p) S,M l ] ∈ R Cp×M l . Let k (p) S,n,i,j = [κ(s (p) n,i,j , d (p) S,1 ), . . . , κ(s (p) n,i,j , d (p) S,M l )] denote the vector that contains similarities between pixel s (p) n,i,j and M l atoms in D (p) S . The first knowledge transfer objective in our method using pixel-level label is defined as follows: arg min Θ S ,D (p) S n,i,j L CE c n,i,j , k (p) S,n,i,j where Θ S denotes parameters of the student network. L CE denotes the cross-entropy loss function, and c n,i,j = arg max( t(l) n,i,j ). Here we should note that the idea of transferring the structure instead of the absolute representation is not new. For example, in (Park et al., 2019) and (Tian et al., 2020) , the authors proposed to transfer the relative distance between the embeddings of samples. In our case, the pixel-level labels provide supervisory information on how the pixels in the student network should be represented in the sparse domain so that their partition using the nearest atom is the same. Image-level labeling: given the sparse representation t(l) n,i,j of the teacher, we generate an imagelevel label by averaging t(l) n,i,j over the spatial dimensions. While pixel-level labels provide local supervisory information encoding the spatial information, image-level label provides global supervisory information, promoting the shift-invariance property. Image-level label bears some resemblances to the Bag-of-Feature model (Passalis & Tefas, 2017) , which aggregates the histograms of image patches to generate an image-level feature. The second knowledge transfer objective in our method using image-level labels is defined as follows: arg min Θ S ,D (p) S n L BCE i,j t(l) n,i,j H l • W l , i,j k (p) S,n,i,j H l • W l (5) where L BCE denotes the binary cross-entropy loss. Here we should note that since most kernel functions output a similarity score in [0, 1], elements of t(l) n,i,j and k (p) S,n,i,j are also in this range, making the two inputs to L BCE in Eq. ( 5) valid. To summarize, the procedures of our SRM method is similar to FitNet (Romero et al., 2014) , which consists of the following steps: • Step 1: Given the source layers (with indices l) in the teacher network T , find the sparse representation by solving Eq. (3). • Step 2: Given the target layers (with indices p), optimize the student network S to predict pixel-level and image-level labels by solving Eq. ( 4), (5). • Step 3: Given the student network obtained in Step 2, optimize it using the original KD algorithm. All optimization objectives in our algorithm are solved by stochastic gradient descent.

4. EXPERIMENTS

The first set of experiments was conducted on CIFAR100 dataset (Krizhevsky et al., 2009) to compare our SRM method with other KD methods: KD (Hinton et al., 2015) , FitNet (Romero et al., 2014) , AT (Zagoruyko & Komodakis, 2016) , PKT (Passalis & Tefas, 2018) , RKD (Park et al., 2019) and CRD (Tian et al., 2020) . In the second set of experiments, we tested the algorithms under transfer learning setting. In the final set of experiments, we evaluated SRM on the large-scale ImageNet dataset. For every experiment configuration, we ran 3 times and reported the mean and standard deviation. Regarding the source and the target layers for transferring intermediate knowledge, we simply selected the outputs of the downsampling layers. Detailed information about our experimental setup is provided in the Appendices.

4.1. EXPERIMENTS ON CIFAR100

Since CIFAR100 has no validation set, we randomly selected 5K samples from the training set for validation purpose, reducing the training set size to 45K. Our setup is different from the conventional practice of validating and reporting the result on the test set of CIAR100. In this set of experiments, we (1) the student network is optimized for 60 epochs using only the training data (without teacher's soft probabilities), given all parameters are fixed, except the last linear layer for classification. This experiment is called Linear Probing; (2) the student network is optimized with all parameters for 200 epochs using only the training data (without teacher's soft probabilities). This experiment is named Whole Network Update. The results are shown in Table 2 . Firstly, both experiments show that the student networks outperform their baseline, thus, benefit from intermediate knowledge transfer by both methods, even without the final KD phase. In the first experiment when we only updated the output layer, the student pretrained by FitNet achieves slightly better performance than by SRM (69.95% versus 69.10%). However, when we optimized with respect to all parameters, the student pretrained by SRM performs significantly better than the one pretrained by FitNet (71.99% versus 68.06%). While full parameter update led the student pretrained by SRM to better local optima, it undermines the student pretrained by FitNet. This result suggests that the process of intermediate knowledge transfer in SRM can initialize the student network at better positions in the parameter space compared to FitNet. Effects of sparsity (λ) and dictionary size (µ): in Table 3 , we show the performance of SRM with different degrees of sparsity (parameterized by λ = k/M l , lower λ indicates higher sparsity) and dictionary sizes (parameterized by µ = M l /C l , higher µ indicates higher overcompleness). As can be seen from Table 3 , SRM is not sensitive to λ and µ. In fact, the worst configuration still performs slightly better than KD (73.71% versus 73.27%), and much better than other AT, PKT, RKD and CRD. Table 3 : SRM test accuracy (%) on CIFAR100 with different dictionary sizes (parameterized by µ) and degree of sparsity (parameterized by λ) µ = 1.5 µ = 2.0 µ = 3.0 λ = 0.01 74.00±00.15 74.12±00.35 74.09±00.08 λ = 0.02 74.34±00.07 74.73±00.26 74.20±00.27 λ = 0.03 73.77±00.05 73.83±00.12 73.71±00.51 Pixel-level label and image-level label: finally, to show the importance of combining both pixellevel and image-level label, we experimented with two other variants of SRM on CIFAR100: using either pixel-level or image-level label. The results are shown in Table 4 . The student network performs poorly when only receiving intermediate knowledge via image-level labels, even though it was later optimized with the standard KD phase. Similar to the observations made from Table 2 , this again suggests that the position in the parameter space, which is obtained after the intermediate knowledge transfer phase, and prior to the standard KD phase, heavily affects the final performance. Once the student network is badly initialized, even receiving soft probabilities from the teacher as additional signals does not help. Although using pixel-level label alone is better than image-level label, the best performance is obtained by combining both objectives. (Nilsback & Zisserman, 2008) , CUB (Wah et al., 2011) , Cars (Krause et al., 2013) , Indoor-Scenes (Quattoni & Torralba, 2009) and PubFig83 (Pinto et al., 2011) ) to assess how well the proposed method works under transfer learning setting compared to others. In our experiments, we used a pretrained ResNext50 (Xie et al., 2017) on ILSVRC2012 database as the teacher network, which is finetuned using the training set of each transfer learning task. We then benchmarked how well each knowledge distillation method transfers both pretrained knowledge and domain specific knowledge to a randomly initialized student network using domain specific data. Both residual (ResNet18 (He et al., 2016) , 11.3M parameters, 1.82G FLOPs) and non-residual (a variant of AllCNN (Springenberg et al., 2014) , 5.1M parameters, 1.35G FLOPs) architectures were used as the student. Full transfer setting: in this setting, we used all samples available in the training set to perform knowledge transfer from the teacher to the student. The test accuracy achieved by different methods is shown in the upper part of Table 5 . It is clear that the proposed SRM outperforms other methods on many datasets, except on Cars and Indoor-Scenes datasets for AllCNN student. While KD, AT, PKT, CRD and SRM successfully led the students to better minima with both residual or non-residual students, FitNet was only effective with the residual one. The results suggest that the proposed intermediate knowledge transfer mechanism in SRM is robust to architectural differences between the teacher and the student networks. Few-shot transfer setting: we further assessed how well the methods perform when there is a limited amount of data for knowledge transfer. For each dataset, we randomly selected 5 samples (5-shot) and 10 samples (10-shot) from the training set for training purpose, and kept the validation and test set similar to the full transfer setting. Since the Flowers dataset has only 10 training samples in total (the original split provided by the database has 20 samples per class, however, we used 10 samples for validation purpose), the results for 10-shot are similar to full transfer learning setting. The test performance (% in accuracy) is reported in the lower part of Table 5 . Under this restrictive regime, the proposed SRM method performs far better than other tested methods for both types of students, largely improves the baseline results.

4.3. IMAGENET EXPERIMENTS

For ImageNet (Russakovsky et al., 2015) 

5. CONCLUSION

In this work, we proposed Sparse Representation Matching (SRM), a method to transfer intermediate knowledge from one network to another using sparse representation learning. Experimental results on several datasets indicated that SRM outperforms related methods, successfully performing intermediate knowledge transfer even if there is a significant architectural mismatch between networks and/or a limited amount of data. SRM serves as a starting point for developing specific knowledge transfer use cases, e.g., data-free knowledge transfer, which is an interesting future research direction.





Comparison (test accuracy %) between FitNet and SRM in terms of quality of intermediate knowledge transfer on CIFAR100Overall comparison (Table1): It is clear that the student network significantly benefits from all knowledge transfer methods. The proposed SRM method clearly outperforms other competing methods, establishing more than 1% margin compared to the second best method (KD). In fact, the student network trained with SRM achieves very close performance with its teacher (74.73% versus 75.09%), despite having a non-residual architecture and 3× less parameters. In addition, recent methods, such as PKT and CRD, perform better than FitNet, however, inferior to the original KD method.Qualityof intermediate knowledge transfer: Since FitNet and SRM have a pretraining phase to transfer intermediate knowledge, we compared the quality of intermediate knowledge transferred by FitNet and SRM by conducting two types of experiments after transferring intermediate knowledge:

Effects of pixel-level label and image-level label in SRM on CIFAR100

Transfer learning using AllCNN (ACNN) and ResNet18 (RN18) (test accuracy %). The standard deviation of the test accuracy in few-shot settings is reported in Table 7 01.06 63.11 ± 00.45 84.40 ± 00.67 54.59 ± 00.38 89.79 ± 00.53 5-Shot → 10-shot ACNN 32.91 → 40.80 13.19 → 25.53 05.03 → 09.50 09.21 → 16.23 05.15 → 10.09 ACNN-KD 35.98 → 46.14 21.60 → 34.63 10.61 → 23.43 14.81 → 21.76 11.97 → 28.11 ACNN-FitNet 28.73 → 30.10 14.78 → 29.40 06.11 → 15.81 08.36 → 15.71 08.25 → 17.89 ACNN-AT 38.21 → 51.62 17.69 → 30.26 08.52 → 25.22 08.76 → 17.90 08.02 → 26.27 ACNN-PKT 33.25 → 47.12 11.30 → 24.93 06.16 → 13.82 10.75 → 16.78 06.13 → 10.67 ACNN-RKD 30.27 → 42.00 09.55 → 19.36 04.82 → 09.60 09.68 → 12.40 04.82 → 06.99 ACNN-CRD 35.01 → 46.99 18.09 → 29.72 06.77 → 16.96 09.73 → 17.60 06.33 → 17.24 ACNN-SRM 41.14 → 51.72 22.89 → 35.86 11.71 → 36.28 16.63 → 24.82 13.96 → 31.47 RN18 32.95 → 44.25 11.55 → 22.72 05.00 → 11.76 09.04 → 15.16 04.98 → 08.92 RN18-KD 38.07 → 48.26 25.53 → 40.57 11.37 → 35.44 14.61 → 23.85 10.69 → 29.67 RN18-FitNet 39.17 → 48.29 26.50 → 43.83 12.36 → 51.88 13.72 → 24.62 11.49 → 36.79 RN18-AT 37.36 → 51.49 18.22 → 30.47 08.96 → 27.70 09.93 → 17.48 08.76 → 29.80 RN18-PKT 33.24 → 45.32 10.63 → 20.62 05.88 → 11.00 11.03 → 15.76 05.34 → 08.80 RN18-RKD 29.97 → 42.32 08.83 → 17.73 04.98 → 08.73 08.41 → 11.82 04.49 → 06.49 RN18-CRD 34.24 → 47.67 18.36 → 30.04 06.95 → 17.11 09.01 → 17.82 06.36 → 16.08 RN18-SRM 51.24 → 67.46 34.67 → 48.42 26.63 → 61.22 21.18 → 32.21 29.31 → 51.99 (Cho & Hariharan, 2019) that combines early-stopping trick and Attention Transfer (Zagoruyko & Komodakis, 2016).

Top-1 Error of ResNet18 on ImageNet. (*) indicates results obtained by 110 epochs. References of the methods in this table are mentioned in the Appendix C

A CIFAR100

In this experiment, we used ADAM optimizer and trained all networks for 200 epochs with the initial learning rate of 0.001, which is reduced by 0.1 at epochs 31 and 131. For those methods that have pre-training phase, the number of epochs for pre-training was set to 160. For regularization, we used weight decay with coefficient 0.0001. For data augmentation, we followed the conventional protocol, which randomly performs horizontal flipping, random horizontal and/or vertical shifting by four pixels. For SRM, KD and FitNet, we used our own implmentation, for other methods (AT, PKT, RKD, CRD), we used the code provided by the authors of CRD method (Tian et al., 2020) .For KD, FitNet and SRM, we validated the temperature τ (used to soften teacher's probability) and the weight α (used to balance between classification and distillation loss) from the following set: τ ∈ {2.0, 4.0, 8.0}, and α ∈ {0.25, 0.5, 0.75}. For other methods, we used the provided values used in CRD paper (Tian et al., 2020) . In addition, there are two hyperparameters of SRM: the degree of sparsity (λ = k/M l ) and overcompleteness of dictionaries (µ = M l /C l ). Lower values of λ indicate sparser representations while higher values of µ indicate larger dictionaries. For CIFAR100, we performed validation with λ ∈ {0.01, 0.02, 0.03} and µ = {1.5, 2.0, 3.0}.

B TRANSFER LEARNING

For each transfer learning dataset, we randomly sampled a few samples from the training set to establish the validation set. All images were resized to resolution 224 × 224 and we used standard ImageNet data augmentation (random crop and horizontal flipping) during stochastic optimization. Based on the analysis of hyperparameters in CIFAR100 experiments, we validated KD, FitNet and SRM using α ∈ {0.5, 0.75} and τ ∈ {4.0, 8.0}. Sparsity degree λ = 0.02 and dictionary size µ = 2.0 were used for SRM. For other methods, we used the hyperparameter settings provided in CRD paper (Tian et al., 2020) . All experiments were conducted using ADAM optimizer for 200 epochs with the initial learning rate of 0.001, which is reduced by 0.1 at epochs 41 and 161. The weight decay coefficient was set to 0.0001.

C IMAGENET

For hyperparameters of SRM, we set µ = 4.0, λ = 0.02, τ = 4.0, α = 0.3. For other training protocols, we followed the standard setup for ImageNet, which trains the student network for 100 epochs using SGD optimizer with an initial learning rate of 0.1, dropping by 0.1 at epochs 51, 81, 91. Weight decay coefficient was set to 0.0001. In addition, the pre-training phase took 80 epochs with an initial learning rate of 0.1, dropping by 0.1 for every 20 epochs.References of the methods reported in Table 6 are:• • CRD Tian et al. (2020) 

