KNOWLEDGE DISTILLATION BY SPARSE REPRESEN-TATION MATCHING

Abstract

Knowledge Distillation refers to a class of methods that transfers the knowledge from a teacher network to a student network. In this paper, we propose Sparse Representation Matching (SRM), a method to transfer intermediate knowledge obtained from one Convolutional Neural Network (CNN) to another by utilizing sparse representation learning. SRM first extracts sparse representations of the hidden features of the teacher CNN, which are then used to generate both pixellevel and image-level labels for training intermediate feature maps of the student network. We formulate SRM as a neural processing block, which can be efficiently optimized using stochastic gradient descent and integrated into any CNN in a plugand-play manner. Our experiments demonstrate that SRM is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.

1. INTRODUCTION

Over the past decade, deep neural networks have become the primary tools to tackle learning problems in several domains, ranging from machine vision (Ren et al., 2015; Redmon & Farhadi, 2018) , natural language processing (Devlin et al., 2018; Radford et al., 2019) to biomedical analysis (Kiranyaz et al., 2015) or financial forecasting (Tran et al., 2018b; Zhang et al., 2019) . Of those important developments, Convolutional Neural Networks have evolved as a de facto choice for high-dimensional signals, either as a feature extraction block or the main workhorse in a learning system. Initially developed in the 1990s for handwritten character recognition using only two convolutional layers (LeCun et al., 1998) , state-of-the-art CNN topologies nowadays consist of hundreds of layers, having millions of parameters (Huang et al., 2017; Xie et al., 2017) . In fact, not only in computer vision but also in other domains, state-of-the-art solutions are mainly driven by very large networks (Devlin et al., 2018; Radford et al., 2019) , which limits their deployment in practice due to the high computational complexity. The promising results obtained from maximally attainable computational power has encouraged a lot of research on developing smaller and light-weight models while achieving similar performances. This includes efforts on designing more efficient neural network families (both automatic and handcrafted) (Howard et al., 2017; Tran et al., 2018a; 2019; Zoph & Le, 2016) , compressing pretrained networks through weight pruning (Manessi et al., 2018; Tung & Mori, 2018) , quantization (Hubara et al., 2017; Zhou et al., 2017) or approximation Denton et al. (2014); Jaderberg et al. (2014) , as well as transferring knowledge from one network to another via knowledge distillation Hinton et al. (2015) . Of these developments, Knowledge Distillation (KD) (Hinton et al., 2015) is a simple and widely used technique that has been shown to be effective in improving the performance of a network, given the access to one or many pretrained networks. KD and its variants work by utilizing the knowledge acquired in one or many models (the teacher(s)) as supervisory signals to train another model (the student) along with the labeled data. Thus, there are two central questions in KD: • How to represent the knowledge encoded in a teacher network? • How to efficiently transfer such knowledge to other networks, especially when there are architectural differences between the teacher and the student networks? In the original formulation (Hinton et al., 2015) , soft probabilities produced by the teacher represent its knowledge and the student network is trained to mimic this soft prediction. Besides the final

