NEIGHBOURHOOD DISTILLATION: ON THE BENEFITS OF NON END-TO-END DISTILLATION Anonymous

Abstract

End-to-end training with back propagation is the standard method for training deep neural networks. However, as networks become deeper and bigger, end-toend training becomes more challenging: highly non-convex models gets stuck easily in local optima, gradients signals are prone to vanish or explode during backpropagation, training requires computational resources and time. In this work, we propose to break away from the end-to-end paradigm in the context of Knowledge Distillation. Instead of distilling a model end-to-end, we propose to split it into smaller sub-networks -also called neighbourhoods -that are then trained independently. We empirically show that distilling networks in a non end-to-end fashion can be beneficial in a diverse range of use cases. First, we show that it speeds up Knowledge Distillation by exploiting parallelism and training on smaller networks. Second, we show that independently distilled neighbourhoods may be efficiently re-used for Neural Architecture Search. Finally, because smaller networks model simpler functions, we show that they are easier to train with synthetic data than their deeper counterparts.

1. INTRODUCTION

As Deep Neural Networks improve on challenging tasks, they also become deeper and bigger. Image classification convolutional neural networks grew from 5 layers in LeNet (LeCun et al., 1998) to more than a 100 in the latest ResNet models (He et al., 2016) . However, as models grow in size, training by back propagating gradients through the entire network becomes more challenging and computationally expensive. Convergence in a highly non-convex space can be slow and requires the development of sophisticated optimizers to escape local optima (Kingma & Ba, 2014) . Gradients vanish or explode as they get passed through an increasing number of layers. Very deep neural networks that are trained end-to-end also require accelerators, and time to train to completion. Our work seeks to overcome the limitations of training very deep networks by breaking away from the end-to-end training paradigm. We address the procedure of distilling knowledge from a teacher model and propose to break a deep architecture into smaller components which are distilled independently. There are multiple benefits to working on small neighbourhoods as compared to full models: training a neighbourhood takes significantly less compute than a larger model; during training, gradients in a neighbourhood only back-propagate through a small number of layers making it unlikely that they will suffer from vanishing or exploding gradients. By breaking a model into smaller neighbourhoods, training can be done in parallel, significantly reducing wall-time for training as well as enabling training on CPUs which are cheaper than custom accelerators but are seldom used in Deep Learning as they are too slow for larger models. Supervision to train the components is provided by a pre-trained teacher architecture, as is commonly used in Knowledge Distillation (Hinton et al., 2015) , a popular model compression technique that encourages a student architecture to reproduce the outputs of the teacher. For this reason, we call our method Neighbourhood Distillation. In this paper, we explore the idea of Neighbourhood Distillation on a number of different applications, demonstrate its benefits, and advocate for more research into non end-to-end training.

Contributions

• We provide empirical evidence of the thresholding effect, a phenomenon that highlights deep neural networks' resilience to local perturbations of their weights. This observation motivates the idea of Neighbourhood Distillation. • We show that Neighbourhood Distillation is up to 4x faster than Knowledge Distillation while producing models of the same quality. We demonstrate this on model compression and sparsification. • Then, we show that neighbourhoods trained independently can be used in a search algorithm that efficiently explores an exponential number of possibilities to find an optimal student architecture. • Finally, we show applications of Neighbourhood Distillation to zero-data settings. Shallow neighbourhoods model less complex functions which we can distill using only Gaussian noise as a training input. et al., 2016; Vaswani et al., 2017) .

2. RELATED WORK

However, gradient-based end-to-end learning comes with a cost. Highly non-convex losses are harder to optimize; models of bigger sizes also require more data to fully train; they suffer from vanishing and exploding gradients (Hochreiter, 1998; Pascanu et al., 2012) . Approaches to overcome these issues can be broken down into three categories. First, several methods have been introduced to ease the training of deep models, such as residual connections (He et al., 2016) , gated recurrent units (Cho et al., 2014) , normalization layers (Ioffe & Szegedy, 2015; Ba et al., 2016; Salimans & Kingma, 2016) , and more powerful optimizers (Kingma & Ba, 2014; Hinton et al.; Duchi et al., 2011) . Second, engineering best practices have adapted to rise to the challenges raised by deep learning: pre-trained models trained on large-datasets can be reused for transfer learning, only requiring the fine-tuning of a portion of the model for specific tasks (Devlin et al., 2018; Dahl et al., 2011 ). Distributed training (Krizhevsky et al., 2012; Dean et al., 2012) et al., 2015) seeks to optimize each layer to output activations close to a given target value. These values are computed by propagating inverses from downstream layers while ours are provided by a pre-trained teacher model. All of these approaches also differ from ours as modules still depend on each other, while our neighbourhoods are distilled independently. Knowledge Distillation Our work specifically draws from Knowledge Distillation (Hinton et al., 2015) , a general-purpose model compression method that has been successfully applied to vision (Crowley et al., 2018) and language problems (Hahn & Choi, 2019) . Knowledge Distillation transfers knowledge from a teacher in the form of its predicted soft logits. Various variations have been developed to improve distillation. One direction is to transfer additional knowledge in the form of intermediate activations (Romero et al., 2014; Aguilar et al., 2019; Zhang et al., 2017) , attention maps (Zagoruyko & Komodakis, 2016) , weight projections (Hyun Lee et al., 2018) or layer



Non end-to-end training Before the democratization of deep learning, machine learning methods relied on multi-stage pipelines. For example, the face detection algorithm designed by Viola & Jones (2001) is a multi-stage pipeline relying first on handcrafted feature extraction and then on a classifier trained to detect faces from the features. Then came the idea of directly learning classification from the input image, leaving the model to learn all parts of the pipeline through a series of hidden

and custom hardware accelerators(Jouppi et al., 2017)  were also crucial in accelerating training.The last category, which our work falls into, investigates non end-to-end training methods for deep neural networks. One class of non end-to-end learning method relies on splitting a deep network into gradient-isolated modules trained with local objectives(Löwe et al., 2019; Nøkland & Eidnes, 2019). Layerwise training(Belilovsky et al., 2018; Huang et al., 2017) also divides the target network into modules that are sequentially trained in a bottom-up approach. Difference Target Propagation (Lee

