NEIGHBOURHOOD DISTILLATION: ON THE BENEFITS OF NON END-TO-END DISTILLATION Anonymous

Abstract

End-to-end training with back propagation is the standard method for training deep neural networks. However, as networks become deeper and bigger, end-toend training becomes more challenging: highly non-convex models gets stuck easily in local optima, gradients signals are prone to vanish or explode during backpropagation, training requires computational resources and time. In this work, we propose to break away from the end-to-end paradigm in the context of Knowledge Distillation. Instead of distilling a model end-to-end, we propose to split it into smaller sub-networks -also called neighbourhoods -that are then trained independently. We empirically show that distilling networks in a non end-to-end fashion can be beneficial in a diverse range of use cases. First, we show that it speeds up Knowledge Distillation by exploiting parallelism and training on smaller networks. Second, we show that independently distilled neighbourhoods may be efficiently re-used for Neural Architecture Search. Finally, because smaller networks model simpler functions, we show that they are easier to train with synthetic data than their deeper counterparts.

1. INTRODUCTION

As Deep Neural Networks improve on challenging tasks, they also become deeper and bigger. Image classification convolutional neural networks grew from 5 layers in LeNet (LeCun et al., 1998) to more than a 100 in the latest ResNet models (He et al., 2016) . However, as models grow in size, training by back propagating gradients through the entire network becomes more challenging and computationally expensive. Convergence in a highly non-convex space can be slow and requires the development of sophisticated optimizers to escape local optima (Kingma & Ba, 2014) . Gradients vanish or explode as they get passed through an increasing number of layers. Very deep neural networks that are trained end-to-end also require accelerators, and time to train to completion. Our work seeks to overcome the limitations of training very deep networks by breaking away from the end-to-end training paradigm. We address the procedure of distilling knowledge from a teacher model and propose to break a deep architecture into smaller components which are distilled independently. There are multiple benefits to working on small neighbourhoods as compared to full models: training a neighbourhood takes significantly less compute than a larger model; during training, gradients in a neighbourhood only back-propagate through a small number of layers making it unlikely that they will suffer from vanishing or exploding gradients. By breaking a model into smaller neighbourhoods, training can be done in parallel, significantly reducing wall-time for training as well as enabling training on CPUs which are cheaper than custom accelerators but are seldom used in Deep Learning as they are too slow for larger models. Supervision to train the components is provided by a pre-trained teacher architecture, as is commonly used in Knowledge Distillation (Hinton et al., 2015) , a popular model compression technique that encourages a student architecture to reproduce the outputs of the teacher. For this reason, we call our method Neighbourhood Distillation. In this paper, we explore the idea of Neighbourhood Distillation on a number of different applications, demonstrate its benefits, and advocate for more research into non end-to-end training.

