TOWARDS UNDERSTANDING ENSEMBLE, KNOWL-EDGE DISTILLATION AND SELF-DISTILLATION IN DEEP LEARNING

Abstract

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the same architecture, trained using the same algorithm on the same data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory (such as boosting or NTKs). We develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble and can be used in distillation. 1 

1. INTRODUCTION

Ensemble (Dietterich, 2000; Hansen & Salamon, 1990; Polikar, 2006) is one of the most powerful techniques in practice to improve the performance of deep learning. By simply averaging the outputs of merely a few (like 3 or 10) independently-trained neural networks of the same architecture, using the same training method over the same training data, it can significantly boost the prediction accuracy over the test set comparing to individual models. The only difference is the randomness used to initialize these networks and/or the randomness during training. Moreover, it is discovered by Hinton et al. (2015) that such superior performance of the ensemble can be transferred into a single model (of the same size as the individual models) using a technique called knowledge distillation: that is, simply train a single model to match the output of the ensemble (such as "90% cat + 10% car", also known as soft labels) as opposite to the true data labels, over the same training data. On the theory side, there are lots of works studying the superior performance of ensemble from principled perspectives (see full version for citations). However, most of these works only apply to: (1). Boosting: where the coefficients associated with the combinations of the single models are actually trained, instead of simply taking average; (2). Bootstrapping/Bagging: the training data are different for each single model; (3). Ensemble of models of different types and architectures; or (4). Ensemble of random features or decision trees. To the best of our knowledge, none of these cited works apply to the particular type of ensemble that is widely used in deep learning: simply take a uniform average of the output of the learners, which are neural networks with the same architecture and are trained by stochastic gradient descent (SGD) over the same training set. In fact, very critically, for deep learning models: • TRAINING AVERAGE DOES NOT WORK: if one directly trains to learn an average of individual neural networks initialized by different seeds, the performance is much worse than ensemble. • KNOWLEDGE DISTILLATION WORKS: the superior performance of ensemble in deep learning can be distilled into a single model (Hinton et al., 2015) .



Full version of this paper can be found on https://arxiv.org/abs/2012.09816.1

