TOWARDS UNDERSTANDING ENSEMBLE, KNOWL-EDGE DISTILLATION AND SELF-DISTILLATION IN DEEP LEARNING

Abstract

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the same architecture, trained using the same algorithm on the same data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory (such as boosting or NTKs). We develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble and can be used in distillation. 1 

1. INTRODUCTION

Ensemble (Dietterich, 2000; Hansen & Salamon, 1990; Polikar, 2006) is one of the most powerful techniques in practice to improve the performance of deep learning. By simply averaging the outputs of merely a few (like 3 or 10) independently-trained neural networks of the same architecture, using the same training method over the same training data, it can significantly boost the prediction accuracy over the test set comparing to individual models. The only difference is the randomness used to initialize these networks and/or the randomness during training. Moreover, it is discovered by Hinton et al. (2015) that such superior performance of the ensemble can be transferred into a single model (of the same size as the individual models) using a technique called knowledge distillation: that is, simply train a single model to match the output of the ensemble (such as "90% cat + 10% car", also known as soft labels) as opposite to the true data labels, over the same training data. On the theory side, there are lots of works studying the superior performance of ensemble from principled perspectives (see full version for citations). However, most of these works only apply to: (1). Boosting: where the coefficients associated with the combinations of the single models are actually trained, instead of simply taking average; (2). Bootstrapping/Bagging: the training data are different for each single model; (3). Ensemble of models of different types and architectures; or (4). Ensemble of random features or decision trees. To the best of our knowledge, none of these cited works apply to the particular type of ensemble that is widely used in deep learning: simply take a uniform average of the output of the learners, which are neural networks with the same architecture and are trained by stochastic gradient descent (SGD) over the same training set. In fact, very critically, for deep learning models: • TRAINING AVERAGE DOES NOT WORK: if one directly trains to learn an average of individual neural networks initialized by different seeds, the performance is much worse than ensemble. • KNOWLEDGE DISTILLATION WORKS: the superior performance of ensemble in deep learning can be distilled into a single model (Hinton et al., 2015) . We are unaware of any satisfactory theoretical explanation for the phenomena above. For instance, as we shall argue, some traditional view for why ensemble works, such as 'ensemble can enlarge the feature space in random feature mappings', even give contradictory explanations to the above phenomena, thus cannot explain knowledge distillation or ensemble in deep learning. Motivated by this gap between theory and practice we study the following question for multi-class classification: Our theoretical questions: How does ensemble improve the test-time performance in deep learning when we simply (unweightedly) average over a few independently trained neural networks? -Especially when all the neural networks have the same architecture, are trained over the same data set using the same standard training algorithm and only differ by the random seeds, and even when all single models already have 100% training accuracy? How can such superior test-time performance of ensemble be later "distilled" into a single neural network of the same architecture, simply by training the single model to match the output of the ensemble over the same training data set? Our results. We prove for certain multi-class classification tasks with a special structure we refer to as multi-view, with a training set Z consisting of N i.i.d. samples from some unknown distribution D, for certain two-layer convolutional network f with (smoothed-)ReLU activation as learner: • (Single model has bad test accuracy): there is a value µ > 0 such that when a single model f is trained over Z using the cross-entropy loss, via gradient descent (GD) starting from random Gaussian initialization, the model can reach zero training error efficiently. However, w.h.p. the prediction (classification) error of f over D is between 0.49µ and 0.51µ. • (Ensemble provably improves test accuracy): let f 1 , f 2 , • • • , f L be L = Ω(1) independently trained single models as above, then w.h.p. G = 1 L ℓ f ℓ has prediction error ≤ 0.01µ over D. • (Ensemble can be distilled into a single model): if we further train (using GD from random initialization) another single model f 0 (same architecture as each f ℓ ) to match the output of G = 1 L ℓ f ℓ merely over the same training data set Z, then f 0 can be trained efficiently and w.h.p. f 0 will have prediction error ≤ 0.01µ over D as well. • (Self-distillation also improves test accuracy): if we further train (using GD from random initialization) another single model f ′ (same architecture as f 1 ) to match the output of the single model f 1 merely over the same training data set Z, then f ′ can be trained efficiently and w.h.p. has prediction error at most ≤ 0.26µ over D. The main idea is that self-distillation is performing "implicit ensemble + knowledge distillation", as we shall argue in Section 4.2. We defer discussions of our empirical results to Section 5. However, we highlight some of the empirical findings, as they shall confirm and justify our theoretical approach studying ensemble and knowledge distillation in deep learning. Specifically, we give empirical evidences showing that: • Knowledge distillation does not work for random feature mappings; and ensemble in deep learning is very different from ensemble in random feature mappings (see Figure 1 ). • Special structures in data (such as the "multi-view" structure we shall introduce) is needed for ensemble of neural networks to work.



Full version of this paper can be found on https://arxiv.org/abs/2012.09816.



Figure 1: Ensemble in deep learning is very different from ensemble in random feature mappings. Details in Figure 6.

