TOWARDS UNDERSTANDING ENSEMBLE, KNOWL-EDGE DISTILLATION AND SELF-DISTILLATION IN DEEP LEARNING

Abstract

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the same architecture, trained using the same algorithm on the same data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory (such as boosting or NTKs). We develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble and can be used in distillation. 1 

1. INTRODUCTION

Ensemble (Dietterich, 2000; Hansen & Salamon, 1990; Polikar, 2006) is one of the most powerful techniques in practice to improve the performance of deep learning. By simply averaging the outputs of merely a few (like 3 or 10) independently-trained neural networks of the same architecture, using the same training method over the same training data, it can significantly boost the prediction accuracy over the test set comparing to individual models. The only difference is the randomness used to initialize these networks and/or the randomness during training. Moreover, it is discovered by Hinton et al. (2015) that such superior performance of the ensemble can be transferred into a single model (of the same size as the individual models) using a technique called knowledge distillation: that is, simply train a single model to match the output of the ensemble (such as "90% cat + 10% car", also known as soft labels) as opposite to the true data labels, over the same training data. On the theory side, there are lots of works studying the superior performance of ensemble from principled perspectives (see full version for citations). However, most of these works only apply to: (1). Boosting: where the coefficients associated with the combinations of the single models are actually trained, instead of simply taking average; (2). Bootstrapping/Bagging: the training data are different for each single model; (3). Ensemble of models of different types and architectures; or (4). Ensemble of random features or decision trees. To the best of our knowledge, none of these cited works apply to the particular type of ensemble that is widely used in deep learning: simply take a uniform average of the output of the learners, which are neural networks with the same architecture and are trained by stochastic gradient descent (SGD) over the same training set. In fact, very critically, for deep learning models: • TRAINING AVERAGE DOES NOT WORK: if one directly trains to learn an average of individual neural networks initialized by different seeds, the performance is much worse than ensemble. • KNOWLEDGE DISTILLATION WORKS: the superior performance of ensemble in deep learning can be distilled into a single model (Hinton et al., 2015) . WRN-28-10 on CIFAR10 WRN-28-10 on CIFAR100 • SELF-DISTILLATION WORKS: even distilling a single model into another of the same size, there is performance boost. (Furlanello et al., 2018; Mobahi et al., 2020; Zhang et al., 2019) We are unaware of any satisfactory theoretical explanation for the phenomena above. For instance, as we shall argue, some traditional view for why ensemble works, such as 'ensemble can enlarge the feature space in random feature mappings', even give contradictory explanations to the above phenomena, thus cannot explain knowledge distillation or ensemble in deep learning. Motivated by this gap between theory and practice we study the following question for multi-class classification: Our theoretical questions: How does ensemble improve the test-time performance in deep learning when we simply (unweightedly) average over a few independently trained neural networks? -Especially when all the neural networks have the same architecture, are trained over the same data set using the same standard training algorithm and only differ by the random seeds, and even when all single models already have 100% training accuracy? How can such superior test-time performance of ensemble be later "distilled" into a single neural network of the same architecture, simply by training the single model to match the output of the ensemble over the same training data set? Our results. We prove for certain multi-class classification tasks with a special structure we refer to as multi-view, with a training set Z consisting of N i.i.d. samples from some unknown distribution D, for certain two-layer convolutional network f with (smoothed-)ReLU activation as learner: • (Single model has bad test accuracy): there is a value µ > 0 such that when a single model f is trained over Z using the cross-entropy loss, via gradient descent (GD) starting from random Gaussian initialization, the model can reach zero training error efficiently. However, w.h.p. the prediction (classification) error of f over D is between 0.49µ and 0.51µ. 1) independently trained single models as above, then w.h.p. G = 1 L ℓ f ℓ has prediction error ≤ 0.01µ over D. • (Ensemble can be distilled into a single model): if we further train (using GD from random initialization) another single model f 0 (same architecture as each f ℓ ) to match the output of G = 1 L ℓ f ℓ merely over the same training data set Z, then f 0 can be trained efficiently and w.h.p. f 0 will have prediction error ≤ 0.01µ over D as well.

• (Ensemble provably improves test accuracy

): let f 1 , f 2 , • • • , f L be L = Ω( • (Self-distillation also improves test accuracy): if we further train (using GD from random initialization) another single model f ′ (same architecture as f 1 ) to match the output of the single model f 1 merely over the same training data set Z, then f ′ can be trained efficiently and w.h.p. has prediction error at most ≤ 0.26µ over D. The main idea is that self-distillation is performing "implicit ensemble + knowledge distillation", as we shall argue in Section 4.2. We defer discussions of our empirical results to Section 5. However, we highlight some of the empirical findings, as they shall confirm and justify our theoretical approach studying ensemble and knowledge distillation in deep learning. Specifically, we give empirical evidences showing that: • Knowledge distillation does not work for random feature mappings; and ensemble in deep learning is very different from ensemble in random feature mappings (see Figure 1 ). • Special structures in data (such as the "multi-view" structure we shall introduce) is needed for ensemble of neural networks to work. • The variance due to label noise or the non-convex landscape of training, in the independentlytrained models, may not be connected to the superior performance of ensemble in deep learning.

2. OUR METHODOLOGY AND INTUITION 2.1 A FAILURE ATTEMPT USING RANDOM FEATURE MAPPINGS

The recent advance in deep learning theory shows that under certain circumstances, neural networks can be treated as a linear function over random feature mappings -see (Allen-Zhu et al., 2019b; Arora et al., 2019b; Daniely et al., 2016; Du et al., 2018b; Jacot et al., 2018; Zou et al., 2018) and the references therein. In particular, the theory shows when f : R D+d → R is a neural network with inputs x ∈ R d and weights W ∈ R D , in some cases, f (W, x) can be approximated by: f (W, x) ≈ f (W 0 , x) + ⟨W -W 0 , ∇ W f (W 0 , x)⟩ where W 0 is the random initialization of the neural network, and Φ W0 (x) := ∇ W f (W 0 , x) is the neural tangent kernel (NTK) feature mapping. This is known as the NTK approach. If this approximation holds, then training a neural network can be approximated by learning a linear function over random features Φ W0 (x), which is very theory-friendly. Ensemble works for random features / NTK. Traditional theorems (Alhamdoosh & Wang, 2014; Brown et al., 2005a; Bryll et al., 2003; Tsymbal et al., 2005) suggest that the ensemble of independently trained random feature models can indeed significantly improve test-time performance, as it enlarges the feature space from Φ W0 (x) to {Φ W (i) 0 (x)} i∈[L] for L many independently sampled W (i) 0 . This can be viewed as a feature selection process (Alvarez et al., 2012; Cai et al., 2018; Oliveira et al., 2003; Opitz, 1999; Rokach, 2010) , and we have confirmed it for NTK in practice, see Figure 1 . However, can we understand ensemble and knowledge distillation in DL as feature selections using NTK? Unfortunately, our empirical results provide many counter examples towards those arguments, see discussions below and Figure 1 . Contradiction 1: training average works even better. Although ensemble of linear functions over NTK features with different random seeds: f i (x) = ⟨W (i) , Φ W (i) 0 (x)⟩ does improve test accuracy, however, such improvement is mainly due to the use of a larger set of random features, whose combinations contain functions that generalize better. To see this, we observe that an even superior performance (than the ensemble) can simply be obtained by directly training F (x) = 1 L f 1 +f 2 +• • •+f L from random initialization. In contrast, recall if f i (x)'s are multi-layer neural networks with different random seeds, then training their average barely gives any better performance comparing to individual networks f i , as now all the f i 's are capable of learning the same set of features. Contradiction 2: knowledge distillation does not work. For NTK feature mappings, we observe that the result obtained by ensemble cannot be distilled at all into individual models, indicating the features selected by ensemble is not contained in the feature Φ W (i) 0 (x) of any individual model. In contrast, in actual deep learning, ensemble does not enlarge feature space: so an individual neural network is capable of learning the features of the ensemble model. In sum, ensemble in deep learning may be very different from ensemble in random features. It may be more accurate to study ensemble / knowledge distillation in deep learning as a feature learning process, instead of a feature selection process. But still, we point out a fundamental difficulty: Key challenge: If a single deep learning model is capable of -through knowledge distillation -learning the features of the ensemble model and achieving better test accuracy comparing to training the single model directly (and the same training accuracy, typically at global optimal of 100%), then why the single model cannot learn these features directly when we train the model to match the true data labels? What is the dark knowledge hidden in the output of ensemble (a.k.a. soft label)foot_1 comparing to the original hard label?

2.2. ENSEMBLE IN DEEP LEARNING: A FEATURE LEARNING PROCESS

Before addressing the key challenge, we point out that prior works are very limited with respect to studying neural network training as a feature learning process. Most of the existing works proving that neural networks can learn features only focus on the case when the input is Gaussian or Gaussian-like -see for instance (Kawaguchi, 2016; Soudry & Carmon, 2016; Xie et al., 2016 ) and many others. However, as we demonstrate in Figure 7 in the full version, Ensemble in DL might not improve test accuracy when inputs are Gaussian-like: Empirically, ensemble does not improve test accuracy in deep learning, in certain scenarios when the distribution of the input data is Gaussian or even mixture of Gaussians. This is true over various learner network structures (fully-connected, residual, convolution neural networks) and various labeling functions (when the labels are generated by linear functions, fully-connected, residual, convolutional networks, with/without label noise, with/without classification margin). Bias variance view of ensemble: Some prior works also try to attribute the benefit of ensemble as reducing the variance of individual solutions due to label noise or non-convex landscape of the training objective. However, reducing such variance can reduce a convex test loss (typically crossentropy), but not necessarily the test classification error. Concretely, the synthetic experiments in Figure 7 show that, after applying ensemble over Gaussian-like inputs, the variance of the model outputs is reduced but the test accuracy is not improved. We give many more empirical evidences to show that the variance (either from label noise or from the non-convex landscape) is usually not the cause for why ensemble works in deep learning, see Section 5. Hence, to understand the true benefit of ensemble in deep learning in theory, we would like to study a setting that can approximate practical deep learning, where: • The input distribution is more structured than standard Gaussian and there is no label noise. (From above discussions, ensemble cannot work for deep learning distribution-freely). • The individual neural networks all are well-trained, in the sense that the training accuracy in the end is 100%, and there is nearly no variance in the test accuracy for individual models. (So training never fails.) In this work, we propose to study a setting of data that we refer to as multi-view, where the above two conditions both hold when we train a two-layer neural networks with (smoothed-)ReLU activations. We also argue that the multi-view structure we consider is fairly common in the data sets used in practice, in particular for vision tasks. We give more details below.

2.3. OUR APPROACH: LEARNING MULTI-VIEW DATA

Let us first give a thought experiment to illustrate our approach, and we present the precise mathematical definition of the "multi-view" structure in Section 3. Consider a binary classification problem and four "features" v 1 , v 2 , v 3 , v 4 . The first two features correspond to the first class label, and the next two features correspond to the second class label. In the data distribution: • When the label is class 1, then:foot_2 both v 1 , v 2 appears with weight 1, one of v 3 , v 4 appears with weight 0.1 w.p. 80%; only v 1 appears with weight 1, one of v 3 , v 4 appears with weight 0.1 w.p. 10%; only v 2 appears with weight 1, one of v 3 , v 4 appears with weight 0.1 w.p. 10%. • When the label is class 2, then both v 3 , v 4 appears with weight 1, one of v 1 , v 2 appears with weight 0.1 w.p. 80%; only v 3 appears with weight 1, one of v 1 , v 2 appears with weight 0.1 w.p. 10%; only v 4 appears with weight 1, one of v 1 , v 2 appears with weight 0.1 w.p. 10%. ResNet-34 learns three features (views) of a car: (1) front wheel (2) front window (3) side window ResNet-34 learns three features (views) of a horse: (1) tail ( 2) legs (3) head We call the 80% of the data multi-view data: these are the data where multiple features exist and can be used to classify them correctly. We call the rest 20% of the data single-view data: some features for the correct labels are missing. 4 How individual neural networks learn. Under the multi-view data defined above, if we train a neural network using the cross-entropy loss via gradient descent (GD) from random initialization, during the training process of the individual networks, we show that: • The network will quickly pick up one of the feature v ∈ {v 1 , v 2 } for the first label, and one of the features v ′ ∈ {v 3 , v 4 } for the second label. So, 90% of the training examples, consisting of all the multi-view data and half of the single-view data (those with feature v or v ′ ), are classified correctly. Once classified correctly (with a large margin), these data begin to contribute negligible to gradient by the nature of the cross-entropy loss. • Next, the network will memorize (using e.g. the noise in the data) the remaining 10% of the training examples without learning any new features, due to insufficient amount of left-over samples after the first phase, thus achieving training accuracy 100% but test accuracy 90%. How ensemble improves test accuracy. It is simple why ensemble works. Depending on the randomness of initialization, each individual network will pick up v 1 or v 2 each w.p. 50%. Hence, as long as we ensemble O(1) many independently trained models, w.h.p. their ensemble will pick up both features {v 1 , v 2 } and both features {v 3 , v 4 }. Thus, all the data will be classified correctly. How knowledge distillation works. Perhaps less obvious is how knowledge distillation works. Since ensemble learns all the features v 1 , v 2 , v 3 , v 4 , given a multi-view data with label 1, the ensemble will actually output ∝ (2, 0.1), where the 2 comes from features v 1 , v 2 and 0.1 comes from one of v 3 , v 4 . On the other hand, an individual model learning only one of v 3 , v 4 will actually output ∝ (2, 0) when the feature v 3 or v 4 in the data does not match the one learned by the model. Hence, by training the individual model to match the output of the ensemble, the individual model is forced to learn both features v 3 , v 4 , even though it has already perfectly classified the training data. This is the "dark knowledge" hidden in the output of the ensemble model. (This theoretical finding is consistent with practice: Figure 8 in the full paper suggests that models trained from knowledge distillation should have learned most of the features, and further computing their ensemble does not give much performance boost.) Significance of our technique. Our work belongs to the generic framework of feature learning in DL where one proves that certain aspects of the algorithm (e.g. the randomness) affects the order 4 Meaningfulness of our multi-view hypothesis. Such "multi-view" structure is very common in many of the datasets where deep learning excels. In vision datasets in particular, as illustrated in Figure 2 , a car image can be classified as a car by looking at the headlights, the wheels, or the windows. For a typical placement of a car in images, we can observe all these features and use any of these features to classify it as a car. However, there are car images taken from a particular angle, where one or more features can be missing. For example, an image of a car facing forward might be missing the wheel feature. Moreover, some car might also have a small fraction of "cat features": for example, the headlight might appear similar to cat eyes the ear of a cat. This can be used as the "dark knowledge" by the single model to learn from the ensemble. In Figure 3 , we visualize the learned features from an actual neural network to show that they can indeed capture different views. In Figure 5 , we plot the "heatmap" for some car images to illustrate that single models (trained from different random seeds) indeed pick up different parts of the input image to classify it as a car. In Figure 9 , we manually delete for instance 7/8 of the channels in some intermediate layer of a ResNet, and show that the test accuracy may not be affected by much after ensemble -thus supporting that the multi-view hypothesis can indeed exist even in the intermediate layers of a neural network and ensemble is indeed collecting all these views. where features are learned. This is fundamentally different from convex optimization, such as kernel method, where (with ℓ 2 regularization) there is an unique global minimum so the choice of the random seed does not matter (thus, ensemble does not help). There are other works that consider other aspects, such as the choice of learning rate, that can affect the order where the features are learned (Li et al., 2019) . Our work is fundamentally different: they only focus on the NTK setting where the features are not learned; we study a feature learning process. Recall, the NTK setting cannot be used to explain ensemble and distillation in DL. Our work extends the reach of traditional machine learning theory, where typically the "generalization" is separated from "optimization." Such "separate" treatment might not be enough to understand how deep learning works.

3. PROBLEM SETUP

The "multi-view" data distribution is a straight-forward generalization of the intuitive setting in Section 2.3. For simplicity, in the main body, we use example choices of the parameters mainly a function of k (such as P = k 2 , γ = 1 k 1.5 , µ = k 1.2 N , ρ = k -0.01 , σ 0 = 1/ √ k as we shall see), and we consider the case when k is sufficiently large. In our full version, we shall give a much larger range of parameters for the theorems to hold.

3.1. DATA DISTRIBUTION AND NOTATIONS

We consider learning a k-class classification problem over P -patch inputs, where each patch has dimension d. In symbols, each labelled data is represented by (X, y) where X = (x 1 , x 2 , • • • , x P ) ∈ (R d ) P is the data vector and y ∈ [k] is the data label. For simplicity, we focus on the case when P = k 2 , and d = poly(k) for a large polynomial. We consider the setting when k is sufficiently large. 5 We use "w.h.p." to denote with probability at least 1 -e -Ω(log 2 k) , and use O, Θ, Ω notions to hide polylogarithmic factors in k. We first assume that each label class j ∈ [k] has multiple associated features, say two features for the simplicity of math, represented by unit feature vectors v j,1 , v j,2 ∈ R d . For notation simplicity, we assume that all the features are orthogonal, namely, ∀j, j ′ ∈ [k], ∀ℓ, ℓ ′ ∈ [2], ∥v j,ℓ ∥ 2 = 1 and v j,ℓ ⊥v j ′ ,ℓ ′ when (j, ℓ) ̸ = (j ′ , ℓ ′ ) although our work also extends to the "incoherent" case trivially. We denote by V := {v j,1 , v j,2 } j∈[k] the set of all features. We consider the following data and label distribution. Let C p be a global constant, s ∈ [1, k 0.2 ] be a sparsity parameter. To be concise, we define the multi-view distribution D m and single-view distribution D s together. Due to space limitation, here we hide the specification of the random "noise" and defer it to the full version. 6Definition 3.1 (data distributions D m and D s ). Given D ∈ {D m , D s }, we define (X, y) ∼ D as follows. First choose the label y ∈ [k] uniformly at random. Then, the data vector X is generated as follows (also illustrated in Figure 4 ). 1. Denote V(X) = {v y,1 , v y,2 } ∪ V ′ as the set of feature vectors used in this data vector X, where V ′ is a set of features uniformly sampled from {v j ′ ,1 , v j ′ ,2 } j ′ ∈[k]\{y} , each with probability s k . 2. For each v ∈ V(X), pick C p many disjoint patches in [P ] and denote it as P v (X) ⊂ [P ] (the distribution of these patches can be arbitrary). We denote P(X) = ∪ v∈V(X) P v (X). 3. If D = D s is the single-view distribution, pick a value ℓ = ℓ(X) ∈ [2] uniformly at random.

4.

For each v ∈ V(X) and p ∈ P v (X), we set x p = z p v + "noise" ∈ R d , where, the random coefficients z p ≥ 0 satisfy that: In the case of multi-view distribution D = D m , • p∈Pv(X) z p ∈ [1, O(1)] when v ∈ {v y,1 , v y,2 }, 7 • p∈Pv(X) z p ∈ [Ω(1), 0.4] when v ∈ V(X) \ {v y,1 , v y,2 }, 8 In the case of single-view distribution D = D s , • p∈Pv(X) z p ∈ [1, O(1)] when v = v y, ℓ , • p∈Pv(X) z p ∈ [ρ, O(ρ)] when v = v y,3-ℓ , • p∈Pv(X) z p ∈ [Ω(Γ), Γ] when v ∈ V(X) \ {v y,1 , v y,2 }. 5. For each p ∈ [P ] \ P(X), we set x p to consist only of "noise". Remark 3.2. The distribution of how to pick P(X) and assign p∈Pv(X) z p to each patch in p ∈ P v (X) can be arbitrary (and can depend on other randomness in the data as well). In particular, we have allowed different features v j,1 , v j,2 to show up with different weights in the data (for example, for multi-view data, some view v y,1 can consistently have larger z p comparing to v y,2 ). Yet, we shall prove that the order to learn these features by the learner network can still be flipped depending on the randomness of network initialization. Interpretation of our data distribution. As we argue more in the full paper, our setting can be tied to a down-sized version of convolutional networks applied to image classification data. With a small kernel size, good features in an image typically appear only at a few patches, and most other patches are random noise or low-magnitude feature noises. More importantly, our noise parameters shall ensure that, the concept class is not learnable by linear classifiers or constant degree polynomials. We believe a (convolutional) neural network with ReLU-like activation is somewhat necessary. Our final data distribution D, and the training data set Z are formally given as follows. Definition 3.3 (D and Z). The distribution D consists of data from D m w.p. 1 -µ and from D s w.p. µ. We are given N training samples from D, and denote the training data set as Z = Z m ∪Z s where Z m and Z s respectively represent multi-view and single-view training data. We write (X, y) ∼ Z as (X, y) sampled uniformly at random from the empirical data set, and denote N s = |Z s |. We again for simplicity focus on the setting when µ = 1 poly(k) and we are given samples N = k 1.2 /µ so each label i appears at least Ω(1) in Z s . Our result trivially applies to many other choices of N .

3.2. LEARNER NETWORK

We consider a learner network using the following smoothed ReLU activation function ReLU: Definition 3.4. For integer q ≥ 2 and threshold ϱ = 1 polylog(k) , the smoothed function ReLU(z) := 0 for z ≤ 0; ReLU(z) := z q qϱ q-1 for z ∈ [0, ϱ]; and ReLU(z ) := z -(1 -1 q )ϱ for z ≥ ϱ. Since ReLU is smooth we denote its gradient as ReLU ′ (z). We focus on q = 4 while our result applies to other constants q ≥ 3 (see full version) or most other forms of smoothing. The learner network F (X) = (F 1 (X), . . . , F k (X)) ∈ R k is a two-layer convolutional network parameterized by w i,r ∈ R d for i ∈ [k], r ∈ [m], satisfying ∀i ∈ [k] : F i (X) = r∈[m] p∈[P ] ReLU(⟨w i,r , x p ⟩) Although there exists network with m = 2 that can classify the data correctly (e.g. w i,r = v i,r for r ∈ [2]), in this paper, for efficient optimization purpose it is convenient to work on a moderate level of over-parameterization: m ∈ [polylog(k), k]. Our lower bounds hold for any m in this range and upper bounds hold even for small over-parameterization m = polylog(k). Training a single model. We learn the concept class (namely, the labeled data distribution) using gradient descent with learning rate η > 0, over the cross-entropy loss function L using N training data points Z = {(X i , y i )} i∈ [N ] . We denote the empirical loss as: L(F ) = 1 N i∈[N ] L(F ; X i , y i ) = E (X,y)∼Z [L(F ; X, y)] where L(F ; X, y) = -log e Fy (X) X) . We randomly initialize the network F by letting each w (0) i,r ∼ N (0, σ 2 0 I) for σ 2 0 = 1/k, which is the most standard initialization people use in practice. j∈[k] e F j ( To train a single model, at each iteration t we update using gradient descent (GD):foot_7  w (t+1) i,r ← w (t) i,r -η E (X,y)∼Z ∇ wi,r L(F (t) ; X, y) (3.1) We run the algorithm for T = poly(k)/η iterations. We use F (t) to denote the model F with hidden weights {w i,r } at iteration t. Notations. We denote by logit i (F, X) := X) . Using this, we can write down ∀i ∈ e F i (X) j∈[k] e F j ( [k], r ∈ [m] : -∇ wi,r L(F ; X, y) = (1 i̸ =y -logit i (F, X))∇ wi,r F i (X) .

4. MAIN THEOREMS AND EXPLANATIONS

We now state the main theorems (and the one for self-distillation is in the full paper).foot_8  Theorem 1 (single model). For every sufficiently large k > 0, every m ∈ [polylog(k), k], every η ≤ 1 poly(k) , suppose we train a single model using the gradient descent update (3.1) starting from the random initialization defined in Section 3.2, then after T = poly(k) η many iterations, with probability ≥ 1 -e -Ω(log 2 k) , the model F (T ) satisfies: • (training is perfect): meaning for all (X, y) ∈ Z, all i ∈ [k] \ {y}: F (T ) y (X) > F (T ) i (X). • (test accuracy is consistently bad): meaning that: Pr (X,y)∼D [∃i ∈ [k] \ {y} : F (T ) y (X) < F (T ) i (X)] ∈ [0.49µ, 0.51µ] . We shall give technical intuitions about why Theorem 1 holds in the full version. But, at a high-level, we shall construct a "lottery winning" set M ⊆ [k] × [2] of cardinality |M| ∈ [k(1 -o(1)), k]. It only depends on the random initialization of F . Then, with some effort we can prove that, for every (i, ℓ) ∈ M, at the end of the training F (T ) will learn feature v i,ℓ but not learn feature v i,3-ℓ . This means for those single-view data (X, y) with y = i and ℓ(X) = 3 -ℓ, the final network F (T ) will predict its label wrong. This is why the final test accuracy is around 0.5µ. Note the property that test accuracy consistently belongs to the range [0.49µ, 0.51µ] should be reminiscent of message ⑤ in Figure 6 , where multiple single models, although starting from different random initialization, in practice does have a relatively small variance in test accuracies. Ensemble. Suppose {F [ℓ] } ℓ∈[K] are K = Ω(1) independently trained models of F with m = polylog(k) for T = O poly(k) η iterations (i.e., the same setting as Theorem 1 except we only need a small over-parameterization m = polylog(k)). Let us define their ensemble G(X) = Θ(1) K ℓ F [ℓ] (X) (4.1) Theorem 2 (ensemble). In the same setting as Theorem 1 except now we only need a small m = polylog(k), we have for the ensemble model G in (4.1), with probability at least 1 -e -Ω(log 2 k) : • (training is perfect): meaning for all (X, y) ∈ Z, for all i ∈ [k] \ {y}: G y (X) > G i (X). • (test accuracy is almost perfect): meaning that: Pr (X,y)∼D [∃i ∈ [k] \ {y} : G y (X) < G i (X)] ≤ 0.001µ . As we discussed in Section 2.3, the reason Theorem 2 holds attributes to the fact that those lottery winning sets M depend on the random initialization of the networks; and therefore, when multiple models are put together, their "union" of M shall cover all possible features {v i,ℓ } (i,ℓ)∈[k]× [2] . Moreover, our theorem only requires individual K = Ω(1) models for ensemble, which is indeed "averaging the output of a few independently trained models".

4.1. KNOWLEDGE DISTILLATION FOR ENSEMBLE

We consider a knowledge distillation algorithm given the existing ensemble model G (see (4.1)) as follows. For every label i ∈ [k], let us define the truncated scaled logit as (for τ = 1 log 2 k ): logit τ i (F, X) = e min{τ 2 F i (X),1}/τ j∈[k] e min{τ 2 F j (X),1}/τ (4.2) (This should be reminiscent of the logit function with temperature used by the original knowledge distillation work (Hinton et al., 2015) ; we use truncation instead which is easier to analyze.) Now, we train a new network F from random initialization (where the randomness is independent of all of those used in F [ℓ] ). At every iteration t, we update each weight w i,r by: w (t+1) i,r = w (t) i,r -η∇ wi,r L(F (t) ) -η ′ E (X,y)∼Z logit τ i (F (t) , X) -logit τ i (G, X) -∇ wi,r F This knowledge distillation method (4.3) is almost identical to the one used in the original work (Hinton et al., 2015) , except we use a truncation during the training to make it more (theoretically) stable. Moreover, we update the distillation objective using a larger learning rate η ′ comparing to η of the cross-entropy objective. This is also consistent with the training schedule used in (Hinton et al., 2015) . Let F (t) be the resulting network obtained by (4.3) at iteration t. We have the following theorem: Theorem 3 (ensemble distillation). Consider the distillation algorithm (4.3) in which G is the ensemble model defined in (4.1). For every k > 0, for m = polylog(k), for every η ≤ 1 poly(k) , setting η ′ = ηpoly(k), after T = poly(k) η many iterations with probability at least 1 -e -Ω(log 2 k) , for at least 90% of the iterations t ≤ T : • (training is perfect): meaning for all (X, y) ∈ Z, all i ∈ [k] \ {y}: F Remark. Theorem 3 necessarily means that the distilled model F has learned all the features {v i,ℓ } (i,ℓ)∈[k]×[2] from the ensemble model G. This is consistent with our empirical findings in Figure 8 : if one trains multiple individual models using knowledge distillation with different random seeds, then their ensemble gives no further performance boost.



Full version of this paper can be found on https://arxiv.org/abs/2012.09816. For a k-class classification problem, the output of a model g(x) is usually k-dimensional, and represents a soft-max probability distribution over the k target classes. This is known as the soft label. One can for simplicity think of "v appears with weight α and w appears with weight β" as data = αv + βw + noise. If we want to work with fixed k, say k = 2, our theorem can also be modified to that setting by increasing the number of features per class. We keep our current setting with two features to simplify the notations. At a high level, we shall allow such "noise" to be any feature noise plus Gaussian noise, such as noise= v ′ ∈V α p,v ′ v ′ + ξp ∈ R d ,where each α p,v ′ ∈ [0, γ] can be arbitrary, and ξp ∼ N (0, σ 2 p I). For instance, the marginal distribution of Z = p∈Pv (X) zp can be uniform over[1, 2]. For instance, the marginal distribution of Z = p∈Pv (X) zp can be uniform over [0.2, 0.4]. Our result also extends to the case when there is a weight decay, discussed in the full version. We shall restate these theorems in the full version with more details and a wider range of parameters.



Figure 1: Ensemble in deep learning is very different from ensemble in random feature mappings. Details in Figure 6.

Figure 2: Illustration of images with multiple views (features) in the ImageNet dataset.

Figure 3: Visualization of the channels in layer-23 of a ResNet-34 trained on CIFAR-10.

Figure 4: Illustration of a multi-view and a single-view data point; the feature vectors can also be combined with feature noise and random noise, see Def. 3.1.

Throughout the paper we denote by [a] + = max{0, a} and [a] -= min{0, a}.

accuracy is almost perfect): meaning that:Pr (X,y)∼D [∃i ∈ [k] \ {y} : F (t) y (X) < F (t) i (X)] ≤ 0.001µ .

