ENRICHING ONLINE KNOWLEDGE DISTILLATION WITH SPECIALIST ENSEMBLE

Abstract

Online Knowledge Distillation (KD) has an advantage over traditional KD works in that it removes the necessity for a pre-trained teacher. Indeed, an ensemble of small teachers has become typical guidance for a student's learning trajectory. Previous works emphasized diversity to create helpful ensemble knowledge and further argued that the size of diversity should be significant to prevent homogenization. This paper proposes a well-founded online KD framework with naturally derived specialists. In supervised learning, the parameters of a classifier are optimized by stochastic gradient descent based on a training dataset distribution. If the training dataset is shifted, the optimal point and corresponding parameters change accordingly, which is natural and explicit. We first introduce a label prior shift to induce evident diversity among the same teachers, which assigns a skewed label distribution to each teacher and simultaneously specializes them through importance sampling. Compared to previous works, our specialization achieves the highest level of diversity and maintains it throughout training. Second, we propose a new aggregation that uses post-compensation in specialist outputs and conventional model averaging. The aggregation empirically exhibits the advantage of ensemble calibration even if applied to previous diversity-eliciting methods. Finally, through extensive experiments, we demonstrate the efficacy of our framework on top-1 error rate, negative log-likelihood, and notably expected calibration error.

1. INTRODUCTION

Knowledge Distillation (KD) has achieved remarkable success in model compression literature (Heo et al., 2019; Park et al., 2019; Tung & Mori, 2019) . KD traditionally employs a two-stage learning paradigm: training a large static model as a "teacher" and training a compact "student" model with the teacher's guidance. Online KD (He et al., 2016; Song & Chai, 2018; lan et al., 2018) emerged as a variant of KD, which simplifies the conventional two-stage pipeline by training all teachers and a student simultaneously. Previous works used a limited number of small teachers and treated them as auxiliary peers that help a student learn. Especially, ensembling these teachers has become a typical direction to make knowledge guidance for the student. A core question in online KD is how to make teachers diverse for the ensemble. Breiman (1996) argues that traditional Bagging-style ensembles usually benefit from diverse and dissimilar models. Recent online KD studies (Chen et al., 2020; Li et al., 2020; Wu & Gong, 2021) support this claim and emphasize the importance of large diversity to prevent homogenization. In supervised learning, the parameters of a classifier are optimized by stochastic gradient descent based on a training data distribution. If the training dataset is shifted, the optimal point and corresponding parameters change accordingly, which is natural and explicit. That is, diversifying training data distribution sheds light on effectively generating diverse classifiers resorting to different features. In this paper, we use label prior shift, where each teacher is assigned unique and non-uniform label distribution. This approach partially aligns with the specialization process in Mixture of Experts (MoE) literature, in which multiple experts with different problem spaces learn only the local landscape (Baldacchino et al., 2016) . The most straightforward and prevalent approach to dealing with label imbalance is to operate on the shifted dataset itself (Japkowicz & Stephen, 2002; Chawla, 2009; Buda et al., 2018) . However, online KD may have an inconvenient design that could necessitate sampling as much as the number of teachers because a typical framework has shared layers in a multi-head architecture. As an alternative way, we consider adjusting a cross-entropy loss of each teacher rather than recursive sampling. Therefore, we efficiently estimate the loss functions using importance sampling drawn from the usual uniform label distribution instead of directly multi-sampling from the truly shifted distributions. Our specialization exhibits the highest level of diversity and maintains it throughout the training compared to prior works. Furthermore, we propose a new ensemble strategy for aggregating specialist teacher outputs. From a perspective of Bayesian inference, it can be interpreted that the conditional distributions of specialists become likewise distorted when a classifier learns the label-imbalance training dataset. Therefore, we need to correct the distortion of conditional distributions before the aggregating process. We first use PC-Softmax (Hong et al., 2021) to post-compensate Softmax outputs. Post-compensation adapts the shifted label priors according to the true label prior by manually adjusting teacher logits. It relaxes the disparity in negative log-likelihoods (Ren et al., 2020) for the same label. As a result, PC-Softmax matches the uniform label distribution by modifying the teacher prediction trained by unique cross-entropy loss. Second, we apply a standard model averaging method (Li et al., 2021) to all the PC-Softmax outputs. We empirically show that our aggregation policy, denoted "specialist ensemble," improves ensemble calibration even when applied to previous diversity-eliciting methods. Our main contributions are summarized as follows: (1) The proposed online knowledge distillation promotes diversifying teachers to be specialists through the label prior shift and importance sampling. As a result, our diversity is at the highest level over previous works and maintained throughout training (2) Our specialist ensemble, based on PC-Softmax and averaging those probabilities, is beneficial in ensemble calibration. Moreover, this advantage is valid even when applied to previous diversity-eliciting methods. (3) Through extensive experiments, we describe that a student distilled by our specialist ensemble outperforms previous works in top-1 error rate, negative loglikelihood, and notably expected calibration error.

2. RELATED WORK

Label prior shift. The label prior shift has been extensively discussed due to various degrees of imbalance in training (source) label prior p s (y) and test (target) label prior p t (y). Especially in most works closely related to ours, Post-Compensating (PC) strategy is typically chosen as the proper adjustment to estimate new conditional probability p(y|x) approximated by p s (y) for given p t (y). When estimating a Softmax regression, Ren et al. (2020) corrects the model outputs by the amounts of each class, assuming the uniform target distribution during training time. Many strategies for matching two priors at test time were investigated by rebalancing a different form of multiplying p t (y)/p s (y) to the output probability from a Bayesian perspective (Buda et al., 2018; Hong et al., 2021; Margineantu, 2000; Tian et al., 2020) . Here, Hong et al. (2021) carefully reconstruct each conditional probability that should satisfy a condition c p t (y = c|x) = 1. It is known as PC-Softmax. We use PC-Softmax of each teacher network to adapt entirely different label priors according to the student label prior. Ensemble learning. Promoting diversity in traditional ensemble learning has been emphasized because the number of models, which acts as a factor of ensemble impact, becomes more crucial when they are gradually uncorrelated. (Breiman, 1996; Ghojogh & Crowley, 2019) . Lakshminarayanan et al. (2017) used only different random initialization and weighted averaging on the same models. The models, as a result, can have similar error rates but converge to different local minima (Wen et al., 2020) . Bringing in multiple models, however, requires prohibitively large computational resources, which frequently limits the ensemble's applicability. Thus, recent studies have found efficiencies in two approaches: sampling multiple learning trajectories with only a single model (Huang et al., 2017a; Laine & Aila, 2017; Tarvainen & Valpola, 2017) and building a new structure architecturally efficient (Wen et al., 2020; Li et al., 2021) . Our ensemble overview can be aligned with the latter by modeling shared parameters and purposive heads to be diversified. Online knowledge distillation. Online knowledge distillation works belong to two categories: network-based and peer-based. Network-based methods (Guo et al., 2020; Zhang et al., 2018)  𝓛 𝟏 ෝ 𝒑 1 ො 𝒛 1 𝒑 2 𝓛 𝟐 ෝ 𝒑 2 ො 𝒛 2 𝒑 3 𝓛 𝟑 ෝ 𝒑 3 ො 𝒛 3 𝒑 4 𝓛 𝟒 ෝ 𝒑 4 ො 𝒛 4 𝒛 3 𝒛 4 𝒛 2 𝒑 1 𝒛 1 Figure 1 : Overview of our online knowledge distillation framework with four teachers. Each teacher is assigned a different label prior by Class Reweighting (CR) function as described in Section 3.3. Each teacher loss and a student loss is defined in Section 3.4 and Section 3.6, respectively. For knowledge distillation, a specialist ensemble is obtained as described in Section 3.5. separate networks with identical architecture, employing mutual learning paradigm; every network interacts and provides knowledge guidance to its cohorts. Sometimes, these allow independent data pre-processing per network. However, in peer-based methods (Song & Chai, 2018; lan et al., 2018; Wu & Gong, 2021; Li et al., 2020) , some parameters are shared among peer heads, which concurrently benefit from computational efficiencies and generalization during training. Earlier works exploit surrogate modules to enlarge varying features or introduce each head unfairly for peer diversity. However, following trained parameters, the modules may ignore some peers as useless, which prevents achieving the preferred diversity (Mullapudi et al., 2018) . Chen et al. (2020) ; Kim et al. (2021) can be applied to both network-and peer-based approaches, but they require extra modules as well. Our method is peer-based and shares two similarities with some previous works: first, designating a "student" and "teachers" in advance of the training to equip one-way guidance (Chen et al., 2020; Li et al., 2020) , and second, creating a dedicated dataset to train each peer (Feng et al., 2021) . However, our specialty dataset differs from an arbitrary sampled subset (Feng et al., 2021) in that it is purposely class-skewed.

3. METHODOLOGY

As shown in Figure 1 , our model consists of three parts: shared part, multiple teacher heads and the student head. That is, we parameterize it as Θ = {θ ϕ } ∪ {θ t |t ∈ N, t ≤ T } ∪ {θ s } where T denotes the number of teachers. {θ ϕ } represents the shared parameters; {θ t } and {θ s } represent the teacher and student head parameters, respectively. For simplicity, we will use the same notation for each teacher and student model including shared parameters, i.e., {θ t } = {θ ϕ } ∪ {θ t } and {θ s } = {θ ϕ } ∪ {θ s }. Note that each {θ t } and {θ s } have the same dimension.

3.1. LABEL PRIOR SHIFT

Prior works (Buda et al., 2018; Hong et al., 2021; Margineantu, 2000; Tian et al., 2020) addressed the discrepancy between train and test class distributions due to the inherent difficulty in obtaining samples from certain classes and dealt with such class imbalance with post-scaling prior distributions. In our work, we adopt a similar method to manually shift the label prior distribution of teachers for the student to learn from diverse teachers. We want the model output p(y|x; θ) to approximate a true posterior distribution p(y|x), which is a conditional distribution of the labels y given input samples x. From the perspective of Bayesian inference, a true posterior distribution is defined as follows: p(y|x) = p(x|y)p(y) p(x) , where p(y) represents a label prior distribution. We assume p(y) is a discrete uniform distribution y ∼ U(1/K) where K is the number of classes since the datasets we are using contain the uniform number of training samples for each class. In our setting, label prior shift refers to the label distribution distinction between the teachers and the student, i.e., p t (y) ̸ = p s (y). While p s (y) = p(y), we purposely make class-imbalanced settings by shifting label distributions of teachers. Also, p t (y) differs from each teacher, i.e., p t (y) ̸ = p t ′ (y), motivating each teacher to be a diverse discriminative classifier. We will discuss how to manipulate teacher label distributions in Section 3.3.

3.2. IMPORTANCE SAMPLING

Under our class-imbalance setting, naive Monte-Carlo sampling is unlikely to effectively approximate the target distribution. Therefore, we exploit importance sampling, which allows us to effectively approximate the target distribution only with the samples generated from a distribution we have. We will denote the shifted label prior p t (y) as q(y) only in this section to avoid confusion with p(y). Theorem 1. Let q(x, y) and p(x, y) be joint probability distributions, and h(x, y) be a differentiable function with respect to x and y. Assuming q(x|y) = p(x|y), we can estimate µ = E (x,y)∼q(x,y) [h(x, y)] as follows: µ = E (x,y)∼p(x,y) [ q(y)h(x, y) p(y) ] ≈ 1 N N i=1 q(y i )h(x i , y i ) p(y i ) , (x i , y i ) ∼ i.i.d. p(x, y). During training, a sample (x i , y i ) is drawn from a joint distribution p(x, y) = p(x|y)p(y). However, what we actually aim is to sample from the joint distribution with shifted prior q(y), q(x, y) = q(x|y)q(y). Theorem 1 shows we can effectively estimate the expectation of the function h(x, y) where (x, y) ∼ q(x, y) with the samples drawn from p(x, y). Often the target label distribution q(y) is intractable and we only know unnormalized q(y) = Zq(y) with unknown normalization constant, Z. We make use of an unnormalized distribution, q(y), by our design choice. Corollary 1.1. Let q(x, y) = Zq(x, y) be an unnormalized distribution with unknown constant Z > 0. Then, the unnormalized importance sampling estimator of µ is as follows: µ ≈ N i=1 κ(y i )h(x i , y i ) N i=1 κ(y i ) , (x i , y i ) ∼ i.i.d. p(x, y), where κ(y) = q(y)/p(y) is the unnormalized reweighting function of y. As shown in Corollary 1.1, we can obtain the estimator of the data loss h(x, y) under label prior shift, once κ(y) is given. In the following section, we will discuss how to formulate the function κ(y).

3.3. CLASS REWEIGHTING FUNCTION

Our goal is to specialize each teacher in a specific subset of labels. To achieve this, each teacher is assigned "specialty" labels. We denote this "specialty" label set for t-th teacher as Y t . Y t s can overlap, but all labels are assigned at least once and then for the same number of times, i.e., Y = Y 1 ∪ • • • ∪ Y T , where Y is the total label set. Although we sequentially allocate the label to teachers (see Appendix.B), clustering (Mullapudi et al., 2018) or human-crafted grouping (Krizhevsky, 2009) can be also considered. Now we introduce a Class Reweighting (CR) function, κ t (y) to assign high weights to "specialty" labels Y t . This function indicates how much the t-th teacher considers the label y against p(y). As mentioned before, p(y) is a uniform probability distribution. However, as pt (y) is an unnormalized (distribution), we consider it as a function of y. κ t (y) = pt (y) p(y) = 1, if y ∈ Y t ϵ, otherwise, 0 ≤ ϵ ≤ 1. We name ϵ as exposure. If exposure is 1, there is no label prior shift, so pt (y) is also a uniform distribution equal to the student label prior.

3.4. TEACHER LOSS

Neural networks typically produce class probabilities by using a "Softmax" output layer that converts the logit computed for each class into a probability by comparing it with the other logits. Let z i t [k] denote the logit of class k given an i-th input sample (x i , y i ) produced by the t-th teacher model. Then, the output conditional probability of Softmax layer is as follows: p t (y i |x i ; θ t ) = exp (z i t [y i ]) K k=1 exp (z i t [k]) , t ≤ T. (5) Using the unnormalized importance sampling estimator in Eq. 3, teacher loss for random samples {(x i , y i )} N i=1 where N is the number of samples is defined as follows. L t (θ t ) = E (x,y)∼p(x,y) [-κ t (y) log(p t (y|x; θ t )] ≈ - N i=1 κ t (y i ) log(p t (y i |x i ; θ t )) N i=1 κ t (y i ) , t ≤ T. (6)

3.5. SPECIALIST ENSEMBLE FOR KNOWLEDGE DISTILLATION

Following the model averaging paradigm (Li et al., 2021) , we aggregate teachers' predictions to define a guide signal of knowledge distillation loss. The original predictions, however, have shortcomings.t-th teacher conditional distribution p t (y|x; θ t ) is closely related to each prior p t (y); since supervision signals for minority classes are unlikely to occur, teachers may fail to introduce correct predictions on uniform p(y). In order to adapt according to p(y), we thus relax the minority classes' likelihood by manually adjusting logit values (Ren et al., 2020) . We introduce further studies in Appendix.C. Adapting label prior. We adjust teacher output logits to adapt shifted teacher priors p t (y) to uniform p(y). Following discussions of Appendix.C, we employ PC-Softmax (Hong et al., 2021) to post-compensate teacher logits. Given the original teacher logits z i t and CR function κ t (y) in Eq. 4, post-compensated logits (PC-Logits) and conditional probabilities of the i-th sample produced by the t-th teacher are defined as follows: ẑi t [y i ] = z i t [y i ] -log 1 κ t (y i ) ; pt (y i |x i ; θ t ) = exp (ẑ i t [y i ]) K k=1 exp (ẑ i t [k]) , t ≤ T, where κ t (y i ) is pt (y i )/p(y i ) well defined by Section 3.3. Thus, if κ t (y i ) is 1, the corresponding class's logit is the same as the student's, but if κ t (y i ) is ϵ, its logit is compensated as much as -log(1/ϵ). Note that each head is trained with Eq. 5 and Eq. 7 is used only to compensate for the label distribution shift before making an ensemble prediction. Model averaging. We now aggregate teacher predictions to form an ensemble prediction. Our fusion is based on an averaged classifier manner (French et al., 2018; Garipov et al., 2018) , commonly used in statistical learning paradigm. Given conditional probabilities for an i-th sample, p1 (y i |x i ; θ 1 ), . . . , pT (y i |x i ; θ T ), obtained from Eq 7, the aggregation is defined as follows: p e (y i |x i ; θ 1 , ..., θ T ) = T t=1 pt (y i |x i ; θ t )p(θ t ), p(θ t ) = U(1/T ). Each θ t is uniformly chosen. We also employ logarithm for the aggregated probabilities (Stanton et al., 2021) to derive a class ensemble logit z i e [y i ] = log(p e (y i |x i ; θ 1 , ..., θ T )). In Section 4.3, we will compare this method to a convention of aggregating naive logits and show that our method has advantages in calibration.

3.6. STUDENT LOSS AND DISTILLATION STEPS

Given an ensemble logit and a student logit, here we define the cross-entropy and a knowledge distillation (Hinton et al., 2015) loss to update the student parameters {θ s }. A temperature τ is used e ← e + 1 ▷ Update epoch to adjust a ramp-up λ(e) of Eq.13 5: for sample a mini-batch {(xi, yi)} N i=1 ∼ D do 6: Compute the entire Lt(θt) and Ls(θs) concurrently ▷ Use Eq.6 and Eq.13 7: θt ← θt -η∇ θ t Lt(θt) 8: θs ← θs -η∇ θs Ls(θs) 9: end for 10: end while to soften probability distribution over classes. Same as Section 3.4, student loss is also defined by the same random samples. p s (y i |x i ; θ s ) = exp (z i s [y i ]/τ ) K k=1 exp (z i s [k]/τ ) ; p e (y i |x i ; θ 1 , ..., θ T ) = exp (z i e [y i ]/τ ) K k=1 exp (z i e [k]/τ ) , For normal cross entropy loss, L CE , temperature τ is set to 1. Knowledge distillation loss, L KD , is KL-divergence between the student and teacher ensemble posterior distributions. L CE (θ s ) = E (x,y)∼p(x,y) [-log(p s (y|x; θ s ))] ≈ - N i=1 log(p s (y i |x i ; θ s )). L KD (θ s ) = E (x,y)∼p(x,y) [KL(p e ||p s )] ≈ N i=1 K k=1 p e (k i |x i ; θ 1 , ..., θ T ) log p e (k i |x i ; θ 1 , ..., θ T ) p s (k i |x i ; θ s ) . The final student loss is a weighted sum of L CE and L KD . We adjust λ using a Gaussian ramp-up curve, which is λ(e) = exp(-5(1 -e/α) 2 ), where e is an epoch and α is the ramp-up period (Laine & Aila, 2017) . L s (θ s ) = L CE (θ s ) + τ 2 λ(e)L KD (θ s ). Alg. 1 introduces student distilling steps. All parameters Θ are updated during training, and only the student, {θ s }, is used at test time. Thus, our framework does not induce additional test-time costs.

4. EXPERIMENTS

In this section, we conduct three experiments to assess the efficacy of the proposed method. First, we evaluate how well our student model is generalized in an image classification compared to previous methods with three measurements (Stanton et al., 2021) : Top-1 error rate (ERR), expected calibration error (ECE), and negative log-likelihood (NLL). Second, we perform an ablation study on the exposure ϵ of the proposed CR function κ(y) and the number of teachers T . For the analysis, we include two metrics (Stanton et al., 2021) to measure a student's fidelity on an ensemble's outputs: averaged top-1 agreement and averaged KL-divergence. We further show diversity change following the variation of ramp-up period α in Appendix F.2. Finally, we empirically analyze why our student model has become calibrated. The evaluation settings are thoroughly summarized in Appendix E.1.

4.1. IMAGE CLASSIFICATION PERFORMANCE

We compare our method to extensive online KD methods: CLILR (Song & Chai, 2018) , ONE (lan et al., 2018) , FFL-S (Kim et al., 2021) and OKDDip (Chen et al., 2020) for CIFAR, as well as DML (Zhang et al., 2018) , KDCL (Guo et al., 2020) and PCL (Wu & Gong, 2021) for ImageNet. Denoted "Vanilla" is to train a target model from scratch without knowledge distillation loss. While CLILR, ONE, and FFL-S select the first network as a student after the whole training procedure, Results on CIFAR datasets. Table 1 demonstrates that our method consistently outperforms previous methods to generalize a student. For all DNN models, our ERR and NLL are marginally better. However, our proposed method produces a significantly calibrated student in ECE than in previous works and Vanilla; the gains improve as the class size increases. Section 4.3 will discuss ECE further. We provide an additional comparison with network-based methods in Appendix F.1. Results on ImageNet datasets. In comparison to all the previous methods, Table 2 shows Top-1 ERR. Our proposed method improves 0.96% and 1.29% ERR against Vanilla for each ResNet-18 and ResNet-34 and still achieves competitive superior among previous methods.

4.2. ABLATION STUDY

We examine the effectiveness of exposure variation (ϵ) and the number of teacher heads (T ). As shown in Figure 2 , we analyze results for teacher diversity, student generalization, and student fidelity on the ensemble posterior distribution. We utilize averaged pairwise Jensen-Shannon divergence (Appendix D) to measure diversity between given two distributions. We use ResNet-110 trained on CIFAR-100 when T = [2, 3, 4] and ϵ = [0.1, 0.3, 0.5, 0.8, 1.0]; Models with more than five peers are not practical due to computational efficiency and saturation performance (Stanton et al., 2021) . The range of ϵ is for our grid search. PC-Softmax is equally processed to evaluate the ensemble outcomes since p(y) is the same in the training and testing data distributions. Exposure variation. As discussed in Section 3, ϵ determines teacher diversity. The diversity is exceptionally higher when ϵ is 0.1 as shown in Figure 2 (f). At this point, the ensemble performs worse than a student, implying that the ensemble usually fails to discover the hidden knowledge in data. As a result, the student has high disagreements against the ensemble; this implies that the student may experience significant confusion during distillation. Diversity is lower than 0.1 in the other ranges, decreasing in small amounts from 0.3. The value ϵ, 0.3, in particular, is quite encouraging. An outstanding generalized ensemble presents the potential to merge diverse teachers. As shown in Figure 2 (d) and 2(e), the fidelity is also superior. The disparity in generalization is thus noticeably small in Figure 2 (a) to 2(c). Furthermore, as shown in Figure 3 , our diversity size by chosen ϵ in various DNNs presents consistently high compared to earlier methods. In Appendix F.3, we further visualize how teachers' confidence varies while predicting specific k-class samples. The number of teacher heads. The fidelity typically improves as T increases as shown in Figure 2(d) and 2(e). One possible explanation is that increasing the number of ensemble components smooths the logits of unlikely classes, making the distribution easier for the student to match. This phenomenon may provide insight into how to improve overall fidelity. The student thus benefits from top-1 accuracy and NLL loss, which improves ECE marginally, as shown in Figure 2 (a) to 2(c). However, student generalization becomes increasingly saturated (Stanton et al., 2021) as T increases. Meanwhile, diversity falls slightly because label repetition can render class coverage redundant. 

4.3. ON CALIBRATION OF STUDENT MODEL

This section empirically explains why our ensemble usually leads to better student calibration than previous methods. The calibration considers the problem of predicting probability estimates representative of the true correctness likelihood (Guo et al., 2017) . KD can regard a type of learned label smoothing regularization (Yuan et al., 2020) . The label smoothing can also calibrate a network, minimizing the miscalibration rate, i.e., ECE (Müller et al., 2019) . Accepting such a KD effect, we conjecture two factors that our ensemble holds to transfer crucial scaling constraints for the student confidence: combining probabilities and diversity. Combining probabilities. As shown in Figure 4 (b), the posterior ensemble distribution with teacher probabilities rather than teacher logits significantly improves ECE and marginal gains with top-1 accuracy and NLL; we further provide ensembled confidence among them in Appendix F.4. Even after replacing the existing ensemble in previous methods with the probability-based and altering our ensemble to the logit-based, using probability still outperforms in ECE, as shown in 4(a). Moreover, as shown in Figure 4 (c), PC-Softmax outperforms PC-Logit in ECE, exhibiting comparable accuracy in varying ϵ. Through three case studies, we hypothesize that a probability-based ensemble effectively regularizes student confidence by KD guidance. Diversity. As shown in Figure 3 , our diversity shows higher and more model-agnostic than previous works; previous works have trouble deriving diversity on DenseNet-40-12 and EfficientNetB0. It implies that when the number of teachers is constrained, using extra losses may fail to induce a helpful diversity. Apart from the size and robustness of our diversity, acceptable fidelity demonstrates that the diversity is implicitly fine for a student to accommodate different signals, as shown in Figure 2(d) and 2(e). Therefore, we know that a merged knowledge is made of our useful diverse teachers to a direction suitable to a student, and it exhibits generalized ensemble performance as shown in Figure 2 (a) to 2(c). The student, as a result, can learn generalized potential knowledge well.

5. CONCLUSION

We propose enriching online knowledge distillation with the specialist ensemble. Proposed CR functions are equipped to model label prior shifts for large diversity among teachers throughout training. Averaging diverse teacher probabilities provides a significant advantage in ensemble calibration. This paper confirms KD with our ensemble enlarges student generalization: marginal improved ERR and NLL with notable ECE. Figure 5 shows our student becomes a more predictable classifier than previous methods through reliability diagrams. For further discussions, the limitations and societal impacts are described in Appendices G and H.

A PROOFS

A.1 PROOF OF THEOREM 1 Proof. For discrete random variables x and y, each joint probability mass function p(x, y) and q(x, y) can be expressed as p(x, y) = p(x|y)p(y) and q(x, y) = q(x|y)q(y) by product rule. We assume that p(x|y) = q(x|y) and only label priors are different p(y) ̸ = q(y). Also, let h(x, y) be a differentiable function with respect to x and y. E (x,y)∼q(x,y) [h(x, y)] = x y q(x, y)h(x, y) = y q(y) x p(x|y)h(x, y) = y p(y) q(y) p(y) x p(x|y)h(x, y) = y p(y) q(y) p(y) (E x∼p(x|y) [h(x, y)]) = E y∼p(y) [ q(y) p(y) E x∼p(x|y) [h(x, y)]] By the associative law of multiplication and y is a constant with respect to x, then we have E y∼p(y) [ q(y) p(y) E x∼p(x|y) [h(x, y)]] = E y∼p(y),x∼p(x|y) [ q(y)h(x, y) p(y) ] (15) = E (x,y)∼p(x,y) [ q(y)h(x, y) p(y) ]. One can wonder about the case p(y = y i ) = 0 for some i, so the denominator becomes zero. In our setting, p(y) is a uniform distribution, thus, all the probabilities are strictly positive. A.2 PROOF OF COROLLARY 1.1 Proof. Let q(x, y) = Zq(x, y) be an unnormalized distribution where Z > 0 is an unknown constant, then x y q(x, y) = x y p(x, y) = 1, x y q(x, y) = Z. For every sample (x i , y i ) ∼ i.i.d. p(x, y), then μ = 1 N N i=1 h(x i , y i ) is a basic Monte-Carlo estimator of µ = E (x,y)∼p(x,y) [h(x, y)] = x y p(x, y)h(x, y). From Eq. 16, µ = E (x,y)∼p(x,y) [ q(y)h(x, y) p(y) ] = E (x,y)∼p(x,y) [ q(y)h(x, y) Zp(y) ] = x y q(y)h(x, y) Zp(y) p(x, y) = 1 Z x y q(y)h(x, y) p(y) p(x, y) = y [p(y) q(y) p(y) [ x p(x|y)h(x, y)]] y [q(y)[ x q(x|y)]] = y [p(y) q(y) p(y) [ x p(x|y)h(x, y)]] y [p(y) q(y) p(y) [ x q(x|y)]] Let κ(y) = q(y) p(y) as the unnormalized class reweighting (CR) function of y and q(x|y) = p(x|y), then we have y [p(y) q(y) p(y) [ x p(x|y)h(x, y)]] y [p(y) q(y) p(y) [ x q(x|y)]] = y [p(y)κ(y)[ x p(x|y)h(x, y)]] y [p(y)κ(y)[ x p(x|y)]] = E y∼p(y) [κ(y)[E x∼p(x|y) h(x, y)]] E y∼p(y),x∼p(x|y) [κ(y)] = E y∼p(y),x∼p(x|y) [κ(y)h(x, y)] E y∼p(y),x∼p(x|y) [κ(y)] = E (x,y)∼p(x,y) [κ(y)h(x, y)] E (x,y)∼p(x,y) [κ(y)] , Using Monte-Carlo estimation, we can estimate the above expectations: E (x,y)∼p(x,y) [κ(y)h(x, y)] E (x,y)∼p(x,y) [κ(y)] ≈ 1 N N i=1 κ(y i )h(x i , y i ) 1 N N i=1 κ(y i ) , (x i , y i ) ∼ i.i.d. p(x, y) (19) = N i=1 κ(y i )h(x i , y i ) N i=1 κ(y i ) , (x i , y i ) ∼ i.i.d. p(x, y).

B FORMULATION OF SPECIALITY LABELS

This section describes how we create "specialty" labels Y t of each t-th teacher (we already know t ∈ N, t ≤ T where T denotes the number of teachers). We allow the overlap between Y t s as mentioned in the main paper. We introduce a parameter γ ∈ [0, 1] to control the ratio of the classes that a teacher focuses on. Let us denote the first class of Y t as follows: c t 0 = K T (t -1) + 1, where K is the total number of classes. Then, we can define the specialty set as: Y t = {k | c t 0 ≤ k ≤ c t 0 + γK -1}, if c t 0 + γK -1 ≤ K, Sometimes k can go beyond the last index of the whole class set. In that case, we define the specialty labels Y t as follows: Y t = {k | c t 0 ≤ k ≤ K} ∪ {k | 1 ≤ k ≤ c t 0 + (γ -1)K -1}, otherwise. In this paper, γ = 0.5 is used to let specialty labels overlap at least once. There is no class overlap if γ is 0. It is also worth noting that if K is not a multiple of T , some classes may be exposed once less than the others. However, our experiments show that this does not significantly affect ensemble performance.

C WHY ADAPTING TEACHER LABEL PRIOR BEFORE AGGREGATION C.1 TEACHER PREDICTION ON LABEL PRIOR SHIFT

In supervised learning, a classifier parameterized by θ tries to sample a correct label y on the input x by directly estimating conditional distribution p(y|x; θ). In our online multi-head learning, each teacher parameterized θ t learns corresponding true distribution p t (x, y) = p(x|y)p t (y) whose label prior is differently class-skewed p t (y) ̸ = p t ′ (y). In Bayes rule, the empirical inference over parameters θ t given specialty dataset D t = {(x i , y i )} M i=1 on each class-imbalanced distribution is as follows: p(θ t |D t ) ∝ p(θ t ) M i=1 p(y i |x i ; θ t ). For an unknown data (x M +1 , y M +1 ), the predictive distribution is marginalized over posterior distribution p(θ t |D t ): p(y M +1 |x M +1 ; D t ) = p(y M +1 |x M +1 ; θ t )p(θ t |D t )dθ t Therefore, each teacher's prediction of given data is likely related to its class-skewed distribution.

C.2 BIASED ACCURACY ON LABEL PRIOR SHIFT

In this section, we further discuss the class-imbalance causes the label prior shift can result in incorrect accuracy (Tian et al., 2020) when especially teacher's label prior has varying degrees of imbalance on uniform (student) label prior p t (y) ̸ = p(y). For the simplicity, we assume a teacher classifier as f (x; θ t ) : R D → {0, 1} K on the K-way one-hot classification on D-dimensional inputs. Acc(x, y) = 1 N K k=1 N i=1 I(f (x i ; θ t ) = k, y i = k) = K k=1 N k N 1 N k N i=1 I(f (x i ; θ t ) = k, y i = k) = K k=1 p(y = k)Agreement(y = k) = E y∼p(y) [Agreement(y)]. (26) As shown in Eq. 26, the accuracy is equal to the expectation of agreement underlying the given label prior. When p t (y) ̸ = p(y), training with imbalanced data maximizes accuracy on p t (y) where majority classes are likely to observe. On the other hand, the accuracy of uniform data calculates the expectation of agreement on p(y). Therefore, training in p t (y) is prone to bias towards large classes to maximize Eq. 26 and thus may result in inaccurate evaluation on uniform p(y).

C.3 LIKELIHOOD RELAXATION

As shown in Eq. 25 and Eq. 26, training imbalanced implies that a given teacher cannot be accurate in the minority data and allows teacher prediction to be closely related to its corresponding labels. Suppose ξ ≥ 0 is any threshold and L t (θ t ) is the standard NLL in Softmax regression of our teachers on the class-imbalanced dataset. Denoting Ω k is a subset of k-class, let err k (ξ) be zero-one loss from empirical samples in k-class subset: err k (ξ) = P r (x,y)∈Ω k [L t (θ t ) > ξ]. In addition, we define err γ,k (ξ) is the zero-one γ-margin loss from empirical samples in k-class subset: err γ,k (ξ) = P r (x,y)∈Ω k [L t (θ t ) + γ k > ξ]. Theorem 2. (Ren et al., 2020) Assume that L t is Lipschitz continuous and sup (x,y)∈Ω |L t (θ t ) -ξ| ≤ C where Ω is an entire dataset. For any δ > 0 with probability at least 1 -δ over the samples, ∀γ k > 0 and ∀f ∈ F in Theorem 2 of Kakade et al. (2008) , neglecting empirical noise, we have  err k (ξ) ≤ err γ,k (ξ) + 4R k (F) γ k + log(log 2 4C γ k ) n k + log(1/δ) 2n k err unif orm (ξ) ≤ 1 K K k=1   err γ,k (ξ) + 4 γ k Γ(F) n k + log(log 2 4C γ k ) n k + log(1/δ) 2n k   where Γ can be measured as a complexity of F, following Thereom 3 of Kakade et al. (2008) . To minimize the uniform error bound in Eq. 28 according to n k , we should minimize the second term because the first term is a natural data loss and the other terms are negligible low-order losses. With an equality constraint of K k=1 γ k = ρ, we can solve the minimization problem of the second term by applying Cauchy-Schwarz inequality to get each optimal k-class margin γ * k . min K k=1 4 γ k Γ(F) n k , subject to K k=1 γ k = ρ. Proof. Given minimization problem can be written as min K k=1 γ k K k=1 4 γ k Γ(F) n k . ( ) By Cauchy-Schwarz inequality, K k=1 γ k K k=1 4 γ k Γ(F) n k ≥   K k=1 γ k • 4 γ k Γ(F) n k   2 =   K k=1 4 Γ(F) n k   2 . ( ) Both sides are equal if and only if γ k and 4 γ k Γ(F ) n k are linearly dependent. Thus, we choose a multiplier ζ 2 for ease of calculation. Then, we have γ k = ζ 2 4 γ k Γ(F) n k ; γ 2 k = 4ζ 2 Γ(F) n k ; γ k = 2ζ Γ(F) n k 1/4 . ( ) Substitute γ k of Eq. 32 with those of the equality constraint in Eq. 29. ρ = K k=1 γ k = K k=1 2ζ Γ(F) n k 1/4 ; ρ = 2ζ K k=1 Γ(F) n k 1/4 , ζ = ρ 2 K k=1 Γ(F ) n k 1/4 ; γ k 2 Γ(F ) n k 1/4 = ρ 2 K k=1 Γ(F ) n k 1/4 , γ k = 2ρ Γ(F ) n k 1/4 2 K k=1 Γ(F ) n k 1/4 = 2ρΓ(F) 1/4 1 n k 1/4 2Γ(F) 1/4 K k=1 1 n k 1/4 , Finally, the optimal margin of the k-class subset, γ * k , is as follows: ∴ γ * k = ρn -1/4 k K k=1 n -1/4 k . ( ) As a result of Eq. 36, γ * k implies that independent margins are necessary according to n k . Thus, minority classes sometimes require larger margins to be generalized. To make a uniform generalization error against each teacher prediction, denoted as Eq. 25, each teacher necessitates manually relaxing k-class NLL loss by adjusting Softmax outputs. Corollary 2.1 of Ren et al. (2020) introduces that we can get the desired NLL loss from a sum of class-wise NLL loss and the given optimal margin. The straightforward derivation results in a compensating method for the k-class logit value. Let a conditional distribution by adapted logit values on each t-th teacher be pt (y|x; θ t ). Then, we can define pt (y = k|x; θ t ) = exp(z t [k] -log γ * t,k ) K k ′ =1 exp(z t [k ′ ] -log γ * t,k ′ ) = n 1 4 t,k exp(z t [k]) K k ′ =1 n 1 4 t,k ′ exp(z t [k ′ ]) where γ * t,k and n t,k denote each optimal k-class margin and size of t-th teacher and each k-class logit value z t [k] is compensated as much as -log γ * t,k . However, Ren et al. (2020) suggests that since by measuring ERR, and choose 0.5 for ResNet-32, 0.3 for ResNet-110, 0.5 for DenseNet-40-12, 0.3 for EfficientNetB0, and 0.8 for MobileNetV2. We choose different exposure for models as they differ in deep network architecture and the ratio of peer/shared parameters, as shown in Section E.2. All parameters are initialized with MSRA initialization (He et al., 2015) . To compare our method with previous works, we use the officially released implementation code 234 for the works and three evaluation metrics on each method is fairly measured with the training settings above. While we use τ = 3 for knowledge distillation temperature, DML uses τ = 1, and CLILR uses τ = 2; those values are reported in the original paper. We train all ImageNet models for 90 epochs. The learning rate begins at 0.1 and decreases by one-tenth at the 30 and 60 epochs. The mini-batch size is set to 256, and the weight decay to 1 × 10 -foot_4 . A balancing factor λ(e) has ramp-up period α of 20 where e is an epoch. For our knowledge distillation, we set τ = 3. For all models, the exposure ϵ of 0.7 is used. We also used MSRA initialization for ImageNet.

E.2 ARCHITECTURAL CONFIGURATIONS OF THE PEER-BASED METHOD

We separate the shared and teacher-specific parts from the start of the last block to build a peerbased architecture for various deep models on both CIFAR datasets. For this purpose, we adhere to the strategy in Chen et al. (2020) . We divide the shared part from the teacher-specific part at the beginning of the last two building blocks for all methods on MobileNetV2 and EfficientNetB0 built for more comparisons in this paper. As a result, network-based models have far more parameters than peer-based models, as shown in Table 3 . We also follow the separating strategy in Chen et al. (2020) for ResNet on the ImageNet dataset; we split the last two residual blocks to build peer-based architecture. Diversity disparity among various model architectures. Accepting that DNNs are architecturally distinct, we can empirically analyze through Figure 3 , and Table 1 that deriving diversity can be significantly difficult if the amounts of peer-head parameters are considerably smaller than the shared part. DenseNet-40-12, for example, has almost all shared parameters because this architecture uses the teacher-specific part as only a fully-connected layer (Chen et al., 2020) . Thus, we can speculate that only minor individual parameters are included for specialization, implying that diversity is possible (outperforms the previous methods), but specialization remains difficult. Therefore, The diversity of our proposed method on DenseNet-40-12 is lower than that of other models. Furthermore, we investigate why MobileNetV2 and EfficientNetB0 have less diversity than ResNet-32. Despite having a similar Peer/Shared parameter ratio, the aforementioned structural differences can be caused by an intermediate layer type, e.g., spatial or depthwise-separable convolution; however, concrete analysis is our future topic. F SUPPLEMENTARY RESULTS

F.1 COMPARISON WITH NETWORK-BASED METHODS

To make a fair comparison with DML and a network-based variant of OKDDip, we rebuilt our framework as a network-based one. From a results of ECE in 

F.2 DIVERSITY CHANGE ON THE VARIATION OF RAMP-UP PERIOD

This section shows how our diversity is large and maintained well throughout training according to the variation of ramp-up period α. In the online KD works, α has been used to modulate the power of KD strength to control homogenization. For example, when α is 80, λ(e) in Eq 13 varies from 0 to 1 during the first 80 epochs. As shown in Figure 6 , CLILR, FFL-S, and ONE are sensitive to the variation of α. In particular, when α is small, the previous works have suffered from homogenization since early epochs. However, OKDDip and ours have not been affected by the variation of α. In addition, our method consistently exhibits the highest diversity over previous works. For each sample (x i , y i ) ∼ D, we can get averaged predictions for each teacher. We first define the conditional distribution p t (y|x i ; θ t ) in K-class Softmax, which can be represented as a multinomial distribution: p t (y|x i ; θ t ) = K k=1 p t (y = k|x i ; θ t ) 1{y=k} ; p t (y = k|x i ; θ t ) = exp(z i t [k]) K k ′ =1 exp(z i t [k ′ ]) , t ≤ T, where 1{•} denotes the indicator function. The logit of class k given an i-th input sample (x i , y i ) produced by the t-th teacher model is denoted by z i t [k] . Second, we take the conditional distributions for each t-th teacher and average them across all samples on D. p t (y|x; θ t ) = 1 N N i=1 p t (y|x i ; θ t ), t ≤ T, where N denotes the total number of samples in D. To demonstrate the post-compensation (PC) effect, we adapt the original label prior as introduced in Section 3.5 and thus replace p t (y|x i ; θ t ) with pt (y|x i ; θ t ). In Figure 7 , we plot Eq 43 and a variant of Eq 43 replaced by pt (y|x; θ t ). Smaller ϵ leads to different conditional distributions when a model is slimmer, inducing specialization and dramatic diversity. After applying the PC strategy, we can see that teachers still maintain diversity in the uniform distribution. 

F.4 ENSEMBLE CONFIDENCE VISUALIZATION

As shown in Figure 8 , we visualize ensemble confidence compared to previous methods on both positive and negative samples. As shown in Figure 8 , our ensemble is less over-confident for positive samples than previous methods. Especially, our ensemble more confidently mispredicts for negative samples. It implies that our ensemble may have a lower chance of miscalibrated failure (being completely incorrect) than the others. That is, our ensemble experiences failure uncertainly. 

G LIMITATION

Our ensemble method produces a better confidence calibration by leveraging two key factors: combining probabilities and teacher diversity. To resolve label prior shift among teachers and match the student label distribution, the class probabilities are individually post-compensated. PC is necessary, but naively employing the PC strategy can result in sometimes overbalanced posterior probabilities on the rest of the specialty classes (Ren et al., 2020) . We studied that generalization error bound for minority classes with fewer samples should be carefully considered in Appendiex C.3; we theoretically discussed that tightness to derive the post-compensation ratio is sometimes well-assumed from Eq. 28. We empirically discovered that our framework suffers from the overbalanced problem when re-scaling teacher outputs on the rest of the specialty before forming an ensemble. We conjecture that an estimator derived from importance sampling can have inherent difference from an estimator derived from basic Monte-Carlo sampling of the actual imbalanced joint distribution; thus, we speculate that a degree of experience with the out of the specialty classes will be a little different from the actual situation. In future work, we will investigate the posterior overbalanced problem after the PC strategy more thoroughly and fundamentally to obtain better predictions.

H POTENTIAL SOCIETAL IMPACT

This work has the same potential impact as any neural network compression study. The positive effect first comes from reducing the resource overhead of deep learning models during inference time. Second, a compressed model with only essential knowledge has more potential because it achieves comparable performance with less power and can even exhibit better generalization than the larger capacity model. Therefore, we can deploy neural network models to mobile phones or edge devices, expecting acceptable performance. We thus take a step closer to energy-friendly deep learning, facilitating a wider use of Artificial Intelligence in industrial IoT or smart home technology. At the same time, research on neural network compression may have some negative consequences. For example, if neural network models are more widely used for wearable devices or surveillance cameras, privacy invasion or cybercrime is possible. In addition, the malfunction of industrial IoT devices could cause a severe problem for the whole production process.



The model architecture of CIFAR is shallower than the plain version of ImageNet(He et al., 2016) https://github.com/DefangChen/OKDDip-AAAI2020 https://github.com/Lan1991Xu/ONE_NeurIPS2018 https://github.com/Jangho-Kim/FFL-pytorch



Student Distilling Steps Input: Training set D; (T +1)-head model parameterized Θ; mini-batch size N ; learning rate η Output: A student model parameterized θ converged s 1: Randomly initialize Θ 2: Set CR functions κt(y) with a choice of ϵ to define each Lt(θt) 3: while θs not converged do 4:

Figure 2: Ensemble and student generalization: top-1 accuracy, expected calibration error (ECE), and negative log-likelihood (NLL) loss. Fidelity between ensemble and student conditional distribution: averaged KL-divergence and averaged ensemble-student top-1 agreement. Diversity: averaged Jensen-Shannon divergence between the posterior distributions of each pair of teachers. The shaded region represents the mean(±std) for three experiments with varying ϵ and T in test time.

Figure3: Diversity comparison in various deep neural networks on CIFAR-10 (up) and CIFAR-100 (down) with previous methods. For fair comparisons, we use Softmax to normalize teacher logits of the previous methods and PC-Softmax on our teacher logits. Each measure is obtained when T = 3 and the student is the best performer in validation time.

Figure4: Performance comparisons in ensembles using logits or probabilities (probs). All ensembles over the benchmarks are obtained when each student performs the best on accuracy at validation time. In particular, (a) probs are based on PC-Softmax for ours and Softmax for others. (c) The shaded region represents the mean(±std), calculated from three trials with varying ϵ

Ren et al. (2020) introduce each negative-log likelihood (NLL) error of minority classes that should be adjusted more. They propose a manual relaxation method over the class-wise NLL by posing a discriminative "margin" denoted as γ k where k is a class index. By carefully revisiting Theorem 2 in both Ren et al. (2020); Kakade et al. (2008), we discuss how we can quantitatively set the margin γ k .

27) where R k (F) is the Rademacher complexity of a function family F (Kakade et al., 2008) and n k is the sample size of k-class subset. By discussion in Ren et al. (2020), we can have the relaxed generalization error bound err unif orm (ξ) for the loss of uniformly class-distributed dataset.

comparisons during entire training time for ResNet-32 on CIFAR-100. The shaded region represents the mean(±std), calculated from three trials. We plot each diversity based on PC-Softmax (ours) and Softmax (others) while using training set.F.3 VISUALIZATION OF TEACHER POSTERIOR DISTRIBUTIONThis section depicts how the teacher's output varies when predicting specific class samples for each t-th teacher. When T = 4, we apply two deep neural networks, ResNet-32 and ResNet-110, on CIFAR-100 and test on exposure ϵ ∈ [0.1, 0.3, 1.0]. We create an arbitrarily partial test set D including only the specific labeled data as Ỹ = [1, 24]; we directly generate the skewed label distribution with the number of samples equal within Ỹ and zero in the reset of classes.

Figure 7: Visualization of averaged teachers' posterior distribution on the specific labeled dataset D. The front number of a teacher is a teacher index. We plot each averaged conditional distribution based on Softmax (up) and PC-Softmax (down).

Figure 8: Confidence of an ensemble posterior distribution corresponds to each class. For each k class, positive samples denote the correct samples corresponding class k , and negative samples denote the incorrect samples on the class k. The shaded area corresponds to the mean(±std).

The generalization comparison with previous peer-based methods on the student model. ERR and ECE use a percentage (%), and NLL is a loss value. Thus, the lower it is, the better. The numbers are the test results of three random experiments and filled in the mean(±std). The best result within each type is indicated in bold.

Top-1 ERR (%) comparison with previous methods on ImageNet validation set. The results of ResNet-18 and ResNet-34 are each reported from Wu & Gong (2021) and Chen et al. (2020); Note also ours has T = 2 and T = 3 on each model for a fair comparison. We filled in mean(±std) through three random experiments on our validation results.

The pure parameter ratio of DNNs. Network-based / Peer-based denotes the parameter ratio of network-based models to peer-based models. Peer / Shared denotes the parameter ratio of single peer head to the shared part. The models are in order of Table1. This ratio excludes additional parameters by extra modules generated in our benchmark works.

Table4 and Table 1, network-based online KD shows more effectiveness than peer-based in producing a more calibrated student. Regardless of class size, our method outperforms previous works in ResNet-110, EfficientNetB0, and MobileNetV2. On CIFAR-10, our method outperforms in ResNet-32 and DenseNet-40-12, but falls short of OKDDip in ERR on CIFAR-100. However, ours is still better than OKDDip, about 2x in ECE. As a result, our student is more accurately confident than OKDDip. The generalization comparison with previous network-based methods on the student model. ERR and ECE use a percentage (%), and NLL is a loss value. Thus, the lower it is, the better. The numbers are the test results of three random experiments and filled in the mean(±std). The best result within each type is indicated in bold.

annex

Eq. 28 is not tight, a power of 1/4 for n k becomes not powerful condition than using 1. Therefore, we can redefine pt (y = k|x; θ t ) = n t,k exp(z t [k])where pt (y = k) denotes class-imbalanced probability of t-th teacher and n is the total number of dataset Ω. Using p(y = k), which is a uniform probability of k-class, then, we can have) where κ t (y = k) is our proposed CR function over t-th teacher. Eq. 39 is, as a result, the same as Eq. 7, PC-Softmax (Hong et al., 2021) . Thus, we little adjust each k-class logit value z t [k] as much as -log(1/ϵ) in this paper.

D DIVERSITY: AVERAGED PAIRWISE JENSEN-SHANNON DIVERGENCE

We measure the diversity of teacher outputs based on Jensen-Shannon Divergence (JSD), which assesses how similar two distributions are. The two distributions are mutually informative if the diversity is zero. Given the i-th sample, we formulate the diversity among T teachers as follows:Div i ;(40)where p i t is the output probability distribution of t-th teacher and σ = (p i t + p i t ′ )/2. For our proposed method, probability distributions after post-compensation are used.

E EXPERIMENTAL SETTINGS E.1 EXPERIMENTAL CONFIGURATIONS

Datasets. We compare our proposed method to previous online KD works using three datasets. CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) each have 50K training images and 10K test images, with each image belonging to one of 10 or 100 classes. ImageNet (Deng et al., 2009) contains 1.2M training and 50K validation images in 1K classes.Training settings. For CIFAR datasets, we train all models for 300 epochs. We use SGD with a momentum of 0.9. The learning rate begins at 0.1 and decreases by one-tenth every 150 and 225 epochs. We employ a standard data augmentation strategy from He et al. (2016) and normalize all images by each channel mean and standard deviation. The batch size is set to 128, and the weight decay to 5×10 -4 . The ramp-up period α of a balancing factor λ(e) for student knowledge distillation loss is 80, where e is an epoch. During the first 80 epochs, λ(e) varies from 0 to 1. We perform a grid search to find each model's optimal exposure ϵ. We find a best ϵ among [0.1, 0.3, 0.5, 0.8, 1.0] 

