ON THE REPRODUCIBILITY OF NEURAL NETWORK PREDICTIONS

Abstract

Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause churn -disagreements between predictions of the two models independently trained by the same algorithm, contributing to the 'reproducibility challenges' in modern machine learning. In this paper, we study this problem of churn, identify factors that cause it, and propose two simple means of mitigating it. We first demonstrate that churn is indeed an issue, even for standard image classification tasks (CIFAR and ImageNet), and study the role of the different sources of training randomness that cause churn. By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction. First, we propose using minimum entropy regularizers to increase prediction confidences. Second, we present a novel variant of co-distillation approach (Anil et al., 2018) to increase model agreement and reduce churn. We present empirical results showing the effectiveness of both techniques in reducing churn while improving the accuracy of the underlying model.

1. INTRODUCTION

Deep neural networks (DNNs) have seen remarkable success in a range of complex tasks, and significant effort has been spent on further improving their predictive accuracy. However, an equally important desideratum of any machine learning system is stability or reproducibility in its predictions. In practice, machine learning models are continuously (re)-trained as new data arrives, or to incorporate architectural and algorithmic changes. A model that changes its predictions on a significant fraction of examples after each update is undesirable, even if each model instantiation attains high accuracy. Reproducibility of predictions is a challenge even if the architecture and training data are fixed across different training runs, which is the focus of this paper. Unfortunately, two key ingredients that help deep networks attain high accuracy -over-parameterization, and the randomization of their training algorithms -pose significant challenges to their reproducibility. The former refers to the fact that NNs typically have many solutions that minimize the training objective (Neyshabur et al., 2015; Zhang et al., 2017) . The latter refers to the fact that standard training of NNs involves several sources of randomness, e.g., initialization, mini-batch ordering, non-determinism in training platforms and in some cases data augmentation. Put together, these imply that NN training can find vastly different solutions in each run even when training data is the same, leading to a reproducibility challenge. The prediction disagreement between two models is referred to as churn (Cormier et al., 2016) 1 . Concretely, given two models, churn is the fraction of test examples where the predictions of the two models disagree. Clearly, churn is zero if both models have perfect accuracy -an unattainable goal for most of the practical settings of interest. Similarly, one can mitigate churn by eliminating all sources of randomness in the underlying training setup. However, even if one controls the seed used for random initialization and the order of data, inherent non-determinism in the current computation platforms is hard to avoid (see §2.3). Moreover it is desirable to have stable models with predictions unaffected by such factors in training. Thus, it is critical to quantify churn, and develop methods that reduce it. In this paper, we study the problem of churn in NNs for the classification setting. We demonstrate the presence of churn, and investigate the role of different training factors causing it. Interestingly our experiments show that churn is not avoidable on the computing platforms commonly used in machine learning, further highlighting the necessity of developing techniques to mitigate churn. We then analyze the relation between churn and predicted class probabilities. Based on this, we develop a novel regularized co-distillation approach for reducing churn. Our key contributions are summarized below: (i) Besides the disagreement in the final predictions of models, we propose alternative soft metrics to measure churn. We demonstrate the existence of churn on standard image classification tasks (CIFAR-10, CIFAR-100, ImageNet, SVHN and iNaturalist), and identify the components of learning algorithms that contribute to the observed churn. Furthermore, we analyze the relationship between churn and model prediction confidences (cf. § 2). (ii) Motivated from our analysis, we propose a regularized co-distillation approach to reduce churn that both improves prediction confidences and reduces prediction variance (cf. §3). Our approach consists of two components: a) minimum entropy regularizers that improve prediction confidences (cf. §3.1), and b) a new variant of co-distillation (Anil et al., 2018) to reduce prediction variance across runs. Specifically, we use a symmetric KL divergence based loss to reduce model disagreement, with a linear warmup and joint updates across multiple models (cf. §3.2). (iii) We empirically demonstrate the effectiveness of the proposed approach in reducing churn and (sometimes) increasing accuracy. We present ablation studies over its two components to show their complementary nature in reducing churn (cf. §4).

1.1. RELATED WORK

Reproducibility in machine learning. There is a broad field studying the problem of reproducible research (Buckheit & Donoho, 1995; Gentleman & Lang, 2007; Sonnenburg et al., 2007; Kovacevic, 2007; Mesirov, 2010; Peng, 2011; McNutt, 2014; Braun & Ong, 2014; Rule et al., 2018) , which identifies best practices to facilitate the reproducibility of scientific results. Henderson et al. (2018) analysed reproducibility of methods in reinforcement learning, showing that performance of certain methods is sensitive to the random seed used in the training. While the performance of NNs on image classification tasks is fairly stable (Table 2 ), we focus on analyzing and improving the reproducibility of individual predictions. Thus, churn can be seen as a specific technical component of this reproducibility challenge. Cormier et al. (2016) defined the disagreement between predictions of two models as churn. They proposed an MCMC approach to train an initial stable model A so that it has a small churn with its future version, say model B. (Izmailov et al., 2018) approach to promote robust interpretations. In contrast, we are interested in robustness of individual predictions. Ensembling and online distillation. Ensemble methods (Dietterich, 2000; Lakshminarayanan et al., 2017) that combine the predictions from multiple (diverse) models naturally reduce the churn by averaging out the randomness in the training procedure of the individual models. However, such methods incur large memory footprint and high computational cost during the inference time. Distillation (Hinton et al., 2015; Bucilua et al., 2006) aims to train a single model from the ensemble to alleviate these costs. Even though distilled model aims to recover the accuracy of the underlying ensemble, it is unclear if the distilled model also leads to churn reduction. Furthermore, distillation is a two-stage process, involving first training an ensemble and then distilling it into a single model. 2020) have focused on online distillation, where multiple identical or similar models (with different initialization) are trained while regularizing the distance between their prediction probabilities. At the end of the training, any of the participating models can be used for inference. Notably, Anil et al. (2018) , while referring to this approach as co-distillation, also empirically pointed out its utility for churn reduction on the Criteo Ad datasetfoot_1 . In contrast, we develop a deeper understanding of co-distillation framework as a churn reduction mechanism by providing a theoretical justification behind its ability to reduce churn. We experimentally show that using a symmetric KL divergence objective instead of the cross entropy loss for co-distillation (Anil et al., 2018) leads to lower churn and better accuracy, even improving over the expensive ensembling-distillation approach. Entropy regularizer. Minimum entropy regularization was earlier explored in the context of semi-supervised learning (Grandvalet & Bengio, 2005) . Such techniques have also been used to combat label noise (Reed et al., 2015) . In contrast, we utilize minimum entropy regularization in fully supervised settings for a distinct purpose of reducing churn and experimentally show its effectiveness.

1.2. NOTATION

Multi-class classification. We consider a multi-class classification setting, where given an instance x P X, the goal is to classify it as a member of one of K classes, indexed by the set Y rKs. Let W be the set of parameters that define the underlying classification models. In particular, for w P W, the associated classification model f p¨; wq : X Ñ ∆ K maps the instance x P X in the K-dimensional simplex ∆ K Ă R K . Given f px; wq, x is classified as element of class ŷx;w such that ŷx;w " arg max jPY f px; wq j . (1) This gives the misclassification error 01 `y, ŷx;w ˘" 1 tŷx;w‰yu , where y is the true label for x. Let P X,Y be the joint distribution over the instance and label pairs. We learn a classification model by minimizing the risk for some valid surrogate loss of the misclassification error 01 : L `w˘ E X,Y " pY, f pX; wqq ‰ . In practice, since we have only finite samples S P pX ˆYq n , we minimize the corresponding empirical risk. L `w; S ˘ 1 |S| ÿ px,yqPS `y, f px; wq ˘. (2)

2. CHURN: MEASUREMENT AND ANALYSIS

In this section we define churn, and demonstrate its existence on CIFAR and ImageNet datasets. We also propose and measure alternative soft metrics to quantify churn, that mitigate the discontinuity of churn. Subsequently, we examine the influence of different factors in the learning algorithm on churn. Finally, we present a relation between churn and prediction confidences of the model. We begin by defining churn as the expected disagreement between the predictions of two models (Cormier et al., 2016) . Definition 1 (Churn between two models). Let w 1 , w 2 P W define classification models f p¨; w 1 q, f p¨; w 2 q : X Ñ ∆ K , respectively. Then, the churn between the two models is Churnpw 1 , w 2 q " E X " 1 t Ŷx;w 1 ‰ Ŷx;w 2 u ‰ " P X " ŶX;w1 ‰ ŶX;w2 ‰ , where Ŷx;w1 arg max jPY f px; w 1 q j and Ŷx;w2 arg max jPY f px; w 2 q j . Note that if the models have perfect test accuracy, then their predictions always agree with the true label, which corresponds to zero churn. In practice, however, this is rarely the case. The following rather straightforward result shows that churn is upper bounded by the sum of the test error of the models. See Appendix B for the proof. We note that a similar result was shown in Theorem 1 of Madani et al. (2004) . Lemma 1. Let P Err,w1 P X,Y rY ‰ ŶX;w1 s and P Err,w2 P X,Y rY ‰ ŶX;w2 s be the misclassification error for the models w 1 and w 2 , respectively. Then, Churnpw 1 , w 2 q ď P Err,w1 `PErr,w2 . Table 1 : Ablation study of churn across 5 runs on CIFAR-10 with a ResNet-56. Holding the initialization constant across models always decreases churn, but using identical mini-batch ordering and completely removing augmentation can increase churn with a decrease in accuracy. Despite the worst-case bound in Lemma 1, imperfect accuracy does not preclude the absence of churn. As the best-case scenario, two imperfect models can agree on the predictions for each example (whether correct or incorrect), causing the churn to be zero. For example, multiple runs of a deterministic learning algorithm produce models with zero churn, independent of their accuracy. This shows that, in general, one cannot infer churn from test accuracy, and understanding churn of an algorithm requires independent exploration.

2.1. DEMONSTRATION OF CHURN

In principle, there are multiple sources of randomness in standard training procedures that make NNs susceptible to churn. We now verify this hypothesis by showing that, in practice, these sources indeed result in a non-trivial churn for NNs on standard image classification datasets. This raises two natural questions: (i) do the prediction probabilities differ significantly, or is the churn observed in Table 2 mainly an artifact of the argmax operation in (1), when faced with small variation in prediction probabilities across models?, and (ii) what causes such a high churn across runs? We address these questions in the following sections.

2.2. SURROGATE CHURN

Churn in Table 2 could potentially be a manifestation of applying the arg max operation in (1), despite the prediction probabilities being close. To study this, we consider the following soft metric to measure churn, which takes the models' entire prediction probability mass function into account. Definition 2 (Surrogate churn between two models). Let f p¨; w 1 q, f p¨; w 2 q : X Ñ ∆ K be two models defined by w 1 , w 2 P W, respectively. Then, for α P R `, the surrogate churn between the models is SChurn α pw 1 , w 2 q " 1 2 ¨EX "› › › › ˆf pX; w 1 q max f pX; w 1 q ˙α ´ˆf pX; w 2 q max f pX; w 2 q ˙α› › › › 1  . As α Ñ 8, this reduces to the standard Churn definition in (3). In Table 2 we measure SChurn α for α " 1, which shows that even the distance between prediction probabilities is significant across runs. Thus, the churn observed in Table 2 is not merely caused by the discontinuity of the arg max, but it indeed highlights the instability of model predictions caused by randomness in training.

2.3. WHAT CAUSES CHURN?

We now investigate the role played by randomness in initialization, mini-batch ordering, data augmentation, and non-determinism in the computation platform in causing churn. Even though these aspects are sources of randomness in training, they are not necessarily sources of churn. We experiment by holding some of these factors constant and measure the churn across 5 different runs. We report results in Table 1 , where the first column gives the baseline churn with no factor held constant across runs. Model initialization is a source of churn. Simply initializing weights from identical seeds (odd columns in Table 1 ) can decrease churn under most settings. Other sources of randomness include mini-batch ordering and dataset augmentation. To hold the former constant, we ensure that every model iterates over the dataset in the same order; to hold the latter constant, we remove all augmentation during training. These two aspects contribute to randomness between training runs, but fixing them does not decrease churn; rather, they appear to have a regularizing affect on our hardware platform. Finally, there is churn resulting from an unavoidable source during training, the non-determinism in computing platforms used for training machine learning models, e.g., GPU/TPU (Morin & Willetts, 2020; PyTorch, 2019; Nvidia, 2020) . The experiments in Table 1 were run on TPU. Even when all other aspects of training are held constant (rightmost column), model weights diverge within 100 steps (across runs) and the final churn is significant. We verified that this is the sole source of non-determinism: models trained to 10,000 steps on CPUs under these settings had identical weights. These experiments underscore the importance of developing and incorporating churn reduction strategies in training. Even with extreme measures to eliminate all sources of randomness, we continue to observe churn due to unavoidable hardware non-determinism.

2.4. REDUCING CHURN: RELATION TO PREDICTION PROBABILITIES

We now focus on the main objective of this paper -churn reduction in NNs. Many factors that have been shown to cause churn in § 2.3 are crucial for the good performance of NNs and thus cannot be controlled or eliminated without causing performance degradation. Moreover, controlling certain factors such as hardware non-determinism is extremely challenging, especially in the large scale distributed computing setup. Towards reducing churn, we first develop an understanding of the relationship between the distribution of prediction probabilities of a model and churn. Intuitively, encouraging either larger prediction confidence or smaller variance across multiple training runs should result in reduced churn. To formalize this intuition, let us consider the prediction confidence realized by the classification model f p¨; wq on the instance x P X: γ x;w " f px; wq ŷx;w ´arg max j‰ŷx;w f px; wq j , where ŷx;w is the model prediction in (1). Note that γ x;w denotes the difference between the probabilities of the most likely and the second most likely classes under the distribution f px; wq, and is not the same as the standard multi-class margin (Koltchinskii et al., 2001) : it captures how confident the model f p¨; wq is about its prediction ŷx;w , without taking the true label into account. The following result relates the prediction confidence to churn. See Appendix B for the proof. Lemma 2. Let γ x;w1 and γ x;w2 be the prediction confidences realized on x P X by the classification models f p¨; w 1 q and f p¨; w 2 q, respectively. Then, Churnpw 1 , w 2 q " P X t ŶX;w 1 ‰ ŶX;w 2 u ď P X " D L1 pf pX; w 1 q, f pX; w 2 qq ą mintγ X;w 1 , γ X;w 2 u ‰ . Here, D L1 pf px; w 1 q, f px; w 2 qq " ř jPY |f px; w 1 q j ´f px; w 2 q j | measures the L1 distance. Lemma 2 establishes that, for a given instance x, the churn between two models f p¨; w 1 q and f p¨; w 2 q becomes less likely as their confidences γ x;w1 and γ x;w2 increase. Similarly, the churn becomes smaller when the difference between their prediction probabilities, D L1 pf px; w 1 q, f px; w 2 qq, decreases. On the other hand, the co-distillation in itself does not affect the prediction confidence of the underlying models (as verified in Figure 2 ). Thus, our combined approach promotes both large model prediction confidence and small variance in prediction probabilities across multiple runs.

3.1. MINIMUM ENTROPY REGULARIZERS

Aiming to increase the prediction confidences of the trained model tγ x;w u xPX , we propose novel training objectives that employ one of two possible regularizers based on: (1) entropy of the model prediction probabilities; and (2) negative symmetric KL divergence between the model prediction probabilities and the uniform distribution. Both regularizers encourage the prediction probabilities to be concentrated on a small number of classes, and increase the associated prediction confidence. Entropy regularizer. Recall that the standard training minimizes the risk Lpwq. Instead, for α P r0, 1s and S " tpx i , y i qu iPrns , we propose to minimize the following regularized objective to increase the prediction confidence of the modelfoot_3 . L entropy pw; Sq " p1 ´αq ¨Lpw; Sq `α ¨1 n ÿ iPrns H `f px i , wq ˘, where H `f px; wq ˘" ´řjPrKs f px; wq j log f px; wq j denotes the entropy of predictions f px; wq. Symmetric KL divergence regularizer. Instead of encouraging the low entropy for the prediction probabilities, we can alternatively maximize the distance of the prediction probability mass function from the uniform distribution as a regularizer to enhance the prediction confidence. In particular, we utilize the symmetric KL divergence as the distance measure. Let Unif P ∆ K be the uniform distribution. Thus, given n samples S " tpx i , y i qu iPrns , we propose to minimize L SKL pw; Sq p1 ´αq ¨Lpw; Sq ´α ¨1 n ÿ iPrns SKL `f px i ; wq, Unif ˘, where SKL `f px; wq, Unif ˘" KL `f px; wq||Unif ˘`KL `Unif||f px; wq ˘. As discussed in the beginning of the section, the intuition behind utilizing these regularizers is to encourage spiky prediction probability mass functions, which lead to higher prediction confidence. The following result supports this intuition in the binary classification setting. See Appendix B for the proof. As for multi-class classification, we empirically verify that the proposed regularizers indeed lead to increased prediction confidences (cf. Figure 2 ). Theorem 3. Let f p¨; wq and f p¨; w 1 q be two binary classification models. For a given x P X , if we have H pf px; wqq ď H pf px; w 1 qq or SKLpf px; wqq ě SKLpf px; w 1 qq, then γ x;w ě γ x;w 1 . Note that the effect of these regularizers is different from increasing the temperature in softmax while computing f px; wq. Similar to max entropy regularizers (Pereyra et al., 2017) , they are independent of the label, a crucial difference that allows them to reduce churn even among misclassified examples.

3.2. CO-DISTILLATION

As the second measure for churn reduction, we now focus on designing training objectives that minimize the variance among the prediction probabilities for the same instance across multiple runs of the learning algorithm. Towards this, we consider novel variants of co-distillation approach (Anil et al., 2018) . In particular, we employ co-distillation by using symmetric KL divergence (Co-distill SKL ), with a linear warm-up, to penalize the prediction distance between the models instead of the popular step-wise cross entropy loss (Co-distill CE ) used in Anil et al. (2018) (see § C). We first motivate the proposed objective for reducing churn using Lemma 2. Recall from Lemma 2 that, for a given instance x, churn across two models f p¨; w 1 q and f p¨; w 2 q decreases as the distance between their prediction probabilities D L1 pf px; w 1 q, f px; w 2 qq becomes smaller. Motivated by this, given training samples S " tpx i , y i qu iPrns , one can simultaneously train two models corresponding to w 1 , w 2 P W by minimizing the following objective and keeping either 3)) and SChurn (cf. ( 4)) on the test sets. For each setting, we report the mean and standard deviation over 10 independent runs, with random initialization, mini-batches and data-augmentation. We report the values corresponding to the smallest churn for each method (see Table 5 in § A for the exact parameters). We boldface the best results in each column. First we notice that both the proposed methods are effective at reducing churn and Schurn, with Co-distill SKL showing significant reduction in churn. Additionally, these methods also improve the accuracy. Finally combing the entropy regularizer with Co-distill SKL offers the best way to reduce churn, improving over the ensembling-distillation and Co-distill CE approaches. Note that for co-distillation we measure churn of a single model (e.g. f p; w 1 q in eq (9)) across independent training runs. of the two models as the final solution. L Co-distillL1 pw 1 , w 2 ; Sq Lpw 1 ; Sq `Lpw 2 ; Sq `β n ÿ iPrns D L1 pf px i ; w 1 q, f px i ; w 2 qq (8) From Pinsker's inequality, D L1 pf px i ; w 1 q, f px i ; w 2 qq ď a 2 ¨KLpf px i ; w 1 q||f px i ; w 2 qq. Thus, one can alternatively utilize the following objective. L Co-distillSKL pw 1 , w 2 ; Sq Lpw 1 ; Sq `Lpw 2 ; Sq `β n ÿ iPrns SKL `f px i ; w 1 q, f px i ; w 2 q ˘, where SKL `f px; w 1 q, f px; w 2 q ˘" KL `f px; w 1 q||f px; w 2 q ˘`KL `f px; w 2 q||f px; w 1 q ˘denotes the symmetric KL-divergence between the prediction probabilities of the two models. In what follows, we work with the objective in (9) as we observed this to be more effective in our experiments, leading to both smaller churn and higher model accuracy. In addition to the co-distillation objective, we introduce two other changes to the training procedure: joint updates and linear rampup of the co-distillation loss. We discuss these differences in § C. Regularized co-distillation. In this paper, we also explore regularized co-distillation, where we utilize the minimum entropy regularizes (cf. § 3.1) in our co-distillation framework. We note that the best results for the combined approach are achieved when we use a linear warmup for the regularizer coefficient as well. Note that combining the Co-distill SKL objective with an entropy regularizer is not the same as using the cross entropy loss, due to the use the different weights for each method, and rampup of the regularizer coefficients. This distinction is important in reducing churn ( Table 2 ).

4. EXPERIMENTS

We conduct experiments on 5 different datasets, CIFAR-10, CIFAR-100, ImageNet, SVHN and iNaturalist 2018. We use LeNet5 for experiments on SVHN, ResNet-56 for experiments on CIFAR-10 and CIFAR-100, and ResNet-v2-50 for experiments on ImageNet and iNaturalist. We use the Table 3 : Weight decay ablation studies. Similar to Table 2 , we provide results on accuracy and churn for baseline and the proposed approaches, with and without weight decay. We notice that the proposed approaches improve churn both with and without weight decay. same hyperparameters for all the experiments on a dataset. We use the Cross entropy loss and the Momentum optimizer. For complete details we refer to the Appendix A. Top k regularizer. For problems with a large number of outputs, e.g. ImageNet, it is not required to penalize all the predictions to reduce churn. Recall that prediction confidence is only a function of the top two prediction probabilities. Hence, on ImageNet, we consider the top k variants of the proposed regularizers, and penalize the entropy/SKL only on the top k predictions, with k " 10.

Accuracy and churn.

In Table 2 , we present the accuracy and churn of models trained with the minimum entropy regularizers, co-distillation and their combination. For each dataset we report the values for the best α and β. We notice that our proposed methods indeed reduce the churn and Schurn. While both minimum entropy regularization and co-distillation are consistently effective in reducing Schurn, their effect on churn varies, potentially due to the discontinuous nature of the churn. Also note that the proposed methods reduce churn both among the correctly and incorrectly classified examples. Our co-distillation proposal, Co-distill SKL consistently showed a significant reduction in churn, and has better performance than Co-distill CE (Anil et al., 2018) that is based on the cross-entropy loss, showing the importance of choosing the right objective for reducing churn. We present additional comparison in Figure 2 , showing churn and Top-1 error on ImageNet, computed for Table 4 : Expected Calibration Error. We compute the Expected Calibration Error (ECE) (Guo et al., 2017) to evaluate the effect on calibration of logits by the churn reduction methods considered in this paper, for different datasets. We report ECE for the predictions of the models used to report accuracy and churn in Table 2 . We note that while the minimum entropy regularizers, predictably, increase the calibration error, our combined approach with co-distillation results in calibration error competitive with 2-ensemble distillation method. different values of β (cf. ( 9)). Finally, we achieve an even further reduction in churn by the combined approach of entropy regularized co-distillation. Interestingly, the considered methods also improve the accuracy, showing their utility beyond churn reduction. Ensembling. We also compare with the ensembling-distillation approach (Hinton et al., 2015) in Table 2 , where we use a 2 teacher ensemble for distillation. We show that the proposed methods consistently outperform ensembling-distillation approach, despite having a lower training cost. Ablation. We next report results of our ablation study on the entropy (top 10 ) regularizer coefficient α, and the Co-distill SKL coefficient β in Figure 1 . While both methods improve accuracy and churn, the prediction entropy shows different behavior between these methods. While entropy regularizers improve churn by reducing the entropy, co-distillation reduces churn by reducing the prediction distance between two models, resulting in an increase in entropy. This complementary nature explains the advantage in combining these two methods (cf. Table 2 ). We next present ablation studies on weight decay regularization in Table 3 , showing that the proposed approaches improve churn on models trained both with and without weight decay. Fixed initialization. Earlier in Table 1 , we showed that fixing the initialization of five runs of the baseline model lowers churn from 5.81 ˘0.08 to 5.56 ˘0.12 for CIFAR-10 on ResNet-56. However, our proposed Co-distill SKL model is more effective, reducing churn to 4.29 ˘0.14 (see Table 2 ). Using fixed initialization on this model does not significantly affect churn (4.24 ˘0.07).

5. DISCUSSION

Connection to label smoothing and calibration. Max entropy regularizers, such as label smoothing, are often used to reduce prediction confidence (Pereyra et al., 2017; Müller et al., 2019) and provide better prediction calibration. The minimum entropy regularizers studied in this paper have the opposite effect and increase the prediction confidence, whereas co-distillation reduces the prediction confidences. To study this more concretely we measure the effect of churn reduction methods on the calibration. We compute Expected Calibration Error (ECE) (Guo et al., 2017) for methods considered in this paper and report the results in Table 4 . We notice that the minimum entropy regularizers, predictably, increase the calibration error, whereas the propose co-distillation approach (Co-distill SKL ) significantly reduces the calibration error. We notice that our joint approach is competitive with the ensemble distillation approach on CIFAR datasets but incurs higher error on ImageNet. Developing approaches that jointly optimize for churn and calibration is an interesting direction of future work. Data augmentation: We use data augmentation with random cropping and flipping. Parameter ranges: For our experiments with the min entropy regularizers, we use α P r0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5s, and report the α corresponding to best churn in Table 2 . For our experiments with the co-distillation approach, we use β P r0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08s, and report the β corresponding to best churn in Table 2 . For our experiments on entropy regularized co-distillation, we use a linear warmup of the regularizer coefficient. We use the same range for α as described above, and use only the best β value from the earlier experiments. Finally, we run our experiments using TPUv3. Experiments on CIFAR datasets finish in an hour, experiments on ImageNet take around 6-8 hours. Hyper parameters for Table 2 : Finally we list in Table 5 the best hyperparameters used to obtain the results in Table 2 . For reference we also include the accuracy and the churn again for all methods.

B PROOFS

Proof of Lemma 1. Recall from Definition 1 that Churnpw 1 , w 2 q " P X " ŶX,w1 ‰ ŶX,w2 ‰ " P X,Y " ŶX,w1 ‰ ŶX,w2 ‰ " P X,Y " t ŶX,w1 " Y, ŶX,w2 ‰ Yu Y t ŶX,w1 ‰ Y, ŶX,w1 ‰ ŶX,w2 u ‰ piq ď P X,Y " ŶX,w1 " Y, ŶX,w2 ‰ Y ‰ `P" ŶX,w1 ‰ Y, ŶX,w1 ‰ ŶX,w2 ‰ piiq ď P " ŶX,w2 ‰ Y ‰ `P" ŶX,w1 ‰ Y ‰ " P Err,w2 `PErr,w1 , where piq and piiq follow from the union bound and the fact that PrAs ď PrBs whenever A Ď B, respectively. Proof of Lemma 2. Note that, for any j ‰ ŷx,w1 , f px; w 2 q ŷx,w 1 ´f px; w 2 q j " f px; w 2 q ŷx,w 1 ´f px; w 1 q ŷx,w 1 `f px; w 1 q ŷx,w 1 ´f px; w 1 q j looooooooooooooomooooooooooooooon ěγx,w 1 `f px; w 1 q j ´f px; w 2 q j ě γ x,w1 ´ÿ jPY |f px; w 1 q j ´f px; w 2 q j | " γ x,w1 ´DL1 `f px; w 1 q, f px; w 2 q ˘. (11) Similarly, for any j ‰ ŷx,w2 , we can establish that f px; w 1 q ŷx,w 2 ´f px; w 1 q j ě γ x,w2 ´DL1 `f px; w 1 q, f px; w 2 q ˘. (12) Note that experiencing churn between the two models on x is equivalent to tŷ x,w1 ‰ ŷx,w2 u Ď Dj ‰ ŷx,w1 : f px; w 2 q ŷx,w 1 ă f px; w 2 q j ( ď Dj ‰ ŷx,w2 : f px; w 1 q ŷx,w 2 ă f px; w 1 q j ( piq Ď D L1 `f px; w 1 q, f px; w 2 q ˘ą γ x,w1 ( ď D L1 `f px; w 1 q, f px; w 2 q ˘ą γ x,w2 ( , where piq follows from ( 11) and ( 12). Now, (13) implies that P X t ŶX,w1 ‰ ŶX,w2 u ď P X " D L1 pf pX; w 1 q, f pX; w 2 qq ą mintγ X,w1 , γ X,w2 u ‰ . ( ) Proof of Theorem 3. Let the prediction for a given x be p " f px; wq. W.L.O.G. let p ě 1 ´p. The prediction confidence is then γ x,w " p ´p1 ´pq " 2p ´1. (15) Figure 3 : CIFAR-100 ablation study: Similar to Figure 1 , we plot the effect of the SKL regularizer (cf. ( 7)), for varying α, and our proposed variant of co-distillation (Co-distill SKL ) for varying β, on the prediction entropy, accuracy, and churn. These plots shows the complementary nature of these methods in reducing churn. While the regularizer reduces churn, by reducing the prediction entropy, Co-distill SKL reduces churn by improving the agreement between two models, hence increasing the entropy. Table 6 : Ablation study of churn across 5 runs on CIFAR-10 with a ResNet-32 on GPU. Holding the initialization constant across models decreases churn significantly, and using identical input data (keeping minibatch ordering and augmentation constant) lowers churn further. Unlike Table 1 , these experiments were performed on GPU rather than TPU. Under this setting, using fixed initialization does achieve a lower churn at a much more expensive computation cost. augmentation, but control for the randomness differently by ensuring that all models within each run perform the same augmentations during training; we call these "identical input data" ablations because they control for both minibatch ordering and data augmentation. Again, the the first column gives the baseline churn with no factor held constant across runs, and the last column has every possible source of churn except hardware nondeterminism held constant across runs. Model initialization is a significant source of churn. Simply initializing weights from identical seeds (columns 1-2 in Table 1 ) can significantly decrease churn, no matter what other aspects of training are held constant. Holding input data constant across runs does reduce randomness and further reduces churn, but is less significant compared to constant initialization. The rightmost column eliminates all possible sources of churn except for unavoidable nondeterminism in the computing platform used for training machine learning models, e.g., GPU/TPU (Morin & Willetts, 2020; PyTorch, 2019; Nvidia, 2020) . The experiments in Table 6 were run on GPU, and even when all other aspects of training are held constant (rightmost column), model weights diverge within 100 steps (across runs) and the final churn is significant. We verified that this is the sole source of non-determinism: models trained to 10,000 steps on CPUs under these settings had identical weights. Comparing these results to the rightmost column of Table 1 , it seems TPU platforms currently introduce a higher baseline amount of churn than GPUs. However, as shown in this work, our Co-distill SKL method can alleviate this somewhat.

E.2 2 ENSEMBLE

In Table 7 we provide the results for 2-ensemble models and compare with distillation and codistillation approaches. We notice that the proposed approach with entropy regularizer and codistillation achieve similar or better churn, with lower inference costs. However the the ensemble models achieve better accuracy on certain datasets, and could be an alternative in settings where higher inference costs are acceptable. Table 7 : Similar to Table 2 , we provide results on accuracy and churn for 2-ensemble and proposed approaches. We notice that proposed approaches are either competitive or improve churn over 2-ensemble method, while keeping the inference costs low. F LOSS LANDSCAPE: ENTROPY REGULARIZATION VS. TEMPERATURE SCALING Figure 4 visualises the entropy regularised loss (6) in binary case. We do so both in terms of the underlying loss operating on probabilities (i.e., log-loss), and the loss operating on logits with an implicit transformation via the sigmoid (i.e., the logistic loss). Here, the regularised log-loss is : r0, 1s Ñ R `, where ppq " p1 ´αq ¨´log p `α ¨Hbin ppq, for binary entropy H bin . Similarly, the logistic loss is : R Ñ R `, where pf q " p1 ´αq ¨´log σpf q `α ¨Hbin pσpf qq, for sigmoid σp¨q. The effect of increasing α is to dampen the loss for high-confidence predictions -e.g., for the logistic loss, either very strongly positive or negative predictions incur a low loss. This encourages the model to make confident predictions. Figure 5 , by contrast, illustrates the effect of changing the temperature τ in the softmax. Temperature scaling is a common trick used to decrease classifier confidence. This also dampens the loss for high-confidence predictions. However, with strongly negative predictions, one still incurs a high loss compared to strongly positive predictions. This is in contrast to the more aggressive dampening of the loss as achieved by the entropy regularised loss. 6)) in binary case. On the left panel is the regularised log-loss p1 ´αq ¨´log p `α ¨Hbin ppq, which accepts a probability in r0, 1s as input. On the right panel is the logistic loss p1 ´αq ¨´log σpf q `α ¨Hbin pσpf qq, which accepts a score in R as input and passes it through sigmoid σp¨q to get a probability estimate.



Madani et al. (2004) referred to this as disagreement and used it as an estimate for generalization error and model selection https://www.kaggle.com/c/criteo-display-ad-challenge REGULARIZED CO-DISTILLATION FOR CHURN REDUCTIONIn this section we present our approach for churn reduction. Motivated by Lemma 2, we consider an approach with two complementary techniques for churn reduction: 1) We first propose entropy based regularizers that encourage solutions w P W with large prediction confidence γ x;w for each instance x. 2) We next employ co-distillation approach with novel design choices that simultaneously trains two models while minimizing the distance between their prediction probabilities.Note that the entropy regularizers themselves cannot actively enforce alignment between the prediction probabilities across multiple runs as the resulting objective does not couple multiple models together. We restrict ourselves to the convex combination of the original risk and the regularizer terms as this ensures that the scale of the proposed objective is the same as that of the original risk. This allows us to experiment with the same learning rate and other hyperparameters for both.



Figure1: ImageNet ablation study: We plot the effect of the entropy regularizer (top 10 ), for varying α, and Co-distill SKL for varying β, on the prediction entropy, accuracy and churn. These plots shows the complementary nature of these methods in reducing churn. While entropy regularizer reduces churn by reducing the prediction entropy, Co-distill SKL reduces churn by improving the agreement between two models, hence increasing the entropy.

Figure 2: ImageNet: Left -effect of the entropy regularizer and our proposed variant of co-distillation, Co-distill SKL , on the prediction confidence. Right -comparison with Co-distill CE (Anil et al., 2018). We plot the trade-off between accuracy and churn for Co-distill CE and Co-distill SKL by varying β. The plot clearly shows that the proposed variant achieves a better tradeoff, compared to Co-distill CE . Dataset Method Accuracy Churnp%q Weight decay = 0 Weight decay = 0.0001 Weight decay = 0 Weight decay = 0.0001 Baseline 91.62˘0.2 93.97˘0.11 8.4˘0.21 5.72˘0.18 CIFAR-10 Entropy (this paper) 91.56˘0.26 94.05˘0.18 8.45˘0.25 5.59˘0.21 SKL (this paper) 91.91˘0.22 93.75˘0.13 7.84˘0.24 5.79˘0.18 Co-distill SKL (this paper) 92.1˘0.31 94.63˘0.15 6.74˘0.32 4.29˘0.14

Figure 4: Visualisation of entropy regularised loss (eq. (6)) in binary case. On the left panel is the regularised log-loss p1 ´αq ¨´log p `α ¨Hbin ppq, which accepts a probability in r0, 1s as input. On the right panel is the logistic loss p1 ´αq ¨´log σpf q `α ¨Hbin pσpf qq, which accepts a score in R as input and passes it through sigmoid σp¨q to get a probability estimate.



Estimate of Churn (cf. (

annex

where 'stride' refers to the stride of the first convolution filter within each block. For ResNet-v2-50, the final layer of each block has 4 ˚nfilter filters. We use L2 weight decay of strength 1e-4 for all experiments.Learning rate and Batch size: For our experiments with CIFAR, we use SGD optimizer with a 0.9 Nesterov momentum. We use a linear learning rate warmup for first 15 epochs, with a peak learning rate of 1.0. We use a stepwise decay schedule, that decays learning rate by a factor of 10, at epoch numbers 200, 300 and 400. We train the models for a total of 450 epochs. We use a batch size of 1024.For the ImageNet experiments, we use SGD optimizer with a 0.9 momentum. We use a linear learning rate warmup for first 5 epochs, with a peak learning rate of 0.8. We again use a stepwise decay schedule, that decays learning rate by a factor of 10, at epoch numbers 30, 60 and 80. We train the models for a total of 90 epochs. We use a batch size of 1024.Entropy case. Now we can write entropy in terms of the prediction confidence (cf. ( 15))Now the gradient of gp¨q is ∇ γx,w gpγ x,w q " 1 2 log ´1´γx,w 1`γx,w ¯, which is less than 0 for γ x,w P r0, 1s. Hence, the function gpγ x,w q is a decreasing function for inputs in range r0, 1s. Hence, if gpγ x,w 1 q ě gpγ x,w q implies γ x,w 1 ď γ x,w .

Symmetric-KL divergence case. Recall that

SKLpf px; wq, Unifq " KL `f px; wq||Unif ˘`KL `Unif||f px; wq " ÿ jPrKs `f px; wq j ´1{K ˘¨log `K ¨f px; wq j " ÿ j f px; wq j ¨log pf px; wq j q ´1 K log pf px; wq j q .For binary classification this reduces to, SKLpf px; wq, Unifq " ´1 2 logppp1 ´pqq ´Hppq.By using ( 15) and ( 16), we can rewrite this function in terms of γ x,w as:x,w 4 ¸´gpγ x,w q.Now notice that both the above terms are an increasing function of γ x,w , as log is an increasing function, and gpγ x,w q is a decreasing function. Hence SKLpf px; wq, Unifq ě SKLpf px; w 1 q, Unifq implies γ x;w ě γ x;w 1 .

C DESIGN CHOICES IN THE PROPOSED VARIANT OF CO-DISTILLATION

Here, we discuss two important design choices that are crucial for successful utilization of codistillation framework for churn reduction while also improving model accuracy.Joint vs. independent updates. Anil et al. (2018) consider co-distillation with independent updates for the participating models. In particular, with two participating models, this corresponds to independently solving the following two sub-problems during the t-th step.where w t´T 1 and w t´T 2 corresponds to earlier checkpoints of two models being used to compute the model disagreement loss component at the other model. Note that since w t´T 2 and w t´T 1 are not being optimized in (17a) and (17b), respectively. Thus, for each model, these objectives are equivalent to regularizing its empirical loss via a cross-entropy terms that aims to align its prediction probabilities with that of the other model. In particular, the optimization steps in ( 17) are equivalent towhere H `f px; w 1 q, f px; w 2 q ˘" ´řj f px; w 1 q j log f px; w 2 q j denotes the cross-entropy between the probability mass functions f px; w 1 q and f px; w 2 q.In our experiments, we verify that this independent updates approach leads to worse results as compared to jointly training the two models using the objective in (9).Deferred model disagreement loss component in co-distillation.In general, implementing co-distillation approach by employing the model disagreement based loss component (e.g.,SKL `f px; w 1 q, f px; w 2 q ˘in our proposal) from the beginning leads to sub-optimal test accuracy as the models are very poor classifiers at the initialization. As a remedy, Anil et al. (2018) proposed a step-wise ramp up of such loss component, i.e., a time-dependent βptq such that βptq ą 0 iff t ą t 0 , with t 0 burn-in steps. Besides step-wise ramp up, we experimented with linear ramp-up, i.e., β t " mintc ¨t, βu and observed that linear ramp up leads to better churn reduction. We believe that, for churn reduction, it's important to ensure that the prediction probabilities of the two models start aligning before the model diverges too far during the training process.

D COMBINED APPROACH: CO-DISTILLATION WITH ENTROPY REGULARIZER

In this paper, we have considered two different approaches to reduce churn, involving minimum entropy regularizers and co-distillation framework, respectively. This raises an important question if these two approaches are redundant. As shown in Figures 2, 1 , and 3, this is not the case. In fact, these two approaches are complementary in nature and combining them together leads to better results in term of both churn and accuracy.Note that the minimum entropy regularizers themselves cannot actively enforce alignment between the prediction probabilities across multiple runs as the resulting objective does not couple multiple models together. On the other hand, as verified in Figure 2 , the co-distillation framework in itself does not affect the prediction confidence of the underlying models. Thus, both these approaches can be combined together to obtain the following training objective which promotes both large model prediction confidence and small variance in prediction probabilities across multiple runs.

L combined

Co-distillSKL pw 1 , w 2 ; Sq " L Co-distillSKL pw 1 , w 2 ; Sqpw 1 , w 2 ; Sq `α ¨ERegpw 1 , w 2 ; Sq,where ERegpw 1 , w 2 ; Sq "ERegpw 1 , w 2 ; Sq " ´1 n ÿ iPrns ´SKL `f px i ; w 1 q, Unif ˘`SKL `f px i ; w 2 q, Unif ˘¯.

E ADDITIONAL EXPERIMENTAL RESULTS

In this section we present additional experimental results.Ablation on CIFAR-100. Next we present our ablation studies, similar to results in Figure 1 , for CIFAR-100. In Figure 3 , we plot the mean prediction entropy, accuracy and churn, for the proposed SKL regularizer (7) and the co-distillation approach (9), for varying α and β, respectively. Similar to Figure 1 , the plots show the effectiveness of the proposed approaches in reducing churn, while improving accuracy.

E.1 ON HARDWARE NONDETERMINISM: GPU BASELINE ABLATION RESULTS

We now repeat the experiments in Sec. 2.3 on GPUs to investigate the unavoidable nondeterminism introduced by different hardware platforms.We report results in Table 6 . Compared to Table 1 , there are two main differences. First, these experiments are run on GPU with smaller ResNet-32 models. Second, these results always use data 

