ON THE REPRODUCIBILITY OF NEURAL NETWORK PREDICTIONS

Abstract

Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause churn -disagreements between predictions of the two models independently trained by the same algorithm, contributing to the 'reproducibility challenges' in modern machine learning. In this paper, we study this problem of churn, identify factors that cause it, and propose two simple means of mitigating it. We first demonstrate that churn is indeed an issue, even for standard image classification tasks (CIFAR and ImageNet), and study the role of the different sources of training randomness that cause churn. By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction. First, we propose using minimum entropy regularizers to increase prediction confidences. Second, we present a novel variant of co-distillation approach (Anil et al., 2018) to increase model agreement and reduce churn. We present empirical results showing the effectiveness of both techniques in reducing churn while improving the accuracy of the underlying model.

1. INTRODUCTION

Deep neural networks (DNNs) have seen remarkable success in a range of complex tasks, and significant effort has been spent on further improving their predictive accuracy. However, an equally important desideratum of any machine learning system is stability or reproducibility in its predictions. In practice, machine learning models are continuously (re)-trained as new data arrives, or to incorporate architectural and algorithmic changes. A model that changes its predictions on a significant fraction of examples after each update is undesirable, even if each model instantiation attains high accuracy. Reproducibility of predictions is a challenge even if the architecture and training data are fixed across different training runs, which is the focus of this paper. Unfortunately, two key ingredients that help deep networks attain high accuracy -over-parameterization, and the randomization of their training algorithms -pose significant challenges to their reproducibility. The former refers to the fact that NNs typically have many solutions that minimize the training objective (Neyshabur et al., 2015; Zhang et al., 2017) . The latter refers to the fact that standard training of NNs involves several sources of randomness, e.g., initialization, mini-batch ordering, non-determinism in training platforms and in some cases data augmentation. Put together, these imply that NN training can find vastly different solutions in each run even when training data is the same, leading to a reproducibility challenge. The prediction disagreement between two models is referred to as churn (Cormier et al., 2016) 1 . Concretely, given two models, churn is the fraction of test examples where the predictions of the two models disagree. Clearly, churn is zero if both models have perfect accuracy -an unattainable goal for most of the practical settings of interest. Similarly, one can mitigate churn by eliminating all sources of randomness in the underlying training setup. However, even if one controls the seed used for random initialization and the order of data, inherent non-determinism in the current computation platforms is hard to avoid (see §2.3). Moreover it is desirable to have stable models with predictions unaffected by such factors in training. Thus, it is critical to quantify churn, and develop methods that reduce it. In this paper, we study the problem of churn in NNs for the classification setting. We demonstrate the presence of churn, and investigate the role of different training factors causing it. Interestingly our experiments show that churn is not avoidable on the computing platforms commonly used in machine learning, further highlighting the necessity of developing techniques to mitigate churn. We then analyze the relation between churn and predicted class probabilities. Based on this, we develop a novel regularized co-distillation approach for reducing churn. Our key contributions are summarized below: (i) Besides the disagreement in the final predictions of models, we propose alternative soft metrics to measure churn. We demonstrate the existence of churn on standard image classification tasks (CIFAR-10, CIFAR-100, ImageNet, SVHN and iNaturalist), and identify the components of learning algorithms that contribute to the observed churn. Furthermore, we analyze the relationship between churn and model prediction confidences (cf. § 2). (ii) Motivated from our analysis, we propose a regularized co-distillation approach to reduce churn that both improves prediction confidences and reduces prediction variance (cf. §3). Our approach consists of two components: a) minimum entropy regularizers that improve prediction confidences (cf. §3.1), and b) a new variant of co-distillation (Anil et al., 2018) to reduce prediction variance across runs. Specifically, we use a symmetric KL divergence based loss to reduce model disagreement, with a linear warmup and joint updates across multiple models (cf. §3.2). (iii) We empirically demonstrate the effectiveness of the proposed approach in reducing churn and (sometimes) increasing accuracy. We present ablation studies over its two components to show their complementary nature in reducing churn (cf. §4).

1.1. RELATED WORK

Reproducibility in machine learning. There is a broad field studying the problem of reproducible research (Buckheit & Donoho, 1995; Gentleman & Lang, 2007; Sonnenburg et al., 2007; Kovacevic, 2007; Mesirov, 2010; Peng, 2011; McNutt, 2014; Braun & Ong, 2014; Rule et al., 2018) , which identifies best practices to facilitate the reproducibility of scientific results. Henderson et al. (2018) analysed reproducibility of methods in reinforcement learning, showing that performance of certain methods is sensitive to the random seed used in the training. While the performance of NNs on image classification tasks is fairly stable (Table 2 ), we focus on analyzing and improving the reproducibility of individual predictions. Thus, churn can be seen as a specific technical component of this reproducibility challenge. Ensembling and online distillation. Ensemble methods (Dietterich, 2000; Lakshminarayanan et al., 2017) that combine the predictions from multiple (diverse) models naturally reduce the churn by averaging out the randomness in the training procedure of the individual models. However, such methods incur large memory footprint and high computational cost during the inference time. Distillation (Hinton et al., 2015; Bucilua et al., 2006) aims to train a single model from the ensemble to alleviate these costs. Even though distilled model aims to recover the accuracy of the underlying ensemble, it is unclear if the distilled model also leads to churn reduction. Furthermore, distillation is a two-stage process, involving first training an ensemble and then distilling it into a single model. 



Madani et al. (2004) referred to this as disagreement and used it as an estimate for generalization error and model selection



Cormier et al. (2016)  defined the disagreement between predictions of two models as churn. They proposed an MCMC approach to train an initial stable model A so that it has a small churn with its future version, say model B. Here, future versions are based on slightly modified training data with possibly additional features. In Goh et al. (2016); Cotter et al. (2019), constrained optimization is utilized to reduce churn across different model versions. In contrast, we are interested in capturing the contribution of factors other than training data modification that cause churn. More recently. Madhyastha & Jain (2019) study instability in the interpretation mechanisms and average performance for deep NNs due to change in random seed, and propose a stochastic weight averaging (Izmailov et al., 2018) approach to promote robust interpretations. In contrast, we are interested in robustness of individual predictions.

To avoid this two-stage training process, multiple recent works Anil et al. (2018); Zhang et al. (2018); Lan et al. (2018); Song & Chai (2018); Guo et al. (2020) have focused on online distillation, where

