ON THE REPRODUCIBILITY OF NEURAL NETWORK PREDICTIONS

Abstract

Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause churn -disagreements between predictions of the two models independently trained by the same algorithm, contributing to the 'reproducibility challenges' in modern machine learning. In this paper, we study this problem of churn, identify factors that cause it, and propose two simple means of mitigating it. We first demonstrate that churn is indeed an issue, even for standard image classification tasks (CIFAR and ImageNet), and study the role of the different sources of training randomness that cause churn. By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction. First, we propose using minimum entropy regularizers to increase prediction confidences. Second, we present a novel variant of co-distillation approach (Anil et al., 2018) to increase model agreement and reduce churn. We present empirical results showing the effectiveness of both techniques in reducing churn while improving the accuracy of the underlying model.

1. INTRODUCTION

Deep neural networks (DNNs) have seen remarkable success in a range of complex tasks, and significant effort has been spent on further improving their predictive accuracy. However, an equally important desideratum of any machine learning system is stability or reproducibility in its predictions. In practice, machine learning models are continuously (re)-trained as new data arrives, or to incorporate architectural and algorithmic changes. A model that changes its predictions on a significant fraction of examples after each update is undesirable, even if each model instantiation attains high accuracy. Reproducibility of predictions is a challenge even if the architecture and training data are fixed across different training runs, which is the focus of this paper. Unfortunately, two key ingredients that help deep networks attain high accuracy -over-parameterization, and the randomization of their training algorithms -pose significant challenges to their reproducibility. The former refers to the fact that NNs typically have many solutions that minimize the training objective (Neyshabur et al., 2015; Zhang et al., 2017) . The latter refers to the fact that standard training of NNs involves several sources of randomness, e.g., initialization, mini-batch ordering, non-determinism in training platforms and in some cases data augmentation. Put together, these imply that NN training can find vastly different solutions in each run even when training data is the same, leading to a reproducibility challenge. The prediction disagreement between two models is referred to as churn (Cormier et al., 2016)foot_0 . Concretely, given two models, churn is the fraction of test examples where the predictions of the two models disagree. Clearly, churn is zero if both models have perfect accuracy -an unattainable goal for most of the practical settings of interest. Similarly, one can mitigate churn by eliminating all sources of randomness in the underlying training setup. However, even if one controls the seed used for random initialization and the order of data, inherent non-determinism in the current computation platforms is hard to avoid (see §2.3). Moreover it is desirable to have stable models with predictions



Madani et al. (2004) referred to this as disagreement and used it as an estimate for generalization error and model selection 1

