ON NONDETERMINISM AND INSTABILITY IN NEURAL NETWORK OPTIMIZATION

Abstract

Optimization nondeterminism causes uncertainty when improving neural networks, with small changes in performance difficult to discern from run-to-run variability. While uncertainty can be reduced by training multiple copies of a model with different random seeds, doing so is time-consuming, costly, and makes reproducibility challenging. Despite this, little attention has been paid towards establishing an understanding of this problem. In this work, we establish an experimental protocol for understanding the effect of optimization nondeterminism on model diversity, which allows us to study the independent effects of a variety of sources of nondeterminism. Surprisingly, we find that each source of nondeterminism all have similar effects on multiple measures of model diversity. To explain this intriguing fact, we examine and identify the instability of model training, when taken as an end-to-end procedure, as the key determinant. We show that even one-bit changes in initial model parameters result in models that converge to vastly different values. Last, we demonstrate that recent methods in accelerated model ensembling hold promise for reducing the effects of instability on run-to-run variability.

1. INTRODUCTION

Consider this common scenario: you have a baseline "current best" model, and are trying to improve it. Now, one of your experiments has produced a model whose metrics are slightly better than the baseline. Yet you have your reservations -how do you know the improvement is "real", and not due to random fluctuations that create run-to-run variability? Similarly, consider performing hyperparameter optimization, in which there are many possible values for a set of hyperparameters, and you find minor differences in performance between them. How do you pick the best hyperparameters, and how can you be sure that you've actually picked wisely? In both scenarios, the standard practice is to perform multiple independent training runs of your model to understand its variability. While this does indeed help address the problem, it can be extremely wasteful, increasing the time required for effective research, using more computing power, and making reproducibility more difficult, while still leaving some uncertainty. Ultimately, the source of this problem is the nondeterminism in optimizing models -randomized components of model training that cause each training run to produce different models with their own performance characteristics. Nondeterminism itself occurs due to many factors: while the most salient source is the random initialization of parameters, other sources exist, including random shuffling of training data, per-example stochasticity of data augmentation, any explicit random operations (e.g. dropout (Srivastava et al., 2014) ), asynchronous model training (Recht et al., 2011) , and even nondeterminism in low-level libraries such as cuDNN (Chetlur et al., 2014) , which are present to improve throughput on hardware accelerators. Despite the clear impact nondeterminism has on the efficacy of modeling, relatively little attention has been paid towards understanding its mechanisms, even in the classical supervised setting. In this work, we establish an experimental protocol for analyzing the impact of nondeterminism in model training, allowing us to quantify the independent effect of each source of nondeterminism. In doing so, we make a surprising discovery: each source has nearly the same effect on the variability of final model performance. Further, we find each source produces models of similar diversity, as measured by correlations between model predictions, functional changes in model performance while ensembling, and state-of-the-art methods of model similarity (Kornblith et al., 2019) . To emphasize one particularly interesting result: nondeterminism in low-level libraries like cuDNN can matter just as much with respect to model diversity and variability as varying the entire network initialization. We explain this mystery by demonstrating that it can be attributed to an inherent numerical instability in optimizing neural networks -when training with SGD-like approaches, we show that small changes to initial parameters result in large changes to final parameter values. In fact, the instabilities in the optimization process are extreme: changing a single weight by the smallest possible amount within machine precision (∼6 * 10 -11 ) produces nearly as much variability as all other sources combined. Therefore, any source of nondeterminism that has any effect at all on model weights is doomed to inherit at least this level of variability. Last, we present promising results in reducing the effects of instability. While we find that many approaches result in no apparent change, we demonstrate that methods for accelerated model ensembling actually do reduce the variability of trained models without an increase in model training time, providing the first encouraging signs for tractability of the problem.

2. RELATED WORK

NONDETERMINISM. Relatively little prior work has studied the effects of nondeterminism on model optimization. Within reinforcement learning, nondeterminism is recognized as a significant barrier to reproducibility and evaluating progress in the field (Nagarajan et al., 2018; Henderson et al., 2018; Islam et al., 2017; Machado et al., 2018) . In the setting of supervised learning, though, the focus of this work, the problem is much less studied. Madhyastha & Jain ( 2019) aggregate all sources of nondeterminism together into a single random seed and analyze the variability of model attention and accuracy as a function of it across various NLP datasets. They also propose a method for reducing this variability (see Sec. A for details of our reproduction attempt). More common in the field, results across multiple random seeds are reported (see Erhan et al. ( 2010) for a particularly extensive example), but the precise nature of nondeterminism's influence on variability goes unstudied. INSTABILITY. We use the term "stability" to refer to numerical stability, in which a stable algorithm is one for which the final output (converged model) does not vary much as the input (initial parameters) are changed. Historically, the term "stability" has been used both in learning theory (Bousquet & Elisseeff, 2002) , in reference to vanishing and exploding gradients (Haber & Ruthotto, 2017) , and in the adversarial robustness community for a particular form of training (Zheng et al., 2016) .

3. NONDETERMINISM

Many sources of nondeterminism exist when optimizing neural networks, each of which can affect the variability and performance of trained models. We begin with a very brief overview: PARAMETER INITIALIZATION. When training a model, parameters without preset values are initialized randomly according to a given distribution, e.g. a Gaussian with mean 0 and variance determined by the number of input connections to the layer (Glorot & Bengio, 2010; He et al., 2015) . DATA SHUFFLING. In stochastic gradient descent (SGD), the overall gradient is approximated by the gradient on a random subset of examples. Most commonly, this is implemented by shuffling the training data, after which the data is iterated through in order. Shuffling may happen either once, before training, or in between each epoch of training, the variant we use in this work. DATA AUGMENTATION. A very common practice, data augmentation refers to randomly altering each training example to artificially expand the training dataset. For example, in the case of images, it is common to randomly flip an image, which encourages invariance to left/right orientation. STOCHASTIC REGULARIZATION. Some forms of regularization, such as Dropout (Srivastava et al., 2014) , take the form of stochastic operations in a model during training. Dropout is the most common instance of this type of regularization, with a variety of others also in relatively common use, such as DropConnect (Wan et al., 2013 ), variational dropout (Gal & Ghahramani, 2016) , and variable length backpropagation through time (Merity et al., 2017) , among many others. LOW-LEVEL OPERATIONS. An underlooked source of nondeterminism, the very libraries that many deep learning frameworks are built on, such as cuDNN (Chetlur et al., 2014) often are run

