ON NONDETERMINISM AND INSTABILITY IN NEURAL NETWORK OPTIMIZATION

Abstract

Optimization nondeterminism causes uncertainty when improving neural networks, with small changes in performance difficult to discern from run-to-run variability. While uncertainty can be reduced by training multiple copies of a model with different random seeds, doing so is time-consuming, costly, and makes reproducibility challenging. Despite this, little attention has been paid towards establishing an understanding of this problem. In this work, we establish an experimental protocol for understanding the effect of optimization nondeterminism on model diversity, which allows us to study the independent effects of a variety of sources of nondeterminism. Surprisingly, we find that each source of nondeterminism all have similar effects on multiple measures of model diversity. To explain this intriguing fact, we examine and identify the instability of model training, when taken as an end-to-end procedure, as the key determinant. We show that even one-bit changes in initial model parameters result in models that converge to vastly different values. Last, we demonstrate that recent methods in accelerated model ensembling hold promise for reducing the effects of instability on run-to-run variability.

1. INTRODUCTION

Consider this common scenario: you have a baseline "current best" model, and are trying to improve it. Now, one of your experiments has produced a model whose metrics are slightly better than the baseline. Yet you have your reservations -how do you know the improvement is "real", and not due to random fluctuations that create run-to-run variability? Similarly, consider performing hyperparameter optimization, in which there are many possible values for a set of hyperparameters, and you find minor differences in performance between them. How do you pick the best hyperparameters, and how can you be sure that you've actually picked wisely? In both scenarios, the standard practice is to perform multiple independent training runs of your model to understand its variability. While this does indeed help address the problem, it can be extremely wasteful, increasing the time required for effective research, using more computing power, and making reproducibility more difficult, while still leaving some uncertainty. Ultimately, the source of this problem is the nondeterminism in optimizing models -randomized components of model training that cause each training run to produce different models with their own performance characteristics. Nondeterminism itself occurs due to many factors: while the most salient source is the random initialization of parameters, other sources exist, including random shuffling of training data, per-example stochasticity of data augmentation, any explicit random operations (e.g. dropout (Srivastava et al., 2014) ), asynchronous model training (Recht et al., 2011) , and even nondeterminism in low-level libraries such as cuDNN (Chetlur et al., 2014) , which are present to improve throughput on hardware accelerators. Despite the clear impact nondeterminism has on the efficacy of modeling, relatively little attention has been paid towards understanding its mechanisms, even in the classical supervised setting. In this work, we establish an experimental protocol for analyzing the impact of nondeterminism in model training, allowing us to quantify the independent effect of each source of nondeterminism. In doing so, we make a surprising discovery: each source has nearly the same effect on the variability of final model performance. Further, we find each source produces models of similar diversity, as measured by correlations between model predictions, functional changes in model performance while ensembling, and state-of-the-art methods of model similarity (Kornblith et al., 2019) . To emphasize

