OPTIMIZATION VARIANCE: EXPLORING GENERAL-IZATION PROPERTIES OF DNNS

Abstract

Unlike the conventional wisdom in statistical learning theory, the test error of a deep neural network (DNN) often demonstrates double descent: as the model complexity increases, it first follows a classical U-shaped curve and then shows a second descent. Through bias-variance decomposition, recent studies revealed that the bell-shaped variance is the major cause of model-wise double descent (when the DNN is widened gradually). This paper investigates epoch-wise double descent, i.e., the test error of a DNN also shows double descent as the number of training epoches increases. Specifically, we extend the bias-variance analysis to epoch-wise double descent, and reveal that the variance also contributes the most to the zero-one loss, as in model-wise double descent. Inspired by this result, we propose a novel metric, optimization variance (OV), to measure the diversity of model updates caused by the stochastic gradients of random training batches drawn in the same iteration. OV can be estimated using samples from the training set only but correlates well with the (unknown) test error. It can be used to predict the generalization ability of a DNN when the zero-one loss is used in test, and hence early stopping may be achieved without using a validation set.

1. INTRODUCTION

Deep Neural Networks (DNNs) usually have large model capacity, but also generalize well. This violates the conventional VC dimension (Vapnik, 1999) or Rademacher complexity theory (Shalev-Shwartz & Ben-David, 2014) , inspiring new designs of network architectures (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016; Zagoruyko & Komodakis, 2016) and reconsideration of their optimization and generalization (Zhang et al., 2017; Arpit et al., 2017; Wang et al., 2018; Kalimeris et al., 2019; Rahaman et al., 2019; Allen-Zhu et al., 2019) . Model-wise double descent, i.e., as a DNN's model complexity increases, its test error first shows a classical U-shaped curve and then enters a second descent, has been observed on many machine learning models (Advani & Saxe, 2017; Belkin et al., 2019a; Geiger et al., 2019; Maddox et al., 2020; Nakkiran et al., 2020) . Multiple studies provided theoretical evidence of this phenomenon in some tractable settings (Mitra, 2019; Hastie et al., 2019; Belkin et al., 2019b; Yang et al., 2020; Bartlett et al., 2020; Muthukumar et al., 2020) . Specifically, Neal et al. (2018) and Yang et al. (2020) performed bias-variance decomposition for mean squared error (MSE) and the cross-entropy (CE) loss, and empirically revealed that the bell-shaped curve of the variance is the major cause of modelwise double descent. Maddox et al. (2020) proposed to measure the effective dimensionality of the parameter space, which can be further used to explain model-wise double descent. Recently, a new double descent phenomenon, epoch-wise double descent, was observed, when increasing the number of training epochs instead of the model complexity 1 (Nakkiran et al., 2020). Compared with model-wise double descent, epoch-wise double descent is relatively less explored. Heckel & Yilmaz (2020) showed that epoch-wise double descent occurs in the situation where different parts of DNNs are learned at different epochs, which can be eliminated by proper scaling of step sizes. Zhang & Wu (2020) discovered that the energy ratio of the high-frequency components of a DNN's prediction landscape, which can reflect the model capacity, switches from increase to decrease at a certain training epoch, leading to the second descent of the test error. However, this metric fails to provide further information on generalization, such as the early stopping point, or how the size of a DNN influences its performance. This paper utilizes bias-variance decomposition of the zero-one (ZO) loss (CE loss is still used in training) to further investigate epoch-wise double descent. By monitoring the behaviors of the bias and the variance, we find that the variance plays an important role in epoch-wise double descent, which dominates and highly correlates with the variation of the test error. Though the variance correlates well with the test error, estimating its value requires training models on multiple different training sets drawn from the same data distribution, whereas in practice usually only one training set is availablefoot_1 . Inspired by the fact that the source of variance comes from the random-sampled training sets, we propose a novel metric, optimization variance (OV), to measure the diversity of model updates caused by the stochastic gradients of random training batches drawn in the same iteration. This metric can be estimated from a single model using samples drawn from the training set only. More importantly, it correlates well with the test error, and thus can be used to determine the early stopping point in DNN training, without using any validation set. Some complexity measures have been proposed to illustrate the generalization ability of DNNs, such as sharpness (Keskar et al., 2017) and norm-based measures (Neyshabur et al., 2015) . However, their values rely heavily on the model parameters, making comparisons across different models very difficult. Dinh et al. (2017) shows that by re-parameterizing a DNN, one can alter the sharpness of its searched local minima without affecting the function it represents; Neyshabur et al. (2018) shows that these measures cannot explain the generalization behaviors when the size of a DNN increases. Our proposed metric, which only requires the logit outputs of a DNN, is less dependent on model parameters, and hence can explain many generalization behaviors, e.g., the test error decreases as the network size increases. Chatterji et al. (2020) proposed a metric called Model Criticality that can explain the superior generalization performance of some architectures over others, yet it is unexplored that whether this metric can be used to indicate generalization in the entire training procedure, especially for some relatively complex generalization behaviors, such as epoch-wise double descent. To summarize, our contributions are: • We perform bias-variance decomposition on the test error to explore epoch-wise double descent. We show that the variance dominates the variation of the test classification error. • We propose a novel metric, OV, which is calculated from the training set only and correlates well with the test classification error. • Based on the OV, we propose an approach to search for the early stopping point without using a validation set, when the zero-one loss is used in test. Experiments verified its effectiveness. The remainder of this paper is organized as follows: Section 2 introduces the details of tracing bias and variance over training epochs. Section 3 proposes the OV and demonstrates its ability to indicate the test behaviors. Section 4 draws conclusions and points out some future research directions.

2. BIAS AND VARIANCE IN EPOCH-WISE DOUBLE DESCENT

This section presents the details of tracing the bias and the variance during training. We show that the variance dominates the epoch-wise double descent of the test error.

2.1. A UNIFIED BIAS-VARIANCE DECOMPOSITION

Bias-variance decomposition is widely used to analyze the generalization properties of machine learning algorithms (Geman et al., 1992; Friedman et al., 2001) . It was originally proposed for



In practice, some label noise is often added to the training set to make the epoch-wise double descent more conspicuous. Assume the training set has n samples. We can partition it into multiple smaller training sets, each with m samples (m < n), and then train multiple models. However, the variance estimated from this case would be different from the one estimated from training sets with n samples. We can also bootstrap the original training set into multiple ones, each with n samples. However, the data distribution of each bootstrap replica is different from the original training set, and hence the estimated variance would also be different.

