CAN WE USE GRADIENT NORM AS A MEASURE OF GENERALIZATION ERROR FOR MODEL SELECTION IN PRACTICE? Anonymous

Abstract

The recent theoretical investigation (Li et al., 2020) on the upper bound of generalization error of deep neural networks (DNNs) demonstrates the potential of using the gradient norm as a measure that complements validation accuracy for model selection in practice. In this work, we carry out empirical studies using several commonly-used neural network architectures and benchmark datasets to understand the effectiveness and efficiency of using gradient norm as the model selection criterion, especially in the settings of hyper-parameter optimization. While strong correlations between the generalization error and the gradient norm measures have been observed, we find the computation of gradient norm is time consuming due to the high gradient complexity. To balance the trade-off between efficiency and effectiveness, we propose to use an accelerated approximation (Goodfellow, 2015) of gradient norm that only computes the loss gradient in the Fully-Connected Layer (FC Layer) of DNNs with significantly reduced computation cost (200∼20,000 times faster). Our empirical studies clearly find that the use of approximated gradient norm, as one of the hyper-parameter search objectives, can select the models with lower generalization error, but the efficiency is still low (marginal accuracy improvement but with high computation overhead). Our results also show that the bandit-based or population-based algorithms, such as BOHB, perform poorer with gradient norm objectives, since the correlation between gradient norm and generalization error is not always consistent across phases of the training process. Finally, gradient norm also fails to predict the generalization performance of models based on different architectures, in comparison with state of the art algorithms and metrics.

1. INTRODUCTION

Generalization performance of deep learning through stochastic gradient descent-based optimization has been widely studied in recent work (Li et al., 2020; Chatterjee, 2020; Negrea et al., 2019; Thomas et al., 2019; Zhu et al., 2019; Hu et al., 2019; Zhang et al., 2016; Mou et al., 2017) . These studies initialize the analysis from the learning dynamics' perspectives and then extend to the upper bound estimation of generalization errors (Mou et al., 2017) . More recently, researchers have shifted their focuses onto providing some theoretical or empirical measures on generalization performance (Li et al., 2020; Negrea et al., 2019; Thomas et al., 2019; Cao & Gu, 2019; He et al., 2019; Frei et al., 2019) with respect to deep architectures, hyper-parameters, data distributions, learning dynamics and so on. This work studies the use of generalization performance measures (Li et al., 2020; Negrea et al., 2019) for model selection purposes (Cawley & Talbot, 2010) . Prior to the endeavor of deep learning, the generalization gap has been used as a straightforward measure. Given a set of N training samples{x 1 , x 2 , x 3 , . . . x N }, a deep learning model θ and a loss function L(θ; x) based on the sample x, the generalization gap G is defined as G(θ) = E x∼X L(θ; x) - 1 N N i=1 L(θ; x i ), where X is defined as the distribution of data, and E x∼X L(θ; x) refers to the expected loss. To quantify the generalization gap, validation or testing samples have been frequently used to measure the expected loss (Kohavi et al., 1995) . Such that given a validation/testing dataset with M samples {y 1 , y 2 , y 3 , . . . , y M }, the empirical generalization gap G M is estimated as G M (θ) = 1 M M i=1 L(θ; y i ) - 1 N N i=1 L(θ; x i ), and lim M →∞ G M = G. However, due to the limited amount of samples, the accurate estimation of generalization gap is not always available (Cawley & Talbot, 2010) . It has been evidenced that performance tuning based on validation set frequently causes overfitting to the validation set (Recht et al., 2018) in deep learning settings. Rather than the use of empirical generalization gap, some advanced measure has been proposed (Li et al., 2020; Negrea et al., 2019; Thomas et al., 2019) to provide data-dependent characterization on the generalization performance for deep learning. For example, Thomas et al. ( 2019) derived the Takeuchi information criterion (TIC) using the Hessian and covariance matrices of loss gradients with low-complexity approximation. They then propose to use TIC as an empirical metric that correlates to the generalization gap. As the calculation of TIC relies on the use of validation set, TIC for deep learning is a posterior measure. Further, Negrea et al. ( 2019) improved mutual information bounds for Stochastic Gradient Langevin Dynamics via some data-dependent measure. More specifically, the squared norm of gradients have been used as the data-dependent priors to tightly bound the generalization gap through measuring the flatness of empirical risk surface. More recently, the squared norm of gradients over the learning dynamics has been studied as the measure to upper-bound the generalization gap (Li et al., 2020; Negrea et al., 2019) . All above methods connect the generalization gap of deep learning to the gradients and Hessians of loss functions. While Thomas et al. ( 2019) is a posterior measure relying on the validation datasets, the two studies (Li et al., 2020; Negrea et al., 2019) provide "prior" measure that uses training datasets. More specifically, Li et al. ( 2020) proves that the generalization error is upper bounded by the empirical squared gradient norm along the optimization path. Formally, they show that, given a model trained with n samples in dataset X = {x 1 , x 2 . . . x n } for a total of T iterations using a C-bounded loss function is used, the theoretical generalization gap of the DNN θ T is bounded as follows, G(θ T ) ≤ 2 √ 2C n E X T t=1 γ 2 t σ 2 t g e (t) , where g e (t) = Eθ t-1 [ 1 n n i=1 ∇L(θ t-1 , x i ) 2 2 ] is the empirical squared gradient norm in the t th iteration, γ t is the learning rate and σ t indicates the standard deviation of Gaussian noise in the stochastic process. Their result directly shows that empirical squared gradient norm along the optimization path is a good indicator of a model's generalization ability. Therefore, we are interested in using this signal as a criterion in model selection, so as to find an optimal set of hyper-parameters. Our Contributions. In this work, we follow the theoretical investigation in (Li et al., 2020) and try to use the squared gradient norms (GN) over the optimization path as a data-dependent generalization performance measure for DNN model selection. In an empirical manner, the proposed metric GN here should be able to measure the generalization gap of the DNN model θ T , which has been trained with T iterations and N training samples, as follows GN(θ T ) = T t=1 1 N N i=1 ∇L(θ t ; x i ) 2 2 , where ∇L(θ t ; x i ) refers to the loss gradient using the model of the t th iteration and the i th training sample in {x 1 , x 2 , . . . , x N }. Based on the data-dependent measure GN, our work make three pieces of significant technical contributions as follows (1) Approximated Gradient Norm (AGN) as an accelerated and low-complexity approximation to GN. Despite the fact that GN can measure the generalization error of a DNN model, the time consumption for GN is high, due to the complexity of gradient estimation using every training sample in every iteration. To lower the computational complexity, we propose Approximated Gradient Norm (AGN) that only uses the summation of loss gradients of the Fully-Connected (FC) Layer in DNN by the end of every epoch (rather than every iteration) as an approximation of GN. Furthermore, the calculation of FC Layer gradient per sample could be further accelerated by Goodfellow

