CAN WE USE GRADIENT NORM AS A MEASURE OF GENERALIZATION ERROR FOR MODEL SELECTION IN PRACTICE? Anonymous

Abstract

The recent theoretical investigation (Li et al., 2020) on the upper bound of generalization error of deep neural networks (DNNs) demonstrates the potential of using the gradient norm as a measure that complements validation accuracy for model selection in practice. In this work, we carry out empirical studies using several commonly-used neural network architectures and benchmark datasets to understand the effectiveness and efficiency of using gradient norm as the model selection criterion, especially in the settings of hyper-parameter optimization. While strong correlations between the generalization error and the gradient norm measures have been observed, we find the computation of gradient norm is time consuming due to the high gradient complexity. To balance the trade-off between efficiency and effectiveness, we propose to use an accelerated approximation (Goodfellow, 2015) of gradient norm that only computes the loss gradient in the Fully-Connected Layer (FC Layer) of DNNs with significantly reduced computation cost (200∼20,000 times faster). Our empirical studies clearly find that the use of approximated gradient norm, as one of the hyper-parameter search objectives, can select the models with lower generalization error, but the efficiency is still low (marginal accuracy improvement but with high computation overhead). Our results also show that the bandit-based or population-based algorithms, such as BOHB, perform poorer with gradient norm objectives, since the correlation between gradient norm and generalization error is not always consistent across phases of the training process. Finally, gradient norm also fails to predict the generalization performance of models based on different architectures, in comparison with state of the art algorithms and metrics.

1. INTRODUCTION

Generalization performance of deep learning through stochastic gradient descent-based optimization has been widely studied in recent work (Li et al., 2020; Chatterjee, 2020; Negrea et al., 2019; Thomas et al., 2019; Zhu et al., 2019; Hu et al., 2019; Zhang et al., 2016; Mou et al., 2017) . These studies initialize the analysis from the learning dynamics' perspectives and then extend to the upper bound estimation of generalization errors (Mou et al., 2017) . More recently, researchers have shifted their focuses onto providing some theoretical or empirical measures on generalization performance (Li et al., 2020; Negrea et al., 2019; Thomas et al., 2019; Cao & Gu, 2019; He et al., 2019; Frei et al., 2019) with respect to deep architectures, hyper-parameters, data distributions, learning dynamics and so on. This work studies the use of generalization performance measures (Li et al., 2020; Negrea et al., 2019) for model selection purposes (Cawley & Talbot, 2010) . Prior to the endeavor of deep learning, the generalization gap has been used as a straightforward measure. Given a set of N training samples{x 1 , x 2 , x 3 , . . . x N }, a deep learning model θ and a loss function L(θ; x) based on the sample x, the generalization gap G is defined as G(θ) = E x∼X L(θ; x) - 1 N N i=1 L(θ; x i ), where X is defined as the distribution of data, and E x∼X L(θ; x) refers to the expected loss. To quantify the generalization gap, validation or testing samples have been frequently used to measure the expected loss (Kohavi et al., 1995) . Such that given a validation/testing dataset with M samples {y 1 , y 2 , y 3 , . . . , y M }, the empirical generalization gap G M is estimated as G M (θ) = 1 M M i=1 L(θ; y i ) - 1 N N i=1 L(θ; x i ), and lim M →∞ G M = G. However, due to the limited amount of samples, the accurate estimation of generalization gap is not always available (Cawley & Talbot, 2010) . It has been evidenced that performance tuning based on validation set frequently causes overfitting to the validation set (Recht et al., 2018) in deep learning settings. Rather than the use of empirical generalization gap, some advanced measure has been proposed (Li et al., 2020; Negrea et al., 2019; Thomas et al., 2019) to provide data-dependent characterization on the generalization performance for deep learning. For example, Thomas et al. (2019) derived the Takeuchi information criterion (TIC) using the Hessian and covariance matrices of loss gradients with low-complexity approximation. They then propose to use TIC as an empirical metric that correlates to the generalization gap. As the calculation of TIC relies on the use of validation set, TIC for deep learning is a posterior measure. Further, Negrea et al. (2019) improved mutual information bounds for Stochastic Gradient Langevin Dynamics via some data-dependent measure. More specifically, the squared norm of gradients have been used as the data-dependent priors to tightly bound the generalization gap through measuring the flatness of empirical risk surface. More recently, the squared norm of gradients over the learning dynamics has been studied as the measure to upper-bound the generalization gap (Li et al., 2020; Negrea et al., 2019) . All above methods connect the generalization gap of deep learning to the gradients and Hessians of loss functions. While Thomas et al. (2019) is a posterior measure relying on the validation datasets, the two studies (Li et al., 2020; Negrea et al., 2019) provide "prior" measure that uses training datasets. More specifically, Li et al. (2020) proves that the generalization error is upper bounded by the empirical squared gradient norm along the optimization path. Formally, they show that, given a model trained with n samples in dataset X = {x 1 , x 2 . . . x n } for a total of T iterations using a C-bounded loss function is used, the theoretical generalization gap of the DNN θ T is bounded as follows, G(θ T ) ≤ 2 √ 2C n E X T t=1 γ 2 t σ 2 t g e (t) , where g e (t) = Eθ t-1 [ 1 n n i=1 ∇L(θ t-1 , x i ) 2 2 ] is the empirical squared gradient norm in the t th iteration, γ t is the learning rate and σ t indicates the standard deviation of Gaussian noise in the stochastic process. Their result directly shows that empirical squared gradient norm along the optimization path is a good indicator of a model's generalization ability. Therefore, we are interested in using this signal as a criterion in model selection, so as to find an optimal set of hyper-parameters. Our Contributions. In this work, we follow the theoretical investigation in (Li et al., 2020) and try to use the squared gradient norms (GN) over the optimization path as a data-dependent generalization performance measure for DNN model selection. In an empirical manner, the proposed metric GN here should be able to measure the generalization gap of the DNN model θ T , which has been trained with T iterations and N training samples, as follows GN(θ T ) = T t=1 1 N N i=1 ∇L(θ t ; x i ) 2 2 , where ∇L(θ t ; x i ) refers to the loss gradient using the model of the t th iteration and the i th training sample in {x 1 , x 2 , . . . , x N }. Based on the data-dependent measure GN, our work make three pieces of significant technical contributions as follows (1) Approximated Gradient Norm (AGN) as an accelerated and low-complexity approximation to GN. Despite the fact that GN can measure the generalization error of a DNN model, the time consumption for GN is high, due to the complexity of gradient estimation using every training sample in every iteration. To lower the computational complexity, we propose Approximated Gradient Norm (AGN) that only uses the summation of loss gradients of the Fully-Connected (FC) Layer in DNN by the end of every epoch (rather than every iteration) as an approximation of GN. Furthermore, the calculation of FC Layer gradient per sample could be further accelerated by Goodfellow (2015) with extremely low costs. Our empirical evaluation finds that, over various DNN models trained with different hyper-parameters, the metrics GN(θ T ) and AGN(θ T ) behave identically with respect to empirical generalization gap G M (θ T ). This approximation makes it feasible to carry out experiments that evaluate the effectiveness of GN for model selection. (2) To validate the correlations between generalization performance and AGN(θ T ), we carry out extensive experiments using various deep neural networks, such as Multi-Layer Perception (MLP), LeNet (LeCun et al., 1998) , and ResNet (He et al., 2016) , on top of benchmark datasets including MNIST (LeCun et al., 2010) , Fashion-MNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , and CIFAR (Krizhevsky et al., 2009) . Observation 1: It has been observed that, when the models are well-fitted (i.e., with low training loss or high training accuracy), AGN(θ T ) well corresponds with the empirical generalization gap G M (θ T ) while models with lower/higher AGN(θ T ) are consistently with lower/higher G M (θ T ) and better/poorer generalization performance. On the other hand, when models are not well-fitted or namely under-fitted, either due to the use of inappropriate hyperparameters or a training process not converged (e.g., T is small), the correlations between G M (θ T ) and AGN(θ T ) are not always consistent while the direction of correlation somtimes is even opposite to the theoretical investigation in Li et al. (2020) and Eq. ( 3). (3) To understand the effectiveness and efficiency of using AGN(θ T ) for model selection, we extend our experiments to use AGN(θ T ) as an objective for hyper-parameter selection, under both blackbox optimization (Escalante et al., 2009; Loshchilov & Hutter, 2016) and bandit-based search (Li et al., 2017; Falkner et al., 2018) settings. Observation 2: We find that, through the combination with training or validation loss, AGN(θ T ) can help black-box optimization algorithms, such as particle swarm (PSO) (Escalante et al., 2009) or covariance matrix adaption based evolving strategies (CMA-ES) (Loshchilov & Hutter, 2016) , search the models (hyper-parameters) with equivalent or marginally better performance than using validation loss/accuracy as the search objective. The training procedure can somehow avoid the potential overfitting to the validation set (Recht et al., 2018) . However the use of AGN(θ T ) requests more prior knowledge (additional parameters) to balance the weights of training/validation loss and AGN(θ T ) in the combined objective during the search, which might be sensitive for model selection. Observation 3: Further, our research finds that bandit-based search algorithms, such as Bayesian optimization over HyperBand (BOHB) (Falkner et al., 2018) , cannot work well with gradient norms. Because BOHB selects and drops models during the learning process with respect to the intermediate measures on performance, while within an end-to-end learning process AGN(θ t ) (for θ 1 , θ 2 , θ 3 . . . , θ T ) cannot always provide consistent measures on the generalization performance. Observation 4: In addition to hyper-parameter selection for deep learning, our experiment finally finds that AGN(θ T ) fails to predict the the generalization performance of the models based on different deep architectures, in comparison to some state of the art algorithms and metrics (Jiang et al., 2019; Nagarajan & Kolter, 2019) . We believe the most relevant studies to this work are those done by Li et al. (2020) ; Negrea et al. (2019) ; Thomas et al. (2019) (that measure the generalization performance of DNN using derivatives, such as gradients and Hessian matrices, of the loss), and the contribution made by our work is unique. Compared to Thomas et al. (2019) , the measure AGN studied here is also a data-dependent metric to characterize the generalization performance. AGN uses gradients and avoids the use of validation sets, while Thomas et al. ( 2019) is based on Hessian matrices and relies on validation data. Compared to Li et al. (2020) ; Negrea et al. (2019) , our work studies the feasibility of using gradient norms over learning process for model selection in empirical settings, where we first provide AGN as a low-complexity implementation to accelerate the computation of gradient norms, then we conduct extensive results to demonstrate the pros and cons of using gradient norms or AGN for model selection in two major hyper-parameter search settings. Our result demonstrates the potentials but also limitations in the adoption of current theoretical results (Li et al., 2020; Negrea et al., 2019) as an objective for model selection.

2. GRADIENT NORM AND ITS APPROXIMATION

Given a DNN model θ T that has been trained with T iterations, the straightforward way to compute GN(θ T ) is to first collect and store the weights of the model for every iteration during the training process (i.e., θ 1 , θ 2 , . . . , θ T ), then compute the gradient for every training sample (N in total) on every model (T in total) through backpropagation (BP) over the whole DNN for N × T times. In practice, such computation cost is far too high. In this section, we present AGN(θ T ) as a lowcomplexity approximation to GN(θ T ) and then discuss the approximation performance.

2.1. AGN: APPROXIMATED GRADIENT NORM

We propose AGN(θ T ) as a low-complexity approximation to accelerate the computation of GN(θ T ) using acceleration strategies as follows. In Fig. 1a to Fig. 1d , we pick 9 sets of random hyperparameters for MLP and 12 sets for LeNet-5. We then train the model for 40 epochs and plot their results. For the two architectures we have tested on MNIST, both AGN(θT ) and GN(θT ) exhibit identical trends when generalization error is plotted against them respectively. Strong positive correlation is observed between GN and AGN in each setting as well, with Pearson correlation coefficients 0.962 for MLP and 0.987 for LeNet-5. We carry out additional experiments using deeper neural network models in Fig. 1e and Fig. 1f in order to verify the existence of strong positive correlation between AGN(θT ) and GN(θT ). We follow a piecewise learning rate decay policy in these experiments. Thus, there are three plots for each ResNet setting. Depth-wise Acceleration. While DNNs frequently composes hundreds of layers, AGN(θ T ) approximates to GN(θ T ) using the gradients of last Fully-Connected (FC) Layer of DNNs. Sample-wise Acceleration. While the computation of GN(θ T ) needs to run time-consuming BP for every training sample, AGN(θ T ) uses the low-complexity per-sample gradient estimation algorithm (Goodfellow, 2015) to compute the gradients of FC layers without BP. Epoch-wise Acceleration While the computation of GN(θ T ) aggregates the gradient norms for every iteration, AGN(θ T ) approximates to GN(θ T ) via the summation of squared norms of gradients collected by the end of every epoch. Given the batch size B for every iteration, each epoch consists of N B gradient descent iterations, the learning process with T iterations takes T B N epochs for completion, and θ τ N B for τ = 1, 2, . . . , T B N refer to the model obtained by the end of every epoch. In this way, we propose to compute AGN(θ T ) as follows AGN(θT ) = T B/N τ =1 1 N N i=1 ∇L FC (θ τ N B ; xi) 2 2 , where ∇L FC (θ; x) refers to the gradient of FC layer based on the model θ and the sample x and is accelerated by using the algorithm proposed by Goodfellow (2015) .

2.2. APPROXIMATION PERFORMANCE

We discuss the approximation performance of AGN(θ T ) to GN(θ T ) from efficiency and effectiveness perspectives, in comparison with validation accuracy, as ways of generalization performance measurements.

Effectiveness Comparison.

We find that AGN(θ T ) preserves the effectiveness of GN(θ T ) as a generalization performance indicator. In Figure 1 , we illustrate the comparison between AGN(θ T ) and GN(θ T ) using 9 Multi-Layer Preceptors (MLP) models and 12 LeNet-5 models trained using 9 and 12 sets of random hyper-parameters on MNIST datasets. For every model here, we train the model with 40 epoch, measure the generalization gap using the validation set, and estimate GN(θ T ) and AGN(θ T ) respectively. In Figure 1 (a) and (c), we plot GN(θ T ) and AGN(θ T ) of these models against their generalization gaps. It shows that, despite the different scales of the two measures, the trends of GN(θ T ) and AGN(θ T ) behave identically with respect to the generalization gaps. Further, we correlate GN(θ T ) and AGN(θ T ) for every model and demonstrate the correlations in Figure 1 (b) and (c). We carry out additional experiments on larger neural networks. The results in Fig. 1e and Fig. 1f also validate that the correlations between GN(θ T ) and AGN(θ T ) are strong, significant, and consistent. Efficiency Comparison. We find that the computation of AGN(θ T ) is much faster than GN(θ T ) while it still consumes significantly more time than using validation set for the measurement of generalization performance. In Figure 2 , we plot the time consumption per epoch of the three generalization performance measurements, i.e., AGN(θ T ), GN(θ T ), and validation accuracy, using LeNet-5, ResNet-20, and ResNet-56 on (parts of) MNIST, SVHN, and CIFAR-100 datasets. To simulate the settings of realistic hyper-parameter search, we collect time consumption of the three measurements using full and reduced (20%) datasets separately. The comparison results show that AGN(θ T ) is 220x∼20,489x faster than GN(θ T ) while it still consumes 95%∼ 1845% more time than using validation accuracy. Note that, for the comparison, we originally collect the time consumption of GN(θ T ) for a single iteration, and then rescale the figures to one epoch. In summary, we conclude that AGN(θ T ) is a low-complexity but tight approximation to GN(θ T ) for generalization performance measurement.

3. USING AGN AS A GENERALIZATION PERFORMANCE INDICATOR

To understand the correlations between AGN(θ T ) and generalization performance, we propose to measure G M (θ T ) and AGN(θ T ) of a wide range of DNN models using ResNet-20, ResNet-56, ResNet-110, and DenseNet100x24 (Huang et al., 2017) based on CIFAR-10 datasets. For each architecture, we train 32 DNN models, each of which is with a random set of hyper-parameters and the same number of epochs. Specifically, we hope to discuss the correlations in two scenariosthe model is well-fitted or under-fitted. With sufficient number of training iterations and appropriate settings of hyper-parameters, the models are usually trained to well-fit the training datasets with low training loss and high training accuracy. However, due to the lack of convergence or inappropriate setting, some models are under-fitted even using a large number of iterations. 

DenseNet100x24

Top 50% Bottom 50% (a) Generalization error against AGN shows an inverted checkmark trend. Observation 1 The correlations between AGN(θ T ) and G M (θ T ) are not consistent from models to models. AGN(θ T ) is positively correlated with G M (θ T ) when models are well-fitted. Otherwise, an opposite trend is observed for underfitted models. Fig. 3 illustrates that the inconsistent correlations between AGN(θ T ) and G M (θ T ) appear in all these architectures. For each architecture, 32 random hyperparameter configurations were chosen to train 32 models, and we obtain each model's AGN(θ T ) at the end of 160 epochs as well as its generalization error, plotted in Fig. 3a . We observe that the scatters exhibit a trend that looks like an inverted checkmark. This shape indicates that the correlation between AGN(θ T ) and generalization error behave differently based on model con-vergence. Due to this inconsistent behavior, it is not always possible to use AGN(θ T ) as an indicator of generalization performance unless the models are well-fitted. Positive correlation between AGN(θ T ) and G M (θ T ) for well-fitted models We identify the models with top 50% training accuracy. These models have fitted the training set with at least 88% accuracy and hence they are well-fitted models. These models show positive correlation between AGN(θ T ) and G M (θ T ), as shown in Fig. 3b . It is feasible to use AGN(θ T ) as an alternative generalization performance indicator, in addition to the validation loss/accuracy, under these settings. This observation coincides with the theoretical findings in Li et al. (2020) and Eq. (3). However, the p-values for Pearson correlation in Fig. 3 (b) are 0.336, 0.039, 0.036, and 0.053 respectively. These values overall indicate weak positive correlation between generalization gap and AGN(θ T ) as three of them are close to or lower than 0.05. This weak correlation is most likely a result of that GN(θ T ) does not work very well as a metric in practice, since we have already shown the experimental results that indicate good approximation of AGN(θ T ) to GN(θ T ) in Fig. 1 . Negative correlation between AGN(θ T ) and G M (θ T ) for under-fitted models In contrast, we also identify the models with bottom 50% training accuracy. Such models are under-fitted models due to inappropriate settings of hyper-parameters or lack of convergence in the training process. Fig. 3c shows that AGN(θ T ) is negatively correlated with G M (θ T ) for under-fitted models. We however do not consider this observation conflicts with the theoretical findings in Li et al. (2020) or Eq. ( 3), because Li et al. (2020) actually studied the learning dynamics of DNN in asymptotic settings, where authors derived the upper bound of generalization error using gradient norms when T → ∞ (the training procedure has been well converged and models are well-fitted). Nonetheless, it still seems to be surprising that there is a negative correlation between AGN(θ T ) and G M (θ T ). We offer an intuitive explanation here. The trend is that gradient norm gradually decreases as training proceeds while generalization gap increases (also reported in Li et al. (2020) ). Across models with different hyper-parameters, these two quantities could possibly change their magnitude at very different speed. A model that remains under-fitted throughout training has low generalization gap (it makes almost random guesses on both training and test sets), but the training samples are likely to consistently produce large gradients. As a result, under-fitted models with smaller generalization gaps have larger AGN(θ T ), contributing to the negative correlation shown in Fig. 3c .

4. USING AGN FOR MODEL SELECTION

In this section, we investigate the feasibility of using AGN(θ T ) as a generalization performance indicator for model selection. We consider two possible applications of AGN(θ T ), where the first one is to incorporate AGN(θ T ) as an objective for hyper-parameter search for a fixed architecture, and the second one is using AGN(θ T ) to predict cross-architectural models' generalization error.

4.1. USE CASE 1: HYPER-PARAMETER SEARCH FOR DEEP NEURAL NETWORKS

To address the fitness issues from Observation 1, we propose to include the training loss and/or validation loss as the search objectives as follows Objective AGN+Train (θ T ) = α • AGN(θ T ) + 1 N N i=1 L(θ T ; x i ), Objective AGN+Val (θ T ) = α • AGN(θ T ) + 1 M M i=1 L(θ T ; y i ), where {x 1 , x 2 , . . . , x N } and {y 1 , y 2 , . . . , y M } refer to the training and validation sets respectively, and α is a weight to balance the training/validation loss and AGN(θ T ). We test the effectiveness of the above two search objectives by conducting hyper-parameter search using ResNet-20 and ResNet-56 on Fashion-MNIST, SVHN, and CIFAR datasets respectively. The results of Fashion-MNIST and SVHN are found in the appendix. On top of the two objectives in Eq. ( 6), we try multiple settings of α and adopt two commonly-used black-box optimization algorithms, including CMA-ES (Loshchilov & Hutter, 2016) and PSO (Escalante et al., 2009) , for hyper-parameter search with a computation budget of 5,120 GPU×epochs, where we take a reduced training set with 20% training samples (Cubuk et al., 2019) to accelerate the search procedure. The search space is with Batch Size [32, 1000], Learning Rate [1 × 10 -7 , 0.50], Momentum [0, 0.99], and Weight Decay [5 × 10 -7 , 0.05], while initial settings are Batch Size = 100, Learning Rate = 0.01, Momentum = 0.6, and Weight Decay = 0.0005. Observation 2.When combining AGN(θ T ) with training or validation loss as search objectives, black-box optimization algorithms can search the hyper-parameters with performance marginally better than using validation accuracy only, whilst models can avoid overfitting to the validation set (Recht et al., 2018) . However, such practice requests prior knowledge (i.e., α in Eq. ( 6)) to balance the two factors in the search objective. Table 1 and Table 4 (in appendix) present the results of comparisons, where we include the testing accuracy of the models trained using full training set and the hyper-parameters found by black-box optimization algorithms. For every set of searched hyper-parameters, the model has been trained and tested three times with error bars estimated. For both CMA-ES and PSO algorithms, the two objectives can help search the models with significantly better or worse testing accuracy than the one selected by using validation accuracy as objectives. Furthermore, the use of these two objectives can fix the issues of overfitting to the validation set for model selection. For example, the result of ResNet-20 on Fashion-MNIST using validation set in Table 4 has a lower test set accuracy than the model trained with default hyper-parameters. This phenomenon is likely due to overfitting the validation set since ResNet-20 has a rather large capacity to learn and Fashion-MNIST is a relatively easy dataset. In contrast, AGN+Train or AGN+Val leads to slightly better result than the default model, significantly outperforming using validation loss. The final model selection highly depends on the choice of α. We however do not know such prior in advance, while trends of performance over α are not consistent or obvious. In this way, we can conclude that there exists potentials of using AGN(θ T ) as part of search objectives for black-box hyper-parameter optimization when the prior knowledge on the settings of α is known. Otherwise, the use of AGN(θ T ) might even hurt the hyper-parameter search, no matter whether it is combined with training or validation loss. Observation 3. AGN(θ T ) sometimes cannot work well with bandit-based hyper-parameter search, as the correlation between AGN(θ T ) and generalization performance is not always consistent during the training process from under-fitted to well-fitted status. We carry out the similar hyper-parameter search experiments using BOHB algorithm under the same settings (search space, initial values, and computation budget). For fair comparison, BOHB normalizes AGN(θ t ) in the objectives (and the setting of α is slightly different from the black-box optimization settings), since BOHB has to compare AGN(θ t ) of models obtained from different iterations t of the training process while the scale of AGN(θ t ) varies significantly. In Table 2 , we present the testing accuracy of the models that are trained using the hyper-parameters searched by BOHB with the objectives. The results show that, though it is able to outperform the default one (such as ResNet-20 on Fashion-MNIST using AGN+Train, α = 0.1), using AGN(θ t ) within BOHB cannot produce better results than the default setting in most cases. We evaluate a series of complexity measures to test their effectiveness of predicting generalization error using the public dataset provided in a competition at NeurIPS 2020 named "Predicting Generalization in Deep Learning" (Jiang et al.) . The descriptions of experimental setup and public dataset are available in the official document of competition details. Briefly, we use 150 trained VGG-like architectures (along their datasets used during training, namely SVHN and CIFAR-10) as datum and predict their generalization error with the listed theoretical complexity measures in Table 3 . The predictions are evaluated using conditional mutual information. The possible scores are between 0 and 100. A higher score indicates that the complexity measures is able to more accurately and consistently predict the generalization error. The results in Table 3 show that although AGN(θ T ) is able to outperform some baseline theoretical complexity measures in predicting generalization error, its effectiveness is largely limited. The ineffectiveness of using AGN(θ T ) to perform cross-architectural generalization error prediction implies it is also ineffective to use GN(θ T ) in practice since we have shown that AGN(θ T ) approximates GN(θ T ) very well.

5. CONCLUSION

In this paper, we have studied the feasibility of using AGN(θ T ), an approximated form of squared norms of loss gradients over optimization path (Li et al., 2020) , to measure generalization error in practice. We find the correlations between AGN(θ T ) and G M (θ T ) are completely opposite for under-fitted and well-fitted models, where the positive correlations between the two variables found in well-fitted models coincide with the theorems by Li et al. (2020) . Furthermore, the use of AGN(θ T ) to complement the validation accuracy as the objectives can marginally improve the performance of hyper-parameter optimization, however the computational overhead caused by AGN(θ T ) estimation and the lack of some prior knowledge makes such paradigm neither efficient nor effective. In the meanwhile, the same set of objectives does not bring any improvements for bandit-based algorithms such as BOHB, partially due to the inconsistent correlation between the objective and generalization performance throughout the training phase. As a result, using the gradient norms for model selection in practice remains challenging due to the high computation overhead (using the approximated version is up to 18 times slower than standard training) and limited effectiveness. Our experiments also show that AGN(θ T ) cannot effectively predict generalization error given different architectures. In conclusion, we do not recommend using GN(θ T ) or AGN(θ T ) for model selection in practice.

A EXPERIMENTS

A.1 EXPERIMENTAL SETUP

Datasets and Data Augmentation

We briefly introduce the benchmark datasets used in this paper and the data augmentation methods we applied when loading the data. • Fashion-MNIST (Xiao et al., 2017) is a benchmark dataset similar to the popular MNIST dataset (LeCun et al., 2010) . It has 70,000 grayscale images in the size of 28 × 28, dividing into 10 classes. The training set has 6,000 images from each class while the rest make up the test set. We normalize the input data as done in common practice. • SVHN is a benchmark dataset composing house-number signs in street level images (Netzer et al., 2011) . We use the cropped digits dataset for training and testing. We normalize the input data as done in common practice. ResNet Training Details For different architectures of ResNet, we train them for 160 epochs. For the default baseline model, we set the initial learning rate to be 0.1 and decay it by 1/10 and 1/100 at epoch 80 and 120 respectively. Momentum is set to be 0.9 and weight decay is 0.0001. The default batch size is 128. The loss function used for all experiments in the paper is cross entropy loss. Blackbox Optimization Setup The process of blackbox hyper-parameter optimization is split into 4 rounds. We train 8 models with different sets of hyper-parameters in parallel in one round. At the end of each round, a blackbox optimization algorithm is run based on the model score computed by using a validation set or using our metric. The algorithm generates the next 8 sets of hyper-parameters to search for the optimal ones. We use two blackbox optimization algorithms for hyper-parameter search, which are Covariance Matrix Adaptation -Evolution Strategy (CMA-ES) (Bergstra et al., 2011) and Particle Swarm Optimization (PSO) (Escalante et al., 2009) . BOHB Hyper-parameter Search Setup We use BOHB (Falkner et al., 2018) to test how our proposed objectives work with bandit based training hyper-parameter optimization algorithms. For all ResNet architectures, the minimum budget is 40 and maximum is 160. We pick η to be 2 so that the total number of training epochs using BOHB would be the same as that using blackbox optimization algorithms. This ensures fair comparison across the two types of hyper-parameter optimization algorithms. When running hyper-parameter search algorithms, we keep GPU×epochs as a constant variable to ensure each algorithm iterates through the training set the same number of times. As mentioned in the main paper, we set the total budget to be 5,120 GPU×epochs for all hyper-parameter search algorithms. Blackbox optimization algorithms used in this paper, namely CMA-ES and PSO, explores 32 sets of hyper-parameters whereas BOHB, as a more efficient search algorithm with early termination of poorly performing models, explores 36 sets of hyper-parameters in total. Specific time cost for hyper-parameter search. The time cost for hyper-parameter search using the reduced training set with validation accuracy as the criterion for ResNet-56 on CIFAR-10 is 856s per model on average. When AGN is used to complement validation accuracy, the average is 3,633s for one model. AGN Helps Reduce the Generalization Gap First, Fig. 4a indicates a gradually enlarging gap between the training loss and testing loss. This phenomenon is likely due to that we train ResNet-20 models for 160 epochs, which is easy to overfit the Fashion-MNIST dataset. As a result, there exists a generalization gap, where the training loss is very close to zero, but test loss is about 0.325. In contrast, when AGN and training loss are used as a combined objective in CMA-ES hyper-parameter optimization (Fig. 4d ), the final generalization gap is 0.259, which is 20.3% lower than the generalization gap using the default hyper-parameter. This decrease in the magnitude of generalization



ResNet-20 on Fashion-MNIST(validating the positive correlation between GN(θ T ) and AGN(θ T ) in larger neural networks).

ResNet-56 on CIFAR-100 (validating the positive correlation between GN(θ T ) and AGN(θ T ) in larger neural networks).

Figure1: Comparison between GN(θT ) and AGN(θT ) in terms of effectiveness to reflect generalization error. In Fig.1ato Fig.1d, we pick 9 sets of random hyperparameters for MLP and 12 sets for LeNet-5. We then train the model for 40 epochs and plot their results. For the two architectures we have tested on MNIST, both AGN(θT ) and GN(θT ) exhibit identical trends when generalization error is plotted against them respectively. Strong positive correlation is observed between GN and AGN in each setting as well, with Pearson correlation coefficients 0.962 for MLP and 0.987 for LeNet-5. We carry out additional experiments using deeper neural network models in Fig.1eand Fig.1fin order to verify the existence of strong positive correlation between AGN(θT ) and GN(θT ). We follow a piecewise learning rate decay policy in these experiments. Thus, there are three plots for each ResNet setting.

Figure 2: Comparisons in computation time for the three generalization performance measurements.

Positive correlation between AGN(θ T ) and G M (θ T ) for well-fitted models. Negative correlation between AGN(θ T ) and G M (θ T ) for under-fitted models.

Inconsistent correlation between empirical squared gradient norm and generalization error depending on how well a model is trained.

Figure 4: Training behavior of ResNet-20 on Fashion-MNIST with the default or learned hyper-parameters from various methods. "RS" refers to random search.

Fig. 4 shows the training behavior of ResNet-20 on Fashion-MNIST with four different hyperparameter settings. We analyze how the training and test losses vary under these settings and obtain the observations below.

Results for ResNet-20 on Fashion-MNIST, CIFAR-10, and CIFAR-100 using BOHB.

Since the correlation between AGN(θ t ) and G M (θ t ) is not always consistent over the optimization path, AGN gives BOHB inaccurate feedback on model generalization performance in early stage of training. Consequently, BOHB tends to early terminate some good models and continue training the inferior ones. This results in deteriorated model performance through the bandit-based search.



• The CIFAR-10 dataset has 6,000 examples of each of 10 classes and the CIFAR-100 dataset has 600 examples of each of 100 non-overlapping classes(Krizhevsky et al., 2009). We apply standard data augmentation the same way as the Pytorch official examples on CIFAR-10 and CIFAR-100 classification tasks. Specifically, we pad the input images by 4 pixels, and then randomly crop a sub-region of 32 × 32 and randomly do a horizontal flip. We normalize the input data as done in common practice.Reduced Dataset To reduce the expensive computation cost for training a model for multiple rounds, we train the model on the reduced dataset when searching for hyper-parameters, where only a random but fixed 20% subset of the standard dataset participates in the training. Training on the reduced dataset gives us a set of hyper-parameters, which is then used to train a separate model with the standard training set. The newly trained model is then validated on the standard test set. We report the final test performance in the main paper. The strategy of using a reduced dataset for hyper-parameter search is also used byCubuk et al. (2019).

