CAN WE USE GRADIENT NORM AS A MEASURE OF GENERALIZATION ERROR FOR MODEL SELECTION IN PRACTICE? Anonymous

Abstract

The recent theoretical investigation (Li et al., 2020) on the upper bound of generalization error of deep neural networks (DNNs) demonstrates the potential of using the gradient norm as a measure that complements validation accuracy for model selection in practice. In this work, we carry out empirical studies using several commonly-used neural network architectures and benchmark datasets to understand the effectiveness and efficiency of using gradient norm as the model selection criterion, especially in the settings of hyper-parameter optimization. While strong correlations between the generalization error and the gradient norm measures have been observed, we find the computation of gradient norm is time consuming due to the high gradient complexity. To balance the trade-off between efficiency and effectiveness, we propose to use an accelerated approximation (Goodfellow, 2015) of gradient norm that only computes the loss gradient in the Fully-Connected Layer (FC Layer) of DNNs with significantly reduced computation cost (200∼20,000 times faster). Our empirical studies clearly find that the use of approximated gradient norm, as one of the hyper-parameter search objectives, can select the models with lower generalization error, but the efficiency is still low (marginal accuracy improvement but with high computation overhead). Our results also show that the bandit-based or population-based algorithms, such as BOHB, perform poorer with gradient norm objectives, since the correlation between gradient norm and generalization error is not always consistent across phases of the training process. Finally, gradient norm also fails to predict the generalization performance of models based on different architectures, in comparison with state of the art algorithms and metrics.

1. INTRODUCTION

Generalization performance of deep learning through stochastic gradient descent-based optimization has been widely studied in recent work (Li et al., 2020; Chatterjee, 2020; Negrea et al., 2019; Thomas et al., 2019; Zhu et al., 2019; Hu et al., 2019; Zhang et al., 2016; Mou et al., 2017) . These studies initialize the analysis from the learning dynamics' perspectives and then extend to the upper bound estimation of generalization errors (Mou et al., 2017) . More recently, researchers have shifted their focuses onto providing some theoretical or empirical measures on generalization performance (Li et al., 2020; Negrea et al., 2019; Thomas et al., 2019; Cao & Gu, 2019; He et al., 2019; Frei et al., 2019) with respect to deep architectures, hyper-parameters, data distributions, learning dynamics and so on. This work studies the use of generalization performance measures (Li et al., 2020; Negrea et al., 2019) for model selection purposes (Cawley & Talbot, 2010) . Prior to the endeavor of deep learning, the generalization gap has been used as a straightforward measure. Given a set of N training samples{x 1 , x 2 , x 3 , . . . x N }, a deep learning model θ and a loss function L(θ; x) based on the sample x, the generalization gap G is defined as G(θ) = E x∼X L(θ; x) - 1 N N i=1 L(θ; x i ), where X is defined as the distribution of data, and E x∼X L(θ; x) refers to the expected loss. To quantify the generalization gap, validation or testing samples have been frequently used to measure the expected loss (Kohavi et al., 1995) . Such that given a validation/testing dataset with M samples

