STABILITY ANALYSIS OF SGD THROUGH THE NORMALIZED LOSS FUNCTION

Abstract

We prove new generalization bounds for stochastic gradient descent for both the convex and non-convex case. Our analysis is based on the stability framework. We analyze stability with respect to the normalized version of the loss function used for training. This leads to investigating a form of angle-wise stability instead of euclidean stability in weights. For neural networks, the measure of distance we consider is invariant to rescaling the weights of each layer. Furthermore, we exploit the notion of on-average stability in order to obtain a data-dependent quantity in the bound. This data dependent quantity is seen to be more favorable when training with larger learning rates in our numerical experiments. This might help to shed some light on why larger learning rates can lead to better generalization in some practical scenarios.

1. INTRODUCTION

In the last few years, deep learning has succeeded in establishing state of the art performances in a wide variety of tasks in fields like computer vision, natural language processing and bioinformatics (LeCun et al., 2015) . Understanding when and how these networks generalize better is important to keep improving their performance. Many works starting mainly from Neyshabur et al. (2015) , Zhang et al. (2017) and Keskar et al. (2017) hint to a rich interplay between regularization and the optimization process of learning the weights of the network. The idea is that a form of inductive bias can be realized implicitly by the optimization algorithm. The most popular algorithm to train neural networks is stochastic gradient descent (SGD). It is therefore of great interest to study the generalization properties of this algorithm. An approach that is particularly well suited to investigate learning algorithms directly is the framework of stability (Bousquet & Elisseeff, 2002) , (Elisseeff et al., 2005) . It is argued in Nagarajan & Kolter (2019) that generalization bounds based on uniform convergence might be condemned to be essentially vacuous for deep networks. Stability bounds offer a possible alternative by trying to bound directly the generalization error of the output of the algorithm. The seminal work of Hardt et al. (2016) exploits this framework to study SGD for both the convex and non-convex case. The main intuitive idea is to look at how much changing one example in the training set can generate a different trajectory when running SGD. If the two trajectories must remain close to each other then the algorithm has better stability. This raises the question of how to best measure the distance between two classifiers. Our work investigate a measure of distance respecting invariances in ReLu networks (and linear classifiers) instead of the usual euclidean distance. The measure of distance we consider is directly related to analyzing stability with respect to the normalized loss function instead of the standard loss function used for training. In the convex case, we prove an upper bound on uniform stability with respect to the normalized loss function which can then be used to prove a high probability bound on the test error of the output of SGD. In the non-convex case, we propose an analysis directly targeted toward ReLu neural networks. We prove an upper bound on the on-average stability with respect to the normalized loss function which can then be used to give a generalization bound on the test error. One nice advantage coming with our approach is that we do not need to assume that the loss function is bounded. Indeed, even if the loss function used for training is unbounded, the normalized loss is necessarily bounded. Our main result for neural networks involves a data-dependent quantity that we estimate during training in our numerical experiments. The quantity is the sum over each layer of the ratio between the norm of the gradient for this layer and the norm of the parameters for the layer. We observe that increasing the learning rate can lead to a trajectory keeping this quantity smaller during training. Therefore, larger learning rates can lead to a better "actual" stability than what a worst case analysis from uniform stability would indicate. There are two ways to get our data-dependent quantity smaller during training. The first one is by facilitating convergence (having smaller norm for the gradients). The second one is by increasing the weights of the network. If the weights are larger, the same magnitude for an update in weight space results in a smaller change in angle. In our experiments, larger learning rates are favorable in both regards.

2. RELATED WORK

Normalized loss functions have been considered before (Poggio et al., 2019) , (Liao et al., 2018) . In, Liao et al. ( 2018) test error is seen to be well correlated with the normalized loss. This observation is one motivation for our study. We might expect generalization bounds on the test error to be better by using the normalized surrogate loss in the analysis. (Poggio et al., 2019) writes down a generalization bound based on Rademacher complexity, but motivated by the possible limitations of uniform convergence for deep learning (Nagarajan & Kolter, 2019), we take the stability approach instead. Generalization of SGD has been investigated before in a large body of literature. Soudry et al. (2018) showed that gradient descent converges to the max-margin solution for logistic regression and Lyu & Li ( 2019 Since the work of Zhang et al. (2017) showing that currently used deep neural networks are so much overparameterized that they can easily fit random labels, taking properties of the data distribution into account seems necessary to understand generalization of deep networks. In the context of stability, this means moving from uniform stability to on-average stability. This is the main concern of the work of Kuzborskij & Lampert (2018) . They develop data-dependent stability bounds for SGD by extending over the work of Hardt et al. (2016) . Their results have a dependence on the risk of the initialization point and the curvature of the initialization. They have to assume a bound on the noise of the stochastic gradient. We do not make this assumption in our work. Furthermore, instead of having the bounds involve properties of the initialization (which can be useful to investigate transfer learning), we maintain instead in our bound for neural networks the properties after the "burn-in" period and therefore closer to the final output since we are interested in the effect of the learning rate on the trajectory. This is motivated by the empirical work of Jastrzebski et al. (2020) arguing that in the early phase of training, the learning rate and batch size determine the properties of the trajectory after a "break-even point". 



) provides and extension to deep non-linear homogeneous networks. Nacson et al. (2019) gives similar results for stochastic gradient descent. From the point of view of stability, starting from Hardt et al. (2016) without being exhaustive, a few representative examples are Liu et al. (2017), London (2017), Yuan et al. (2019), Kuzborskij & Lampert (2018).

It has been observed in the early work ofKeskar et al. (2017)  that training with larger batch sizes can lead to a deterioration in test accuracy. The simplest strategy to reduce (at least partially) the gap with small batch training is to increase the learning rate(He et al., 2019),(Smith & Le, 2018),(Hoffer et al., 2017), (Goyal et al., 2017). We choose this scenario to investigate empirically the relevance of our stability bound for SGD on neural networks. Remark that the results in Hardt et al. (2016) are more favorable to smaller learning rates. It seems therefore important in order to get theory closer to practice to understand better in what sense larger learning rates can improve stability.3 PRELIMINARIESLet l(w, z) be a non-negative loss function. Furthermore, let A be a randomized algorithm and denote by A(S) the output of A when trained on training set S = {z 1 , • • • , z n } ∼ D n . The true risk for a classifier w is given as L D (w) := E z∼D l(w, z) and the empirical risk is given by

