STABILITY ANALYSIS OF SGD THROUGH THE NORMALIZED LOSS FUNCTION

Abstract

We prove new generalization bounds for stochastic gradient descent for both the convex and non-convex case. Our analysis is based on the stability framework. We analyze stability with respect to the normalized version of the loss function used for training. This leads to investigating a form of angle-wise stability instead of euclidean stability in weights. For neural networks, the measure of distance we consider is invariant to rescaling the weights of each layer. Furthermore, we exploit the notion of on-average stability in order to obtain a data-dependent quantity in the bound. This data dependent quantity is seen to be more favorable when training with larger learning rates in our numerical experiments. This might help to shed some light on why larger learning rates can lead to better generalization in some practical scenarios.

1. INTRODUCTION

In the last few years, deep learning has succeeded in establishing state of the art performances in a wide variety of tasks in fields like computer vision, natural language processing and bioinformatics (LeCun et al., 2015) . Understanding when and how these networks generalize better is important to keep improving their performance. Many works starting mainly from Neyshabur et al. (2015) , Zhang et al. (2017) and Keskar et al. (2017) hint to a rich interplay between regularization and the optimization process of learning the weights of the network. The idea is that a form of inductive bias can be realized implicitly by the optimization algorithm. The most popular algorithm to train neural networks is stochastic gradient descent (SGD). It is therefore of great interest to study the generalization properties of this algorithm. An approach that is particularly well suited to investigate learning algorithms directly is the framework of stability (Bousquet & Elisseeff, 2002) , (Elisseeff et al., 2005) . It is argued in Nagarajan & Kolter (2019) that generalization bounds based on uniform convergence might be condemned to be essentially vacuous for deep networks. Stability bounds offer a possible alternative by trying to bound directly the generalization error of the output of the algorithm. The seminal work of Hardt et al. (2016) exploits this framework to study SGD for both the convex and non-convex case. The main intuitive idea is to look at how much changing one example in the training set can generate a different trajectory when running SGD. If the two trajectories must remain close to each other then the algorithm has better stability. This raises the question of how to best measure the distance between two classifiers. Our work investigate a measure of distance respecting invariances in ReLu networks (and linear classifiers) instead of the usual euclidean distance. The measure of distance we consider is directly related to analyzing stability with respect to the normalized loss function instead of the standard loss function used for training. In the convex case, we prove an upper bound on uniform stability with respect to the normalized loss function which can then be used to prove a high probability bound on the test error of the output of SGD. In the non-convex case, we propose an analysis directly targeted toward ReLu neural networks. We prove an upper bound on the on-average stability with respect to the normalized loss function which can then be used to give a generalization bound on the test error. One nice advantage coming with our approach is that we do not need to assume that the loss function is bounded. Indeed, even if the loss function used for training is unbounded, the normalized loss is necessarily bounded. Our main result for neural networks involves a data-dependent quantity that we estimate during training in our numerical experiments. The quantity is the sum over each layer of the ratio between the norm of the gradient for this layer and the norm of the parameters for the layer. We observe that increasing the learning rate can lead to a trajectory keeping this quantity smaller during training. Therefore, larger learning rates can lead to a better "actual" stability than what a worst case analysis from uniform stability would indicate. There are two ways to get our data-dependent quantity smaller during training. The first one is by facilitating convergence (having smaller norm for the gradients). The second one is by increasing the weights of the network. If the weights are larger, the same magnitude for an update in weight space results in a smaller change in angle. In our experiments, larger learning rates are favorable in both regards.

2. RELATED WORK

Normalized loss functions have been considered before (Poggio et al., 2019) , (Liao et al., 2018) . In, Liao et al. (2018) test error is seen to be well correlated with the normalized loss. This observation is one motivation for our study. We might expect generalization bounds on the test error to be better by using the normalized surrogate loss in the analysis. (Poggio et al., 2019) writes down a generalization bound based on Rademacher complexity, but motivated by the possible limitations of uniform convergence for deep learning (Nagarajan & Kolter, 2019), we take the stability approach instead. Generalization of SGD has been investigated before in a large body of literature. Soudry et al. (2018) showed that gradient descent converges to the max-margin solution for logistic regression and Lyu & Li (2019) provides and extension to deep non-linear homogeneous networks. Nacson et al. (2019) gives similar results for stochastic gradient descent. From the point of view of stability, starting from Hardt et al. (2016) without being exhaustive, a few representative examples are Liu et al. (2017) , London (2017) , Yuan et al. (2019) , Kuzborskij & Lampert (2018) . Since the work of Zhang et al. (2017) showing that currently used deep neural networks are so much overparameterized that they can easily fit random labels, taking properties of the data distribution into account seems necessary to understand generalization of deep networks. In the context of stability, this means moving from uniform stability to on-average stability. This is the main concern of the work of Kuzborskij & Lampert (2018) . They develop data-dependent stability bounds for SGD by extending over the work of Hardt et al. (2016) . Their results have a dependence on the risk of the initialization point and the curvature of the initialization. They have to assume a bound on the noise of the stochastic gradient. We do not make this assumption in our work. Furthermore, instead of having the bounds involve properties of the initialization (which can be useful to investigate transfer learning), we maintain instead in our bound for neural networks the properties after the "burn-in" period and therefore closer to the final output since we are interested in the effect of the learning rate on the trajectory. This is motivated by the empirical work of Jastrzebski et al. (2020) arguing that in the early phase of training, the learning rate and batch size determine the properties of the trajectory after a "break-even point". Keskar et al. (2017) that training with larger batch sizes can lead to a deterioration in test accuracy. The simplest strategy to reduce (at least partially) the gap with small batch training is to increase the learning rate (He et al., 2019) , (Smith & Le, 2018) , (Hoffer et al., 2017) , (Goyal et al., 2017) . We choose this scenario to investigate empirically the relevance of our stability bound for SGD on neural networks. Remark that the results in Hardt et al. (2016) are more favorable to smaller learning rates. It seems therefore important in order to get theory closer to practice to understand better in what sense larger learning rates can improve stability.

3. PRELIMINARIES

Let l(w, z) be a non-negative loss function. Furthermore, let A be a randomized algorithm and denote by A(S) the output of A when trained on training set S = {z 1 , • • • , z n } ∼ D n . The true risk for a classifier w is given as L D (w) := E z∼D l(w, z) and the empirical risk is given by L S (w) := 1 n n i=1 l(w, z i ). When considering the 0 -1 loss of classifier w, we will write L 0-1 D (w). Furthermore, we will add a superscript α when the normalized losses l α are under consideration (these will be defined more clearly in the subsequent sections respectively for the convex case and the non-convex case). Our main interest is to ensure small test error and so we want to bound L 0-1 D (w). The usual approach is to minimize a surrogate loss upper bounding the 0 -1 loss. In this paper, we consider the algorithm stochastic gradient descent with different batch sizes to minimize the empirical surrogate loss. The update rule of this algorithm for learning rates λ t and a subset B t ⊂ S of size B is given by w t+1 = w t -λ t 1 B zj ∈Bt ∇l(w t , z j ). We assume sampling uniformly with replacement in order to form each batch of training examples. In order to investigate generalization of this algorithm, we consider the framework of stability (Bousquet & Elisseeff, 2002) . We now give the definitions for uniform stability and on-average stability (random pointwise hypothesis stability in Elisseeff et al. (2005) ) for randomized algorithms (see also Hardt et al. (2016) and Kuzborskij & Lampert (2018) ). The definitions can be formulated with respect to any loss function but since we will study stability with respect to the l α losses, we write the definitions in the context of this special case.

Definition 1

The algorithm A is said to be α uni -uniformly stable if for all i ∈ {1, . . . , n} sup S,z i ,z E |l α (A(S), z) -l α (A(S (i) ), z)| ≤ α uni . Here, the expectation is taken over the randomness of A. The notation S (i) means that we replace the i th example of S with z i .

Definition 2

The algorithm A is said to be α av -on-average stable if for all i ∈ {1, . . . , n} E |l α (A(S), z) -l α (A(S (i) ), z)| ≤ α av . Here, the expectation is taken over S ∼ D n , z ∼ D and the randomness of A. The notation S (i) means that we replace the i th example of S with z. In Hardt et al. (2016) , uniform stability with respect to the same loss the algorithm is executed on is considered. This is a natural choice, however if we are interested in the 0 -1 loss, different set of parameters w, w can represent equivalent classifiers (that is, predict the same label for any input). This is the case for logistic regression since any rescaling of the parameters yields the same classifier (but they can have different training losses). This is also the case for Relu neural networks where we can rescale each layer without affecting the classifier. This is why we consider stability with respect to normalized losses instead. Remark that we are still considering SGD executed on the original loss l (we do not change the algorithm A). The intuitive idea is to measure stability in terms of angles (more precisely, we consider distances between normalized vectors) instead of standard euclidean distances (see Figure 1 ). The proof in Hardt et al. (2016) consists in bounding E||w t -w t ||, where w t represents the weights at iteration t when training on S and w t represents the weights at iteration t when training on the modified training set S (i) . We will instead bound E|| wt ||wt|| - w t ||w t || || (or E[d(f, g)] for an appropriate measure of distance d between neural networks f and g). Throughout the paper, || • || will denote the euclidean norm for vectors and the Frobenius norm for matrices. The proofs are given in appendix A for the convex case and in appendix B for the non-convex case. 

4. CONVEX CASE

Consider a linear classifier parameterized by either a vector of weights (binary case) or a matrix of weights (multi-class case) that we denote by w in both cases. The normalized losses are defined by l α (w, z) := l(α w ||w|| , z), for α > 0. In order to state the main result of this section, we need two common assumptions: L-Lipschitzness of l as a function of w and β-smoothness. Definition 3 The function l(w, z) is L-Lipschitz for all z in the domain (with respect to w) if for all w, w , z, |l(w, z) -l(w , z)| ≤ L||w -w ||. Definition 4 The function l(w, z) is β-smooth if for all w, w , z, ||∇l(w, z) -∇l(w , z)|| ≤ β||w -w ||. We are now ready to state the main result of this section. Theorem 1 Assume that l(w, z) is convex, β-smooth and L-Lipschitz for all z. Furthermore, assume that the initial point w 0 satisfies ||w 0 || ≥ K for some K such that K = K -L T -1 i=0 λ i > 0 for a sequence of learning rates λ i ≤ 2/β. SGD is then run with batch size B on loss function l(w, z) for T steps with the learning rates λ t starting from w 0 . Denote by α uni the uniform stability of this algorithm with respect to l α . Then, α uni ≤ α 2L 2 B n K T -1 i=0 λ i . What is the difference between our bound and the bound in Hardt et al. ( 2016) (see theorem 6 in Appendix A) ? Our bound says that it is not enough to use a small learning rate and a small number of epochs to guarantee good stability (with respect to the normalized loss). We also need to take into account the norm of the parameters (here the norm of the initialization) to make sure that the "effective" learning rate is small. As a side note, we also incorporated the batch size into the bound which is not present in Hardt et al. (2016) (only B = 1 is considered). From this result, it is possible to now obtain a high probability bound for the test error. The bound is over draws of training sets S but not over the randomness of A.foot_0 So, we actually have the expected test error over the randomness of A in the bound. This is reminiscent of PAC-Bayes bounds where here the posterior distribution would be induced from the randomness of the algorithm A. Theorem 2 Fix α > 0. Let M α := sup{l(w, z) s.t. ||w|| ≤ α, ||x|| ≤ R}. Then, for any n > 1 and δ ∈ (0, 1), the following hold with probability greater or equal to 1 -δ over draws of training sets S: E A L 0-1 D (A(S)) ≤ E A L α S (A(S)) + α uni + (2n α uni + M α ) ln(1/δ) 2n . Proof: The proof is an application of McDiarmid's concentration bound. Remark that we do not need the training loss to be bounded since we consider the normalized loss which is bounded. The proof follows the same line as theorem 12 in Bousquet & Elisseeff (2002) and we do not replicate it here. Remark that we need to use that uniform stability implies generalization in expectation which is proven for example in theorem 2.2 from Hardt et al. (2016) . Furthermore, we can make the bound hold uniformly for all α's using standard techniques. Theorem 3 Let C > 0. Assume that l α (w, z) is a convex function of α for all w, z and that α uni is a non-decreasing function of α. Then, for any n > 1 and δ ∈ (0, 1), the following hold with probability greater or equal to 1 -δ over draws of training sets S: E A L 0-1 D (A(S)) ≤ inf α∈(0,C] E A max L α/2 S (A(S)), L α S (A(S)) + α uni + (2n α uni + M α ) 2 ln( √ 2(2 + log 2 C -log 2 α)) + ln(1/δ) 2n . In the next section, we investigate the non-convex case (actually we target directly neural networks). We exploit on-average stability to obtain a data-dependent quantity in the bound. Remark that it is also argued in Kuzborskij & Lampert (2018) that the worst case analysis of uniform stability might not be appropriate for deep learning. Finally, observe that if we use only one layer we get a linear classifier and so the results from the next section also applies to logistic regression for example.

5. NON-CONVEX CASE

We now consider ReLu neural networks (although we could easily extend to other Lipschitz nonlinearity) in the setup of multiclass classification. Write f (x) = W l (σ(• • • W 2 (σ(W 1 x)))) , where x is an input to the neural network, W i denotes the weight matrix at layer i and σ denotes the ReLu function. Consider a non-negative loss function l(s, y) that receives a score vector s = f (x) and a label y as inputs. We require the loss function to be L-Lipschitz for all y as a function of s. That is, for all s, s , y, |l(s, y) -l(s , y)| ≤ L||s -s ||. For example, we can use the cross-entropy loss (softmax function with negative log likelihood). In this case, it is simple to show by bounding the norm of the gradient of l(s, y) with respect to s that we can use L = √ 2. Remark that this is slightly different from the Lipschitz assumption of the previous section (given with respect to the weights w). Lemma 1 Assume that ||x|| ≤ R. For 1 ≤ j ≤ l, write s j = W j (σ(• • • W 2 (σ(W 1 x)))) and s j = W j (σ(• • • W 2 (σ(W 1 x)))). We have ||s l -s l || ≤ R l i=1 ||W i -W i || l j=1,j =i ||W i || if j < i ||W i || if j > i . ( ) The previous lemma motivates a measure of "distance" between neural networks. Definition 5 For neural networks f and g, where the weight matrices of f are given by W 1 • • • W l and the weight matrices of g are given by W 1 • • • W l , define d(f, g) := l i=1 || W i ||W i || - W i ||W i || ||. Remark that this distance function is invariant to rescaling the weights of any layer. This is a desirable property since in a ReLu network such a reparametrization leaves the class predicted by the classifier unchanged for any input to the network. Let α 1 , • • • α l be positive real numbers. We define the l α1,•••α l losses as l α1,•••α l (f, z) := l(α l W l ||W l || (σ(• • • α 2 W 2 ||W 2 || (σ(α 1 W 1 ||W 1 || x)))), y), where z = (x, y) and f is the neural network with weight matrix at layer i given by W i . That is, we project the weight matrices to give the norm α i to layer i and then we evaluate the loss l on this "normalized" network. For simplicity, we will only consider the case where all the α i 's are equal to say α and we will write l α (f, z). From our definitions and lemma 1, we have that for all z and neural networks f and g, |l α (f, z) -l α (g, z)| ≤ LRα l d(f, g). In order to bound stability with respect to l α , we will have to ensure that the two trajectories cannot diverge to much in terms of d(f, g). We will make two main assumptions to prove our main result. The first one is a modification of the concept of β-smooth functions for neural networks. Definition 6 Consider the gradient of the loss function with respect to the parameters W for some training example z. The vector containing only the partial derivatives for the weights of layer j will be denoted by ∇ (j) l(W, z). We define {β j } l j=1 -layerwise smoothness as the following property: For all j, z, W = (W 1 , • • • , W l ) and W = (W 1 , • • • , W l ), ||∇ (j) l(W, z) -∇ (j) l(W , z)|| ≤ β j ||W j -W j ||. ( ) We also let β := max{β j }. Remark that β is upper bounding the spectral norm of the bloc diagonal approximation of the Hessian. Next, we introduce an assumption meaning informally that with high probability, the norm of parameters of each layer is eventually non-decreasing. Definition 7 Let ≥ 0. The growing norms assumption (for this value of ) is defined as the following: The distribution D satisfies the property that there exists t 0 ∈ {0, 1, • • • , n B } such that the probability over draws of training sets from D n and over the randomness of A of generating a non-decreasing sequence ||W j,t || T t=t0 for all j is greater or equal to 1 -. In Figure 2 , we can see that on Cifar10 and Mnist this assumption makes sense (more details on experiments in section 6). We are now ready to state the main theorem of this section. Theorem 4 Let ≥ 0 and assume that the distribution D satisfies the growing norms assumption for this value of . The notation t 0 ∈ {1, 2, • • • , B n } will now mean implicitly also that t 0 is large enough so that the sequence of norms is non-decreasing. Suppose that the loss function l(s, y) is L-Lipschitz for all y, non-negative and that l α (f, z) is bounded above by M α . Furthermore, assume {β j } l j=1 -layerwise smoothness and that ||x|| ≤ R. Finally, let B denote the batch size, λ t ≤ c t the learning rates and T the number of iterations SGD is being run. Then, α av ≤ inf t0∈{1,2,..., n B } 2BLRα l (n -B)β T -1 t 0 -1 cβ T -1 t=t0 ζ t + M α ( Bt 0 n + ) , where ζ t := Theorem 5 Fix α > 0. Then, for any n > 1 and δ ∈ (0, 1), the following hold with probability greater or equal to 1 -δ over draws of training sets S and the randomness of the algorithm A: L 0-1 D (A(S)) ≤ L α S (A(S)) + 1 δ 2M 2 α + 12nM α α av n . 6 EXPERIMENTS

6.1. LEARNING RATES AND ζ t

In this section we conduct some experiments on the datasets Cifar10 and Mnist. We consider the scenario where we try to reduce the performance gap between small batch and large batch training by increasing the learning rate. We will give some evidence suggesting that the quantity ζ t can be of interest to assess generalization in this case. Remark that the bound on stability for neural networks hold for learning rates satisfying λ t ≤ c/t. This is not a very practical schedule for deep learning and so we use a global learning rate being decayed one time by a factor of 10 instead in our experiments. We use no weight decay or momentum to stay closer to our theoretical analysis of SGD. Remark that in principle, the learning rate could be as large as we want during the inital burn-in period (before t 0 ) without hurting stability. However, this burn-in period must be inside the first epoch in the theoretical result we presented. Since in practice we train for many epochs, it is not clear if such a small burn-in period is long enough to be significant in current practice. We still think that the quantity ζ t is relevant to investigate empirically. We approximate its value on a training set S with the quantity ζt (S) := l j=1 ||∇ j L B t (Wt)|| ||Wj,t|| . Instead of plotting the value for each iteration, we average ζt (S) for each epoch. This leads to smoother curves. We use a 5-layers convolutional network consisting in 2 convolutional layers with maxpooling and then 3 fully connected layers with cross-entropy loss on Cifar10. We use also the cross-entropy loss on Mnist but the neural network is a 6-layers fully connected network. In both cases, we use batch-normalization to facilitate training. All the results in the figures are obtained when using a batch size of 2048. We started by training with a smaller batch size of 256 and then tried to reduce the gap in performance between large batch and small batch training by increasing the learning rate. For example, on Cifar10, we obtain a test accuracy of 86.23% when using a batch size of 256 and a learning rate of 0.5. When increasing the batch size to 2048 (and maintaining the learning rate to 0.5), the test accuracy dropped to 85.14%. This happened even if the training loss is reaching approximately the same value in both cases (0.0123 for batch size 256 and 0.0167 for batch size 2048). We then increased the learning rate to 1.0 and then to 1.5 reaching 85.63% in both cases (not completely solving the gap but reducing it). A similar phenomenon happens for Mnist. Here, with batch size 256 we get 98.57% (lr = 0.05) of test accuracy and for batch size 2048, we get 97.52% (lr = 0.05), 98.00% (lr = 0.1) and 98.39% (lr = 0.5). We plotted the values of ζt (S) during training in figure 3 . We can see that it is better during all training when increasing the learning rate. To compare with the analysis from Hardt et al. (2016) , the quantity ζ t would be replace with a global Lipschitz constant which would not be affected by the actual trajectory of the algorithm. Therefore, in comparison to our bound, the bound in Hardt et al. (2016) would be much more favorable to smaller learning rates. In other words, the worst case analysis of uniform convergence would require much smaller learning rates to be used than our result to guarantee good stability. The quantity ζ t can be improved by accelerating convergence because of the numerator (norm of the gradients) but also by increasing the denominator (norm of the parameters). A larger learning rate can help in both these regards (see figure 4 and figure 5 ). Remark also that considering only the norm of the gradients without the norm of the parameters would lead to a less favorable quantity compared to considering both the norm of the gradients and the norm of the parameters. A standard analysis of stability (without the normalized loss) similar to Kuzborskij & Lampert (2018) would not benefit from the norm of the paramters. 

6.2. GENERALIZATION BOUND AND TEST ERROR

We show in this section the usefulness of considering the normalized loss for bounding the test error. We evaluate the bound in theorem 5 and compare it to an analogous version for the unnormalized loss. For this analogous version, we replace the upper bound M on the loss function by the largest loss achieved during training. Furthermore, the quantity av is upper bounded by the Lipschitz constant times the euclidean distance between the weights of the networks. The Lipschitz constant is replaced by the largest norm of gradient obtained during training. For the normalized loss (our actual theorem 5), we upper bound α av by LRα l Ed(f, g) (see equation 13). We plot the test error, the upper bound for the normalized case with α = 1.0 and the upper bound for the unnormalized case in figure 6 . Figure 6 : The bound obtained from the euclidean distance is much worst than the bound obtained from our normalized distance. The network is a 6-layer fully connected network and the training set is MNIST.

7. CONCLUSION

We investigated the stability (uniform and on-average) of SGD with respect to the normalized loss functions. This leads naturally to consider a more meaningful measure of distance between classifiers. Our experimental results show that stability might not be as bad as expected when using larger learning rates in training deep neural networks. We hope that our analysis will be a helpful step in understanding generalization in deep learning. Future work could investigate the on-average stability with respect to l α losses for different optimization algorithms.

A APPENDIX: PROOFS FOR THE CONVEX CASE

We start by proving a few lemmas. Lemma 2 Let v, w ∈ R n and 0 < c ≤ min{||v||, ||w||}. Then, || v ||v|| -w ||w|| || ≤ ||v-w|| c . Proof: The proof follows from basic linear algebra manipulations. We give it here for completeness since it is important in what follows. We need to show that v ||v|| -w ||w|| , v ||v|| -w ||w|| ≤ v-w,v-w c 2 . After some manipulations, one can see that this is equivalent to show that ||v|| 2 + ||w|| 2 -2c 2 + 2(c 2 -||v||||w||) v ||v|| , w ||w|| ≥ 0. From Cauchy-Schwarz inequality, v ||v|| , w ||w|| ≤ 1. Since c 2 -||v||||w|| ≤ 0, the proof will be completed by showing that ||v|| 2 + ||w|| 2 -2c 2 + 2(c 2 -||v||||w||) ≥ 0. But this is true since ||v|| 2 + ||w|| 2 -2||v||||w|| = (||v|| -||w||) 2 . Lemma 3 Assume that the update rule G is η-expansive. Furthermore, assume that ||G(v)|| and ||G(w)|| are larger than Kη. Then, || G(w) ||G(w)|| -G(v) ||G(v)|| || ≤ ||v-w|| K . Proof: For v and w satisfying the assumptions, we have || G(v) ||G(v)|| -G(w) ||G(w)|| || ≤ ||G(v)-G(w)|| Kη ≤ ||v-w|| K . Lemma 4 Assume that the update rule is σ-bounded, that is ||G(w) -w|| ≤ σ for all w. Furthermore, assume that ||G(v)|| ≥ K and ||G(w)|| ≥ K. Then, || G(w) ||G(w)|| -G(v) ||G(v)|| || ≤ ||v-w|| K + 2 σ K . Lemma 5 Assume that the initial point w 0 satisfies ||w 0 || ≥ K and that SGD is run with batch size B and a sequence of learning rates λ t on an L-Lipschitz loss function l(w, z) for all z. Then, for all t ≥ 1 ||w t || ≥ K -L t-1 i=0 λ i . Proof: ||w t || = ||w t-1 -λ t-1 1 B B j=1 ∇l(w t-1 , z j )|| ≥ ||w t-1 || -λ t-1 1 B || B j=1 ∇l(w t-1 , z j )|| ≥ ||w t-1 || -λ t-1 L ≥ ||w t-2 || -λ t-2 L -λ t-1 L ≥ • • • ≥ ||w 0 || -L t-1 i=0 λ i ≥ K -L t-1 i=0 λ i . For ease of comparison, we give the statement of theorem 3.8 in Hardt et al. (2016) . Theorem 6 (Theorem 3.8 in Hardt et al. ( 2016)) Assume that the loss function f (•; z) is β-smooth, convex and L-Lipschitz for every z. Suppose that we run SGD with step sizes α t ≤ 2/β for T steps. Then, SGD satisfies uniform stability with uni ≤ 2L 2 n T t=1 α t . We are now ready to prove proposition 1. Proof of theorem 1: The proof is similar to Hardt et al. (2016) . Let w t denotes the output of A after t steps on training set S and w t be the output of A after t steps on training set S (i) for some i ∈ {1, • • • n}. From convexity, the update rule is 1-expansive (see lemma 3.7 from Hardt et al. (2016) ). Furthermore, it follows directly form our lemma 5 that the assumptions in both lemma 3 and lemma 4 are satisfied (we can use σ = Lλ t at iteration t in lemma 4). Since the probability of picking the example i in a mini-batch of size B is smaller than B n (sampling with replacement), we have E || w t+1 ||w t+1 || - w t+1 ||w t+1 || || ≤ B n E||w t -w t || K + 2 Lλ t K + 1 - B n E||w t -w t || K = 1 K E||w t -w t || + 2BLλ t n . Remark that this is true since E||wt-w t || K ≤ E||wt-w t || K + 2 Lλt K . From the result in Hardt et al. ( 2016) (with a slight change to allow any batch size B), we have E||w t -w t || ≤ 2BL n t-1 i=0 λ i . Therefore, E || wt+1 ||wt+1|| - w t+1 ||w t+1 || || ≤ 2BL n K t i=0 λ i The result then follows from the inequality |l(α w ||w|| , z) -l(α w ||w || , z)| ≤ Lα|| w ||w|| -w ||w || ||. We finally prove theorem 3. Proof of theorem 3: To simplify the text, write (α, δ) := E A L α S (A(S)) + α stab + (2n α stab + M α ) ln(1/δ) 2n . For i ≥ 1, let α i = 2 (1-i) C and δ i = δ 2i 2 . For any fixed i, we have P S {E A L 0-1 D (A(S)) > (α i , δ i )} < δ i . Therefore, P S {∀i, E A L 0-1 D (A(S)) ≤ (α i , δ i )} = 1 -P S {∃i, E A L 0-1 D (A(S)) > (α i , δ i )} ≥ 1 - ∞ i=1 P S {E A L 0-1 D (A(S)) > (α i , δ i )} ≥ 1 - ∞ i=1 δ i ≥ 1 -δ. The last inequality follows from ∞ i=1 δ i = δ 2 ∞ i=1 1 i 2 = δ 2 π 2 6 ≤ δ. We want to show that the set {S : ∀i, E A L 0-1 D (A(S)) ≤ (α i , δ i )} is contained in the set {S : ∀α ∈ (0, C], E A L 0-1 D (A(S)) ≤ E A max L α/2 S (A(S)), L α S (A(S)) + α stab + (2n α stab + M α ) 2 ln( √ 2(2+log 2 C-log 2 α))+ln(1/δ) 2n }. Let S be such that ∀i, E A L 0-1 D (A(S)) ≤ (α i , δ i ). Let α ∈ (0, C]. Then, there exists i such that α i ≤ α ≤ 2α i . We have E A L 0-1 D (A(S)) ≤ E A L αi S (A(S)) + αi stab + (2n αi stab + M αi ) ln(1/δ i ) 2n ≤ E A L αi S (A(S)) + α stab + (2n α stab + M α ) ln(1/δ i ) 2n ≤ E A L αi S (A(S)) + α stab + (2n α stab + M α ) 2 ln( √ 2(2 + log 2 C -log 2 α)) + ln(1/δ) 2n The second inequality is true since both α stab and M α are non-decreasing functions of α and α i ≤ α. The last inequality is true since 1 δi = 2i 2 δ ≤ 2(2+log 2 C-log 2 α) 2 δ . Finally, the proof is concluded by using the convexity of L α S (A(S)) with respect to α. Indeed, since α 2 ≤ α i ≤ α, we must have L αi S (A(S)) ≤ max L α/2 S (A(S)), L α S (A(S)) .

B APPENDIX: PROOFS FOR THE NON-CONVEX CASE

Proof of lemma 1: The proof is done by induction on the number of layers l. Suppose the result is true for l -1 layers. Then we have, ||s l -s l || = ||W l σ(s l-1 ) -W l σ(s l-1 )|| = ||W l σ(s l-1 ) -W l σ(s l-1 ) -W l (σ(s l-1 ) -σ(s l-1 ))|| ≤ ||W l σ(s l-1 ) -W l σ(s l-1 )|| + ||W l (σ(s l-1 ) -σ(s l-1 ))|| ≤ ||W l -W l || ||σ(s l-1 )|| + ||W l || ||σ(s l-1 ) -σ(s l-1 )|| ≤ ||W l -W l || ||s l-1 || + ||W l || ||s l-1 -s l-1 || ≤ R||W l -W l || l-1 i=j ||W j || + ||W l || ||s l-1 -s l-1 || ≤ R||W l -W l || l-1 j=1 ||W j || + ||W l || R l-1 i=1 ||W i -W i || l-1 j=1,j =i ||W i || if j < i ||W i || if j > i = R l i=1 ||W i -W i || l j=1,j =i ||W i || if j < i ||W i || if j > i . We exploited the fact that the ReLu non-linearity cannot increase the norm of a vector and also that it is 1-Lipschitz. The proof is finally concluded by observing that for one layer we have, ||s 1 -s 1 || ≤ R||W 1 -W 1 ||. Definition 8 Let us introduce some notations. Let δ (j) t (S, z) := ||W j,t -W j,t || and ∆ (j) t (S, z) := E A [δ (j) t (S, z) | ∀k, δ (k) t0 (S, z) = 0]. Here, W j,t is obtained when training with S for t iterations and W j,t is obtained when training with S (i) for t iterations. The condition inside the expectation is that after t 0 iterations, the two networks are still exactly the same. Since we are interested in distances after normalization, we consider δ(j) Before proving theorem 4, we establish a lemma. Remark that the structure of the proof of the following lemma and of theorem 4 is similar to the corresponding results in Hardt et al. (2016) and in Kuzborskij & Lampert (2018) . Lemma 6 Let ≥ 0 and assume that the distribution D satisfies the growing norms assumption for this value of . The notation t 0 ∈ {0, 1, 2, • • • , B n } will now mean implicitly also that t 0 is large enough so that the sequence of norms is non-decreasing. Suppose that the loss function l(s, y) is L-Lipschitz for all y, non-negative and that l α (f, z) is bounded above by M α . Also, assume that ||x|| ≤ R. Furthermore, let B denote the batch size and T the number of iterations SGD is being run. Then, for any t 0 ∈ {0, 1, 2, . . . , n B }, the on-average stability satisfies α av ≤ LRα l l j=1 E S,z E A δ(j) T (S, z) | ∀k, δ (k) t0 (S, z) = 0 + M α ( Bt 0 n + ). Proof: Since l α (f, z) is bounded above by M α and non-negative, we have |l α (f, z) -l α (g, z)| ≤ M α . Therefore, by adding a term M α to the bound, we can then only consider the case where the growing norm assumption is satisfied. We won't write that the expectation is then conditional on that assumption to lighten the statements. By a similar line of reasoning, we can further condition inside the expectation with the property that after t 0 iterations, the two networks are still exactly the same. To make this clearer, write the quantity |l α (f, z) -l α (g, z)| as the sum of |l α (f, z) - l α (g, z)|I{∀k, δ (k) t0 (S, z) = 0} and |l α (f, z) -l α (g, z)|I{∃k : δ (k) t0 (S, z) = 0}. We bound the first term by using the fact that |l α (f, z) -l α (g, z)| ≤ LRα l d(f, g) = LRα l l j=1 δ(j) T (S, z). For the second term, we use that l α (f, z) is bounded above by M α and non-negative to write again |l α (f, z) -l α (g, z)| ≤ M α . The result then follows from the fact that the probability of picking example i in t 0 iterations is smaller than Bt0 n . Proof of theorem 4: From lemma 2, we always have δ(j) t (S, z) ≤ δ(j) t (S, z). Therefore, from the previous lemma, . Here, B t denotes the batch of samples at iteration t when training on S and B t denotes the batch of samples at iteration t when training on S (i) . When B t = B t , we will use {β j } l j=1 -layerwise smoothness to bound ||∇ (j) L Bt (W t ) -∇ (j) L B t (W t )||. Otherwise, we use simply the triangular inequality. Let p(B, n) be the probability of picking the example i in a mini-batch of size B (this is smaller than B n ). For t ≥ t 0 , we have To complete the proof, we will use that max t0≤t≤T -1 {ζ 



It is possible to obtain a bound holding over the randomness of A by exploiting the framework ofElisseeff et al. (2005). However, the term involving ρ in their theorem 15 do not converge to 0 when the size of the training set grows to infinity.



Figure 1: For the same magnitude of step taken (same ball radius), a larger norm of parameters leads to a smaller change in angle.

Figure 2: Norm of parameters for 3 different layers when training a convolutional network on Ci-far10 and a fully connected network on Mnist with SGD.

L B t (Wt)|| min{||Wj,t||,||W j,t ||} and β = max{β j }. Exploiting theorem 12 in Elisseeff et al. (2005), we can get a probabilistic bound on the test error (holding over the randomness in the training sets and the randomness in the algorithm).

Figure 3: ζt (S) when training a convolutional network on Cifar10 and a fully connected network on Mnist.

Figure 5: Norm of the parameters (layer 3) when training a convolutional network on Cifar10 and a fully connected network on Mnist.

z) + λ t ||∇ (j) L Bt (W t ) -∇ (j) L B t (W t )|| = δ(j) t (S, z) + λ t ||∇ (j) L Bt (W t ) -∇ (j) L B t (W t )|| K

order. With the definition ζt := l j=1 ζ (j) t , we then have α av ≤ inf t0∈{1,2,..., n B } 2BLRα l (n -B)β T -1 t 0 -1 cβ T -1 t=t0 ζ t + M α ( Bt 0 n + ) .

∆(j) t+1 (S, z) ≤ (1 -p(B, n))(1 + β j λ t ) ∆(j) t (S, z) + p(B, n) ∆(j) t (S, z) + λ t E A ||∇ (j) L Bt (W t )||

