SGD AND WEIGHT DECAY PROVABLY INDUCE A LOW-RANK BIAS IN NEURAL NETWORKS

Abstract

We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices tend to be of small rank. Our analysis relies on a minimal set of assumptions; the neural networks may be arbitrarily wide or deep, and may include residual connections, as well as convolutional layers. The same analysis implies the inherent presence of SGD "noise", defined as the inability of SGD to converge to a stationary point. In particular, we prove that SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of training samples.

1. INTRODUCTION

Stochastic gradient descent (SGD) is one of the standard workhorses for optimizing deep models (Bottou, 1991) . Though initially proposed to remedy the computational bottleneck of gradient descent (GD), recent studies suggest that SGD also induces crucial regularization, which prevents overparameterized models from converging to minima that cannot generalize well (Zhang et al., 2016; Jastrzebski et al., 2017; Keskar et al., 2017; Zhu et al., 2019) . Empirical studies suggest that SGD outperforms GD Zhu et al. (2019) and SGD generalizes better when used with smaller batch sizes (Hoffer et al., 2017; Keskar et al., 2017) , and (iii) gradient descent with additional noise cannot compete with SGD Zhu et al. (2019) . The full range of regularization effects induced by SGD, however, is not yet fully understood. In this paper we present a mathematical analysis of the bias of SGD towards rank-minimization. To investigate this bias, we propose the SGD Near-Convergence Regime as a novel approach for investigating inductive biases of SGD-trained neural networks. This setting considers the case where SGD reaches a point in training where the expected update is small in comparison to the actual weights' norm. Our analysis is fairly generic: we consider deep ReLU networks trained with minibatch SGD for minimizing a differentiable loss function with L 2 regularization (i.e., weight decay). The neural networks may include fully-connected layers, residual connections and convolutions. Our main contributions are: • In Thm. 1, we demonstrate that training neural networks with mini-batch SGD and weight decay results in a low-rank bias in their weight matrices. We theoretically demonstrate that when training with smaller batch sizes, the rank of the learned matrices tends to decrease. This observation is validated as part of an extensive empirical study of the effect of certain hyperparameters on the rank of learned matrices with various architectures. • In Sec. 3.2, we study the inherent inability of SGD to converge to a stationary point, that we call 'SGD noise'. In Props. 1-2 we describe conditions in which 'SGD noise' is inevitable when training convolutional neural networks. In particular, we demonstrate that when training a fullyconnected neural network, SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of samples. These predictions are empirically validated in Sec. 4.3.

1.1. RELATED WORK

A prominent thread in the recent literature revolves around characterizing the implicit regularization of gradient-based optimization in the belief that this is key to generalization in deep learning. Several papers have focused on a potential bias of gradient descent or stochastic gradient descent towards rank minimization. The initial interest was motivated by the matrix factorization problem, which corresponds to training a depth-2 linear neural network with multiple outputs w.r.t. the square loss. Gunasekar et al. (2017) initially conjectured that the implicit regularization in matrix factorization can be characterized in terms of the nuclear norm of the corresponding linear predictor. This conjecture, however, was formally refuted by Li et al. (2020) . Later, Razin & Cohen (2020) conjectured that the implicit regularization in matrix factorization can be explained by rank minimization, and also hypothesized that some notion of rank minimization may be key to explaining generalization in deep learning. Li et al. (2020) established evidence that the implicit regularization in matrix factorization is a heuristic for rank minimization. Beyond factorization problems, Ji & Telgarsky (2020) showed that gradient flow (GF) training of univariate linear networks w.r.t. exponentially-tailed classification losses learns weight matrices of rank 1. With nonlinear neural networks, however, things are less clear. Empirically, a series of papers (Denton et al., 2014; Alvarez & Salzmann, 2017; Tukan et al., 2021; Yu et al., 2017; Arora et al., 2018) showed that replacing the weight matrices by low-rank approximations results in only a small drop in accuracy. This suggests that the weight matrices at convergence may be close to low-rank matrices. However, whether they provably behave this way remains unclear. Timor et al. (2022) showed that for ReLU networks, GF generally does not minimize rank. They also argued that sufficiently deep ReLU networks can have low-rank solutions under L 2 norm minimization. This interesting result, however, applies to layers added to a network that already solves the problem and may not have any low-rank bias. It is not directly related to the mechanism described in this paper, which applies to all layers in the network but only in the presence of regularization and SGD unlike (Timor et al., 2022) . A recent paper (Le & Jegelka, 2022) analyzes low-rank bias in neural networks trained with GF (without regularization). While this paper makes significant strides in extending the analysis in (Ji & Telgarsky, 2020) , it makes several limiting assumptions. As a result, their analysis is only applicable under very specific conditions, such as when the data is linearly separable, and their low-rank analysis is limited to a set of linear layers aggregated at the top of the trained network.

2. PROBLEM SETUP

In this work we consider a standard supervised learning setting (classification or regression), and study the inductive biases induced by training a neural network with mini-batch SGD along with weight decay. Formally, the task is defined by a distribution P over samples (x, y) ∈ X × Y, where X ⊂ R c1×h1×w1 is the instance space (e.g., images), and Y ⊂ R k is a label space. We consider a parametric model F ⊂ {f ′ : X → R k }, where each function f W ∈ F is specified by a vector of parameters W ∈ R N . A function f W ∈ F assigns a prediction to any input point x ∈ X , and its performance is measured by the Expected Risk, L P (f W ) := E (x,y)∼P [ℓ(f W (x), y)], where ℓ : R k × Y → [0, ∞) is a non-negative, differentiable, loss function (e.g., MSE or cross-entropy losses). For simplicity, in the analysis we focus on the case where k = 1. Since we do not have direct access to the full population distribution P , the goal is to learn a predictor, f W , from some training dataset S = {(x i , y i )} m i=1 of independent and identically distributed (i.i.d.) samples drawn from P . Traditionally, in order to avoid overfitting the training data, we typically employ weight decay in order to control the complexity of the learned model. Namely, we intend to minimize the Regularized Empirical Risk, L λ S (f W ) := 1 m m i=1 ℓ(f W (x i ), y i ) + λ∥W ∥ 2 2 , where λ > 0 is predefined hyperparameter. In order to minimize this objective, we typically use mini-batch SGD, as detailed below. Optimization. In this work, we minimize the regularized empirical risk L λ S (f W ) by applying stochastic gradient descent (SGD) for a certain number of iterations T . Formally, we initialize W 0 using a standard initialization procedure, iteratively update W t for T iterations and return W T . At each iteration, we sample a subset S = {(x ij , y ij )} B j=1 ⊂ S uniformly at random and update W t+1 ← W t -µ∇ W L λ S (f Wt ) , where µ > 0 is a predefined learning rate. Notation. Throughout the paper, we use the following notations. For an integer k ≥ 1, [k] = {1, . . . , k}. ∥z∥ denotes the Euclidean norm. For two given vectors x ∈ R n , y ∈ R m we define their concatenation as follows (x∥y) := (x 1 , . . . , x n , y 1 , . . . , y m ) ∈ R n+m . For a given matrix x ∈ R n×m , we denote x i,: its i'th row and by vec(x) := (x 1 ∥ . . . ∥x n ) its vectorization. For a given tensor x ∈ R c×h×w , we denote by vec(x) := (vec(x 1 )∥ . . . ∥vec(x c )) the vectorized form of x. We define tensor slicing as follows, x a:b := (x a , . . . , x b ).

2.1. ARCHITECTURES

In this work, the function f W represents a neural network, consisting of a set of layers of weights interlaced with ReLU activation units. We employ a fairly generic definition of a neural network, that includes convolutional layers, pooling layers, residual connections and fully-connected layers. Network architecture. Formally, f W is as a directed acyclic graph (DAG) G = (V, E), where V = {v 1 , . . . , v L } consists of the various layers of the network and each edge e ij = (v i , v j ) ∈ E specifies a connection between two layers. Each layer is a function v i : R c1×h1×w1 → R ci×hi×wi and each connection (v i , v j ) holds a transformation C ij : R cj ×hj ×wj → R ci×hi×wi . The layers are divided into three categories: (i) the input layer v 1 , (ii) the output layer v L and (iii) intermediate layers. In this setting, we do not have connections directed towards the input layer nor connections coming out of the output layer (i.e., ∀ i ∈ [L] : (v L , v i ), (v 0 , v i ) / ∈ E). Given an input x ∈ R ci×hi×wi , the output of a given layer v i is evaluated as follows v i (x) := σ( j∈pred(i) C ij (v j (x))), except for the output layer v L that computes f W (x) := v L (x) := j∈pred(L) C Lj (v j (x)). Here, pred(i) := {j ∈ [L] | (v i , v j ) ∈ E}, succ(i) := {j ∈ [L] | (v j , v i ) ∈ E} and σ is the ReLU activation function. Each transformation C ij is either trainable (e.g., a convolutional layer) or a constant affine transformation (e.g., a residual connection). We denote by E T the set of trainable connections. In this paper, we consider the following transformations. Convolutional layers. A convolutional layer (Lecun et al., 1998) (see also Goodfellow et al. (2016) ) C ij : R cj ×hj ×wj → R ci×hi×wi with kernel sizes (k 1 , k 2 ), padding p and stride s is parameterized by a tensor Z ij ∈ R ci×cj ×k1×k2 and computes the following tensor as output ∀ (c, t, l) ∈ [ci]×[hi]×[wi] : y c,t,l = c j c ′ =1 vec(Z c,c ′ , : ) ⊤ •vec(Padp(x) c ′ ,ts : (t+1)s+k 1 ,ls : (l+1)s+k 2 ). (1) Here, Pad p takes a tensor x ∈ R c×h×w and returns a new tensor x ′ ∈ R c×(h+2p) ×(w+2p) , where the first and last p rows and columns of each channel x ′ c, : , : are zeros and the middle 1 × h × w tensor is equal to x c, : , : . The formulas for the output dimensions are h 2 = (⌊(h 1 -k 1 + 2p)⌋/s + 1) and w 2 = (⌊(w 1 -k 2 + 2p)⌋/s + 1). For a given convolutional layer C ij with weights Z ij , we define a matrix V ij ∈ R cihiwi×cj hj wj that computes ∀ x ∈ R cj ×hj ×wj : V ij vec(x) = vec(C ij (x)). This matrix exists since both the padding and convolution operations can be represented as linear operations of the input. We also consider W ij ∈ R ci×cj k1k2 which is a matrix whose c'th row is the vectorized filter W ij c := vec(Z ij c, : , : ). Fully-connected layers. As a special case of convolutional layers, the network may also include fully-connected layers. Namely, a fully-connected layer F : R c1 → R c2 associated with a matrix W ∈ R c2×c1 can be represented as a 1 × 1 convolutional layer C : R c1×1×1 → R c2×1×1 with k 1 = k 2 = 1, p = 0 and s = 1. Namely, the parameters tensor Z ∈ R c2×c1×1×1 satisfies Z a,b,1,1 = W a,b for all (a, b) ∈ [c 2 ] × [c 1 ] and the layer satisfies vec(C(x)) = W vec(x). Pooling layers. A pooling layer (Zhou & Chellappa, 1988 ) (see also (Goodfellow et al., 2016) ) C with kernel dimensions (k 1 , k 2 ) stride s and padding p takes an input x ∈ R c1×h1×w1 and computes an output y ∈ R c2×h2×w2 with c 2 = c 1 channels, and dimensions h 2 = (⌊(h 1 -k 1 + 2p)⌋/s + 1) and w 2 = (⌊(w 1 -k 2 + 2p)⌋/s + 1). Each pooling layer computes the following tensor as output ∀ (c, t, l) ∈ [c i ] × [h i ] × [w i ] : y c,t,l = op(Pad p (x) c,ts : (t+1)s+k1,ls : (l+1)s+k2 ), ( ) where op is either the maximum or average operator. Rearrangement layers. To conveniently switch between convolutional and fully-connected layers, we should be able to represent tensor layers as vectors and vice versa. To reshape the representation of a certain layer, we allow the networks to include rearrangement layers. A rearrangement layer C ij : R cj ×hj ×wj → R ci×hi×wi takes an input vector x ∈ R cj ×hj ×wj and 'rearranges' its coordinates in a different shape and permutation. Formally, it returns a vector (x π(k) ) k∈[cj ]×[hj ]×[wj ] , where π : [c j ] × [h j ] × [w j ] → [c i ] × [h i ] × [w i ] is invertible (in particular, c i h i w i = c j h j w j ).

3. THEORETICAL RESULTS

In this section we describe our main theoretical results. We investigate the inductive biases that emerge near convergence when training with SGD. For this purpose, we begin by introducing our definitions of SGD convergence (and near convergence) points. To formally study convergence, we employ the notion of convergence in mean of random variables. Namely, we say that a sequence of random variables {W t } ∞ t=1 ⊂ R N starting from a certain vector W 0 ∈ R N (constant) converges in mean if the following holds ∃ W * ∈ R N : lim t→∞ E[∥W t -W * ∥] = 0. As a consequence of this definition, we have lim t→∞ E[∥W t+1 -W t ∥] ≤ lim t→∞ E[∥W t+1 -W * ∥] + lim t→∞ E[∥W t -W * ∥] = 0, where the expectations are taken over the selections of the mini-batches. In particular, convergence is possible only when the expected size of each step tends to zero. In particular, when training the network using mini-batch SGD along with weight decay, we have lim t→∞ E S [∥∇L λ S (f Wt )∥] = lim t→∞ 1 µ E[∥W t+1 -W t ∥] = 0. In this work we study the implicit biases of SGD by investigating the "SGD Near Convergence Regime", where SGD arrives at a point in training where each subsequent step is small compared to the actual weights, i.e., E S [∥∇ W ij L λ S (f W T )∥/∥W ij T ∥] is small. In Sec. 3.1 we show that nearconvergence, mini-batch SGD learns neural networks with low-rank matrices and in Sec. 3.2 we prove that perfect SGD convergence is impossible in the presence of weight decay. We acknowledge that our definition differs from the traditional notions of convergence. Several papers study the convergence of SGD to a point where the performance is near-optimal when the learning rate µ t decays with the number of iterations. In these cases, convergence is guaranteed since µ t tends to zero. These papers, however, do not analyze whether the expected gradients E S [∥∇ W ij t L λ S (f Wt )∥] also decays at each step. Other papers (e.g., (Soudry & Carmon, 2016; Cooper, 2021) ) study the critical points of the objective function to better understand the solutions of gradient-based optimization methods. While understanding the critical points of the objective function is necessary for characterizing the convergence points of GD and GF, SGD does not necessarily converge at these points. For instance, suppose W t is a stationary point of L λ S (f W ) and there exists a batch S for which ∇ W L λ S (f Wt ) ̸ = 0. With probability m B -1 > 0 SGD selects the batch S and updates W t+1 = W t -µ∇L λ S (f Wt ) ̸ = W t . Therefore, p = P[W t+1 ̸ = W t | W t ] > 0 and P[∃ l ∈ [T ′ ] : W t+l ̸ = W t | W t ] ≥ 1 -(1 -p) T ′ ----→ T ′ →∞ 1. As a result, the probability of the optimization becoming stuck at W t indefinitely is zero.

3.1. LOW-RANK BIAS IN NEURAL NETWORKS

We begin our theoretical analysis with the simple observation (proved in Appendix A) that the number of input patches N ij of a certain convolutional layer C ij upper bounds the rank of the gradient of the network w.r.t. W ij . Lemma 1. Let f W be a neural network and let C ij be a convolutional layer within f W with parameters matrix W ij . Then, rank (∇ W ij f W (x)) ≤ N ij . Interestingly, we obtain particularly degenerate gradients for fully-connected layers. As discussed in Sec. 2.1 for a fully-connected layer C ij : R cj ×1×1 → R ci×1×1 we have N ij = 1, and therefore, rank (∇ W ij f W (x)) ≤ 1. The following theorem provides an upper bound on the minimal distance between the network's weight matrices and low-rank matrices. Theorem 1. Let ∥ • ∥ be any matrix norm and ℓ any differentiable loss function. Let f W (x) be a ReLU neural network and C ij be a convolutional layer within f W and let B ∈ [m]. Then, min W ∈R d i ×d j : rank(W )≤N ij B ∥ W ij ∥W ij ∥ -W ∥ ≤ 1 2λ min S⊂S: | S|=B ∥∇ W ij L λ S (f W )∥/∥W ij ∥ Proof. Let S ⊂ S be a batch of size B. By the chain rule, we can write the gradient of the loss function as follows ∇ W ij L λ S (f W ) = 1 B (x,y)∈ S ∂ℓ(f W (x),y) ∂f W (x) • ∇ W ij f W (x) + 2λW ij =: -E S + 2λW ij . According to Lem. 1, we have rank( 1 2λ E S ) ≤ BN ij . Therefore, we obtain that min W : rank(W )≤BN ij ∥W ij -W ∥ ≤ min S⊂S: | S|=B ∥W ij -1 2λ E S ∥ = 1 2λ min S⊂S: | S|=B ∇ W ij L λ S (fW ) . Finally, by dividing both sides by ∥W ij ∥ we obtain the desired inequality. The theorem above provides an upper bound on the minimal distance between the parameters matrix W ij and a matrix of rank ≤ BN ij . The upper bound is proportional to, min S ∥∇ W ij L λ S (f W )∥/∥W ij ∥, which is the minimal norm of the gradient of the regularized empirical risk evaluated on batches of size B, normalized by the norm of the weight matrix. As shown in equation 3, near convergence we expect E S [∥∇ W ij L λ S (f W )∥/∥W ij ∥] to be small, and therefore, min S ∥∇ W ij L λ S (f W )∥/∥W ij ∥ should also be small ( S is distributed uniformly as a batch of samples size B). In particular, by the theorem above, we expect min W : rank(W )≤BN ij ∥ W ij ∥W ij ∥ -W ∥ to also be small. As a result, we predict that the rank of the learned parameter matrices W ij decreases with the batch size. In Sec. 4 we validate this idea with an extensive set of experiments -and also study the relationship between the rank and other hyperparameters.

3.2. DEGENERACY AND THE ORIGIN OF "SGD NOISE"

As we mentioned, it is impossible for SGD to converge to a stationary point of the gradient dynamical system, resulting in inherent "SGD noise". In this section, we study the (non-)convergence of mini-batch SGD. Our results are essentially impossibility results: the assumption of SGD convergence to a critical point of the gradient implies that the network represents the zero function. As a result, asymptotic noise is inherently unavoidable (when training properly). For simplicity, we assume that ∀ i ∈ [m] : x i ̸ = 0. As shown in equation 3, any convergence point W of SGD satisfies E S [∥∇L λ S (f W )∥] = 0. Since the distribution over mini-batches S of size B is discrete and ∥∇L λ S (f W )∥ ≥ 0, we obtain that at convergence we have ∥∇ W ij L λ S (f W )∥ = 0, for all mini-batches S of size B. In particular, ∀ S : 0 = ∇ W ij L λ S (f W ) = 1 B (x,y)∈ S ∂ℓ(f W (x),y) ∂f W (x) • ∇ W ij f W (x) + 2λW ij . ( ) Suppose we have two batches S1 , S2 ⊂ S of size B that differ by only one sample. We denote the unique sample of each batch by (x j1 , y j1 ) and (x j2 , y j2 ) respectively. We notice that, 0 = ∇ W ij L λ S1 (f W ) -∇ W ij L λ S2 (f W ) = ∂ℓ(f W (xj 1 ),yj 1 ) ∂f W (xj 1 ) • ∇ W ij f W (x j1 ) - ∂ℓ(f W (xj 2 ),yj 2 ) ∂f W (xj 2 ) • ∇ W ij f W (x j2 ). Therefore, we conclude that for all j 1 , j 2 ∈ [m], M ij = ∂ℓ(f W (xj 1 ),yj 1 ) ∂f W (xj 1 ) • ∇ W ij f W (x j1 ) = ∂ℓ(f W (xj 2 ),yj 2 ) ∂f W (xj 2 ) • ∇ W ij f W (x j2 ). Hence, for all (v i , v j ) ∈ E T and k ∈ [m], ∂ℓ(f W (x k ),y k ) ∂f W (x k ) • ∇ W ij f (x k ) + 2λW ij = M ij + 2λW ij = 0. (6) Therefore, unless λ = 0 or ∀ (v i , v j ) ∈ E T : W ij = 0, we conclude that ∂ℓ(f W (x k ),y k ) ∂f W (x k ) ̸ = 0 for all k ∈ [m]. In this case, we also obtain that { ∂f W (x k ) ∂vec(W ij ) } m k=1 are collinear vectors by equation 5. Therefore, any convergence point of training a neural network using mini-batch SGD along with weight decay is highly degenerate and does not fit any one of the training labels. To better understand the essence of this degeneracy, we provide the following proposition (proved in Appendix A), which is specialized for ReLU networks. Proposition 1 (λ > 0). Let ℓ(a, b) be a differentiable loss function, λ > 0, and let f W (x) be a ReLU neural network, where succ(1) = {p} and (v p , v 1 ) ∈ E T . Let {x k i } N p1 k=1 be the N p1 patches of x i used by the layer C p1 . Let W be a convergence point of mini-batch SGD for minimizing L λ S (f W ) (see equation 4). Then, either f W ≡ 0 or ∀ i, j ∈ [m] : {x k i , x k j } N p1 k=1 are linearly dependent tensors. The preceding proposition shows that unless the patches of any two training samples are linearly dependent, any convergence point of SGD corresponds to the zero function. When N p1 is small, the linear dependence criterion is unrealistic. For example, if C p1 is a fully-connected layer, N p1 = 1, and the condition asserts that any two training samples x i , x j are collinear. Since this is unrealistic, we conclude that convergence is impossible unless f W ≡ 0. As a next step, we consider convergence of SGD when training without weight decay. Proposition 2 (λ = 0). Let λ = 0 and let ℓ be a differentiable loss function. Let f W (x) be a ReLU neural network, where succ(1) = {p} and (v p , v 1 ) ∈ E T is fully-connected. Let {x k i } N p1 k=1 be the N p1 patches of x i used by the layer C p1 . Let W be a convergence point of mini-batch SGD for minimizing L λ S (f W ). Then, ∀ i ∈ [m] : ∂ℓ(f W (xi),yi) ∂f W (xi) = 0 or f W (x j ) = 0 or ∀ j ∈ [m] : {x k i , x k j } N p1 k=1 are linearly dependent tensors. The preceding proposition (proved in Appendix A) provides conditions for SGD convergence of training a convolutional network without weight decay. It shows that at convergence, for every sample the network either perfectly fits the label (i.e., ∂ℓ(f W (xi),yi) ∂f W (xi) = 0) or outputs zero unless the patches of that sample are linearly dependent with the patches of any other sample. As mentioned, the linear dependence criteria is generally unrealistic when (v p , v 1 ) is fully-connected. Therefore, convergence with fully-connected networks is possible only when f W perfectly fits the training set labels. We note that if ℓ(a, b) is convex and has no minima a for any b ∈ R (e.g., binary cross-entropy, logistic loss or exponential loss), then ∀ i ∈ [m] : ∂ℓ(f W (xi),yi) ∂f W (xi) ̸ = 0. Therefore, the only possible convergence points of fully-connected networks are ones for which ∀ i ∈ [m] : f W (x i ) = 0. Since this is in general absurd, we argue that perfect convergence of training a network with exponential-type loss functions is generally impossible. While convergence to a non-zero function is not guaranteed, in practice training without weight decay may still fall into the regime of 'almost convergence', in which max i∈[m] ∂ℓ(f W (xi),yi) ∂f W (xi) is tiny and as a result the training steps -µ • ∂ℓ(f W (xi),yi) ∂f W (xi) • ∂f W (xi) ∂W ij are very small. This is usually the case when an overparameterized network has been properly trained. For the squared loss, convergence may occur when the network perfectly fits the training labels, i.e., ∀ i ∈ [m] : f W (x i ) = y i . It is worth noting that training with mini-batches during optimization and weight decay are critical to our analysis. While many papers examine the training dynamics and critical points of GD, as we show, SGD convergence points are highly degenerate and, in general, behave differently than GD solutions. Surprisingly, this analysis is unaffected by batch size, and therefore, the presence of SGD noise occurs regardless of batch size, as long as it is strictly smaller than the full dataset's size.

4. EXPERIMENTS

In this section we empirically study the implicit bias towards rank minimization in deep ReLU networks. Throughout the experiments we extensively vary different hyperparameters (e.g., the learning rate, weight decay, and the batch size) and study their effect on the rank of the various matrices in the  µ = 0.03 µ = 0.05 µ = 0.1 µ = 0.3

4.1. SETUP

Evaluation process. We consider k-class classification problems and train a multilayered neural network f W : R n → R k on some balanced training dataset S. The model is trained using CE/ MSE loss minimization between its logits and the one-hot encodings of the labels. After each epoch, we compute the averaged rank across the network's weight matrices and its train and test accuracy rates. For a convolutional layer C ij , we use W ij as its weight matrix. To estimate the rank of a given matrix M , we count how many of the singular values of M/∥M ∥ 2 are / ∈ [-ϵ, ϵ], where ϵ is a small tolerance value. In these experiments we consider the MNIST and CIFAR10 datasets. Architectures. We consider several network architectures. (i) The first architecture is an MLP, denoted by MLP-BN-L-H, which consists of L hidden layers, where each layer contains a fullyconnected layer of width H, followed by batch normalization and ReLU activations. On top of that, we compose a fully-connected output layer. (ii) The second architecture, denoted by RES-BN-L-H, consists of a linear layer of width H, followed by L residual blocks, ending with a fullyconnected layer. Each block computes a function of the form z + σ(n 2 (W 2 σ(n 1 (W 1 z)))), where W 1 , W 2 ∈ R H×H , n 1 , n 2 are batch normalization layers and σ is the ReLU function. We denote by MLP-L-H and RES-L-H the same architectures without applying batch normalization. (iii) The third, denoted VGG-16, is the convolutional network proposed by Simonyan & Zisserman (2014) , but with dropout replaced by batch normalization layers and with only one fully-connected layer at the end. (iv) The fourth architecture is the residual network proposed in (He et al., 2016) , denoted ResNet-18. (v) The fifth layer is a small visual transformer (Dosovitskiy et al., 2020) , denoted by ViT. Our ViT splits the input images into patches of size 4 × 4 includes 8 self-attention heads, where each one of them consists of 6 self-attention layers. The self-attention layers are followed by two fully-connected layers with dropout probability 0.1 and a GELU activation in between them.

4.2. EXPERIMENTS ON RANK MINIMIZATION

In each experiment we trained various models while varying one hyperparameter (e.g., batch size) and leaving the other hyperparameters constant. The models were trained with SGD for crossentropy loss minimization along with weight decay. For MLP-BN-10-100, ResNet-18 and VGG-16, we decayed the learning rate three times by a factor of 0.1 at epochs 60, 100, and 200 and training is stopped after 500 epochs. We train instances of ViT using SGD and the the learning rate is decayed by a factor of 0.2 three times at epochs 60, 100 and training is stopped after 200 epochs. By default, we trained the models with weight decay λ = 5e-4. As can be seen in Figs. 1 and 3 by decreasing the batch size, we essentially strengthen the lowrank constraint over the network's matrices, which eventually leads to matrices of lower ranks. This is consistent with the prediction made in Sec. 3.1 that we learn matrices with lower ranks when training the network with smaller batch sizes. Interestingly, we also notice a regularizing effect for the learning rate; the average rank tends to decrease when increasing the learning rate. As can be seen in Fig. 2 , by increasing λ we typically impose stronger rank minimization constraints. Interestingly, it appears that the batch size have little effect on the ranks of the weight matrices when training with λ = 0 which is exactly the case when our bound is infinite. This empirically validates that weight decay is a necessary to obtain a significant low-rank bias.

4.3. EXPERIMENTS ON SGD NOISE

In Sec. 3.2 we showed that convergence to a non-zero function is impossible when training a fullyconnected neural network with SGD and weight decay. To validate this prediction, we trained MLP-5-2000 instances examined their convergence as a function of λ. Each model was trained for CIFAR10 classification using SGD with batch size 128 and learning rate 0.1 for 2000 epochs. To investigate the convergence of the networks, we measure the averaged distance between the network's matrices between consecutive epochs:  d(W t+1 , W t ) := 1 |E T | (p,q)∈E T ∥W ij t+1 -W ij t ∥ d(W t+1 , W t ) = 0. In Fig. 4 we monitor (a) d(W t+1 , W t ), (b) the train accuracy rates, (c) the train losses and (d) the averaged rank of the trainable matrices. As predicted in Prop. 1, when training with λ > 0, W t either converges to zero and f Wt to the zero function (e.g., see the results with λ = 1e-4, 1e-3) or W t does not converge (i.e., d(W t+1 , W t ) does not tend to zero). Furthermore, we observe that for λ = 0, d(W t+1 , W t ) is smaller by orders of magnitude compared to using λ > 0 (except for cases when the selection of λ > 0 leads to W ij t → 0). Interestingly, even though for certain values of λ > 0, the training loss and accuracy converged, the network's parameters do not converge. Finally, when training for cross-entropy loss minimization without weight decay we encounter the 'almost convergence' regime discussed in Sec. 3.2. Namely, even though perfect convergence is impossible, the term d(W t+1 , W t ) may become as small as we wish by increasing the size of the neural network. Therefore, since the MLP-5-2000 is relatively large (compared to the dataset's size), we may inaccurately get the impression that d(W t+1 , W t ) tends to zero.

5. CONCLUSIONS

A mathematical characterization of the biases associated with SGD-trained neural networks is regarded as a significant open problem in the theory of deep learning (Neyshabur et al., 2017) . In addition to its independent interest, a low-rank bias -though probably not necessary for generalization -may be a key ingredient in an eventual characterization of the generalization properties of deep networks. In fact, recent results (Huh et al., 2022) and our preliminary experiments (see Figs. 19-20 in the appendix) suggest that low-rank bias in neural networks improves generalization. By investigating the "SGD Near-Convergence Regime", we proved that SGD together with weight decay induces a low-rank bias in a variety of network architectures. Our result also shows that the batch size used by SGD influences the rank of the learned matrices. This means that the batch size plays an active role in regularizing the learned function.We also prove that when training a fullyconnected neural network, SGD noise must always be present, even asymptotically, regardless of batch size, as long as weight decay is used. Weight decay may not be strictly necessary for SGD noise and low-rank bias to appear. This is the case, for example, when training with exponential-type loss functions and Weight Normalization (Salimans & Kingma, 2016) . We hope that our work will spark further research into the near-convergence regime. For instance, it may provide an interesting algorithmic approach to the long-standing problem of developing low-rank regularizers during optimization. It would also be interesting to study whether additional structures (e.g., neural collapse (Papyan et al., 2020) , sparsity) emerge during the near-convergence regime and to extend our analysis to more sophisticated learning algorithms (e.g., Adam (Kingma A PROOFS Lemma 1. Let f W be a neural network and let C ij be a convolutional layer within f W with parameters matrix W ij . Then, rank (∇ W ij f W (x)) ≤ N ij . Proof. Let x ∈ R c1×h1×w1 be an input tensor and C ij be a certain convolutional layer with kernel size (k 1 , k 2 ), stride s and padding p. We would like to show that rank (∇ W ij f W (x)) ≤ N ij . We begin by writing the output of f W as a sum over paths that pass through C ij and paths that do not. We note that the output can be written as follows, f W (x) = l1∈pred(l0) C l0l1 • v l1 (x), where l 0 = L and C l0l1 • z := C l0l1 (z). In addition, each layer v l can also be written as v l1 (x) = D l1 ⊙ l2∈pred(l1) C l1l2 • v l2 (x), where D l := D l (x) := σ ′ (v l (x))) ∈ R c l ×h l ×w l . A path π within the network's graph G is a sequence π = (π 0 , . . . , π T ), where π 0 = 1, π T = L and for all i = 0, . . . , T -1 : (v πi , v πi+1 ) ∈ E. We can write f W (x) as the sum of matrix multiplications along paths π from v 1 to v l0 . Specifically, we can write f W (x) as a follows f W (x) = π from i to l0 C π T π T -1 • D π T -1 • • • D π2 ⊙ C π2π1 • D π1 ⊙ C ij • v j (x) + π from 1 to l0 (i,j) / ∈π C π T π T -1 • D π T -1 ⊙ C π T -1 π T -2 • • • D π2 ⊙ C π2π1 • x, =: A W (x) + B W (x) where T = T (π) denotes the length of the path π. Since σ is a piece-wise linear function with a finite number of pieces, for any x ∈ R c1×h1×w1 , with measure 1 over W , the matrices {D l (x)} L-1 l=1 are constant in the neighborhood of W . Furthermore, W ij does not appear in the multiplications along the paths π from 1 to l 0 that exclude (i, j). Therefore, we conclude that ∂B W (x) ∂W ij = 0. As a next step we would like to analyze the rank of ∂A W (x) ∂W ij . For this purpose, we rewrite the convolutional layers and the multiplications by the matrices D l (x) as matrix multiplications. Representing C ij . We begin by representing the layer C ij as a linear transformation of its input with N ij blocks of W ij . For this purpose, we define a representation of a given 3-dimensional tensor input z ∈ R cj ×hj ×wj as a vector vec ij (z) ∈ R N ij cj k1k2 . First, we pad z with p rows and columns of zeros and obtain Pad p (z). We then vectorize each one of its patches (of dimensions c j × k 1 × k 2 ) that the convolutional layer is acting upon (potentially overlapping) and concatenate them. We can write the vectorized output of the convolutional layer as U ij vec ij (z), where U ij :=          0 0 0 0 W ij 0 0 0 0 0 0 0 0 0 0 . . . 0 0 0 0 0 0 0 0 0 0 W ij          (7) is a (N ij c i ) × (N ij c j k 1 k 2 ) matrix with N ij copies of V ij . We note that this is a non-standard representation of the convolutional layer's operation as a linear transformation. Typically, we write the convolutional layer as a linear transformation W ij acting on the vectorized version vec(z) ∈ R cj k1k2 of its input z. Since vec ij (z) consists of the same variables as in vec(z) with potentially duplicate items, there is a linear transformation that translates vec(z) into vec ij (z). Therefore, we can simply write V ij vec(z) = U ij vec ij (z). Representing convolutional layers. Except of C ij , we represent each one of the network's convolutional layers C ld in f W as linear transformations. As mentioned earlier, we can write vec(C ld (z)) = V ld vec(z), for any input z ∈ R c d ×h d ×w d . Representing pooling and rearrangement layers. An average pooling layer or a rearrangement layer C ld can be easily represented as a (non-trainable) linear transformation of its input. Namely, we can write vec(C ld (z)) = V ld vec(z) for some constant matrix V ld . A max-pooling layer can be written as a composition of ReLU activations and multiple (non-trainable) linear transformations, since max(x, y) = σ(x -y) + y. Therefore, without loss of generality we can replace the pooling layers with non-trainable linear transformations and ReLU activations. Computing the rank. Finally, we note that vec( C ij • z) = U ij vec ij (z) = V ij vec(z), vec(D l ⊙ z) = P l • vec(z) for P l := diag(vec(D l )) . Therefore, we can write A W (x) = π from i to l0 W π T π T -1 • P π T -1 • • • P π2 • W π2π1 • P π1 • U ij • vec ij (v j (x)) =: a(x) ⊤ • U ij • b(x), where a(x) ⊤ := π from i to l0 W π T π T -1 •P π T -1 • • • P π2 •W π2π1 •P π1 and b(x) := vec ij (v j (x) ). We note that with measure 1, the matrices {P l } L-1 l=1 are constant in the neighborhood of W . In addition, a(x) and b(x) are computed as multiplications of matrices W ld and P l excluding (i, j) = (p, q). Therefore, with measure 1 over the selection of W , the Jacobians of a(x) and b(x) with respect to V ij are 0. Furthermore, due to equation 7 and the definition of U ij , we can write a(x) ⊤ • U ij • b(x) = N ij t=1 a t (x) ⊤ • V ij • b t (x), where a t (x) and b t (x) are the slices of a(x) and b(x) that are multiplied by the t'th V ij block in U ij . Since the Jacobians of a i (x) and b i (x) with respect to V ij are 0 with measure 1 over the selection of W , we have, ∂a(x) ⊤ • U ij • b(x) ∂V ij = N ij t=1 a t (x) • b t (x) ⊤ . Therefore, we conclude that, with measure 1 over the selection of W , we have ∂f W (x) ∂V ij = N ij t=1 a t (x) • b t (x) ⊤ which is a matrix of rank ≤ N ij . Proposition 1 (λ > 0). Let ℓ(a, b) be a differentiable loss function, λ > 0, and let f W (x) be a ReLU neural network, where succ(1) = {p} and (v p , v 1 ) ∈ E T . Let {x k i } N p1 k=1 be the N p1 patches of x i used by the layer C p1 . Let W be a convergence point of mini-batch SGD for minimizing L λ S (f W ) (see equation 4). Then, either f W ≡ 0 or ∀ i, j ∈ [m] : {x k i , x k j } N p1 k=1 are linearly dependent tensors. Proof. Since {(v p , v 1 ) ∈ E | v p ∈ V } ⊂ E T is of size 1, we denote this single layer by (v p , v 1 ). Following the proof of Lem. 1, we define a representation of a given 3-dimensional tensor input x ∈ R cj ×hj ×wj as a vector vec ij (x) = (x 1 ∥ . . . ∥x N ij ) ∈ R N ij cj k1k2 , where x k is the vectorization of the k'th c j × k 1 × k 2 patch of x. Similar to the proof of Lem. 1, we can write f W (x) = π from 1 to L C π T π T -1 • D π T -1 ⊙ C π T -1 π T -2 • • • D π2 ⊙ C π2π1 • x = H(x) • U p1 • vec p1 (x) where H(x) := π from p to L V π T π T -1 • P π T -1 (x i ) • • • V π2π1 • P π1 (x i ) and denote H i := H(x i ) and H i = (H 1 i , . . . , H N p1 i ), where H k i is of dimension c 1 k 1 k 2 . Hence, we can write, ∂f W (x i ) ∂W p1 = N p1 k=1 H k i • (x k i ) ⊤ . Under review as a conference paper at ICLR 2023 We would like to show that {x k i , x k j } N p1 k=1 are linearly dependent vectors or f W ≡ 0. Assume the opposite by contradiction, i.e., that {x k i , x k j } N p1 k=1 are not linearly dependent and that f W ̸ ≡ 0. In particular, by equation 9, we have, W p1 ̸ = 0. According to the analysis in Sec. 3.2, { ∂f W (xi) ∂vec(W p1 ) } m i=1 are collinear vectors. Therefore, for any pair i, j ∈ [m], there is a scalar α p ij ∈ R, such that, N p1 k=1 H k i • (x k i ) ⊤ = α ij N p1 k=1 H k j • (x k j ) ⊤ . Consider a given index i ∈ [m]. We would like to show that either H i = 0 or {x k i , x k j } N p1 k=1 is a set of linearly dependent vectors. Assume that H i = 0. Then, ∂f W (xi) ∂W p1 = 0 and M p1 = 0 and W p1 = 0 according to equation 6, which implies that f W ≡ 0 (since (v p , v 1 ) is the only connection starting from v 1 ). Assume that H i ̸ = 0. Then, there exist k ∈ [N p1 ] and r ∈ [dim(H i )/N p1 ], for which the r'th coordinate of H k i is non-zero. Therefore, the r'th row of N p1 k=1 H k i • (x k i ) ⊤ -α ij N p1 k=1 H k j • (x k j ) ⊤ = 0 is a non-trivial linear combination of the vectors {x k i , x k j } N p1 k=1 . Proposition 2 (λ = 0). Let λ = 0 and let ℓ be a differentiable loss function. Let f W (x) be a ReLU neural network, where succ(1) = {p} and (v p , v 1 ) ∈ E T is fully-connected. Let {x k i } N p1 k=1 be the N p1 patches of x i used by the layer C p1 . Let W be a convergence point of mini-batch SGD for minimizing L λ S (f W ). Then, ∀ i ∈ [m] : ∂ℓ(f W (xi),yi) ∂f W (xi) = 0 or f W (x j ) = 0 or ∀ j ∈ [m] : {x k i , x k j } N p1 k=1 are linearly dependent tensors. Proof. Let i ∈ [m] be an index for which ∂ℓ(f W (xi),yi) ∂f W (xi) ̸ = 0. Then, by equation 5, for all j ∈ [m], we have ∂f W (x i ) ∂W p1 = ∂ℓ(f W (x j ), y j ) ∂f W (x j ) / ∂ℓ(f W (x i ), y i ) ∂f W (x i ) -1 • ∂f W (x j ) ∂W p1 . In particular, ∂f W (xi) ∂vec(W p1 ) and ∂f W (xj ) ∂vec(W p1 ) are collinear vectors (for all j ∈ [m]). Hence, by the proof of Prop. 1, either f W (x i ) = 0 or {x k i , x k j } m i=1 is a set of linearly dependent vectors.

B ADDITIONAL EXPERIMENTS

Experiments on rank minimization. To further demonstrate the bias towards rank minimization of SGD with weight decay, we conducted a series of experiments with different learning settings. We follow the same training and evaluation protocol described in Sec. 4.2. The results are summarized in Figs. 6 7 8 9 10 11 12 13 14 15 16 17 18 . Experiments on SGD noise. We repeated the experiment in Fig. 4 for training the models on MNIST. As can be seen in Fig. 5 , similar to the previous experiment, when the models were trained with λ > 0, the weights W t were not able to converge (i.e., d(W t+1 , W t ) does not tend to zero), even though for certain values of λ > 0 the training accuracy and loss converged. On the other hand, when λ = 0, the distance d(W t+1 , W t ) tends to zero. Low-rank bias and generalization. We looked into the connection between low-rank bias and generalization. In Figs. 19-20 we trained ResNet-18 and VGG-16 instances on CIFAR10 while varying the batch size and keeping λ and µ constant. To provide a fair comparison, we chose λ and µ in each setting to ensure that all models fit the training data perfectly. As can be seen, models trained with smaller batch sizes, i.e. models with lower rank in their weights, tend to generalize better. Based on these findings, we hypothesize that when two neural networks of the same architecture are trained with SGD with different hyperparameters and perfectly fit the data, the one with a lower average rank will outperform the other at test time. 

MNIST, CE

1 |E T | (p,q)∈E T ∥W ij t+1 -W ij t ∥. In (b) we plot the train accuracy rates, in (c) we plot the averaged train loss and in (d) we plot the average rank across the trainable matrices. Figure 13 : Average ranks and accuracy rates of ResNet-18 trained on CIFAR10 with various batch sizes. The models were trained with a λ = 5e-4 weight decay. To estimate the rank, we used an ϵ = 0.001 threshold. 



The plots are best viewed when zooming into the pictures. & Ba, 2015)) and learning settings (e.g., unsupervised learning, self-supervised learning, etc).



Figure1: Average ranks and accuracy rates of MLP-BN-10-100 trained on CIFAR10 with various batch sizes. The top row shows the average rank across layers, while the bottom row shows the train and test accuracy rates for each setting. Weight decay λ = 5e-4 was used to train each model. To calculate the rank, we used an ϵ = 0.001 threshold.

Figure 2: Average ranks and accuracy rates of ResNet-18 trained on CIFAR10 with varying weight decay. In this experiment: µ = 1.5 and ϵ = 0.001.

Figure 3: Average ranks and accuracy rates of ViT trained on CIFAR10 with various batch sizes. In this experiment: λ = 5e-4 and ϵ = 0.01.

Figure 5: Convergence of MLP-5-2000 trained on MNIST with CE/ MSE loss. In (a) we plot the averaged distance between the weight matrices at epoch t and epoch t + 1, captured by,

Figure 6: Average rank of MLP-BN-10-100 trained on CIFAR10 with various batch sizes. Each model was trained with a λ = 5e-4 weight decay. To estimate the rank, we used an ϵ = 0.001 threshold.

Figure8: Average rank of RES-BN-5-500 trained on CIFAR10 with various batch sizes. Each model was trained with a λ = 5e-4 weight decay. To estimate the rank, we used an ϵ = 0.001 threshold.

Figure9: Average ranks and accuracy rates of MLP-5-500 trained on CIFAR10 with various batch sizes. Each model was trained with a λ = 5e-4 weight decay. To estimate the rank, we used an ϵ = 0.001 threshold.

Figure12: Average ranks and accuracy rates of MLP-5-500 trained on CIFAR10 with varying λ. The models were trained with a µ = 0.025 initial learning rate. To estimate the rank, we used an ϵ = 0.001 threshold.

Figure17: Average ranks and accuracy rates of VGG-16 trained on CIFAR10 with varying µ. The models were trained with a µ = 5e-4 weight decay. To estimate the rank, we used an ϵ = 0.001 threshold.

Figure18: Average ranks and accuracy rates of ViT trained on CIFAR10 with varying λ. The models were trained with a µ = 4e-2 initial learning rate. To estimate the rank, we used an ϵ = 0.01 threshold.

, where {W ij t } (p,q)∈E T are the various trainable matrices in the network at epoch t. As mentioned in Sec. 3, convergence is possible only when lim

Average rank of MLP-BN-10-100 trained on CIFAR10 with varying λ. Each model was trained with a µ = 0.1 initial learning rate and 0.9 momentum. To estimate the rank, we used an ϵ = 0.001 threshold.

Average ranks and accuracy rates of MLP-BN-10-100 trained on CIFAR10 with varying λ. Each model was trained with a µ = 0.1 initial learning rate. To estimate the rank, we used an ϵ = 0.001 threshold.Figure11: Average ranks and accuracy rates of MLP-5-500 trained on CIFAR10 with various batch sizes. The models were trained with a λ = 5e-4 weight decay. To estimate the rank, we used an ϵ = 0.001 threshold.

Average ranks and accuracy rates of ResNet-18 trained on CIFAR10 with varying µ. The models were trained with a µ = 5e-4 initial learning rate. To estimate the rank, we used an ϵ = 0.001 threshold.Figure16: Average ranks and accuracy rates of VGG-16 trained on CIFAR10 with varying λ. The models were trained with a µ = 0.1 initial learning rate. To estimate the rank, we used an ϵ = 0.01 threshold.

