SGD AND WEIGHT DECAY PROVABLY INDUCE A LOW-RANK BIAS IN NEURAL NETWORKS

Abstract

We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices tend to be of small rank. Our analysis relies on a minimal set of assumptions; the neural networks may be arbitrarily wide or deep, and may include residual connections, as well as convolutional layers. The same analysis implies the inherent presence of SGD "noise", defined as the inability of SGD to converge to a stationary point. In particular, we prove that SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of training samples.

1. INTRODUCTION

Stochastic gradient descent (SGD) is one of the standard workhorses for optimizing deep models (Bottou, 1991) . Though initially proposed to remedy the computational bottleneck of gradient descent (GD), recent studies suggest that SGD also induces crucial regularization, which prevents overparameterized models from converging to minima that cannot generalize well (Zhang et al., 2016; Jastrzebski et al., 2017; Keskar et al., 2017; Zhu et al., 2019) . Empirical studies suggest that SGD outperforms GD Zhu et al. (2019) and SGD generalizes better when used with smaller batch sizes (Hoffer et al., 2017; Keskar et al., 2017) , and (iii) gradient descent with additional noise cannot compete with SGD Zhu et al. (2019) . The full range of regularization effects induced by SGD, however, is not yet fully understood. In this paper we present a mathematical analysis of the bias of SGD towards rank-minimization. To investigate this bias, we propose the SGD Near-Convergence Regime as a novel approach for investigating inductive biases of SGD-trained neural networks. This setting considers the case where SGD reaches a point in training where the expected update is small in comparison to the actual weights' norm. Our analysis is fairly generic: we consider deep ReLU networks trained with minibatch SGD for minimizing a differentiable loss function with L 2 regularization (i.e., weight decay). The neural networks may include fully-connected layers, residual connections and convolutions. Our main contributions are: • In Thm. 1, we demonstrate that training neural networks with mini-batch SGD and weight decay results in a low-rank bias in their weight matrices. We theoretically demonstrate that when training with smaller batch sizes, the rank of the learned matrices tends to decrease. This observation is validated as part of an extensive empirical study of the effect of certain hyperparameters on the rank of learned matrices with various architectures. • In Sec. 3.2, we study the inherent inability of SGD to converge to a stationary point, that we call 'SGD noise'. In Props. 1-2 we describe conditions in which 'SGD noise' is inevitable when training convolutional neural networks. In particular, we demonstrate that when training a fullyconnected neural network, SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of samples. These predictions are empirically validated in Sec. 4.3. showed that replacing the weight matrices by low-rank approximations results in only a small drop in accuracy. This suggests that the weight matrices at convergence may be close to low-rank matrices. However, whether they provably behave this way remains unclear. Timor et al. (2022) showed that for ReLU networks, GF generally does not minimize rank. They also argued that sufficiently deep ReLU networks can have low-rank solutions under L 2 norm minimization. This interesting result, however, applies to layers added to a network that already solves the problem and may not have any low-rank bias. It is not directly related to the mechanism described in this paper, which applies to all layers in the network but only in the presence of regularization and SGD unlike (Timor et al., 2022) . A recent paper (Le & Jegelka, 2022) analyzes low-rank bias in neural networks trained with GF (without regularization). While this paper makes significant strides in extending the analysis in (Ji & Telgarsky, 2020), it makes several limiting assumptions. As a result, their analysis is only applicable under very specific conditions, such as when the data is linearly separable, and their low-rank analysis is limited to a set of linear layers aggregated at the top of the trained network.

2. PROBLEM SETUP

In this work we consider a standard supervised learning setting (classification or regression), and study the inductive biases induced by training a neural network with mini-batch SGD along with weight decay. Formally, the task is defined by a distribution P over samples (x, y) ∈ X × Y, where X ⊂ R c1×h1×w1 is the instance space (e.g., images), and Y ⊂ R k is a label space. We consider a parametric model F ⊂ {f ′ : X → R k }, where each function f W ∈ F is specified by a vector of parameters W ∈ R N . A function f W ∈ F assigns a prediction to any input point x ∈ X , and its performance is measured by the Expected Risk, L P (f W ) := E (x,y)∼P [ℓ(f W (x), y)], where ℓ : R k × Y → [0, ∞) is a non-negative, differentiable, loss function (e.g., MSE or cross-entropy losses). For simplicity, in the analysis we focus on the case where k = 1. Since we do not have direct access to the full population distribution P , the goal is to learn a predictor, f W , from some training dataset S = {(x i , y i )} m i=1 of independent and identically distributed (i.i.d.) samples drawn from P . Traditionally, in order to avoid overfitting the training data, we typically employ weight decay in order to control the complexity of the learned model. Namely, we intend to minimize the Regularized Empirical Risk, L λ S (f W ) := 1 m m i=1 ℓ(f W (x i ), y i ) + λ∥W ∥ 2 2 , where λ > 0 is predefined hyperparameter. In order to minimize this objective, we typically use mini-batch SGD, as detailed below. Optimization. In this work, we minimize the regularized empirical risk L λ S (f W ) by applying stochastic gradient descent (SGD) for a certain number of iterations T . Formally, we initialize W 0 using a standard initialization procedure, iteratively update W t for T iterations and return W T . At each iteration, we sample a subset S = {(x ij , y ij )} B j=1 ⊂ S uniformly at random and update W t+1 ← W t -µ∇ W L λ S (f Wt ), where µ > 0 is a predefined learning rate.



1.1 RELATED WORKA prominent thread in the recent literature revolves around characterizing the implicit regularization of gradient-based optimization in the belief that this is key to generalization in deep learning. Several papers have focused on a potential bias of gradient descent or stochastic gradient descent towards rank minimization. The initial interest was motivated by the matrix factorization problem, which corresponds to training a depth-2 linear neural network with multiple outputs w.r.t. the square loss.Gunasekar et al. (2017)  initially conjectured that the implicit regularization in matrix factorization can be characterized in terms of the nuclear norm of the corresponding linear predictor. This conjecture, however, was formally refuted byLi et al. (2020). Later, Razin & Cohen (2020) conjectured that the implicit regularization in matrix factorization can be explained by rank minimization, and also hypothesized that some notion of rank minimization may be key to explaining generalization in deep learning.Li et al. (2020)  established evidence that the implicit regularization in matrix factorization is a heuristic for rank minimization. Beyond factorization problems, Ji & Telgarsky (2020) showed that gradient flow (GF) training of univariate linear networks w.r.t. exponentially-tailed classification losses learns weight matrices of rank 1.

