SGD AND WEIGHT DECAY PROVABLY INDUCE A LOW-RANK BIAS IN NEURAL NETWORKS

Abstract

We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices tend to be of small rank. Our analysis relies on a minimal set of assumptions; the neural networks may be arbitrarily wide or deep, and may include residual connections, as well as convolutional layers. The same analysis implies the inherent presence of SGD "noise", defined as the inability of SGD to converge to a stationary point. In particular, we prove that SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of training samples.

1. INTRODUCTION

Stochastic gradient descent (SGD) is one of the standard workhorses for optimizing deep models (Bottou, 1991) . Though initially proposed to remedy the computational bottleneck of gradient descent (GD), recent studies suggest that SGD also induces crucial regularization, which prevents overparameterized models from converging to minima that cannot generalize well (Zhang et al., 2016; Jastrzebski et al., 2017; Keskar et al., 2017; Zhu et al., 2019) . Empirical studies suggest that SGD outperforms GD Zhu et al. (2019) and SGD generalizes better when used with smaller batch sizes (Hoffer et al., 2017; Keskar et al., 2017) , and (iii) gradient descent with additional noise cannot compete with SGD Zhu et al. (2019) . The full range of regularization effects induced by SGD, however, is not yet fully understood. In this paper we present a mathematical analysis of the bias of SGD towards rank-minimization. To investigate this bias, we propose the SGD Near-Convergence Regime as a novel approach for investigating inductive biases of SGD-trained neural networks. This setting considers the case where SGD reaches a point in training where the expected update is small in comparison to the actual weights' norm. Our analysis is fairly generic: we consider deep ReLU networks trained with minibatch SGD for minimizing a differentiable loss function with L 2 regularization (i.e., weight decay). The neural networks may include fully-connected layers, residual connections and convolutions. Our main contributions are: • In Thm. 1, we demonstrate that training neural networks with mini-batch SGD and weight decay results in a low-rank bias in their weight matrices. We theoretically demonstrate that when training with smaller batch sizes, the rank of the learned matrices tends to decrease. This observation is validated as part of an extensive empirical study of the effect of certain hyperparameters on the rank of learned matrices with various architectures. • In Sec. 3.2, we study the inherent inability of SGD to converge to a stationary point, that we call 'SGD noise'. In Props. 1-2 we describe conditions in which 'SGD noise' is inevitable when training convolutional neural networks. In particular, we demonstrate that when training a fullyconnected neural network, SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of samples. These predictions are empirically validated in Sec. 4.3. 1

