THE IMPLICIT BIAS OF MINIMA STABILITY IN MULTIVARIATE SHALLOW RELU NETWORKS

Abstract

We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors), whose second derivative has a bounded weighted L 1 norm. Notably, the bound gets smaller as the step size increases, implying that training with a large step size leads to 'smoother' predictors. Here we generalize this result to the multivariate case, showing that a similar result applies to the Laplacian of the predictor. We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of ReLU networks corresponding to stable minima of the loss. Specifically, although shallow ReLU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be wellapproximated by stable single hidden-layer ReLU networks trained with a nonvanishing step size. This is while the same function can be realized as a stable two hidden-layer ReLU network. Finally, we prove that if a function is sufficiently smooth (in a Sobolev sense) then it can be approximated arbitrarily well using shallow ReLU networks that correspond to stable solutions of gradient descent.

1. INTRODUCTION

Neural networks (NNs) have been demonstrating phenomenal performance in a wide array of fields, from computer vision and speech processing to medical sciences. Modern networks are typically taken to be highly overparameterized. In such setting, the training loss usually has multiple global minima, which correspond to models that perfectly fit the training data. Some of those models are clearly sub-optimal in terms of generalization. Yet, the training process seems to consistently avoid those bad global minima, and somehow steer the model towards global minima that generalize well. A long line of works attributed this behavior to "implicit biases" of the training algorithms, e.g., (Zhang et al., 2017; Gunasekar et al., 2017; Soudry et al., 2018; Arora et al., 2019) . Recently, it has been recognized that a dominant factor affecting the implicit bias of gradient descent (GD) and stochastic gradient descent (SGD), is associated with dynamical stability. Roughly speaking, the dynamical stability of a minimum point refers to the ability of the optimizer to stably converge to that point. Particular research efforts have been devoted to understanding linear stability, namely the dynamical stability of the optimizer's linearized dynamics around the minimum (Wu et al., 2018; Nar & Sastry, 2018; Mulayoff et al., 2021; Ma & Ying, 2021) . For GD and SGD, it is well known that a minimum is linearly stable if the loss terrain is sufficiently flat w.r.t. the step size η. Concretely, a necessary condition for a minimum to be linearly stable for GD and SGD is that the top eigenvalue of the Hessian at that minimum point be smaller than 2/η (see Sec. 2). Although this * Indicates equal contribution. Correspondence: {mor.shpigel,rotem.mulayof}@gmail.com. 1

