IMPLICIT BIAS IN LEAKY RELU NETWORKS TRAINED ON HIGH-DIMENSIONAL DATA

Abstract

The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fullyconnected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an ℓ 2 -max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.

1. INTRODUCTION

Neural networks trained by gradient descent appear to generalize well in many settings, even when trained without explicit regularization. It is thus understood that the usage of gradient-based optimization imposes an implicit bias towards particular solutions which enjoy favorable properties. The nature of this implicit regularization effect-and its dependence on the structure of the training data, the architecture of the network, and the particular gradient-based optimization algorithm-is thus a central object of study in the theory of deep learning. In this work, we examine the implicit bias of gradient descent when the training data is such that the pairwise correlations |⟨x i , x j ⟩| between distinct samples x i , x j ∈ R d are much smaller than the squared Euclidean norms of each sample: that is, the samples are nearly-orthogonal. As we shall show, this property is often satisfied when the training data is sampled i.i.d. from a d-dimensional distribution and d is significantly larger than the number of samples n. We will thus refer to such training data with the descriptors 'high-dimensional' and 'nearly-orthogonal' interchangeably. We consider fully-connected two-layer networks with m neurons where the first layer weights are trained and the second layer weights are fixed at their random initialization. If we denote the firstlayer weights by W ∈ R m×d , with rows w ⊤ j ∈ R d , then the network output is given by, f (x; W ) := m j=1 a j ϕ(⟨w j , x⟩), where a j ∈ R, j = 1, . . . m are fixed. We consider the implicit bias in two different settings: gradient flow, which corresponds to gradient descent where the step-size tends to zero, and standard gradient descent. For gradient flow, we consider the standard leaky ReLU activation, ϕ(z) = max(γz, z). Our starting point in this setting is recent work by Lyu & Li (2019); Ji & Telgarsky (2020) that show that, provided the network interpolates the training data at some time, gradient flow on homogeneous networks, such as two-layer leaky ReLU networks, converges (in direction) to a network that satisfies the Karush-Kuhn-Tucker (KKT) conditions for the margin-maximization problem, min W 1 /2 ∥W ∥ 2 F s.t. ∀i ∈ [n], y i f (x i ; W ) ≥ 1 . Leveraging this, we show that the asymptotic limit of gradient flow produces a matrix W which is a global optimum of the above problem, and has rank at most 2. Moreover, we note that our assumption on the high-dimensionality of the data implies that it is linearly separable. Our leaky ReLU network f (•; W ) is non-linear, but we show that gradient flow converges in direction to W such that the decision boundary is linear, namely, there exists z ∈ R d such that for all x we have sign(f (x; W )) = sign(z ⊤ x). This linear predictor z may not be an ℓ 2 -max-margin linear predictor, but it maximizes the margin approximately (see details in Theorem 3.2). For gradient descent, we consider a smoothed approximation to the leaky ReLU activation, and consider training that starts from a random initialization with small initialization variance. Our result for gradient flow on the standard leaky ReLU activation suggests that gradient descent with smallenough step size should eventually produce a network for which W (t) has small rank. However, the asymptotic characterization of trained neural networks in terms of KKT points of a marginmaximization problem relies heavily upon the infinite-time limit. This leaves open what happens in finite time. Towards this end, we consider the stable rank of the weight matrix W (t) found by gradient descent at time t, defined as ∥W (t) ∥ 2 F /∥W (t) ∥ 2 2 , the square of the ratio of the Frobenius norm to the spectral norm of W (t) . We show that after the first step of gradient descent, the stable rank of the weight matrix W (t) reduces from something that is of order min(m, d) to that which is at most an absolute constant, independent of m, d, or the number of samples. Further, throughout the training trajectory the stable rank of the network is never larger than some absolute constant. We conclude by verifying our results with experiments. We first confirm our theoretical predictions for binary classification problems with high-dimensional data. We then consider the stable rank of two-layer networks trained by SGD for the CIFAR10 dataset, which is not high-dimensional. We notice that the scale of the initialization plays a crucial role in the stable rank of the weights found by gradient descent: with default TensorFlow initialization, the stable rank of a network with m = 512 neurons never falls below 74, while with a smaller initialization variance, the stable rank quickly drops to 3.25, and only begins to increase above 10 when the network begins to overfit.

RELATED WORK

Implicit bias in neural networks. The literature on the implicit bias in neural networks has rapidly expanded in recent years, and cannot be reasonably surveyed here (see Vardi (2022) for a survey). In what follows, we discuss results which apply to two-layer ReLU or leaky ReLU networks in classification settings. By Lyu & Li (2019) and Ji & Telgarsky (2020), homogeneous neural networks (and specifically two-layer leaky ReLU networks, which are the focus of this paper) trained with exponentially-tailed classification losses converge in direction to a KKT point of the maximum-margin problem. Our analysis of the implicit bias relies on this result. We note that the aforementioned KKT point may not be a global optimum (see a discussion in Section 3). Lyu et al. (2021) studied the implicit bias in two-layer leaky ReLU networks trained on linearly separable and symmetric data, and showed that gradient flow converges to a linear classifier which maximizes the ℓ 2 margin. Note that in our work we do not assume that the data is symmetric, but we assume that it is nearly orthogonal. Also, in our case we show that gradient flow might converge to a linear classifier that does not maximize the ℓ 2 margin. Sarussi et al. (2021) studied gradient flow on two-layer leaky ReLU networks, where the training data is linearly separable. They showed convergence to a linear classifier based on an assumption called Neural Agreement Regime (NAR): starting from some time point, all positive neurons (i.e., neurons with a positive outgoing weight) agree on the classification of the training data, and similarly for the negative neurons. However, it is unclear when this assumption holds a priori. Chizat & Bach (2020) studied the dynamics of gradient flow on infinite-width homogeneous twolayer networks with exponentially-tailed losses, and showed bias towards margin maximization w.r.t.

