IMPLICIT BIAS IN LEAKY RELU NETWORKS TRAINED ON HIGH-DIMENSIONAL DATA

Abstract

The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fullyconnected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an ℓ 2 -max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.

1. INTRODUCTION

Neural networks trained by gradient descent appear to generalize well in many settings, even when trained without explicit regularization. It is thus understood that the usage of gradient-based optimization imposes an implicit bias towards particular solutions which enjoy favorable properties. The nature of this implicit regularization effect-and its dependence on the structure of the training data, the architecture of the network, and the particular gradient-based optimization algorithm-is thus a central object of study in the theory of deep learning. In this work, we examine the implicit bias of gradient descent when the training data is such that the pairwise correlations |⟨x i , x j ⟩| between distinct samples x i , x j ∈ R d are much smaller than the squared Euclidean norms of each sample: that is, the samples are nearly-orthogonal. As we shall show, this property is often satisfied when the training data is sampled i.i.d. from a d-dimensional distribution and d is significantly larger than the number of samples n. We will thus refer to such training data with the descriptors 'high-dimensional' and 'nearly-orthogonal' interchangeably. We consider fully-connected two-layer networks with m neurons where the first layer weights are trained and the second layer weights are fixed at their random initialization. If we denote the firstlayer weights by W ∈ R m×d , with rows w ⊤ j ∈ R d , then the network output is given by, f (x; W ) := m j=1 a j ϕ(⟨w j , x⟩), where a j ∈ R, j = 1, . . . m are fixed. We consider the implicit bias in two different settings: gradient flow, which corresponds to gradient descent where the step-size tends to zero, and standard gradient descent.

