IMPLICIT BIAS OF LARGE DEPTH NETWORKS: A NOTION OF RANK FOR NONLINEAR FUNCTIONS

Abstract

We show that the representation cost of fully connected neural networks with homogeneous nonlinearities -which describes the implicit bias in function space of networks with L 2 -regularization or with losses such as the cross-entropy -converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the 'true' rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered. Finally, we discuss the effect of the rank of a classifier on the topology of the resulting class boundaries and show that autoencoders with optimal nonlinear rank are naturally denoising.

1. INTRODUCTION

There has been a lot of recent interest in the so-called implicit bias of DNNs, which describes what functions are favored by a network when fitting the training data. Different network architectures (choice of nonlinearity, depth, width of the network, and more) and training procedures (initialization, optimization algorithm, loss) can lead to widely different biases. In contrast to the so-called kernel regime where the implicit bias is described by the Neural Tangent Kernel (Jacot et al., 2018) , there are several active regimes (also called rich or feature-learning regimes), whose implicit bias often feature a form sparsity that is absent from the kernel regime. Such active regimes have been observed for example in DNNs with small initialization (Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; Li et al., 2020; Jacot et al., 2022a) , with L 2regularization (Savarese et al., 2019; Ongie et al., 2020; Jacot et al., 2022b) or when trained on exponentially decaying losses (Gunasekar et al., 2018a; b; Soudry et al., 2018; Du et al., 2018; Ji & Telgarsky, 2018; Chizat & Bach, 2020; Ji & Telgarsky, 2020) . In the latter two cases, the implicit bias is described by the representation cost: R(f ) = min W:f W =f ∥W∥ 2 where f is a function that can be represented by the network and the minimization is over all parameters W that result in a network function f W equal to f , the parameters W form a vector and ∥W∥ is the L 2 -norm. The representation cost can in some cases be explicitly computed for linear networks. For diagonal linear networks, the representation cost of a linear function f (x) = w T x equals the L p norm R(f ) = L ∥w∥ p p of the vector v for p = 2 L (Gunasekar et al., 2018a; Moroshko et al., 2020) where L is the depth of the network. For fully-connected linear networks, the representation cost of a linear function f (x) = Ax equals the L p -Schatten norm (the L p norm of the singular values) R(f ) = L ∥A∥ p p (Dai et al., 2021) . A common thread between these examples is a bias towards some notion of sparsity: sparsity of the entries of the vector w in diagonal networks and sparsity of the singular values in fully connected 1

