IMPLICIT BIAS OF LARGE DEPTH NETWORKS: A NOTION OF RANK FOR NONLINEAR FUNCTIONS

Abstract

We show that the representation cost of fully connected neural networks with homogeneous nonlinearities -which describes the implicit bias in function space of networks with L 2 -regularization or with losses such as the cross-entropy -converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the 'true' rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered. Finally, we discuss the effect of the rank of a classifier on the topology of the resulting class boundaries and show that autoencoders with optimal nonlinear rank are naturally denoising.

1. INTRODUCTION

There has been a lot of recent interest in the so-called implicit bias of DNNs, which describes what functions are favored by a network when fitting the training data. Different network architectures (choice of nonlinearity, depth, width of the network, and more) and training procedures (initialization, optimization algorithm, loss) can lead to widely different biases. In contrast to the so-called kernel regime where the implicit bias is described by the Neural Tangent Kernel (Jacot et al., 2018) , there are several active regimes (also called rich or feature-learning regimes), whose implicit bias often feature a form sparsity that is absent from the kernel regime. Such active regimes have been observed for example in DNNs with small initialization (Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; Li et al., 2020; Jacot et al., 2022a) , with L 2regularization (Savarese et al., 2019; Ongie et al., 2020; Jacot et al., 2022b) or when trained on exponentially decaying losses (Gunasekar et al., 2018a; b; Soudry et al., 2018; Du et al., 2018; Ji & Telgarsky, 2018; Chizat & Bach, 2020; Ji & Telgarsky, 2020) . In the latter two cases, the implicit bias is described by the representation cost: R(f ) = min W:f W =f ∥W∥ 2 where f is a function that can be represented by the network and the minimization is over all parameters W that result in a network function f W equal to f , the parameters W form a vector and ∥W∥ is the L 2 -norm. The representation cost can in some cases be explicitly computed for linear networks. For diagonal linear networks, the representation cost of a linear function f Gunasekar et al., 2018a; Moroshko et al., 2020) where L is the depth of the network. For fully-connected linear networks, the representation cost of a linear function f (x) = Ax equals the L p -Schatten norm (the L p norm of the singular values) R(f ) = L ∥A∥ p p (Dai et al., 2021) . A common thread between these examples is a bias towards some notion of sparsity: sparsity of the entries of the vector w in diagonal networks and sparsity of the singular values in fully connected networks. Furthermore, this bias becomes stronger with depth and in the infinite depth limit L → ∞ the rescaled representation cost R(f )/L converges to the L 0 norm ∥w∥ 0 (the number of non-zero entries in w) in the first case and to the rank Rank(A) in the second. (x) = w T x equals the L p norm R(f ) = L ∥w∥ p p of the vector v for p = 2 L ( For shallow (L = 2) nonlinear networks with a homogeneous activation, the representation cost also takes the form of a L 1 norm (Bach, 2017; Chizat & Bach, 2020; Ongie et al., 2020) , leading to sparsity in the effective number of neurons in the hidden layer of the network. However, the representation cost of deeper networks does not resemble any typical norm (L p or not), though it still leads to some form of sparsity (Jacot et al., 2022b) . Despite the absence of explicit formula, we will show that the rescaled representation cost R(f )/L converges to some notion of rank in nonlinear networks as L → ∞, in analogy to infinite depth linear networks.

CONTRIBUTIONS

We first introduce two notions of rank: the Jacobian rank Rank J (f ) = max x Rank [Jf (x)] and the Bottleneck rank Rank BN (f ) which is the smallest integer k such that f can be factorized f = h • g with inner dimension k. In general, Rank J (f ) ≤ Rank BN (f ), but for functions of the form f = ψ • A • ϕ (for a linear map A and two bijections ψ and ϕ), we have Rank J (f ) = Rank BN (f ) = RankA. These two notions of rank satisfy the properties (1) Rankf ∈ Z; (2) Rank(f • g) ≤ min{Rankf, Rankg}; (3) Rank(f + g) ≤ Rankf + Rankg; (4) Rank(x → Ax + b) = RankA. We then show that in the infinite depth limit L → ∞ the rescaled representation cost of DNNs with a general homogeneous nonlinearity is sandwiched between the Jacobian and Bottleneck ranks: Rank J (f ) ≤ lim L→∞ R(f ) L ≤ Rank BN (f ) . Furthermore lim L→∞ R(f ) satisfies properties (2-4) above. We also conjecture that the limiting representation cost equals its upper bound Rank BN (f ). We then study how this bias towards low-rank functions translates to finite but large depths. We first show that for large depths the rescaled norm of the parameters ∥ Ŵ∥ 2 /L at any global minimum Ŵ is upper bounded by 1 + C N /L for a constant C N which depends on the training points. This implies that the resulting function has approximately rank 1 w.r.t. the Jacobian and Bottleneck ranks. This is however problematic if we are trying to fit a 'true function' f * whose 'true rank' k = Rank BN f * is larger than 1. Thankfully we show that if k > 1 the constant C N explodes as N → ∞, so that the above bound ( ∥ Ŵ∥ 2 /L ≤ 1+ C N /L) is relevant only for very large depths when N is large. We show another upper bound ∥ Ŵ∥ 2 /L ≤ k + C /L with a constant C independent of N , suggesting the existence of a range of intermediate depths where the network recovers the true rank k. Finally, we discuss how rank recovery affects the topology of decision boundaries in classification and leads autoencoders to naturally be denoising, which we confirm with numerical experiments.

RELATED WORKS

The implicit bias of deep homogeneous networks has, to our knowledge, been much less studied than those of either linear networks or shallow nonlinear ones. (Ongie & Willett, 2022) study deep networks with only one nonlinear layer (all others being linear). Similarly (Le & Jegelka, 2022) show a low-rank alignment phenomenon in a network whose last layers are linear. Closer to our setup is the analysis of the representation cost of deep homogeneous networks in (Jacot et al., 2022b) , which gives two reformulations for the optimization in the definition of the representation cost, with some implications on the sparsity of the representations, though the infinite depth limit is not studied. A very similar analysis of the sparsity effect of large depth on the global minima of L 2 -regularized networks is given in (Timor et al., 2022) , however, they only show how the optimal weight matrices are almost rank 1 (and only on average), while we show low-rank properties of the learned function, as well as the existence of a layer with almost rank 1 hidden representations.

