IMPLICIT BIAS OF LARGE DEPTH NETWORKS: A NOTION OF RANK FOR NONLINEAR FUNCTIONS

Abstract

We show that the representation cost of fully connected neural networks with homogeneous nonlinearities -which describes the implicit bias in function space of networks with L 2 -regularization or with losses such as the cross-entropy -converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the 'true' rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered. Finally, we discuss the effect of the rank of a classifier on the topology of the resulting class boundaries and show that autoencoders with optimal nonlinear rank are naturally denoising.

1. INTRODUCTION

There has been a lot of recent interest in the so-called implicit bias of DNNs, which describes what functions are favored by a network when fitting the training data. Different network architectures (choice of nonlinearity, depth, width of the network, and more) and training procedures (initialization, optimization algorithm, loss) can lead to widely different biases. In contrast to the so-called kernel regime where the implicit bias is described by the Neural Tangent Kernel (Jacot et al., 2018) , there are several active regimes (also called rich or feature-learning regimes), whose implicit bias often feature a form sparsity that is absent from the kernel regime. Such active regimes have been observed for example in DNNs with small initialization (Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; Li et al., 2020; Jacot et al., 2022a) , with L 2regularization (Savarese et al., 2019; Ongie et al., 2020; Jacot et al., 2022b) or when trained on exponentially decaying losses (Gunasekar et al., 2018a; b; Soudry et al., 2018; Du et al., 2018; Ji & Telgarsky, 2018; Chizat & Bach, 2020; Ji & Telgarsky, 2020) . In the latter two cases, the implicit bias is described by the representation cost: R(f ) = min W:f W =f ∥W∥ 2 where f is a function that can be represented by the network and the minimization is over all parameters W that result in a network function f W equal to f , the parameters W form a vector and ∥W∥ is the L 2 -norm. The representation cost can in some cases be explicitly computed for linear networks. For diagonal linear networks, the representation cost of a linear function f (x) = w T x equals the L p norm R(f ) = L ∥w∥ p p of the vector v for p = 2 L (Gunasekar et al., 2018a; Moroshko et al., 2020) where L is the depth of the network. For fully-connected linear networks, the representation cost of a linear function f (x) = Ax equals the L p -Schatten norm (the L p norm of the singular values) R(f ) = L ∥A∥ p p (Dai et al., 2021) . A common thread between these examples is a bias towards some notion of sparsity: sparsity of the entries of the vector w in diagonal networks and sparsity of the singular values in fully connected networks. Furthermore, this bias becomes stronger with depth and in the infinite depth limit L → ∞ the rescaled representation cost R(f )/L converges to the L 0 norm ∥w∥ 0 (the number of non-zero entries in w) in the first case and to the rank Rank(A) in the second. For shallow (L = 2) nonlinear networks with a homogeneous activation, the representation cost also takes the form of a L 1 norm (Bach, 2017; Chizat & Bach, 2020; Ongie et al., 2020) , leading to sparsity in the effective number of neurons in the hidden layer of the network. However, the representation cost of deeper networks does not resemble any typical norm (L p or not), though it still leads to some form of sparsity (Jacot et al., 2022b) . Despite the absence of explicit formula, we will show that the rescaled representation cost R(f )/L converges to some notion of rank in nonlinear networks as L → ∞, in analogy to infinite depth linear networks.

CONTRIBUTIONS

We first introduce two notions of rank: the Jacobian rank Rank J (f ) = max x Rank [Jf (x)] and the Bottleneck rank Rank BN (f ) which is the smallest integer k such that f can be factorized f = h • g with inner dimension k. In general, Rank J (f ) ≤ Rank BN (f ), but for functions of the form f = ψ • A • ϕ (for a linear map A and two bijections ψ and ϕ), we have Rank J (f ) = Rank BN (f ) = RankA. These two notions of rank satisfy the properties (1) Rankf ∈ Z; (2) Rank(f • g) ≤ min{Rankf, Rankg}; (3) Rank(f + g) ≤ Rankf + Rankg; (4) Rank(x → Ax + b) = RankA. We then show that in the infinite depth limit L → ∞ the rescaled representation cost of DNNs with a general homogeneous nonlinearity is sandwiched between the Jacobian and Bottleneck ranks: Rank J (f ) ≤ lim L→∞ R(f ) L ≤ Rank BN (f ) . Furthermore lim L→∞ R(f ) satisfies properties (2-4) above. We also conjecture that the limiting representation cost equals its upper bound Rank BN (f ). We then study how this bias towards low-rank functions translates to finite but large depths. We first show that for large depths the rescaled norm of the parameters ∥ Ŵ∥ 2 /L at any global minimum Ŵ is upper bounded by 1 + C N /L for a constant C N which depends on the training points. This implies that the resulting function has approximately rank 1 w.r.t. the Jacobian and Bottleneck ranks. This is however problematic if we are trying to fit a 'true function' f * whose 'true rank' k = Rank BN f * is larger than 1. Thankfully we show that if k > 1 the constant C N explodes as N → ∞, so that the above bound ( ∥ Ŵ∥ 2 /L ≤ 1+ C N /L) is relevant only for very large depths when N is large. We show another upper bound ∥ Ŵ∥ 2 /L ≤ k + C /L with a constant C independent of N , suggesting the existence of a range of intermediate depths where the network recovers the true rank k. Finally, we discuss how rank recovery affects the topology of decision boundaries in classification and leads autoencoders to naturally be denoising, which we confirm with numerical experiments.

RELATED WORKS

The implicit bias of deep homogeneous networks has, to our knowledge, been much less studied than those of either linear networks or shallow nonlinear ones. (Ongie & Willett, 2022 ) study deep networks with only one nonlinear layer (all others being linear). Similarly (Le & Jegelka, 2022) show a low-rank alignment phenomenon in a network whose last layers are linear. Closer to our setup is the analysis of the representation cost of deep homogeneous networks in (Jacot et al., 2022b) , which gives two reformulations for the optimization in the definition of the representation cost, with some implications on the sparsity of the representations, though the infinite depth limit is not studied. A very similar analysis of the sparsity effect of large depth on the global minima of L 2 -regularized networks is given in (Timor et al., 2022) , however, they only show how the optimal weight matrices are almost rank 1 (and only on average), while we show low-rank properties of the learned function, as well as the existence of a layer with almost rank 1 hidden representations.

2. PRELIMINARIES

In this section, we define fully-connected DNNs and their representation cost.

FULLY CONNECTED DNNS

In this paper, we study fully connected DNNs with L + 1 layers numbered from 0 (input layer) to L (output layer). Each layer ℓ ∈ {0, . . . , L} has n ℓ neurons, with n 0 = d in the input dimension and n L = d out the output dimension. The pre-activations αℓ (x) ∈ R n ℓ and activations α ℓ (x) ∈ R n ℓ of the layers of the network are defined inductively as α 0 (x) = x αℓ (x) = W ℓ α ℓ-1 (x) + b ℓ α ℓ (x) = σ (α ℓ (x)) , for the n ℓ ×n ℓ-1 connection weight matrix W ℓ , the n ℓ bias vector b ℓ and the nonlinearity σ : R → R applied entrywise to the vector αℓ (x). The parameters of the network are the collection of all connection weights matrices and bias vectors W = (W 1 , b 1 , . . . , W L , b L ). We call the network function f W : R din → R dout the function that maps an input x to the preactivations of the last layer αL (x). In this paper, we will focus on homogeneous nonlinearities σ, i.e. such that σ(λx) = λσ(x) for any λ ≥ 0 and x ∈ R, such as the traditional ReLU σ(x) = max{0, x}. In our theoretical analysis we will assume that the nonlinearity is of the form σ a (x) = x if x ≥ 0 ax otherwise for some α ∈ (-1, 1), since for a general homogeneous nonlinearity σ (which is not proportional to the identity function, the constant zero function or the absolute function), there are scalars a ∈ (-1, 1), b ∈ R and c ∈ {+1, -1} such that σ(x) = cσ a (bx); as a result, the global minima and representation cost are the same up to scaling. Remark 1. By a simple generalization of the work of (Arora et al., 2018) , the set of functions that can be represented by networks (with any finite widths and depth) with such nonlinearities is the set of piecewise linear functions with a finite number of linear regions. In contrast, the three types of homogeneous nonlinearities we rule out (the identity, the constant, or the absolute value) lead to different sets of functions: the linear functions, the constant functions, or the piecewise linear functions f such that lim t→∞ ∥f (tx) -f (-tx)∥ is finite for all directions x ∈ R din (or possibly a subset of this class of functions). While some of the results of this paper could probably be generalized to the third case up to a few details, we rule it out for the sake of simplicity. Remark 2. All of our results will be for sufficiently wide networks, i.e. for all widths n such that n ℓ ≥ n * ℓ for some minimal widths n * ℓ . Moreover these results are O(0) in the width, in the sense that above the threshold n * ℓ the constants do not depend on the widths n ℓ . When there are a finite number of datapoints N , it was shown by (Jacot et al., 2022b ) that a width of N (N + 1) is always sufficient, that is we can always take n * ℓ = N (N + 1) (though it is observed empirically that a much smaller width can be sufficient in some cases). When we are trying to fit a piecewise linear function over the whole input domain Ω, the width required depends on the number of linear regions (He et al., 2018) .

REPRESENTATION COST

The representation cost R(f ; Ω, σ, L) is the squared norm of the optimal weights W which represents the function f |Ω : R(f ; Ω, σ, L) = min W:f W|Ω =f |Ω ∥W∥ 2 where the minimum is taken over all weights W of a depth L network (with some finite widths n) such that f W (x) = f (x) for all x ∈ Ω. If no such weights exist, we define R(f ; Ω, σ, L) = ∞. The representation cost describes the natural bias on the represented function f W induced by adding L 2 regularization on the weights W: min W C(f W ) + λ ∥W∥ 2 = min f C(f ) + λR(f ; Ω, σ, L) for any cost C (defined on functions f : Ω → R dout ) and where the minimization on the right is over all functions f that can be represented by a depth L network with nonlinearity σ. Therefore, if we can give a simple description of the representation cost of a function f , we can better understand what type of functions f are favored by a DNN with nonlinearity σ and depth L. Remark 3. Note that the representation cost does not only play a role in the presence of L 2regularization, it also describes the implicit bias of networks trained on an exponentially decaying loss, such as the cross-entropy loss, as described in (Soudry et al., 2018; Gunasekar et al., 2018a; Chizat & Bach, 2020) .

3. INFINITELY DEEP NETWORKS

In this section, we first give 4 properties that a notion of rank on piecewise linear functions should satisfy and introduce two notions of rank that satisfy these properties. We then show that the infinitedepth limit L → ∞ of the rescaled representation cost R(f ; Ω, σ a , L)/L is sandwiched between the two notions of rank we introduced, and that this limit satisfies 3 of the 4 properties we introduced.

RANK OF PIECEWISE LINEAR FUNCTIONS

There is no single natural definition of rank for nonlinear functions, but we will provide two of them in this section and compare them. We focus on notions of rank for piecewise linear functions with a finite number of linear regions since these are the function that can be represented by DNNs with homogeneous nonlinearities (this is a Corollary of Theorem 2.1 from (Arora et al., 2018) , for more details, see Appendix E.1). We call such functions finite piecewise linear functions (FPLF). Let us first state a set of properties that any notion of rank on FPLF should satisfy, inspired by properties of rank for linear functions: 1. The rank of a function is an integer Rank(f ) ∈ N. 2. Rank(f • g) ≤ min{Rankf, Rankg}. 3. Rank(f + g) ≤ Rankf + Rankg. 4. If f is affine (f (x) = Ax + b) then Rankf = RankA. Taking g = id or f = id in (2) implies Rank(f ) ≤ min{d in , d out }. Properties (2) and (4) also imply that for any bijection ϕ on R d , Rank(ϕ) = Rank(ϕ -1 ) = d. Note that these properties do not uniquely define a notion of rank. Indeed we will now give two notions of rank which satisfy these properties but do not always match. However any such notion of rank must agree on a large family of functions: Property 2 implies that Rank is invariant under preand post-composition with bijections (see Appendix A), which implies that the rank of functions of the form ψ • f • ϕ for an affine function f (x) = Ax + b and two (piecewise linear) bijections ψ and ϕ always equals RankA. The first notion of rank we consider is based on the rank of the Jacobian of the function: Definition 1. The Jacobian rank of a FPLF f is Rank J (f ; Ω) = max x RankJf (x), taking the max over points where x is differentiable. Note that since the jacobian is constant over the linear regions of the FPLF f , we only need to take the maximum over every linear region. As observed in (Feng et al., 2022) , the Jacobian rank measures the intrinsic dimension of the output set f (Ω). The second notion of rank is inspired by the fact that for linear functions f , the rank of f equals the minimal dimension k such that f can be written as the composition of two linear function f = g • h with inner dimension k. We define the bottleneck rank as: Definition 2. The bottleneck rank Rank BN (f ; Ω) is the smallest integer k ∈ N such that there is a factorization as the composition of two FPLFs f |Ω = (g • h) |Ω with inner dimension k. The following proposition relates these two notions of rank: Proposition 1. Both Rank J and Rank BN satisfy properties 1 -4 above. Furthermore: • For any FPLF and any set Ω, Rank J (f ; Ω) ≤ Rank BN (f ; Ω). • There exists a FPLF f : R 2 → R 2 and a domain Ω such that Rank J (f ; Ω) = 1 and Rank BN (f ; Ω) = 2.

INFINITE-DEPTH REPRESENTATION COST

In the infinite-depth limit, the (rescaled ) representation cost of DNNs R ∞ (f ; Ω, σ a ) = lim L→∞ R(f ;Ω,σa,L) L converges to a value 'sandwiched' between the above two notions of rank: Theorem 1. For any bounded domain Ω and any FPLF f Rank J (f ; Ω) ≤ R ∞ (f ; Ω, σ α ) ≤ Rank BN (f ; Ω). Furthermore the limiting representation cost R ∞ (f ; Ω, σ a ) satisfies properties 2 to 4. Proof. The lower bound follows from taking L → ∞ in Proposition 3 (see Section 4). The upper bound is constructive: a function f = h • g can be represented as a network in three consecutive parts: a first part (of depth L g ) representing g, a final part (of depth L h ) representing h, and in the middle L -L g -L h identity layers on a k-dimensional space. The contribution to the norm of the parameters of the middle part is k(L -L g -L h ) and it dominates as L → ∞, since the contribution of the first and final parts are finite. Note that R ∞ (f ; Ω, σ a ) might satisfy property 1 as well, we were simply not able to prove it. Theorem 1 implies that for functions of the form f = ψ • A • ϕ for bijections ψ and ϕ, R ∞ (f ; Ω, σ a ) = Rank J (f ; Ω) = Rank BN (f ; Ω) = RankA. Remark 4. Motivated by some aspects of the proofs and a general intuition (which is described in Section 4) we conjecture that R ∞ (f ; Ω, σ a ) = Rank BN (f ; Ω). This would imply that the limiting representation cost does not depend on the choice of nonlinearity, as long as it is of the form σ a (which we already proved is the case for functions of the form ψ • A • ϕ). This result suggests that large-depth neural networks are biased towards function which have a low Jacobian rank and (if our above mentioned conjecture is true) low Bottleneck rank, much like linear networks are biased towards low-rank linear maps. It also suggests that the rescaled norm of the parameters ∥W∥ 2 /L is an approximate upper bound on the Jacobian rank (and if our conjecture is true on the Bottleneck rank too) of the function f W . In the next section, we partly formalize these ideas.

4. RANK RECOVERY IN FINITE DEPTH NETWORKS

In this section, we study how the (approximate) rank of minimizer functions f Ŵ (i.e. functions at a global minimum Ŵ) for the MSE L λ (W) = 1 N N i=1 (f W (x i )-y i ) 2 + λ L ∥W∥ 2 with data sampled from a distribution with support Ω is affected by the depth L. In particular, when the outputs are generated from a true function f * (i.e. y i = f * (x i )) with k = Rank BN (f * ; Ω), we study in which condition the 'true rank' k is recovered.

APPROXIMATE RANK 1 REGIME

One can build a function with BN-rank 1 that fits any training data (for example by first projecting the input to a line with no overlap and then mapping the points from the line to the outputs with a piecewise linear function). This implies the following bound: Proposition 2. There is a constant C N (which depends on the training data only) such that for any large enough L, at any global minimum Ŵ of the loss L λ the represented function f Ŵ satisfies 1 L R(f Ŵ; σ a , Ω, L) ≤ 1 + C N L . Proof. We use the same construction as in the proof of Theorem 1 for any fitting rank 1 function. This bound implies that the function f Ŵ represented by the network at a global minimum is approximately rank 1 both w.r.t. to the Jacobian and Bottleneck ranks, showing the bias towards low-rank functions even for finite (but possibly very large) depths. Jacobian Rank: For any function f , the rescaled norm representation cost 1 L R(f ; Ω, σ a , L) bounds the L p -Schatten norm of the Jacobian (with p = 2 L ) at any point: Proposition 3. Let f be a FPLF, then at any differentiable point x, we have ∥Jf (x)∥ 2 /L 2 /L := RankJf W (x) k=1 s k (Jf (x)) 2 L ≤ 1 L R(f ; Ω, σ a , L), where s k (Jf W (x)) is the k-th singular value of the Jacobian Jf W (x). Together with Proposition 2, this implies that the second singular value of the Jacobian of any minimizer function must be exponentially small s 2 Jf Ŵ(x) ≤ 1+ C N L 2 L 2 in L. Bottleneck Rank: We can further prove the existence of a bottleneck in the network in any minimizer network, i.e. a layer ℓ whose hidden representation is approximately rank 1: Proposition 4. For any global minimum Ŵ of the L 2 -regularized loss L λ with λ > 0 and any set of Ñ datapoints X ∈ R din× Ñ (which do not have to be the training set X) with non-constant outputs, there is a layer ℓ 0 such that the first two singular values s 1 , s 2 of the hidden representation Z ℓ0 ∈ R n ℓ ×N (whose columns are the activations α ℓ0 (x i ) for all the inputs x i in X) satisfies s2 s1 = O(L -1 4 ). The fact that the global minima of the loss are approximately rank 1 not only in the Jacobian but also in the Bottleneck sense further supports our conjecture that the limiting representation cost equals the Bottleneck rank R ∞ = Rank BN . Furthermore, it shows that the global minimum of the L 2 -regularized is biased towards low-rank functions for large depths, since it fits the data with (approximately) the smallest possible rank.

RANK RECOVERY FOR INTERMEDIATE DEPTHS

However, learning rank 1 functions is not always a good thing. Assume that we are trying to fit a 'true function' f * : Ω → R dout with a certain rank k = Rank BN (f * ; Ω). If k > 1 the global minima of a large depth network will end up underestimating the true rank k. In contrast, in the linear setting underestimating the true rank is almost never a problem: for example in matrix completion one always wants to find a minimal rank solution (Candès & Recht, 2009; Arora et al., 2019) . The difference is due to the fact that rank 1 nonlinear functions can fit any finite training set, which is not the case in the linear case. Thankfully, for large datasets it becomes more and more difficult to underestimate the rank, since for large N fitting the data with a rank 1 function requires large derivatives, which in turn implies a large parameter norm: Theorem 2. Given a Jacobian-rank k true function f * : Ω → R dout on a bounded domain Ω, then for all ϵ there is a constant c ϵ such that for any BN-rank 1 function f : Ω → R dout that fits f (x i ) = f * (x i ) a dataset x 1 , . . . , x N sampled i.i.d. from a distribution p with support Ω, we have 1 L R( f ; Ω, σ a , L) > c ϵ N 2 L (1-1 k ) with prob. at least 1 -ϵ. Proof. We show that there is a point x ∈ Ω with large derivative ∥Jf (x)∥ op ≥ TSP(y1,...,y N ) diam(x1,...,x N ) for the Traveling Salesman Problem TSP(y 1 , . . . , y N ), i.e. the length of the shortest path passing through every point y 1 , . . . , y m , and the diameter diam(x 1 , . . . , x N ) of the points x 1 , . . . , x N . This follows from the fact that the image of f is a line going through all y i s, and if i and j are the first and last points visited, the image of segment [x i , x j ] is a line from y i to y j passing through all y k s. The diameter is bounded by diamΩ while the TSP scales as N 1-1 k (Beardwood et al., 1959) since the y i s are sampled from a k-dimensional distribution. The bound on the parameter norm then follows from Proposition 3. The impact of the nonlinearity at each layer ℓ, measured by the ratio ∥ Zℓ -Z ℓ ∥ F /∥ Zℓ∥ F where Zℓ is the matrix of preactivations with entries α(ℓ) k (x i ). This impact vanishes in the middle layers, supporting our intuition that the middle layers represent approximate identities. (right) First 10 singular values of the Jacobian Jf W (x) at 10 random points. This implies that the constant C N in Proposition 2 explodes as the number of datapoints N increases, i.e. as N increases, larger and larger depths are required for the bound in Proposition 2 to be meaningful. In that case, a better upper bound on the norm of the parameters can be obtained, which implies that the functions f Ŵ at global minima are approximately rank k or less (at least in the Jacobian sense, according to Proposition 3): Proposition 5. Let the 'true function' f * : Ω → R dout be piecewise linear with Rank BN (f * ) = k, then there is a constant C which depends on f * only such that any minimizer function f Ŵ satisfies 1 L R(f Ŵ; σ a , Ω, L) ≤ 1 L R(f * ; σ a , Ω, L) ≤ k + C L . Theorem 2 and Proposition 5 imply that if the number of datapoints N is sufficiently large (N > k+ C L c kL 2k-2 ), there are parameters W * that fit the true function f * with a smaller parameter norm than any choice of parameters W that fit the data with a rank 1 function. In that case, the global minima will not be rank 1 and might instead recover the true rank k. Another interpretation is that since the constant C does not depend on the number of training points N (in contrast to C N ), there is a range of depths (which grows as N → ∞) where the upper bound of Proposition 5 is below that of Proposition 2. We expect rank recovery to happen roughly in this range of depths: too small depths can lead to an overestimation of the rankfoot_0 , while too large depths can lead to an underestimation. Remark 5. Note that in our experiments, we were not able to observe gradient descent converging to a solution that underestimates the true rank, even for very deep networks. This is probably due to gradient descent converging to one of the many local minima in the loss surface of very deep L 2 -regularized DNNs. Some recent theoretical results offer a possible explanation for why gradient descent naturally avoids rank 1 solutions: the proof of Proposition 2 shows that rank 1 fitting functions have exploding gradient as N → ∞, and such high gradient functions are known (at the moment only for shallow networks with 1D inputs) to correspond to narrow minima (Mulayoff et al., 2021) . Some of our results can be applied to local minima Ŵwith a small norm: Proposition 3 implies that the Jacobian rank of f Ŵ is approximately bounded by ∥ Ŵ∥ 2 /L. Proposition 4 also applies to local minima, but only if ∥ Ŵ∥ 2 /L ≤ 1 + C /L for some constant C, though it could be generalized.

DISCUSSION

We now propose a tentative explanation for the phenomenon observed in this section. In contrast to the rest of the paper, this discussion is informal. Ideally, we want to learn functions f which can be factorized as a composition h • g so that not only the inner dimension is small but the two functions g, h are not 'too complex'. These two objectives are often contradictory and one needs to find a trade-off between the two. Instead of optimizing the bottleneck rank, one might want to optimize with a regularization term of the form min f =h•g k + γ (C(g) + C(h)) , optimizing over all possible factorization f = h • g of f with inner dimension k, where C(g) and C(h) are measures of the complexity of g and h resp. The parameter γ ≥ 0 allows us to tune the balance between the minimization of the inner dimension and the complexity of g and h, recovering the Bottleneck rank when γ = 0. For small γ the minimizer is always rank 1 (since it is always possible to fit a finite dataset with a rank 1 function in the absence of restriction on the complexity on g and h), but with the right choice of γ one can recover the true rank. Some aspects of the proofs techniques we used in this paper suggest that large-depth DNNs are optimizing such a cost (or an approximation thereof). Consider a deep network that fits with minimal parameter norm a function f ; if we add more layers to the network it is natural to assume that the new optimal representation of f will be almost the same as that of the shallower network with some added (approximate) identity layers. The interesting question is where are those identity layers added? The cost of adding an identity layer at a layer ℓ equals the dimension d ℓ of the hidden representation of the inputs at ℓ. It is therefore optimal to add identity layers where the hidden representations have minimal dimension. This suggests that for large depths the optimal representation of a function f approximately takes the form of L g layers representing g, then L-L g -L h identity layers, and finally L h layers representing h, for some factorization f = h • g with inner dimension k. We observe in Figure 1 such a three-part representation structure in an MSE task with a low-rank true function. The rescaled parameter norm would then take the form 1 L ∥W∥ 2 = L -L g -L h L k + 1 L ∥W g ∥ 2 + ∥W h ∥ 2 , where W g and W h are the parameters of the first and last part of the network. For large depths, we can make the approximation L-Lg-L h L ≈ 1 to recover the same structure as Equation 1, with γ = 1 /L, C(g) = ∥W∥ 2 g and C(h) = ∥W h ∥ 2 . This intuition offers a possible explanation for rank recovery in DNNs, though we are not yet able to prove it rigorously.

5. PRACTICAL IMPLICATIONS

In this section, we describe the impact of rank minimization on two practical tasks: multiclass classification and autoencoders.

MULTICLASS CLASSIFICATION

Consider a function f W * : R din → R m which solves a classification task with m classes, i.e. for all training points x i with class y i ∈ {1, . . . , m} the y i -th entry of the vector f W * is strictly larger than all other entries. The Bottleneck rank k = Rank BN (f W * ) of f W * has an impact on the topology of the resulting partition of the input space Ω into classes, leading to topological properties typical of a partition on a k-dimensional space rather than those of a partition on a d in -dimensional space. When k = 1, the partition will be topologically equivalent to a classification on a line, which implies the absence of tripoints, i.e. points at the boundary of 3 (or more) classes. Indeed any boundary point x ∈ Ω will be mapped to a boundary point z = g(x) by the first function g : Ω → R in the factorization of f W * ; since z has at most two neighboring classes, then so does x. This property is illustrated in Figure 2 : for a classification task on four classes on the plane, we observe that the partitions obtained by shallow networks (L = 2) leads to tripoints which are absent in deeper networks (L = 9). Notice also that the presence or absence of L 2 -regularization has little effect on the final shape, which is in line with the observation that the cross-entropy loss leads to an implicit L 2 -regularization (Soudry et al., 2018; Gunasekar et al., 2018a; Chizat & Bach, 2020) , reducing the necessity of an explicit L 2 -regularization. 

AUTOENCODERS

Consider learning an autoender on data of the form x = g(z) where z is sampled (with full dimensional support) in a latent space R k and g : R k → R d is an injective FPLF. In this setting, the true rank is the intrinsic dimension k of the data, since the minimal rank function that equals the identity on the data distribution has rank k. Assume that the learned autoencoder f : R k → R k fits the data f (x) = x for all x = g(z) and recovers the rank Rank BN f = k. At any datapoint x 0 = g(z 0 ) such that g is differentiale at z 0 , the data support g(R k ) is locally a k-dimensional affine subspace T = x 0 + ImJg(z 0 ). In the linear region of f that contains x 0 , f is an affine projection to T since it equals the identity when restricted to T and its Jacobian is rank k. This proves that rank recovering autoencoders are naturally (locally) denoising.



Note that traditional regression models, such as Kernel Ridge Regression (KRR) typically overestimate the true rank, as described in Appendix D.1.



Figure 1: DNN (depth L = 11 and width n ℓ = 300) trained on a MSE task with rank 5 true function f * : R 50 → R 50 , with N = 300 and λ = 0.05/L. At the end of training, we obtain ∥W ∥ 2/L ≈ 8. (left) First 10 singular values of the matrix of activations Z ℓ for all ℓ. The representations are appr. rank 5 in the middle layers. (middle) The impact of the nonlinearity at each layer ℓ, measured by the ratio ∥ Zℓ -Z ℓ ∥ F /∥ Zℓ∥ F where Zℓ is the matrix of preactivations with entries α(ℓ) k (x i ). This impact vanishes in the middle layers, supporting our intuition that the middle layers represent approximate identities. (right) First 10 singular values of the Jacobian Jf W (x) at 10 random points.

L = 9, λ = 10 -3

Figure 2: Classification on 4 classes (whose sampling distribution are 4 identical inverted 'S' shapes translated along the x-axis) for two depths and with or without L 2 -regularization. The class boundaries in shallow networks (A,B) feature tripoints, which are not observed in deeper networks (C,D).

Figure 3: Autoencoders trained on MNIST (A) and a 1D dataset on the plane (B, C) with a ridge λ = 10 -4 . Plot (A) shows noisy inputs in the first line with corresponding outputs below. In plots (B) and (C) the blue dots are the training data, and the green dots are random inputs that are mapped to the orange dots pointed by the arrows. We see that for large depths (A, B) the learned autoencoder is naturally denoising, projecting points to the data distribution, which is not the case for shallow networks (C).

6. CONCLUSION

We have shown that in infinitely deep networks, L 2 -regularization leads to a bias towards low-rank functions, for some notion of rank on FPLFs. We have then shown a set of results that suggest that this low-rank bias extends to large but finite depths. With the right depths, this leads to 'rank recovery', where the learned function has approximately the same rank as the 'true function'. We proposed a tentative explanation for this rank recovery: for finite but large widths, the network is biased towards function f which can be factorized f = h • g with both a small inner dimension k and small complexity of g and h. Finally, we have shown how rank recovery affects the topology of the class boundaries in a classification task and leads to natural denoising abilities in autoencoders.

