DEPTH SEPARATION WITH MULTILAYER MEAN-FIELD NETWORKS

Abstract

Depth separation-why a deeper network is more powerful than a shallower onehas been a major problem in deep learning theory. Previous results often focus on representation power. For example, Safran et al. ( 2019) constructed a function that is easy to approximate using a 3-layer network but not approximable by any 2-layer network. In this paper, we show that this separation is in fact algorithmic: one can learn the function constructed by Safran et al. ( 2019) using an overparameterized network with polynomially many neurons efficiently. Our result relies on a new way of extending the mean-field limit to multilayer networks, and a decomposition of loss that factors out the error introduced by the discretization of infinite-width mean-field networks.

1. INTRODUCTION

One of the mysteries in deep learning theory is why we need deeper networks. In the early attempts, researchers showed that deeper networks can represent functions that are hard for shallow networks to approximate (Eldan & Shamir, 2016; Telgarsky, 2016; Poole et al., 2016; Daniely, 2017; Yarotsky, 2017; Liang & Srikant, 2017; Safran & Shamir, 2017; Poggio et al., 2017; Safran et al., 2019; Malach & Shalev-Shwartz, 2019; Vardi & Shamir, 2020; Venturi et al., 2022; Malach et al., 2021) . In particular, seminal works of Eldan & Shamir (2016) ; Safran et al. (2019) constructed a simple function (f * (x) = ReLU(1 -∥x∥)) which can be computed by a 3-layer neural network but cannot be approximated by a 2-layer network. However, these results are only about the representation power of neural networks and do not guarantee that training a deep neural network from reasonable initialization can indeed learn such functions. In this paper, we prove that one can train a neural network that approximates f * (x) = ReLU(1 -∥x∥) to any desired accuracy -this gives an algorithmic separation between the power of 2-layer and 3-layer networks. To analyze the training dynamics, we develop a new framework to generalize mean-field analysis of neural networks (Chizat & Bach, 2018; Mei et al., 2018) to multiple layers. As a result, all the layer weights can change significantly during the training process (unlike many previous works on neural tangent kernel or fixing lower-layer representations). Our analysis also gives a decomposition of loss that allows us to decouple the training of multiple layers. In the remainder of the paper, we first introduce our new framework for multilayer mean-field analysis, then give our main result and techniques. We discuss several related works in the algorithmic aspect for depth separation in Section 1.3. Similar to standard mean-field analysis, we first consider the infinite-width dynamics in Section 3, then we discuss our new ideas in discretizing the result to a polynomial-size network (see Section 4).

1.1. MULTI-LAYER MEAN-FIELD FRAMEWORK

We propose a new way to extend the mean-field analysis to multiple layers. For simplicity, we state it for 3-layer networks here. See Appendix A for the general framework. In short, we break the middle layer into two linear layers and restrict the size of the layer in between. More precisely, we  f (x) = 1 m 2 a ⊤ 2 σ(W 2 F (x)), F (x) = 1 m 1 A 1 σ(W 1 x), where W 1 ∈ R m1×d , A 1 ∈ R D×m1 , W 2 ∈ R m2×D a 2 ∈ R m2 are the parameters, and F (x) ∈ R D represents the hidden feature. See Figure 1 for an illustration. Later we will refer to the step of x → F (x) as the first layer and F (x) → f (x) as the second layer, even though both of them actually are two-layer networks. In the infinite-width limit, we will fix hidden feature dimension D and let the number of neurons m 1 , m 2 go to infinity. Then, we get the infinite-width network f (x) = E (a2,w2)∼µ2 a 2 σ(w 2 • F (x)), F i (x) = E (a1,w1)∼µ1,i a 1 σ(w 1 • x), ∀i ∈ [D], where (µ 1,i ) i∈ [D] are distributions over R 1+d with a shared marginal distribution over w 1 , and µ 2 is a distribution over R 1+D . Note that, unlike the formulation in Nguyen & Pham (2020), here the hidden layers are described using distributions of neurons, whence are automatically invariant under permutation of neurons, which is one of the most important properties of mean-field networks. One can choose µ 1 , µ 2 to be empirical distributions over finitely many neurons to recover a finite-width network. In fact, we will do so in most parts of the paper so that our results apply to finite-width networks of polynomially many neurons. The network can be viewed as a 3-layer network with intermediate layer W 2 A, which is low rank. This is reminiscent of the bottleneck structure used in ResNet (He et al. ( 2016)) and has also been used in previous theoretical analyses such as Allen-Zhu & Li (2020) for other purposes. Learner network Now we are ready to introduce the specific network that we use to learn the target function. We set D = 1 and couple a 1 with w 1 .    F (x) = F (x; µ 1 ) := E w∼µ1 {∥w∥ σ(w • x)} , f (x) = f (x; µ 2 , µ 1 ) := E (w2,b2)∼µ2 σ(w 2 F (x; µ 1 ) + b 2 ). Here, σ is the ReLU activation, and µ 1 ∈ P(R d ) and µ 2 ∈ P(R 2 ) are distributions encoding the weights of the first and second hidden layers, respectively. We multiply each first layer neuron by ∥w∥ to make F more regular. This 2-homogeneous parameterization is also used in Li et al. (2020) and Wang et al. (2020) . In most parts of the paper, µ 1 and µ 2 are empirical distributions over polynomially many neurons. We use µ 1 , µ 2 to unify the notations in discussions on infinite-and finite-width networks. Restricting the intermediate layer to have only one dimension (D = 1) is sufficient as one can learn x → α ∥x∥ for some α ∈ R with the first layer F (x) and α ∥x∥ → σ(1 -∥x∥) with the second layer. For the network that computes F (x), we do not need a bias term as the intended function is homogeneous in x. Though we restrict the first layer to be positive, it does not restrict the representation power of the network as the second layer can be either positive or negative. For the second layer, even though a single neuron is sufficient, we follow the framework and overparameterize the network.



Figure 1: Difference between previous Nguyen & Pham (2020) (Left) and our framework (Right).

define

