GLOBAL CONVERGENCE OF THREE-LAYER NEURAL NETWORKS IN THE MEAN FIELD REGIME *

Abstract

In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit. This lends a way to study large-width neural networks via analyzing the mean field limit. Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees. The extension to multilayer ones however has been a highly challenging puzzle, and little is known about the optimization efficiency in the mean field regime when there are more than two layers. In this work, we prove a global convergence result for unregularized feedforward three-layer networks in the mean field regime. We first develop a rigorous framework to establish the mean field limit of three-layer networks under stochastic gradient descent training. To that end, we propose the idea of a neuronal embedding, which comprises of a fixed probability space that encapsulates neural networks of arbitrary sizes. The identified mean field limit is then used to prove a global convergence guarantee under suitable regularity and convergence mode assumptions, which -unlike previous works on two-layer networks -does not rely critically on convexity. Underlying the result is a universal approximation property, natural of neural networks, which importantly is shown to hold at any finite training time (not necessarily at convergence) via an algebraic topology argument. * This paper is a conference submission. We refer to the work Nguyen & Pham (2020

1. INTRODUCTION

Interests in the theoretical understanding of the training of neural networks have led to the recent discovery of a new operating regime: the neural network and its learning rates are scaled appropriately, such that as the width tends to infinity, the network admits a limiting learning dynamics in which all parameters evolve nonlinearly with timefoot_0 . This is known as the mean field (MF) limit (Mei et al. (2018) 2018) led the first wave of efforts in 2018 and analyzed two-layer neural networks. They established a connection between the network under training and its MF limit. They then used the MF limit to prove that two-layer networks could be trained to find (near) global optima using variants of gradient descent, despite non-convexity (Mei et al. (2018) ; Chizat & Bach (2018)). The MF limit identified by these works assumes the form of gradient flows in the measure space, which factors out the invariance from the action of a symmetry group on the model. Interestingly, by lifting to the measure space, with a convex loss function (e.g. squared loss), one obtains a limiting optimization problem that is convex (Bengio et al. (2006) ; Bach ( 2017 In this work, we prove a global convergence guarantee for feedforward three-layer networks trained with unregularized stochastic gradient descent (SGD) in the MF regime. After an introduction of the three-layer setup and its MF limit in Section 2, our development proceeds in two main steps: Step 1 (Theorem 3 in Section 3): We first develop a rigorous framework that describes the MF limit and establishes its connection with a large-width SGD-trained three-layer network. Here we propose the new idea of a neuronal embedding, which comprises of an appropriate non-evolving probability space that encapsulates neural networks of arbitrary sizes. This probability space is in general abstract and is constructed according to the (not necessarily i.i.d.) initialization scheme of the neural network. This idea addresses directly the intertwined action of multiple symmetry groups, which is the aforementioned conceptual obstacle (Nguyen ( 2019)), thereby covering setups that cannot be handled by formulations in Araújo et al. ( 2019); Sirignano & Spiliopoulos (2019) (see also Section 5 for a comparison). Our analysis follows the technique from Sznitman (1991); Mei et al. ( 2018) and gives a quantitative statement: in particular, the MF limit yields a good approximation of the neural network as long as n -1 min log n max 1 independent of the data dimension, where n min and n max are the minimum and maximum of the widths. Step 2 (Theorem 8 in Section 4): We prove that the MF limit, given by our framework, converges to the global optimum under suitable regularity and convergence mode assumptions. Several elements of our proof are inspired by Chizat & Bach (2018) ; the technique in their work however does not generalize to our three-layer setup. Unlike previous two-layer analyses, we do not exploit convexity; instead we make use of a new element: a universal approximation property. The result turns out to be conceptually new: global convergence can be achieved even when the loss function is non-convex. An important crux of the proof is to show that the universal approximation property holds at any finite training time (but not necessarily at convergence, i.e. at infinite time, since the property may not realistically hold at convergence). Together these two results imply a positive statement on the optimization efficiency of SGD-trained unregularized feedforward three-layer networks (Corollary 10). Our results can be extended to the general multilayer case -with new ideas on top and significantly more technical works -or used to obtain new global convergence guarantees in the two-layer case (Nguyen & Pham (2020); Pham & Nguyen (2020)). We choose to keep the current paper concise with the three-layer case being a prototypical setup that conveys several of the basic ideas. Complete proofs are presented in appendices. Notations. K denotes a generic constant that may change from line to line. |•| denotes the absolute value for a scalar and the Euclidean norm for a vector. For an integer n, we let [n] = {1, ..., n}.

2.1. THREE-LAYER NEURAL NETWORK

We consider the following three-layer network at time k ∈ N ≥0 that takes as input x ∈ R d : ŷ (x; W (k)) = ϕ 3 (H 3 (x; W (k))) , (1) H 3 (x; W (k)) = 1 n 2 n2 j2=1 w 3 (k, j 2 ) ϕ 2 (H 2 (x, j 2 ; W (k))) ,



This is to be contrasted with another major operating regime (the NTK regime) where parameters essentially do not evolve and the model behaves like a kernel method (Jacot et al. (2018); Chizat et al. (2019); Du et al. (2019); Allen-Zhu et al. (2019); Zou et al. (2018); Lee et al. (2019)).



; Chizat & Bach (2018); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2018); Nguyen (2019); Araújo et al. (2019); Sirignano & Spiliopoulos (2019)). The four works Mei et al. (2018); Chizat & Bach (2018); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (

)). The analyses ofMei et al. (2018);Chizat & Bach (2018)  utilize convexity, although the mechanisms to attain global convergence in these works are more sophisticated than the usual convex optimization setup in Euclidean spaces.The extension to multilayer networks has enjoyed much less progresses. The works Nguyen (2019); Araújo et al. (2019); Sirignano & Spiliopoulos (2019) argued, heuristically or rigorously, for the existence of a MF limiting behavior under gradient descent training with different assumptions. In fact, it has been argued that the difficulty is not simply technical, but rather conceptual (Nguyen (2019)): for instance, the presence of intermediate layers exhibits multiple symmetry groups with intertwined actions on the model. Convergence to the global optimum of the model under gradientbased optimization has not been established when there are more than two layers.

