GLOBAL CONVERGENCE OF THREE-LAYER NEURAL NETWORKS IN THE MEAN FIELD REGIME *

Abstract

In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit. This lends a way to study large-width neural networks via analyzing the mean field limit. Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees. The extension to multilayer ones however has been a highly challenging puzzle, and little is known about the optimization efficiency in the mean field regime when there are more than two layers. In this work, we prove a global convergence result for unregularized feedforward three-layer networks in the mean field regime. We first develop a rigorous framework to establish the mean field limit of three-layer networks under stochastic gradient descent training. To that end, we propose the idea of a neuronal embedding, which comprises of a fixed probability space that encapsulates neural networks of arbitrary sizes. The identified mean field limit is then used to prove a global convergence guarantee under suitable regularity and convergence mode assumptions, which -unlike previous works on two-layer networks -does not rely critically on convexity. Underlying the result is a universal approximation property, natural of neural networks, which importantly is shown to hold at any finite training time (not necessarily at convergence) via an algebraic topology argument. * This paper is a conference submission. We refer to the work Nguyen & Pham (2020) and its companion note Pham & Nguyen

1. INTRODUCTION

Interests in the theoretical understanding of the training of neural networks have led to the recent discovery of a new operating regime: the neural network and its learning rates are scaled appropriately, such that as the width tends to infinity, the network admits a limiting learning dynamics in which all parameters evolve nonlinearly with time 1 . This is known as the mean field (MF) limit (Mei et 2018) led the first wave of efforts in 2018 and analyzed two-layer neural networks. They established a connection between the network under training and its MF limit. They then used the MF limit to prove that two-layer networks could be trained to find (near) global optima using variants of gradient descent, despite non-convexity (Mei et al. (2018) ; Chizat & Bach (2018) ). The MF limit identified by these works assumes the form of gradient flows in the measure space, which factors out the invariance from the action of a symmetry group on the model. Interestingly, by lifting to the measure space, with a convex loss function (e.g. squared loss), one obtains a limiting optimization problem that is convex (Bengio et al. (2006) ; Bach (2017) ). The analyses of Mei et al. (2018) ; Chizat & Bach (2018) utilize convexity, although the mechanisms to attain global convergence in these works are more sophisticated than the usual convex optimization setup in Euclidean spaces. The extension to multilayer networks has enjoyed much less progresses. The works Nguyen (2019); Araújo et al. (2019) ; Sirignano & Spiliopoulos (2019) argued, heuristically or rigorously, for the existence of a MF limiting behavior under gradient descent training with different assumptions. In fact, it has been argued that the difficulty is not simply technical, but rather conceptual (Nguyen (2019) ): for instance, the presence of intermediate layers exhibits multiple symmetry groups with intertwined actions on the model. Convergence to the global optimum of the model under gradientbased optimization has not been established when there are more than two layers. In this work, we prove a global convergence guarantee for feedforward three-layer networks trained with unregularized stochastic gradient descent (SGD) in the MF regime. After an introduction of the three-layer setup and its MF limit in Section 2, our development proceeds in two main steps: Step 1 (Theorem 3 in Section 3): We first develop a rigorous framework that describes the MF limit and establishes its connection with a large-width SGD-trained three-layer network. Here we propose the new idea of a neuronal embedding, which comprises of an appropriate non-evolving probability space that encapsulates neural networks of arbitrary sizes. This probability space is in general abstract and is constructed according to the (not necessarily i.i.d.) initialization scheme of the neural network. This idea addresses directly the intertwined action of multiple symmetry groups, which is the aforementioned conceptual obstacle (Nguyen (2019)), thereby covering setups that cannot be handled by formulations in Araújo et al. (2019) ; Sirignano & Spiliopoulos (2019) (see also Section 5 for a comparison). Our analysis follows the technique from Sznitman (1991) ; Mei et al. (2018) and gives a quantitative statement: in particular, the MF limit yields a good approximation of the neural network as long as n -1 min log n max 1 independent of the data dimension, where n min and n max are the minimum and maximum of the widths. Step 2 (Theorem 8 in Section 4): We prove that the MF limit, given by our framework, converges to the global optimum under suitable regularity and convergence mode assumptions. Several elements of our proof are inspired by Chizat & Bach (2018) ; the technique in their work however does not generalize to our three-layer setup. Unlike previous two-layer analyses, we do not exploit convexity; instead we make use of a new element: a universal approximation property. The result turns out to be conceptually new: global convergence can be achieved even when the loss function is non-convex. An important crux of the proof is to show that the universal approximation property holds at any finite training time (but not necessarily at convergence, i.e. at infinite time, since the property may not realistically hold at convergence). Together these two results imply a positive statement on the optimization efficiency of SGD-trained unregularized feedforward three-layer networks (Corollary 10). Our results can be extended to the general multilayer case -with new ideas on top and significantly more technical works -or used to obtain new global convergence guarantees in the two-layer case (Nguyen & Pham (2020) ; Pham & Nguyen (2020)). We choose to keep the current paper concise with the three-layer case being a prototypical setup that conveys several of the basic ideas. Complete proofs are presented in appendices. Notations. K denotes a generic constant that may change from line to line. |•| denotes the absolute value for a scalar and the Euclidean norm for a vector. For an integer n, we let [n] = {1, ..., n}.

2.1. THREE-LAYER NEURAL NETWORK

We consider the following three-layer network at time k ∈ N ≥0 that takes as input x ∈ R d : ŷ (x; W (k)) = ϕ 3 (H 3 (x; W (k))) , H 3 (x; W (k)) = 1 n 2 n2 j2=1 w 3 (k, j 2 ) ϕ 2 (H 2 (x, j 2 ; W (k))) , ∆ 2 (c 1 , c 2 ; W (t)) = E Z ∆ H 2 (Z, c 2 ; W (t)) ϕ 1 ( w 1 (t, c 1 ) , X ) , ∆ 1 (c 1 ; W (t)) = E Z E C2 ∆ H 2 (Z, C 2 ; W (t)) w 2 (t, c 1 , C 2 ) ϕ 1 ( w 1 (t, c 1 ) , X ) X , ∆ H 2 (z, c 2 ; W (t)) = ∂ 2 L (y, ŷ (x; W (t))) ϕ 3 (H 3 (x; W (t))) w 3 (t, c 2 ) ϕ 2 (H 2 (x, c 2 ; W (t))) . In Appendix B, we show well-posedness of MF ODEs under the following regularity conditions. Assumption 1 (Regularity). We assume that ϕ 1 and ϕ 2 are K-bounded, ϕ 1 , ϕ 2 and ϕ 3 are Kbounded and K-Lipschitz, ϕ 2 and ϕ 3 are non-zero everywhere, ∂ 2 L (•, •) is K-Lipschitz in the second variable and K-bounded, and |X| ≤ K with probability 1. Furthermore ξ 1 , ξ 2 and ξ 3 are K-bounded and K-Lipschitz. Theorem 1. Under Assumption 1, given any neuronal ensemble and an initialization W (0) such thatfoot_2 ess-sup |w 2 (0, C 1 , C 2 )| , ess-sup |w 3 (0, C 2 )| ≤ K, there exists a unique solution W to the MF ODEs on t ∈ [0, ∞). An example of a suitable setup is ϕ 1 = ϕ 2 = tanh, ϕ 3 is the identity, L is the Huber loss, although a non-convex sufficiently smooth loss function suffices. In fact, all of our developments can be easily modified to treat the squared loss with an additional assumption |Y | ≤ K with probability 1. So far, given an arbitrary neuronal ensemble (Ω, F, P ), for each initialization W (0), we have defined a MF limit W (t). The connection with the neural network's dynamics W (k) is established in the next section.

3.1. NEURONAL EMBEDDING AND THE COUPLING PROCEDURE

To formalize a connection between the neural network and its MF limit, we consider their initializations. In practical scenarios, to set the initial parameters W (0) of the neural network, one typically randomizes W (0) according to some distributional law ρ. We note that since the neural network is defined w.r.t. a set of finite integers {n 1 , n 2 }, so is ρ. We consider a family Init of initialization laws, each of which is indexed by the set {n 1 , n 2 }: Init = {ρ : ρ is the initialization law of a neural network of size {n 1 , n 2 } , n 1 , n 2 ∈ N >0 }. This is helpful when one is to take a limit that sends n 1 , n 2 → ∞, in which case the size of this family |Init| is infinite. More generally we allow |Init| < ∞ (for example, Init contains a single law ρ of a network of size {n 1 , n 2 } and hence |Init| = 1). We make the following crucial definition. Definition 2. Given a family of initialization laws Init, we call (Ω, F, P, w 0 i i=1,2,3 ) a neuronal embedding of Init if the following holds: 1. (Ω, F, P ) = (Ω 1 × Ω 2 , F 1 × F 2 , P 1 × P 2 ) a product measurable space. As a reminder, we call it a neuronal ensemble.

2.. The deterministic functions w

0 1 : Ω 1 → R d , w 0 2 : Ω 1 × Ω 2 → R and w 0 3 : Ω 2 → R are such that, for each index {n 1 , n 2 } of Init and the law ρ of this index, if -with an abuse of notations -we independently sample {C i (j i )} ji∈[ni] ∼ P i i.i.d. for each i = 1, 2, then Law w 0 1 (C 1 (j 1 )) , w 0 2 (C 1 (j 1 ), C 2 (j 2 )) , w 0 3 (C 2 (j 2 )) , j i ∈ [n i ] , i = 1, 2 = ρ. To proceed, given Init and {n 1 , n 2 } in its index set, we perform the following coupling procedure: 1. Let (Ω, F, P, w 0 i i=1,2,3 ) be a neuronal embedding of Init. 2. We form the MF limit W (t) (for t ∈ R ≥0 ) associated with the neuronal ensemble (Ω, F, P ) by setting the initialization W (0) to w 1 (0, •) = w 0 1 (•), w 2 (0, •, •) = w 0 2 (•, •) and w 3 (0, •) = w 0 3 (•) and running the MF ODEs described in Section 2.2. 3. We independently sample C i (j i ) ∼ P i for i = 1, 2 and j i = 1, ..., n i . We then form the neural network initialization W (0) with w 1 (0, j 1 ) = w 0 1 (C 1 (j 1 )), w 2 (0, j 1 , j 2 ) = w 0 2 (C 1 (j 1 ) , C 2 (j 2 )) and w 3 (0, j 2 ) = w 0 3 (C 2 (j 2 )) for j 1 ∈ [n 1 ], j 2 ∈ [n 2 ] . We obtain the network's trajectory W (k) for k ∈ N ≥0 as in Section 2.1, with the data z (k) generated independently of {C i (j i )} i=1,2 and hence W (0). We can then define a measure of closeness between W ( t/ ) and W (t) for t ∈ [0, T ]: D T (W, W) = sup |w 1 ( t/ , j 1 ) -w 1 (t, C 1 (j 1 ))| , |w 2 ( t/ , j 1 , j 2 ) -w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| , |w 3 ( t/ , j 2 ) -w 3 (t, C 2 (j 2 ))| : t ≤ T, j 1 ≤ n 1 , j 2 ≤ n 2 . ( ) Note that W (t) is a deterministic trajectory independent of {n 1 , n 2 }, whereas W (k) is random for all k ∈ N ≥0 due to the randomness of {C i (j i )} i=1,2 and the generation of the training data z (k). Similarly D T (W, W) is a random quantity. The idea of the coupling procedure is closely related to the coupling argument in Sznitman (1991) ; Mei et al. (2018) . Here, instead of playing the role of a proof technique, the coupling serves as a vehicle to establish the connection between W and W on the basis of the neuronal embedding. This connection is shown in Theorem 3 below, which gives an upper bound on D T (W, W). We note that the coupling procedure can be carried out to provide a connection between W and W as long as there exists a neuronal embedding for Init. Later in Section 4.1, we show that for a common initialization scheme (in particular, i.i.d. initialization) for Init, there exists a neuronal embedding. Theorem 3 applies to, but is not restricted to, this initialization scheme.

3.2. MAIN RESULT: APPROXIMATION BY THE MF LIMIT

Assumption 2 (Initialization of second and third layers). We assume that ess-sup w 0 2 (C 1 , C 2 ) , ess-sup w 0 3 (C 2 ) ≤ K, where w 0 2 and w 0 3 are as described in Definition 2. Theorem 3. Given a family Init of initialization laws and a tuple {n 1 , n 2 } that is in the index set of Init, perform the coupling procedure as described in Section 3.1. Fix a terminal time T ∈ N ≥0 . Under Assumptions 1 and 2, for ≤ 1, we have with probability at least 1 -2δ, D T (W, W) ≤ e K T 1 √ n min + √ log 1/2 3 (T + 1) n 2 max δ + e ≡ err δ,T ( , n 1 , n 2 ) , in which n min = min {n 1 , n 2 }, n max = max {n 1 , n 2 }, and K T = K 1 + T K . The theorem gives a connection between W ( t/ ), which is defined upon finite widths n 1 and n 2 , and the MF limit W (t), whose description is independent of n 1 and n 2 . It lends a way to extract properties of the neural network in the large-width regime. Corollary 4. Under the same setting as Theorem 3, consider any test function ψ : R × R → R which is K-Lipschitz in the second variable uniformly in the first variable (an example of ψ is the loss L). For any δ > 0, with probability at least 1 -3δ, sup t≤T |E Z [ψ (Y, ŷ (X; W ( t/ )))] -E Z [ψ (Y, ŷ (X; W (t)))]| ≤ e K T err δ,T ( , n 1 , n 2 ) . These bounds hold for any n 1 and n We observe that the MF trajectory W (t) is defined as per the choice of the neuronal embedding (Ω, F, P, w 0 i i=1,2,3 ), which may not be unique. On the other hand, the neural network's trajectory W (k) depends on the randomization of the initial parameters W (0) according to an initialization law from the family Init (as well as the data z (k)) and hence is independent of this choice. Another corollary of Theorem 3 is that given the same family Init, the law of the MF trajectory is insensitive to the choice of the neuronal embedding of Init.  (m) , m 2 (m) : m ∈ N} in which as m → ∞, min {m 1 (m) , m 2 (m)} -1 log (max {m 1 (m) , m 2 (m)}) → 0. Let W (t) and Ŵ (t) be two MF trajectories associated with two choices of neuronal embeddings of Init, (Ω, F, P, w 0 i i=1,2,3 ) and ( Ω, F, P , ŵ0 i i=1,2,3 ). The following statement holds for any T ≥ 0 and any two positive integers n 1 and n 2 : if we independently sample C i (j i ) ∼ P i and Ĉi (j i ) ∼ Pi for j i ∈ [n i ], i = 1, 2, then Law (W (n 1 , n 2 , T )) = Law( Ŵ (n 1 , n 2 , T )) , where we define W (n 1 , n 2 , T ) as the below collection w.r.t. W (t), and similarly define Ŵ (n 1 , n 2 , T ) w.r.t. Ŵ (t): W (n 1 , n 2 , T ) = w 1 (t, C 1 (j 1 )) , w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) , w 3 (t, C 2 (j 2 )) : j 1 ∈ [n 1 ] , j 2 ∈ [n 2 ] , t ∈ [0, T ] . The proofs are deferred to Appendix C.

4. CONVERGENCE TO GLOBAL OPTIMA

In this section, we prove a global convergence guarantee for three-layer neural networks via the MF limit. We consider a common class of initialization: i.i.d. initialization. 4.1 I.I.D. INITIALIZATION Definition 6. An initialization law ρ for a neural network of size {n 1 , n 2 } is called ρ 1 , ρ 2 , ρ 3i.i.d. initialization (or i.i.d. initialization, for brevity), where ρ 1 , ρ 2 and ρ 3 are probability measures over R d , R and R respectively, if {w 1 (0, j 1 )} j1∈[n1] are generated i.i.d. according to ρ 1 , {w 2 (0, j 1 , j 2 )} j1∈[n1], j2∈[n2] are generated i.i.d. according to ρ 2 and {w 3 (0, j 2 )} j2∈[n2] are generated i.i.d. according to ρ 3 , and w 1 , w 2 and w 3 are independent. Observe that given ρ 1 , ρ 2 , ρ 3 , one can build a family Init of i.i.d. initialization laws that contains any index set {n 1 , n 2 }. Furthermore i.i.d. initializations are supported by our framework, as stated in the following proposition and proven in Appendix D. Proposition 7. There exists a neuronal embedding Ω, F, P, w 0 i i=1,2,3 for any family Init of initialization laws, which are ρ 1 , ρ 2 , ρ 3 -i.i.d.

4.2. MAIN RESULT: GLOBAL CONVERGENCE

To measure the learning quality, we consider the loss averaged over the data Z ∼ P: L (V ) = E Z [L (Y, ŷ (X; V ))] , where V = (v 1 , v 2 , v 3 ) is a set of three measurable functions v 1 : Ω 1 → R d , v 2 : Ω 1 × Ω 2 → R, v 3 : Ω 2 → R. Assumption 3. Consider a neuronal embedding Ω, F, P, w 0 i i=1,2,3 of the ρ 1 , ρ 2 , ρ 3 -i.i.d. initialization, and the associated MF limit with initialization W (0) such that w 1 (0, •) = w 0 1 (•), w 2 (0, •, •) = w 0 2 (•, •) and w 3 (0, •) = w 0 3 (•). Assume: 1. Support: The support of ρ 1 is R d .

2.

Convergence mode: There exist limits w1 , w2 and w3 such that as t → ∞, Note that this assumption specifies the mode of convergence and is not an assumption on the limits w1 , w2 and w3 . Specifically conditions (3)-( 5) are similar to the convergence assumption in Chizat & Bach (2018) . We differ from Chizat & Bach (2018) fundamentally in the essential supremum condition (6). On one hand, this condition helps avoid the Morse-Sard type condition in Chizat & Bach (2018) , which is difficult to verify in general and not simple to generalize to the three-layer case. On the other hand, it turns out to be a natural assumption to make, in light of Remark 9 below. E [(1 + | w3 (C 2 )|) | w3 (C 2 )| | w2 (C 1 , C 2 )| |w 1 (t, C 1 ) -w1 (C 1 )|] → 0, E [(1 + | w3 (C 2 )|) | w3 (C 2 )| |w 2 (t, C 1 , C 2 ) -w2 (C 1 , C 2 )|] → 0, E [(1 + | w3 (C 2 )|) |w 3 (t, C 2 ) -w3 (C 2 )|] → 0, (5) ess-supE C2 [|∂ t w 2 (t, C 1 , C 2 )|] → 0. We now state the main result of the section. The proof is in Appendix D. Theorem 8. Consider a neuronal embedding Ω, F, P, w 0 i i=1,2,3 of ρ 1 , ρ 2 , ρ 3 -i.i.d. initialization. Consider the MF limit corresponding to the network (1) , such that they are coupled together by the coupling procedure in Section 3.1, under Assumptions 1, 2 and 3. For simplicity, assume ξ 1 (•) = ξ 2 (•) = 1. Further assume either: • (untrained third layer) ξ 3 (•) = 0 and w 0 3 (C 2 ) = 0 with a positive probability, or • (trained third layer) ξ 3 (•) = 1 and L w 0 1 , w 0 2 , w 0 3 < E Z [L (Y, ϕ 3 (0))]. Then the following hold: • Case 1 (convex loss): If L is convex in the second variable, then lim t→∞ L (W (t)) = inf V L (V ) = inf ỹ: R d →R E Z [L (Y, ỹ (X))] . • Case 2 (generic non-negative loss): Suppose that ∂ 2 L (y, ŷ) = 0 implies L (y, ŷ) = 0. If y = y(x) is a function of x, then L (W (t)) → 0 as t → ∞. Remarkably here the theorem allows for non-convex losses. A further inspection of the proof shows that no convexity-based property is used in Case 2 (see, for instance, the high-level proof sketch in Section 4.3); in Case 1, the key steps in the proof are the same, and the convexity of the loss function serves as a convenient technical assumption to handle the arbitrary extra randomness of Y conditional on X. We also remark that the same proof of global convergence should extend beyond the specific fully-connected architecture considered here. Similar to previous results on SGD-trained two-layer networks  lim →0 E Z [L (Y, ŷ (X; W ( t/ )))] = inf f1,f2,f3 L (f 1 , f 2 , f 3 ) = inf ỹ E Z [L (Y, ỹ (X))] in probability, where the limit of the widths is such that min {n 1 , n 2 } -1 log (max {n 1 , n 2 }) → 0. In Case 2, the same holds with the right-hand side being 0.

4.3. HIGH-LEVEL IDEA OF THE PROOF

We give a high-level discussion of the proof. This is meant to provide intuitions and explain the technical crux, so our discussion may simplify and deviate from the actual proof. Our first insight is to look at the second layer's weight w 2 . At convergence time t = ∞, we expect to have zero movement and hence, denoting W (∞) = ( w1 , w2 , w3 ): ∆ 2 (c 1 , c 2 ; W (∞)) = E Z ∆ H 2 (Z, c 2 ; W (∞)) ϕ 1 ( w1 (c 1 ) , X ) = 0, for P -almost every c 1 , c 2 . Suppose for the moment that we are allowed to make an additional (strong) assumption on the limit w1 : supp ( w1 (C 1 )) = R d . It implies that the universal approximation property, described in Assumption 3, holds at t = ∞; more specifically, it implies {ϕ 1 ( w1 (c 1 ) , • ) : c 1 ∈ Ω 1 } has dense span in L 2 (P X ). This thus yields E Z ∆ H 2 (Z, c 2 ; W (∞)) X = x = 0, for P-almost every x. Recalling the definition of ∆ H 2 , one can then easily show that E Z [∂ 2 L (Y, ŷ (X; W (∞)))|X = x] = 0. Global convergence follows immediately; for example, in Case 2 of Theorem 8, this is equivalent to that ∂ 2 L (y (x) , ŷ (x; W (∞))) = 0 and hence L (y (x) , ŷ (x; W (∞))) = 0 for P-almost every x. In short, the gradient flow structure of the dynamics of w 2 provides a seamless way to obtain global convergence. Furthermore there is no critical reliance on convexity. However this plan of attack has a potential flaw in the strong assumption that supp ( w1 (C 1 )) = R d , i.e. the universal approximation property holds at convergence time. Indeed there are setups where it is desirable that supp ( w1 (C 1 )) = R d (Mei et al. (2018); Chizat (2019)); for instance, it is the case where the neural network is to learn some "sparse and spiky" solution, and hence the weight distribution at convergence time, if successfully trained, cannot have full support. On the other hand, one can entirely expect that if supp (w 1 (0, C 1 )) = R d initially at t = 0, then supp (w 1 (t, C 1 )) = R d at any finite t ≥ 0. The crux of our proof is to show the latter without assuming supp ( w1 (C 1 )) = R d . This task is the more major technical step of the proof. To that end, we first show that there exists a mapping (t, u) → M (t, u) that maps from (t, w 1 (0, c 1 )) = (t, u) to w 1 (t, c 1 ) via a careful measurability argument. This argument rests on a scheme that exploits the symmetry in the network evolution. Furthermore the map M is shown to be continuous. The desired conclusion then follows from an algebraic topology argument that the map M preserves a homotopic structure through time.

5. DISCUSSION

The MF literature is fairly recent. A long line of works (Nitanda & Suzuki (2017) 2019), when there are more than three layers and no biases, i.i.d. initializations lead to a certain simplifying effect on the MF limit. On the other hand, our framework supports non-i.i.d. initializations which avoid the simplifying effect, as long as there exist suitable neuronal embeddings (Nguyen & Pham (2020) ). Although our global convergence result in Theorem 8 is proven in the context of i.i.d. initializations for three-layer networks, in the general multilayer case, it turns out that the use of a special type of non-i.i.d. initialization allows one to prove a global convergence guarantee (Pham & Nguyen (2020)). In this aspect, our framework follows closely the spirit of the work Nguyen (2019), whose MF formulation is also not specific to i.i.d. initializations. Yet though similar in the spirit, Nguyen (2019) develops a heuristic formalism and does not prove global convergence. Global convergence in the two-layer case with convex losses has enjoyed multiple efforts with a lot of new and interesting results (Mei et 2019)). Our work is the first to establish a global convergence guarantee for SGD-trained three-layer networks in the MF regime. Our proof sends a new message that the crucial factor is not necessarily convexity, but rather that the whole learning trajectory maintains the universal approximation property of the function class represented by the first layer's neurons, together with the gradient flow structure of the second layer's weights. As a remark, our approach can also be applied to prove a similar global convergence guarantee for two-layer networks, removing the convex loss assumption in previous works ( ACKNOWLEDGEMENT H. T. Pham would like to thank Jan Vondrak for many helpful discussions and in particular for the shorter proof of Lemma 19. We would like to thank Andrea Montanari for the succinct description of the difficulty in extending the mean field formulation to the multilayer case, in that there are multiple symmetry group actions in a multilayer network.

A NOTATIONAL PRELIMINARIES

For a real-valued random variable Z defined on a probability space (Ω, F, P ), we recall ess-supZ = inf {z ∈ R : P (Z > z) = 0} . We also introduce some convenient definitions which we use throughout the appendices. For a set of neural network's parameter W, we define W T = max max j1≤n1, j2≤n2 sup t≤T |w 2 ( t/ , j 1 , j 2 )| , max j2≤n2 sup t≤T |w 3 ( t/ , j 2 )| . Similarly for a set of MF parameters W , we define: W T = max ess-sup sup t≤T |w 2 (t, C 1 , C 2 )| , ess-sup sup t≤T |w 3 (t, C 2 )| . For two sets of neural network's parameters W , W , we define their distance: W -W T = sup |w 1 ( t/ , j 1 ) -w 1 ( t/ , j 1 )| , |w 2 ( t/ , j 1 , j 2 ) -w 2 ( t/ , j 1 , j 2 )| , |w 3 ( t/ , j 2 ) -w 3 ( t/ , j 2 )| : t ∈ [0, T ] , j 1 ∈ [n 1 ] , j 2 ∈ [n 2 ] . Similarly for two sets of MF parameters W , W , we define their distance: W -W T = ess-sup sup t∈[0,T ] |w 1 (t, C 1 ) -w 1 (t, C 1 )| , |w 2 (t, C 1 , C 2 ) -w 2 (t, C 1 , C 2 )| , |w 3 (t, C 2 ) -w 3 (t, C 2 )| .

B EXISTENCE AND UNIQUENESS OF THE SOLUTION TO MF ODES

We first collect some a priori estimates. Lemma 11. Under Assumption 1, consider a solution W to the MF ODEs with initialization W (0) such that W 0 < ∞. If this solution exists, it satisfies the following a priori bounds, for any T ≥ 0: ess-sup sup t≤T |w 3 (t, C 2 )| ≤ W 0 +KT ≡ W 0 +K 0,3 (T ) , ess-sup sup t≤T |w 2 (t, C 1 , C 2 )| ≤ W 0 +KT K 0,3 (T ) ≡ W 0 +K 0,2 (T ) , and consequently, W T ≤ 1 + max {K 0,2 (T ) , K 0,3 (T )} . Proof. The bounds can be obtained easily by bounding the respective initializations and update quantities separately. In particular, ess-sup sup t≤T |w 3 (t, C 2 )| ≤ ess-sup |w 3 (0, C 2 )| + T ess-sup sup t≤T ∂ ∂t w 3 (t, C 2 ) ≤ W 0 +KT, ess-sup sup t≤T |w 2 (t, C 1 , C 2 )| ≤ ess-sup |w 2 (0, C 1 , C 2 )| + T ess-sup sup t≤T ∂ ∂t w 2 (t, C 1 , C 2 ) ≤ ess-sup |w 2 (0, C 1 , C 2 )| + KT ess-sup sup t≤T |w 3 (t, C 2 )| ≤ W 0 +KT K 0,3 (T ) . Inspired by the a priori bounds in Lemma 11, given an arbitrary terminal time T and the initialization W (0), let us consider: • for a tuple (a, b) ∈ R 2 ≥0 , a space W T (a, b) of W = (W (t)) t≤T = (w 1 (t, •) , w 2 (t, •, •) , w 3 (t, •)) t≤T such that ess-sup sup t≤T |w 3 (t, C 2 )| ≤ b, ess-sup sup t≤T |w 2 (t, C 1 , C 2 )| ≤ a, where w 1 : R ≥0 × Ω 1 → R d , w 2 : R ≥0 × Ω 1 × Ω 2 → R, w 3 : R ≥0 × Ω 3 → R, • for a tuple (a, b) ∈ R 2 ≥0 and W (0), a space W + T (a, b, W (0)) of W ∈ W T (a, b ) such that W (0) = W (0) additionally (and hence every W in this space shares the same initialization W (0)). We equip the spaces with the metric W -W T . It is easy to see that both spaces are complete. Note that Lemma 11 implies, under Assumption 1 and W 0 < ∞, we have any MF solution W , if exists, is in W T ( W 0 +K 0,2 (T ) , W 0 +K 0,3 (T )). For the proof of Theorem 1, we work mainly with W + T ( W 0 +K 0,2 (T ) , W 0 +K 0,3 (T ) , W (0)), although several intermediate lemmas are proven in more generality for other uses. Lemma 12. Under Assumption 1, for T ≥ 0, any W , W ∈ W T (a, b) and almost every z ∼ P: ess-sup sup t≤T ∆ H 2 (z, C 2 ; W (t)) ≤ K a,b , ess-sup sup t≤T |H 2 (x, C 2 ; W (t)) -H 2 (x, C 2 ; W (t))| ≤ K a,b W -W T , sup t≤T |H 3 (x; W (t)) -H 3 (x; W (t))| ≤ K a,b W -W T , sup t≤T |∂ 2 L (y, ŷ (x; W (t))) -∂ 2 L (y, ŷ (x; W (t)))| ≤ K a,b W -W T , ess-sup sup t≤T ∆ H 2 (z, C 2 ; W (t)) -∆ H 2 (z, C 2 ; W (t)) ≤ K a,b W -W T , where K a,b ≥ 1 is a generic constant that grows polynomially with a and b. Proof. The first bound is easy to see: ess-sup sup t≤T ∆ H 2 (z, C 2 ; W (t)) ≤ ess-sup sup t≤T |w 3 (t, C 2 )| ≤ b. We prove the second bound, invoking Assumption 1: |H 2 (x, C 2 ; W (t)) -H 2 (x, C 2 ; W (t))| ≤ K |w 2 (t, C 1 , C 2 )| |ϕ 1 ( w 1 (t, C 1 ) , x ) -ϕ 1 ( w 1 (t, C 1 ) , x )| + K |w 2 (t, C 1 , C 2 ) -w 2 (t, C 1 , C 2 )| ≤ K (|w 2 (t, C 1 , C 2 )| + 1) W -W T , which yields by the fact W ∈ W T (a, b): ess-sup sup t≤T |H 2 (x, C 2 ; W (t)) -H 2 (x, C 2 ; W (t))| ≤ K (a + 1) W -W T . Consequently, we have: |H 3 (x; W (t)) -H 3 (x; W (t))| ≤ K |w 3 (t, C 2 )| |ϕ 2 (H 2 (x, C 2 ; W (t))) -ϕ 2 (H 2 (x, C 2 ; W (t)))| + K |w 3 (t, C 2 ) -w 3 (t, C 2 )| ≤ K |w 3 (t, C 2 )| |H 2 (x, C 2 ; W (t)) -H 2 (x, C 2 ; W (t))| + K W -W T , |∂ 2 L (y, ŷ (x; W (t))) -∂ 2 L (y, ŷ (x; W (t)))| ≤ K |ŷ (x; W (t)) -ŷ (x; W (t))| ≤ K |H 3 (x; W (t)) -H 3 (x; W (t))| , which then yield the third and fourth bounds by the fact W , W ∈ W T (a, b). Using these bounds, we obtain the last bound: ∆ H 2 (z, C 2 ; W (t)) -∆ H 2 (z, C 2 ; W (t)) ≤ K |w 3 (t, C 2 )| |∂ 2 L (y, ŷ (x; W (t))) -∂ 2 L (y, ŷ (x; W (t)))| + |H 3 (x; W (t)) -H 3 (x; W (t))| + |H 2 (x, C 2 ; W (t)) -H 2 (x, C 2 ; W (t))| + K |w 3 (t, C 2 ) -w 3 (t, C 2 )| , from which the last bound follows. To prove Theorem 1, for a given W (0), we define a mapping F W (0) that maps from W = (w 1 , w 2 , w 3 ) ∈ W T (a, b) to F W (0) (W ) = W = ( w 1 , w 2 , w 3 ), defined by W (0) = W (0) and ∂ ∂t w 3 (t, c 2 ) = -ξ 3 (t) ∆ 3 (c 2 ; W (t)) , ∂ ∂t w 2 (t, c 1 , c 2 ) = -ξ 2 (t) ∆ 2 (c 1 , c 2 ; W (t)) , ∂ ∂t w 1 (t, c 1 ) = -ξ 1 (t) ∆ 1 (c 1 ; W (t)) . Notice that the right-hand sides do not involve W . Note that the MF ODEs' solution, initialized at W (0), is a fixed point of this mapping. We establish the following estimates for this mapping. Lemma 13. Under Assumption 1, for T ≥ 0, any initialization W (0) and any W , W ∈ W T (a, b), ess-sup sup s≤t |∆ 3 (C 2 ; W (s)) -∆ 3 (C 2 ; W (s))| ≤ K a,b W -W t , ess-sup sup s≤t |∆ 2 (C 1 , C 2 ; W (s)) -∆ 2 (C 1 , C 2 ; W (s))| ≤ K a,b W -W t , ess-sup sup s≤t |∆ 1 (C 1 ; W (s)) -∆ 1 (C 1 ; W (s))| ≤ K a,b W -W t , and consequently, if in addition W (0) = W (0) (not necessarily equal W (0)), then ess-sup sup t≤T | w 3 (t, C 2 ) -w 3 (t, C 2 )| ≤ K a,b T 0 W -W s ds, ess-sup sup t≤T | w 2 (t, C 1 , C 2 ) -w 2 (t, C 1 , C 2 )| ≤ K a,b T 0 W -W s ds, ess-sup sup t≤T | w 1 (t, C 1 ) -w 1 (t, C 1 )| ≤ K a,b T 0 W -W s ds, in which W = ( w 1 , w 2 , w 3 ) = F W (0) (W ), W = ( w 1 , w 2 , w 3 ) = F W (0) (W ) and K a,b ≥ 1 is a generic constant that grows polynomially with a and b. Proof. From Assumption 1 and the fact W , W ∈ W T (a, b), we get: |∆ 3 (C 2 ; W (s)) -∆ 3 (C 2 ; W (s))| ≤ KE Z [|∂ 2 L (Y, ŷ (X; W (s))) -∂ 2 L (Y, ŷ (X; W (s)))|] + KE Z [|H 3 (X; W (s)) -H 3 (X; W (s))|] + KE Z [|H 2 (X, C 2 ; W (s)) -H 2 (X, C 2 ; W (s))|] , |∆ 2 (C 1 , C 2 ; W (s)) -∆ 2 (C 1 , C 2 ; W (s))| ≤ K a,b |w 1 (s, C 1 ) -w 1 (s, C 1 )| + K E Z ∆ H 2 (Z, C 2 ; W (s)) -∆ H 2 (Z, C 2 ; W (s)) , |∆ 1 (C 1 ; W (s)) -∆ 1 (C 1 ; W (s))| ≤ K a,b E Z ∆ H 2 (Z, C 2 ; W (s)) -∆ H 2 (Z, C 2 ; W (s)) + K a,b |w 2 (s, C 1 , C 2 ) -w 2 (s, C 1 , C 2 )| + K a,b |w 1 (s, C 1 ) -w 1 (s, C 1 )| , from which the first three estimates then follow, in light of Lemma 12. The last three estimates then follow from the fact that W (0) = W (0) and Assumption 1; for instance, ess-sup sup t≤T | w 3 (t, C 2 ) -w 3 (t, C 2 )| ≤ T 0 ess-sup ∂ ∂t w 3 (s, C 2 ) - ∂ ∂t w 3 (s, C 2 ) ds ≤ K T 0 ess-sup |∆ 3 (C 2 ; W (s)) -∆ 3 (C 2 ; W (s))| ds. We are now ready to prove Theorem 1. Proof of Theorem 1. We will use a Picard-type iteration. To lighten notations: W + T ≡ W + T ( W 0 +K 0,2 (T ) , W 0 +K 0,3 (T ) , W (0)) , F ≡ F W (0) . Since W 0 ≤ K by assumption, we have W 0 +K 0,2 (T ) + K 0,3 (T ) ≤ K T . Recall that W + T is complete. For an arbitrary T > 0, consider W , W ∈ W + T . Lemma 13 yields: F (W ) -F (W ) T ≤ K T T 0 W -W s ds. Note that F maps to W + T under Assumption 1 by the same argument as Lemma 11. Hence we are allowed to iterating this inequality and get, for an arbitrary T > 0, F (k) (W ) -F (k) (W ) T ≤ K T T 0 F (k-1) (W ) -F (k-1) (W ) T2 dT 2 ≤ K 2 T T 0 T2 0 F (k-2) (W ) -F (k-2) (W ) T3 I (T 2 ≤ T ) dT 3 dT 2 ... ≤ K k T T 0 T2 0 ... T k 0 W -W T k+1 I (T k ≤ ... ≤ T 2 ≤ T ) dT k+1 ...dT 2 ≤ 1 k! K k T W -W T . By substituting W = F (W ), we have: ∞ k=1 F (k+1) (W ) -F (k) (W ) T = ∞ k=1 F (k) (W ) -F (k) (W ) T ≤ ∞ k=1 1 k! K k T W -W T < ∞. Hence as k → ∞, F (k) (W ) converges to a limit in W + T , which is a fixed point of F . The uniqueness of a fixed point follows from the above estimate, since if W and W are fixed points then W -W T = F (k) (W ) -F (k) (W ) T ≤ 1 k! K k T W -W T , while one can take k arbitrarily large. This proves that the solution exists and is unique on t ∈ [0, T ]. Since T is arbitrary, we have existence and uniqueness of the solution on the time interval [0, ∞).

C CONNECTION BETWEEN THE NEURAL NET AND ITS MF LIMIT: PROOFS FOR SECTION 3 C.1 PROOF OF THEOREM 3

We construct an auxiliary trajectory, which we call the particle ODEs: ∂ ∂t w3 (t, j 2 ) = -ξ 3 (t) E Z ∂ 2 L Y, ŷ X; W (t) ϕ 3 H 3 X; W (t) ϕ 2 H 2 X, j 2 ; W (t) , ∂ ∂t w2 (t, j 1 , j 2 ) = -ξ 2 (t) E Z ∆ H 2 Z, j 2 ; W (t) ϕ 1 ( w1 (t, j 1 ) , X ) , ∂ ∂t w1 (t, j 1 ) = -ξ 1 (t) E Z   1 n 2 n2 j2=1 ∆ H 2 Z, j 2 ; W (t) w2 (t, j 1 , j 2 ) ϕ 1 ( w1 (t, j 1 ) , X ) X   , in which j 1 = 1, ..., n 1 , j 2 = 1, ..., n 2 , W (t) = ( w1 (t, •) , w2 (t, •, •) , w3 (t, •)) , and t ∈ R ≥0 . We specify the initialization W (0): w1 (0, j 1 ) = w 0 1 (C 1 (j 1 )), w2 (0, j 1 , j 2 ) = w 0 2 (C 1 (j 1 ) , C 2 (j 2 )) and w3 (0, j 3 ) = w 0 3 (C 2 (j 2 )). That is, it shares the same initialization with the neural network one W (0), and hence is coupled with the neural network and the MF ODEs. Roughly speaking, the particle ODEs are continuous-time trajectories of finitely many neurons, averaged over the data distribution. We note that W (t) is random for all t ∈ R ≥0 due to the randomness of {C i (j i )} i=1,2 . The existence and uniqueness of the solution to the particle ODEs follows from the same proof as in Theorem 1, which we shall not repeat here. We equip W (t) with the norm W T = max max j1≤n1, j2≤n2 sup t≤T | w2 (t, j 1 , j 2 )| , max j2≤n2 sup t≤T | w3 (t, j 2 )| . One can also define the measures D T W, W and D T W , W similar to Eq. ( 2): D T W, W = sup |w 1 (t, C 1 (j 1 )) -w1 (t, C 1 (j 1 ))| , |w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) -w2 (t, C 1 (j 1 ) , C 2 (j 2 ))| , |w 3 (t, C 2 (j 2 )) -w3 (t, C 2 (j 2 ))| : t ≤ T, j 1 ≤ n 1 , j 2 ≤ n 2 , D T W , W = sup |w 1 ( t/ , j 1 ) -w1 (t, C 1 (j 1 ))| , |w 2 ( t/ , j 1 , j 2 ) -w2 (t, C 1 (j 1 ) , C 2 (j 2 ))| , |w 3 ( t/ , j 2 ) -w3 (t, C 2 (j 2 ))| : t ≤ T, j 1 ≤ n 1 , j 2 ≤ n 2 . We have the following results: Theorem 14. Under the same setting as Theorem 3, for any δ > 0, with probability at least 1 -δ, D T W, W ≤ 1 √ n min log 1/2 3 (T + 1) n 2 max δ + e e K T , in which n min = min {n 1 , n 2 }, n max = max {n 1 , n 2 }, and K T = K 1 + T K . Theorem 15. Under the same setting as Theorem 3, for any δ > 0 and ≤ 1, with probability at least 1 -δ, D T W , W ≤ log 2n 1 n 2 δ + e e K T , in which K T = K 1 + T K . Proof of Theorem 3. Using the fact D T (W, W) ≤ D T W, W + D T W , W , the thesis is immediate from Theorems 14 and 15.

C.2 PROOF OF THEOREMS 14 AND 15

Proof of Theorem 14. In the following, let K t denote an generic positive constant that may change from line to line and takes the form K t = K 1 + t K , such that K t ≥ 1 and K t ≤ K T for all t ≤ T . We first note that at initialization, D 0 W, W = 0. Since W 0 ≤ K, W T ≤ K T by Lemma 11. Furthermore it is easy to see that W 0 ≤ W 0 ≤ K almost surely. By the same argument as in Lemma 11, W T ≤ K T almost surely. We shall use all above bounds repeatedly in the proof. We decompose the proof into several steps. Step 1 -Main proof. Let us define, for brevity q 3 (t, x) = H 3 x; W (t) -H 3 (x; W (t)) , q 2 (t, x, j 2 , c 2 ) = H 2 x, j 2 ; W (t) -H 2 (x, c 2 ; W (t)) , q ∆ (t, z, j 1 , j 2 , c 1 , c 2 ) = ∆ H 2 Z, j 2 ; W (t) w2 (t, j 1 , j 2 ) -∆ H 2 (z, c 2 ; W (t)) w 2 (t, c 1 , c 2 ) . Consider t ≥ 0. We first bound the difference in the updates between W and W . Let us start with w 3 and w3 . By Assumption 1, we have: ∂ ∂t w3 (t, j 2 ) - ∂ ∂t w 3 (t, C 2 (j 2 )) ≤ KE Z [|q 3 (t, X)| + |q 2 (t, X, j 2 , C 2 (j 2 ))|] . Similarly, for w 2 and w2 , ∂ ∂t w2 (t, j 1 , j 2 ) - ∂ ∂t w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) ≤ KE Z ∆ H 2 Z, j 2 ; W (t) -∆ H 2 (Z, C 2 (j 2 ) ; W (t)) + K |w 3 (t, C 2 (j 2 ))| | w1 (t, j 1 ) -w 1 (t, C 1 (j 1 ))| ≤ K t E Z [|q 3 (t, X)| + |q 2 (t, X, j 2 , C 2 (j 2 ))|] + K t (| w1 (t, j 1 ) -w 1 (t, C 1 (j 1 ))| + | w3 (t, j 2 ) -w 3 (t, C 2 (j 2 ))|) ≤ K t E Z [|q 3 (t, X)| + |q 2 (t, X, j 2 , C 2 (j 2 ))|] + K t D t W, W , and for w 1 and w1 , by Lemma 12, ∂ ∂t w1 (t, j 1 ) - ∂ ∂t w 1 (t, C 1 (j 1 )) ≤ KE Z   1 n 2 n2 j2=1 E C2 [q ∆ (t, Z, j 1 , j 2 , C 1 (j 1 ) , C 2 )]   + E C2 ∆ H 2 (Z, C 2 ; W (t)) |w 2 (t, C 1 (j 1 ) , C 2 )| | w1 (t, j 1 ) -w 1 (t, C 1 (j 1 ))| ≤ KE Z   1 n 2 n2 j2=1 E C2 [q ∆ (t, Z, j 1 , j 2 , C 1 (j 1 ) , C 2 )]   + K t D t W, W . To further the bounding, we now make the following two claims: • Claim 1: For any ξ > 0, max j2≤n2 ∂ ∂t w 3 (t + ξ, C 2 (j 2 )) - ∂ ∂t w 3 (t, C 2 (j 2 )) ≤ K t+ξ ξ, max j1≤n1, j2≤n2 ∂ ∂t w 2 (t + ξ, C 1 (j 1 ) , C 2 (j 2 )) - ∂ ∂t w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) ≤ K t+ξ ξ, max j1≤n1 ∂ ∂t w 1 (t + ξ, C 1 (j 1 )) - ∂ ∂t w 1 (t, C 1 (j 1 )) ≤ K t+ξ ξ, and similarly, max j2≤n2 ∂ ∂t w3 (t + ξ, j 2 ) - ∂ ∂t w3 (t, j 2 ) ≤ K t+ξ ξ, max j1≤n1, j2≤n2 ∂ ∂t w2 (t + ξ, j 1 , j 2 ) - ∂ ∂t w2 (t, j 1 , j 2 ) ≤ K t+ξ ξ, max j1≤n1 ∂ ∂t w1 (t + ξ, j 1 ) - ∂ ∂t w1 (t, j 1 ) ≤ K t+ξ ξ. • Claim 2: For any γ 1 , γ 2 , γ 3 > 0 and t ≥ 0, max max j2≤n2 E Z [|q 2 (t, X, j 2 , C 2 (j 2 ))|] , E Z [|q 3 (t, X)|] , max j1≤n1 E Z   1 n 2 n2 j2=1 E C2 [q ∆ (t, Z, j 1 , j 2 , C 1 (j 1 ) , C 2 )]   ≥ K t D t W, W + γ 1 + γ 2 + γ 3 , with probability at most n 1 γ 1 exp - n 2 γ 2 1 K t + n 2 γ 2 exp - n 1 γ 2 2 K t + 1 γ 3 exp - n 2 γ 2 3 K t . Combining these claims with the previous bounds, taking a union bound over t ∈ {0, ξ, 2ξ, ..., T /ξ ξ} for some ξ ∈ (0, 1), we obtain that max max j2≤n2 ∂ ∂t w3 (t, j 2 ) - ∂ ∂t w 3 (t, C 2 (j 2 )) , max j1≤n1, j2≤n2 ∂ ∂t w2 (t, j 1 , j 2 ) - ∂ ∂t w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) , max j1≤n1 ∂ ∂t w1 (t, j 1 ) - ∂ ∂t w 1 (t, C 1 (j 1 )) ≤ K T D t W, W + γ 1 + γ 2 + γ 3 + ξ , ∀t ∈ [0, T ] , with probability at least 1 - T + 1 ξ n 1 γ 1 exp - n 2 γ 2 1 K T + n 2 γ 2 exp - n 1 γ 2 2 K T + 1 γ 3 exp - n 2 γ 2 3 K T . The above event in turn implies D t W, W ≤ K T t 0 D s W, W + γ 1 + γ 2 + γ 3 + ξ ds, and hence by Gronwall's lemma and the fact D 0 W, W = 0, we get D T W, W ≤ (γ 1 + γ 2 + γ 3 + ξ) e K T . The theorem then follows from the choice ξ = 1 √ n max , γ 2 = K T √ n 1 log 1/2 3 (T + 1) n 2 max δ + e , γ 1 = γ 3 = K T √ n 2 log 1/2 3 (T + 1) n 2 max δ + e . We are left with proving the claims. Step 2 -Proof of Claim 1. We have from Assumption 1, ess-sup |w 3 (t + ξ, C 2 ) -w 3 (t, C 2 )| ≤ K t+ξ t ess-sup ∂ ∂t w 3 (s, C 2 ) ds ≤ Kξ, ess-sup |w 2 (t + ξ, C 1 , C 2 ) -w 2 (t, C 1 , C 2 )| ≤ K t+ξ t ess-sup ∂ ∂t w 2 (s, C 1 , C 2 ) ds ≤ K t+ξ t ess-sup |w 3 (s, C 2 )| ds ≤ K t+ξ ξ, ess-sup |w 1 (t + ξ, C 1 ) -w 1 (t, C 1 )| ≤ K t+ξ t ess-sup ∂ ∂t w 1 (s, C 1 ) ds ≤ K t+ξ t ess-sup |w 3 (s, C 2 ) w 2 (s, C 1 , C 2 )| ds ≤ K t+ξ ξ. By Lemma 12, we then obtain that ess-supE Z [|H 2 (X, C 2 ; W (t + ξ)) -H 2 (X, C 2 ; W (t))|] ≤ K t+ξ ξ, E Z [|H 3 (X; W (t + ξ)) -H 3 (X; W (t))|] ≤ K t+ξ ξ, ess-supE Z ∆ H 2 (Z, C 2 ; W (t + ξ)) -∆ H 2 (Z, C 2 ; W (t)) ≤ K t+ξ ξ. Using these estimates, we thus have, by Assumption 1, max j2≤n2 ∂ ∂t w 3 (t + ξ, C 2 (j 2 )) - ∂ ∂t w 3 (t, C 2 (j 2 )) ≤ K t+ξ ξ + KE Z [|H 3 (X; W (t + ξ)) -H 3 (X; W (t))|] + Kess-supE Z [|H 2 (X, C 2 ; W (t + ξ)) -H 2 (X, C 2 ; W (t))|] ≤ K t+ξ ξ, max j1≤n1, j2≤n2 ∂ ∂t w 2 (t + ξ, C 1 (j 1 ) , C 2 (j 2 )) - ∂ ∂t w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) ≤ K t+ξ ξ + Kess-supE Z ∆ H 2 (Z, C 2 ; W (t + ξ)) -∆ H 2 (Z, C 2 ; W (t)) + Kess-sup |w 3 (t, C 2 )| |w 1 (t + ξ, C 1 ) -w 1 (t, C 1 )| ≤ K t+ξ ξ, max j1≤n1 ∂ ∂t w 1 (t + ξ, C 1 (j 1 )) - ∂ ∂t w 1 (t, C 1 (j 1 )) ≤ K t+ξ ξ + Kess-supE Z E C2 ∆ H 2 (Z, C 2 ; W (t + ξ)) -∆ H 2 (Z, C 2 ; W (t)) |w 2 (t, C 1 , C 2 )| + Kess-supE C2 [|w 3 (t, C 2 )| |w 2 (t + ξ, C 1 , C 2 ) -w 2 (t, C 1 , C 2 )|] + Kess-supE C2 [|w 3 (t, C 2 ) w 2 (t, C 1 , C 2 )|] |w 1 (t + ξ, C 1 ) -w 1 (t, C 1 )| ≤ K t+ξ ξ. The proof of the rest of the claim is similar. Step 3 -Proof of Claim 2. We recall the definitions of q ∆ , q 2 and q 3 . Let us decompose them as follows. We start with q 2 : |q 2 (t, x, j 2 , C 2 (j 2 ))| = 1 n 1 n1 j1=1 w2 (t, j 1 , j 2 ) ϕ 1 ( w1 (t, j 1 ) , x ) -E C1 [w 2 (t, C 1 , C 2 (j 2 )) ϕ 1 ( w 1 (t, C 1 ) , x )] ≤ max j1≤n1 | w2 (t, j 1 , j 2 ) ϕ 1 ( w1 (t, j 1 ) , x ) -w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) ϕ 1 ( w 1 (t, C 1 (j 1 )) , x )| + 1 n 1 n1 j1=1 w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) ϕ 1 ( w 1 (t, C 1 (j 1 )) , x ) -E C1 [w 2 (t, C 1 , C 2 (j 2 )) ϕ 1 ( w 1 (t, C 1 ) , x )] ≡ Q 2,1 (x, j 2 ) + Q 2,2 (x, j 2 ) . Similarly, we have for q 3 : |q 3 (t, x)| = 1 n 2 n2 j2=1 w3 (t, j 2 ) ϕ 2 H 2 x, j 2 ; W (t) -E C2 [w 3 (t, C 2 ) ϕ 2 (H 2 (x, C 2 ; W (t)))] ≤ max j2≤n2 w3 (t, j 2 ) ϕ 2 H 2 x, j 2 ; W (t) -w 3 (t, C 2 (j 2 )) ϕ 2 (H 2 (x, C 2 (j 2 ) ; W (t))) + 1 n 2 n2 j2=1 w 3 (t, C 2 (j 2 )) ϕ 2 (H 2 (x, C 2 (j 2 ) ; W (t))) -E C2 [w 3 (t, C 2 ) ϕ 2 (H 2 (x, C 2 ; W (t)))] ≡ Q 3,1 (x) + Q 3,2 (x) . Finally we have for q ∆ : 1 n 2 n2 j2=1 E C2 [q ∆ (t, z, j 1 , j 2 , C 1 (j 1 ) , C 2 )] ≤ max j2≤n2 ∆ H 2 z, j 2 ; W (t) w2 (t, j 1 , j 2 ) -∆ H 2 (z, C 2 (j 2 ) ; W (t)) w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) + 1 n 2 n2 j2=1 ∆ H 2 (z, C 2 (j 2 ) ; W (t)) w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) -E C2 ∆ H 2 (z, C 2 ; W (t)) w 2 (t, C 1 (j 1 ) , C 2 ) ≡ Q 1,1 (z, j 1 ) + Q 1,2 (z, j 1 ) .

Now let us analyze each of the terms.

• We start with Q 2,1 . We have from Assumption 1, max j2≤n2 E Z [Q 2,1 (X, j 2 )] ≤ K max j1≤n1, j2≤n2 | w2 (t, j 1 , j 2 ) -w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| + K max j1≤n1, j2≤n2 |w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| | w1 (t, j 1 ) -w 1 (t, C 1 (j 1 ))| ≤ K t D t W, W . • To bound Q 2,2 , let us write: Z 2 (x, c 1 , c 2 ) = w 2 (t, c 1 , c 2 ) ϕ 1 ( w 1 (t, c 1 ) , x ) . Recall that C 1 (j 1 ) and C 2 (j 2 ) are independent. We thus have: E [Z 2 (X, C 1 (j 1 ) , C 2 (j 2 ))|X, C 2 (j 2 )] = E C1 [Z 2 (X, C 1 , C 2 (j 2 ))] . Furthermore {Z 2 (C 1 (j 1 ) , C 2 (j 2 ))} j1∈[n1] are independent, conditional on C 2 (j 2 ). We also have, almost surely |Z 2 (X, C 1 (j 1 ) , C 2 (j 2 ))| ≤ K t , by Assumption 1. Then by Lemma 19, P (E Z [Q 2,2 (X, j 2 )] ≥ K t γ 2 ) ≤ (1/γ 2 ) exp -n 1 γ 2 2 /K t . • To bound Q 3,1 , we have from Assumption 1, E Z [Q 3,1 (X)] ≤ max j2≤n2 (K | w3 (t, j 2 ) -w 3 (t, C 2 (j 2 ))| + K t E Z [|q 2 (t, X, j 2 , C 2 (j 2 ))|]) ≤ KD t W, W + K t max j2≤n2 E Z [|q 2 (t, X, j 2 , C 2 (j 2 ))|] . • To bound Q 3,2 , noticing that almost surely |w 3 (t, C 2 (j 2 )) ϕ 2 (H 2 (x, C 2 (j 2 ) ; W (t)))| ≤ K t by Assumption 1, we obtain P (E Z [Q 3,2 (X)] ≥ K t γ 3 ) ≤ (1/γ 3 ) exp -n 2 γ 2 3 /K t , similar to the treatment of Q 2,2 . • To bound Q 1,1 , using Assumption 1, E Z [Q 1,1 (Z, j 1 )] ≤ K max j2≤n2 |w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| E Z ∆ H 2 Z, j 2 ; W (t) -∆ H 2 (Z, C 2 (j 2 ) ; W (t)) + K max j2≤n2 | w2 (t, j 1 , j 2 ) -w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| E Z ∆ H 2 Z, j 2 ; W (t) ≤ K max j2≤n2 |w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| | w3 (t, j 2 ) -w 3 (t, C 2 (j 2 ))| + |w 3 (t, C 2 (j 2 ))| E Z [|q 3 (t, X)| + |q 2 (t, X, j 2 , C 2 (j 2 ))|] + K max j2≤n2 | w2 (t, j 1 , j 2 ) -w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| | w3 (t, j 2 )| ≤ K t D t W, W + E Z |q 3 (t, X)| + max j2≤n2 |q 2 (t, X, j 2 , C 2 (j 2 ))| . • To bound Q 1,2 , we note that almost surely ∆ H 2 (Z, C 2 (j 2 ) ; W (t)) w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) ≤ K |w 3 (t, C 2 (j 2 ))| |w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| ≤ K t . Then similar to the bounding of Q 2,2 , we get: P (E Z [Q 1,2 (Z, j 1 )] ≥ K t γ 1 ) ≤ (1/γ 1 ) exp -n 2 γ 2 1 /K t . Finally, combining all of these bounds together, applying suitably the union bound over j 1 ∈ [n 1 ] and j 2 ∈ [n 2 ], we obtain the claim. Proof of Theorem 15. We consider t ≤ T , for a given terminal time T ∈ N ≥0 . We again reuse the notation K t from the proof of Theorem 3. Note that K t ≤ K T for all t ≤ T . We also note that at initialization, D 0 W, W = 0. We also recall from the proof of Theorem 3 that W T ≤ K T almost surely. For brevity, let us define several quantities that relate to the difference in the gradient updates between W and W : q 3 (k, z, z, j 2 ) = ∂ 2 L (y, ŷ (x; W (k))) ϕ 3 (H 3 (x; W (k))) ϕ 2 (H 2 (x, j 2 ; W (k))) -∂ 2 L ỹ, ŷ x; W (k ) ϕ 3 H 3 x; W (k ) ϕ 2 H 2 x, j 2 ; W (k ) , r 3 (k, z, j 2 ) = ξ 3 (k ) ∂ 2 L y, ŷ x; W (k ) ϕ 3 H 3 x; W (k ) ϕ 2 H 2 x, j 2 ; W (k ) -ξ 3 (k ) E Z ∂ 2 L Y, ŷ X; W (k ) ϕ 3 H 3 X; W (k ) ϕ 2 H 2 X, j 2 ; W (k ) , q 2 (k, z, z, j 1 , j 2 ) = ∆ H 2 (z, j 2 ; W (k)) ϕ 1 ( w 1 (k, j 1 ) , x ) -∆ H 2 z, j 2 ; W (k ) ϕ 1 ( w1 (k , j 1 ) , x ) , r 2 (k, z, j 1 , j 2 ) = ξ 2 (k ) ∆ H 2 z, j 2 ; W (k ) ϕ 1 ( w1 (k , j 1 ) , x ) -ξ 2 (k ) E Z ∆ H 2 Z, j 2 ; W (k ) ϕ 1 ( w1 (k , j 1 ) , X ) , q 1 (k, z, z, j 1 ) = 1 n 2 n2 j2=1 ∆ H 2 (z, j 2 ; W (k)) w 2 (k, j 1 , j 2 ) ϕ 1 ( w 1 (k, j 1 ) , x ) x - 1 n 2 n2 j2=1 ∆ H 2 z, j 2 ; W (k ) w2 (k , j 1 , j 2 ) ϕ 1 ( w1 (k , j 1 ) , x ) x, r 1 (k, z, j 1 ) = ξ 1 (k ) 1 n 2 n2 j2=1 ∆ H 2 z, j 2 ; W (k ) w2 (k , j 1 , j 2 ) ϕ 1 ( w1 (k , j 1 ) , x ) x -ξ 1 (k ) E Z   1 n 2 n2 j2=1 ∆ H 2 Z, j 2 ; W (k ) w2 (k , j 1 , j 2 ) ϕ 1 ( w1 (k , j 1 ) , X ) X   . Let us also define: q H 3 (k, x) = H 3 (x; W (k)) -H 3 x; W (k ) , q H 2 (k, x, j 2 ) = H 2 (x, j 2 ; W (k)) -H 2 x, j 2 ; W (k ) . We proceed in several steps. Step 1: Decomposition. As shown in the proof of Theorem 3: max j2≤n2 ∂ ∂t w3 (t + ξ, j 2 ) - ∂ ∂t w3 (t, j 2 ) ≤ K t+ξ ξ, max j1≤n1, j2≤n2 ∂ ∂t w2 (t + ξ, j 1 , j 2 ) - ∂ ∂t w2 (t, j 1 , j 2 ) ≤ K t+ξ ξ, max j1≤n1 ∂ ∂t w1 (t + ξ, j 1 ) - ∂ ∂t w1 (t, j 1 ) ≤ K t+ξ ξ. for any t ≥ 0 and ξ ≥ 0. These time-interpolation estimates, along with Assumption 1, allow to derive the following. We first have: max j2≤n2 |w 3 ( t/ , j 2 ) -w3 (t, j 2 )| ≤ K max j2≤n2 t/ -1 k=0 ξ 3 (k ) E Z [q 3 (k, z (k) , Z, j 2 )] + tK t ≤ K max j2≤n2 [Q 3,1 ( t/ , j 2 ) + Q 3,2 ( t/ , j 2 )] + tK t , where we define Q 3,1 ( t/ , j 2 ) = t/ -1 k=0 |q 3 (k, z (k) , z (k) , j 2 )| , Q 3,2 ( t/ , j 2 ) = t/ -1 k=0 r 3 (k, z (k) , j 2 ) . (Here t/ -1 k=0 = 0 if t/ = 0.) We have similarly: max j1≤n1 |w 1 ( t/ , j 1 ) -w1 (t, j 1 )| ≤ K max j1≤n1 [Q 1,1 ( t/ , j 1 ) + Q 1,2 ( t/ , j 1 )] + tK t , max j1≤n1, j2≤n2 |w 2 ( t/ , j 1 , j 2 ) -w2 (t, j 1 , j 2 )| ≤ K max j1≤n1, j2≤n2 [Q 2,1 ( t/ , j 1 , j 2 ) + Q 2,2 ( t/ , j 1 , j 2 )] + tK t , in which Q 1,1 ( t/ , j 1 ) = t/ -1 k=0 |q 1 (k, z (k) , z (k) , j 1 )| , Q 1,2 ( t/ , j 1 ) = t/ -1 k=0 r 1 (k, z (k) , j 1 ) , Q 2,1 ( t/ , j 1 , j 2 ) = t/ -1 k=0 |q 2 (k, z (k) , z (k) , j 1 , j 2 )| , Q 2,2 ( t/ , j 1 , j 2 ) = t/ -1 k=0 r 2 (k, z (k) , j 1 , j 2 ) . The task is now to bound Q 1,1 , Q 1,2 , Q 2,1 , Q 2,2 , Q 3,1 and Q 3,2 . Step 2: Bounding the terms. Before we proceed, let us give some bounds for q H 3 and q H 2 , which hold for any x ∈ R d : q H 2 (k, x, j 2 ) ≤ K max j1≤n1 (| w2 (k , j 1 , j 2 )| |w 1 (k, j 1 ) -w1 (k , j 1 )| + |w 2 (k, j 1 , j 2 ) -w2 (k , j 1 , j 2 )|) ≤ K k D k W , W , q H 3 (k, x) ≤ K max j2≤n2 |w 3 (k, j 2 ) -w3 (k , j 2 )| + | w3 (k , j 2 )| q H 2 (k, x, j 2 ) ≤ K k D k W , W . With these, we have the following: • Let us bound Q 3,1 . By Assumption 1, |q 3 (k, z (k) , z (k) , j 2 )| ≤ K q H 2 (k, x (k) , j 2 ) + q H 3 (k, x (k)) . We then get: max j2≤n2 Q 3,1 ( t/ , j 2 ) ≤ K t t/ -1 k=0 D k W , W . • Similarly to Q 3,1 , we consider Q 2,1 : |q 2 (k, z (k) , z (k) , j 1 , j 2 )| ≤ K ∆ H 2 (z (k) , j 2 ; W (k)) -∆ H 2 z (k) , j 2 ; W (k ) + K ∆ H 2 z (k) , j 2 ; W (k ) |w 1 (k, j 1 ) -w1 (k , j 1 )| ≤ K k q H 2 (k, x (k) , j 2 ) + q H 3 (k, x (k)) + K |w 3 (k, j 2 ) -w3 (k , j 2 )| + K k |w 1 (k, j 1 ) -w1 (k , j 1 )| , which yields max j1≤n1, j2≤n2 Q 2,1 ( t/ , j 1 , j 2 ) ≤ K t t/ -1 k=0 D k W , W . • Again we get a similar bound for Q 1,1 : |q 1 (k, z (k) , z (k) , j 1 )| ≤ K n 2 n2 j2=1 | w2 (k , j 1 , j 2 )| ∆ H 2 (z (k) , j 2 ; W (k)) -∆ H + K n 2 n2 j2=1 ∆ H 2 z (k) , j 2 ; W (k ) |w 2 (k, j 1 , j 2 ) -w2 (k , j 1 , j 2 )| + K n 2 n2 j2=1 | w2 (k , j 1 , j 2 )| ∆ H 2 z (k) , j 2 ; W (k ) |w 1 (k, j 1 ) -w1 (k , j 1 )| ≤ K k max j2≤n2 q H 2 (k, x (k) , j 2 ) + q H 3 (k, x (k)) + K k D k W , W , which yields max j1≤n1 Q 1,1 ( t/ , j 1 ) ≤ K t t/ -1 k=0 D k W , W . • Let us bound Q 3,2 . Let us define: r 3 (k, j 2 ) = k-1 =0 r 3 (k, z (k) , j 2 ) , r 3 (0, j 2 ) = 0. Let F k be the sigma-algebra generated by {z ( ) : ∈ {0, ..., k -1}}. Note that {r 3 (k, j 2 )} k∈N is a martingale adapted to {F k } k∈N . Furthermore, for k ≤ T / , the martingale difference is bounded: |r 3 (k, z (k) , j 2 )| ≤ K by Assumption 1. Therefore, by Theorem 20 and the union bound, we have: P max j2≤n2 max ∈{0,1,...,T / } Q 3,2 ( , j 2 ) ≥ ξ ≤ 2n 2 exp - ξ 2 K (T + 1) . • The bounding of Q 2,2 is similar: |r 2 (k, z (k) , j 1 , j 2 )| ≤ K k almost surely by Assumption 1, and thus P max j1≤n1, j2≤n2 max ∈{0,1,...,T / } Q 2,2 ( , j 1 , j 2 ) ≥ ξ ≤ 2n 1 n 2 exp - ξ 2 K T (T + 1) . • Again the bounding of Q 1,2 is also similar: |r 1 (k, z (k) , j 1 )| ≤ K k almost surely by Assumption 1, and thus P max j1≤n1 max ∈{0,1,...,T / } Q 1,2 ( , j 1 ) ≥ ξ ≤ 2n 1 exp - ξ 2 K T (T + 1) . Step 3: Putting everything together. All the above results give us D t/ W , W ≤ K T t/ -1 k=0 D k W , W + ξ + T K T ∀t ≤ T, which hold with probability at least 1 -2n 1 n 2 exp - ξ 2 K T (T + 1) . The above event implies, by Gronwall's lemma, D T W , W ≤ (ξ + ) e K T . Choosing ξ = K T (T + 1) log (2n 1 n 2 /δ) completes the proof.

C.3 PROOFS OF COROLLARIES 4 AND 5

Proof of Corollary 4. By the assumption on ψ and Assumption 1, we have: |E Z [ψ (Y, ŷ (X; W ( t/ ))) -ψ (Y, ŷ (X; W (t)))]| ≤ KE Z [|H 3 (X; W ( t/ )) -H 3 (X; W (t))|] ≤ KE Z [|H 3 (X; W (t)) -H 3 (X; W ( t/ ))|] + KE Z [|H 3 (X; W ( t/ )) -H 3 (X; W ( t/ ))|] . An inspection of the proof of Theorem 3 (in particular, the proofs of Theorems 14 and 15) reveals that firstly, sup t≤T E Z [|H 3 (X; W (t)) -H 3 (X; W ( t/ ))|] ≤ K T , and secondly, sup t≤T E Z [|H 3 (X; W ( t/ )) -H 3 (X; W ( t/ ))|] ≤ K T D T (W, W) + 1 √ n min log 1/2 3T n 2 max δ + e e K T with probability at least 1 -δ. Together with Theorem 3, we obtain the claim. Proof of Corollary 5. Observe that for each index {N 1 , N 2 } of Init, one obtains a neural network initialization W(0) with law ρ by setting w 1 (0, j 1 ) = w 1 (0, C 1 (j 1 )), w 2 (0, j 1 , j 2 ) = w 2 (0, C 1 (j 1 ) , C 2 (j 2 )) , w 3 (0, j 2 ) = w 3 (0, C 2 (j 2 )) , j 1 ∈ [N 1 ] , j 2 ∈ [N 2 ] . We consider the evolution W (k) starting from W (0), which is independent of W . Note that W (k) is a deterministic function of its initialization W (0) and the data {z (j)} j≤k . Similarly, we consider the counterpart for Ŵ : the evolution Ŵ (k) as a function of the initialization Ŵ (0) and the data {ẑ (j)} j≤k . Due to sharing the same distribution for both the initialization and the data, these evolutions have the same law; to be specific, W (n 1 , n 2 , T ) and Ŵ (n 1 , n 2 , T ) has the same distribution for any n 1 , n 2 and T , where we define W (n 1 , n 2 , T ) = w 1 (k, j 1 ) , w 2 (k, j 1 , j 2 ) , w 3 (k, j 2 ) : j 1 ∈ [n 1 ] , j 2 ∈ [n 2 ] , k ≤ T / , and a similar definition for Ŵ (n 1 , n 2 , T ). In other words, W W, Ŵ ≡ inf coupling of (W, Ŵ) E max k≤ T / , j1≤n1, j2≤n2 |w 1 (k, j 1 ) -ŵ1 (k, j 1 )| , |w 2 (k, j 1 , j 2 ) -ŵ2 (k, j 1 , j 2 )| , |w 3 (k, j 2 ) -ŵ3 (k, j 2 )| = 0. Theorem 3 implies that for any tuple {n 1 , n 2 } such that n 1 ≤ N 1 and n 2 ≤ N 2 , with probability at least 1 -2δ, D (n1,n2) T (W, W) ≡ max sup t≤T, j1≤n1, j2≤n2 |w 2 (t, j 1 , j 2 ) -w 2 (t, C 1 (j 1 ) , C 2 (j 2 ))| , sup t≤T, j2≤n2 |w 3 (t, j 2 ) -w 3 (t, C 2 (j 2 ))| , sup t≤T, j1≤n1 |w 1 (t, j 1 ) -w 1 (t, C 1 (j 1 ))| ≤ Õδ,T ( , N 1 , N 2 ) , where Õδ,T ( , N 1 , N 2 ) → 0 as → 0 and N -1 min log N max → 0 with N min = min {N 1 , N 2 } and N max = max {N 1 , N 2 }. We also have a similar result for Ŵ and Ŵ . As such, with probability at least 1 -4δ, W W, Ŵ ≡ inf coupling of (W, Ŵ) E sup t≤T, j1≤n1, j2≤n2 w 1 (t, C 1 (j 1 )) -ŵ1 t, Ĉ1 (j 1 ) , w 2 (t, C 1 (j 1 ) , C 2 (j 2 )) -ŵ2 t, Ĉ1 (j 1 ) , Ĉ2 (j 2 ) , w 3 (t, C 2 (j 2 )) -ŵ3 t, Ĉ2 (j 2 ) ≤ D (n1,n2) T (W, W) + D (n1,n2) T Ŵ , Ŵ + W W, Ŵ ≤ 2 Õδ,T ( , N 1 , N 2 ) . By fixing the tuple {n 1 , n 2 } while letting → 0, N -1 min log N max → 0 and δ → 0, we obtain the claim.

D GLOBAL CONVERGENCE: PROOFS FOR SECTION 4 D.1 PROOF OF PROPOSITION 7

Proof of Proposition 7. Consider a probability space (Λ, G, P 0 ) with random processes R d -valued p 1 (θ 1 ), R-valued p 2 (θ 1 , θ 2 ) and R-valued p 3 (θ 2 ), which are indexed by (θ 1 , θ 2 ) ∈ [0, 1] × [0, 1], such that the following holds. Let m 1 and m 2 be two arbitrary finite positive integers and, with these integers, let θ (ki) i ∈ [0, 1] : k i ∈ [m i ] , i = 1, 2 be an arbitrary collection. For each i = 1, 2, let S i be the set of unique elements in θ (ki) i : k i ∈ [m i ] . Similarly, let R 2 be the set of unique pairs in θ (k1) 1 , θ : k 1 ∈ [m 1 ] , k 2 ∈ [m 2 ] . We have that {p 1 (θ 1 ) : θ 1 ∈ S 1 }, {p 3 (θ 2 ) : θ 2 ∈ S 2 }and {p 2 (θ 1 , θ 2 ) : (θ 1 , θ 2 ) ∈ R 2 } are all mutually independent. In addition, we also have Law (p 1 (θ 1 )) = ρ 1 , Law (p 3 (θ 2 )) = ρ 3 and Law (p 2 (θ 1 , θ 2 )) = ρ 2 for any θ 1 ∈ S 1 , θ 2 ∈ S 2 and (θ 1 , θ 2 ) ∈ R 2 . Such a space exists by Kolmogorov's extension theorem. We now construct the desired neuronal embedding. For i = 1, 2, consider Ω i = Λ × [0, 1] and F i = G × B ([0, 1]), equipped with the product measure P 0 × Unif ([0, 1]) in which Unif ([0, 1]) is the uniform measure over [0, 1] equipped with the Borel sigma-algebra B ([0, 1]). We construct Ω = Ω 1 × Ω 2 and F = F 1 × F 2 , equipped with the product measure P = (P 0 × Unif ([0, 1])) 2 . Define the deterministic functions w 0 1 : Ω 1 → R d , w 0 2 : Ω 1 × Ω 2 → R and w 0 3 : Ω 2 → R: w 0 1 ((λ 1 , θ 1 )) = p 1 (θ 1 ) (λ 1 ) , w 0 2 ((λ 1 , θ 1 ) , (λ 2 , θ 2 )) = p 2 (θ 1 , θ 2 ) (λ 2 ) , w 0 3 ((λ 2 , θ 2 )) = p 3 (θ 2 ) (λ 2 ) . It is easy to check that this construction yields the desired neuronal embedding.

D.2 PROOF OF THEOREM 8

We first present a measurability argument, which is crucial to showing that a certain universal approximation property holds throughout the course of training. Lemma 16 (Measurability argument). Consider a family Init of initialization laws, which are ρ 1 , ρ 2 , ρ 3 -i.i.d., such that ρ 2 -almost surely |w 2 | ≤ K and ρ 3 -almost surely |w 3 | ≤ K. There exists a neuronal embedding Ω, F, P, w 0 i i=1,2,3 of Init such that there exist Borel functions w * 1 and ∆ H * 2 for which P -almost surely, for all t ≥ 0, w 1 (t, C 1 ) = w * 1 t, w 0 1 (C 1 ) , ∆ H 2 (z, C 2 ; W (t)) = ∆ H * 2 t, z, w 0 3 (C 2 ) , where W (t) is the MF dynamics formed under the coupling procedure with this neuronal embedding as described in Section 3.1. Furthermore, ∂ ∂t w * 1 (t, u 1 ) = -ξ 1 (t) E Z ∆ H * 2 (t, Z, u 3 ) u 2 ϕ 1 ( w * 1 (t, u 1 ) , X ) X ρ 2 (du 2 ) ρ 3 (du 3 ) + ξ 1 (t) t 0 ξ 2 (s) E Z,Z ∆ H * 2 (t, Z, u 3 ) ∆ H * 2 (s, Z , u 3 ) ρ 3 (du 3 ) × ϕ 1 ( w * 1 (s, u 1 ) , X ) ϕ 1 ( w * 1 (t, u 1 ) , X ) X ds, with initialization w * 1 (0, u 1 ) = u 1 for all u 1 ∈ supp ρ 1 and t ≥ 0, where Z is an independent copy of Z. Proof. We denote by K t a constant that may depend on t and is finite with finite t. By Proposition 7, there exists a neuronal embedding that accommodates Init. We recall its construction and reuse the notations from the proof of Proposition 7; in particular: w 0 1 ((λ 1 , θ 1 )) = p 1 (θ 1 ) (λ 1 ) , w 0 2 ((λ 1 , θ 1 ) , (λ 2 , θ 2 )) = p 2 (θ 1 , θ 2 ) (λ 2 ) , w 0 3 ((λ 2 , θ 2 )) = p 3 (θ 2 ) (λ 2 ) . Let S 1 , S 3 and S 2 denote the sigma-algebras generated by w 0 1 (C 1 ), w 0 3 (C 2 ) and w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) respectively. Let S 13 denote the sigma-algebra generated by S 1 and S 3 . We also let S Z 1 to denote the sigma-algebra generated by S 1 and the sigma-algebra of the data Z. We define similarly for S Z 2 and S Z 3 . Step 1: Reduced dynamics. Given the MF dynamics W (t), let us define ∆3 (t, c 2 ) = E [∆ 3 (C 2 ; W (t))|S 3 ] (c 2 ) , ∆2 (t, c 1 , c 2 ) = E [∆ 2 (C 1 , C 2 ; W (t))|S 2 ] (c 1 , c 2 ) , ∆1 (t, c 1 ) = E [∆ 1 (C 1 ; W (t))|S 1 ] (c 1 ) . We recall from the proof of Theorem 3 that for any t, s ≥ 0, ess-sup |w 3 (t, C 2 ) -w 3 (s, C 2 )| ≤ K |t -s| , ess-sup |w 2 (t, C 1 , C 2 ) -w 2 (s, C 1 , C 2 )| ≤ K t∨s |t -s| , ess-sup |w 1 (t, C 1 ) -w 1 (s, C 1 )| ≤ K t∨s |t -s| . Then by Lemma 13, E ∆3 (t, C 2 ) -∆3 (s, C 2 ) 2 ≤ K t∨s |t -s| 2 , E ∆2 (t, C 1 , C 2 ) -∆2 (s, C 1 , C 2 ) 2 ≤ K t∨s |t -s| 2 , E ∆1 (t, C 1 ) -∆1 (s, C 1 ) 2 ≤ K t∨s |t -s| 2 . Therefore, by Kolmogorov continuity theorem, there exist continuous modifications of the (timeindexed) processes ∆1 , ∆2 and ∆3 . We thus replace them with their continuous modifications, written by the same notations. Given these continuous modifications, we consider the following reduced dynamics: ∂ ∂t w3 (t, c 2 ) = -ξ 3 (t) ∆3 (t, c 2 ) , ∂ ∂t w2 (t, c 1 , c 2 ) = -ξ 2 (t) ∆2 (t, c 1 , c 2 ) , ∂ ∂t w1 (t, c 1 ) = -ξ 1 (t) ∆1 (t, c 1 ) , in which: • w1 : R ≥0 × Ω 1 → R d , w2 : R ≥0 × Ω 1 × Ω 2 → R, w3 : R ≥0 × Ω 3 → R. • W (t) = { w1 (t, •) , w2 (t, •, •) , w3 (t, •)} is the collection of reduced parameters at time t, • the initialization is w1 (0, •) = w 0 1 (•), w2 (0, •, •) = w 0 2 (•, •) and w3 (0, •) = w 0 3 (•), i.e. W (0) = W (0). Step 2: Measurability of the reduced dynamics. It is easy to see that w3 (t, C 2 ) is S 3 -measurable by its construction and the fact w3 (0, C 2 ) = w 0 3 (C 2 ) is S 3 -measurable. Similarly, w2 (t, C 1 , C 2 ) is S 2 -measurable and w1 (t, C 1 ) is S 1 -measurable. Notice that there exist Borel functions w * 1 , w * 2 and w * 3 for which P -almost surely, w1 (t, C 1 ) = w * 1 t, w 0 1 (C 1 ) , w2 (t, C 1 , C 2 ) = w * 2 t, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) , w3 (t, C 2 ) = w * 3 t, w 0 3 (C 2 ) . Indeed, since w2 (t, C 1 , C 2 ) is S 2 -measurable , there exists a function w * 2 (t, •) for each rational t such that the desired identity holds for P -almost every (C 1 , C 2 ) and for all rational t ≥ 0. Since w2 is continuous in time, there is a unique continuous (in time) function w * 2 (t, •) such that the identity holds for all t ≥ 0 and for P -almost every (C 1 , C 2 ). The same argument yields the construction of w * 1 and w * 3 . Step 3: Measurability of constituent quantities. We show that H 2 X, C 2 ; W (t) is S Z 3measurable. Recall that H 2 X, C 2 ; W (t) = E C1 [ w2 (t, C 1 , C 2 ) ϕ 1 ( w1 (t, C 1 ) , X )] . By the existence of w * 1 and w * 2 , for each t ≥ 0, there exists a Borel function f t such that almost surely w2 (t, C 1 , C 2 ) ϕ 1 ( w1 (t, C 1 ) , X ) = f t X, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) . We recall that ρ 1 and ρ 2 are the laws of w 0 1 (C 1 ) and w 0 2 (C 1 , C 2 ). We analyze the following: E H 2 X, C 2 ; W (t) -f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) 2 = E H 2 X, C 2 ; W (t) 2 + E f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) 2 -2E H 2 X, C 2 ; W (t) f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) . Let us evaluate the first term: E H 2 X, C 2 ; W (t) 2 = E E C1 f t X, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) 2 (a) = E f t X, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) f t X, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) (b) = E f t (X, p 1 (θ 1 ) (λ 1 ) , p 2 (θ 1 , θ 2 ) (λ 2 ) , p 3 (θ 2 ) (λ 2 )) × f t (X, p 1 (θ 1 ) (λ 1 ) , p 2 (θ 1 , θ 2 ) (λ 2 ) , p 3 (θ 2 ) (λ 2 )) (c) = E f t (X, p 1 (θ 1 ) (λ 1 ) , p 2 (θ 1 , θ 2 ) (λ 2 ) , p 3 (θ 2 ) (λ 2 )) × f t (X, p 1 (θ 1 ) (λ 1 ) , p 2 (θ 1 , θ 2 ) (λ 2 ) , p 3 (θ 2 ) (λ 2 )) I (θ 1 = θ 1 ) (d) = E Z f t (X, u 1 , u 2 , u 3 ) f t (X, u 1 , u 2 , u 3 ) ρ 1 (du 1 ) ρ 1 (du 1 ) ρ 2 (du 2 ) ρ 2 (du 2 ) ρ 3 (du 3 ) , where in step (a), we define C 1 to be an independent copy of C 1 ; in step (b), we recall C 1 = (λ 1 , θ 1 ); in step (c), we recall θ 1 , θ 1 ∼ Unif ([0, 1]) and since C 1 is independent of C 1 , we have θ 1 = θ 1 almost surely; step (d) is owing to the independence property of the construction of the functions p 1 , p 2 and p 3 . We calculate the second term: E f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) 2 = E f t X, u 1 , u 2 , w 0 3 (C 2 ) f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) = E Z [f t (X, u 1 , u 2 , u 3 ) f t (X, u 1 , u 2 , u 3 )] ρ 1 (du 1 ) ρ 2 (du 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) ρ 3 (du 3 ) , as well as the last term: E H 2 X, C 2 ; W (t) f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) = E f t X, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) = E Z [f t (X, u 1 , u 2 , u 3 ) f t (X, u 1 , u 2 , u 3 )] ρ 1 (du 1 ) ρ 2 (du 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) ρ 3 (du 3 ) . It is then easy to see that E H 2 X, C 2 ; W (t) -f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) 2 = 0. That is, we have almost surely H 2 X, C 2 ; W (t) = f t X, u 1 , u 2 , w 0 3 (C 2 ) ρ 1 (du 1 ) ρ 2 (du 2 ) . Note that the right-hand side is S Z 3 -measurable, and hence so is H 2 X, C  ∆ H 2 Z, C 2 ; W (t) w2 (t, C 1 , C 2 ) = g t Z, w 0 1 (C 1 ) , w 0 2 (C 1 , C 2 ) , w 0 3 (C 2 ) . Then with the same argument as the treatment of H 2 X, C 2 ; W (t) , one can show that E C2 ∆ H 2 Z, C 2 ; W (t) w2 (t, C 1 , C 2 ) = g t Z, w 0 1 (C 1 ) , u 2 , u 3 ρ 2 (du 2 ) ρ 3 (du 3 ) , which is S Z 1 -measurable. Using these facts together with the existence of w * 1 , w * 2 and w * 3 , we have ∆ 3 C 2 ; W (t) is S 3 - measurable, ∆ 2 C 1 , C 2 ; W (t) is S 13 -measurable and ∆ 1 C 1 ; W (t) is S 1 -measurable. Step 4: Closeness between the MF dynamics and the reduced dynamics. We shall use W -W t with the same meaning as the distance between two sets of MF parameters. Recall by Lemma 11 that W T ≤ K T since W 0 ≤ K. By the same argument, W T ≤ K T . Then by Lemma 13, we have for any t ≤ T , ess-sup sup s≤t ∆ 3 (C 2 ; W (s)) -∆ 3 C 2 ; W (s) ≤ K T W -W t , ess-sup sup s≤t ∆ 2 (C 1 , C 2 ; W (s)) -∆ 2 C 1 , C 2 ; W (s) ≤ K T W -W t , ess-sup sup s≤t ∆ 1 (C 1 ; W (s)) -∆ 1 C 1 ; W (s) ≤ K T W -W t . We have: ess-sup ∆2 (t, C 1 , C 2 ) -∆ 2 C 1 , C 2 ; W (t) = ess-sup E [∆ 2 (C 1 , C 2 ; W (t))|S 2 ] -∆ 2 C 1 , C 2 ; W (t) ≤ ess-sup E ∆ 2 C 1 , C 2 ; W (t) S 2 -∆ 2 C 1 , C 2 ; W (t) + ess-sup E ∆ 2 C 1 , C 2 ; W (t) -∆ 2 (C 1 , C 2 ; W (t)) S 2 (a) = ess-sup E ∆ 2 C 1 , C 2 ; W (t) -∆ 2 (C 1 , C 2 ; W (t)) S 2 ≤ ess-sup ∆ 2 C 1 , C 2 ; W (t) -∆ 2 (C 1 , C 2 ; W (t)) , where step (a) is because ∆ 2 C 1 , C 2 ; W (t) is S 13 -measurable from Step 3 and S 13 ⊆ S 2 . As such, ess-sup ∆2 (t, C 1 , C 2 ) -∆ 2 (C 1 , C 2 ; W (t)) ≤ 2K T W -W t almost surely for all rational t ≤ T . By continuity in t of both sides, the same holds for all t ≤ T . Hence by Assumption 1, ∂ ∂t w2 (t, C 1 , C 2 ) - ∂ ∂t w 2 (t, C 1 , C 2 ) ≤ K ∆2 (t, C 1 , C 2 ) -∆ 2 (C 1 , C 2 ; W (t)) ≤ 2K T W -W t , for all t ≤ T almost surely, which leads to | w2 (t, C 1 , C 2 ) -w 2 (t, C 1 , C 2 )| ≤ 2K T t 0 W -W s ds almost surely. One can obtain similar results for w1 versus w 1 and w3 versus w 3 . Therefore, W -W t ≤ K T t 0 W -W s ds. Since W (0) = W (0), by Gronwall's inequality, W -W t = 0 for all t ≤ T . In other words, since T is arbitrary, w1 (t, C 1 ) = w 1 (t, C 1 ) , w2 (t, C 1 , C 2 ) = w 2 (t, C 1 , C 2 ) , w3 (t, C 2 ) = w 3 (t, C 2 ) , for all t ≥ 0 almost surely. Step 5: Concluding. The first claim of the lemma is proven by the conclusion of Step 4 and by choosing w * 1 = w * 1 , w * 2 = w * 2 and w * 3 = w * 3 , as well as the measurability facts from Step 3. To prove the second claim, since ∆ H 2 Z, C 2 ; W (t) is S Z 3 -measurable and W -W t = 0 for all t ≥ 0, there exists a Borel function ∆ H * 2 such that ∆ H 2 Z, C 2 ; W (t) = ∆ H 2 (Z, C 2 ; W (t)) = ∆ H * 2 t, Z, w 0 3 (C 2 ) for all t ≥ 0 almost surely, by the same argument in Step 2. These facts, together with the dynamics of w 1 and w 2 , imply that almost surely, for all t ≥ 0, ∂ ∂t w 2 (t, C 1 , C 2 ) = -ξ 2 (t) E Z ∆ H * 2 t, Z, w 0 3 (C 2 ) ϕ 1 w * 1 t, w 0 1 (C 1 ) , X , ∂ ∂t w * 1 t, w 0 1 (C 1 ) = -ξ 1 (t) E Z E C2 ∆ H * 2 t, Z, w 0 3 (C 2 ) w 2 (t, C 1 , C 2 ) ϕ 1 w * 1 t, w 0 1 (C 1 ) , X X , with initialization w * 1 0, w 0 1 (C 1 ) = w 0 1 (C 1 ). Substituting the first equation into the second one, we get: ∂ ∂t w * 1 t, w 0 1 (C 1 ) = -ξ 1 (t) E Z E C2 ∆ H * 2 t, Z, w 0 3 (C 2 ) w 0 2 (C 1 , C 2 ) ϕ 1 w * 1 t, w 0 1 (C 1 ) , X X + ξ 1 (t) t 0 ξ 2 (s) E Z,Z E C2 ∆ H * 2 t, Z, w 0 3 (C 2 ) ∆ H * 2 s, Z , w 0 3 (C 2 ) × ϕ 1 w * 1 s, w 0 1 (C 1 ) , X ϕ 1 w * 1 t, w 0 1 (C 1 ) , X X ds. Note that by an argument similar to Step 3, E C2 ∆ H * 2 t, Z, w 0 3 (C 2 ) w 0 2 (C 1 , C 2 ) = ∆ H * 2 (t, Z, u 3 ) u 2 ρ 2 (du 2 ) ρ 3 (du 3 ) , which holds for all t ≥ 0 almost surely by the same argument in Step 2. We thus obtain:  ∂ ∂t w * 1 (t, u 1 ) = -ξ 1 (t) E Z ∆ H * 2 (t, Z, u 3 ) u 2 ϕ 1 ( w * 1 (t, u 1 ) , X ) X ρ 2 (du 2 ) ρ 3 (du 3 ) + ξ 1 (t) t 0 ξ 2 (s) E Z,Z ∆ H * 2 (t, Z, u 3 ) ∆ H * 2 (s, Z , u 3 ) ρ 3 (du 3 ) × ϕ 1 ( w * 1 (s, u 1 ) , X ) ϕ 1 ( w * 1 (t, u 1 ) , X ) X (t, C 1 ) = w * 1 t, w 0 1 (C 1 ) , ∆ H 2 (z, C 2 ; W (t)) = ∆ H * 2 t, z, w 0 3 (C 2 ) , where W (t) is the MF dynamics formed under the coupling procedure with this neuronal embedding as described in Section 3.1. Furthermore, ∂ ∂t w * 1 (t, u 1 ) = -E Z ∆ H * 2 (t, Z, u 3 ) u 2 ϕ 1 ( w * 1 (t, u 1 ) , X ) X ρ 2 (du 2 ) ρ 3 (du 3 ) + t 0 E Z,Z ∆ H * 2 (t, Z, u 3 ) ∆ H * 2 (s, Z , u 3 ) ρ 3 (du 3 ) × ϕ 1 ( w * 1 (s, u 1 ) , X ) ϕ 1 ( w * 1 (t, u 1 ) , X ) X ds, with initialization w * 1 (0, u 1 ) = u 1 for all u 1 ∈ supp ρ 1 and t ≥ 0, where Z is an independent copy of Z. We recall from Lemma 12 that ess-sup∆ H * 2 t, Z, w 0 3 (C 2 ) = ess-sup∆ H 2 (Z, C 2 ; W (t)) ≤ K t , where K t denotes a generic constant that depends on t and is finite with finite t. Therefore, by Assumption 1, for t ≤ T and u 1 , u 1 ∈ supp ρ 1 , ∂ ∂t w * 1 (t, u 1 ) - ∂ ∂t w * 1 (t, u 1 ) ≤ K t |w * 1 (t, u 1 ) -w * 1 (t, u 1 )| + K t t 0 |w * 1 (s, u 1 ) -w * 1 (s, u 1 )| ds ≤ K T sup s≤t |w * 1 (s, u 1 ) -w * 1 (s, u 1 )| , ∂ ∂t w * 1 (t, u 1 ) ≤ K T . Applying Gronwall's lemma to the first bound: sup t≤T |w * 1 (t, u 1 ) -w * 1 (t, u 1 )| ≤ e K T |w * 1 (0, u 1 ) -w * 1 (0, u 1 )| = e K T |u 1 -u 1 | . Furthermore the second bound implies sup t,t ≤T |w * 1 (t, u 1 ) -w * 1 (t , u 1 )| ≤ K T |t -t | . Therefore (t, u 1 ) → w * 1 (t, u 1 ) is a continuous mapping on [0, T ] × R d for an arbitrary T ≥ 0. Given this continuity, we show the thesis by a topology argument. Consider the sphere S d which is a compactification of R d . We can extend w * 1 to a function M : [0, T ] × S d → S d fixing the point at infinity, which remains a continuous map since |M (t, u 1 ) - u 1 | = |M (t, u 1 ) -M (0, u 1 )| ≤ K T t. Let M t : R d → R d be defined by M t (u 1 ) = M (t, u 1 ). We claim that M t is surjective for all finite t. Indeed, if M t fails to be surjective for some t, then for some p ∈ S d , M t : S d → S d \ {p} → S d is homotopic to the constant map, but M then gives a homotopy from the identity map M 0 on the sphere to a constant map, which is a contradiction as the sphere S d is not contractible. With this, we are ready to prove Theorem 8. Proof of Theorem 8. Recall, by Theorem 1, the solution to the MF ODEs exists uniquely, and by Lemma 17, the support of Law (w 1 (t, C 1 )) is R d at all t. By the convergence assumption, we have that for any > 0, there exists T ( ) such that for all t ≥ T ( ) and P -almost every c 1 : E C2 E Z ∆ H 2 (Z, C 2 ; W (t)) ϕ 1 ( w 1 (t, c 1 ) , X ) ≤ . Since Law (w 1 (t, C 1 )) has full support, we obtain that for u in a dense subset of R d , E C2 E Z ∆ H 2 (Z, C 2 ; W (t)) ϕ 1 ( u, X ) ≤ . By continuity of u → ϕ 1 ( u, x ), we extend the above to all u ∈ R d . Since ϕ 1 is bounded, E C2 E Z ∆ H 2 (Z, C 2 ; W (t)) -∆ H 2 (Z, C 2 ; w1 , w2 , w3 ) ϕ 1 ( u, X ) ≤ KE ∆ H 2 (Z, C 2 ; W (t)) -∆ H 2 (Z, C 2 ; w1 , w2 , w3 ) ≤ KE (1 + | w3 (C 2 )|) |w 3 (t, C 2 ) -w3 (C 2 )| + | w3 (C 2 )| |w 2 (t, C 1 , C 2 ) -w2 (C 1 , C 2 )| + | w3 (C 2 )| | w2 (C 1 , C 2 )| |w 1 (t, C 1 ) -w1 (C 1 )| , where the last step is by Assumption 1. Recall that the right-hand side converges to 0 as t → ∞. We thus obtain that for all u ∈ R d , E C2 E Z ∆ H 2 (Z, C 2 ; w1 , w2 , w3 ) |X = x , ϕ 1 ( u, x ) L 2 (P X ) = E C2 E Z ∆ H 2 (Z, C 2 ; w1 , w2 , w3 ) ϕ 1 ( u, X ) = 0, which yields that for all u ∈ R d and P -almost every c 2 , E Z ∆ H 2 (Z, c 2 ; w1 , w2 , w3 ) |X = x , ϕ 1 ( u, x ) L 2 (P X ) = 0. Here we note that by Assumption 1, E Z ∆ H 2 (Z, c 2 ; w1 , w2 , w3 ) |X = x ≤ K | w3 (c 2 )| , and so E Z ∆ H 2 (Z, c 2 ; w1 , w2 , w3 ) |X = x is in L 2 (P X ) for P -almost ev- ery c 2 . Since ϕ 1 ( u, • ) : u ∈ R d has dense span in L 2 (P X ), we have E Z ∆ H 2 (Z, c 2 ; w1 , w2 w3 ) |X = x = 0 for P X -almost every x and P -almost every c 2 , and hence E Z [∂ 2 L (Y, ŷ (X; w1 , w2 , w3 ))|X = x] ϕ 3 (H 3 (x; w1 , w2 , w3 )) w3 (c 2 ) ϕ 2 (H 2 (x, c 2 ; w1 , w2 )) = 0. We note that our assumptions guarantee that P ( w3 (C 2 ) = 0) is positive. Indeed: • In the case w 0 3 (C 2 ) = 0 with positive probability and ξ 3 (•) = 0, the conclusion is obvious. • In the case L w 0 1 , w 0 2 , w 0 3 < E Z [L (Y, ϕ 3 (0))], we recall the following standard property of gradient flows: L (w 1 (t, •) , w 2 (t, •, •) , w 3 (t, •)) ≤ L (w 1 (t , •) , w 2 (t , •, •) , w 3 (t , •)) , for t ≥ t . In particular, setting t = 0 and taking t → ∞, it is easy to see that L ( w1 , w2 , w3 ) ≤ L w 0 1 , w 0 2 , w 0 3 < E Z [L (Y, ϕ 3 (0))] . If P ( w3 (C 2 ) = 0) = 1 then L ( w1 , w2 , w3 ) = E Z [L (Y, ϕ 3 (0))], a contradiction. Then since ϕ 2 and ϕ 3 are strictly non-zero, we have E Z [∂ 2 L (Y, ŷ (X; w1 , w2 , w3 ))|X = x] = 0 for P X -almost every x. In Case 1, since L convex in the second variable, for any measurable function ỹ(x), L (y, ỹ (x)) -L (y, ŷ (x; w1 , w2 , w3 )) ≥ ∂ 2 L (y, ŷ (x; w1 , w2 , w3 )) (ỹ (x) -ŷ (x; w1 , w2 , w3 )) . Taking expectation, we get E Z [L (Y, ỹ (X))] ≥ L ( w1 , w2 , w3 ), i.e. ( w1 , w2 , w3 ) is a global minimizer of L . In Case 2, since y is a function of x, we obtain ∂ 2 L (y, ŷ (x; w1 , w2 , w3 )) = 0 and hence L (y, ŷ (x; w1 , w2 , w3 )) = 0 for P X -almost every x. which tends to 0 as t → ∞. This completes the proof.

E CONVERSE FOR GLOBAL CONVERGENCE: REMARK 9

We prove a converse statement for global convergence in relation with the essential supremum condition (6). Proposition 18. Consider a neuronal embedding Ω, F, P, w 0 i i=1,2,3 of ρ 1 , ρ 2 , ρ 3 -i.i.d. initialization. Consider the MF limit corresponding to the network (1) , such that they are coupled together by the coupling procedure in Section 3.1, under Assumptions 1, 2, ξ 1 (•) = ξ 2 (•) = 1. Assume that L(y, ŷ) → ∞ as |ŷ| → ∞ for each y. Further assume that there exists w3 such that as t → ∞, E • Case 2 (generic non-negative loss): Suppose that ∂ 2 L (y, ŷ) = 0 implies L (y, ŷ) = 0, and y = y(x) is a function of x. If L (W (t)) → 0 as t → ∞, then the same conclusion also holds. Proof. We recall × ϕ 3 (H 3 (X; W (t))) ϕ 2 (H 2 (X, c 2 ; W (t))) ϕ 1 ( w 1 (t, c 1 ) , X ) ,  for c 1 ∈ Ω 1 , c 2 ∈ Ω 2 .

F USEFUL TOOLS

We first present a useful concentration result. In fact, the tail bound can be improved using the argument in Feldman & Vondrak (2018) , but the following simpler version is sufficient for our purposes. Lemma 19. Consider an integer n ≥ 1 and let x, c 1 , ..., c n be mutually independent random variables. Let E x and E c denote the expectations w.r.t. x only and {c i } i∈[n] only, respectively. Consider a collection of mappings {f i } i∈ [n] , which map to a separable Hilbert space F. Let f i (x) = E c [f i (c i , x)]. Assume that for some R > 0, |f i (c i , x) -f i (x)| ≤ R almost surely, then for any δ > 0, P E x 1 n n i=1 f i (c i , x) -f i (x) ≥ δ ≤ 8R √ nδ exp - nδ 2 8R 2 ≤ 8R δ exp - nδ 2 8R 2 . Proof. For brevity, let us define Notice that since c 1 , ..., c n are independent and Z n (x) = n i=1 (f i (c i , x) -f i (x) f i (x) = E c [f i (c i , x)], E |Z n (x)| 2 = n i=1 E |f i (c i , x) -f i (x)| 2 ≤ 2nR 2 . We thus get: P (E x [|Z n (x)|] ≥ nδ) ≤ 8R √ nδ exp - nδ 2 8R 2 . This proves the claim. We state a martingale concentration result, which is a special case of (Pinelis, 1994, Theorem 3.5) which applies to a more general Banach space. Theorem 20 (Concentration of martingales in Hilbert spaces.). Consider a martingale Z n ∈ Z a separable Hilbert space such that |Z n -Z n-1 | ≤ R and Z 0 = 0. Then for any t > 0, P max k≤n |Z k | ≥ t ≤ 2 inf λ>0 exp -λt + ess-sup n k=1 E e λ|Z k -Z k-1 | -1 -λ|Z k -Z k-1 | | F k-1 . In particular, for any δ > 0, P max k≤n |Z k | ≥ nδ ≤ 2 exp - nδ 2 2R 2 . The following concentration result for i.i.d. random variables in Hilbert spaces is a corollary. Theorem 21 (Concentration of i.i.d. sum in Hilbert spaces.). Consider n i.i.d. random variables X 1 , ..., X n in a separable Hilbert space. Suppose that there exists a constant R > 0 such that |X i -E [X i ]| ≤ R almost surely. Then for any δ > 0, P 1 n n i=1 X i -E [X i ] ≥ δ ≤ 2 exp - nδ 2 2R 2 .



This is to be contrasted with another major operating regime (the NTK regime) where parameters essentially do not evolve and the model behaves like a kernel method(Jacot et al. (2018); Chizat et al. (2019); Du et al. (2019); Allen-Zhu et al. (2019); Zou et al. (2018); Lee et al. (2019)). To absorb first layer's bias term to w1, we assume the input x to have 1 appended to the last entry. We recall the definition of ess-sup in Appendix A. z (k) , j 2 ; W (k )



al. (2018); Chizat & Bach (2018); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2018); Nguyen (2019); Araújo et al. (2019); Sirignano & Spiliopoulos (2019)). The four works Mei et al. (2018); Chizat & Bach (2018); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (

Consider a family Init of initialization laws, indexed by a set of tuples {m 1 , m 2 } that contains a sequence of indices {m 1

; Mei et al. (2018); Chizat & Bach (2018); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2018); Wei et al. (2019); Javanmard et al. (2019); Mei et al. (2019); Shevchenko & Mondelli (2019); Wojtowytsch (2020)) have focused mainly on two-layer neural networks, taking an interacting particle system approach to describe the MF limiting dynamics as Wasserstein gradient flows. The three works Nguyen (2019); Araújo et al. (2019); Sirignano & Spiliopoulos (2019) independently develop different formulations for the MF limit in multilayer neural networks, under different assumptions. These works take perspectives that are different from ours. In particular, while the central object in Nguyen (2019) is a new abstract representation of each individual neuron, our neuronal embedding idea instead takes a keen view on a whole ensemble of neurons. Likewise our idea is also distant from Araújo et al. (2019); Sirignano & Spiliopoulos (2019): the central objects in Araújo et al. (2019) are paths over the weights across layers; those in Sirignano & Spiliopoulos (2019) are time-dependent functions of the initialization, which are simplified upon i.i.d. initializations. The result of our perspective is a neuronal embedding framework that allows one to describe the MF limit in a clean and rigorous manner. In particular, it avoids extra assumptions made in Araújo et al. (2019); Sirignano & Spiliopoulos (2019): unlike our work, Araújo et al. (2019) assumes untrained first and last layers and requires non-trivial technical tools; Sirignano & Spiliopoulos (2019) takes an unnatural sequential limit n 1 → ∞ before n 2 → ∞ and proves a non-quantitative result, unlike Theorem 3 which only requires sufficiently large min {n 1 , n 2 }. We note that Theorem 3 can be extended to general multilayer networks using the neuronal embedding idea. The advantages of our framework come from the fact that while MF formulations in Araújo et al. (2019); Sirignano & Spiliopoulos (2019) are specific to and exploit i.i.d. initializations, our formulation does not. Remarkably as shown in Araújo et al. (

al. (2018); Chizat & Bach (2018); Javanmard et al. (2019); Rotskoff et al. (2019); Wei et al. (

Nguyen & Pham (2020)). The recent work Lu et al. (2020) on a MF resnet model (a composition of many two-layer MF networks) and a recent update of Sirignano & Spiliopoulos (2019) essentially establish conditions of stationary points to be global optima. They however require strong assumptions on the support of the limit point. As explained in Section 4.3, we analyze the training dynamics without such assumption and in fact allow it to be violated. Our global convergence result is non-quantitative. An important, highly challenging future direction is to develop a quantitative version of global convergence; previous works on two-layer networks Javanmard et al. (2019); Wei et al. (2019); Rotskoff et al. (2019); Chizat (2019) have done so under sophisticated modifications of the architecture and training algorithms. Finally we remark that our insights here can be applied to prove similar global convergence guarantees and derive other sufficient conditions for global convergence of two-layer or multilayer networks (Nguyen & Pham (2020); Pham & Nguyen (2020)).

we have from Assumptions 1, 3:|L (W (t)) -L ( w1 , w2 , w3 )| = |E Z [L (Y, ŷ (X; W (t))) -L (Y, ŷ (X; w1 , w2 , w3 ))]| ≤ KE Z [|ŷ (X; W (t)) -ŷ (X; w1 , w2 , w3 )|] ≤ KE |w 3 (t, C 2 ) -w3 (C 2 )| + | w3 (C 2 )| |w 2 (t, C 1 , C 2 ) -w2 (C 1 , C 2 )| + | w3 (C 2 )| | w2 (C 1 , C 2 )| |w 1 (t, C 1 ) -w1 (C 1 )|

C2 [|w 3 (t, C 2 ) -w3 (C 2 )|] → 0.Then the following hold:• Case 1 (convex loss): If L is convex in the second variable andlim t→∞ L (W (t)) = inf V L (V ) , then it must be that sup c1∈Ω1 E C2 ∂ ∂t w 2 (t, c 1 , C 2 ) → 0 as t → ∞.

(t, c 1 , c 2 ) = -E Z ∂ 2 L (Y, ŷ (X; W (t))) w 3 (t, c 2 )

2 , similar to Mei et al. (2018); Araújo et al. (2019), in contrast with non-quantitative results in Chizat & Bach (2018); Sirignano & Spiliopoulos (2019). These bounds suggest that n 1 and n 2 can be chosen independent of the data dimension d. This agrees with the experiments in Nguyen (2019), which found width ≈ 1000 to be typically sufficient to observe MF behaviors in networks trained with real-life high-dimensional data.

) 3. Universal approximation: ϕ 1 ( u, • ) : u ∈ R d has dense span in L 2 (P X ) (the space of square integrable functions w.r.t. P X the distribution of the input X).

2 ; W (t) . Next we consider ∆ H 2 Z, C 2 ; W (t) . Recall that ∆ H 2 z, c 2 ; W (t) = ∂ 2 L y, ŷ x; W (t) ϕ 3 H 3 x; W (t) w3 (t, c 2 ) ϕ 2 H 2 x,c 2 ; W (t) . Then together with the existence of w * 3 , we have ∆ H 2 Z, C 2 ; W (t) is S Z 3 -measurable. Now we consider E C2 ∆ H 2 Z, C 2 ; W (t) w2 (t, C 1 , C 2 ) . With the existence of w * 2 , there exists a Borel function g t such that

Published as a conference paper at ICLR 2021 An important ingredient of the proof is that the distribution of w 1 (t, C 1 ) has full support at all time t ≥ 0, even though we only need to assume this property at initialization t = 0. This key property is proven by a topology argument, supported by the measurability result of Lemma 16. We remark that a similar property for two-layer networks is established in Chizat & Bach (2018) using a different topology argument. Lemma 17. Consider the same setting as Theorem 8. For all finite time t ≥ 0, the support of Law (w1 (t, C 1 )) is R d .Proof. By Lemma 16, one can choose a neural embedding such that there exists Borel functions w *

which implies there is an open ball B in R d for which P (w 1 (t, C 1 ) ∈ B) = 0. Then P w * 1 t, w 0 1 (C 1 ) ∈ B = 0. Since w * 1 (t, •) has full support, there is an open set U such that w * 1 (t, u 1 ) ∈ B for all u 1 ∈ U . Then P w 0 1 (C 1 ) ∈ U = 0, contradicting the assumption that w 0 1 (C 1 ) has full support. Therefore w 1 (t, C 1 ) must have full support at all t ≥ 0.

Note that the right-hand side is independent of c 1 . Since E C2 [|w 3 (t, C 2 ) -w3 (C 2 )|] → 0 as t → ∞, we have for some finite t 0 ≤ K,E C2 [| w3 (C 2 )|] ≤ E C2 [|w 3 (t 0 , C 2 )|] + K ≤ K,where the last step is by Lemma 11 and Assumption 2. As such, for all t sufficiently large, we have:(t, c 1 , C 2 ) ≤ KE Z [|∂ 2 L (Y, ŷ (X; W (t)))|] E C2 [|w 3 (t, C 2 )|] ≤ KE Z [|∂ 2 L (Y, ŷ (X; W (t)))|] (K + E C2 [| w3 (C 2 )|]) ≤ KE Z [|∂ 2 L (Y, ŷ (X; W (t)))|] .The proof concludes once we show thatE Z [|∂ 2 L (Y, ŷ (X; W (t)))|] → 0 as t → ∞.For a fixed z = (x, y), let us writeL (t, z) = L(y, ŷ(x; W (t))) and ∂ 2 L(t, z) = ∂ 2 L(y, ŷ(x; W (t)))for brevity. Consider Case 1. We claim that if there is an increasing sequence of time t i so that lim i→∞ [L(t i , z) -inf ŷ L(y, ŷ)] = 0, then lim i→∞ |∂ 2 L(t i , z)| = 0. Indeed, it suffices to show that for any subsequence t ij of t i , there exists a further subsequence t ij k such thatlim k→∞ ∂ 2 L(t ij k , z) = 0.In any subsequence t ij of t i , using that L(t ij , z) is convergent and the fact L(y, ŷ) → ∞ as |ŷ| → ∞, we have ŷ(x; W (t ij )) is bounded. Hence, we obtain a subsequence t ij k for which ŷ(x; W (t ij k )) converges to some limit ŷ * . By continuity, we have L(y, ŷ * ) = lim k→∞ L(t ij k , z) = inf ŷ L(y, ŷ). Thus, since L is convex in the second variable, we have ∂ 2 L(y, ŷ * ) = 0. Thus, lim k→∞ ∂ 2 L(t ij k , z) = |∂ 2 L(y, ŷ * )| = 0, as claimed. Similarly, we obtain in Case 2 that if there is an increasing sequence of time t i so that lim i→∞ [L(t i , z)] = 0, thenlim i→∞ |∂ 2 L(t i , z)| = 0. To show that E Z [|∂ 2 L (t, Z)|] → 0 as t → ∞,it suffices to show that for any increasing sequence of times t i tending to infinity, there exists a subsequencet ij of t i such that E Z ∂ 2 L t ij , Z → 0. In Case 1, we have lim i→∞ L (W (t i )) = inf V L (V ), so lim i→∞ E Z L (t i , Z) -inf Ŷ L(Y, Ŷ ) = 0. Since L (t i , Z) -inf Ŷ L(Y,Ŷ ) is nonnegative, this implies that L (t i , Z) -inf Ŷ L(Y, Ŷ ) converges to 0 in probability. Thus, there is a further subsequence t ij for which L t ij , Zinf Ŷ L(Y, Ŷ ) converges to 0 P-almost surely. By the previous claim, ∂ 2 L t ij , Z converges to 0 P-almost surely. Since ∂ 2 L t ij , Z is bounded P-almost surely, we obtain that E Z ∂ 2 L t ij , Z → 0 from the bounded convergence theorem. The result in Case 2 can be established similarly.

) . By Theorem 21, P (|Z n (x)| ≥ nδ|x) ≤ 2 exp -nδ 2 / 4R 2 , and therefore, P (|Z n (x)| ≥ nδ) ≤ 2 exp -nδ 2 / 4R 2 , since the right-hand side is uniform in x. Next note that, w.r.t. the randomness of x only,E x [|Z n (x)|] = E x [|Z n (x)| I (|Z n (x)| ≥ nδ/2)] + E x [|Z n (x)| I (|Z n (x)| < nδ/2)] ≤ E x [|Z n (x)| I (|Z n (x)| ≥ nδ/2)] + nδ/2.As such, by Markov's inequality and Cauchy-Schwarz's inequality,P (E x [|Z n (x)|] ≥ nδ) ≤ P (E x [|Z n (x)| I (|Z n (x)| ≥ nδ/2)] ≥ nδ/2) ≤ 2 nδ E [|Z n (x)| I (|Z n (x)| ≥ nδ/2)]

