UNIFORM-IN-TIME PROPAGATION OF CHAOS FOR THE MEAN-FIELD GRADIENT LANGEVIN DYNAMICS

Abstract

The mean-field Langevin dynamics is characterized by a stochastic differential equation that arises from (noisy) gradient descent on an infinite-width two-layer neural network, which can be viewed as an interacting particle system. In this work, we establish a quantitative weak propagation of chaos result for the system, with a finite-particle discretization error of O(1/N ) uniformly over time, where N is the width of the neural network. This allows us to directly transfer the learning guarantee for infinite-width networks to practical finite-width models without excessive overparameterization. On the technical side, our analysis differs from most existing studies on similar mean field dynamics in that we do not require the interaction between particles to be sufficiently weak to obtain a uniform propagation of chaos, because such assumptions may not be satisfied in neural network optimization. Instead, we make use of a logarithmic Sobolev-type condition which can be verified in appropriate regularized risk minimization settings.

1. INTRODUCTION

Mean-field neural networks. We consider the optimization of a two-layer neural network in the mean-field regime, which is represented as an average over N neurons: f X (z) = 1 N N i=1 h z (x i ), where given the input z ∈ R d ′ , each neuron computes a nonlinear transformation based on trainable parameters x ∈ R d ; for example, we may set h z (x) = tanh(w ⊤ z + b) for x = (w, b) ∈ R d ′ +1 . Importantly, the mean-field parameterization allows for the parameters to move away from initialization during gradient descent and hence learn informative features (Yang and Hu, 2020) even when the network width is large (N → ∞), in contrast to the Neural Tangent Kernel (NTK) parameterization (Jacot et al., 2018) (corresponding to a 1/ √ N prefactor), which freezes the model at initialization under overparameterization. This feature learning ability enables mean-field neural networks to outperform the NTK counterpart (or linear estimators in general) in learning a wide range of target functions (Ghorbani et al., 2019; Li et al., 2020; Abbe et al., 2022; Ba et al., 2022) . Optimization guarantees for mean-field neural networks are typically obtained by lifting the finitewidth model to the infinite-dimensional space of parameter distributions and then exploiting convexity of the objective function. Using this viewpoint, convergence of gradient flow on infinite-width neural networks to the global optimal solution can be shown under appropriate conditions (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020) . However, most existing results are qualitative in nature, in that they do not characterize the rate of convergence and the finite-particle discretization error. Mean-field Langevin dynamics. An often-studied optimization method for mean-field neural networks is the noisy particle gradient descent (NPGD) algorithm (Mei et al., 2018; Hu et al., 2019; Chen et al., 2020b) , where Gaussian noise is injected to the gradient to encourage "exploration" and enable global optimality to be shown under milder conditions than the noiseless case. The large particle and vanishing step size limit is termed the mean-field Langevin dynamics (Hu et al., 2019) , which globally minimizes an entropy-regularized convex functional in the space of measures. Recently, Nitanda et al. (2022) ; Chizat (2022) established exponential convergence for the meanfield Langevin dynamics under certain logarithmic Sobolev inequalities which can be easily verified in regularized risk minimization problems using two-layer neural networks (1). This represents a significant step towards a quantitative optimization analysis of neural networks in the presence of feature learning, yet the limitation is also clear: these results are obtained from directly analyzing the large particle limit (i.e., the limiting McKean-Vlasov stochastic differential equation), and cannot be easily transferred to practical finite-width networks. In fact, naively applying the quantitative results in Mei et al. (2018; 2019) leads to discretization error bounds that blow up exponentially in time, rendering the guarantee vacuous beyond the very early stages of gradient descent learning. Therefore, for the purpose of characterizing the optimization behavior of finite-width neural networks, it is important to derive a finite-particle discretization error bound that holds uniformly over time, that is, the error remains stable even when t is large. global optimum approximated by the PDA algorithm (Nitanda et al., 2021) . In this paper, we establish finite-particle guarantees for the mean-field Langevin dynamics via a propagation of chaos calculation (Sznitman, 1991) which controls the weak error between the empirical distribution of the interacting particle system and the corresponding infinite particle limit along the optimization trajectory. This allows us to bound the difference in the function value between the finite-width neural network optimized by NPGD and the infinite-width counterpart. In particular, starting from N initialized particles X 0 i.i.d ∼ µ 0 , if we denote the finite-particle model at time t of optimization as f Xt , and its corresponding infinite-particle limit as f µt , then our propagation of chaos result is the following. Theorem (informal). Under suitable regularity conditions, E (f Xt (z)-f µt (z)) 2 = O (1/N ) for any t > 0 and z ∈ R d ′ . We make the following remarks on the main theorem: • To our knowledge, we provide the first rigorous uniform-in-time propagation of chaos result in the context of mean-field neural networks. This is in contrast to prior works where the discretization error typically increases as optimization proceeds (e.g., |f Xt -f µt | = O(exp(t)•N -1/2 ) as in Mei et al. (2018, Theorem 3) ). The theorem implies that as the width N becomes larger, the difference between the finite-width and infinite-width model output diminishes rapidly, as shown in Figure 1 . • Our analysis assumes a modified Log-Sobolev condition which is satisfied in regularized risk minimization problems using neural network when the convex regularizer on the parameters has super-quadratic tail. Noticeably, we do not impose any constraint on the strength of regularization and interaction; this differs from many existing results where uniform propagation of chaos is only achieved under weak interaction or large noise (Eberle et al., 2019; Delarue and Tse, 2021) . Due to the space constraint, we defer discussions on additional related works to Appendix A.

2. PRELIMINARIES

In this section, we formulate the problem setting and introduce some useful notations for the following sections. We optimize a two-layer neural network by minimizing the empirical or expected risk in a supervised learning setting, where the input is included in a set Z ⊂ R d ′ and the output is in a bounded set Y ⊂ R. As defined in the Introduction, h (•) (x) : z ∈ Z → h z (x) ∈ Y represents one neuron (particle) with parameters x ∈ R d , and the mean-field neural network is written as the average over N neurons: f X (z) = 1 N N i=1 h z (x i ), where X = (x i ) N i=1 ⊂ R d denotes the collection of parameters and z ∈ Z. The continuous limit of the neural network is obtained by taking N → ∞, and in analogy to the law of large numbers, f X converges to the following integral form: f µ (z) = h z (x)dµ(x), where µ is a probability measure on (R d , B(R d )) representing the weight of each parameter. Let P be the set of probability measures on (R d , B(R d )), and P p be those with a finite p-th moment (p ≥ 1). As typical for the mean-field analysis, we aim to optimize the density function µ so that the neural network f µ accurately predicts the output y ∈ Y from the input z ∈ Z. In the following, we take (regularized) empirical risk minimization as a concrete example, and we note that the exact same analysis applies to the minimization of expected risk. Let ℓ(z, y) : Y ×Y → R be a convex loss function, such as the squared loss ℓ(z, y) = (z -y) 2 /2 for regression, or the logistic loss ℓ(z, y) = log(1 + exp(-yz)) for classification. For each (z i , y i ) in the given training data (z i , y i ) n i=1 ⊂ Z × Y, we use the notation h i (x) and ℓ i (f ) to indicate h zi (x) and ℓ(f (z i ), y i ) respectively. Our goal is to find an approximate minimizer of the following objective over P: F (µ) := 1 n n i=1 ℓ i (f µ ) + λ 1 r(x)dµ(x), where λ 1 > 0 is the regularization strength and r(•) is a convex regularizer. More specifically, we will analyze the mean-field Langevin dynamics which solves an entropy-regularized version of (1). It is worth noting that this entropy-regularized objective can also be globally optimized by the recently proposed particle gradient-type methods in Nitanda et al. (2021) ; Oko et al. (2022) , for which finite-width convergence rates have been provided. However, those methods employ an intricate double-loop structure which does not mirror the commonly-used gradient descent algorithm. Therefore, an important question to be addressed is whether noisy gradient descent also enjoys similar quantitative convergence guarantee -this is precisely the motivation of the current paper.

3. MEAN-FIELD GRADIENT LANGEVIN DYNAMICS

Derivation of the continuous dynamics. The basic idea of the mean-field Langevin dynamics is to optimize the aforementioned objective via Wasserstein gradient flow over a set of measures P. To define the gradient with respect to the measure, we introduce the first-variation δG δµ of a functional G : P 2 → R at µ ∈ P q (for a given q ≥ 1) as a continuous functional P q × R d → R that satisfies lim ϵ→0 G(ϵν + (1 -ϵ)µ) ϵ = δG δµ (µ)(x)d(ν -µ), for any ν ∈ P q . If there exists such a functional δG δµ , we say G admits a first-variation at µ, or simply G is differentiable at µ. To avoid the ambiguity of δG δµ up to constant shift, we follow the convention of imposing δG δµ (µ)dµ = 0. In our setting, the first-variation of the objective F is given by δF δµ (µ)(x) = 1 n n j=1 ℓ ′ j (f µ )h j (x) + λ 1 r(x). We track F (µ t ) along a trajectory of measures (µ t ) t in P 2 following a continuity equation: ∂ t µ t = ∇ • (µ t v t ), where v t : R d → R d is a vector field included in L 2 (µ t ), and the time-derivative and the divergence operator are defined in a weak sense, that is, for any continuously differentiable function ϕ with a compact support, ϕdµ t -ϕdµ s = -t s ∇ϕ • v τ dµ τ dτ . Then, the time-derivative of G(µ t ) can be written as ∂ t G(µ t ) = v t • ∇ δ δµ G(µ t )dµ t . We refer readers to Villani (2009) ; Ambrosio et al. (2005) ; Bakry et al. (2014) for more details. In this sense, ∇ δ δµ F (µ t ) can be seen as a gradient direction in the measure space (endowed with a Wasserstein metric). The mean-field Langevin dynamics approximately minimizes the objective F based on the Wasserstein gradient flow. Specifically, define the nonlinear drift term: b(x, µ) = ∇ δF δµ (µ)(x) = 1 n n j=1 ℓ ′ j (f µ )∇h j (x) + λ 1 ∇r(x), the mean-field Langevin dynamics is then given by the following stochastic differential equation: dX t = -b(X t , µ t )dt + √ 2λdW t , (3a) µ t = Law(X t ), for X 0 ∼ µ 0 , where Law(X) denotes the distribution (probability law) of the random variable X, and (W t ) t≥0 is the d-dimensional standard Brownian motion. The existence and uniqueness of the solution are ensured by Theorem 3.3 of Huang et al. (2021) (see also Corollary 3 for more details). For concise presentation in the subsequent analysis, we follow the notation in (Delarue and Tse, 2021) and denote the law µ t in the mean-field Langevin dynamics with the initial condition µ 0 as µ t = m(t, µ 0 ). It is known that µ t satisfies the following nonlinear Fokker-Planck equation: ∂ t m(t, µ 0 ) = λ∆m(t, µ 0 ) + ∇ • [m(t, µ 0 )b(•, m(t, µ 0 ))], with m(0, µ 0 ) = µ 0 (this is again defined in a weak sense, that is, ϕd(m(t, µ 0 ) -m(s, µ 0 )) = t s (λ∆ϕ -b(•, m(τ, µ 0 )) ⊤ ∇ϕ)dm(τ, µ 0 )dτ for smooth test function f with compact support). This dynamics is an example of distribution dependent SDEs originating from the study of interacting particle systems which dates back to 1950s (Kahn and Harris, 1951; Kac, 1956; 1959; McKean, 1966; 1967) . A fundamental characterization of the mean-field Langevin dynamics is that it is a Wasserstein gradient flow that minimizes the following objective (Mei et al., 2018; Hu et al., 2019) : L(µ) = F (µ) + λEnt(µ), where Ent(µ) = -log(dµ(z)/dz)dµ(z) is the negative entropy of µ. Indeed, it is known that ∇ δL(µ) δµ = ∇ δF δµ + λ∇ log(µ) = λ∇ log(µ) + b(•, µ t ) (e.g, Theorem 4.16 of Ambrosio et al. (2005) ) and the continuity equation corresponding to µ t can be rewritten as µ) δµ µ t ), which, in combination with the identity (2), yields that ∂ t µ t = ∇ • [(λ∇ log(µ t ) + b(•, µ t ))µ t ] = ∇ • (∇ δL( d dt L(µ t ) = - ∇ δL(µ) δµ 2 dµ t . We therefore see that µ t decreases L(µ t ) unless δL(µ) δµ = 0, which is a crucial property that guarantees the convergence of µ t to the global optimal solution (Lemma 1). This can be seen as a nonlinear extension of the usual gradient Langevin dynamics (e.g., see Bakry et al. (2014)) , where F (µ) is a linear functional in the form of F (µ) = L(x)dµ with some objective function L. It is easy to see that the objective L can be reformulated as the following objective (up to constant) that employs the KL divergence from a distribution characterized by the regularization term r: L(µ) = 1 n n i=1 ℓ i (f µ ) + λKL(µ, ν r ), where ν r is a distribution with density proportional to exp(-λ 1 r/λ) and KL(•, •) is the KL divergence (relative entropy) defined as KL(µ, ν) := log(dµ/dν)dµ. Particle discretization. One of the main difficulties to simulate the mean-field Langevin dynamics is that we cannot access the exact information of µ t in the practical setting. Instead, we approximate this infinite-dimensional objective by a finite set of particles, which yields the following SDE: d Xi t = -b( Xi t , µ N t )dt + √ 2λdW i t , µ N t = 1 N N i=1 δ Xi t , and Xi 0 ∼ µ 0 . Our goal is to quantify the finite-particle approximation error due to replacing µ t with µ N t . The main mechanism of this approximation is the propagation of chaos (Sznitman, 1991) , which roughly refers to the phenomenon that as the number of particles N → ∞, the correlation between particles vanishes and µ N t → µ t . However, it is far from trivial to establish a quantitative estimate of such approximation that provides a meaningful guarantee for the finite-particle system. In the following sections, we will show that under appropriate conditions, the finite-particle error can be (weakly) controlled uniformly over time, which allows us to transfer learning guarantees in the mean-field limit to finite-width neural networks that are not excessively overparameterized.

4. MAIN ASSUMPTIONS AND LOGARITHMIC SOBOLEV INEQUALITY

In this section we present our main theoretical result -the quantitative propagation of chaos. First, we introduce the main assumption in our analysis. Assumption 1. We assume h i , ℓ i , r ∈ C ∞ and satisfy the following conditions: 1. Convexity of loss: ℓ i is a convex function.

2.. Boundedness and smoothness:

There exists B > 0 such that ∥h i ∥ ∞ ≤ B, ∥∇h i ∥ ∞ ≤ B, ∥∇∇ ⊤ h i ∥ ∞ ≤ B, max{|ℓ j (f µ )|, |ℓ ′ j (f µ )|, |ℓ ′′ j (f µ )|} ≤ B uniformly over P. 3. Regularity of r: The regularization term r is a convex function satisfying c r ∥x∥ 2+δ ≤ r(x) ≤ C r (1 + ∥x∥ 2+δ ) 1 , ∇r(x) • x ≥ c r ∥x∥ 2+δ and 0 ⪯ ∇∇ ⊤ r(x) ⪯ C r (1 + ∥x∥ δ )I for constants 0 < δ and c r , C r > 0. We make the following remarks on the assumptions. • The loss convexity is a standard assumption to ensure that the objective L is convex with respect to µ (note that this does not imply convexity with respect to the parameters {x i } N i=1 of the network). • The second assumption is satisfied for standard two-layer models under the following conditions: (i) ∥z∥ ≤ C; (ii) smooth loss function, such as the squared loss and logistic loss. For example, we may set h z (x) = tanh(rσ(w ⊤ z)) with smooth activation function σ and x = (r, w), or h z (x) = σ(w ⊤ z + b) with smooth and bounded activation and x = (w, b). • The constraint on r(•) requires the regularization term to have a super-quadratic tail, which is satisfied, for example, by r(x) = ∥x∥ 4 . While this does not cover the standard weight decay, we note that such regularizers with stronger tail growth have been employed in the theoretical analysis of neural networks (Chen et al., 2020a; Allen-Zhu and Li, 2022) . The purpose of this assumption is to ensure good isoperimetry of µ t along the trajectory (see Corollary 1). • The infinite differentiability condition is imposed only for the simplicity of our analysis. Proximal Gibbs distribution. An important quantity in the convergence analysis is the proximal Gibbs distribution: for µ ∈ P, we define the proximal density function as p µ (x) = 1 Z(µ) exp - 1 λ δF (µ) δµ (x) , where Z(µ) is the normalization constant. One may check that this corresponds to the minimizer of the linearized potential: min ν∈P δF (µ) δµ dν + λEnt(ν). For a given µ t , we denote by μt its proximal Gibbs measure, that is, the probability measure with the density p µt . Then, we have the following characterization of the minimizer of L. Proposition 1. Under Assumption 1, the functional L has a unique minimizer in P that is absolutely continuous with respect to the Lebesgue measure. Moreover, µ * ∈ P 2+δ is the optimal solution if and only if µ * is absolutely continuous and its density function is given by p µ * . This proposition can be shown in the same manner as Proposition 2.5 of Hu et al. (2019) . We remark that although this prior result assumed r to have at most quadratic growth, its proof does not require such growth condition but requires only the integrability of ν r and ν r (x) log(ν r (x)). Many convergence properties of the mean-field Langevin dynamics can be characterized by properties of µ * and the proximal Gibbs distribution μt . We first introduce the logarithmic Sobolev inequality (LSI) which will be very useful in the subsequent analysis. Definition 1 (Logarithmic Sobolev inequality). Let p(θ) be a smooth probability density function on R d . p(θ) (or its corresponding probability measure on (R, B(R d ))) satisfies the LSI with constant α ′ > 0 if and only if, for any smooth function ϕ : R d → R with E p [∥ϕ∥ 2 ] < ∞, it holds that E p [ϕ 2 log(ϕ 2 )] -E p [ϕ 2 ] log(E p [ϕ 2 ]) ≤ 2 α ′ E p [∥∇ϕ∥ 2 2 ]. We can verify that in our setting, the proximal Gibbs measure satisfies the LSI condition. Proposition 2. Under Assumption 1, p µ satisfies the log-Sobolev inequality with a constant α that depends on d, c r , B, λ, δ. If additionally ∇∇ ⊤ r ⪰ I, then the LSI holds with α = 2λ1 λ exp -4B λ . The proof is given in Corollary 5 in the Appendix. Note what similar characterization was obtained in Nitanda et al. (2022) ; Chizat (2022) under a quadratic regularizer r(x) = ∥x∥ 2 , via the standard Bakry-Emery and Holley-Stroock arguments (Bakry and Émery, 1985; Holley and Stroock, 1987) (see also Corollary 5.7.2 and 5.1.7 of Bakry et al. (2014) ). However, our Assumption 1 does not entail ∇∇ ⊤ r ⪰ I and thus our proof follows a different strategy. This LSI condition is crucial to the geometric ergodicity of the mean-field Langevin dynamics described in Theorem 1 below.

5. CONVERGENCE GUARANTEE FOR FINITE-WIDTH NEURAL NETWORKS

To present our (weak) convergence result, we first introduce an objective function in the form of U(t, µ) := Φ(m(t, µ)), where Φ : P → R is assumed to be sufficiently smooth, that is, Φ is twice differentiable with respect to µ and x, and the derivatives are bounded as sup x1,..., x k ∈R d |∂ j1 x1 . . . ∂ j k x k δ k Φ(µ) δµ k (x 1 , . . . , x k )| < C for k = 0, 1, 2 and j i = 0, 1, 2 uniformly over all µ ∈ P with some constant C (see Delarue and Tse (2021) for related definition). Example 1. Under Assumption 1, we allow for the following objective functions. For a smooth loss ℓ, Φ satisfies the smoothness condition. If P = 1 n n i=1 δ (zi,yi) , then Φ is the training loss, and if P is the test distribution, it is the test loss. We proceed by bounding the the weak difference between the finite-particle system at time t and the optimal µ * (see Proposition 1): E[Φ(µ N t )] -Φ(µ * ). We utilize the following decomposition: E[Φ(µ N t )] -Φ(µ * ) = E[U(t, µ N 0 ) -Φ(µ * )] (I), ergodicity term + E[U(0, µ N t ) -U(t, µ N 0 )] (II), propagation of chaos term , where the ergodicity term (I) monitors the convergence of the infinite-particle dynamics (3) (starting from µ N 0 ) to the optimal solution µ * , and the propagation of chaos term (II) controls the fluctuation due to the finite-particle update (4). The two terms are bounded separately in the ensuing subsections. Note that while we focus on two-layer neural networks under Assumption 1, the same computation can be performed under certain isoperimetric conditions on the trajectory. In particular, • Analysis of the ergodicity term (I) only requires the proximal Gibbs measure μt to satisfy the LSI. This condition can be easily verified for both quadratic regularizer as in (Nitanda et al., 2022; Chizat, 2022) and super-quadratic regularizers as shown in Proposition 2. • To control the propagation of chaos term (II), we require an LSI condition on µ t along the trajectory. Similar assumption also appeared in Lacker and Flem (2022) to obtain a uniform-intime evaluation, and is very challenging to establish in the mean-field neural network setting. We prove this assumption by transferring the LSI constant from μt to µ t via a super LSI condition, which is verified under the super-quadratic regularization in Assumption 1 (see Lemma 2).

5.1. BOUNDING THE ERGODICITY TERM (I)

Let W p (µ, ν) denote the p-Wasserstein distance between µ, ν ∈ P p . We first show that m(t, µ N 0 ) converges to µ * in an exponential order (geometric ergodicity) in the following sense. Lemma 1. Under Assumption 1, for µ t = m(t, µ N 0 ), it holds that L(µ t ) < ∞ for any t > 0, and L(µ t ) -L(µ * ) ≤ exp(-2αλ(t -τ 0 ))λψ 2 (τ 0 )W 2 (µ 0 , µ * ) 2 , KL(µ t , µ * ) ≤ exp(-2αλ(t -τ 0 ))ψ 2 (τ 0 )W 2 (µ 0 , µ * ) 2 , for any t > τ 0 where τ 0 > 0 is arbitrary, ψ 2 (t) = 1 2λ B 2 1-exp(-B 2 t) + tB 2 exp(4B 2 t)

2

, and α is the LSI constant of μt given in Proposition 2. The proof can be found in Corollary 4 in the Appendix. This is an extension of the "entropy sandwich" argument in Nitanda et al. (2022) ; Chizat (2022) , in which the right hand side of the bound is given by KL(µ 0 , µ * ) instead of the Wasserstein distance. However, in our setting, µ 0 = µ N 0 is a discrete distribution and thus the KL divergence from µ * is not finite. To resolve this issue, our analysis shows an upper bound of the KL divergence at t > τ 0 via the Wasserstein distance W 2 (µ 0 , µ * ) (see Corollary 3). Then, we obtain the following theorem on the convergence of term (I). Theorem 1 (Geometric ergodicity). Under Assumption 1, the term (I) converges as E[U(t, µ N 0 )] -Φ(µ * ) ≤ C 2α -1 ψ 2 (τ 0 ) exp(-αλ(t -τ 0 ))E[W 2 (µ N 0 , µ * )], for any t > τ 0 where τ 0 > 0 is an arbitrary positive real number. Proof. By Otto-Villani's theorem (Otto and Villani, 2000) , LSI implies Talagrand's inequality: Lemma 10) . The assertion is obtained by combining Lemma 1 and Talagrand's inequality. W 2 (µ, µ * ) ≤ 2 α KL(µ, µ * ). Also, the smoothness of Φ entails Φ(µ) -Φ(µ * ) ≤ CW 2 (µ, µ * ) (see

5.2. BOUNDING THE PROPAGATION OF CHAOS TERM (II)

Bounding the second term (II) is much more involved. We utilize the following evaluation adapted from Delarue and Tse (2021) (see Arnaudon and Del Moral (2020) for similar calculation): E[U(0, µ N t ) -U(t, µ N 0 )] = 1 N d i=1 t 0 E ∂ (x1)i ∂ (x2)i δ 2 U δµ 2 (t -s, µ N s )(x, x) µ N s (dx) ds. Intuitively, the integrand on the right hand side approximately represents E[U(0, µ N s+ϵ ) -U(t, µ N s )] for small ϵ, that is, how a small time difference between the finite particle and continuous limit propagates to the terminal time t. Here, for a linear operator q acting on a function f : R d → R, we write f (q) := q(f ). Then from Delarue and Tse (2021)  (see also Appendix C.2, C.3) it holds that ∂ (x1)i ∂ (x2)i δ 2 U δµ 2 (t, µ 0 )(x 1 , x 2 ) = δ 2 Φ δµ 2 (µ t )(d (1) i (t; µ 0 , ξ, x 1 ), d i (t; µ 0 , ξ, x 2 )) + δΦ δµ (µ t )(d (2) i,i (t; µ 0 , x 1 , x 2 )), where d (1) i and d (2) i,j are linear operators defined by d (1) i (t; µ 0 , ξ, x 1 )(ϕ) = ∂ (x1)i δ δµ (m(t; •)(ϕ))| µ0 (x 1 ), d (2) i,j (t; µ 0 , x 1 , x 2 )(ϕ) = ∂ (x1)i ∂ (x2)j δ 2 δµ 2 (m(t; •)(ϕ))| µ0 (x 1 , x 2 ), for a smooth test function ϕ : R d → R, where m(t, µ)(ϕ) = ϕ(x)dm(t, µ)(x) (we will also use the same notation for a general measure µ). The dynamics of these operators is characterized by Proposition 6 in Appendix C.1, which is adapted from Delarue and Tse (2021) . To obtain a uniform-in-time evaluation of term (II), we aim to show a rapid decay of d (1) i and d (2) i,j . Isoperimetry of µ t via super LSI. Our strategy is to establish the boundedness of the integral t 0 E ∂ (x1)i ∂ (x2)i δ 2 U δµ 2 (t -s, µ N s )(x, x)µ N s (dx) ds by proving the exponential convergence of ∂ (x1)i ∂ (x2)i δ 2 U δµ 2 (t -s, µ N s )(x, x). However, this requires a local evaluation around m(t -s; µ N s ), for which we cannot exploit "global" properties such as the log-Sobolev condition of the optimal solution µ * . Instead, our current analysis requires m(t -s; µ N s ) to also satisfy the LSI condition, which is technically demanding to establish. To our knowledge, similar condition has only been recently verified in Guillin et al. (2021) ; Lacker and Flem (2022) for a limited class of interaction potentials which cannot cover the case of mean-field neural networks. To overcome this difficulty, we impose a stronger-than-quadratic tail growth condition on the regularizer, i.e., r(x) = Ω(∥x∥ 2+δ ) (see third point in Assumption 1). Under this assumption, we can show that the proximal Gibbs measure μτ associated with µ τ = m(τ + s; µ N s ) satisfies the super logarithmic Sobolev inequality defined below. Definition 2 (super logarithmic Sobolev inequality (super LSI)). We say that a probability measure µ satisfies super log-Sobolev inequality if there exists a monotonically non-increasing function β : (0, ∞) → R such that for any ϕ satisfying E µ [ϕ 2 ] = 1 and E µ [∥∇ϕ∥ 2 ] < ∞, it holds that µ(ϕ 2 log ϕ 2 ) ≤ r ∥∇ϕ∥ 2 dµ + β(r) (∀r > 0). It is known that super LSI implies LSI if there exists r > 0 such that β(r) = 0. Lemma 2. For any µ ∈ P, the probability measure μ corresponding to the proximal Gibbs density p µ satisfies the super log-Sobolev inequality with β(r) = C ′ -4+2δ δ log(r/2) with a constant C ′ > 0. Furthermore, it satisfies the log-Sobolev inequality with LSI constant α = exp(-δC ′ 4+2δ ). The proof is given in Appendix B.3. An important consequence of this lemma is that µ t and μt have a bounded density ratio when t is sufficiently large. This implies that many properties of μt are also inherited by µ t . Crucially, the bound on density ratio is strong enough for the LSI condition to be transferred from μt to µ t , which allows us to establish the exponential convergence of d i and d (2) i,j . Corollary 1. Under Assumption 1, there exists some T 0 > 0 depending on d, B, δ, α, λ, c r , C r and Q0 depending on W 2 (µ 0 , µ * ) such that for all t ≥ T 0 + Q0 we have 1 √ 2 ≤ dµ t dμ t (x) ≤ √ 2 (∀x ∈ R d ), dµ t dμ t -1 ∞ ≤ C ′ exp(-αλ(t -T 0 )), where C ′ is some positive constant. Moreover, for t ≥ T 0 + Q0 , µ t satisfies (α/2)-LSI. Uniform-in-time propagation of chaos. Equipped with the LSI condition on µ t , we can now control term (II) in the error decomposition by proving exponential convergence of d (1) i and d (2) i,j . In particular, LSI implies the Poincaré inequality which then roughly ensures the KL-divergence behaves like a strongly convex function around µ t . Therefore, a small perturbation from µ t exponentially converges to 0 as t grows, which entails the fast convergence of d (1) i and d (2) i,j because these quantities represent infinitesimal displacement of µ t . Theorem 2 (Uniform Propagation of Chaos). Suppose that the support of µ 0 is bounded. Then for any 0 < s < t and 1 ≤ i ≤ d, it holds that E ∂ (x1)i ∂ (x2)i δ 2 U δµ 2 (t -s, µ N s )(x, x) µ N s (dx) = O(exp(-λα(t -s -T 0 )/2)), with some constant T 0 > 0. This implies that E[U(0, µ N t ) -U(t, µ N 0 )] = O N -1 . The proof of this theorem can be found in Appendix C.6. In addition, note that the informal theorem stated in Section 1 is a direct consequence of the above theorem (see Corollary 7 for details).

5.3. PUTTING THINGS TOGETHER

By combining the previous calculations, we arrive at the following characterization on the difference between the finite-width neural network and the optimal (infinite-width) solution µ * . Corollary 2. Under Assumption 1, if the initial distribution µ 0 has bounded support, then we have E[U(0, µ N t )] -Φ(µ * ) ≤ C 1 N -1 + C 2 2α -1 ψ 2 (τ 0 ) exp(-αλ(t -τ 0 ) ), for some constants C 1 , C 2 and any τ 0 > 0. Noticeably, since both Theorem 1 and Theorem 2 do not require sufficiently large regularizations, our finite-particle guarantee holds for any choice of regularization strength λ, λ 1 ; in contrast, prior works on uniform propagation of chaos typically assume weak interaction or large noise, which limits the applicability to the optimization of neural networks in the mean-field regime. As an important consequence of our general convergence guarantee, we know that the finite particle dynamics also exhibits a similar decay of the training and test losses compared to the continuous limit up to a O(1/N ) discretization error; this can be straightforwardly checked by considering the special setting where Φ(µ) = E (Z,Y )∼P [ℓ(f µ (Z), Y )] (see Example 1). Remark 1. Our current convergence result holds in expectation. To obtain a high probability statement, as discussed in Arnaudon and Del Moral (2020) , one may apply a martingale concentration inequality (e.g., see Lemma 3.2 of Nishiyama (1997) ) to obtain a guarantee in the form of U(0, µ N t ) -U(t, µ N 0 ) -E[U(0, µ N t ) -U(t, µ N 0 )] = c log(ϵ -1 )/N with probability 1 -ϵ. 6 NUMERICAL EXPERIMENTS 10 0 10 1 10 2 10 3 number of steps We provide empirical support for our propagation of chaos result in a synthetic student-teacher setting. We consider the empirical risk minimization problem, where the training labels are generated by a teacher model which is a Gaussian function defined as f * (z) = exp -∥z-a∥ 2 2d . We set n = 2000, d = 20. The loss is chosen to be the squared error, and for the regularization term we set r(x) = ∥x∥ 2 or r(x) = ∥x∥ 4 , and the regularization strength λ 1 = λ = 10 -2 . The student model is a two-layer neural network with tanh activation, and the width N is taken to be {16, 32, 64, 128, 256, 512, 1024, 2048}. We optimize the student model using NPGD with step size η = 10 -2 . The global optimal solution µ * is approximated via the particle dual averaging (PDA) algorithm (Nitanda et al., 2021) : we set the model width N = 2048 and number of outer loop steps T = 250; we scale the number of inner loop steps T t with t, and the step size η t with 1/ √ t, where t is the outer loop iteration. For the figures, we report the training or test error without the regularization terms. In Figure 1 in the Introduction, we plot the training error for r(x) = ∥x∥ 4 ; whereas in Figure 2 , we plot the test error for r(x) = ∥x∥ 2 . Observe that even though our current theoretical analysis does not cover the latter setting, the empirical trends are almost identical: as the width N increases, the performance of the finite-width network improves and approaches that of the (approximate) optimal solution µ * .

7. CONCLUSION

In this paper, we established the first uniform-in-time propagation of chaos for the mean-field Langevin dynamics in the context of neural network optimization. In contrast to most existing works, our analysis gives a quantitative discretization error and does not blow up through time, and we do not impose the commonly-assumed weak interaction condition. This is achieved by utilizing a super logarithmic Sobolev inequality that is satisfied by a regularization term with super-quadratic tail. This condition then enables us to establish good isoperimetry of the intermediate solution µ t , which gives an exponential convergence of the error propagation. Limitations and future directions. Our current analysis requires a super-quadratic tail of the regularization term, which does not cover the commonly-used ℓ 2 regularization (weight decay). We note that after our initial submission, Chen et al. (2022a) developed a different proof technique based on the tensorization of LSI which handles the case of quadratic regularization. Another important future direction is to extend the analysis discrete-time dynamics. Finally, the mean-field Langevin dynamics has found applications beyond neural network optimization (Chizat et al., 2022) ; hence we are optimistic that our technique can provide finite-particle guarantee for interacting particle algorithms in other applications. Mean-field analysis of two-layer neural networks describes the optimization dynamics as a partial differential equation (PDE) of the parameter distribution, from which convergence to the global optimal solution can be shown (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018) . However, quantitative convergence results usually requires additional assumptions on the learning problem (Chizat, 2019; Akiyama and Suzuki, 2021; Chen et al., 2022b) , or modification of the dynamics (Rotskoff et al., 2019; Wei et al., 2019) . For the entropy-regularized objective ( 5 To obtain a uniform in time control, there are roughly two approaches: (i) the uniform log-Sobolev (or Poincaré) inequality approach, and (ii) the local Taylor expansion approach. The first approach (i) directly derives the LSI constant of the N -particle dynamics (X = (X i t ) N i=1 ) and show that the constant can be bounded from below uniformly over all N . For example, Guillin et al. (2022) (see also references therein) established a uniform LSI constant under a weak interaction assumption. Ren and Wang (2021); Delgadino et al. (2021) showed geometric ergodicity based on a similar evaluation. Salem (2018) considered a uniform WJ-inequality instead of the log-Sobolev inequality also based on weak interaction conditions. The second approach (ii) is what we employed; in particular, we follow the framework developed in Arnaudon and Del Moral (2020); Delarue and Tse (2021) . In addition, Durmus et al. (2020) devised a different technique using a sophisticated coupling argument. We note that these prior results all assume weak interaction between particles to establish a uniform-in-time evaluation, and thus cannot be applied to the neural network setting.

B BASIC PROPERTIES OF THE SOLUTION B.1 BOUNDEDNESS AND UNIQUENESS OF THE SOLUTION

Proposition 3 (Theorem 2.1 and Corollary 4.3 of Wang (2018), adapted). Suppose that there exist K 1 , K 2 , K 3 ∈ C([0, ∞); (0, ∞)) such that • -2⟨b(x, µ) -b(y, ν), x -y⟩ ≤ K 1 (t)∥x -y∥ 2 + K 2 (t)W 2 (µ, ν)∥x -y∥ for any x, y ∈ R d , t ≥ 0 and µ, ν ∈ P 2 , • ∥b(0, µ)∥ 2 ≤ K 3 (t){1 + µ(∥ • ∥ 2 )} for any µ ∈ P 2 and t ≥ 0. Then, the mean-filed Langevin dynamics (3) has a unique strong solution. Moreover, for two different solutions X (1) t and X (2) t with different initial distributions µ (1) 0 , µ 0 ∈ P 2 , the corresponding laws µ (k) t = Law(X (k) t ) (k = 1, 2) are equivalent for t > 0 and satisfy the following contraction property: W 2 (µ (1) t , µ (2) t ) 2 ≤ ψ 1 (t)W 2 (µ (1) 0 , µ (2) 0 ) 2 , KL(µ (1) t , µ (2) t ) ≤ ψ 2 (t)W 2 (µ (1) 0 , µ (2) 0 ) 2 , ( ) for t > 0, where ψ j : (0, ∞) → [0, ∞) (j = 1, 2) depends only on K 1 , K 2 , K 3 , λ and is an increasing function. Note that it is possible that lim t→0 ψ 2 (t) = ∞. Indeed, ψ 2 (t) is given as ψ 2 (t) = 1 2λ K 1 (t) 1 -e -K1(t)t + tK 2 (t) exp(2t(K 1 (t) + K 2 (t)))) 2 . As a consequence of this proposition, we obtain the following corollary. Corollary 3. Under Assumption 1, the mean-filed Langevin dynamics (3) has a unique strong solution. Moreover, the two distributions µ (1) t and µ (2) t corresponding to different initial distributions in P 2 are equivalent and satisfy the contraction property (9). In particular, µ t is equivalent to µ * and hence is equivalent to the Lebesgue measure. Therefore, µ t has a density that is positive for all x ∈ R d , and if µ 0 ∈ P 2 , then the density of µ t satisfies (t, x) → dµt dx (x) ∈ C 1,∞ ((0, ∞) × R d , R). Proof. Let H(x, µ) = 1 n n j=1 ℓ ′ j (f µ )h j (x). We just need to check the two conditions in Proposition 3. The first condition can be checked as follows: by noticing the convexity of r, we have -2⟨b(x, µ) -b(y, ν), x -y⟩ = -2⟨∇H(x, µ) -∇H(y, ν), x -y⟩ -2λ 1 ⟨∇r(x) -∇r(y), x -y⟩ = -2⟨∇H(x, µ) -∇H(y, µ), x -y⟩ -2⟨∇H(y, µ) -∇H(y, ν), x -y⟩ -2λ 1 ⟨∇r(x) -∇r(y), x -y⟩ ≤ 2B 2 ∥x -y∥ 2 + 2B 2 ∥f µ -f ν ∥ ∞ ∥x -y∥ (∵ Assumption 1 and convexity of r) ≤ 2B 2 ∥x -y∥ 2 + 2B 3 W 2 (µ, ν)∥x -y∥, which yields the first condition. Next, the second condition can be guaranteed as ∥b(0, µ)∥ ≤ ∥∇H(0, µ)∥ + λ 1 ∥∇r(0)∥ ≤ B 2 + λC r . Therefore, applying Proposition 3, we obtain the first assertion. As for the second assertion, we first note that µ * is an invariant measure of the mean-field Langevin dynamics. Hence, if µ 0 = µ * , then µ t = µ * for any t > 0. Moreover, recall that p µ * = µ * by Proposition 1. Combining these relations, we have µ * = p µ * = µ t (t > 0) when the initial distribution satisfies µ 0 = µ * . Therefore, µ t with a general initial distribution is equivalent to µ * by Proposition 3. Finally, (t, x) → dµt dx (x) ∈ C 1,∞ ((0, ∞) × R d , R) follows from Theorem 5.1 of Jordan et al. (1998) . In the subsequent analysis, it is important to ensure the boundedness of the moments of X t . Indeed, we have the following estimate. Lemma 3. Under Assumption 1, for any p ≥ 2, E[∥X 0 ∥ p ] < ∞ implies E sup t∈[0,T ] ∥X t ∥ p < ∞ for any T > 0. This is a direct consequence of Theorem 2.1 of Wang (2018) . Therefore, we have that µ t ∈ P p as long as µ 0 ∈ P p . Indeed, we are interested in a situation where µ 0 = µ N s = 1 N N i=1 Xi s , and hence µ t ∈ P p for any p ≥ 2 because a discrete measure has a finite moment for any p ≥ 2. If p = 2 or p = 2 + δ, we have a sharper uniform bound as follows. Lemma 4. Under Assumption 1, we have the following uniform boundedness of the moments: sup t>0 E[∥X t ∥ 2 ] ≤ max 1 λ 1 c(1 + δ/2) B 4 λ 1 c(1 + δ/2) + λ 1 cδ + 2λd , E[∥X 0 ∥ 2 ] , sup t>0 E[r(X t )] ≤ max E[r(X 0 )], Cr(2+δ) λ1c 2 r (1+δ) 1 2 + δ (B 2 C r ) 2+δ (λ 1 c r /2) 1+δ + (1 + 2δ)λ 1 c r 2 + δ + λC r + 2 2 + δ (λC r ) (2+δ)/2 δ λcr(1+δ) δ/2 . The same bounds also hold with respect to E[∥ Xi t ∥ 2 ] and E[r( Xi t )]. Proof. Let H t (x) = 1 n n j=1 ℓ ′ j (f µt )h j (x) . By the formula of the infinitesimal generator, we have d dt E[∥X t ∥ 2 ] = E[-2X ⊤ t b(X t , µ t )] + 2λd . By Young's inequality, the right hand side can be bounded as -2E X ⊤ t (∇H t (X t ) + λ 1 ∇r(X t )) + 2λd ≤ 2B 2 E[∥X t ∥] -2λ 1 cE[∥X t ∥ 2+δ ] + 2λd ≤ 4B 4 4λ 1 c(1 + δ/2) + 2λ 1 c(1 + δ/2)E[∥X t ∥ 2 ] 2 + 2λ 1 c δ 2 -(1 + δ 2 )E[∥X t ∥ 2 ] + 2λd ≤ B 4 λ 1 c(1 + δ/2) + λ 1 cδ + 2λd -λ 1 c(1 + δ 2 )E[∥X t ∥ 2 ]. Hence, we obtain that E[∥X t ∥ 2 ] ≤ 1 λ 1 c(1 + δ/2) B 4 λ 1 c(1 + δ/2) + λ 1 cδ + 2λd ∨ E[∥X 0 ∥ 2 ]. In the same vein, we can show the bound for r(X t ) as follows. First note that d dt E[r(X t )] = E[-∇ ⊤ r(X t )b(X t , µ t )] + λE[Tr[∇∇ ⊤ r(X t )]]. By Young's inequality, the right hand side can be bounded as -E ∇ ⊤ r(X t ) (∇H t (X t ) + λ 1 ∇r(X t )) + λC r (1 + E[∥X t ∥ δ ]) ≤ B 2 E[∥∇r(X t )∥] -λ 1 E[∥∇r(X t )∥ 2 ] + λC r (1 + E[∥X t ∥ δ ]) ≤ B 2 C r E[∥X t ∥ 1+δ ] -λ 1 c r E[∥X t ∥ 2(1+δ) ] + λC r (1 + E[∥X t ∥ δ ]) ≤ 1 2 + δ (B 2 C r ) 2+δ (λ 1 c r /2) 1+δ + λ 1 c r (1 + δ) 2(2 + δ) E[∥X t ∥ 2+δ ] -λ 1 c r 2(1 + δ) 2 + δ E[∥X t ∥ 2+δ ] - δ 2 + δ + λC r + 2 2 + δ (λC r ) (2+δ)/2 δ λc r (1 + δ) δ/2 + λ 1 c r (1 + δ) 2(2 + δ) E[∥X t ∥ 2+δ ] ≤ 1 2 + δ (B 2 C r ) 2+δ (λ 1 c r /2) 1+δ + δλ 1 c r 2 + δ + λC r + 2 2 + δ (λC r ) (2+δ)/2 δ λc r (1 + δ) δ/2 - λ 1 c r (1 + δ) 2 + δ E[∥X t ∥ 2+δ ] ≥E[r(Xt)]/Cr-1 . This gives the second bound. The same argument can be applied to Xi t , which concludes the assertion. Lemma 5. Under Assumption 1, for µ 0 ∈ P p with p ≥ 2, it holds that ∇ log(µ t )(x) = - 1 t √ 2λ E t 0 (I + s∇b(•, µ s )| Xs ) • dW s |X t = x , and, if p/δ ≥ 1, it also holds that E[∥∇ log(µ t )(X t )∥ p/δ ] < ∞, for any t > 0. Proof. First, note that µ 0 ∈ P p ensures differentiability of µ t . The characterization of ∇ log(µ t ) is given by the integration by parts formula investigated by Föllmer (1986) (see also Lemma 6.2 of Hu et al. (2019) , Wang (2014) and Theorem 5.1 of Wang ( 2018)). As for the moment bound, first we note that ∥∇b(•, µ)| x ∥ = O(1 + ∥x∥ δ ), by the assumption. Then, by Jensen's inequality and the moment inequality of stochastic integral (Kim, 2013) , we have that, for q = p/δ, E[∥∇ log(µ t )∥ q ] ≲ 1 t √ 2λ q C t,q E t 0 (1 + s∥∇b(•, µ s )| Xs ∥) q ds ≲ 1 t √ 2λ q C t,q E t 0 (1 + s(1 + ∥X s ∥ δ )) q ds < ∞ (∵ Lemma 3), where C t,q = q(q-1) 2 q/2 t q-2 2 . According to Lemma 3 and the remark following the lemma, we have µ t ∈ P p for any p ≥ 2 in our situation where µ 0 is a discrete measure like µ 0 = µ N s . Hence, we may assume E[∥∇ log(µ t )(X t )∥ p ] < ∞ for any p ≥ 2. In particular, the Fisher divergence I(µ t ||µ * ) is welldefined for any t > 0 (but not defined for t = 0).

B.2 GEOMETRIC ERGODICITY

For µ, ν ∈ P where ν is absolutely continuous with respect to µ and thus can be written as dν = f dµ, the Fisher divergence of ν with respect to µ is defined as I(ν||µ) = 4 ∥∇ f ∥ 2 dµ = ∥∇ log(f )∥ 2 dν. Proposition 4 (Geometric ergodicity of the mean-field Langevin dynamics (Nitanda et al., 2022; Chizat, 2022) ). L(µ t ) -L(µ * ) ≤ exp(-2αλt)(L(µ 0 ) -L(µ * )). λKL(µ||µ * ) ≤ L(µ) -L(µ * ) ≤ λKL(µ||p µ ). Although Nitanda et al. (2022) ; Chizat (2022) assumed r(x) = Θ(∥x∥ 2 ), we can adapt the same argument also to our situation. Indeed, the quadraticity of the regularization term is used to ensure the well-posedness of the solution, and in our setting this is ensured by Corollary 3, which yields the assertion of the proposition. Combining Corollary 3 and Propositions 4, we obtain the following corollary. Corollary 4. Under Assumption 1, for any initial condition µ 0 ∈ P with W 2 (µ 0 , µ * ) < ∞, it holds that L(µ t ) < ∞, and L(µ t ) -L(µ * ) ≤ exp(-2αλ(t -τ 0 ))λψ 2 (τ 0 )W 2 (µ 0 , µ * ) 2 , KL(µ t , µ * ) ≤ exp(-2αλ(t -τ 0 ))ψ 2 (τ 0 )W 2 (µ 0 , µ * ) 2 , for any t > τ 0 where τ 0 > 0 can be arbitrary. Proof. We know that µ * = p µ * = µ t (t > 0) when the initial distribution is µ 0 = µ * from the proof of Corollary 3. Plugging this relation into Corollary 3 and Propositions 4 gives the assertion. As remarked in Section 5, in the convergence analysis we need to assume an LSI (or Poincaré inequality) on µ t instead of μt . This is not generally ensured. However, we can verify this condition if the semi-group satisfies the super log-Sobolev inequality. Indeed, the super log-Sobolev inequality entails the ultra-contractivity yielding an L ∞ -convergence of the density ratio between µ t and μt . This is remarkably useful to transfer the LSI property of μt to µ t .

B.3 SUPER LOGARITHMIC SOBOLEV INEQUALITY

Definition 3 (super log-Sobolev inequality (super LSI)). We say that a probability measure µ satisfies super log-Sobolev inequality if there exits a monotonically non-increasing function β : (0, ∞) → R such that µ(f 2 log f 2 ) ≤ r ∇f • ∇f dµ + β(r) (∀r > 0, ∀f ∈ D(E), µ(f 2 ) = 1). Proposition 5. If a probability measure µ is given by µ = exp(-V ) where V (x) = λ 1 r(x) + H(x) with a convex function r : R d → R satisfying λ 1 ∇r(x) • x ≥ c∥x∥ 2+δ for δ > 0 and h : R d → R satisfying ∥∇H∥ ∞ ≤ C < ∞, then µ satisfies the super log-Sobolev inequality with β(r) = C ′ -4+2δ δ log(r/2) where C ′ > 0 is a constant depending on d, c, C, δ. In particular, it satisfies the log-Sobolev inequality with a constant α ′ > 0 such that β(2/α ′ ) = 0. Proof. This can be proven by adapting Corollary 5.7.5 of Wang (2005) . Let P t be the semigroup that corresponds to the generator L * ϕ = ∆ϕ -∇V • ∇ϕ. Then, we have that L * ∥x∥ 2 = d -(λ 1 ∇r + ∇H) • (2x) ≤ d -2c∥x∥ 2+δ + 2∥∇H∥ ∞ ∥x∥ ≤ d -2c∥x∥ 2+δ + c∥x∥ 2+δ + ∥∇H∥ 2+δ δ ∞ c (∵ Young's inequality) = d + C 2+δ δ c -c∥x∥ 2+δ . Corollary 5.7.5 of Wang (2005) implies that ∥P t ∥ L 2 (µ)→L ∞ (µ) ≤ exp[c ′ t -(1+δ/2)/(δ/2) ] = exp[c ′ t -(2+δ)/δ ], for a constant c ′ > 0 depending on d, c, C. Indeed, (5.7.9) of Wang (2005)  holds for c ← d + C 2+δ δ c and γ(r) ← cr 1+δ/2 in their notations, which yields the bound. Then, Theorem 5.1.7 of Wang (2005) states that the super log-Sobolev inequality holds for β(r) = 2 log ∥P r/2 ∥ L 2 (µ)→L ∞ (µ) ≤ 2 log(c ′ ) -2 2+δ δ log(r/2). By resetting C ′ ← 2 log(c ′ ) , we obtain the assertion. Due to our assumption on the regularization term in Assumption 1, we know that the proximal Gibbs measure satisfies the super LSI condition. Corollary 5. μt satisfies the super log-Sobolev inequality with β(r) = C ′ -4+2δ δ log(r/2). In addition, it satisfies the log-Sobolev inequality with the LSI-constant α = exp(-δC ′ 4+2δ ). Let P s,t (s < t) be the semigroup associated with X t , i.e., (P s,t f )(x) = E[f (X t )|X s = x] and µ s P s,t f = µ t f . Theorem 3. There exists t 0 ∈ (0, 1] such that ∥P s,s+t0 ∥ L 2 (μs+t 0 )→L ∞ (μs) ≤ exp ∞ 2 β (1/p) p 2 dp + Ct 0 for some constant C > 0. In particular, there exists C 0 > 0 such that ∥P s,s+t0 ∥ L 2 (μs+t 0 )→L ∞ (μs) ≤ C 0 < ∞ uniformly over s > 0. Proof. We first note that μt (f p log(f )) ≤ -r μt (f p-1 L * µt f ) + β 4(p -1)r p p -1 μt (f p ) + μt (f p ) log(∥f ∥ L p (μt) ). (10) for all f ∈ D(L * µt ) such that f ≥ 0 (which we denote as f ∈ D + ), 2 < p < ∞, r > 0 (by definition the invariant measure corresponding to L * µt is μt , i.e., L µt μt = 0). Let t 0 := ∞ 2 γ(p) p dp ≤ 1 for γ(p) = 1 2p . We define p(τ ) and N (τ ) as functions on [0, t) such that p ′ (τ ) = p(τ ) γ • p(τ ) , p(0) = 2, N ′ (τ ) = p ′ (τ )β 4(p(τ )-1)γ•p(τ ) p(τ ) p(τ ) 2 , N (0) = 0. Then, for f ∈ D + , if we rewrite s ← s + t 0 , one has d dτ ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) = μs-τ ( d dτ (P s-τ,s f ) p(τ ) ) p(τ )∥P s-τ,s f ∥ p(τ )-1 p(τ ) - p ′ (τ ) p(τ ) ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) log(∥P s-τ,s f ∥ L p(τ ) (μs-τ ) ) + 1 p(τ ) ∥P s-τ,s f ∥ 1-p(τ ) L p(τ ) (μs-τ ) |P s-τ,s f | p(τ ) d dτ μs-τ dx. ( ) The last term of the right hand side can be bounded as 1 p(τ ) ∥P s-τ,s f ∥ 1-p(τ ) L p(τ ) (μs-τ ) |P s-τ,s f | p(τ ) d dτ μs-τ dx ≤ 1 p(τ ) ∥P s-τ,s f ∥ 1-p(τ ) L p(τ ) (μs-τ ) |P s-τ,s f | p(τ ) 1 n n j=1 ℓ ′′ j (f µs-τ )h j (•)µ s-τ (L * µs-τ h j ) μs-τ dx ≤C 1 p(τ ) ∥P s-τ,s f ∥ 1-p(τ ) L p(τ ) (μs-τ ) |P s-τ,s f | p(τ ) μs-τ dx = C ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) p(τ ) . ( ) By the backward Kolmogorov equation, and combining these inequalities ( 11) and ( 12) with the super log-Sobolev inequality (10), we have d dτ (e -N (τ ) ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) ) = p ′ (τ )e -N (s) p(τ )∥P s,s-τ f ∥ p(τ )-1 L p(τ ) (μs-τ ) μs-τ ((P s-τ,s f ) p(τ ) log P s-τ,s f ) + p(τ ) p ′ (τ ) μs-τ ((P s-τ,s f ) p(τ )-1 L * µs-τ P s-τ,s f ) - N ′ (τ )p(τ ) p ′ (τ ) μs-τ ((P s-τ,s f ) p(τ ) ) -μs-τ ((P s-τ,s f ) p(τ ) ) log ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) + e -N (τ ) 1 p(τ ) ∥P s-τ,s f ∥ 1-p(τ ) L p(τ ) (μs-τ ) |P s-τ,s f | p(τ ) d dτ μs-τ dx ≤ C 1 p(τ ) e -N (τ ) ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) . Hence, we obtain that ∥P s-τ,s f ∥ L p(τ ) (μs-τ ) ≤ e N (τ )+C τ 0 p(τ ′ ) -1 dτ ′ ∥f ∥ L 2 (μs) ≤ e N (τ )+Cτ ∥f ∥ L 2 (μs) . Here, notice that p(s-t0) 2 γ(p) p dp = t0 0 p ′ (τ )γ(p(τ )) p(τ ) dτ = t0 0 1dτ = t 0 . On the other hand, since we also have t 0 = ∞ 2 γ(p) p dp by its definition, it must be the case that lim τ ↗t0 p(τ ) = ∞ and hence lim τ ↗t0 N (τ ) = ∞ 2 β 4(p-1)γ(p) p p 2 dp ≤ ∞ 2 β (2γ(p)) p 2 dp. This yields the first assertion. By Corollary 5 we can see that ∞ 2 β(2γ(p)) p 2 dp = ∞ 2 β(1/p) p 2 dp ≲ ∞ 2 1+log(p) p 2 dp =: C 0 < ∞. This yields the second assertion. Let p(t, x, y) be the density function of the distribution X t conditioned by X 0 = x ∈ supp(µ 0 ) with respect to μt , i.e., p(t, x, y) = dP * 0,t (δx) dμt (y). Then, we have that (p(t + t 0 , x, y) -1)f (y)dμ t+t0 (dy) ≤ |P 0,t0 • P t0,t+t0 (f -μt+t0 f )(x)| ≤C 0 ∥P t0,t+t0 f -μt+t0 f ∥ L 2 (μt 0 ) , ( ) where we used the bound ∥P 0,t0 ∥ L 2 (μt 0 )→L ∞ (μ0) ≤ C 0 and the fact that the support of μ0 is the whole space R d . We will show that the right hand side converges to 0 in an exponential order by taking its differentiation with respect to s: ∂ s ∥P s,t f -μt f ∥ 2 L 2 (μs) = 2 ∂ s (P s,t f -μt f )(P s,t f -μt f )dμ s + (P s,t f -μt f ) 2 ∂ s μs dx. We evaluate each term as follows. (i) The second term in the right hand side can be evaluated as (P s,t f -μt f ) 2 ∂ s μs dx ≤ (P s,t f -μt f ) 2   1 n n j=1 ℓ ′′ j (f µs )h j (•)∂ s µ s (h j )   μs dx ≤ C (P s,t f -μt f ) 2 μs dx • 1 n n j=1 |∂ s µ s (h j )| ≤ C (P s,t f -μt f ) 2 μs dx • 1 n n j=1 ∇h j (∇ log(µ s ) -∇ log(μ s ))dµ s ≤ C 2 ∥P s,t f -μt f ∥ 2 L 2 (μs) I(µ s ||μ s ). (ii) Next, we evaluate the first term. By the Poincaré inequality (PI), we have that ∂ s (P s,t f -μt f )(P s,t f -μt f )dμ s = -L * µs (P s,t f )(P s,t f -μt f )dμ s = λ∥∇P s,t f ∥ 2 dμ s ≥ λα∥P s,t f + μs (P s,t f )∥ 2 L 2 (μs) = λα∥P s,t f -μt f + μt f -μs (P s,t f )∥ 2 L 2 (μs) = λα[∥P s,t f -μt f ∥ 2 L 2 (μs) + 2μ s (P s,t f -μt f )(μ t f -μs (P s,t f )) + (μ t f -μs (P s,t f )) 2 ] = λα[∥P s,t f -μt f ∥ 2 L 2 (μs) -2(μ s (P s,t f ) -μt f ) 2 + (μ t f -μt (P s,t f )) 2 ] = λα[∥P s,t f -μt f ∥ 2 L 2 (μs) -(μ s (P s,t f ) -μt f ) 2 ] = λα∥P s,t f -μt f ∥ 2 L 2 (μs) -λα(μ s (P s,t f ) -μt f ) 2 . Denote the total variation norm of two probability measures µ and ν by ∥µ -ν∥ TV , we notice that, if t -s ≥ t 0 , it holds that |μ s (P s,t f ) -μt f | ≤ |(P * s,t-t0 μs -μt-t0 )(P t-t0,t f ) -(μ t -P * t-t0,t μt-t0 )f | ≤ ∥P * s,t-t0 μs -μt-t0 ∥ TV ∥P t-t0,t f ∥ L ∞ + |(μ t -P * t-t0,t μt-t0 )f | ≤ ∥P * s,t-t0 μs -μt-t0 ∥ TV ∥P t-t0,t ∥ L 2 (μt)→L ∞ (μt-t 0 ) ∥f ∥ L 2 (μt) + |(μ t -P * t-t0,t μt-t0 )f | ≤ C 0 ∥P * s,t-t0 μs -μt-t0 ∥ TV ∥f ∥ L 2 (μt) + |μ t f -μt-t0 (P t-t0,t f )|, and, if t -s < t 0 , the same bound without the first term in the right hand side holds. By Pinsker's inequality ∥µ -ν∥ TV ≤ KL(µ||ν)/2, we have that ∥μ s -µ * ∥ TV ≤ KL(μ s ||µ * )/2 = 1 √ 2 λ -1 δF (µ * ) δµ (•) - δF (µ s ) δµ (•) dμ s + log(Z(µ * )/Z(μ s )) 1/2 ≤ 1 √ λ sup x | δF (µ * ) δµ (x) -δF (µs) δµ (x)| ≤ C √ λ ∥f µs -f µ * ∥ ∞ ≤ C ′ exp(-αλs)W 2 (µ 0 , µ * ), where C ′ > 0 is a constant depending on τ 0 , α, λ. Moreover, if we let μτ = P * s,τ μs , then ∂ τ KL(μ τ ||µ * ) = log(μ τ /µ * )∂ τ μτ dx = (λ∆ -b ⊤ τ ∇) log(μ τ /µ * )μ τ dx = (-λ∇ log(μ τ ) -b τ )∇ log(μ τ /µ * )μ τ dx = -λ ∥∇ log(μ τ ) -∇ log(µ * )∥ 2 dμ τ -λ (∇ log(µ * ) + λ -1 b τ ) ⊤ (∇ log(μ τ ) -∇ log(µ * ))dμ τ = -λ ∥∇ log(μ τ ) -∇ log(µ * )∥ 2 dμ τ -λ (∇ log(µ * ) -∇ log(μ τ )) ⊤ (∇ log(μ τ ) -∇ log(µ * ))dμ τ = - λ 2 ∥∇ log(μ τ ) -∇ log(µ * )∥ 2 dμ τ + λ 2 ∥∇ log(μ τ ) -∇ log(µ * )∥ 2 dμ τ ≤ -λαKL(μ τ ||µ * ) + sup x ∥∇ log(μ τ )(x) -∇ log(µ * )(x)∥ 2 ≤ -λαKL(μ τ ||µ * ) + C ′ exp(-2αλτ )W 2 (µ 0 , µ * ) 2 , where we used the same argument as Eq. ( 16) in the first inequality. This evaluation, together with the bound of KL(μ s ||µ * ) in Eq. ( 16), implies that KL(μ τ ||µ * ) ≤ C ′ max{exp(-λα(τ -s))KL(μ s ||µ * ), exp(-2αλτ )W 2 (µ 0 , µ * ) 2 } ≤ C ′ exp(-αλ[(τ -s) + 2s])W 2 (µ 0 , µ * ) 2 . Therefore, ∥P * s,t-t0 μs -μt-t0 ∥ TV ≤ ∥P * s,t-t0 μs -µ * ∥ TV + ∥µ * -μt-t0 ∥ TV ≲ exp(-αλ[(t -t 0 -s)/2 + s])W 2 (µ 0 , µ * ). Next, we evaluate the second term of the right hand side of Eq. ( 15), |μ t f -μt-t0 (P t-t0,t f )|. We notice that ∂ s (P s,t f )μ s dx = [(λ∆ -b ⊤ s ∇)(P s,t f )]μ s dx + (P s,t f )∂ s μs dx = (-λ∇ log(μ s ) -b s ) ⊤ ∇(P s,t f )μ s dx + (P s,t f )∂ s μs dx = (P s,t f )∂ s μs dx ≤ C |P s,t f |μ s dx I(µ s ||μ s ) ≲ ∥f ∥ L 2 (μs) I(µ s ||μ s ), where the first inequality is obtained by the same argument as ( 14) and the last inequality is by Theorem 3 and the fact that the density ratio between μτ and μτ ′ are bounded from above and below for any τ, τ ′ because of the boundedness of δF δµ . Hence, as in Eq. ( 22) and Eq. ( 23) below, we have |μ t-t0 (P t-t0,t f ) -μt f | ≤ C t t-t0 I(µ s ||μ s )ds∥f ∥ L 2 (μt-t 0 ) ≤ C 1 λ √ t 0 exp(-αλ(t -t 0 -τ 0 ))(1 + ψ 2 (τ 0 )W 2 (µ 0 , µ * ))∥f ∥ L 2 (μt-t 0 ) . Therefore, by applying these bounds to the right hand side of Eq. ( 15), we arrive at (μ s (P s,t f ) -μt f ) 2 ≤ C exp(-λα(t -s + 2s))(1 + W 2 (µ 0 , µ * )) 2 ∥f ∥ 2 L 2 (μt) . (iii) Finally, by combining the bounds of (i) and (ii), we obtain that ∂ s ∥P s,t f -μt f ∥ 2 L 2 (μs) ≥ 2λα∥P s,t f -μt f ∥ 2 L 2 (μs) -Cλα exp(-λα(t + s))(1 + W 2 (µ 0 , µ * )) 2 ∥f ∥ 2 L 2 (μt) -C∥P s,t f -μt f ∥ 2 L 2 (μs) I(µ s ||μ s ), which yields that, by taking the differentiation with respect to s in the reverse direction, ∂ s ∥P t-s,t f -μt f ∥ 2 L 2 (μt-s) ≤ -(2λα -C I(µ t-s ||μ t-s ))∥P t-s,t f -μt f ∥ 2 L 2 (μt-s) + Cλα exp(-λα(t + s))(1 + W 2 (µ 0 , µ * )) 2 ∥f ∥ 2 L 2 (μt) , and thus, if we write C t = C t t0 I(µ s ||μ s )ds, ∥P t0,t+t0 f -μt+2t0 f ∥ 2 L 2 (μt 0 ) ≤ exp(-2λαt + C t+t0 )∥P t+t0,t+t0 f -μt+t0 f ∥ 2 L 2 (μt+t 0 ) + λαC t+t0 t0 exp(-λα(t + s))(1 + W 2 (µ 0 , µ * )) 2 ∥f ∥ 2 L 2 (μt+t 0 ) e -2λα(t-s)+Ct+t 0 -Cs ds ≤ exp(-2λαt)∥f ∥ 2 L 2 (μt+t 0 ) exp(C t+t0 )[1 + C(1 + W 2 (µ 0 , µ * )) 2 ]. As in Eq. ( 23) below, the right hand side can be bounded by C 1 exp(-2λαt)∥f ∥ 2 L 2 (μt+t 0 ) (1 + W 2 (µ 0 , µ * ) 2 ) exp(C 2 (1 + W 2 (µ 0 , µ * ))) for constants C 1 and C 2 . This is further bounded C 3 exp(-2λαt) exp(C 4 (1 + W 2 (µ 0 , µ * )))∥f ∥ 2 L 2 (μt+t 0 ) with constants C 3 and C 4 . Therefore, we arrive at ∥P t0,t+t0 f -μt+t0 f ∥ L 2 (μt 0 ) ≲ exp(-αλt) exp(C ′ 4 (1 + W 2 (µ 0 , µ * )))∥f ∥ L 2 (μt+t 0 ) , where we used that the density ratio between μτ and μτ ′ are bounded from above and below for any τ, τ ′ because of the boundedness of δF δµ and C ′ 4 = C 4 /2. Therefore, by applying this bound to the right hand side of (13), we have that sup x,y |p(t + t 0 , x, y) -1| ≤ C ′ exp(-αλt) Q0 , for a constant C ′ and Q0 = exp(C ′ 4 (1+W 2 (µ 0 , µ * ))), which can be checked by noticing the density of µ t is smooth and taking f (x) = 1{∥x -x ′ ∥ ≤ ϵ} for arbitrary x ′ ∈ R d and letting ϵ → 0. This indicates that the density function of µ t with respect to μt satisfies dµ t dμ t -1 ∞ ≤ C ′ exp(-αλ(t -t 0 )) Q0 . Importantly, this observation indicates that µ t satisfies the (α/2)-LSI for sufficiently large t such as C ′ exp(-αλ(t -t 0 )) Q0 ≤ min{ √ 2 -1, 1 -1/ √ 2} via the Holley-Stroock bounded perturbation principle (e.g., Proposition 5.1.6 of Bakry et al. (2014) ). Corollary 6. Under Assumption 1, there exits T 0 depending on d, B, δ, α, λ, c r , C r such that µ t satisfies the (α/2)-LSI condition for t ≥ T 0 + log( Q0 )/αλ. Moreover, 1 √ 2 ≤ dµ t dμ t (x) ≤ √ 2, dµ t dμ t -1 ∞ ≤ C ′ exp(-αλ(t -T 0 )) Q0 , for all t ≥ T 0 + log( Q0 )/αλ, where C ′ is a constant. By setting Q0 = log( Q0 )/αλ, we obtain Corollary 1 in the main text.

C COMPUTATION OF ERROR TERMS C.1 DYNAMICS OF THE DERIVATIVE TERMS

Recall that d i and d (2) i,j are linear operators defined by d (1) i (t; µ 0 , ξ, x 1 )(ϕ) = ∂ (x1)i δ δµ (m(t; •)(ϕ))| µ0 (x 1 ), d (2) i,j (t; µ 0 , x 1 , x 2 )(ϕ) = ∂ (x1)i ∂ (x2)j δ 2 δµ 2 (m(t; •)(ϕ))| µ0 (x 1 , x 2 ), for a smooth test function ϕ : R d → R, where m(t, µ)(ϕ) = ϕ(x)dm(t, µ)(x). Given a linear operator q, we introduce a differential operator L µ as follows, L µ q = λ∆q + ∇ • (b(•, µ)q) + ∇ • µ δb δµ (•, µ)(q) , which is defined in a weak sense, i.e., (L µ q)(ϕ) = q(L * µ ϕ) = q(λ∆ϕ-b(•, µ)•∇ϕ-δb δµ (y, µ)(•)• ∇ϕ(y)µ(dy)) for a test function ϕ : R d → R with appropriate regularity condition. Following Delarue and Tse (2021) , we know that these operators obey the following dynamics. Proposition 6. d (1) i and d (2) i,j follows the following differential equation: for t > 0, ∂ t d (1) i (t; µ, x) = L m(t,µ) d (1) i (t; µ, x), d (1) i (0; µ, x) = (D ′ x ) i , where (D ′ x ) i is defined by (D ′ x ) i (ϕ) = ∂ xi ϕ(x), and                    ∂ t d (2) i,j (t; µ, x 1 , x 2 ) = L m(t,µ) d (2) i,j (t; µ, x 1 , x 2 ) +∇ • d (1) j (t; µ, x 2 ) δb δµ (•, m(t, µ))(d i (t; µ, x 1 )) +∇ • d (1) i (t; µ, x 1 ) δb δµ (•, m(t, µ))(d j (t; µ, x 2 )) +∇ • m(t, µ), δ 2 b δµ 2 (•, m(t, µ))(d (1) i (t; µ, x 1 ), d j (t; µ, x 2 )) , d (2) i,j (0; µ, x 1 , x 2 ) = 0. The above equations should be interpreted in a weak sense,i.e., when ∂ t q t -L m(t;µ) -r t = 0 means that q t (ϕ(t, •)) -q s (ϕ(s, •)) = t s q τ (∂ τ ϕ(τ, •))dτ + t s q τ (L * m(τ ;µ) ϕ(τ, •))dτ + t s r τ (ϕ(τ, •))dτ for a smooth test function ϕ : [0, ∞) × R d → R. The well-posedness of this differential equation is justified in Delarue and Tse (2021) ; Tse (2021) for mean-field dynamics on the d-dimensional torus. Although our dynamics is defined on R d and the regularization term r has unbounded gradient, the arguments there can be applied because r is convex and does not depend on the distribution. Here we omit the technical details.

C.2 FIRST ORDER

DIFFERENTIATION Let H t (x) = 1 n n j=1 ℓ ′ j (f µt )h j (x). Here, we evaluate the first order derivative d (1) i (t; µ, x). For that purpose, we define an operator d (1) (t; µ, ξ, x) where ξ ∈ R d , x ∈ R d , µ ∈ P as (we omit the argument x if there is no confusion) d (1) (t; µ, ξ, x)(ϕ) = ξ ⊤ ∇ ∂ ∂ϵ m(t, ϵδ x + (1 -ϵ)µ)(ϕ)| ϵ=0 = ξ ⊤ ∇ δ δµ m(t, µ)(ϕ)(x). One may check that d (1) i (t; µ, x) = d (1) (t; µ, e i , x ) where e i is the indicator vector that has 1 in its i-th coordinate and 0 in other coordinates. In our case, we are interested in a setting µ 0 = µ N s = 1 N N i=1 Xi s , which is a discrete measure with support consisting of N points. In that case, d (1) (t; µ 0 , ξ, x) can be reformulated as the following gradient flow E µ0 ξ(X 0 ) ⊤ ∇ δ δµ m(t, µ 0 )(ϕ) = 1 N N i=1 ξ( Xi 0 ) ⊤ ∇ δ δµ m(t, µ 0 )(ϕ). where ξ(X 0 ) = N ξ when X 0 = x and ξ(X 0 ) = 0 otherwise. Clearly, ξ ∈ L 2 (µ 0 ). Accordingly, we define the following SDE: dX ϵ t = -b(X ϵ t , µ ϵ t )dt + √ 2λdW t , µ ϵ t = Law(X ϵ t ), X ϵ 0 = X 0 + ϵ ξ(X 0 ). Then, if we define ṽξ t := lim ϵ→0 X ϵ t -Xt ϵ , it holds that d (1) (t; µ 0 , ξ, x)(ϕ) = E[ṽ ξ t • ∇ϕ(X t )]. We can see that the infinitesimal displacement ṽξ t follows the following differential equation: dṽ ξ t = -ṽξ t • ∇b(x, µ t )| x=Xt + δ δµ b(X t , µ t )(q t ) dt, from which we can derive moment bounds for ṽϵ t . In particular, for p ≥ 1, 1 p d∥ṽ ξ t ∥ p dt = -∥ṽ ξ t ∥ p-2 ṽξ t • ∇b(x, µ t )| x=Xt • ṽξ t + δ δµ b(X t , µ t )(q t ) • ṽξ t = -∥ṽ ξ t ∥ p-2 (ṽ ξ t ) ⊤ ∇∇ ⊤ H t (X t )ṽ ξ t + λ(ṽ ξ t ) ⊤ ∇∇ ⊤ r(X t )ṽ ξ t + δ δµ b(X t , µ t )(q t ) • ṽξ t ≤ B 2 ∥ṽ ξ t ∥ p + B 2 ∥ṽ ξ t ∥ p-1 E[∥ṽ ξ t ∥], where we used Assumption 1 and convexity of r in the last inequality. First, by taking expectation of both sides for p = 1, we know that E[∥ṽ ξ 0 ∥] = ∥ξ∥ yields that E[∥ṽ ξ t ∥] ≤ exp(2B 2 t)E[∥ṽ ξ 0 ∥] = exp(2B 2 t)∥ξ∥. Then, for p > 1, noticing that ṽξ 0 = 0 for X 0 ̸ = x and ∥ṽ ξ t ∥ p-1 E[∥ṽ ξ t ∥] ≤ (1 -1 p )∥ṽ ξ t ∥ p + 1 p E[∥ṽ ξ t ∥] p , we have that, for x ′ ̸ = x, E[∥ṽ ξ t ∥ p |X 0 = x ′ ] (19) ≤ exp((2p -1)B 2 t)E[∥ṽ ξ 0 ∥ p |X 0 = x ′ ] + t 0 E[∥ṽ ξ s ∥] p exp[(2p -1)B 2 (t -s)]ds ≤ t 0 exp(2pB 2 s) exp[(2p -1)B 2 (t -s)]dsE[∥ṽ ξ 0 ∥] p ≤ exp[(2p -1)B 2 t][exp(B 2 t) -1]B -2 E[∥ṽ ξ 0 ∥] p ≤ B -2 exp[(2p -1)B 2 t][exp(B 2 t) -1]∥ξ∥ p ≤ B -2 [exp(2pB 2 t) -1]∥ξ∥ p . (20) In the same vein, when X 0 = x, we have that ∥ṽ ξ t ∥ p = O(exp(2pB 2 t))N p ∥ξ∥ p for p ≥ 1 and t > 0. By Corollary 3, m(t, ϵδ x -(1 -ϵ)µ 0 ) has a smooth density for t > 0, which we denote by µ (ϵ) t . Corollary 3 also asserts that µ (ϵ) t (x) > 0 and is equivalent to µ * for any t > 0. Let q (1) t,x (x ′ ) = 1 µ t (x ′ ) ξ ⊤ ∇ ∂ ∂ϵ µ (ϵ) t (x ′ )| ϵ=0 . For concise presentation, we introduce the abbreviated notation q (1) t,x (ϕ) := E µt [q (1) t,x ϕ]. Let µ ϵ t|x ′ be the distribution of X ϵ t conditioned by X ϵ 0 = x ′ + ϵ ξ(x ′ ). We define the conditional version of q t,x as q (1) t,x|x ′ (x ′′ ) := 1 µ t|x ′ (x ′′ ) ξ ⊤ ∇ ∂ ∂ϵ µ (ϵ) t|x ′ (x ′′ )| ϵ=0 . Accordingly, we define q (1) t,x|x ′ (ϕ) := E µ t|x ′ [q (1) t,x|x ′ ϕ]. Then, we can see that q (1) t,x (ϕ) = 1 N N i=1 q (1) t,x| Xi s (ϕ). Lemma 6 (Bismut formula). Suppose Assumption 1 holds. Then for a bounded measurable function f : R d → R, µ 0 ∈ P 2 and t > 0, m(t, (I + ϵξ) ♯ µ 0 )(f ) is differentiable with respect to ϵ at t = 0, and we have d dϵ m(t, (I + ϵξ) ♯ µ 0 )(f )| ϵ=0 = E[∇f (X t ) • ṽξ t ] = E f (X t ) t 0 ζ ξ s • dW s where ζ ξ s = ( √ 2λ) -1 t ṽξ s + s δb δµ (X s , µ s )(q (1) t,x ) .

Also, q

(1) t,x (x ′ ) = E[ t 0 ζ ξ s • dW s |X t = x ′ ], q (1) t,x|x ′ (x ′′ ) = E[ t 0 ζ ξ s • dW s |X t = x ′′ , X 0 = x ′ ], and E[(q (1) t,x|x (X t )) 2 |X 0 = x] < K(t)N 2 , and E[(q (1) t,x|x ′ (X t )) 2 |X 0 = x ′ ] ≤ K(t) for x ′ ̸ = x, where K(t) is a constant depending on t. Proof. The first assertion is obtained by the Bismut formula with respect to the Lions derivative by the initial distribution µ 0 (Ren and Wang, 2019) . In particular, Theorem 2.1 of Ren and Wang (2019) yields the assertion by setting g s = s/t in their notation. We note that although they assumed b(x, µ) and their derivatives are bounded, the same argument can be directly applied to our setting because r(x) is a convex function that forces X t to contract to origin instead of diverging. As for the second assertion, we first observe that d dϵ E[f (X t )] = f (x) ∂ϵµt(x) µt(x) dµ t = f (x)∂ ϵ log µ t (x)dµ t for all f (note that µ t (x) > 0 for all x ∈ R d by Corollary 3). This indi- cates that ∂ ϵ log(µ t )(x) = E[ t 0 ζ ξ t • dW s |X t = x] almost surely. In the same vein, we also have the characterization of the conditional version q (1) t,x|x ′ (x ′′ ). Indeed, we may consider a dynamics of Xt = (X t , X 0 ) and apply the same argument on µ t to the distribution of Xt . The moment bound can be ensured by noticing that E[q (1) t,x|x (X t ) 2 |X 0 = x] = E E t 0 ζ ξ s • dW s X t , X 0 = x 2 |X 0 = x ≤ E t 0 ζ ξ t • dW s 2 |X 0 = x ≤ C t,2 t 0 E[∥ζ ξ s ∥ 2 |X 0 = x]ds ≲ C t,2 t 2 t 0 (1 + s) 2 E[∥ṽ ξ s ∥ 2 |X 0 = x]ds ≲ C t,2 t 2 (1 + t) 3 N 2 ∥ξ∥ 2 = O C t,2 (1 + t) 3 t 2 N 2 , where the first inequality is due to Jensen's inequality, the second inequality is by the moment inequality of stochastic integral (Kim, 2013 ) (C t,q = ( q(q-1)

2

) q/2 t q-2 2 ), and the last inequality is due to Eq. ( 21) and Assumption 1. When X 0 ̸ = x, Eq. ( 20) gives E[∥ṽ ξ s ∥ 2 |X 0 = x ′ ] = O(exp(2pB 2 t)), which gives the bound for X 0 ̸ = x.

Let

Lt ϕ = λ∆ϕ -b ⊤ t ∇ϕ for ϕ : R d → R. Then, the derivative of q (1) t,x with respect to t can be evaluated as d dt E µt [q (1) t,x f ] = ∂ ∂t ∂ ∂ϵ E[f (X ϵ t )]| ϵ=0 = ∂ ∂ϵ ∂ ∂t E[f (X ϵ t )]| ϵ=0 = ∂ ∂ϵ E[(λ∆ -b ⊤ t ∇)f (X ϵ t )]| ϵ=0 = q (1) t,x (λ∆ -b ⊤ t ∇)f - δb t δµ (q t,x )∇f dµ t . We can also see that d dt E µ t|x ′ [q (1) t,x|x ′ f ] = q (1) t,x|x ′ (λ∆ -b ⊤ t ∇)f - δb t δµ (q (1) t,x )∇f dµ t|x ′ , Note that the second term in the right hand side is δbt δµ (q (1) t,x ) instead of δbt δµ (q (1) t,x|x ′ ). We refer readers to Tse (2021) for higher order regularity of the nonlinear PDEs induced by the derivative with respect to the initial distribution in the torus setting. Therefore, by applying Theorem 4, we obtain the following convergence bound. Lemma 7. Under Assumption 1, it holds that, for any t > τ with sufficiently small τ > 0, E[(q (1) t,x (X t )) 2 ] ≤ O( Λµ0 exp(-λα(t -T 0 )/2)E[(q (1) 0,x (X 0 )) 2 ]) = O( Λµ0 exp(-λα(t -T 0 )/2)N ∥ξ∥ 2 ), where Λµ0 = exp(O(W 2 (µ 0 , µ * ))). Proof. We apply Theorem 4 in Appendix C.4. We first note that the conditions of Theorem 4 hold for ϵ t = C exp(-αλt)W 2 (µ 0 , µ * ) by Corollary 4 and δ t = 0. Moreover, by Corollary 6, we can assume α t = α/2 for t ≥ T 0 + log( Qt )/(λα) and α t = 0 otherwise. Next, we check the integrability of I(µ t ||μ t ) with respect to t. According to Nitanda et al. (2022) ; Chizat (2022) , d dt (L(µ t ) -L(µ * )) ≤ -λ 2 I(µ t ||μ t ) ≤ 0, which implies that t t ′ I(µ s ||μ s )ds ≤ 1 λ 2 (L(µ t ′ ) -L(µ * ) -(L(µ t ) -L(µ * ))). Hence, t t ′ I(µ s ||μ s )ds ≤ (t -t ′ ) λ 2 (L(µ t ′ ) -L(µ t )) ≤ O 1 λ √ t -t ′ exp(-λα(t ′ -τ 0 )) L(µ τ0 ) -L(µ * ) . By taking t = k + 1 + τ 0 and t ′ = k + τ 0 for k = 0, 1, . . . and taking summation over k, we obtain ∞ τ0 I(µ s ||μ s )ds ≲ O L(µ τ0 ) -L(µ * ) = O(1 + ψ 2 (τ 0 )W 2 (µ 0 , µ * )) = O p (1), where we omitted the dependence on λ, α in the order symbol. Therefore, Theorem 4 yields that E[(q (1) t,x (X t )) 2 ] ≤ O( Λµ0 exp(-λα(t -T 0 )(3/4))E[(q (1) t,x (X τ )) 2 ]), for sufficiently small τ > 0, where Λµ0 = exp(O(W 2 (µ 0 , µ * ))). Combining this evaluation and Lemma 6 completes the proof. However, observe that the bound in Lemma 7 has O(N ) factor in the right hand side. We can remove that factor by considering E[|q (1) t,x (X t )|] 2 instead of E[q (1) t,x (X t ) 2 ]. Recall that µ 0 = µ N s = 1 N N i=1 Xi s . Here we define µ 0\x := 1 N -1 N i=1: Xi s ̸ =x Xi s . Lemma 8. Under Assumption 1, it holds that, for any t > τ with sufficiently small τ > 0, E[|q (1) t,x (X t )|] 2 = O(Λ µ0 exp(-λα(t -T 0 )/2)∥ξ∥ 2 ), where Λ µ0 = exp(O(W 2 (µ 0 , µ * ) + W 2 (µ 0\x , µ * ))). The proof of which can be found in Appendix C.5. We finally remark that combining Eq. ( 21) and Lemma 8, we know that for any ϕ ∈ C b (R d ) and t > 0, it holds that d (1) (t; µ 0 , ξ, x)(ϕ) ≤ O (Λ µ0 exp(-λα(t -T 0 )/2)∥ξ∥∥ϕ∥ ∞,1 ) , where ∥ϕ∥ ∞,1 = max{∥ϕ∥ ∞ , ∥∇ϕ∥ ∞ }.

C.3 SECOND ORDER DIFFERENTIATION

Now we evaluate the second order derivatives. Let x 1 , x 2 ∈ R d fixed. For ϵ = (ϵ 1 , ϵ 2 ) with ϵ k ≥ 0 and ξ [1] , ξ [2] ∈ R d , we note that  ξ [1]⊤ ∇ x1 ∇ ⊤ x2 δ 2 δµ 2 U(t, µ 0 )(x 1 , x 2 )ξ [2] =ξ [1]⊤ ∇ x1 ∇ ⊤ x2 ∂ 2 ∂ϵ 1 ∂ϵ 2 Φ(m(t, ϵ 1 δ x1 + ϵ 2 δ x2 -(1 -ϵ 1 -ϵ 2 )µ 0 ))| ϵ=(0,0) ξ [2] =ξ [2]⊤ ∇ x2 ∂ ∂ϵ 2 δ δm Φ(m(t, ϵ 2 δ x2 + (1 -ϵ 2 )µ 0 )(d (1) (t; ϵ 2 δ x2 + (1 -ϵ 2 )µ 0 , ξ [1] )) = δ 2 δm 2 Φ(m(t, µ 0 ))(d (1) (t; µ 0 , ξ [1] ), d (1) (t; µ 0 , ξ [2] )) + δ δm Φ(m(t, µ 0 ))(d (2) (t; µ 0 , ξ [1] , ξ [2] , x 1 , x 2 )), d (2) (t; µ 0 , ξ [1] , ξ [2] , x 1 , x 2 )(ϕ) = ξ [1] ∇ x1 ∇ ⊤ x2 ∂ 2 ∂ϵ 1 ∂ϵ 2 m(t, ϵ 1 δ x1 + ϵ 2 δ x2 -(1 -ϵ 1 -ϵ 2 )µ 0 ))(ϕ)| ϵ=(0,0) ξ [2] . By Corollary 3, m(t, ϵ 1 δ x1 + ϵ 2 δ x2 -(1 -ϵ 1 -ϵ 2 )µ 0 ) (2) t,(x1,x2) (x) = 1 µ t (x) ξ [1]⊤ ∇∇ ⊤ ∂ 2 ∂ϵ 1 ∂ϵ 2 µ (ϵ1,ϵ2) t (x)ξ [2] , On the other hand, Corollary 4 yields that L(µ t ) -L(µ * ) ≤ λψ 2 (τ 0 )W 2 (µ 0 , µ * ) 2 . Combining this with the argument in Eq. ( 23), we see that ∞ τ ′ 0 I(µ t , μt )dt ≤ O 1 + ψ 2 (τ ′ 0 )W 2 (µ 0 , µ * ) . Then, by the same argument as Lemma 7 and Theorem 4-(ii), we obtain the assertion.

C.4 GENERAL CONVERGENCE GUARANTEE

We can see that q t = q (1) t,x and q t = q (2) t satisfy d dt E µt [q t ϕ] = E[q t Lt (ϕ)] - δb t δµ (q t ) • ∇ϕdµ t + r t (ϕ), where |r t (ϕ)| ≤ C exp(-c 0 λαt) E µt [∥∇ϕ∥ 2 ], and with slight abuse of notation we write q t (ϕ) := E µt [q t ϕ]. We define D 1 (t) := 1 n n j=1 E µt [q t h j ] 2 ℓ ′′ j,t , D 2 (t) := q 2 t dµ t , where ℓ ′′ j,t = ℓ ′′ j (f µt ). In addition, recall the µ satisfies the Poincaré inequality (PI) with constant α if for all smooth functions f : R d → R, we have Var µ (f ) ≤ α -1 E µ [∥∇f ∥ 2 ]. It is well-known that LSI implies PI with the same constant. The following theorem provides an upper bound on D 1 and D 2 under the Poincaré inequality. Theorem 4. Given Assumption 1 and suppose that | d dt ℓ ′′ j,t | ≤ ϵ t , q t satisfies Eq. (25) with r t (q t ) satisfying |r t (q t )| ≤ C δ t E µt [∥∇q t ∥ 2 ] with a sequence of 1/2 ≥ δ t ≥ 0 and C ≥ 0, and µ t satisfies α t -PI for α t ≥ 0 (α t = 0 is also allowed in the case that µ t does not satisfy the LSI), then the following bounds hold: (i) D 2 (t) ≤ exp t B 4 λ + C 2 2λ D 2 (0) + C 2 2 B 4 λ + C 2 2λ . (ii) d dt (D 1 (t) + λD 2 (t)) ≤ -2(1 -δ t )λα t (D 1 (t) + λD 2 (t)) + B 2 2Bλ I(µ t ||μ t ) + ϵ t + 2B 2 δ t 1 -δ t D 2 (t) + C 2 2 δ t . In particular, it holds that, for 0 ≤ τ ≤ t, D 1 (t) + λD 2 (t) ≤ t τ C 2 2 δ s e At-As ds + exp(A t -A τ )(D 1 (τ ) + λD 2 (τ )), where A s = s 0 -2(1 -δ s )λα s + C 1 λ I(µ s ||μ s ) + ϵ s + δ s 1 -δ s ds, and C 1 = max{2B 3 , 1, 2B 4 }/λ. Proof. By substituting ϕ ← q t , it holds that d dt (q t ) 2 dµ t =2E µt [q t Lt (q t )] -q 2 t ∂ ∂t dµ t -2 δb t δµ (q t ) • ∇q t dµ t + 2r t (q t ). Here, the first two terms in the right hand side can be evaluated as 2E µt [q t Lt (q t )] -(q t ) 2 ∂ ∂t dµ t =2E µt [q t (λ∆ -b ⊤ t ∇)(q t )] -(λ∆ -b ⊤ t ∇)(q t ) 2 dµ t =2E µt [q t (λ∆ -b ⊤ t ∇)(q t )] -2E µt [λ(∆q t )q t + λ∥∇q t ∥ 2 -b ⊤ t (∇q t )q t ] = -2λE µt [∥∇q t ∥ 2 ]. Part (i). By the assumption, the Cauchy-Schwarz inequality, and the arithmetic-geometric mean inequality, we can see that -2 δb t δµ (q t ) • ∇q t dµ t + 2r t (q t ) ≤ B 4 λ E µt [q 2 t ] + λE µt [∥∇q t ∥ 2 ] + C 2 λ δ t + λE µt [∥∇q t ∥ 2 ] = 2λE µt [∥∇q t ∥ 2 ] + B 4 λ E µt [q 2 t ] + C 2 λ δ t . Therefore, d dt D 2 (t) ≤ B 4 λ D 2 (t) + C 2 λ δ t . This yields that D 2 (t) ≤ exp t B 4 λ + C 2 λ t 0 δ s ds D 2 (0) + C 2 2 t 0 δ s exp (t -s) B 4 λ + C 2 λ t s δ τ dτ ds ≤ exp t B 4 λ + C 2 2λ D 2 (0) + C 2 2 B 4 λ + C 2 2λ exp t B 4 λ + C 2 2λ . This gives the first inequality. Part (ii). Next, we evaluate the time differentiation of D 1 as Using λD 2 (t) ≤ D 1 (t) + λD 2 (t) and Grönwall's inequality (Mischler, 2019) , we arrive at D 1 (t) + λD 2 (t) ≤ Recall that in Lemma 7 we obtained a bound on E[(q n d dt D 1 (t) = d dt n j=1 (E µt [q t h j ]) 2 ℓ ′′ j,t = n j=1 E µt [q t h j ] 2ℓ ′′ j,t d dt E µt [q t h j ] + E mt [q t h j ] d dt ℓ ′′ j,t = 2 n j=1 E µt [q t h j ]ℓ ′′ j,t E µt q t (λ∆h j -b ⊤ t ∇h j ) - δb ⊤ t δµ (q t )∇h j + B 2 nϵ t D 2 (t) = -2λ n j=1 E µt [q t h j ]ℓ ′′ j,t ∇q t • ∇h j dµ t + λ n j=1 E µt [q t h j ]ℓ ′′ j, (1) t,x (X t )) 2 ] that contains a factor of N . Now we refine this result by considering a bound E[q (1) t,x (X t )ϕ] ≤ E[|q (1) t,x (X t )|]∥ϕ∥ ∞ . Define the events I 1 := {X 0 = x} and I c 1 := {X 0 ̸ = x}. We let µ t|I1 be the conditional distribution of X t conditioned by I 1 and µ t|I c 1 be that conditioned by I c 1 . For notation simplicity, we write q t = q (1) t,x , q t|I1 = q (1) t,x|x and q t|I c 1 (•) = x ′ ̸ =x P (X 0 = x ′ |X t = •)q (1) t,x|x ′ (•) = x ′ ̸ =x µ t|x ′ (•) x ′′ ̸ =x µ t|x ′′ (•) q (1) t,x|x ′ (•). Accordingly, we write q t|I1 (ϕ) := E µ t|I 1 [q t|I1 ϕ] and q t|I c 1 (ϕ) := E µ t|I c 1 [q t|I c 1 ϕ] for a test function ϕ. We control E[|q t,x (X t )|] by utilizing the following bound: E[|q (1) t,x (X t )|] ≤ 1 N E[(q (1) t,x|I1 (X t )) 2 |X 0 = x] + N -1 N E[(q (1) t,x|I c 1 (X t )) 2 |X 0 ̸ = x]. We first evaluate E[(q (1) Thus, Grönwall's inequality (see also the proof of Theorem 4) yields that E µ t|I 1 [q 2 t|I1 ] ≤ t τ B 4 D 2 (s) 2δ s exp(A t -A s )ds + exp(A t )E µ τ |I 1 [q 2 τ |I1 ], where A s = s 0 -2(1-δ s )λα s ds. Here, we recall that D 2 (t) ≲ Λµ0 N exp(-αλ(t-T 0 )(3/4))∥ξ∥ 2 by Lemma 7. Hence, if we set δ t = exp(-αλ(t -T 0 )(1/8)), then A s ≤ -2λα(s -T 0 ) + + C for a constant C which can depend on α, λ, T 0 . This argument and Lemma 6 give  E µ t|I 1 [q 2 t|I1 ] ≲ Λµ0 N d dt E exp 1 2N N i=1 ∥ Xi t ∥ 2 = E exp (Q t /2) - 1 N N i=1 b( Xi t , µ N t ) ⊤ Xi t + λ 2N N i=1 (d + ∥ Xi t ∥ 2 ) ≤ E exp (Q t /2) 1 N n i=1 B 2 ∥ Xi t ∥ -(2 + δ)λc r ∥ Xi t ∥ 2+δ + λ 2 (d + ∥ Xi t ∥ 2 ) . Then, by Young's inequality, there exists a constant C ′ depending on B, λ, δ, c r , d such that B 2 ∥ Xi t ∥ -(2 + δ)λc r ∥ Xi t ∥ 2+δ + λ 2 (d + ∥ Xi t ∥ 2 ) ≤ C ′ - (2 + δ)λc r 2 ∥ Xi t ∥ 2+δ . Moreover, by Jensen's inequality, the term related to ∥ Xi t ∥ 2+δ can be further bounded by - 1 N n i=1 ∥ Xi t ∥ 2+δ ≤ - 1 N n i=1 ∥ Xi t ∥ 2 (2+δ)/2 = -Q (2+δ)/2 t . In summary, we arrive at d dt E [exp (Q t /2)] ≤ E[exp(Q t /2)(C ′ -ϵQ (2+δ)/2 t )] with another constant ϵ = (2+δ)λcr 2 . Here, by noticing that C ′ -ϵQ (2+δ)/2 t ≤ 2C ′ 1[Q t ≤ (2C ′ /ϵ) 2/(2+δ) ] -C ′ , it holds that d dt E [exp (Q t /2)] ≤ E[exp(Q t /2)(2C ′ 1[Q t ≤ (2C ′ /ϵ) 2/(2+δ) ] -C ′ )] ≤ E 2C ′ exp[ 1 2 (2C ′ /ϵ) 2/(2+δ) ] -C ′ exp(Q t /2) . This means that 2+δ) ], E[exp(Q 0 /2)]} = O(1). E[exp(Q t /2)] ≤ max{2 exp[ 1 2 (2C ′ /ϵ) 2/(



This condition can be easily relaxed to cr∥x∥ 2+δ ≤ r(x) ≤ Cr(1 + ∥x∥ 2+δ ′ ) with δ ̸ = δ ′ > 0.Here we consider δ = δ ′ just for simplicity of presentation.



Figure 1: Training error of two-layer NNs optimized by NPGD. Solid curve: mean over 100 runs. Translucent curve: individual runs. Dashed black line: global optimum approximated by the PDA algorithm(Nitanda et al., 2021).

Neural network function value: Φ(µ) = h z (x)dµ(x) with a fixed z ∈ Z. (ii) Training and test error: Φ(µ) = E (Z,Y )∼P [ℓ(f µ (Z), Y )] where P is a distribution on Z × Y.

Figure 2: Test error of NNs optimized by NPGD (r(x) = ∥x∥ 2 ). Solid curve: mean over 100 runs. Translucent curve: individual runs.

Optimization of mean-field neural networks . . . . . . . . . . . . . . . . . . . . . A.2 Interacting particle systems and propagation of chaos . . . . . . . . . . . . . . . . B Basic properties of the solution B.1 Boundedness and uniqueness of the solution . . . . . . . . . . . . . . . . . . . . . B.2 Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Super logarithmic Sobolev inequality . . . . . . . . . . . . . . . . . . . . . . . . . C Computation of Error Terms C.1 Dynamics of the derivative terms . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 First order differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Second order differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 General convergence guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.5 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.6 Combining all bounds together . . . . . . . . . . . . . . . . . . . . . . . . . . . .

), efficient optimization algorithms have been proposed inNitanda et al. (2021);Oko et al. (2022);Nishikawa et al.;  noticeably, the quantitative convergence rate guarantees for these particle-based methods remain valid in both finite-width and discrete-time settings.If we restrict ourselves the standard gradient descent-based methods, then fluctuation around the mean-field limit has been studied inRotskoff and Vanden-Eijnden (2018); Sirignano and Spiliopoulos (2020);Pham and Nguyen (2021). Closely related to our work are the quantitative propagation of chaos results for two-layer neural network from DeBortoli et al. (2020);Chen et al. (2020c). In particular, DeBortoli et al. (2020) studied the impact of learning rate in the stochastic gradient descent update, but did not provide a convergence rate or uniform control of the discretization error over time (due to the lack of regularization).Chen et al. (2020c)  showed that the long-time fluctuation induced by finite width can be controlled assuming that the mean-field dynamics converges at a specific rate, but such condition is very challenging to establish in their setting. In contrast, our result provides a uniform-in-time bound on the finite-width discretization error under conditions that can be verified for regularized risk minimization problems.A.2 INTERACTING PARTICLE SYSTEMS AND PROPAGATION OF CHAOSPropagation of chaos has been analyzed mainly in the context of McKean-Vlasov equations whose drift term has the form of b(x, µ) = ∇V (x) -∇ W (•, y)dµ(y)| x . Generally, neural network optimization is not included in this class, but techniques to analyze such equations can be applied to the neural network setting. Many existing works analyze the discretization error in a bounded time horizon (seeLacker (2021) and references therein). That is, for a fixed time horizon T , it has been shown that sup t∈[0,T ] |Ψ(µ N t ) -Ψ(µ t )| ≤ C T /N . However, the constant C T depends on T and the dependency is often exponential.

At-As ds + exp(A t )(D 1 (0) + λD 2 (0δ s )λα s + C 1 λ I(µ s ||μ s ) + ϵ s + δ s 1 -δ s ds, for a constant C 1 = max{2B 3 , 1, 2B 4 }/λ.C.5 PROOF OF LEMMA 8

Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bounding the ergodicity term (I) . . . . . . . . . . . . . . . . . . . . . . . . . . .

t q t (b ⊤ t + λ∇ log(µ t )) • ∇h j dµ t ′′ i,t E µt [q t h j ]) n j=1 ∥ 2 Qt + B 2 nϵ t D 2 (t) where (Q t ) i,j := ∇h i • ∇h j dµ t . ∥∇ log(μ t ) -∇ log(µ t )∥ 2 dµ t q 2 t dµ t ℓ ′′ j,t ∥∇h j ∥ ∞ ∥h j ∥ ∞ ≤ λnB 3 I(µ t ||μ t )E µt [q 2 t ] = λnB 3 I(µ t ||μ t )D 2 (t). ∇q t • ∇h j dx = ∇q t • δb t δµ (q t )dx. ′′ i,t E µt [q t h j ]) n j=1 ∥ 2 Qt -4λ ∇q t • δb t δµ (q t )dµ t -2λ 2 ∥∇q t ∥ 2 dµ t + 2B 3 λ I(µ t ||μ t )D 2 (t) + B 2 ϵ t D 2 (t) + 2λr t (q t ) ′′ i,t E µt [q t h j ]) n j=1 ∥ 2 Qt -4λ ∇q t • δb t δµ (q t )dµ t -2λ 2 ∥∇q t ∥ 2 dµ t + 2B 3 λ I(µ t ||μ t )D 2 (t) + B 2 ϵ t D 2 (t) + 2δ t λ 2 ∥∇q t ∥ 2 dµ t + When µ t satisfies α t -PI (which is implied by α t -LSI), it holds that ′′ j,t h j E µt [q t h j ] + θ 2 λq t ′′ j,t h j E µt [q t h j ] + 1 -δ t λq t ′′ j,t E µt [q t h i ]E µt [q t h j ] h i h j dµ t E µt [q t h j ]E µt [q t h j ] -(1 -δ t )2α t λ 2 q 2 t dµ t + B 2 2Bλ I(µ t ||μ t ) + ϵ t + 2B 2 δ t B 2 2Bλ I(µ t ||μ t ) + ϵ t + 2B 2 δ t 1 -δ t D 2 (t) + C 2 2 δ t .

t,x|I1 (X t )) 2 |X 0 = x]. If we let D 2 (t) = E µt [q 2t ], then it holds that, for a positive sequence δ t > 0,d dt E µ t|I 1 [q 2 t|I1 ] = -2λE µ t|I 1 [∥∇q t|I1 ∥ 2 ] -2 δb t δµ (q t ) • ∇q t|I1 dµ t|I1 ≤ -2λE µ t|I 1 [∥∇q t|I1 ∥ 2 ] + 2B 2 D 2 (t) E µ t|I 1 [∥∇q t|I1 ∥ 2 ] ≤ -2λ(1 -δ t )E µ t|I 1 [∥∇q t|I1 ∥ 2 ] + B 4 D 2 (t)/(2δ t ) ≤ -2λα t (1 -δ t )E µ t|I 1 [q 2 t|I1 ] + B 4 D 2 (t)/(2δ t ).

exp(-αλ(t -T 0 )(5/8))∥ξ∥ 2 .Hence, by applying Lemma 7 and 9 with µ 0 = µ N s , ξ = ξ[1] = ξ[2] = e i , we obtain∂ (x1)i ∂ (x2)i δ 2 U δµ 2 (t -s, µ N s )(x, x) µ N s (dx) ))). Now we only need to evaluate the term E[Λ µ N s ]. Suppose C is a constant such that Λ µ N s ≤ 1 ))). Then, we can verify Λ µ N s ≤ exp 3C E µ N s [∥X∥ 2 ] + E µ * [∥X∥ 2 ] . Since µ * (x) ≲ exp(-λc r ∥x∥ 2+δ ), E µ * [∥X∥ 2 ] = O(1)and thus we need to evaluate E exp C E µ N s [∥X∥ 2 ]] . This can be upper bounded by E exp 9C 2

ACKNOWLEDGMENT

TS was partially supported by JSPS KAKENHI (20H00576) and JST CREST. AN was partially supported by JSPS Kakenhi (22H03650) and JST-PRESTO (JPMJPR1928). DW was partially supported by a Borealis AI Fellowship.

annex

then we can notice that d (2) (t; µ 0 , ξ [1] , ξ [2] , x 1 , x 2 )(ϕ) = E µt [q (2) t,(x1,x2) ϕ]. Here, we denote by d(1) t, [k] (ϕ) = d (1) (t; µ 0 , ξ [k] , x k )(ϕ) = ξ ⊤ ∇ x ∂ ∂ϵ m(t, ϵδ x k -(1 -ϵ)µ 0 ))(ϕ)| ϵ=0 . Then, by taking the derivative of d dt d (1) (t; •) with respect to ϵ 2 , we know q(2) t (and d (2) ) follows the following dynamics:t,(x1,x2) Lt (ϕ)] -δb t δµ (d (2) (t; µ 0 , ξ (1) , ξ (2) , x 1 , x 2 )) • ∇ϕdµ t -d(1) t, [1] δb t δµ (dt,[2] ) • ∇ϕ , and also q(2)and consider a situation wheret,[2],i ) • ∇ϕ .By Eq. ( 24),In the same vein, we also haveThen applying Theorem 4, we obtain the following lemma. Lemma 9. Under Assumption 1, it holds thatProof. From the argument above, the assumption in Theorem 4 guaranteed for δ t = exp(-λt/2). The other conditions are also satisfied as in the proof of Lemma 7. Hence, for arbitrary small positive time τ ′ 0 > 0, we have thatdue to Theorem 4-(i) and q(2)We also notice thatHence by the same reasoning as the proof of Theorem 4, we have1 , where we used Eq. ( 27) in the last inequality. Then, by noticing that the same argument as Corollary 6 holds for the conditional distribution µ t|I c 1 since Eq. ( 17) holds uniformly for any x ∈ { Xi s } N i=1 , we may apply the same reasoning as the proof of Theorem 4 to obtain that, for sufficiently small τ > 0, the following holds) where we used Lemma 6 in the last inequality.Finally, by combining the inequalities ( 27) and (28) to Eq. ( 26), we obtain the assertion.

C.6 COMBINING ALL BOUNDS TOGETHER

Recall thatOne side remark is that we can further show an exponential decay of the term related to E[exp(Q 0 /2)], but the above is sufficient for our purpose.Finally, we arrive atBy selecting the initial distribution so that E[Λ µ N 0 ] = O(1), the right hand side is O(1). This is satisfied if the support of µ 0 is bounded, e.g.,

D AUXILIARY LEMMAS

Lemma 10. Suppose that Ψ : P → R has smooth first-variation such that ∥∇ δΨ δµ ∥ ∞ ≤ C. Then, for any ν 0 , ν 1 ∈ P 2 , it holds thatProof. By the Benamou-Brenier formula, for any ν 1 , ν 0 ∈ P 2 , it holds thatwhere the infimum is taken over all curves ν t : [0, 1] → P 2 continuous with respect to the weak topology (Ambrosio et al., 2005, Chapter 8 ). Then, for any v t satisfying ∂ t ν t + ∇ • (ν t v t ) = 0 (t ∈ [0, 1]), we note that dΦ(νt) dt = v ⊤ t ∇ δΨ δµ (ν t )dν t , and thusBy taking infimum with respect to v t , we obtain the inequality.f Xt in the informal theorem stated in Section 1 formally corresponds to f µ N t in the main text. Then, we have the following corollary as a direct consequence of Theorem 2. Corollary 7. Under the same setting as Theorem 2, we have, and the expectation is taken over the dynamics of µ N t with a fixed initial condition µ N 0 .Proof. By conditioning the initial distribution µ N 0 , the evolution of µ t is deterministic. Hence, for fixed z ∈ R d ′ , it holds that 

