UNIFORM-IN-TIME PROPAGATION OF CHAOS FOR THE MEAN-FIELD GRADIENT LANGEVIN DYNAMICS

Abstract

The mean-field Langevin dynamics is characterized by a stochastic differential equation that arises from (noisy) gradient descent on an infinite-width two-layer neural network, which can be viewed as an interacting particle system. In this work, we establish a quantitative weak propagation of chaos result for the system, with a finite-particle discretization error of O(1/N ) uniformly over time, where N is the width of the neural network. This allows us to directly transfer the learning guarantee for infinite-width networks to practical finite-width models without excessive overparameterization. On the technical side, our analysis differs from most existing studies on similar mean field dynamics in that we do not require the interaction between particles to be sufficiently weak to obtain a uniform propagation of chaos, because such assumptions may not be satisfied in neural network optimization. Instead, we make use of a logarithmic Sobolev-type condition which can be verified in appropriate regularized risk minimization settings.

1. INTRODUCTION

Mean-field neural networks. We consider the optimization of a two-layer neural network in the mean-field regime, which is represented as an average over N neurons: f X (z) = 1 N N i=1 h z (x i ), where given the input z ∈ R d ′ , each neuron computes a nonlinear transformation based on trainable parameters x ∈ R d ; for example, we may set h z (x) = tanh(w ⊤ z + b) for x = (w, b) ∈ R d ′ +1 . Importantly, the mean-field parameterization allows for the parameters to move away from initialization during gradient descent and hence learn informative features (Yang and Hu, 2020) even when the network width is large (N → ∞), in contrast to the Neural Tangent Kernel (NTK) parameterization (Jacot et al., 2018) (corresponding to a 1/

√

N prefactor), which freezes the model at initialization under overparameterization. This feature learning ability enables mean-field neural networks to outperform the NTK counterpart (or linear estimators in general) in learning a wide range of target functions (Ghorbani et al., 2019; Li et al., 2020; Abbe et al., 2022; Ba et al., 2022) . Optimization guarantees for mean-field neural networks are typically obtained by lifting the finitewidth model to the infinite-dimensional space of parameter distributions and then exploiting convexity of the objective function. Using this viewpoint, convergence of gradient flow on infinite-width neural networks to the global optimal solution can be shown under appropriate conditions (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020) . However, most existing results are qualitative in nature, in that they do not characterize the rate of convergence and the finite-particle discretization error. Mean-field Langevin dynamics. An often-studied optimization method for mean-field neural networks is the noisy particle gradient descent (NPGD) algorithm (Mei et al., 2018; Hu et al., 2019; Chen et al., 2020b) , where Gaussian noise is injected to the gradient to encourage "exploration" and enable global optimality to be shown under milder conditions than the noiseless case. The large particle and vanishing step size limit is termed the mean-field Langevin dynamics (Hu et al., 2019) , which globally minimizes an entropy-regularized convex functional in the space of measures. Recently, Nitanda et al. (2022) ; Chizat (2022) established exponential convergence for the meanfield Langevin dynamics under certain logarithmic Sobolev inequalities which can be easily verified in regularized risk minimization problems using two-layer neural networks (1). This represents a significant step towards a quantitative optimization analysis of neural networks in the presence of feature learning, yet the limitation is also clear: these results are obtained from directly analyzing the large particle limit (i.e., the limiting McKean-Vlasov stochastic differential equation), and cannot be easily transferred to practical finite-width networks. In fact, naively applying the quantitative results in Mei et al. (2018; 2019) leads to discretization error bounds that blow up exponentially in time, rendering the guarantee vacuous beyond the very early stages of gradient descent learning. Therefore, for the purpose of characterizing the optimization behavior of finite-width neural networks, it is important to derive a finite-particle discretization error bound that holds uniformly over time, that is, the error remains stable even when t is large. In this paper, we establish finite-particle guarantees for the mean-field Langevin dynamics via a propagation of chaos calculation (Sznitman, 1991) which controls the weak error between the empirical distribution of the interacting particle system and the corresponding infinite particle limit along the optimization trajectory. This allows us to bound the difference in the function value between the finite-width neural network optimized by NPGD and the infinite-width counterpart. In particular, starting from N initialized particles X 0 i.i.d ∼ µ 0 , if we denote the finite-particle model at time t of optimization as f Xt , and its corresponding infinite-particle limit as f µt , then our propagation of chaos result is the following. Theorem (informal). Under suitable regularity conditions, E (f Xt (z)-f µt (z)) 2 = O (1/N ) for any t > 0 and z ∈ R d ′ . We make the following remarks on the main theorem: • To our knowledge, we provide the first rigorous uniform-in-time propagation of chaos result in the context of mean-field neural networks. This is in contrast to prior works where the discretization error typically increases as optimization proceeds (e.g., et al. (2018, Theorem 3) ). The theorem implies that as the width N becomes larger, the difference between the finite-width and infinite-width model output diminishes rapidly, as shown in Figure 1 . • Our analysis assumes a modified Log-Sobolev condition which is satisfied in regularized risk minimization problems using neural network when the convex regularizer on the parameters has super-quadratic tail. Noticeably, we do not impose any constraint on the strength of regularization and interaction; this differs from many existing results where uniform propagation of chaos is only achieved under weak interaction or large noise (Eberle et al., 2019; Delarue and Tse, 2021) . |f Xt -f µt | = O(exp(t)•N -1/2 ) as in Mei Due to the space constraint, we defer discussions on additional related works to Appendix A.

2. PRELIMINARIES

In this section, we formulate the problem setting and introduce some useful notations for the following sections. We optimize a two-layer neural network by minimizing the empirical or expected risk in a supervised learning setting, where the input is included in a set Z ⊂ R d ′ and the output is in a bounded set Y ⊂ R. As defined in the Introduction, h (•) (x) : z ∈ Z → h z (x) ∈ Y represents



Figure 1: Training error of two-layer NNs optimized by NPGD. Solid curve: mean over 100 runs. Translucent curve: individual runs. Dashed black line: global optimum approximated by the PDA algorithm (Nitanda et al., 2021).

