UNIFORM-IN-TIME PROPAGATION OF CHAOS FOR THE MEAN-FIELD GRADIENT LANGEVIN DYNAMICS

Abstract

The mean-field Langevin dynamics is characterized by a stochastic differential equation that arises from (noisy) gradient descent on an infinite-width two-layer neural network, which can be viewed as an interacting particle system. In this work, we establish a quantitative weak propagation of chaos result for the system, with a finite-particle discretization error of O(1/N ) uniformly over time, where N is the width of the neural network. This allows us to directly transfer the learning guarantee for infinite-width networks to practical finite-width models without excessive overparameterization. On the technical side, our analysis differs from most existing studies on similar mean field dynamics in that we do not require the interaction between particles to be sufficiently weak to obtain a uniform propagation of chaos, because such assumptions may not be satisfied in neural network optimization. Instead, we make use of a logarithmic Sobolev-type condition which can be verified in appropriate regularized risk minimization settings.

1. INTRODUCTION

Mean-field neural networks. We consider the optimization of a two-layer neural network in the mean-field regime, which is represented as an average over N neurons: f X (z) = 1 N N i=1 h z (x i ), where given the input z ∈ R d ′ , each neuron computes a nonlinear transformation based on trainable parameters x ∈ R d ; for example, we may set h z (x) = tanh(w ⊤ z + b) for x = (w, b) ∈ R d ′ +1 . Importantly, the mean-field parameterization allows for the parameters to move away from initialization during gradient descent and hence learn informative features (Yang and Hu, 2020) even when the network width is large (N → ∞), in contrast to the Neural Tangent Kernel (NTK) parameterization (Jacot et al., 2018) (corresponding to a 1/ √ N prefactor), which freezes the model at initialization under overparameterization. This feature learning ability enables mean-field neural networks to outperform the NTK counterpart (or linear estimators in general) in learning a wide range of target functions (Ghorbani et al., 2019; Li et al., 2020; Abbe et al., 2022; Ba et al., 2022) . Optimization guarantees for mean-field neural networks are typically obtained by lifting the finitewidth model to the infinite-dimensional space of parameter distributions and then exploiting convexity of the objective function. Using this viewpoint, convergence of gradient flow on infinite-width neural networks to the global optimal solution can be shown under appropriate conditions (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020) . However, most existing results are qualitative in nature, in that they do not characterize the rate of convergence and the finite-particle discretization error.

