GLOBAL OPTIMALITY OF SOFTMAX POLICY GRADIENT WITH SINGLE HIDDEN LAYER NEURAL NETWORKS IN THE MEAN-FIELD REGIME

Abstract

We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.

1. INTRODUCTION

In recent years, deep reinforcement learning has revolutionized the world of Artificial Intelligence by outperforming humans in a multitude of highly complex tasks and achieving breakthroughs that were deemed unthinkable at least for the next decade. Spectacular examples of such revolutionary potential have appeared over the last few years, with reinforcement learning algorithms mastering games and tasks of increasing complexity, from learning to walk to the games of Go and Starcraft (Mnih et al., 2013; 2015; Silver et al., 2016; 2017; 2018; Haarnoja et al., 2018a; Vinyals et al., 2019) . In most cases, the main workhorse allowing artificial intelligence to pass such unprecedented milestones was a variation of a fundamental method to train reinforcement learning models: policy gradient (PG) algorithms (Sutton et al., 2000) . This algorithm has a disarmingly simple approach to the optimization problem at hand: given a parametrization of the policy, it updates the parameters in the direction of steepest ascent of the associated integrated value function. Impressive progress has been made recently in the understanding of the convergence and optimization properties of this class of algorithms in the tabular setting (Agarwal et al., 2019; Cen et al., 2020; Bhandari & Russo, 2019) , in particular leveraging the natural tradeoff between exploration and exploitation offered for entropy-regularized rewards by softmax policies (Haarnoja et al., 2018b; Mei et al., 2020) . However, this simple algorithm alone is not sufficient to explain the multitude of recent breakthroughs in this field: in application domains such as Starcraft, robotics or movement planning, the space of possible states and actions are exceedingly large -or even continuous -and can therefore not be represented efficiently by tabular policies (Haarnoja et al., 2018a) . Consequently, the recent impressive successes of artificial intelligence would be impossible without the natural choice of neural networks to approximate value functions and / or policy functions in reinforcement learning algorithms (Mnih et al., 2015; Sutton et al., 2000) . While neural networks, in particular deep neural networks, provide a powerful and versatile tool to approximate high dimensional functions on continuous spaces (Cybenko, 1989; Hornik, 1991; Barron, 1993) , their intrinsic nonlinearity poses significant obstacles to the theoretical understanding of their training and optimization properties. For instance, it is known that the optimization landscape of these models is highly nonconvex, preventing the use of most theoretical tools from classical optimization theory. For this reason, the unprecedented success of neural networks in artificial intelligence stands in contrast with the poor understanding of these methods from a theoretical perspective. Indeed, even in the supervised setting, which can be viewed as a special case of reinforcement learning, deep neural networks are still far from being understood despite having been an important and fashionable research focus in recent years. Only recently, a theory of neural network learning has started to emerge, including recent works on mean-field point of view of training dynamics (Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Rotskoff et al., 2019; Wei et al., 2018; Chizat & Bach, 2018) and on linearized dynamics in the over-parametrized regime (Jacot et al., 2018; Allen-Zhu et al., 2018; Du et al., 2018; 2019; Zou et al., 2018; Allen-Zhu et al., 2019; Chizat et al., 2019; Oymak & Soltanolkotabi, 2020; Ghorbani et al., 2019; Lee et al., 2019) . More specifically to the context of reinforcement learning, some works focusing on value-based learning (Agazzi & Lu, 2019; Cai et al., 2019; Zhang et al., 2020) , and others exploring the dynamics of policy gradient algorithms (Zhang et al., 2019) have recently appeared. Despite this progress, the theoretical understanding of deep reinforcement learning still poses a significant challenge to the theoretical machine learning community, and it is of crucial importance to understand the convergence and optimization properties of such algorithms to bridge the gap between theory and practice.

CONTRIBUTIONS.

The main goal of this work is to investigate entropy-regularized policy gradient dynamics for wide, single hidden layer neural networks. In particular, we give the following contributions: • We give a mean-field formulation of policy gradient dynamics in parameter space, describing the evolution of neural network parameters in the form of a transport partial differential equation (PDE). We prove convergence of the particle dynamics to their mean-field counterpart. We further explore the structure of this problem by showing that such PDE is a gradient flow in the Wasserstein space for the appropriate energy functional. • We investigate the convergence properties of the above dynamics in the space of measures. In particular, we prove that under some mild assumptions on the initialization of the neural network parameters and on the approximating power of the nonlinearity, all fixed points of the dynamics are global optima, i.e., the approximate policy learned by the neural network is optimal, RELATED WORKS. Recent progress in the understanding of the parametric dynamics of simple neural networks trained with gradient descent in the supervised setting has been made in (Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Wei et al., 2018; Chizat, 2019; Chizat & Bach, 2020) . These results have further been extended to the multilayer setting in (Nguyen & Pham, 2020) . In particular, the paper (Chizat & Bach, 2018) proves optimality of fixed points for wide single layer neural networks leveraging a Wasserstein gradient flow structure and the strong convexity of the loss functional WRT the predictor. We extend these results to the reinforcement learning framework, where the convexity that is heavily leveraged in (Chizat & Bach, 2018) is lost. We bypass this issue by requiring a sufficient expressivity of the used nonlinear representation, allowing to characterize global minimizer as optimal approximators. The convergence and optimality of policy gradient algorithms (including in the entropy-regularized setting) is investigated in the recent papers (Bhandari & Russo, 2019; Mei et al., 2020; Cen et al., 2020; Agarwal et al., 2019) . These references establish convergence estimates through gradient domination bounds. In (Mei et al., 2020; Cen et al., 2020) such results are limited to the tabular case, while (Agarwal et al., 2019; 2020) also discuss neural softmax policy classes, but under a different algorithmic update and assuming certain well-conditioning assumptions along training. Furthermore, all these results heavily leverage the finiteness of action space. In contrast, this paper focuses on the continuous space and action setting with nonlinear function approximation. Further recent works discussing convergence properties of reinforcement learning algorithms with function approximation via neural networks include (Zhang et al., 2019; Cai et al., 2019) . These results only hold for finite action spaces, and are obtained in the regime where the network behaves essentially like a linear model (known as the neural or lazy training regime), in contrast to the results of this paper, which considers training in a nonlinear regime. We also note the work (Wang et al., 2019) where the action space is continuous but the training is again in an approximately linear regime.

2. MARKOV DECISION PROCESSES AND POLICY GRADIENTS

We denote a Markov Decision Process (MDP) by the 5-tuple (S, A, P, r, γ), where S is the state space, A is the action space, P = P (s, a, s ) s,s ∈S,a∈A a Markov transition kernel, r(s, a, s ) s,s ∈S,a∈A is the realvalued, bounded and continuous immediate reward function and γ ∈ (0, 1) is a discount factor. We will consider a probabilistic policy, mapping a state to a probability distribution on the action space, so that π : S → M 1 + (A) , where M 1 + (A) denotes the space of probability measures on A, and denote for any s ∈ S the corresponding density π(s, •) : A → R + . The policy defines a state-to-state transition operator P π (s, ds ) = A P (s, a, ds )π(s, da), and we assume that P π is Lipschitz continuous as an operator M 1 + (S) → M 1 + (S) wrt the policy. We further encourage exploration by defining some (relative) entropy-regularized rewards (Williams & Peng, 1991) R τ (s, a, s ) = r(s, a, s ) -τ D KL (π(s, •); π( • )) , where D KL denotes the relative entropy, π is a reference measure and τ indicates the strength of regularization. Throughout, we choose π to be the Lebesgue measure on A, which we assume, like S, to be a compact subset of the Euclidean space. This regularization encourages exploration and absolute continuity of the policy WRT Lebesgue measure. Consequently, with some abuse of notation, we use throughout the same notation for a distribution and its density in phase space. Note that the original, unregularized MDP can be recovered in the limit τ → 0. In this context, given a policy π the associated value function V π : S → R maps each state to the infinite-horizon expected discounted reward obtained by following the policy π and the Markov process defined by P : V π (s) = E π ∞ t=0 γ t R τ (s t , a t , s t+1 ) s 0 = s (1) = E π ∞ t=0 γ t r(s t , a t , s t+1 ) -τ D KL (π(s t , •); π( • )) s 0 = s , where E π [ • |s 0 = s] denotes the expectation of the stochastic process s t starting at s 0 = s and following the (stochastic) dynamics defined recursively by the transition operator P π (s, ds ) = P (s, a, ds )π(s, da). Correspondingly, we define the Q-function Q π : S × A → R as Q π (s, a) = E π r(s 0 , a 0 , s 1 ) + ∞ t=1 γ t R τ (s t , a t , s t+1 ) s 0 = s, a 0 = a = r(s, a) + γE π V π (s 1 ) s 0 = s, a 0 = a , where r(s, a) = E[r(s, a, s )] is the average reward from (s, a). Conversely, from the definition, we have the identity for V π (s) = E π [Q π (s 0 , a 0 )|s 0 = s] -τ D KL (π(s, •); π( • )). We are interested in learning the optimal policy π * of a given MDP (S, A, P, r, γ), which satisfies for all s ∈ S V π * (s) = max π : S→M 1 + (A) V π (s). More specifically we would like to estimate this function through a family of approximators π w : S → M 1 + (A) parametrized by a vector w ∈ W := R p . Note that since we consider entropy-regularized rewards, the optimal policy will be a probabilistic policy (given as a Boltzmann distribution) instead of a deterministic one. A popular algorithm to solve this problem is given by policy gradient algorithm (Sutton & Barto, 2018) . Starting from an initial condition w(0) ∈ W, this algorithm updates the parameters w of the predictor in the direction of steepest ascent of the average reward w(t + 1) := w(t) + β t ∇ w Ẽs∼ 0 V π (s) , for a fixed absolutely continuous initial distribution of initial states 0 ∈ M 1 + (S) and sequence of time steps {β t } t . Here Ẽ[•] denotes an approximation of the expected value operator. This work investigates the regime of asymptotically small constant step-sizes β t → 0. In this adiabatic limit, the stochastic component of the dynamics is averaged before the parameters of the model can change significantly. This allows to consider the parametric update as a deterministic dynamical system emerging from the averaging of the underlying stochastic algorithm corresponding to the limit of infinite sample sizes. This is known as the ODE method (Borkar, 2009) for analyzing stochastic approximation. We focus on the analysis of this deterministic system to highlight the core dynamical properties of policy gradients with nonlinear function approximation. The averaged, deterministic dynamics is given by the set of ODEs d dt w(t) = E s∼ 0 [∇ w V π (s)] = E s∼ π ,a∼πw [∇ w log π w (s, a) (Q π (s, a) -τ log(π w (s, a)))] , where in the second equality we have applied the policy gradient theorem (Sutton et al., 2000; Sutton & Barto, 2018) , defining for a fixed 0 ∈ M 1 + (S) π (s 0 , s) := ∞ t=0 γ t P t π (s 0 , s) , π (s) = S π (s 0 , s) 0 (ds 0 ) , as the (improper) discounted empirical measure. For completeness, we include a derivation of (5) in Appendix A.

SOFTMAX POLICIES IN THE MEAN-FIELD REGIME

We choose to represent our policy as a softmax policy: π w (s, a) = exp(f w (s, a)) A exp(f w (s, a))da and parametrize the energy f as a two-layer neural network in the mean-field regime, i.e., f w (s, a) = 1 N N i=1 ψ(s, a; w i ) for a fixed, usually nonlinear function ψ : S × A × Ω → R, where we have separated w ∈ W into N identical components w i ∈ Ω so that W = Ω N . We can rewrite the above expression in terms of an empirical measure: f ν (N ) (s, a) := Ω ψ(s, a; ω)ν (N ) (dω) where ν (N ) (dω) = 1 N N i=1 δ w (i) (dω) ∈ M 1 + (Ω). This empirical measure representation removes the symmetry of the approximating functions under permutations of parameters w i . It also facilitates the limit N → ∞, when ν (N ) → ν weakly, so that f ν (N ) → f ν . Then, for a general distribution ν ∈ M 1 + (Ω) the softmax mean-field policy reads: π ν (s, a) = exp Ω ψ(s, a; ω)ν(dω) A exp Ω ψ(s, a; ω)ν(dω) da . ( ) Note that by our choice of softmax policy and mean-field parametrization (7) we have ∇ wi log π ν (N ) (s, a) = ∇ wi f ν (N ) (s, a) -∇ wi log A exp f ν (N ) (s, a)da = ∇ wi 1 N N i=1 ψ(s, a; w i ) - A ∇ wi exp 1 N N i=1 ψ(s, a; w i ) da A exp f ν (N ) (s, a)da = 1 N ∇ wi ψ(s, a; w i ) - A ∇ wi ψ(s, a; w i )π ν (N ) (s, da) . Thus the training dynamics (5), after an appropriate rescaling of time (t → t/N , which is due to the mean-field parametrization for f w ), can be rewritten as d dt w i (t) = S×A ∇ wi ψ(s, a; w i ) -E π ν (N ) [∇ wi ψ(s, •; w i )] × × Q π ν (N ) (s, a) -τ log π ν (N ) (s, a) π ν (N ) (s, da) π ν (N ) (ds). The training dynamics can be more compactly represented by the evolution of the measure ν ∈ M 1 + (Ω) in parameter space, given by a mean-field transport partial differential equation of the Vlasov type as d dt ν t (ω) = div ν t (ω) S×A C πν [∇ ω ψ(s, •; ω), Q πν -τ log π ν ](s) πν (ds) , where ω ∈ Ω and we have introduced the shorthand C π [f, g](s) to denote the covariance operator WRT the probability measure π(s, da). Note that the above partial differential equation also captures the dynamics of the finite-width system, i.e., of the empirical measure ν (N ) where each w i follows (9). We further note that the dynamics introduced above have a gradient flow structure in the probability space M 1 + (Ω): defining the expected value function E[ν] = E s0∼ 0 [V πν (s 0 )] the dynamics (10) is a gradient flow for E in the Wasserstein space (see e.g., (Santambrogio, 2017) for an introduction), as we prove in the appendix: Proposition 2.1. For a fixed initial distribution 0 ∈ M 1 + (S), the dynamics (10) is the Wasserstein gradient flow of the energy functional (11). Analogous dynamics equation for evolution of parameter space measure in the supervised learning case has been derived in (Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Chizat & Bach, 2018) and in the TD learning case in (Agazzi & Lu, 2019; Zhang et al., 2020) . In particular, in the case of supervised learning, the resulting dynamics is a Wasserstein gradient flow, the structure of which is used to obtain the convergence of the particle system to the mean-field dynamics. In our case, however, the energy functional is not convex WRT the policy and moreover the softmax parametrization destroys the convexity of the approximator of the policy with respect to ν t . Thus showing convergence of the dynamics becomes much more challenging.

3. SIMPLIFIED SETTING: THE BANDIT PROBLEM

We now introduce our results in the simple bandit setting, where state space S is one point (and will be henceforth suppressed in the notation) and without loss of generality action space A is continuous. In this case, for a reward function r and a softmax policy π ν (a) = exp(f ν (a)) A exp(f ν (a))da we have that the value function for the regularized problem reads (we denote V ν = V πν to simplify notation) V ν = (r(a) -τ log π ν (a))π ν (da) , while the Q function is simply Q(a) = r(a). We further note that the optimal policy in the regularized case reads: π * (a) = Z -1 exp(τ -1 r(a)), Z = A exp(τ -1 r(a))da Recalling the definition of the covariance operator C π [f, g](s) from ( 10), the expression for the policy gradient vector field in the latter case simplifies to ∂ t ω t := F t (ω t ; ν t ) = ∇ ω Dπ ν DV ν = C πν [∇ ω ψ(a; ω), r -τ log(π ν )] = A ∇ ω ψ(a; ω) -∇ ω ψ(a ; ω)π ν (da ) (r(a) -τ f ν (a)) π ν (da) , where Dπ ν , DV ν denote the Fréchét derivative of π ν and V ν WRT ν and π respectively. Note that by the structure of the covariance operator C π , adding a constant to the function f ν (•) does not affect the dynamics. This reflects the fact that the softmax policy is normalized by definition.

3.1. GLOBAL OPTIMALITY OF SOFTMAX POLICY GRADIENT

We now sketch the main steps in proving that the mean-field policy gradient dynamics converge, under appropriate assumptions, to global optimizers. The proof in this simpler setting is much more transparent than the general case to be discussed in the next section, and will provide some intuition for the latter. The first part of the proof concerns the properties of fixed points of the dynamics (10), while the second part concerns the training dynamics.

STATICS

We first informally prove global optimality of any fixed point ν * of the transport equation d dt ν t = -div(ν t F t ) with F t from (12) such that a) ν * has full support in Ω, b) the nonlinearity is 1-homogeneous in the first component of its parameters, i.e., that writing ω = (ω 0 , ω) ∈ R × Θ one has ψ(a; ω) := ω 0 φ(a; ω) for a regular enough φ : A × Θ → R, c) the span of {φ(a; ω)} ω∈Θ is dense in L 2 (A), so that i.e., π ν * (a) = π * (a) = Z -1 exp τ -1 r(a) . Weaker assumptions and the general statement are given in the next section, while the general proof appears in the appendix. First, we note that by assumption a), div(ν * F ( • ; ν * )) = 0 directly implies that for almost all ω ∈ Ω F (ω; ν * ) = A ∇ ω ψ(a; ω) (r(a) -τ f ν * (a) -V ν * ) π ν * (da) = 0 . In particular, by homogeneity assumption b), the first component of the above vector field must vanish on Θ. A φ(a; ω) (r(a) -τ f ν * (a) -V ν * ) π ν * (da) = 0 . By assumption c) that span of φ is dense in L 2 (A) the above implies that r(a) -τ f ν * (a) -V ν * = 0 π ν * -a.e. in A . Finally, recalling that by the softmax parametrization and by the boundedness of φ, π ν * (a) > 0, we must have f ν * (a) = τ -1 r(a) + C which directly implies the optimality of the policy.

DYNAMICS

While it is clear that assumption b) and c) about the structure and approximating power of the nonlinearity ψ hold independently of t, we want to show that assumption a) also holds uniformly in time. In this sense, the continuity of the vector field (12) will preserve the full support properties of the measure ν t for all t > 0, as we will prove in a more general framework in Lemma C.2. Consequently, any measure ν respecting assumption a) at initialization will do so for any finite positive time t > 0. However, the question remains of whether this property still holds at t = ∞. This is the object of Lemma C.3, where we prove that whenever the gradient approaches a fixed point in parameter space, if this fixed point is not a global minimizer, it must be avoided by the dynamics, and thus, the only possible fixed points of the dynamics are global minimizers.

4. RESULTS IN THE GENERAL SETTING

We now come back to the general MDP framework introduced in Section 2.

4.1. ASSUMPTIONS

To state the main result of this section, the optimality of fixed points of (10) we need the following Assumption 1. Assumption 1 a) is a common, technical regularity assumption ensuring that (10) is well behaved and controlling the growth, variation and regularity of φ. Alternative assumptions on the case Θ = R m-1 are given in the appendix. Assumption 1 b) speaks to the approximating power of the nonlinearity, assumed to be expressive enough to approximate any function in L 2 (S × A). This condition replaces the convexity assumption from Chizat & Bach (2018) , as the lack of convex structure in our setting prevents us from identifying the local and global minimization properties of a fixed point. Indeed, despite the one-point convexity of E [V π (s)] as a functional of π (Kakade & Langford, 2002) which can be leveraged in the tabular case, such property will be lost, in general, when restricting to policies through nonlinear function approximation. We bypass this issue by requiring sufficient expressivity of the approximating function class, guaranteeing that the optimal policy can be represented with arbitrary precision. Similar assumption on approximability of neural network representation was made in recent analysis of natural policy gradient algorithm (Agarwal et al., 2019) . We note that this assumption is easily satisfied by widely used nonlinearities by the universal approximation theorem (Cybenko, 1989; Barron, 1993) . Examples of activation functions satisfying Assumption 1 a)-b) include sigmoid, tanh and Gaussian radial function nonlinearities. Extension to analogous results in the ReLU case was discussed in Wojtowytsch (2020) for the supervised learning. Finally, Assumption 1 c) guarantees that the initial condition is such that the expressivity from b) can actually be exploited. This condition is satisfied for example by the product of a uniform distribution on any bounded set A ⊂ R with the normal distribution on Θ or, if Θ is compact, with the uniform distribution on Θ. Assume that ω = (ω 0 , ω) ∈ R × Θ for Θ = R m-1 and ψ(s, a; ω) = ω 0 φ(s, a; ω) with a) Regularity of φ: φ is bounded, differentiable and Dφ ω is Lipschitz. Also, for all f ∈ L 2 (S × A) the regular values of the map ω → g f (ω) := f (s, a)φ(s, a; ω) are dense in its range, and g f (r ω) converges in C 1 ({ω ∈ Θ : ω 2 = 1}) as r → ∞ to a map ḡf (ω)

4.2. CONVERGENCE OF THE MANY-PARTICLE LIMIT

Before discussing the optimality properties of the dynamics (10), we show that this PDE accurately describes the policy gradient dynamics of a sufficiently wide, single layer neural network. To this aim, we let P 2 (Ω) be the space of probability distributions on Ω with finite second moment. Theorem 4.1. Let Assumption 1 hold and let w (N ) t be a solution of ( 5) with initial condition w (N ) 0 ∈ W = Ω N . If ν (N ) 0 converges to ν 0 ∈ P 2 (Ω) in Wasserstein distance W 2 then ν (N ) t converges, for every t > 0, to the unique solution ν t of (10). We note that by the law of large numbers for empirical distributions, the condition of convergence of ν (N ) 0 to ν 0 is e.g., satisfied when w (i) 0 are drawn independently at random from ν 0 . The proof of this result is largely standard, under the given assumptions, and is provided in the appendix for completeness. The idea of the proof is a canonical propagation of chaos argument (Sznitman, 1991) . In a nutshell the first step of the argument establishes sufficient regularity of the gradient dynamics, allowing to guarantee existence and uniqueness of the solution to (10). Then, one bounds the difference in differential updates for the particle system and the mean-field dynamics by comparing them with the evolution of the particle system according to a linear, time-inhomogeneous PDE using the drift term of the mean-field model. The proof is finally concluded by application of Gronwall inequality. The main difficulty WRT similar results in the literature is to establish the needed Lipschitz continuity of the vector field driving the transport PDE: while this is an immediate consequence of assumptions on the activation functions and on the risk functional in the supervised setting, proving this type of regularity requires more effort in the RL setting given the involved dependence of the vector field on the measure ν t .

4.3. OPTIMALITY

After discussing the connection between particle dynamics and mean-field equations, we present the main convergence result of this paper: Theorem 4.2. Let Assumption 1 hold and ν t given by (10) converge to ν * , then π ν * = π * , the optimal policy for (3). Thus if the policy gradient dynamics (10) converges to a stationary point, that point must be a global minimizer. Again, we emphasize that in our regularized setting π * is given by a probability distribution, and can thus be represented as a softmax policy. We prove this result in three steps. First, we connect the optimality of a stationary point with the support of the underlying measure in parameter space. More specifically, we show in Lemma C.1 that by the expressivity of φ, the transport vector field of suboptimal fixed points of the dynamics (10) cannot vanish everywhere in parameter space. This implies that a measure with sufficient support cannot correspond to a suboptimal fixed point. We then show in Lemma C.2 that such sufficient notion of support (Assumption 1 c) ) is preserved by the mean-field policy gradient dynamics (10) throughout training. For any finite time, this is true by topological arguments: the separation property of the measure cannot be altered by the action of a continuous vector field such as (10). We note in particular that we do not prove that assumption a) in Section 3 holds in this case. Finally, in Lemma C.3 we combine the above results and prove that spurious fixed points are avoided by the policy gradient dynamics (10) when initialized properly. To establish this we argue by contradiction: assuming that we are approaching such a spurious fixed point ν at time t 0 , we show in Lemma C.5 that the velocity field will change little for any t > t 0 . In particular, it follows that in this regime the dynamics of ( 10) can be approximated by the gradient descent dynamics (in particle space) of an approximately fixed potential. On the other hand, by Assumption 1 c) and by the homogeneity of ψ, we are able to show that by Lemma C.2 a positive amount of measure ν will fall in a forward invariant region where its ω 0 component will grow linearly in t (which exists by Lemma C.1), thereby eventually contradicting the assumption that ν is a fixed point of (10). There are two main conceptual differences between the proof outlined above and the one carried out in the supervised learning setting. On one hand, a necessary step in our proof is to establish Lipschitz continuity of the vector field defining the transport equation ( 10), also needed for convergence of the particle dynamics as discussed above. On the other hand, the landscape of the objective function for policy gradients does not enjoy the convexity (WRT the predictor) typically assumed in the supervised case. To exclude the existence of local minima we assume sufficient expressivity of the activation functions, absent in the supervised analysis. This assumption is key to deduce optimality of fixed points of (10) in Theorem 4.2 in our less regular setting.

5. NUMERICAL EXAMPLES

To test our theoretical results in a simple setting we train a wide, single hidden layer neural network with policy gradients to learn the optimal softmax policy (8) for entropy-regularized rewards with parametrization (7) and regularization parameter τ = 0.2. We do so in two separate settings: (a) S = {0}, A = [0, 1]. This setting corresponds to bandits framework discussed in Section 3. (b) S × A is a grid of size 100 × 100 in the set [0, 1] 2 . In this case, we have chosen a discount factor of γ = 0.7 and a transition process given by P (s, a, s ) = 0.9δ(s -a) + 0.1/100 (i.e., an action a leads to the corresponding state s = a with probability 0.9 and is uniformly distributed with probability 0.1). At each iteration we have computed the exact distribution π by computing the resolvent of the (weighted) transition matrix. In both cases, we defined the optimal Q function as Q * (s, a) = τ f * w (s, a), where f * w (s, a) is given by a single hidden layer neural network of width n = 5, ReLU nonlinearites and weights w drawn independently and identically distributed from a centered, normal distribution with variance σ 2 = 4, i.e., Q * (s, a) = τ f * w (s, a) for w i ∼ N (0, 4) . We learn the optimal policy for the problem defined above using a N = 800-neurons wide single hidden layer neural network with ReLU nonlinearities in the mean-field regime (7) used as energy for a softmax policy (8). The initialization of the student network is as follows: first-layer weights are initialized at random drawn independently from a centered, normal distribution with variance σ 2 = 4, while output weights are initialized at 0. The model is trained according to (4) with fixed step-size β t = 10 -3 . We report the results of this training procedure in Fig. 1 , where we notice that all the paths monotonically decrease the error E(ν * ) -E(ν t ), as predicted by our results. Note that the convergence rate of the model varies across experiments, consistently with the purely qualitative nature of the convergence result we proved.

6. CONCLUSIONS AND FUTURE WORK

This work addresses the problem of optimality of policy gradient algorithms, a workhorse of deep reinforcement learning, when combined with mean-field models such as neural networks. More specifically, we provide a mean-field formulation of the parametric dynamics of policy gradient algorithms for entropy-regularized MDPs and prove that, under mild assumptions, all fixed points of such dynamics are optimal. This extends similar results obtained in the "neural" or "lazy" regime to the mean-field one, which is known to be much more expressive (E et al., 2019; Ghorbani et al., 2020) , but also highly nonlinear. The latter feature prevents, at present, from obtaining convergence results of these models, except in very specific settings (Chizat, 2019; Javanmard et al., 2019) . Interesting avenues or future research include the relaxing the adiabaticity assumption, i.e., considering the stochastic approximation problem resulting from the finite number of samples and the finite gradient step-size, as well as establishing quantitative bounds for models with a large, but finite, number of parameters. Probably the most important open question, however, concerns establishing quantitative convergence of mean-field dynamics of neural networks: even in the supervised setting, despite recent results in specific settings (Chizat, 2019; Javanmard et al., 2019) , these guarantees remain mainly out of reach.

ACKNOWLEDGMENTS.

AA acknowledges the support of the Swiss National Science Foundation through the grant P2GEP2-17501 and by the NSF grant DMS-1613337. The work of JL is in part supported by the US National Science Foundation via grants CCF-1910571 (Duke TRIPODS) and DMS-2012286.

A DERIVATION OF SOFTMAX POLICY GRADIENT DYNAMICS

Lemma A.1. The gradient of the entropy-regularized value function can be written as E s∼ 0 [∇ w V π (s)] = E s∼ π ,a∼πw [∇ w log π w (s, a) (Q π (s, a) -τ log π w (s, a))] , and thus the policy gradient dynamics (5). Proof. We choose throughout π as the Lebesgue measure, and use that π w is absolutely continuous wrt π. Taking the gradient of (1) using a parametric policy π w we obtain ∇ w V πw (s) = ∇ w E πw ∞ t=0 γ t (r(s t , a t , s t+1 ) -τ D KL (π w (s t , •); π( • ))) s 0 = s = ∇ w S×A r(s, a 1 , s 1 ) -τ log π w (s, a 1 ) π(a 1 ) + γV πw (s 1 ) P (s, a 1 , ds 1 )π w (s, da 1 ) = S×A r(s, a 1 , s 1 ) -τ log π w (s, a 1 ) π(a 1 ) + γV πw (s 1 ) P (s, a 1 , ds 1 )∇ w π w (s, da 1 ) + S×A -τ ∇ w log π w (s, a 1 ) π(a 1 ) + γ∇ w V πw (s 1 ) P (s, a 1 , ds 1 )π w (s, da 1 ) (A.1) Now, since S P (s, a, ds ) = 1 and A π w (s, da ) = 1 for all s, a, w we have for the first term in brackets in the last line S×A ∇ w log π w (s, a 1 ) π(a 1 ) P (s, a 1 , ds 1 )π w (s, da 1 ) = A ∇ w π w (s, a 1 )da 1 = ∇ w A π w (s, a 1 )da 1 = 0 On the other hand, we can rewrite the second term in brackets as E πw [γ∇V πw (s 1 )|s 0 = s], and recognize the LHS of (A.1) evaluated at the next state s 1 in the expectation. Therefore, we can sequentially repeat the same computation as above, and recalling the definition of πw ( • ) and Q π (s, a) in ( 6) and ( 2) we obtain ∇ w V πw (s) = A ∞ t=0 γ t r(s t , a t , s t+1 ) -τ log π w (s t , a t ) π(a t ) + γV πw (s t+1 ) ∇ w π w (s t , da t ) s0=s = E πw ∞ t=0 γ t Q πw (s t , a t ) -τ log π w (s t , a t ) π(a t ) ∇ w log π w (s t , a t ) s 0 = s = S×A Q πw (s t , a t ) -τ log π w (s, a) π(a) ∇ w log π w (s, a) πw (ds)π w (s, da), where in the second line we have used that, if π w > 0 on A, π w (s t , da t )∇ w log π w (s t , a t ) = ∇ w π w (s t , da t ). Proposition 2.1. For a fixed initial distribution 0 , the dynamics (10) is the Wasserstein gradient flow of the energy functional E[ν] = E s0∼ 0 [V πν (s 0 )] . Proof. We find the potential of the gradient flow by functional differentiation of E: δ δν E[ν](ω) = A δE δπ (s, a) δπ ν δν (s, a; ω)ds da (A.2) and consider the two terms in the integral separately, starting from the second: δπ ν δν (s, a; ω) = δ δν e τ ψ(s,a;ω)ν(dω) A e τ ψ(s,a;ω)ν(dω) da = 1 A e τ ψ(s,a;ω)ν(dω) da δ δν e τ ψ(s,a;ω)ν(dω) Published as a conference paper at ICLR 2021 -e τ ψ(s,a;ω)ν(dω) A e τ ψ(s,a;ω)ν(dω) da 2 A δ δν e τ ψ(s,a;ω)ν(dω) da = τ ψ(s, a; ω) - where in the last line we have used that δ δπ [π log π](s, a) = log π(s, a) + 1. We evaluate the variational derivative of π as δ π (s , s ) δπ (s, a) = ∞ t=0 γ t δ δπ P t π (s , s ) (s, a) = γ S δP π (s , s ) δπ ∞ t=0 γ t P t π (s , s ) + P π (s , s ) δ δπ ∞ t=0 γ t δ δπ P t π (s , s )ds = γ S δ(s, s )P (s , a, s ) π (s , s ) + P π (s , s ) δ π (s , s ) δπ (s, a)ds We further recognize the same derivative on the RHS of the above expression, allowing to write δ π (s , s ) δπ (s, a) = ∞ t=0 γ t P t π (s , s) S P (s, a, s ) π (s , s )ds Finally, we notice that the last term in (A.4) is constant in a and therefore vanishes when integrated against (A.3), so we only consider δE δπ (s, a)π (s) = S 2 ×A (r(s , a ) -τ log π(s , a )) × × δ s,s δ a,a π (s , s) + π(s , a ) δ π (s , s ) δπ (s, a) 0 (ds )ds da = S 2 ×A ∞ t=0 γ t P t π (s , s) (r(s , a ) -τ log π(s , a )) × × δ s,s δ a,a + γ S P (s, a, s ) π (s , s )ds 0 (ds )ds da = ∞ t=0 γ t S P t π (s , s) (Q π (s, a) -τ log π(s, a)) 0 (ds ) = (Q π (s, a) -τ log π(s, a)) π (s) (A.6) We conclude by noting that combining (A.3) and (A.6) we obtain C π [ψ(ω), Q π -τ log π](s), and the Wasserstein gradient flow corresponding to this potential is (10).

B PROOFS OF THE MANY-PARTICLE LIMIT

Theorem 4.1. Let Assumption 1 hold and let w (N ) t be a solution of ( 5) with initial condition w (N ) 0 ∈ W = Ω N . If ν (N ) 0 converges to ν 0 ∈ P 2 (Ω) in Wasserstein distance W 2 then ν (N ) t converges, for every t > 0, to the unique solution ν t of (10). Proof. As anticipated in the main text, the proof is divided in two parts: 1. We prove sufficient regularity of the dynamics (10), allowing to establish existence and uniqueness of its solution. 2. We leverage the regularity proven above to establish a propagation of chaos result, showing that the system of interacting particles behaves asymptotically as its mean-field limit. While carrying out this proof is needed in our context since the dependence of E(ν) on ν is more involved than in e.g., Mei et al. (2018) ; Rotskoff & Vanden-Eijnden (2018) ; Chizat & Bach (2018) , the steps of this derivation are mainly standard, see e.g., (Sznitman, 1991) .

B.1 REGULARITY

We prove existence and uniqueness of the gradient flow dynamics (10) through standard arguments from the optimal transportation literature (see e.g., (Ambrosio et al., 2008) ). More specifically, recalling that π = π ν we leverage the Lipschitz continuity of the vector field F t (ω, ν) = S C π [∇ ω ψ(s, •; ω), Q π (s, •) -τ log π(s, •)] π (ds) with respect to ν. To prove such regularity result, decompose E(ν) = R ψν for R(f ) = S • π(f ) . (B.1) and S : (S → M 1 + (A)) → R maps µ → E µ [V µ (s)|s ∼ 0 ] and π : L 2 (S × A) → (S → M 1 + (A)) is the softmax policy parametrization (8) of its argument. Recalling the definition of Q r from Assumption 1 and denoting F r = { ψν : supp ν ∈ Q r }, we further define the norms and constants needed in the following proof as: Dψ r,∞ = sup ω∈Qr Dψ ω L Dψ = sup ω,ω ∈Qr Dψ ω -Dψ ω ω -ω 2 DR r,∞ = sup ψ(•;ω) : ω∈Qr DR ψ L DR = sup ψ,ψ ∈Fr DR ψ -DR ψ ψ -ψ 2 where • denotes the operator norm. While for any r > 0 the boundedness of Dψ r,∞ , L Dψ results directly from Assumption 1, more work is needed to prove that DR r,∞ < ∞, L DR < ∞. We prove this in Lemma C.5 below, and proceed with the proof of convergence of the particle dynamics. For any r and corresponding Q r from Assumption 1 we define the set of localized functionals E (r) (ν) = E(ν) if supp (ν) ⊂ Q r ∞ else Furthermore, we say that a coupling γ ∈ M 1 + (Ω × Ω) is an admissible transport plan if both its marginals have support in Q r and finite second moments. To every admissible transport plan, for p ≥ 1 we associate a transportation cost C p (γ) = ( Ω 2 |ω -ω | p dγ(ω, ω )) 1/p . We prove the following results: for every r > 0 we have 1. There exists λ r > 0 such that for any admissible transport plan γ, defining the interpolation map ν γ t := (tΠ 0 + (1 -t)Π 1 ) # γ, the function t → E(ν γ t ) is differentiable with Lipschitz continuous derivative with constant λ r C 2 2 (γ). 2. Let ν 0 have support in Q r . Then for any given transport plan γ with first marginal given by ν 0 , a velocity field F satisfies E((Π 1 ) # γ) ≥ E(ν 0 ) + F (u) • (u -u )dγ(u, u ) + o(C 2 (γ)) (B.2) if and only if F (u) is in the subdifferential of DE ν (u) := C π [ψ, Q π -log π](u) for g ∈ L 2 (S × A) (projected in the interior of Q r when u ∈ ∂Q r ) for ν 0 almost every u ∈ Ω. The proof of the two points above corresponds to (Chizat & Bach, 2018, Lemma B.2) . We sketch the proof of these two points below, referring to the original reference for the details.

1.. By the Lipschitz continuity of

ψ : Ω → L 2 (S × A) and R (f ) = DR f = C π [•, Q π -τ log π] in Q r , the energy E (r) (ν γ t ) transported along an interpolating path ν γ t is differentiable and we can write its derivative as d dt E (r) (ν γ t ) = R ( ψν γ t ) Dψ (1-t)ω+tω (ω -ω)dγ(ω, ω ) Then, again by the Lipschitz continuity of Dψ and DR we have for 0 ≤ t < t < 1 d dt E (r) (ν γ t ) - d dt E (r) (ν γ t ) ≤ (R ( ψν γ t ) -R ( ψν γ t )) Dψ (1-t )ω+t ω (ω -ω)dγ(ω, ω ) + R ( ψν γ t ) (Dψ (1-t )ω+t ω -Dψ (1-t )ω+t ω )(ω -ω)dγ(ω, ω ) ≤ λ r |t -t | for a λ r large enough, where in the last inequality we have used the uniform bounds on DR in Q r , that |Dψ (1-t )ω+t ω -Dψ (1-t )ω+t ω | ≤ (t -t )L Dψ |ω -ω | and we applied Hölder's inequality to bound C 2 1 (γ) ≤ C 2 2 (γ). 2. The proof of this result leverages an expansion of the functionals R, ψ to the second order in their arguments: ψ(ω ) = ψ(ω) + Dψ ω (ω -ω) + R ψ (ω, ω ) R(g) = R(f ) + DR f (g -f ) + R R (f, g) Recalling the Lipschitz bounds on the remainders R ψ (ω, ω ) < 1 2 L Dψ |ω -ω | 2 , R R (f, g) < 1 2 L DR f -g 2 2 and combining the two expansions above we have, for a transport plan γ with marginals ν 0 = (Π 0 ) # γ, ν 1 = (Π 1 ) # γ E (r) (ν 1 ) = E (r) (ν 0 ) + R ( ψν 0 )Dψ u (u -u)dγ(u, u )dsda + R for a remainder term R. We can bound such remainder term, again by the Lipschitz regularity of Dψ and DR and by the boundedness of Dψ, DR in Q r , by C 2 (γ)foot_0 and C 1 (γ) 2 ≤ C 2 (γ) 2 , thereby obtaining that E (r) (ν 1 ) = E (r) (ν 0 ) + R ( ψν 0 )Dψ u (u -u) dsda dγ(u, u ) + o(C 2 (γ)) Noting that the integrand against the coupling is the gradient flow vector field, the above uniquely characterizes the velocity field satisfying (B.2). We note that point 1) above immediately implies that E(•) is λ r -semiconvex along geodesics, while by 2) E (r) ( • ) admits strong Wasserstein subdifferentials on its domain (Ambrosio et al., 2008, Definition 10.3.1) . Combining the two results one obtains existence and uniqueness of the solutions of the Wasserstein gradient flow through (Ambrosio et al., 2008, Theorem 11. 2.1).

B.2 PROPAGATION OF CHAOS

By the Lipschitz continuity of the transport field in (10) in ν 0 with supp ν 0 ⊂ Q r , there exists a time t r > 0 such that, supp ν (N ) s ⊂ Q r for all s ∈ [0, t r ], N ∈ N. Consider now two times 0 ≤ t 1 < t 2 ≤ t r . To prove the existence of the limiting curve (ν t ) t , we show that the curves ν (N ) t are uniformly in N equicontinuous in W 2 and as such, possess a converging subsequence by Arzela-Ascoli theorem. To show equicontinuity, we bound the the W 2 distance between distributions by coupling positions of the same particles at different times and using Cauchy-Schwartz inequality: W 2 (ν (N ) t1 , ν (N ) t2 ) 2 ≤ 1 N N i=1 w (i) t1 -w (i) t2 2 2 ≤ t 2 -t 1 N N i=1 t2 t1 d ds w (i) s Combining the above with the identity d dt E(ν (N ) t ) = 1 N N i=1 ∇ wi E(ν (N ) t ), d dt w (i) t = 1 N N i=1 d dt w (i) t 2 2 we have W 2 (ν (N ) t1 , ν (N ) t2 ) ≤ √ t 2 -t 1 t2 t1 d ds E(ν (N ) s )ds ≤ √ t 2 -t 1 sup supp ν∈Qr E(ν) - inf supp ν∈Qr E(ν) 1/2 where we recall that F r = {ν ∈ M 1 + (Ω) : supp ν ⊂ Q r }. In particular, the above continuity bound is independent on N , proving equicontinuity of ν (N ) in W 2 . We now prove that the limiting point of the converging subsequence whose existence was identified above must solve (10). To do so we compare both the differential for the mean-field and particle dynamics to the one of a linear inhomogneous PDE: for any bounded and continuous f : R × R m → R m , denoting by E = ν t F t dt and E N = ν (N ) t F (N ) t dt where F t , F (N ) t are the vector fields of the mean-field and particle system respectively, we write f (ω)d(E -E N ) 2 ≤ f ∞ F (N ) t -F t 2 dν (N ) t dt + f F t d(ν (N ) t -ν t )dt . By boundedness of f F t over Q r , the second term converges by our choice of subsequence. For the first term, denoting throughout by • BL the bounded Lipschitz norm, we leverage the Lipschitz continuity of the vector field F WRT the underlying parametric measure: F (N ) t -F t 2 ≤ C r ν t -ν (N ) t BL for C r > 0 large enough, again obtaining convergence by our choice of subsequence. This proves convergence of the particle model to the mean-field equation ( 10) on [0, t r ]. To extend the time interval on which we prove convergence, we use that E(ν) (and thus R) decays along trajectories of (10). Consequently, by the boundedness of the differential DR on sublevel sets of R the Lipschitz constant of E(ν t ) is uniformly bounded. Using again the Lipschitz continuity of F we can show that sup u∈Qr F < A + Br, i.e., that particle velocities can grow at most linearly in r, and an application of Gronwall allows to find, for every T > 0 that there exists r > 0 such that supp ν t ∈ Q r for all t ∈ [0, T ] and propagation of chaos follows.

C PROOFS OF OPTIMALITY

Theorem 4.2. Let Assumption 1 hold and ν t given by (10) converge to ν * , then π ν * = π * S × A-a.e. Before proceeding to prove Theorem 4.2, we state the alternative form of Assumption 1 a) in the case where Θ = R m-1 . Our proof of the theorem above can be easily generalized to the setting of Assumption 2. Assumption 2. Assume that ω = (ω 0 , ω) ∈ R × Θ for Θ ⊂ R m-1 which is the closure of a bounded open convex set. Furthermore ψ(s, a; ω) = ω 0 φ(s, a; ω) where φ is bounded, differentiable and Dφ ω is Lipschitz. Also, for all f ∈ L 2 (S × A) the regular values of the map ω → g f (ω) := f (s, a)φ(s, a; ω)dads are dense in its range and g f (ω) satisfies Neumann boundary conditions (i.e., for all ω ∈ ∂Θ we have dg f (ω)(n ω ) = 0 where n ω ∈ R m-1 is the normal of ∂Θ at ω). We prove Theorem 4.2 as sketched in Section 4 by first connecting the optimality and the support of stationary measures in parametric space through Lemma C.1, and then investigating how the dynamics preserves full support property for any t > 0 in Lemma C.2 and avoids spurious minima in Lemma C.3. Before starting this program we introduce the equivalent of greedy policies in the entropy-regularized setting. For a given Q(s, a), the associated Boltzmann policy π B with respect to a reference measure π is given by π B (s, a) := exp [(Q(s, a) -V Q (s))/τ ] for V Q (s) := τ log E a∼π [exp [Q(s, a)/τ ]] , and satisfies π B (s, •) = arg max π∈M 1 + (A) (E a∼π [Q(s, a)] -τ D KL (π; π)(s)) One can then define the Boltzmann backup or soft Bellman backup operator T τ that, for a given Q and the associated Boltzmann policy π B , gives the action-value function T τ Q associated to π B : T τ Q(s, a) = r(s, a) + γτ E π [log E a1∼π [exp Q(s 1 , a 1 )/τ ] |s 0 = s] (C.1) It is known (Haarnoja et al., 2018b , Theorem 1) that the fixed points of the above operator are optimal, i.e., they correspond to the optimal policy π * B = π * of the entropy-regularized MDP. To state the first partial result towards the proof of Theorem 4.2, we observe that the ω 0 -component of the transport vector field in (10) can be written as S C πν [∇ ω ψ(s, •; ω), Q πν (s, •) -τ log π ν ] πν (ds) 0 = S C πν [φ(s, •; ω), Q πν (s, •) -τ log π ν ] πν (ds) , (C.2) where we recall that C π [f, g] is the covariance operator WRT the probability measure π(s, da) introduced below (10). We note in particular that the above expression only depends on ω. With the above information at hand we now proceed to prove Lemma C.1 relating the value of (C.2) and the optimality of fixed points of ( 10  Q πν (s, a) -τ log π ν (s, a) -V πν (s) = 0 π ν πν -a.e. . We then rewrite the above condition in compact notation as the fixed point equation T τ Q πν (s, a) = Q πν (s, a) for the soft Q learning or Boltzmann backup operator T τ defined in (C.1). Since all fixed points of T τ for γ < 1 are optimal (Nachum et al., 2017, Theorem 3) , we must have that π ν = π B [Q * ] = π * for π ν πν -almost every (s, a) ∈ S × A. The result follows by equivalence of π ν and π. Consequently, suboptimal fixed points of the dynamics (10) cannot satisfy (C.3) Lebesgue-a.e. in Θ. C.1 PROOF OF THEOREM 4.2 We prove below that spurious local minima that do not satisfy (C.3) Θ-a.e. are avoided by the dynamics. We do so by leveraging the approximate gradient structure of the policy gradient vector field when ν t is close to one of such stationary points, as discussed in the main text. Combining this fact with the assumed convergence to ν * proves Theorem 4.2. We note that by the assumed homogeneity of ψ in its first component, if ν, ν are such that ω 0 ν(dω 0 , dω) = ω 0 ν (dω 0 , dω) a.e. (C.4) then f ν ( • ) = ω 0 φ( • ; ω)ν(dω 0 , dω) = ω 0 φ( • ; ω)ν (dω 0 , dω) = f ν ( • ) , so that in turn we have π ν = π ν a.e.. In other words, the homogeneity of the chosen class of approximators results in a degeneracy of the map ψ : M 1 + (Ω) → L 2 (S × A). To remove this degeneracy in our analysis, we identify all the distributions ν, ν that are equivalent under (C.4) by defining the signed measure h 1 ν (dω) := ω 0 ν(dω 0 , dω) (C.5) Leveraging this definition, we prove the desired result Theorem 4.2 in two key steps: we show that 1. the solution to (10) does not lose (projected) support for any finite time, thereby preserving the property from Assumption 1 c), 2. stationary points ν with Q πν = Q π * -which by Lemma C.1 cannot have full projected support in Θare avoided by the dynamics. These facts are respectively summarized in the following lemmas: Lemma C.2. Let Assumption 1 a) hold and let ν 0 satisfy Assumption 1 c), then for every t > 0, ν t solving (10) with initial condition ν 0 also satisfies Assumption 1 c). Throughout, we let • BL denote the bounded Lipschitz norm. Lemma C.3. Let Assumption 1 hold and let ν be a fixed point of (10) such that (C.3) does not hold a.e.. Then there exists ε > 0 such that if hfoot_1 ν -h 1 νt 1 BL < ε for a t 1 > 0 there exists t 2 > t 1 such that h 1 ν -h 1 νt 2 BL > ε. Proof of Lemma C.2. Analogously to (Chizat & Bach, 2018, Lemma C.13) , we aim to show that the separation property Assumption 1 c) is preserved by the evolution of ν 0 along the characteristic curves X(t, u) solving ∂ t X(t, u) = F t (X(t, u); ν t ), (C.6) where F t is the transport field in (10). To reach this conclusion, the analogous result in Chizat & Bach (2018) only relies on the continuity of the map u → X(t, u), established in (Chizat & Bach, 2018, Lemma B.4) under Assumption 1 a). Hence, it is sufficient for our purposes to establish continuity of the map X(t, •) from (C.6). This property, however results immediately from the one-sided Lipschitz continuity of the vector field F t on Q r = [-r, r] × Θ uniformly on compact time intervals, which is in turn guaranteed by the Lipschitz continuity and Lipschitz smoothness of ψ from Assumption 1 and boundedness of r. To simplify the notation in the following proof, we denote throughout δ(ν) := Q πν -τ log π ν -V πν and f, g π := S×A f (s, a)g(s, a)π(s, da) π (ds) . Proof of Lemma C.3. We first claim that by Lemma C.1, for any spurious fixed point (such that Q πν = Q π * ), there must exist a subset of Θ with positive Lebesgue measure where ν loses support and such that ∇ψ, δ(ν) π = 0. This is easily proven by contradiction: if ∇ψ, δ(ν) π = 0 a.e. then by Lemma C.1 we have that Q πν = Q π * . This implies that the quantity g ν (ω) := ∂ ω0 ψ(•; ω), δ(ν) πν = ψ(•; (1, ω)), δ(ν) πν = φ(•; ω), δ(ν) πν (C.7) cannot vanish a.e. on Θ. Then, by Assumption 1 on the regularity of g, there exists a nonzero regular value -η of g ν (ω). Assuming without loss of generality that this regular value is negative, so that η > 0 (else invert the signs of ω 0 in the remainder of the proof), we define the nonempty sublevel set G := {(ω 0 , ω) ∈ Ω : g ν (ω) < -η} and G + = {(ω 0 , ω) ∈ G : ω 0 > 0} . (C.8) Further denoting by Ḡ ⊆ Θ the projection of G onto Θ, we have by definition that the gradient field of g ν (ω) is orthogonal to the level set ∂ Ḡ, the latter being an orientable manifold of dimension m -2. Denoting by n ω the normal unit vector to ∂ Ḡ in the outward direction, by continuity of ∇g ν (ω) when Ḡ is compact 1 we can bound the scalar product between the two away from 0, i.e., there exists β := min ω∈∂ Ḡ n ω • ∇ ω g ν (ω) > 0 . We now prove that the stationarity assumption in a ε-neighborhood of a spurious fixed point h 1 ν -h 1 νt BL < ε for all t > t 1 , (C.9) leads to a contradiction for ε small enough. To do so, by Lemma C.5 we set ε(α, η, β) small enough so that for all ν t such that (C.9) holds we have g νt (ω) < -η/2 on Ḡ and n ω • ∇ ω g νt > β/2 on ∂ Ḡ. Then, the two inequalities above combined with ∂ ω0 ψ(ω 0 , ω) = ψ(1, ω) imply that the set G + defined above is forward invariant and therefore that ∂ t ν t (G + ) ≥ 0 as long as (C.9) holds. Furthermore, by similar arguments we notice that characteristic trajectories cannot enter the set G \ G + after t 1 . Now, we consider two cases: either (i) a positive amount of mass is present at t 1 in the forward invariant set G + (ν t1 (G + ) > 0) or (ii) ν t1 (G + ) = 0. We discuss these two cases separately, along the lines of (Chizat & Bach, 2018, Lemmas C.4, C.18) , respectively. exists ε ∈ (0, η/4) such that τ q > τ , i.e., that the trajectory ω * reaches G + before q(t) > ε. Note that as long as t ∈ [0, τ q ] and ε ∈ (0, η/4) the negative terms on the RHS (C.11) can be bounded from below, and we have ω 0 (t) ≥ ω 0 (0) + η 2 t so that ω 0 (t) > ω 0 (0) ≥ -M . Consequently, for all t ∈ [0, τ ∧ τ q ] we bound the RHS of (C.12) as d dt q(t) < M ε + LM q(t). Using that q(0) = 0 and Grönwall inequality we can bound the total excursion in the ω component q(t) ≤ εM t exp [LM t]. Finally, setting τ 0 := 2(M + 1)/η ≥ -2(ω 0 (0) -1)/η > τ so that ω 0 (τ 0 ) > 1 we are still free to set ε small enough such that τ q > τ 0 > τ . Indeed, by monotonicity of the upper-bound on q(t) we have q(τ 0 ) ≤ 2ε(M + 1)M/η exp [2LM (M + 1))/η] ≤ η/4L , so that setting ε ∈ (0, η/4) concludes the proof. We now proceed to show the needed regularity of the potential g ν from (C.7) in terms of the signed measure h 1 ν defined in (C.5). In doing so, we also prove Lipschitz smoothness of the operator R defined in (B.1): Lemma C.5. For any r > 0, the operator R on F r = { ψν : supp ν ∈ Q r } is Lipschitz smooth, and DR f is bounded in the supremum norm. Furthermore, for all C 0 > 0 there exists α > 0 and ε > 0 such that for all ν, ν satisfying h 1 ν BL , h 1 ν BL < C 0 , h 1 ν -h 1 ν BL < ε, one has g ν -g ν C 1 ≤ α h 1 ν -h 1 ν BL . (C.13) To prove the above lemma, we first bound some relevant quantities. Throughout, by slight abuse of notation, we denote for any function f : S × A → R f 2 = sup s∈S A f (s, a) 2 da Lemma C.6. For all f, f ∈ F r there exists ε > 0, C , C , C > 0 such that if f -f 2 < ε one has π ν -π ν 2 ≤ C f -f 2 (C.14) ν -ν 1 ≤ C f -f 2 (C.15) Q πν -Q π ν 2 ≤ C f -f 2 (C.16) Proof of Lemma C.6. Throughout this proof, for simplicity of notation, we will write π = π ν and π = π ν . Furthermore, we use that for f ∈ F r there exists C 0 > 0 so that e -C0 φ C 1 ≤ exp[f (s, a)] ≤ e C0 φ C 1 , (C.17) implying, together with the assumed absolute continuity of 0 that Q π ∞ , π ∞ , π ∞ < ∞. Setting throughout τ = 1 to simplify the notation and combining the above with the pointwise upper bound e x < 1 + K r |x| for |x| < e C0 φ C 1 we obtain π -π 2 ≤ exp[f (s, a)] exp[f (s, a )]da - exp[f (s, a)] exp[f (s, a )]da 2 ≤ exp[f (s, a)] exp[f (s, a )]da - exp[f (s, a)] exp[f (s, a )]da 2 + exp[f (s, a)] exp[f (s, a )]da - exp[f (s, a)] exp[f (s, a )]da 2 ≤ exp[f (s, a)] exp[f (s, a )]da ∞ 1 -exp[f (s, a) -f (s, a)] 2 + exp[f (s, a )]da exp[f (s, a )]da -1 2 ≤ e 2C0 φ C 1 |A| K r f (s, a) -f (s, a) 2 + e 2C0 φ C 1 |A| exp[f (s, a ) -f (s, a )] -1 da 2 ≤ e 2C0 φ C 1 |A| K r (1 + e 4C0 φ C 1 ) f -f 2 =: C f -f 2 , (C.18) where we have denoted by |A| the Lebesgue measure of the action space A. We now proceed to establish the second bound in the statement of the lemma. In this case, denoting the t-steps transition probability as P t π (s, ds t ) = S t-1 P π (s, ds 1 )P π (s 1 , ds 2 ) . . . P π (s t-1 , ds t ) we have π -π 1 = ∞ t=1 γ t S 0 (ds 0 ) P t π (s 0 , ds) -P t π (s 0 , ds) . Observing that for any smooth ∈ M 1 + (S) for the operator norm of the difference in the above sum we have (P π -P π ) 1 = S A P (s, a, ds ) (π(s, da) -π (s, da)) 1 ≤ 1 P 1 π -π 2 ≤ (1 -γ) 2 γ C f -f 2 for large enough C , where we used (C.18) in the last line and the Lipschitz continuity of P in its second argument, and P 1 is the operator norm of the transition operator A P (s, a, ds )π (s, da) : M 1 + (A) → M 1 + (A), which is equal to 1. From this we conclude π -π 1 ≤ (1 -γ) 2 γ ∞ t=1 tγ t C f -f 2 = C f -f 2 Finally, defining for notational convenience R τ := r -τ D KL (π , π) we write: and bound each term separately letting C 1 , C 2 , C 3 > 0 be large enough constants. For the first, we have: ,a) da e f (s,a) da where we have used that all the terms appearing in DR f are bounded for every choice of r > 0. Q π -Q π 2 = γ P (s, a, ds ) (V π (s ) -V π (s )) (R τ -R τ ) (s , a ) π (s , ds )π(s , da ) ∞ ≤ 1 1 -γ π 2 log π -log π 2 ≤ 1 1 -γ π 2 f -f 2 + log e f (s To establish Lipschitz smoothness of R, letting f, f ∈ F r and denoting, to simplify notation, π = π f and π = π f , we proceed to bound the operator norm by splitting the RHS as (s, a)φ(s, a; ω)dsda(h 1 ν -h 1 ν )(dω) DR f -DR f ≤ sup ∈L 2 (S×A) : =1 |C π [ , Q π -τ log π] -C π [ , Q π -τ log π ]| ≤ sup ) π | ≤ π π ∞ 2 δ(π) -δ(π ) 2 ≤ π 1 π 2 2 Q π -Q π 2 + V π -V π 2 + τ log π π 2 ≤ π 1 π 2 2 (1 + π ∞ ) Q π -Q π 2 + Q π 2 π -π 2 + τ log π π 2 ≤ K 2 f -f 2 . (C. ≤ φ C 1 h 1 ν -h 1 ν BL , which combined with the φ C 1 -Lipschitz continuity of the map (s, a)φ(s, a; ω)dsda and with the Lipschitz smoothness of R concludes the proof.



ds if Ḡ is not compact we choose η to also be a regular value of the function on {ω ∈ R m-1 : ω = 1} to which g converges as ω goes to infinity.



whose regular values are dense in its range. b) Universal approximation: the span of {φ(•, ω) : ω ∈ Θ} is dense in L 2 (S × A); c) Support of the measure: There exists r > 0 s.t. the support of the initial condition ν 0 is contained in Q r := [-r, r] × Θ and separates {-r} × Θ from {r} × Θ, i.e., any continuous path connecting {-r} × Θ to {r} × Θ intersects the support of ν 0 .

Figure 1: Evolution of E(ν * ) -E(ν t ) as a function of training time, for experiments (a) and (b) as described in the main text. Different lines correspond to different random initializations of reward function and learning model. We see that the training error decreases monotonically (but with different rates) during trainig.

ψ(s, a ; ω)π ν (s, da ) π ν (s, a) (A.3) For the first term in the integrand of (A.2), we use π(s, a) as a density and obtain δE δπ (s, a) = δ δπ S×A (r(s, a) -τ log π(s, a)) ( π (ds)π(s, da)) (s, a) = S (r(s, a) -τ (log π(s, a) + 1)) π (s , s) s , a ) -τ log π(s , a )) δ π (s , s ) δπ (s, a)π(s , da )ds 0 (ds ) ,(A.5)

): Lemma C.1. Let Assumption 1 hold and let ν satisfyS×A φ(ω; s, a) (Q πν (s, a) -τ log π ν (s, a) -V πν (s)) π ν (s, da) πν (ds) = 0 , (C.3)ω-almost everywhere in Θ. Then we have thatQ πν = Q π * holds π π -a.e. in S × A.Proof of Lemma C.1. Assuming that (C.3) holds Lebesgue-a.e. in Θ, by the assumed continuity of φ in ω combined with the expressivity of φ Assumption 1 b) we must have that

π -P π ) P t-j-1 π 1

γ P (s, a, ds ) R τ (s , a ) π (s , ds )π(da ) -R τ (s , a ) π (s , ds )π (da )2 ≤ γ P (s, a, ds ) R τ (s , a ) π (s , ds )π(da ) -R τ (s , a ) π (s , ds )π (da ) 2 ≤ (R τ -R τ ) (s , a ) π (s , ds )π(da ) ∞ + R τ (s , a )( ππ )(s , ds )π(da ) ∞ + R τ (s , a ) π (s , ds )(π -π )(da ) ∞ (C.19)

C 1 f -f 2 (C.20)where we have bounded the log term as follows: log e f (s,a) da e f (s,a) da 2 = log e f (s,a) -e f (s,a) da e f (s,a) da + 1 2≤ log e f (s,a) e |f (s,a)-f (s,a)| -1 dae f (s,a) da + (s, a) -f (s, a)|da + 1 2 ≤ log e -f ∞ e f ∞ K r f (s, a) -f (s, a) 2 + 1 2 ≤ e -f ∞ e f ∞ K r f -f 2 (C.21)For the second term in (C.19), using the boundedness of R 2 < C R and that ππ = γ ( ππ ) for a certain, 0 depending on s we writeR τ (s , a )( ππ )(s , ds )π(s , da ) ∞ ≤ π 2 R τ (s , a ) 2 ππ 1 ≤ C 2 f -f 2 . (C.22)We finally bound the third term by writingR τ (s , a ) π (s , ds )(π -π )(s , da ) ∞ ≤ π 1 R τ (s , a ) 2 π -π 2 ≤ C 3 f -f 2 (C.23) and obtain (C.16) by combining (C.20)-(C.23). Proof of Lemma C.5. We first establish the desired properties of the functional R. To do so we differentiate (8) at f ∈ L 2 (S × A) (s, da ) π f (s, a) . (C.24) Then, combining the above with (A.6) and Hölder inequality we obtain DR f ∞,r = sup f ∈Fr (DR f (s, a)) 2 dsda < ∞

resulting terms separately. First of all, defining throughout by slight abuse of notation δ(π) := Q π -τ log π, we have(I) := | (s, a)π (s, da) -(s, a)π(s, da), δ(π ) π | ≤ S A (s, a)(π(s, a) -π (s, a))da δ(π )π (da) π (s)ds ≤ 2 π -π 2 δ(π )π 2 π 1 ≤ C 2 f -f 2 , (C.25)where the last line was obtained using (C.14) and boundedness from above and below of π, Q π in F r . We further write, for a K < ∞ large enough(II) := | (s, a) -(s, a )π(s, da ), δ(π ) π -(s, a) -(s, a )π(s, da ), δ(π ) π | ≤ | δ(π ) (s, a) π (s)(π -π )(s, da)ds| + | (s, a)δ(π )( ππ )(s)π (s, da)ds| ≤ δ(π ) 2 π 1 2 ( π -π 2 + ππ 1 ) ≤ K 2 f -f 2 , (C.26) where we have used Cauchy-Schwartz inequality together with (C.14) and (C.15). Finally, using (C.16) and (C.21), we bound for K < ∞ possibly larger than above (III) := | (s, a) -(s, a )π(s, da ), δ(π) -δ(π

27) Combining (C.25), (C.26) and (C.27) we obtain thatL DR = sup f,f ∈Fr DR f -DR f f -f 2 < ∞proving the Lipschitz smoothness claim.Proceeding to the proof of (C.13), combining the Lipschitz smoothness of R on bounded sets, the identity ψ(s, a;(1, ω)) = φ(s, a; ω) and the boundedness of the set { ψν :ν ∈ P 2 (Ω), |h 1 ν | < C 0 } we obtain f ν -f ν 2 = ψ(•; ω)(ν -ν )(dω)

annex

(i) Assume that ν t1 (G + ) > 0. We note that under our assumptions the first component of the velocity field in G is lower bounded by η/2, so that ω 0 (t) = ω 0 (0) + tη/2 bounds from below the ω 0 -component of the trajectory of a test mass with initial condition with ω(0) ∈ G, as long as ω(t) ∈ Ḡ. Combining this bound with the forward invariance of G + , we see that if ω(0) ∈ G + then ω 0 (t) > tη/2. Consequently, assuming that supp(ν t ) ⊂ (-M, M ) × Θ for every t > t 1 we haveThis implies linear growth of h 1 νt ( Ḡ) for t > t 1 + 2M/η, contradicting the original assumption that h 1 ν -h 1 νt 1 BL < ε for all t > t 1 . (ii) Consider now the complementary case ν t1 (G + ) = 0. We proceed to show that there exists t 2 > t 1 such that ν t2 (G + ) > 0, thereby reducing this case at time t 2 to part (i). To do so, we consider ω * ∈ supp(ν t1 ) such that ω * ∈ Ḡ is a local minimum of g ν , i.e., for which ∇g ν = 0 (which exists by the preservation of the support property Assumption 1 c) ). Then, choosing ε such that B ε( ω * ) ⊂ Ḡ, and setting M large enough that supp (ν t1 ) ⊆ [-M, M ] × Θ, we prove below in Lemma C.4 that there exists t 2 > t 1 for which the image at t 2 of ω(t 1 ) := ω * under the characteristic flow (C.6) is contained in G + . By continuity of the flow map X(•, t), this conclusion extends to a neighborhood of ω * , with positive mass under ν t1 .We denote throughout by • C 1 the maximum of the supremum norm of a function and the supremum norm of its gradient and recall the structure of the policy gradient vector fieldwhere g is defined in (C.7) and ν t solves (10).With these definitions, we proceed to prove that case (ii) in the analysis above will ultimately reduce to case (i) for t large enough. Lemma C.4. Let ν ∈ M 1 + (Ω) and ω * satisfy |∇g ν (ω * )| = 0, g ν (ω * ) < -η < 0 for some η > 0. Then for every ε, M > 0 there exists t 2 , ε > 0 such that if for all t ∈ (0, t 2 ) we have g ν -g νt C 1 < ε and ω * 0 ∈ [-M, 0], then the point ω * is mapped, under the flow of the policy gradient vector field (C.10) at time t 2 to a subset of B ε((1, ω * )).Proof of Lemma C.4. By homogeneity of the approximator, we can bound the first component of the velocity of a particle (ω 0 (t), ω(t)) under (C.10) with initial condition ω(0) = ω * asIn the other directions, defining q(t) := ω(t) -ω * , we havefor all t ∈ [0, τ ] where τ := inf{t : ω 0 (t) ∈ [-M, 1]}. Moreover, Lipschitz continuity of the potential g ν ( • ) and its Lipschitz smoothness imply the existence of a L > 0 such that max{|g ν (ω)Combining this with the assumed convergence of ν t to ν, which implies g νg νt C 1 < ε, we can bound the evolution of (ω 0 (t), q(t)) for t ∈ [0, τ ] in the perturbative regime of interest as follows:We now show that, choosing both ε and a neighborhood around ω * = (ω * 0 , ω * ) to be small enough, the forward dynamics of ω * will reach the set {ω 0 > 0} before q(t) can increase too much. More precisely, by possibly increasing the value of L such that η/4L < ε, and defining τ q = inf{t : q(t) > η/4L} we prove that there

