GLOBAL OPTIMALITY OF SOFTMAX POLICY GRADIENT WITH SINGLE HIDDEN LAYER NEURAL NETWORKS IN THE MEAN-FIELD REGIME

Abstract

We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.

1. INTRODUCTION

In recent years, deep reinforcement learning has revolutionized the world of Artificial Intelligence by outperforming humans in a multitude of highly complex tasks and achieving breakthroughs that were deemed unthinkable at least for the next decade. Spectacular examples of such revolutionary potential have appeared over the last few years, with reinforcement learning algorithms mastering games and tasks of increasing complexity, from learning to walk to the games of Go and Starcraft (Mnih et al., 2013; 2015; Silver et al., 2016; 2017; 2018; Haarnoja et al., 2018a; Vinyals et al., 2019) . In most cases, the main workhorse allowing artificial intelligence to pass such unprecedented milestones was a variation of a fundamental method to train reinforcement learning models: policy gradient (PG) algorithms (Sutton et al., 2000) . This algorithm has a disarmingly simple approach to the optimization problem at hand: given a parametrization of the policy, it updates the parameters in the direction of steepest ascent of the associated integrated value function. Impressive progress has been made recently in the understanding of the convergence and optimization properties of this class of algorithms in the tabular setting (Agarwal et al., 2019; Cen et al., 2020; Bhandari & Russo, 2019) , in particular leveraging the natural tradeoff between exploration and exploitation offered for entropy-regularized rewards by softmax policies (Haarnoja et al., 2018b; Mei et al., 2020) . However, this simple algorithm alone is not sufficient to explain the multitude of recent breakthroughs in this field: in application domains such as Starcraft, robotics or movement planning, the space of possible states and actions are exceedingly large -or even continuous -and can therefore not be represented efficiently by tabular policies (Haarnoja et al., 2018a) . Consequently, the recent impressive successes of artificial intelligence would be impossible without the natural choice of neural networks to approximate value functions and / or policy functions in reinforcement learning algorithms (Mnih et al., 2015; Sutton et al., 2000) . While neural networks, in particular deep neural networks, provide a powerful and versatile tool to approximate high dimensional functions on continuous spaces (Cybenko, 1989; Hornik, 1991; Barron, 1993) , their intrinsic nonlinearity poses significant obstacles to the theoretical understanding of their training and optimization properties. For instance, it is known that the optimization landscape of these models is highly nonconvex, preventing the use of most theoretical tools from classical optimization theory. For this reason, the unprecedented success of neural networks in artificial intelligence stands in contrast with the poor understanding of these methods from a theoretical perspective. Indeed, even in the supervised setting, which can be viewed as a special case of reinforcement learning, deep neural networks are still far from being understood despite having been an important and fashionable research focus in recent years. Only recently, a theory of neural network learning has started to emerge, including recent works on mean-field point of view of training dynamics (Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Rotskoff et al., 2019; Wei et al., 2018; Chizat & Bach, 2018) and on linearized dynamics in the over-parametrized regime (Jacot et al., 2018; Allen-Zhu et al., 2018; Du et al., 2018; 2019; Zou et al., 2018; Allen-Zhu et al., 2019; Chizat et al., 2019; Oymak & Soltanolkotabi, 2020; Ghorbani et al., 2019; Lee et al., 2019) . More specifically to the context of reinforcement learning, some works focusing on value-based learning (Agazzi & Lu, 2019; Cai et al., 2019; Zhang et al., 2020) , and others exploring the dynamics of policy gradient algorithms (Zhang et al., 2019) have recently appeared. Despite this progress, the theoretical understanding of deep reinforcement learning still poses a significant challenge to the theoretical machine learning community, and it is of crucial importance to understand the convergence and optimization properties of such algorithms to bridge the gap between theory and practice.

CONTRIBUTIONS.

The main goal of this work is to investigate entropy-regularized policy gradient dynamics for wide, single hidden layer neural networks. In particular, we give the following contributions: • We give a mean-field formulation of policy gradient dynamics in parameter space, describing the evolution of neural network parameters in the form of a transport partial differential equation (PDE). We prove convergence of the particle dynamics to their mean-field counterpart. We further explore the structure of this problem by showing that such PDE is a gradient flow in the Wasserstein space for the appropriate energy functional. • We investigate the convergence properties of the above dynamics in the space of measures. In particular, we prove that under some mild assumptions on the initialization of the neural network parameters and on the approximating power of the nonlinearity, all fixed points of the dynamics are global optima, i.e., the approximate policy learned by the neural network is optimal, RELATED WORKS. Recent progress in the understanding of the parametric dynamics of simple neural networks trained with gradient descent in the supervised setting has been made in (Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Wei et al., 2018; Chizat, 2019; Chizat & Bach, 2020) . These results have further been extended to the multilayer setting in (Nguyen & Pham, 2020). In particular, the paper (Chizat & Bach, 2018) proves optimality of fixed points for wide single layer neural networks leveraging a Wasserstein gradient flow structure and the strong convexity of the loss functional WRT the predictor. We extend these results to the reinforcement learning framework, where the convexity that is heavily leveraged in (Chizat & Bach, 2018) is lost. We bypass this issue by requiring a sufficient expressivity of the used nonlinear representation, allowing to characterize global minimizer as optimal approximators. The convergence and optimality of policy gradient algorithms (including in the entropy-regularized setting) is investigated in the recent papers (Bhandari & Russo, 2019; Mei et al., 2020; Cen et al., 2020; Agarwal et al., 2019) . These references establish convergence estimates through gradient domination bounds. In (Mei et al., 2020; Cen et al., 2020) such results are limited to the tabular case, while (Agarwal et al., 2019; 2020) also discuss neural softmax policy classes, but under a different algorithmic update and assuming certain well-conditioning assumptions along training. Furthermore, all these results heavily leverage the finiteness of action space. In contrast, this paper focuses on the continuous space and action setting with nonlinear function approximation. Further recent works discussing convergence properties of reinforcement learning algorithms with function approximation via neural networks include (Zhang et al., 2019; Cai et al., 2019) . These results only hold for finite action spaces, and are obtained in the regime where the network behaves essentially like a linear model (known as the neural or lazy training regime), in contrast to the results of this paper, which considers training in a nonlinear regime. We also note the work (Wang et al., 2019) where the action space is continuous but the training is again in an approximately linear regime.

2. MARKOV DECISION PROCESSES AND POLICY GRADIENTS

We denote a Markov Decision Process (MDP) by the 5-tuple (S, A, P, r, γ), where S is the state space, A is the action space, P = P (s, a, s ) s,s ∈S,a∈A a Markov transition kernel, r(s, a, s ) s,s ∈S,a∈A is the realvalued, bounded and continuous immediate reward function and γ ∈ (0, 1) is a discount factor. We will consider a probabilistic policy, mapping a state to a probability distribution on the action space, so that π : S → M 1 + (A) , where M 1 + (A) denotes the space of probability measures on A, and denote for any s ∈ S the corresponding density π(s, •) : A → R + . The policy defines a state-to-state transition operator

