GLOBAL OPTIMALITY OF SOFTMAX POLICY GRADIENT WITH SINGLE HIDDEN LAYER NEURAL NETWORKS IN THE MEAN-FIELD REGIME

Abstract

We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.

1. INTRODUCTION

In recent years, deep reinforcement learning has revolutionized the world of Artificial Intelligence by outperforming humans in a multitude of highly complex tasks and achieving breakthroughs that were deemed unthinkable at least for the next decade. Spectacular examples of such revolutionary potential have appeared over the last few years, with reinforcement learning algorithms mastering games and tasks of increasing complexity, from learning to walk to the games of Go and Starcraft (Mnih et al., 2013; 2015; Silver et al., 2016; 2017; 2018; Haarnoja et al., 2018a; Vinyals et al., 2019) . In most cases, the main workhorse allowing artificial intelligence to pass such unprecedented milestones was a variation of a fundamental method to train reinforcement learning models: policy gradient (PG) algorithms (Sutton et al., 2000) . This algorithm has a disarmingly simple approach to the optimization problem at hand: given a parametrization of the policy, it updates the parameters in the direction of steepest ascent of the associated integrated value function. Impressive progress has been made recently in the understanding of the convergence and optimization properties of this class of algorithms in the tabular setting (Agarwal et al., 2019; Cen et al., 2020; Bhandari & Russo, 2019) , in particular leveraging the natural tradeoff between exploration and exploitation offered for entropy-regularized rewards by softmax policies (Haarnoja et al., 2018b; Mei et al., 2020) . However, this simple algorithm alone is not sufficient to explain the multitude of recent breakthroughs in this field: in application domains such as Starcraft, robotics or movement planning, the space of possible states and actions are exceedingly large -or even continuous -and can therefore not be represented efficiently by tabular policies (Haarnoja et al., 2018a) . Consequently, the recent impressive successes of artificial intelligence would be impossible without the natural choice of neural networks to approximate value functions and / or policy functions in reinforcement learning algorithms (Mnih et al., 2015; Sutton et al., 2000) . While neural networks, in particular deep neural networks, provide a powerful and versatile tool to approximate high dimensional functions on continuous spaces (Cybenko, 1989; Hornik, 1991; Barron, 1993) , their intrinsic nonlinearity poses significant obstacles to the theoretical understanding of their training and optimization properties. For instance, it is known that the optimization landscape of these models is highly nonconvex, preventing the use of most theoretical tools from classical optimization theory. For this reason, the unprecedented success of neural networks in artificial intelligence stands in contrast with the poor understanding of these methods from a theoretical perspective. Indeed, even in the supervised setting, which can be viewed as a special case of reinforcement learning, deep neural networks are still far from being understood despite having been an important and fashionable research focus in recent years. Only recently, a theory of neural network learning has started to emerge, including recent works on mean-field point of view of training dynamics (Mei et al., 

