POLICY OPTIMIZATION IN ZERO-SUM MARKOV GAMES: FICTITIOUS SELF-PLAY PROVABLY ATTAINS NASH EQUILIBRIA

Abstract

Fictitious Self-Play (FSP) has achieved significant empirical success in solving extensive-form games. However, from a theoretical perspective, it remains unknown whether FSP is guaranteed to converge to Nash equilibria in Markov games. As an initial attempt, we propose an FSP algorithm for two-player zero-sum Markov games, dubbed as smooth FSP, where both agents adopt an entropy-regularized policy optimization method against each other. Smooth FSP builds upon a connection between smooth fictitious play and the policy optimization framework. Specifically, in each iteration, each player infers the policy of the opponent implicitly via policy evaluation and improves its current policy by taking the smoothed best-response via a proximal policy optimization (PPO) step. Moreover, to tame the non-stationarity caused by the opponent, we propose to incorporate entropy regularization in PPO for algorithmic stability. When both players adopt smooth FSP simultaneously, i.e., with self-play, in a class of games with Lipschitz continuous transition and reward, we prove that the sequence of joint policies converges to a neighborhood of a Nash equilibrium at a sublinear O(1/T ) rate, where T is the number of iterations. To our best knowledge, we establish the first finite-time convergence guarantee for FSP-type algorithms in zero-sum Markov games.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Bu et al., 2008; Sutton & Barto, 2018) has achieved great empirical success, e.g., in playing the game of Go (Silver et al., 2016; 2017 ), Dota 2 (Berner et al., 2019 ), and StarCraft 2 (Vinyals et al., 2019) , which are all driven by policy optimization algorithms which iteratively update the policies that are parameterized using deep neural networks. Empirically, the popularity of policy optimization algorithms for MARL is attributed to the observations that they usually converges faster than value-based methods that iteratively update the value functions (Mnih et al., 2016; O'Donoghue et al., 2016) . Compared with their empirical success, the theoretical aspect of policy optimization algorithms in MARL setting (Littman, 1994; Hu & Wellman, 2003; Conitzer & Sandholm, 2007; Pérolat et al., 2016; Zhang et al., 2018) remains less understood. Although convergence guarantees for various policy optimization algorithms have been established under the single-agent RL setting (Sutton et al., 2000; Konda & Tsitsiklis, 2000; Kakade, 2002; Agarwal et al., 2019; Wang et al., 2019) , extending those theoretical guarantees to arguably one of the simplest settings of MARL, two-player zero-sum Markov game, suffers from challenges in the following two aspects. First, in such a Markov game, each agent interact with the opponent as well as the environment. Seen from the perspective of each agent, it belongs to an environment that is altered by the actions of the opponent. As a result, due to the existence of an opponent, the policy optimization problem of each agent has a time-varying objective function, which is in stark contrast with the value-based methods such as value-iteration Shapley (1953) ; Littman (1994) , where there is a central controller which specifies the policies of both players. When the joint policy of both players are considered, the problem of solving the optimal value function corresponds to finding the fixed point of the Bellman operator, which is defined independently of the policy of the players. Second, when viewing the policy optimization in zero-sum Markov game as an optimization problem for both players together, although we have a fixed objective function, the problem is minimax optimization with a non-convex non-concave objective. Even for classical optimization, such a kind of optimization problem remains less less understood (Cherukuri et al., 2017; Rafique et al., 2018; Daskalakis & Panageas, 2018; Mertikopoulos et al., 2018) . It is observed that first-order methods such as gradient descent might fail to converge (Balduzzi et al., 2018; Mazumdar & Ratliff, 2018) . As an initial step to study policy optimization for MARL, we propose a novel policy optimization algorithm for any player of a multi-player Markov game, which is dubbed as smooth fictitious selfplay (FSP). Specifically, when a player adopts smooth FSP, in each iteration, it first solves a policy evaluation problem that estimates the value function associate with the current joint policy of all players. Then it update its own policy via an entropy-regularized proximal policy optimization (PPO) Schulman et al. ( 2017) step, where the update direction is obtained from the estimated value function. This algorithm can be viewed as an extension of the fictitious play (FP) algorithm that is designed for normal-form games (Von Neumann & Morgenstern, 2007; Shapley, 1953) and extensive-form games (Heinrich et al., 2015; Perolat et al., 2018) to Markov-games. FP is a general algorithmic framework for solving games where an agent first infer the policy of the opponents and then adopt a policy that best respond to the inferred opponents. When viewing our algorithm as a FP method, instead of estimating the policies of the opponents directly, the agent infers the opponent implicitly by estimating the value function. Besides, policy update corresponds to a smoothed best-response policy Swenson & Poor (2019) based on the inferred value function. To examine the theoretical merits of the proposed algorithm, we focus on two-player zero-sum Markov games and let both players follow smooth FSP, i.e., with self-play. Moreover, we restrict to a class of Lipschitz games (Radanovic et al., 2019) where the impact of each player's policy change on the environment is Lipschitz continuous with respect to the magnitude of policy change. For such a Markov game, we tackle the challenge of non-stationarity by imposing entropy regularization which brings algorithmic stability. In addition, to establish convergence to Nash equilibrium, we explicitly characterize the geometry of the policy optimization problem from a functional perspective. Specifically, we prove that the objective function, as a bivariate function of the two players' policies, despite being non-convex and non-concave, satisfies a one-point strong monotonicity condition (Facchinei & Pang, 2007) at a Nash equilibrium. Thanks to such benign geometry, we prove that smooth FSP converges to a neighborhood of a Nash equilibrium at a sublinear O(1/T ) rate, where T is the number of policy iterations and O hides logarithmic factors. Moreover, as a byproduct of our analysis, if any of the two players deviates from the proposed algorithm, it is shown the other player that follows smooth FSP exploits such deviation by finding the best-response policy at a same sublinear rate. Such a Hannan consistency property exhibited in our algorithm is related to Hennes et al. ( 2020), which focus on normal-form games. Thus, our results also serve as a first step towards connecting regret between minimization in normal-form/extensive-form games and Markov games. Contribution. Our contribution is two-fold. First, we propose a novel policy optimization algorithm for Markov games, which can be viewed as a generalization of FP. Second, when applied to a class of two-player zero-sum Markov games satisfying a Lipschitz regularity condition, our algorithm provably enjoys global convergence to a neighborhood of a Nash equilibrium at a sublinear rate. To the best of our knowledge, we propose the first provable FSP-type algorithm with finite time convergence guarantee for zero-sum Markov games. Related Work. There is a large body of literature on the value-based methods to zero-sum Markov games (Lagoudakis & Parr, 2012; Pérolat et al., 2016; Zhang et al., 2018; Zou et al., 2019) . More recently, Perolat et al. (2018) prove that actor-critic fictitious play asymptotically converges to the Nash equilibrium, while our work provides finite time convergence guarantee to a neighborhood of a Nash equilibrium. In addition, Zhang et al. ( 2020) study the sample comlexity of planning algorithm in the model-based MARL settting as opposed to the model-free setting with function approximation in this paper. Closely related to smooth FSP proposed in this paper, there is a line of work in best-response algorithms (Heinrich et al., 2015; Heinrich & Silver, 2016) , which have also shown great empirical performances (Dudziak, 2006; Xiao et al., 2013; Kawamura et al., 2017) . However, they are only applicable to extensive-form games and not directly applicable to stochastic games. Also, our smooth FSP is related to Swenson & Poor (2019) , which focus on the potential games. It does not enforce entropy-regularization and only provides asymptotic convergence guarantee to a neighborhood of the

