POLICY OPTIMIZATION IN ZERO-SUM MARKOV GAMES: FICTITIOUS SELF-PLAY PROVABLY ATTAINS NASH EQUILIBRIA

Abstract

Fictitious Self-Play (FSP) has achieved significant empirical success in solving extensive-form games. However, from a theoretical perspective, it remains unknown whether FSP is guaranteed to converge to Nash equilibria in Markov games. As an initial attempt, we propose an FSP algorithm for two-player zero-sum Markov games, dubbed as smooth FSP, where both agents adopt an entropy-regularized policy optimization method against each other. Smooth FSP builds upon a connection between smooth fictitious play and the policy optimization framework. Specifically, in each iteration, each player infers the policy of the opponent implicitly via policy evaluation and improves its current policy by taking the smoothed best-response via a proximal policy optimization (PPO) step. Moreover, to tame the non-stationarity caused by the opponent, we propose to incorporate entropy regularization in PPO for algorithmic stability. When both players adopt smooth FSP simultaneously, i.e., with self-play, in a class of games with Lipschitz continuous transition and reward, we prove that the sequence of joint policies converges to a neighborhood of a Nash equilibrium at a sublinear O(1/T ) rate, where T is the number of iterations. To our best knowledge, we establish the first finite-time convergence guarantee for FSP-type algorithms in zero-sum Markov games.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Bu et al., 2008; Sutton & Barto, 2018) has achieved great empirical success, e.g., in playing the game of Go (Silver et al., 2016; 2017 ), Dota 2 (Berner et al., 2019 ), and StarCraft 2 (Vinyals et al., 2019) , which are all driven by policy optimization algorithms which iteratively update the policies that are parameterized using deep neural networks. Empirically, the popularity of policy optimization algorithms for MARL is attributed to the observations that they usually converges faster than value-based methods that iteratively update the value functions (Mnih et al., 2016; O'Donoghue et al., 2016) . Compared with their empirical success, the theoretical aspect of policy optimization algorithms in MARL setting (Littman, 1994; Hu & Wellman, 2003; Conitzer & Sandholm, 2007; Pérolat et al., 2016; Zhang et al., 2018) remains less understood. Although convergence guarantees for various policy optimization algorithms have been established under the single-agent RL setting (Sutton et al., 2000; Konda & Tsitsiklis, 2000; Kakade, 2002; Agarwal et al., 2019; Wang et al., 2019) , extending those theoretical guarantees to arguably one of the simplest settings of MARL, two-player zero-sum Markov game, suffers from challenges in the following two aspects. First, in such a Markov game, each agent interact with the opponent as well as the environment. Seen from the perspective of each agent, it belongs to an environment that is altered by the actions of the opponent. As a result, due to the existence of an opponent, the policy optimization problem of each agent has a time-varying objective function, which is in stark contrast with the value-based methods such as value-iteration Shapley (1953) ; Littman (1994) , where there is a central controller which specifies the policies of both players. When the joint policy of both players are considered, the problem of solving the optimal value function corresponds to finding the fixed point of the Bellman operator, which is defined independently of the policy of the players. Second, when viewing the policy optimization in zero-sum Markov game as an optimization problem for both players together, although we have

