FASTER LAST-ITERATE CONVERGENCE OF POLICY OPTIMIZATION IN ZERO-SUM MARKOV GAMES

Abstract

Multi-Agent Reinforcement Learning (MARL)-where multiple agents learn to interact in a shared dynamic environment-permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

1. INTRODUCTION

Policy optimization methods (Williams, 1992; Sutton et al., 2000; Kakade, 2002; Peters and Schaal, 2008; Konda and Tsitsiklis, 2000) , which cast sequential decision making as value maximization problems with regards to (parameterized) policies, have been instrumental in enabling recent successes of reinforcement learning (RL) . See e.g., Schulman et al. (2015; 2017); Silver et al. (2016) . Despite its empirical popularity, the theoretical underpinnings of policy optimization methods remain elusive until very recently. For single-agent RL problems, a flurry of recent works has made substantial progress on understanding the global convergence of policy optimization methods under the framework of Markov Decision Processes (MDP) (Agarwal et al., 2020; Bhandari and Russo, 2019; Mei et al., 2020; Cen et al., 2021a; Lan, 2022; Bhandari and Russo, 2020; Zhan et al., 2021; Khodadadian et al., 2021; Xiao, 2022) . Despite the nonconcave nature of value maximization, (natural) policy gradient methods are shown to achieve global convergence at a sublinear rate (Agarwal et al., 2020; Mei et al., 2020) or even a linear rate in the presence of regularization (Mei et al., 2020; Cen et al., 2021a; Lan, 2022; Zhan et al., 2021) when the learning rate is constant. Author are sorted alphabetically. Moving beyond single-agent RL, Multi-Agent Reinforcement Learning (MARL) is the next frontier-where multiple agents learn to interact in a shared dynamic environment-permeating across critical applications such as multi-agent networked systems, autonomous vehicles, robotics, and so on. Designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges and new desiderata, which unfortunately, remain highly inadequately addressed by existing theory.

1.1. POLICY OPTIMIZATION FOR COMPETITIVE RL

In this work, we focus on one of the most basic settings of competitive multi-agent RL, namely two-player zero-sum Markov games (Shapley, 1953) , and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. In particular, our designs gravitate around algorithms that are single-loop, symmetric, with finite-time last-iterate convergence to the Nash Equilibrium (NE) or Quantal Response Equilibrium (QRE) under bounded rationality, two prevalent solution concepts in game theory. These design principles naturally come up as a result of pursuing simple yet efficient algorithms: single-loop updates preclude sophisticated interleaving of rounds between agents; symmetric updates ensure no agent will compromise its rewards in the learning process, which can be otherwise exploited by a faster-updating opponent; in addition, asymmetric updates typically lead to one-sided convergence, i.e., only one of the agents is guaranteed to converge to the minimax equilibrium in a non-asymptotic manner, which is less desirable; moreover, last-iterate convergence guarantee absolves the need for agents to switch between learning and deployment; last but not least, it is desirable to converge as fast as possible, where the iteration complexities are non-asymptotic with clear dependence on salient problem parameters. Substantial algorithmic developments have been made for finding equilibria in two-player zero-sum Markov games, where Dynamical Programming (DP) techniques have long been used as a fundamental building block, leading to prototypical iterative schemes such as Value Iteration (VI) (Shapley, 1953) and Policy Iteration (PI) (Van Der Wal, 1978; Patek and Bertsekas, 1999) . Different from their single-agent counterparts, these methods require solving a two-player zero-sum matrix game for every state per iteration. A considerable number of recent works (Zhao et al., 2022; Alacaoglu et al., 2022; Cen et al., 2021b; Chen et al., 2021a) are based on these DP iterations, by plugging in various (gradient-based) solvers of two-player zero-sum matrix games. However, these methods are inherently nested-loop, which are less convenient to implement. In addition, PI-based methods are asymmetric and come with only one-sided convergence guarantees (Patek and Bertsekas, 1999; Zhao et al., 2022; Alacaoglu et al., 2022) . Going beyond nested-loop algorithms, single-loop policy gradient methods have been proposed recently for solving two-player zero-sum Markov games. Here, we are interested in finding an ϵoptimal NE or QRE in terms of the duality gap, i.e. the difference in the value functions when either of the agents deviates from the solution policy. • For the infinite-horizon discounted setting, Daskalakis et al. (2020) demonstrated that the independent policy gradient method, with direct parameterization and asymmetric learning rates, finds an ϵ-optimal NE within a polynomial number of iterations. Zeng et al. ( 2022) improved over this rate using an entropy-regularized policy gradient method with softmax parameterization and asymmetric learning rates. On the other end, Wei et al. ( 2021b) proposed an optimistic gradient descent ascent (OGDA) method (Rakhlin and Sridharan, 2013) with direct parameterization and symmetric learning rates,foot_0 which achieves a last-iterate convergence at a rather pessimistic iteration complexity. • For the finite-horizon episodic setting, Zhang et al. (2022); Yang and Ma (2022) showed that the weighted average-iterate of the optimistic Follow-The-Regularized-Leader (FTRL) method, when combined with slow critic updates, finds an ϵ-optimal NE in a polynomial number of iterations. A more complete summary of prior results can be found in Table 1 and Table 2 . In brief, while there have been encouraging progresses in developing computationally efficient policy gradient methods



To be precise,Wei et al. (2021b) proved the average-iterate convergence of the duality gap, as well as the last-iterate convergence of the policy in terms of the Euclidean distance to the set of NEs, where it is possible to translate the latter last-iterate convergence to the duality gap (see Appendix. G). The resulting iteration complexity, however, is much worse than that of the average-iterate convergence in terms of the duality gap, with a problem-dependent constant that can scale pessimistically with salient problem parameters.

