CAN WE FIND NASH EQUILIBRIA AT A LINEAR RATE IN MARKOV GAMES?

Abstract

We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we construct a meta algorithm, dubbed as Homotopy-PO, which provably finds a Nash equilibrium at a global linear rate. In particular, Homotopy-PO interweaves two base algorithms Local-Fast and Global-Slow via homotopy continuation. Local-Fast is an algorithm that enjoys local linear convergence while Global-Slow is an algorithm that converges globally but at a slower sublinear rate. By switching between these two base algorithms, Global-Slow essentially serves as a "guide" which identifies a benign neighborhood where Local-Fast enjoys fast convergence. However, since the exact size of such a neighborhood is unknown, we apply a doubling trick to switch between these two base algorithms. The switching scheme is delicately designed so that the aggregated performance of the algorithm is driven by Local-Fast. Furthermore, we prove that Local-Fast and Global-Slow can both be instantiated by variants of optimistic gradient descent/ascent (OGDA) method, which is of independent interest.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) which studies how a group of agents interact with each other and make decisions in a shared environment (Zhang et al., 2021a) has received much attention in recent years due to its wide applications in games (Lanctot et al., 2019; Silver et al., 2017; Vinyals et al., 2019) , robust reinforcement learning (Pinto et al., 2017; Tessler et al., 2019; Zhang et al., 2021b) , robotics (Shalev-Shwartz et al., 2016; Matignon et al., 2012) , among many others. Problems in MARL are frequently formulated as Markov Games (Littman, 1994; Shapley, 1953) . In this paper, we focus on one important class of Markov games: two-player zero-sum Markov games. In such a game, the two players compete against each other in an environment where state transition and reward depend on both players' actions. Our goal is to design efficient policy optimization methods to find Nash equilibria in zero-sum Markov games. This task is usually formulated as a nonconvex-nonconcave minimax optimization problem. There have been works showing that Nash equilibria in matrix games, which are a special kind of zero-sum Markov games with convex-concave structures, can be found at a linear rate (Gilpin et al., 2012; Wei et al., 2020) . However, due to the nonconvexity-nonconcavity, theoretical understanding of zero-sum Markov games is sparser. Existing methods have either sublinear rates for finding Nash equilibria, or linear rates for finding regularized Nash equiliria such as quantal response equilibria which are approximations for Nash equilibria (Alacaoglu et al., 2022; Cen et al., 2021; Daskalakis et al., 2020; Pattathil et al., 2022; Perolat et al., 2015; Wei et al., 2021; Yang & Ma, 2022; Zeng et al., 2022; Zhang et al., 2022; Zhao et al., 2022) . A natural question is: Q1: Can we find Nash equilibria for two-player zero-sum Markov games at a linear rate? Furthermore, in Markov games, it is desirable to design decentralized algorithms. That is, when a player updates its policy, it does not need to know the policy of other agents, as such information is usually unavailable especially when the game is competitive in nature. Meanwhile, other desiderata in MARL include symmetric updates and rationality. Here symmetry means that the algorithm employed by each player is the same/symmetric, and their updates differ only through using the different local information possessed by each player. Rationality means that if other players adopt stationary policy, the algorithm will converge to the best-response policy (Sayin et al., 2021; Wei et al., 2021) . In other words, the algorithm finds the optimal policy of the player. In decentralized learning, each player observes dynamic local information due to the changes in other players' policy, which makes it more challenging to design efficient algorithms (Daskalakis et al., 2020; Hernandez-Leal et al., 2017; Sayin et al., 2021) . Symmetric update also poses challenges for the convergence. Condon (1990) shows multiple variants of value iteration with symmetric updates can cycle and fail to find NEs. Gradient descent/ascent (GDA) with symmetric update can cycle even in matrix games (Daskalakis et al., 2018; Mertikopoulos et al., 2018) . Thus, an even more challenging question to pose is: Q2: Can we further answer Q1 with a decentralized algorithm that is symmetric and rational? In this paper, we give the first affirmative answers to Q1 and Q2. In specific, we propose a meta algorithm Homotopy-PO which provably converges to a Nash equilibrium (NE) with two base algorithms Local-Fast and Global-Slow. Homotopy-PO is a homotopy continuation style algorithm that switches between Local-Fast and Global-Slow, where Global-Slow behaves as a "guide" which identifies a benign neighborhood for Local-Fast to enjoy linear convergence. A novel switching scheme is designed to achieve global linear convergence without knowing the size of such a neighborhood. Next, we propose the averaging independent optimistic gradient descent/ascent (Averaging OGDA) method and the independent optimistic policy gradient descent/ascent (OGDA) method. Then, we instantiate Homotopy-PO by proving that Averaging OGDA and OGDA satisfy the conditions of Global-Slow and Local-Fast, respectively. This yields the first algorithm which provably finds Nash equilibria in zero-sum Markov games at a global linear rate. In addition, Homotopy-PO is decentralized, symmetric, rational and last-iterate convergent.

Our contribution.

Our contribution is two-fold. First, we propose a meta algorithm Homotopy-PO which is shown to converge to Nash equilibria of two-player zero-sum Markov games with global linear convergence, when the two base algorithms satisfy certain benign properties. Moreover, Homotopy-PO is a decentralized algorithm and enjoys additional desiderata in MARL including symmetric update, rationality and last-iterate convergence. Second, we instantiate Homotopy-PO by designing two base algorithms based on variants of GDA methods, which are proved to satisfy the conditions required by Homotopy-PO. In particular, we prove that the example base algorithm OGDA enjoys local linear convergence to Nash equilibria, which might be of independent interest.

1.1. RELATED WORK

A more comprehensive literature review is moved to Appendix A due to the space limitation. Of particular relevance are two decentralized algorithms Daskalakis et al. (2020) and Wei et al. (2021) . Daskalakis et al. (2020) consider an independent policy gradient descent/ascent algorithm which is a natural extension of single-agent policy gradient descent to two-player zero-sum Markov games. They utilize the two-sided gradient dominance to prove a sub-linear convergence rate of the gradientdescent-ascent (GDA) method. This is the first non-asymptotic convergence result of GDA for finding Nash equilibria in Markov games. However, their method is asymmetric, where one-player takes much smaller steps than its opponent. And their convergence results are base on average policies with no explicit guarantee for last-iterate convergence. Wei et al. (2021) propose an actor-critic optimistic policy gradient descent/ascent algorithm that is simultaneous decentralized, symmetric, rational and has O(1/ √ t) last-iterate convergence rate to the Nash equilibrium set. They use a critic which averages the approximate value functions from past iterations to tame nonstationarity in approximate Q-functions and get better approximations for policy gradients. A classical averaging stepsize from Jin et al. ( 2018) is utilized by the critic so that the errors accumulate slowly and last-

