CAN WE FIND NASH EQUILIBRIA AT A LINEAR RATE IN MARKOV GAMES?

Abstract

We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we construct a meta algorithm, dubbed as Homotopy-PO, which provably finds a Nash equilibrium at a global linear rate. In particular, Homotopy-PO interweaves two base algorithms Local-Fast and Global-Slow via homotopy continuation. Local-Fast is an algorithm that enjoys local linear convergence while Global-Slow is an algorithm that converges globally but at a slower sublinear rate. By switching between these two base algorithms, Global-Slow essentially serves as a "guide" which identifies a benign neighborhood where Local-Fast enjoys fast convergence. However, since the exact size of such a neighborhood is unknown, we apply a doubling trick to switch between these two base algorithms. The switching scheme is delicately designed so that the aggregated performance of the algorithm is driven by Local-Fast. Furthermore, we prove that Local-Fast and Global-Slow can both be instantiated by variants of optimistic gradient descent/ascent (OGDA) method, which is of independent interest.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) which studies how a group of agents interact with each other and make decisions in a shared environment (Zhang et al., 2021a) has received much attention in recent years due to its wide applications in games (Lanctot et al., 2019; Silver et al., 2017; Vinyals et al., 2019) , robust reinforcement learning (Pinto et al., 2017; Tessler et al., 2019; Zhang et al., 2021b) , robotics (Shalev-Shwartz et al., 2016; Matignon et al., 2012) , among many others. Problems in MARL are frequently formulated as Markov Games (Littman, 1994; Shapley, 1953) . In this paper, we focus on one important class of Markov games: two-player zero-sum Markov games. In such a game, the two players compete against each other in an environment where state transition and reward depend on both players' actions. Our goal is to design efficient policy optimization methods to find Nash equilibria in zero-sum Markov games. This task is usually formulated as a nonconvex-nonconcave minimax optimization problem. There have been works showing that Nash equilibria in matrix games, which are a special kind of zero-sum Markov games with convex-concave structures, can be found at a linear rate (Gilpin et al., 2012; Wei et al., 2020) . However, due to the nonconvexity-nonconcavity, theoretical understanding of zero-sum Markov games is sparser. Existing methods have either sublinear rates for finding Nash equilibria, or linear rates for finding regularized Nash equiliria such as quantal response equilibria which are approximations for Nash equilibria (Alacaoglu et al., 2022; Cen et al., 2021; Daskalakis et al., 2020; Pattathil et al., 2022; Perolat et al., 2015; Wei et al., 2021; Yang & Ma, 2022; Zeng et al., 2022; Zhang et al., 2022; Zhao et al., 2022) . A natural question is: Q1: Can we find Nash equilibria for two-player zero-sum Markov games at a linear rate?

