O(T -1 ) CONVERGENCE OF OPTIMISTIC-FOLLOW-THE-REGULARIZED-LEADER IN TWO-PLAYER ZERO-SUM MARKOV GAMES

Abstract

We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an O(T -1 )-approximate Nash equilibrium in T iterations for twoplayer zero-sum Markov games with full information. This improves the Õ(T -5/6 ) convergence rate recently shown in the paper by Zhang et al. (2022b). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra log T factor. This crucial improvement enables the inductive analysis that leads to the final O(T -1 ) rate.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Busoniu et al., 2008; Zhang et al., 2021 ) models sequential decision-making problems in which multiple agents/players interact with each other in a shared environment. MARL has recently achieved tremendous success in playing games (Vinyals et al., 2019; Berner et al., 2019; Brown & Sandholm, 2019) , which, consequently, has spurred a growing body of work on MARL; see Yang & Wang (2020) for a recent overview. A widely adopted mathematical model for MARL is the so-called Markov games (Shapley, 1953; Littman, 1994) , which combines normal-form games (Nash, 1951) with Markov decision processes (Puterman, 2014). In a nutshell, a Markov game starts with a certain state, followed by actions taken by the players. The players then receive their respective payoffs, as in a normal-form game, and at the same time the system transits to a new state as in a Markov decision process. The whole process repeats. As in normal-form games, the goal for each player is to maximize her own cumulative payoffs. We defer the precise descriptions of Markov games to Section 2. In the simpler normal-form games, no-regret learning (Cesa-Bianchi & Lugosi, 2006) has long been used as an effective method to achieve competence in the multi-agent environment. Take the two-player zero-sum normal-form game as an example. It is easy to show that standard no-regret algorithms such as follow-theregularized-leader (FTRL) reach an O(T -1/2 )-approximate Nash equilibrium (Nash, 1951) in T iterations. Surprisingly, the seminal paper Daskalakis et al. (2011) demonstrates that a special no-regret algorithm, built upon Nesterov's excessive gap technique (Nesterov, 2005) , achieves a faster and optimal Õ(T -1 ) rate of convergence to the Nash equilibrium. This nice and fast convergence was later established for optimistic variants of mirror descent (Rakhlin & Sridharan, 2013) and FTRL (Syrgkanis et al., 2015) . Since then,

