O(T -1 ) CONVERGENCE OF OPTIMISTIC-FOLLOW-THE-REGULARIZED-LEADER IN TWO-PLAYER ZERO-SUM MARKOV GAMES

Abstract

We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an O(T -1 )-approximate Nash equilibrium in T iterations for twoplayer zero-sum Markov games with full information. This improves the Õ(T -5/6 ) convergence rate recently shown in the paper by Zhang et al. (2022b). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra log T factor. This crucial improvement enables the inductive analysis that leads to the final O(T -1 ) rate.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Busoniu et al., 2008; Zhang et al., 2021 ) models sequential decision-making problems in which multiple agents/players interact with each other in a shared environment. MARL has recently achieved tremendous success in playing games (Vinyals et al., 2019; Berner et al., 2019; Brown & Sandholm, 2019) , which, consequently, has spurred a growing body of work on MARL; see Yang & Wang (2020) for a recent overview. A widely adopted mathematical model for MARL is the so-called Markov games (Shapley, 1953; Littman, 1994) , which combines normal-form games (Nash, 1951) with Markov decision processes (Puterman, 2014) . In a nutshell, a Markov game starts with a certain state, followed by actions taken by the players. The players then receive their respective payoffs, as in a normal-form game, and at the same time the system transits to a new state as in a Markov decision process. The whole process repeats. As in normal-form games, the goal for each player is to maximize her own cumulative payoffs. We defer the precise descriptions of Markov games to Section 2. In the simpler normal-form games, no-regret learning (Cesa-Bianchi & Lugosi, 2006) has long been used as an effective method to achieve competence in the multi-agent environment. Take the two-player zero-sum normal-form game as an example. It is easy to show that standard no-regret algorithms such as follow-theregularized-leader (FTRL) reach an O(T -1/2 )-approximate Nash equilibrium (Nash, 1951) in T iterations. Surprisingly, the seminal paper Daskalakis et al. (2011) demonstrates that a special no-regret algorithm, built upon Nesterov's excessive gap technique (Nesterov, 2005) , achieves a faster and optimal Õ(T -1 ) rate of convergence to the Nash equilibrium. This nice and fast convergence was later established for optimistic variants of mirror descent (Rakhlin & Sridharan, 2013) and FTRL (Syrgkanis et al., 2015) . Since then, a flurry of research (Chen & Peng, 2020; Daskalakis et al., 2021; Anagnostides et al., 2022a; b; Farina et al., 2022) has been conducted around optimistic no-regret learning algorithms to obtain faster rate of convergence in normal-form games. In contrast, research on the fast convergence of optimistic no-regret learning in Markov games has been scarce. In this paper, we focus on two-player zero-sum Markov games-arguably the simplest Markov game. Zhang et al. (2022b) recently initiated the study of the optimistic-follow-the-regularized-leader (OFTRL) algorithm in such a setting and proved that OFTRL converges to an Õ(T -5/6 )-approximate Nash equilibrium after T iterations. In light of the faster O(T -1 ) convergence of optimistic algorithms in normalform games, it is natural to ask After T iterations, can OFTRL find an O(T -1 )-approximate Nash equilibrium in two-player zerosum Markov games? In fact, this question has also been raised by Zhang et al. (2022b) in the Discussion section. More promisingly, they have verified the fast convergence (i.e., O(T -1 )) of OFTRL in a simple two-stage Markov game; see Fig. 1 therein. Our main contribution in this work is to answer this question affirmatively, through improving the Õ(T -5/6 ) rate demonstrated in Zhang et al. (2022b) to the optimal O(T -1 ) rate. The improved rate for OFTRL arises from two technical contributions. The first is the approximate non-negativity of the sum of the regrets of the two players in Markov games. In particular, the sum is lower bounded by the negative estimation error of the optimal Q-function; see Lemma 6 for the precise statement. This is in stark contrast to the two-player zero-sum normal-form game (Anagnostides et al., 2022c) and the multi-player general-sum normal-form game (Anagnostides et al., 2022b) , in which by definition, the sum of the external/swap regrets are nonnegative. This approximate non-negativity proves crucial for us to control the second-order path length of the learning dynamics induced by OFTRL. In a different context-time-varying zero-sum normal-form games, Zhang et al. (2022a) also utilizes a sort of approximate non-negativity of the sum of the regrets. However, the source of this gap from non-negativity is different: in Zhang et al. (2022a) it arises from the time-varying nature of the zero-sum game, while in our case with Markov games, it comes from the estimation error of the equilibrium pay-off matrix by the algorithm itself. Secondly, central to the analysis in finite-horizon Markov decision processes (and also Markov games) is the induction across the horizon. In our case, in order to carry out the induction step, we prove a tighter algebraic inequality related to the weights deployed by OFTRL; see Lemma 4. In particular, we shave an extra log T factor. Surprisingly, this seemingly harmless log T factor is the key to enabling the abovementioned induction analysis, and as a by-product, removes the extra log factor in the performance guarantee of OFTRL. Note that as an imperfect remedy, Zhang et al. (2022b) proposed a modified OFTRL algorithm that achieves Õ(T -1 ) convergence to Nash equilibrium. However, compared to the vanilla OFTRL algorithm considered herein, the modified version tracks two Q-functions, adopts a different Q-function update procedure that can be more costly in certain scenarios, and more importantly diverges from the general policy optimization framework proposed in Zhang et al. (2022b) . Our work bridges these gaps by establishing the fast convergence for the vanilla OFTRL. Another line of algorithms used for solving Nash equilibrium is based on dynamic programming (Perolat et al., 2015; Zhang et al., 2022b; Cen et al., 2021) . Unlike the single-loop structure of OFTRL, the dynamic programming approach requires a nested loop, with the outer-loop iterating over the horizons and the inner-loops solving a sub-game through iterations. This requires more tuning parameters, one set for each subproblem/layer. Such kind of extra tuning was documented in Cen et al. (2021) . The nested nature of dynamic programming also demands one to predetermine a precision ϵ and estimate the sub-game at each horizon to precision ϵ/H. This is less convenient in practice compared to a single-loop algorithm like the

