O(T -1 ) CONVERGENCE OF OPTIMISTIC-FOLLOW-THE-REGULARIZED-LEADER IN TWO-PLAYER ZERO-SUM MARKOV GAMES

Abstract

We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an O(T -1 )-approximate Nash equilibrium in T iterations for twoplayer zero-sum Markov games with full information. This improves the Õ(T -5/6 ) convergence rate recently shown in the paper by Zhang et al. (2022b). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra log T factor. This crucial improvement enables the inductive analysis that leads to the final O(T -1 ) rate.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Busoniu et al., 2008; Zhang et al., 2021 ) models sequential decision-making problems in which multiple agents/players interact with each other in a shared environment. MARL has recently achieved tremendous success in playing games (Vinyals et al., 2019; Berner et al., 2019; Brown & Sandholm, 2019) , which, consequently, has spurred a growing body of work on MARL; see Yang & Wang (2020) for a recent overview. A widely adopted mathematical model for MARL is the so-called Markov games (Shapley, 1953; Littman, 1994) , which combines normal-form games (Nash, 1951) with Markov decision processes (Puterman, 2014) . In a nutshell, a Markov game starts with a certain state, followed by actions taken by the players. The players then receive their respective payoffs, as in a normal-form game, and at the same time the system transits to a new state as in a Markov decision process. The whole process repeats. As in normal-form games, the goal for each player is to maximize her own cumulative payoffs. We defer the precise descriptions of Markov games to Section 2. In the simpler normal-form games, no-regret learning (Cesa-Bianchi & Lugosi, 2006) has long been used as an effective method to achieve competence in the multi-agent environment. Take the two-player zero-sum normal-form game as an example. It is easy to show that standard no-regret algorithms such as follow-theregularized-leader (FTRL) reach an O(T -1/2 )-approximate Nash equilibrium (Nash, 1951) in T iterations. Surprisingly, the seminal paper Daskalakis et al. (2011) demonstrates that a special no-regret algorithm, built upon Nesterov's excessive gap technique (Nesterov, 2005) , achieves a faster and optimal Õ(T -1 ) rate of convergence to the Nash equilibrium. This nice and fast convergence was later established for optimistic variants of mirror descent (Rakhlin & Sridharan, 2013) and FTRL (Syrgkanis et al., 2015) . Since then, a flurry of research (Chen & Peng, 2020; Daskalakis et al., 2021; Anagnostides et al., 2022a; b; Farina et al., 2022) has been conducted around optimistic no-regret learning algorithms to obtain faster rate of convergence in normal-form games. In contrast, research on the fast convergence of optimistic no-regret learning in Markov games has been scarce. In this paper, we focus on two-player zero-sum Markov games-arguably the simplest Markov game. Zhang et al. (2022b) recently initiated the study of the optimistic-follow-the-regularized-leader (OFTRL) algorithm in such a setting and proved that OFTRL converges to an Õ(T -5/6 )-approximate Nash equilibrium after T iterations. In light of the faster O(T -1 ) convergence of optimistic algorithms in normalform games, it is natural to ask After T iterations, can OFTRL find an O(T -1 )-approximate Nash equilibrium in two-player zerosum Markov games? In fact, this question has also been raised by Zhang et al. (2022b) in the Discussion section. More promisingly, they have verified the fast convergence (i.e., O(T -1 )) of OFTRL in a simple two-stage Markov game; see Fig. 1 therein. Our main contribution in this work is to answer this question affirmatively, through improving the Õ(T -5/6 ) rate demonstrated in Zhang et al. (2022b) to the optimal O(T -1 ) rate. The improved rate for OFTRL arises from two technical contributions. The first is the approximate non-negativity of the sum of the regrets of the two players in Markov games. In particular, the sum is lower bounded by the negative estimation error of the optimal Q-function; see Lemma 6 for the precise statement. This is in stark contrast to the two-player zero-sum normal-form game (Anagnostides et al., 2022c) and the multi-player general-sum normal-form game (Anagnostides et al., 2022b) , in which by definition, the sum of the external/swap regrets are nonnegative. This approximate non-negativity proves crucial for us to control the second-order path length of the learning dynamics induced by OFTRL. In a different context-time-varying zero-sum normal-form games, Zhang et al. (2022a) also utilizes a sort of approximate non-negativity of the sum of the regrets. However, the source of this gap from non-negativity is different: in Zhang et al. (2022a) it arises from the time-varying nature of the zero-sum game, while in our case with Markov games, it comes from the estimation error of the equilibrium pay-off matrix by the algorithm itself. Secondly, central to the analysis in finite-horizon Markov decision processes (and also Markov games) is the induction across the horizon. In our case, in order to carry out the induction step, we prove a tighter algebraic inequality related to the weights deployed by OFTRL; see Lemma 4. In particular, we shave an extra log T factor. Surprisingly, this seemingly harmless log T factor is the key to enabling the abovementioned induction analysis, and as a by-product, removes the extra log factor in the performance guarantee of OFTRL. Note that as an imperfect remedy, Zhang et al. (2022b) proposed a modified OFTRL algorithm that achieves Õ(T -1 ) convergence to Nash equilibrium. However, compared to the vanilla OFTRL algorithm considered herein, the modified version tracks two Q-functions, adopts a different Q-function update procedure that can be more costly in certain scenarios, and more importantly diverges from the general policy optimization framework proposed in Zhang et al. (2022b) . Our work bridges these gaps by establishing the fast convergence for the vanilla OFTRL. Another line of algorithms used for solving Nash equilibrium is based on dynamic programming (Perolat et al., 2015; Zhang et al., 2022b; Cen et al., 2021) . Unlike the single-loop structure of OFTRL, the dynamic programming approach requires a nested loop, with the outer-loop iterating over the horizons and the inner-loops solving a sub-game through iterations. This requires more tuning parameters, one set for each subproblem/layer. Such kind of extra tuning was documented in Cen et al. (2021) . The nested nature of dynamic programming also demands one to predetermine a precision ϵ and estimate the sub-game at each horizon to precision ϵ/H. This is less convenient in practice compared to a single-loop algorithm like the OFTRL we study, where such predetermined precision is not necessary. Another recent paper Cen et al. (2022) also discusses the advantages of single-loop algorithms over those with nested loops.

1.1. RELATED WORK

Optimistic no-regret learning in games. Our work is mostly related to the line of work on proving fast convergence of optimistic no-regret algorithms in various forms of games. Daskalakis et al. (2011) provide the first fast algorithm that reaches a Nash equilibrium at an Õ(T -1 ) rate in two-player zero-sum normalform games. Later, with the same setup, Rakhlin & Sridharan (2013) prove a similar fast convergence for optimistic mirror descent (OMD). Syrgkanis et al. (2015) extend the results to multi-player general-sum normal-form games. In addition, Syrgkanis et al. show that when all the players adopt optimistic algorithms, their individual regret is at most O(T -3/4 ). This is further improved to O(T -5/6 ) in the special two-player zero-sum case (Chen & Peng, 2020) . More recently, via a detailed analysis of higher-order smoothness, Daskalakis et al. (2021) ; Anagnostides et al. (2022a) manage to improve the individual regret guarantee of optimistic hedge to Õ(T -1 ) in multi-player general-sum normal-form games, matching the result in the two-player case. A similar result is shown by Anagnostides et al. (2022b) with a different analysis using self-concordant barriers as the regularizer. Several attempts have been made to extend the results on optimistic no-regret learning in normal-form games to Markov games. Wei et al. (2021) design a decentralized algorithm based on optimistic gradient descent / ascent that converges to a Nash equilibrium at an Õ(T -1/2 ) rate. Closest to us is the work by Zhang et al. (2022b) which shows an Õ(T -5/6 ) convergence of OFTRL to the Nash equilibrium in twoplayer zero-sum Markov games and an Õ(T -3/4 ) convergence to a coarse correlated equilibrium in multiplayer general-sum Markov games. Most recently, Erez et al. (2022) prove an O(T -1/4 ) individual regret for OMD in multi-player general-sum Markov games. Two-player zero-sum Markov games. Our work also fits into the study of two-player zero-sum Markov games (Shapley, 1953; Littman, 1994) . Various algorithms (Hu & Wellman, 2003; Littman, 1994; Zhao et al., 2021; Cen et al., 2021) have been proposed in the full information setting, where one assumes the players have access to the exact state-action value functions. In particular, Zhao et al. (2021) ; Cen et al. (2021) use optimistic approaches for normal-form games as subroutines to extend the Õ(T -1 ) convergence rates to two-player zero-sum Markov games. In particular, they provide last iterate convergence guarantees as well. However, in doing so, their algorithms require one to approximately solve a normal-form game in each iteration. In the bandit setting, Bai & Jin (2020) ; Xie et al. (2020) ; Bai et al. (2020) ; Liu et al. (2021) ; Zhang et al. (2020) study the sample complexity of two-player zero-sum Markov games. In addition, Sidford et al. (2020) ; Jia et al. (2019) ; Zhang et al. (2020) ; Li et al. (2022) investigate the sample complexity under a generative model where one can query the Markov game at arbitrary states and actions. Last but not least, recently two-player zero-sum Markov games have been studied in the offline setting (Cui & Du, 2022; Yan et al., 2022) , where the learner is given a set of historical data, and cannot interact with Markov games further.

2. PRELIMINARIES

This section provides the necessary background on Markov games and optimistic-follow-the-regularizedleader (OFTRL).

Two-player zero-sum Markov games.

Denote by MG(H, S, A, B, P, r) a finite-horizon timeinhomogeneous two-player zero-sum Markov game, with H the horizon, S the state space, A (resp. B) the action space for the max-player (resp. min-player), P = {P h } h∈[H] the transition probabilities, and r = {r h } h∈[H] the reward function. We assume state space S and action spaces A, B to be finite and have size S, A, B, respectively, and r h takes value in [0, 1]. Without loss of generality, we assume that the game starts at a fixed state s 1 ∈ S. Then at each step h, both players observe the current state s h ∈ S. The max-player picks an action a h ∈ A and the min-player picks an action b h ∈ B simultaneously. Then the max-player (resp. min-player) receives the reward r h (s h , a h , b h ) (resp. -r h (s h , a h , b h )), and the game transits to step h + 1 with the next state s h+1 sampled from P h (• | s h , a h , b h ). The game ends after H steps. The goal for the max-player is to maximize her total reward while the min-player seeks to minimize the total reward obtained by the max-player. Markov policies and value functions. Let µ = {µ h } h∈[H] be the Markov policy for the max-player, where µ h (• | s) ∈ ∆ A is the distribution of actions the max-player picks when seeing state s at step h. Here, ∆ X denotes the set of all probability distributions on the space X . Similarly, the min-player is equipped with a Markov policy ν = {ν h } h∈ [H] . We define the value function of the policy pair (µ, ν) at step h to be V µ,ν h (s) := E µ,ν H i=h r(s i , a i , b i ) | s h = s , where the expectation is taken w.r.t. the policies {µ i , ν i } i≥h and the state transitions {P i } i≥h . Similarly, one can define the Q-function as Q µ,ν h (s, a, b) := E µ,ν H i=h r(s i , a i , b i ) | s h = s, a h = a, b h = b . In words, both functions represent the expected future rewards received by the max-player given the current state or state-action pair. Best responses and Nash equilibria. Fix a Markov policy ν for the min-player. There exists a Markov policy µ † (ν) (a.k.a. best response) such that for any s ∈ S and h ∈ [H], V µ † (ν),ν h (s) = sup µ † V µ † ,ν h (s), where the supremum is taken over all Markov policies. To simplify the notation, we denote V †,ν h (s) := V µ † (ν),ν h (s). Similarly, we can define V µ, † h (s). It is known that a pair (µ ⋆ , ν ⋆ ) of Markov policies exists and µ ⋆ , ν ⋆ are best responses to the other, i.e., V µ ⋆ ,ν ⋆ h (s) = V †,ν ⋆ h (s) = V µ ⋆ , † h (s) for all s ∈ S and h ∈ [H]. Such a pair (µ ⋆ , ν ⋆ ) is called a Nash equilibrium (NE). We may denote the value function and Q-function under any Nash equilibrium (µ ⋆ , ν ⋆ ) as V ⋆ h := V µ ⋆ ,ν ⋆ h , Q ⋆ h := Q µ ⋆ ,ν ⋆ h , which are known to be unique even if there are multiple Nash equilibria (Shapley, 1953) . The goal of learning in two-player zero-sum Markov games is to find an ε-approximation to the NE defined as follows. Definition 1 (ε-approximate Nash equilibrium). Fix any approximation accuracy ε > 0. A pair (µ, ν) of Markov policies is an ε-approximate Nash equilibrium if NE-gap(µ, ν) := V †,ν 1 (s 1 ) -V µ, † 1 (s 1 ) ≤ ε. (1) Algorithm 1 Optimistic-follow-the-regularized-leader for solving two-player zero-sum Markov games Input: Stepsize η, reward function r, probability transition function P. Initialization: Q 0 h ≡ 0 for all h ∈ [H]. For iteration 1 to T , do • Policy Update: for all state s ∈ S, horizon h ∈ [H], µ t h (a | s) ∝ exp η w t t-1 i=1 w i Q i h ν i h (s, a) + w t Q t-1 h ν t-1 h (s, a) , ν t h (b | s) ∝ exp - η w t t-1 i=1 w i (Q i h ) ⊤ µ i h (s, b) + w t (Q t-1 h ) ⊤ µ t-1 h (s, b) . (2b) • Value Update: for all s ∈ S, a ∈ A, b ∈ B, from h = H to 1, Q t h (s, a, b) = (1 -α t )Q t-1 h (s, a, b) + α t r h + P h (µ t h+1 ) ⊤ Q t h+1 ν t h+1 (s, a, b), Output average policy: for all s ∈ S, h ∈ [H] μh (• | s) := T t=1 α t T µ t h (• | s), νh (• | s) := T t=1 α t T ν t h (• | s). An interlude: additional notations. Before explaining OFTRL, we introduce some additional notations to simplify things hereafter. Fix any h ∈ [H], s ∈ S. For any function Q : S ×A×B → R, we may consider Q(s, •, •) to be an A × B matrix and µ h (• | s), ν h (• | s) to be vectors of length A and B, respectively. Then for any policy (µ h , ν h ) at horizon h we may define µ ⊤ h Qν h (s) := E a∼µ h (•|s),b∼ν h (•|s) [Q(s, a, b)], µ ⊤ h Q (s, •) := E a∼µ h (•|s) [Q(s, a, •)], [Qν h ] (s, •) := E b∼ν h (•|s) [Q(s, •, b)]. The term µ ⊤ h Qν h (s) can also be written in the inner product form ⟨µ h , Qν h ⟩ (s) or ν h , Q ⊤ µ h (s). It is easy to check that for fixed s and h, the left hand sides of these definitions are standard matrix operations. In addition, for any V : S → R, we define the shorthand [P h V ] (s, a, b) := E s ′ ∼P h (•|s,a,b) [V (s ′ )], which allows us to rewrite Bellman updates of V and Q as V µ,ν h (s) = µ ⊤ h Q µ,ν h ν h (s), Q µ,ν h (s, a, b) = r h (s, a, b) + P h V µ,ν h+1 (s, a, b). Optimistic-follow-the-regularized-leader. Now we are ready to introduce the optimistic-follow-theregularized-leader (OFTRL) algorithm for solving two-player zero-sum Markov games, which has appeared in the paper by Zhang et al. (2022b) . See Algorithm 1 for the full specification. In a nutshell, the algorithm has three main components. The first is the policy update (2) using weighted OFTRL for both the max and min players. As one can see, compared to the standard follow-the-regularizedleader algorithm, the weighted OFTRL adds a loss predictor [Q t-1 ν t-1 ](s, a) and deploys a weighted update according to the weights {w i } 1≤i≤t , which we shall define momentarily. The second component is the backward value update (3) using weighted average of the previous estimates and the Bellman updates. The last essential part is outputting a weighted policy (4) over all the historical policies. As one can realize, weights play a big role in specifying the OFTRL algorithm. In particular, we set α t := H + 1 H + t , α t t := α t , α i t := α i t j=i+1 (1 -α j ), w i := α i t α 1 t = α i α 1 i j=2 (1 -α j ) , which are the same choices as in the paper by (Zhang et al., 2022b) .

3. MAIN RESULT AND OVERVIEW OF THE PROOF

With the preliminaries in place, we are in a position to state our main result for OFTRL in two-player zero-sum Markov games. Theorem 1. Consider Algorithm 1 with η = C η H -2 for some constant C η ≤ 1/8. The output policy pair (μ, ν) satisfies NE-gap(μ, ν) ≤ 320C -1 η H 5 • log(AB) T . Several remarks on Theorem 1 are in order. First, Theorem 1 demonstrates that OFTRL can find an O(T -1 )-approximate Nash equilibrium in T iterations. This improves the Õ(T -5/6 ) rate proved in the prior work (Zhang et al., 2022b) , and also matches the empirical evidence provided therein. While the paper by Zhang et al. (2022b) also provides a modified OFTRL algorithm that achieves an Õ(T -1 ) rate by maintaining two separate value estimators (one for the max-player and the other for the min-player), the OFTRL algorithm studied herein is more natural and also computationally simpler. Second, this rate is nearly unimprovable even in the simpler two-player zero-sum normal-form games (Daskalakis et al., 2011) . It is also worth pointing out that algorithms with Õ(T -1 ) rate have been proposed in the literature (Cen et al., 2021; Zhao et al., 2021) . However, compared to those algorithms, OFTRL does not require one to approximately solve a normal-form game in each iteration. Lastly, Theorem 1 allows any C η ∈ (0, 1/8] while C η = 1/8 is optimal for the bound on NE-gap. Before embarking on the formal proof, we would like to immediately provide an overview of our proof techniques. Step 1: controlling NE-gap using the sum of regrets and estimation error. In the simpler normal-form game (i.e., without any state transition dynamics as in Markov games), it is well known that NE-gap is controlled by the sum of the regrets of the two players. This would also be the case for Markov games if in the policy update (2) by OFTRL, we use the true Q-function Q ⋆ h instead of the estimate Q t h . As a result, intuitively, the NE-gap in Markov games should be controlled by both the sum of the regrets of the two players and also the estimation error ∥Q t h -Q ⋆ h ∥ ∞ ; see Lemma 1. Step 2: bounding the sum of regrets. Given the extensive literature on regret guarantees for optimistic algorithms (Anagnostides et al., 2022c; b; Zhang et al., 2022b) , it is relatively easy to control the sum of the regrets to obtain the desired O(T -1 ) rate; see Lemma 2. The key is to exploit the stability in the loss vectors. Step 3: bounding estimation error. It then boils down to controlling the estimation error ∥Q t h -Q ⋆ h ∥ ∞ , in which our main technical contributions lie. Due to the nature of the Bellman update (3), it is not hard to obtain a recursive relation for the estimation error; see the recursion (17). However, the undesirable part is that the estimation error depends on the maximal regret between the two players, instead of the sum of the regrets. This calls for technical innovation. Inspired by the work of Anagnostides et al. (2022c; b) in normalform games, we make an important observation that the sum of the regrets is approximately non-negative. In particular, the sum is lower bounded by the negative estimation error ∥Q t h -Q ⋆ h ∥ ∞ ; see Lemma 6. This lower bound together with the upper bound in Step 2 allows us to control the maximal regret via the estimation error (19), which further yields a recursive relation (20) involving estimation errors only. Solving the recursion leads to the desired result.

4. PROOF OF THEOREM 1

In this section, we present the proof of our main result, i.e., Theorem 1. We first define a few useful notations. For each step h ∈ [H], each state s ∈ S, and each iteration t ∈ [T ], we define the state-wise weighted individual regret as reg t h,1 (s) := max µ † ∈∆ A t i=1 α i t µ † -µ i h , Q i h ν i h (s), ( ) reg t h,2 (s) := max ν † ∈∆ B t i=1 α i t ν i h -ν † , (Q i h ) ⊤ µ i h (s). ( ) We also define the maximal regret as reg t h := max s∈S max i=1,2 reg t h,i (s) , that maximizes over the players and the states. In addition, for each step h ∈ [H], and each iteration t ∈ [T ], we define the estimation error of the Q-function as δ t h := ∥Q t h -Q ⋆ h ∥ ∞ . With these notations in place, we first connect the NE-gap with the sum of regrets reg T h,1 (s) + reg T h,2 (s) as well as the estimation error δ t h . Lemma 1. One has NE-gap(μ, ν) ≤ 2 H h=1 max s reg T h,1 (s) + reg T h,2 (s) + 2 T t=1 α t T δ t h . See Section B.1 for the proof of this lemma. It then boils down to controlling max s reg T h,1 (s) + reg T h,2 (s) and T t=1 α t T δ t h . The following two lemmas provide such control. Lemma 2. For every h ∈ [H], every s ∈ S, and every iteration t ∈ [T ], one has reg t h,1 (s) ≤ 2H • (log A) ηt + 16ηH 3 t + 2ηH 2 t i=2 α i t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 (7a) - 1 8η t i=2 α i-1 t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 ; reg t h,2 (s) ≤ 2H • (log B) ηt + 16ηH 3 t + 2ηH 2 t i=2 α i t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 (7b) - 1 8η t i=2 α i-1 t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 . As a result, when η = C η H -2 for some constant C η ≤ 1/8, one has max s reg t h,1 (s) + reg t h,2 (s) ≤ 3C -1 η H 3 • log(AB) t -4ηH 3 t i=2 α i t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 (8) + ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 . See Section B.2 for the proof of this lemma. Lemma 3. Choosing η = C η H -2 for some constant C η ≤ 1/8, for all h ∈ [H] and t ∈ [T ], we have that δ t h ≤ 5e 2 C -1 η H 4 • log(AB) t . See Section B.3 for the proof of this lemma. Combine Lemmas 2-3 with Lemma 1 to arrive at the desired conclusion that when η = C η H -2 for some constant C η ≤ 1/8, NE-gap(μ, ν) ≤ 2 H h=1 max s reg T h,1 (s) + reg T h,2 (s) + 2 T t=1 α t T δ t h ≤ 2 H h=1 3C -1 η H 3 • log(AB) T + 2 T t=1 α t T 5e 2 C -1 η H 4 • log(AB) t ≤ 2H • 3C -1 η H 3 • log(AB) T + 20e 2 C -1 η H 4 • log(AB) T ≤ 320C -1 η H 5 • log(AB) T , where the penultimate inequality uses the following important lemma we have alluded to before. Lemma 4. For all t ≥ 1, one has t i=1 α i t • 1 i ≤ 1 + 1 H 1 t . ( ) On the surface, this lemma shaves an extra log t factor from a simple average of the sequence {1/i} i≤t (cf. Lemma A.3 in the paper by Zhang et al. (2022b) ). But more importantly, it shines in the ensuing proof of Lemma 3 by enabling the induction step. See Section B.4 for the proof of Lemma 4, and see the end of Section B.3 for the comment on the benefit of this improved result.

5. DISCUSSION

In this paper, we prove that the optimistic-follow-the-regularized-leader algorithm, together with smooth value updates, converges to an O(T -1 )-approximate Nash equilibrium in two-player zero-sum Markov games. This improves the Õ(T -5/6 ) rate proved in the paper Zhang et al. (2022b) . Quite a few interesting directions are open. Below we single out a few of them. First, although our rate is unimprovable in the dependence on T , it is likely sub-optimal in its dependence on the horizon H. Improving such dependence and proving any sort of lower bound on it are both interesting and important for finite-horizon Markov games. Second, we focus on the simple two-player zero-sum games. It is an important open question to see whether one can generalize the proof technique herein to the multi-player general-sum Markov games and to other solution concepts in games (e.g., coarse correlated equilibria, and correlated equilibria).

Published as a conference paper at ICLR 2023

To help reading, we repeat the definitions below: for each t ≥ 1, and 1 ≤ i ≤ t, we define α t = α t t = H + 1 H + t , and α i t = α i t j=i+1 (1 -α j ). ( ) Lemma 5. Fix any t ≥ 1. The following properties are true: 1. The sequence {α i t } 1≤i≤t sums to 1, i.e., t i=1 α i t = 1. 2. For all 1 ≤ i ≤ t, one has α i t ≤ i/t. 3. For the relative weight defined by w i = α i t /α 1 t (note that this is the same for every t ≥ i), we have w i w i-1 = α i t α i-1 t = H + i -1 i -1 ≤ H. 4. The sequence {α i t } 1≤i≤t is increasing in i. 5. On the sum of squares of the weights, we have t i=1 (α i t ) 2 ≤ t i=1 α 2 i ≤ H + 2. 6. For any non-increasing sequence {b i } 1≤i≤t , one has t i=1 α i t b i ≤ 1 t t i=1 b i . Proof. Property 1 follows directly from the definitions of α i t 1≤i≤t . Now we move on to Property 2. It trivially holds for i = t. Therefore we focus on the case when 1 ≤ i ≤ t -1. By definition, we have α i t = α i t j=i+1 (1 -α j ) ≤ t j=i+1 (1 -α j ) = t j=i+1 j -1 H + j . ( ) where the inequality holds since α i ≤ 1 for all 1 ≤ i ≤ t, and the last relation is the definition of α j . Expanding the right hand side of (11), we have α i t ≤ i H + i + 1 × i + 1 H + i + 2 × • • • × t -1 H + t ≤ i H + t , where we only keep the first numerator and the last denominator. Property 2 then follows. Property 3 is trivial. Hence we omit the proof. In addition, Property 3 implies Property 4 since α i t α i-1 t = H+i-1 i-1 ≥ 1. For Property 5, the first inequality holds since 0 ≤ α i ≤ 1 for all 1 ≤ i ≤ t. For the second inequality, one has t i=1 α 2 i = 1 + t i=2 H + 1 H + i 2 ≤ 1 + (H + 1) 2 t i=2 1 (H + i -1)(H + i) . Expanding this as a telescoping sum, we see that t i=1 α 2 i ≤ 1 + (H + 1) 2 t i=2 1 H + i -1 - 1 H + i ≤ 1 + (H + 1) 2 1 H + 1 = H + 2. Lastly, for Property 6, we have t i=1 α i t b i - 1 t t i=1 b i = t i=1 (α i t - 1 t )b i . Let i 0 := sup i α i t ≤ 1/t . Since {α i t } is increasing in i (cf. Property 4) and t i=1 α i t = 1 (cf. Property 1), we know that i 0 is well defined, i.e., 1 ≤ i 0 ≤ t. Since α i t i≤t (resp. {b i } i≤t ) is increasing (resp. non- increasing), we have α i t ≤ 1/t and b i ≥ b i0 for all i ≤ i 0 . As a result, we obtain (α i t -1/t)b i ≤ (α i t -1/t)b i0 for all i ≤ i 0 . Similarly, one has α i t > 1/t and b i ≤ b i0 for all i > i 0 , which implies (α i t -1/t)b i ≤ (α i t -1/t)b i0 for all i > i 0 . Take these two relations together to see that t i=1 (α i t -1/t)b i ≤ t i=1 (α i t -1/t)b i0 = 0, where the last equality uses the fact from Property 1, namely  NE-gap(μ, ν) = V †,ν 1 (s 1 ) -V ⋆ 1 (s 1 ) + V ⋆ 1 (s 1 ) -V μ, † 1 (s 1 ) ≤ 2 H h=1 max s max µ † ,ν † µ † , Q ⋆ h νh -ν † , Q ⋆⊤ h μh (s) . By the definition of the output policy (μ, ν), one has max µ † ,ν † µ † , Q ⋆ h νh -ν † , Q ⋆⊤ h μh (s) = max µ † ,ν † T t=1 α t T µ † , Q ⋆ h ν t h -ν † , Q ⋆⊤ h µ t h (s).

Replacing the true value function

Q ⋆ h with the value estimate Q t h yields max µ † ,ν † ⟨µ, Q ⋆ h νh ⟩ -ν † , (Q ⋆ h ) ⊤ μh (s) ≤ max µ † ,ν † T t=1 α t T µ † , Q t h ν t h -ν † , (Q t h ) ⊤ µ t h (s) + 2 T t=1 α t T δ t h , where we recall δ t h = ∥Q t h -Q ⋆ h ∥ ∞ . The proof is finished by taking the above three relations together with the observation that reg T h,1 (s) + reg T h,2 (s) = max µ † ,ν † T t=1 α t T µ † , Q t h ν t h -ν † , (Q t h ) ⊤ µ t h (s).

B.2 PROOF OF LEMMA 2

We prove the regret bound for the max-player (i.e., bound (7a)). The bound (7b) for the min-player can be obtained via symmetry. First, we make the observation that, the policy update in Algorithm 1 for the max-player is exactly the OFTRL algorithm (i.e., Algorithm 4 in the paper by Zhang et al. (2022b) ) with the loss vector g t = w t [Q t h ν t h ](s, •), the recency bias M t = w t [Q t-1 h ν t-1 h ](s, •) , and a learning rate η t = η/w t . Therefore, we can apply Lemma B.3 from Zhang et al. (2022b) to obtain reg t h,1 (s) = max µ † t i=1 α i t µ † -µ i h , Q i h ν i h (s) = α 1 t max µ † t i=1 w i µ † -µ i h , Q i h ν i h (s) ≤ α t • (log A) η + α 1 t t i=1 η w i w i Q i h ν i h -w i Q i-1 h ν i-1 h (s, •) 2 ∞ =:Err1 (12) -α 1 t t i=2 w i-1 8η ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 =:Err2 , where we have used the fact that w i = α i t /α 1 t . We now move on to bound the term Err 1 . Use (a + b) 2 ≤ 2a 2 + 2b 2 to see that Q i h ν i h -Q i-1 h ν i-1 h (s, •) 2 ∞ ≤ 2 Q i h ν i h -Q i-1 h ν i h (s, •) 2 ∞ + 2 Q i-1 h ν i h -Q i-1 h ν i-1 h (s, •) 2 ∞ ≤ 2∥Q i h -Q i-1 h ∥ 2 ∞ + 2H 2 ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 , where the second line uses Holder's inequality and the fact that ∥Q i-1 h ∥ ∞ ≤ H. In view of the update rule (3) for the Q-function, we further have ∥Q i h -Q i-1 h ∥ ∞ = -α i Q i-1 h + α i r h + P h (µ i h+1 ) ⊤ Q i h+1 ν i h+1 ∞ ≤ α i max Q i-1 h ∞ , r h + P h (µ i h+1 ) ⊤ Q i h+1 ν i h+1 ∞ ≤ α i H. As a result, we arrive at the bound Err 1 ≤ 2ηα 1 t t i=1 w i α 2 i H 2 + H 2 ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 = 2ηH 2 t i=1 α i t α 2 i + 2ηH 2 t i=1 α i t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 , where we again use the relation w i = α i t /α 1 t . Since {α i } i≤t is decreasing in i, we can apply Property 6 in Lemma 5 to obtain t i=1 α i t α 2 i ≤ 1 t t i=1 α 2 i ≤ H + 2 t ≤ 3H t , where the second inequality follows from Property 5 in Lemma 5. In all, we see that Err 1 ≤ 6ηH 3 t + 2ηH 2 t i=1 α i t ν i h (• | s) -ν i-1 h (• | s) 2 1 . ( ) Substitute the upper bound ( 14) for Err 1 into the master bound ( 12) to obtain reg t h,1 (s) ≤ α t • (log A) η + Err 1 -Err 2 ≤ 2H • (log A) ηt + 6ηH 3 t + 2ηH 2 t i=1 α i t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 - 1 8η t i=2 α i-1 t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 , where in the first inequality we use α t = (H + 1)/(H + t) ≤ 2H/t. Since ∥ν i h (• | s) -ν i-1 h (• | s)∥ 1 ≤ 2 and α 1 t ≤ 1/t (see Property 2 of Lemma 5), we can take the term i = 1 out and reach reg t h,1 (s) ≤ 2H • (log A) ηt + 16ηH 3 t + 2ηH 2 t i=2 α i t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 - 1 8η t i=2 α i-1 t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 . This finishes the proof of the regret bound (7a) for the max-player. The bound (7b) for the min-player can be obtained via symmetry. Combine the two bounds (7a) and (7b) see that reg t h,1 (s) + reg t h,2 (s) ≤ 2H • log(AB) ηt + 32ηH 3 t + t i=2 2ηH 2 α i t - α i-1 t 8η ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 (15) + ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 . When η ≤ 1/(8H 2 ), one has 2ηH 2 α i t - α i-1 t 8η ≤ 2ηH 3 α i t - α i-1 t 8η ≤ -4ηH 3 α i t , where we have used Property 3 of Lemma 5, i.e., α i-1 t /α i t ≥ 1/H. Consequently, with η = C η H -2 for some constant C η ≤ 1/8, the bound (16) reads max s reg t h,1 (s) + reg t h,2 (s) ≤ 3C -1 η H 3 • log(AB) t -4ηH 3 t i=2 α i t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 + ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 , where we assume the choice of players is non-trivial, i.e., AB ≥ 2.

B.3 PROOF OF LEMMA 3

By Lemma C.2 in the paper by Zhang et al. (2022b) , for any h ∈ [H -1], we have the recursive relation δ t h ≤ t i=1 α t δ i h+1 + reg t h+1 , where we recall reg t h+1 = max s max i=1,2 {reg t h+1,i (s)}. Step 1: Bounding reg t h+1 . In view of this recursion (17), one needs to control the maximal regret reg t h+1 over the two players. Lemma 2 provides us with precise control of the individual regrets reg t h,1 (s) and reg t h,2 (s): reg t h,1 (s) ≤ 3C -1 η H 3 • (log AB) t + 2ηH 2 t i=2 α i t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 , reg t h,2 (s) ≤ 3C -1 η H 3 • (log AB) t + 2ηH 2 t i=2 α i t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 , where we have substituted η = C η H -2 for C η ≤ 1/8 and AB ≥ 2. We have also ignored the negative terms on the right hand sides of (7a) and (7b). Therefore, to control individual regrets, it suffices to bound the second-order path lengths 2ηH 2 t i=2 α i t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 and 2ηH 2 t i=2 α i t ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 . To this end, the following lemma proves crucial, whose proof is deferred to the end of this section. Lemma 6. For each t, h and s, one has reg t h,1 (s) + reg t h,2 (s) ≥ -2 t i=1 α i t δ i h . In words, Lemma 6 reveals the approximate non-negativity of the sum of the regrets. This together with the upper bound (8) in Lemma 2 implies 2ηH 2 t i=2 α i t ∥µ i h (• | s) -µ i-1 h (• | s)∥ 2 1 + ∥ν i h (• | s) -ν i-1 h (• | s)∥ 2 1 ≤ 3C -1 η H 2 • log(AB) 2t + 1 H t i=1 α i t δ i h . Feeding this back to (18a) and (18b), we obtain reg t h = max s max i=1,2 reg t h,i (s) ≤ 5C -1 η H 3 • log(AB) t + 1 H t i=1 α i t δ i h . Step 2: Bounding δ t h . Substituting the maximal regret bound (19) into the recursion (17), we arrive at δ t h ≤ 1 + 1 H t i=1 α i t δ i h+1 + 5C -1 η H 3 • log(AB) t . We continue the proof of Lemma 3 via induction on h. More precisely, we aim to inductively establish the claim δ t h ≤ H h ′ =h 1 + 1 H 2(H-h ′ ) • 5C -1 η H 3 • log(AB) t . First note that the induction hypothesis holds naturally for h = H as δ t H = 0 for all 1 ≤ t ≤ T . Now assume that the induction hypothesis is true for some 2 ≤ h + 1 ≤ H and for all 1 ≤ t ≤ T . Our goal is to show that (21) continues to hold for the previous step h and for all 1 ≤ t ≤ T . By the recursion (20) and the induction hypothesis, one has for any 1 ≤ t ≤ T : δ t h ≤ 1 + 1 H t i=1 α i t δ i h+1 + 5C -1 η H 3 • log(AB) t ≤ 1 + 1 H t i=1 α i t H h ′ =h+1 1 + 1 H 2(H-h ′ ) • 5C -1 η H 3 • log(AB) t + 5C -1 η H 3 • log(AB) t . Apply Lemma 4 to obtain t i=1 α i t • 5C -1 η H 3 • log(AB) i ≤ 1 + 1 H 5C -1 η H 3 • log(AB) t . This leads to the conclusion that δ t h ≤ 1 + 1 H H h ′ =h+1 1 + 1 H 2(H-h ′ ) 1 + 1 H 5C -1 η H 3 • log(AB) t + 5C -1 η H 3 • log(AB) t = H h ′ =h+1 1 + 1 H 2(H-h ′ +1) 5C -1 η H 3 • log(AB) t + 5C -1 η H 3 • log(AB) t = H h ′ =h 1 + 1 H 2(H-h ′ ) 5C -1 η H 3 • log(AB) t . This finishes the induction. This bound on δ t h can be further simplified by δ t h ≤ H h ′ =h 1 + 1 H 2(H-h ′ ) • 5C -1 η H 3 • log(AB) t ≤ H 1 + 1 H 2H • 5C -1 η H 3 • log(AB) t ≤ 5e 2 C -1 η H 4 • log(AB) t . This finishes the proof, and we are left with proving Lemma 6. Proof of Lemma 6. Recall that reg t h,1 (s) + reg t h,2 (s) = max µ † ,ν † t i=1 α i t µ † , Q i h ν i h -ν † , (Q i h ) ⊤ µ i h (s). Replace the estimation Q i h with Q ⋆ h to obtain reg t h,1 (s) + reg t h,2 (s) ≥ max µ † ,ν † t i=1 α i t µ † , Q ⋆ h ν i h -ν † , (Q ⋆ h ) ⊤ µ i h (s) + t i=1 α i t µ † , Q i h -Q ⋆ h ν i h -ν † , Q i h -Q ⋆ h ⊤ µ i h (s) . Lower bounding the term involving Q i h -Q ⋆ h yields reg t h,1 (s) + reg t h,2 (s) ≥ max µ † ,ν † t i=1 α i t µ † , Q ⋆ h ν i h -ν † , (Q ⋆ h ) ⊤ µ i h (s) -2 t i=1 α i t δ i h . where recall  δ i h = ∥Q i h -Q ⋆ h ∥ ∞ . max µ † ,ν † t i=1 α i t µ † , Q ⋆ h ν i h -ν † , (Q ⋆ h ) ⊤ µ i h (s) = max µ † ,ν † µ † , Q ⋆ h t i=1 α i t ν i h (s) -ν † , Q ⋆⊤ h t i=1 α i t µ i h (s) ≥ t i=1 α i t µ i h , Q ⋆ h t i=1 α i t ν i h (s) - t i=1 α i t ν i h , Q ⋆⊤ h t i=1 α i t µ i h (s) = 0. Combine the above two inequalities to finish the proof. In the end, it is worth pointing out that without the improved inequality in Lemma 4, one would necessarily incur an extra log T factor in each induction step. Consequently, the recursion will fail due to the explosion at a rate of (log T ) H .

B.4 PROOF OF LEMMA 4

We prove the claim via induction. The base case t = 1 is true since α 1 1 • 1 = 1 ≤ 1 + 1/H. Now assume that the inequality (9) holds for some t ≥ 1, and we aim to prove that it continues to hold at t + 1. We first make the observation that for all i ≤ t α i t+1 = α i t+1 j=i+1 (1 -α j ) = (1 -α t+1 )α i t j=i+1 (1 -α j ) = (1 -α t+1 )α i t . This allows us to rewrite t+1 i=1 α i t+1 • 1 i as t+1 i=1 α i t+1 • 1 i = (1 -α t+1 ) t i=1 α i t • 1 i + α t+1 • 1 t + 1 ≤ (1 -α t+1 ) 1 + 1 H 1 t + α t+1 t + 1 , where the second line follows from the induction hypothesis. Note that α t+1 = H+1 H+t+1 . We can continue the derivation as t+1 i=1 α i t+1 • 1 i ≤ 1 + 1 H t H + t + 1 • 1 t + H + 1 H + t + 1 • 1 t + 1 = 1 + 1 H t + 1 H + t + 1 • 1 t + 1 + 1 + 1 H H H + t + 1 • 1 t + 1 = 1 + 1 H 1 t + 1 . This finishes the proof.



Now observe that

annex

A PROPERTIES OF α i t This section collects a few useful properties of the sequences {α t } t≥1 and {α i t } t≥1,1≤i≤t . Some of these results have appeared in prior work (Jin et al., 2018; Zhang et al., 2022b) . For completeness, we include all the proofs here.

