ASYNCHRONOUS GRADIENT PLAY IN ZERO-SUM MULTI-AGENT GAMES

Abstract

Finding equilibria via gradient play in competitive multi-agent games has been attracting a growing amount of attention in recent years, with emphasis on designing efficient strategies where the agents operate in a decentralized and symmetric manner with guaranteed convergence. While significant efforts have been made in understanding zero-sum two-player matrix games, the performance in zerosum multi-agent games remains inadequately explored, especially in the presence of delayed feedbacks, leaving the scalability and resiliency of gradient play open to questions. In this paper, we make progress by studying asynchronous gradient plays in zero-sum polymatrix games under delayed feedbacks. We first establish that the last iterate of entropy-regularized optimistic multiplicative weight updates (OMWU) method converges linearly to the quantal response equilibrium (QRE), the solution concept under bounded rationality, in the absence of delays. While the linear convergence continues to hold even when the feedbacks are randomly delayed under mild statistical assumptions, it converges at a noticeably slower rate due to a smaller tolerable range of learning rates. Moving beyond, we demonstrate entropy-regularized OMWU-by adopting two-timescale learning rates in a delay-aware manner-enjoys faster last-iterate convergence under fixed delays, and continues to converge provably even when the delays are arbitrarily bounded in an average-iterate manner. Our methods also lead to finite-time guarantees to approximate the Nash equilibrium (NE) by moderating the amount of regularization. To the best of our knowledge, this work is the first that aims to understand asynchronous gradient play in zero-sum polymatrix games under a wide range of delay assumptions, highlighting the role of learning rates separation.

1. INTRODUCTION

Finding equilibria of multi-player games via gradient play lies at the heart of game theory, which permeates a remarkable breadth of modern applications, including but not limited to competitive reinforcement learning (RL) (Littman, 1994) , generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarial training (Mertikopoulos et al., 2018) . While conventional wisdom leans towards the paradigm of centralized learning (Bertsekas & Tsitsiklis, 1989) , retrieving and sharing information across multiple agents raise questions in terms of both privacy and efficiency, leading to a significant amount of interest in designing decentralized learning algorithms that utilize only local payoff feedbacks, with the updates at different agents executed in a symmetric manner. In reality, there is no shortage of scenarios where the feedback can be obtained only in a delayed manner (He et al., 2014) , i.e., the agents only receive the payoff information sent from a previous round instead of the current round, due to communication slowdowns and congestions, for example. Substantial progress has been made towards reliable and efficient online learning with delayed feedbacks in various settings, e.g., stochastic multi-armed bandit (Pike-Burke et al., 2018; Vernade et al., 2017) , adversarial multi-armed bandit (Cesa-Bianchi et al., 2016; Li et al., 2019) , online convex optimization (Quanrud & Khashabi, 2015; McMahan & Streeter, 2014) and multi-player game (Meng et al., 2022; Héliou et al., 2020; Zhou et al., 2017) . Typical approaches to combatting delays include subsampling the payoff history (Weinberger & Ordentlich, 2002; Joulani et al., 2013) , or adopting

Learning rate

Type of delay Iteration complexity ϵ-QRE ϵ-NE single-timescale none τ -1 d max ∥A∥ ∞ log ϵ -1 d max ∥A∥ ∞ ϵ -1 statistical τ -2 d 2 max ∥A∥ 2 ∞ (γ + 1) 2 log ϵ -1 d 2 max ∥A∥ 2 ∞ (γ + 1) 2 ϵ -2 two-timescale constant τ -1 d max ∥A∥ ∞ (γ + 1) 2 log ϵ -1 d max ∥A∥ ∞ (γ + 1) 2 ϵ -1 bounded τ -2 nd 3 max ∥A∥ 3 ∞ (γ + 1) 5/2 ϵ -1 nd 3 max ∥A∥ 3 ∞ (γ + 1) 5/2 ϵ -3 Table 1 : Iteration complexities of the proposed OMWU method for finding ϵ-QRE/NE of zero-sum polymatrix games, where logarithmic dependencies are omitted. Here, γ denotes the maximal time delay when the delay is bounded, n denotes the number of agents in the game, d max is the maximal degree of the graph, and ∥A∥ ∞ = max i,j ∥A i,j ∥ ∞ is the ℓ ∞ norm of the entire payoff matrix A (over all games in the network). We only present the result under statistical delay when the delays are bounded for ease of comparison, while more general bounds are given in Section 3.2. adaptive learning rates suggested by delay-aware analysis (Quanrud & Khashabi, 2015; McMahan & Streeter, 2014; Hsieh et al., 2020; Flaspohler et al., 2021) . Most of these efforts, however, have been limited to either the asymptotic convergence to the equilibrium (Zhou et al., 2017; Héliou et al., 2020) or the study of individual regret, which characterizes the performance gap between an agent's learning trajectory and the best policy in hindsight. It remains highly inadequate when it comes to guaranteeing finite-time convergence to the equilibrium in a multi-player environment, especially in the presence of delayed feedbacks, thus leaving the scalability and resiliency of gradient play open to questions. In this work, we initiate the study of asynchronous learning algorithms for an important class of games called zero-sum polymatrix games (also known as network matrix games (Bergman & Fokin, 1998) ), which generalizes two-player zero-sum matrix games to the multiple-player setting and serves as an important stepping stone to more general multi-player general-sum games. Zero-sum polymatrix games are commonly used to describe situations in which agents' interactions are captured by an interaction graph and the entire system of games are closed so that the total payoffs keep invariant in the system. They find applications in an increasing number of important domains such as security games (Cai et al., 2016) , graph transduction (Bernardi, 2021) , and more. In particular, we focus on finite-time last-iterate convergence to two prevalent solution concepts in game theory, namely Nash Equilibrium (NE) and Quantal Response Equilibrium (QRE) which considers bounded rationality (McKelvey & Palfrey, 1995) . Despite the seemingly simple formulation, few existing works have achieved this goal even in the synchronous setting, i.e., with instantaneous feedback. Leonardos et al. (2021) studied a continuous-time learning dynamics that converges to the QRE at a linear rate. Anagnostides et al. (2022) demonstrated Optimistic Mirror Descent (OMD) (Rakhlin & Sridharan, 2013) enjoys finite-time last-iterate convergence to the NE, yet the analysis therein requires continuous gradient of the regularizer, which incurs computation overhead for solving a subproblem every iteration. In contrast, an appealing alternative is the entropy regularizer, which leads to closed-form multiplicative updates and is computationally more desirable, but remains poorly understood. In sum, designing efficient learning algorithms that provably converge to the game equilibria has been technically challenging, even in the synchronous setting.

1.1. OUR CONTRIBUTIONS

In this paper, we develop provably convergent algorithms-broadly dubbed as asynchronous gradient play-to find the QRE and NE of zero-sum polymatrix games in a decentralized and symmetric manner with delayed feedbacks. We propose an entropy-regularized Optimistic Multiplicative Weights Update (OMWU) method (Cen et al., 2021) , where each player symmetrically updates their strategies without access to the payoff matrices and other players' strategies, and initiate a systematic investigation on the impacts of delays on its convergence under two schemes of learning rates schedule. Our main contributions are summarized as follows. • Finite-time last-iterate convergence of single-timescale OMWU. We begin by showing that, in the synchronous setting, the single-timescale OMWU method-when the same learning rate is adopted for extrapolation and update-achieves last-iterate convergence to the QRE at a linear rate, which is independent of the number of agents as well as the size of action spaces (up to logarithmic factors). In addition, this implies a last-iterate convergence to an ϵ-approximate NE in O(ϵ -1 ) iterations by adjusting the regularization parameter, where O(•) hides logarithmic dependencies. While the last-iterate linear convergence to QRE continues to hold in the asynchronous setting, as long as the delay sequence follows certain mild statistical assumptions, it converges at a slower rate due to a smaller tolerable range of learning rates, with the iteration complexity to find an ϵ-NE degenerating to O(ϵ -2 ). • Finite-time convergence of two-timescale OMWU. To accelerate the convergence rate in the presence of delayed feedback, we propose a two-timescale OMWU method which separates the learning rates of extrapolation and update in a delay-aware manner for applications with constant and known delays (e.g. from timestamp information). The learning rate separation is critical in bypassing the convergence slowdown encountered in the single-timescale case, where we show that two-timescale OMWU achieves a faster last-iterate linear convergence to QRE in the presence of constant delays, with an improved O(ϵ -1 ) iteration complexity to ϵ-NE that matches the rate without delay. We further tackle the more practical yet challenging setting where the feedback sequence is permutated by bounded delays-possibly in an adversarial manner-and demonstrate provable convergence to the equilibria in an average-iterate manner. We summarize the iteration complexities of the proposed methods for finding ϵ-approximate solutions of QRE and NE in Table 1 . To the best of our knowledge, this work presents the first algorithm design and analysis that focus on equilibrium finding in a multi-player game with delayed feedbacks. In contrast, most of existing works concerning individual regret in the synchronous/asynchronous settings typically yield average-iterate convergence guarantees (see e.g., Bailey (2021); Meng et al. (2022) ) and fall short of characterizing the actual learning trajectory to the equilibrium.

1.2. NOTATION AND PAPER ORGANIZATION

Denote by [n] the set {1, • • • , n} and by ∆(S) the probability simplex over the set S. Given two probability distributions p, p ′ ∈ ∆(S), the KL divergence from p ′ to p is defined by KL p ∥ p ′ := k∈S p(k) log p(k) p ′ (k) . For any vector z = [z i ] 1≤i≤n ∈ R n , we use exp(z) to represent [exp(z i )] 1≤i≤n . The rest of this paper is organized as follows. Section 2 provides the preliminary on zero-sum polymatrix games and solution concepts. Performance guarantees of single-timescale OMWU and two-timescale OMWU are presented in Section 3 and Section 4, respectively. Numerical experiments are provided in Section 5 to corroborate the theoretical findings, and finally, we conclude in Section 6. The proofs are deferred to the appendix.

2. PRELIMINARIES

In this section, we introduce the formulation of zero-sum polymatrix games as well as the solution concept of NE and QRE. We start by defining the polymatrix game. Definition 1 (Polymatrix game). Let G := {(V, E), {S i } i∈V , {A ij } (i,j)∈E } be an n-player polymatrix game, where each element in the tuple is defined as follows. • An undirected graph (V, E), with V = [n] denoting the set of players and E the set of edges; • For each player i ∈ V , S i represents its action set, which is assumed to be finite; • For each edge (i, j) ∈ E, A ij ∈ R |Si|×|Sj | and A ji ∈ R |Sj |×|Si| represent the payoff matrices associated with player i and j, i.e., when player i and player j choose s i ∈ S i and s j ∈ S j , the received payoffs are given by A ij (s i , s j ), A ji (s j , s i ), respectively. Utility function. Given the strategy profile s = (s 1 , • • • , s n ) ∈ S = i∈V S i taken by all players, the utility function u i : S → R of player i is given by u i (s) = j:(i,j)∈E A ij (s i , s j ). Suppose that player i adopts a mixed/stochastic strategy or policy, π i ∈ ∆(S i ), where the probability of selecting s i ∈ S i is specified by π i (s i ). With slight abuse of notation, we denote the expected utility of player i with a mixed strategy profile π = (π 1 , • • • , π n ) ∈ ∆(S) as u i (π) = E si∼πi,∀i∈V [u i (s)] = j:(i,j)∈E π ⊤ i A ij π j . It turns out to be convenient to treat π i and π as vectors in R |Si| and R i∈V |Si| without ambiguity, and concatenate all payoff matrices associated with player i into A i = (A i1 , • • • , A in ) ∈ R |Si|× j∈V |Sj | , where A ij is set to 0 whenever (i, j) / ∈ E. In particular, it follows that A ii = 0 for all i ∈ V . With these notation in place, we can rewrite the expected utility function (1) as u i (π) = π ⊤ i A i π, where A i π ∈ R |Si| can be interpreted as the expected utility of the actions in S i for player i. In addition, we denote the maximum entrywise absolute value of payoff by Zero-sum polymatrix games. The game G is a zero-sum polymatrix game if it holds that i∈V u i (s) = 0, ∀ s ∈ S. This immediately implies that for any strategy profile π ∈ ∆(S), it follows that i∈V u i (π) = 0. ∥A∥ ∞ = max i,j ∥A ij ∥ ∞ = max i ∥A i ∥ ∞ ,

Nash equilibrium (NE). A mixed strategy profile π

⋆ = (π ⋆ 1 , • • • , π ⋆ n ) is a Nash equilibrium (NE) when each player i cannot further increase its own utility function u i by unilateral deviation, i.e., u i (π ′ i , π ⋆ -i ) ≤ u i (π ⋆ i , π ⋆ -i ), for all i ∈ V, π ′ i ∈ ∆(S i ) , where the existence is guaranteed by the work (Cai et al., 2016) . Here we denote the mixed strategies of all players other than i by π -i and write u i (π i , π -i ) = u i (π). To measure how close a strategy π ∈ ∆(S) is to an NE, we introduce NE-Gap(π) = max i∈V max π ′ i ∈∆(Si) u i (π ′ i , π -i ) -u i (π) , which measures the largest possible gain in the expected utility when players deviate from its strategy unilaterally. A mixed strategy profile π is called an ϵ-approximate Nash equilibrium (ϵ-NE) when NE-Gap(π) ≤ ϵ, which ensures that u i (π ′ i , π -i ) ≤ u i (π i , π -i )+ϵ, for all i ∈ V, π ′ i ∈ ∆(S i ). Quantal response equilibrium (QRE). The quantal response equilibrium (QRE), proposed by McKelvey & Palfrey (1995) , generalizes the classical notion of NE under uncertain payoffs or bounded rationality, while balancing exploration and exploitation. A mixed strategy profile π ⋆ τ = (π ⋆ 1,τ , • • • , π ⋆ n,τ ) is a QRE when each player assigns its probability of action according to the expected utility of every action in a Boltzmann fashion, i.e., for all i ∈ V , π ⋆ i,τ (k) = exp([A i π ⋆ τ ] k /τ ) k∈Si exp([A i π ⋆ τ ] k /τ ) , k ∈ S i , where τ > 0 is the regularization parameter or temperature. Equivalently, this amounts to maximizing an entropy-regularized utility of each player (Mertikopoulos & Sandholm, 2016) , i.e., u i,τ (π ′ i , π ⋆ -i,τ ) ≤ u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) for all i ∈ V , π ′ i ∈ ∆(S i ). Here, the entropy-regularized utility function u i : S → R of player i is given by u i,τ (π) = u i (π) + τ H(π i ), where H(π i ) = -π ⊤ i log π i denotes the Shannon entropy of π i . In Leonardos et al. (2021) , it is shown that a unique QRE exists in a zero-sum polymatrix game. Similarly, we can measure the proximity of a strategy π to a QRE by QRE-Gap τ (π) = max i∈V max π ′ i ∈∆(Si) u i,τ (π ′ i , π -i ) -u i,τ (π) . A mixed strategy profile π is called an ϵ-QRE when QRE-Gap τ (π) ≤ ϵ. According to the straightforward relationship NE-Gap(π) ≤ QRE-Gap τ (π) + τ log S max , it follows immediately that we can link an ϵ/2-QRE to ϵ-NE by setting τ = ϵ 2 log Smax . This facilitates the translation of convergence to the QRE to one regarding the NE by appropriately setting the regularization parameter τ . Algorithm 1 Entropy-regularized OMWU, agent i 1: Initialize π (0) i = π (0) i as uniform distribution. Learning rates η, and η (optional). 2: for t = 0, 1, 2, . . . do

3:

Receive payoff vector A i π (κ (t) i ) . 4: When t ≥ 1, update π i according to π (t) i (k) ∝ π (t-1) i (k) 1-ητ exp(η[A i π (κ (t) i ) ] k ), ∀k ∈ S i .

5:

Update π i according to the single-timescale rule π (t+1) i (k) ∝ π (t) i (k) 1-ητ exp(η[A i π (κ (t) i ) ] k ), ∀k ∈ S i . ( ) or the two-timescale rule π (t+1) i (k) ∝ π (t) i (k) 1-ητ exp(η[A i π (κ (t) i ) ] k ), ∀k ∈ S i . ( ) 6: end for

3. PERFORMANCE GUARANTEES OF SINGLE-TIMESCALE OMWU

In this section, we present and study the entropy-regularized OMWU method (Cen et al., 2021) for finding the QRE of zero-sum polymatrix games. Whilst the method is originally proposed for finding QRE in a two-player zero-sum game, the update rule naturally generalizes to the multi-player setting as π (t+1) i (k) ∝ π (t) i (k) 1-ητ exp(η[A i π (t+1) ] k ), ∀k ∈ S i , where η > 0 is the learning rate and π (t+1) serves as a prediction for π (t+1) via an extrapolation step π (t+1) i (k) ∝ π (t) i (k) 1-ητ exp(η[A i π (t) ] k ), ∀k ∈ S i . In the asynchronous setting, however, each agent i receives a delayed payoff vector A i π (κ (t) i ) instead of A i π (t) in the t-th iteration, where κ (t) i = max{t -γ (t) i , 0}, with γ (t) i ≥ 0 representing the length of delay. The detailed procedure is outlined in Algorithm 1 using the single-timescale rule (9) for extrapolation.

3.1. PERFORMANCE GUARANTEES WITHOUT DELAYS

We first present our theorem concerning the last-iterate convergence of single-timescale OMWU for finding the QRE in the synchronous setting, i.e. γ (t) i = 0 for all i ∈ V and t ≥ 0. For any π, π ′ ∈ V , let KL π ∥ π ′ = i∈V KL π i ∥ π ′ i . Theorem 1 (Last-iterate convergence without delays). Suppose that the learning rate η of single- timescale OMWU in Algorithm 1 obeys 0 < η ≤ min 1 2τ , 1 4dmax∥A∥ ∞ , then for any T ≥ 0, the iterates π (T ) and π (T s) converge at a linear rate according to 0) . (11a) Furthermore, the QRE-gap also converges linearly according to KL π ⋆ τ ∥ π (T ) ≤ (1 -ητ ) T KL π ⋆ τ ∥ π (0) , KL π ⋆ τ ∥ π (T +1) ≤ 2(1 -ητ ) T KL π ⋆ τ ∥ π ( QRE-Gap τ (π (T ) ) ≤ η -1 + 2τ -1 d 2 max ∥A∥ 2 ∞ (1 -ητ ) T -1 KL π ⋆ τ ∥ π (0) . ( ) Theorem 1 demonstrates that as long as the learning rate η is sufficiently small, the last iterate of single-timescale OMWU converges to the QRE at a linear rate. Compared with prior works for finding approximate equilibrium for zero-sum polymatrix games, our approach features a closedform multiplicative update and a fast linear last-iterate convergence. Some remarks are in order. • Linear convergence to the QRE. Theorem 1 implies an iteration complexity of O 1 ητ log 1 ϵ for finding an ϵ-QRE in a last-iterate manner, which leads to an iteration complexity of O dmax∥A∥ ∞ τ + 1 log 1 ϵ by optimizing the learning rate in Theorem 1.The result is especially appealing as it avoids direct dependency on the number of agents n as well as the size of action spaces (up to logarithmic factors), suggesting that learning in competitive multi-agent games can be made quite scalable as long as the interactions among the agents are sparse (so that the maximum degree of the graph d max is much smaller than the number of agents n). • Last-iterate convergence to ϵ-NE. By setting τ appropriately, we end up with an iteration complexity of O dmax∥A∥ ∞ ϵ for achieving last-iterate convergence to an ϵ-NE, which outperforms the best existing last-iterate rate of O n ∥A∥ ∞ /ϵ 2 from Leonardos et al. (2021) by at least a factor of n/(d max ϵ). Remark 1. Our results trivially extend to the setting of weighted zero-sum polymatrix games (Leonardos et al., 2021) , which amounts to adopting different learning rates {η i } i∈V at each player. In this case, the iteration complexity becomes O max i∈V 1 ηiτ log 1 ϵ . In addition, our convergence result readily translates to a bound on individual regret as detailed in Appendix C.

3.2. PERFORMANCE GUARANTEES UNDER RANDOM DELAYS

We continue to examine single-timescale OMWU in the more challenging asynchronous setting. In particularly, we show that the last iterate of single-timescale OMWU continues to converge linearly to the QRE at a slower rate, as long as the delays satisfy some mild statistical assumptions given below. Assumption 1 (Random delays). Assume that for all i ∈ V , t ≥ 0, the delay γ (t) i is independently generated and satisfies E γ (t) i ≥ℓ γ (t) i := E γ (t) i γ (t) i ≥ ℓ ≤ E(ℓ), ∀ℓ = 0, 1, . . . . Additionally, there exists some constant ζ > 1, such that L ≜ ∞ ℓ=0 ζ ℓ E(ℓ) < ∞. We remark that Assumption 1 is a rather mild condition that applies to typical delay distributions, such as the Poisson distribution (Zhang et al., 2020) , as well as distributions with bounded support (Recht et al., 2011; Liu et al., 2014; Assran et al., 2020) . Roughly speaking, Assumption 1 implies that the probability of the delay decays exponentially with its length, where ζ -1 approximately indicates the decay rate. We have the following theorem. Theorem 2 (Last-iterate convergence with random delays). Under Assumption 1, suppose that the regulari-zation parameter τ < min{1, d max ∥A∥ ∞ } and the learning rate η of single-timescale OMWU in Algorithm 1 obeys 0 < η ≤ min τ 24d 2 max ∥A∥ 2 ∞ (L + 1) , ζ -1 τ ζ , then for any T ≥ 1, the iterates π (T ) and π (T ) converges to π ⋆ τ at the rate max E KL π ⋆ τ ∥ π (T ) , 1 2 E KL π ⋆ τ ∥ π (T ) ≤ (1 -ητ ) T KL π ⋆ τ ∥ π (0) . ( ) Furthermore, the QRE-gap also converges linearly according to E QRE-Gap τ (π (T ) ) ≤ 4η -1 (1 -ητ ) T KL π ⋆ τ ∥ π (0) . ( ) Theorem 2 suggests that the iteration complexity to ϵ-QRE is no more than O max d 2 max ∥A∥ 2 ∞ (L + 1), ζ ζ-1 1 τ 2 log 1 ϵ after optimizing the learning rate, whose range is more limited compared with the requirement in Theorem 1without delays. In particular, the range of the learning rate is proportional to the regularization parameter τ , an issue we shall try to address by resorting to two-timescale learning rates in OMWU. To facilitate further understanding, we showcase the iteration complexity for finding ϵ-QRE/NE under two typical scenarios: bounded delay and Poisson delay. • Bounded random delay. When the delays are bounded above by some maximum delay γ, Assumption 1 is met with ζ = 1 + γ -1 and L = eγ(γ + 1). Plugging into Theorem 2 yields an iteration complexity of O d 2 max ∥A∥ 2 ∞ (γ+1) 2 τ 2 log 1 ϵ for finding an ϵ-QRE, or O d 2 max ∥A∥ 2 ∞ (γ+1) 2 ϵ 2 for finding an ϵ-NE, which increases quadratically as the maximum delay increases. Note that these rates are worse than those without delays (cf. Theorem 1). • Poisson delay. When the delays follow the Poisson distribution with parameter 1/T , it suffices to set ζ = 1 + T -1 and L = eT (1 + T ) Assumption 1. This leads to an iteration complexity of O d 2 max ∥A∥ 2 ∞ T 2 τ 2 log 1 ϵ for finding an ϵ-QRE, or O d 2 max ∥A∥ 2 ∞ T 2 ϵ 2 for finding an ϵ-NE, which is similar to the bounded random delay case.

4. PERFORMANCE GUARANTEES OF TWO-TIMESCALE OMWU

While Theorem 2 demonstrates provable convergence of single-timescale OMWU with random delays, it remains unclear whether the update rule can be better motivated in more general asynchronous settings, and whether the convergence can be further ensured under adversarial delays. Indeed, theoretical insights from previous literature (Mokhtari et al., 2020; Cen et al., 2021) suggest the critical role of π (t) as a predictive surrogate for π (t) in enabling fast convergence, which no longer holds when π (t) is replaced by a delayed feedback from π (κ (t) i ) . To this end, we propose to replace the extrapolation update (9) with one equipped with a different learning rate: π (t+1) i (k) ∝ π (t) i (k) 1-ητ exp(η[A i π (κ (t) i ) ] k ), ∀k ∈ S i , which adopts a larger learning rate η > η to counteract the delay. Intuitively, a choice of η ≈ (γ (t) i + 1)η would allow π (κ (t) i ) to approximate π (t) by taking the intermediate updates {π (l) : κ i ≤ l < t} into consideration. We refer to this update rule as the two-timescale entropy-regularized OMWU, whose detailed procedure is again outlined in Algorithm 1 using (10) for extrapolation.

4.1. PERFORMANCE GUARANTEES UNDER CONSTANT AND KNOWN DELAYS

To highlight the potential benefit of learning rate separation, we start by studying the convergence of two-timescale OMWU in the asynchronous setting with constant and known delays, which has been studied in (Weinberger & Ordentlich, 2002; Flaspohler et al., 2021; Meng et al., 2022) . We have the following theorem, which reveals a a faster linear convergence to the QRE by using a delay-aware two-timescale learning rate design. Theorem 3 (Last-iterate convergence with fixed delays). Suppose that the delays γ (t) i = γ are fixed and known. Suppose that the learning rate η of two-timescale OMWU in Algorithm 1 satisfies η ≤ min 1 2τ (γ + 1) , 1 5d max ∥A∥ ∞ (γ + 1) 2 and η is determined by 1ητ = (1ητ ) (γ+1) , then the last iterate π (t) and π (t) converge to the QRE at a linear rate: for T ≥ γ, max KL π ⋆ τ ∥ π (T +1) , 1 2 KL π ⋆ τ ∥ π (T -γ+1) ≤(1 -ητ ) T +1 KL π ⋆ τ ∥ π (0) + (1 -ητ ) T +1-γ . In addition, the QRE-gap converges linearly according to QRE-Gap τ (π (T -γ+1) ) ≤2 max d 2 max ∥A∥ 2 ∞ τ , 1 η (1 -ητ ) T +1 KL π ⋆ τ ∥ π (0) + (1 -ητ ) T +1-γ . By optimizing the learning rate η, Theorem 3 implies that two-timescale OMWU takes at most O dmax∥A∥ ∞ (γ+1) 2 τ log 1 ϵ iterations to find an ϵ-QRE in a last-iterate manner, which translates to an iteration complexity of O dmax∥A∥ ∞ (γ+1) 2 ϵ for finding an ϵ-NE. This significantly improves over the iteration complexity of O d 2 max ∥A∥ 2 ∞ (γ + 1) 2 /ϵ 2 for single-timescale OMWU, verifying the positive role of adopting two-timescale learning rate in enabling faster convergence.

4.2. PERFORMANCE GUARANTEES WITH PERMUTED BOUNDED DELAYS

The above result requires the exact information of the delay, which may not always be available. Motivated by the need to address arbitrary or even adversarial delays, we consider a more realistic scenario, where the payoff sequence arrives in a permuted order (Agarwal & Duchi, 2011) constrained by a maximum bounded delay (McMahan & Streeter, 2014; Wan et al., 2022) . Assumption 2 (Bounded delay). For any i ∈ V and t > 0, it holds that γ (t) i ≤ γ. Assumption 3 (Permuted feedback). For any t > 0, the payoff vector at the t-th iteration is received by agent i only once. The payoff at the 0-th iteration can be used multiple times. The following theorem unveils the convergence of two-timescale OMWU to the QRE in an average sense under permutated bounded delays. Theorem 4 (Average-iterate convergence under permutated delays). Under Assumption 2 and 3, suppose that the learning rate η of two-timescale OMWU in Algorithm 1 satisfies η ≤ min 1 2τ (γ+1) , 1 28dmax∥A∥ ∞ (γ+1) 5/2 , and η is determined by 1 -ητ = (1 -ητ ) (γ+1) , then for T > 2γ, it holds that 1 T -2γ max T -1 t=2γ KL π ⋆ τ ∥ π (t+1) , 1 3 T -1 t=2γ KL π ⋆ τ ∥ π (t-γ+1) ≤ 1 ητ (T -2γ) KL π ⋆ i,τ ∥ π (0) i + n + 24nγ log S max T -2γ . ( ) Furthermore, the average QRE-gap can be bounded by 1 T -2γ T -1 t=2γ QRE-Gap τ (π (t+1) ) ≤ 1 T -2γ max 3d 2 max ∥A∥ 2 ∞ 2τ , τ 1 ητ (KL π ⋆ i,τ ∥ π (0) i + n) + 36nγ log S max . Theorem 4 guarantees that the best iterate among {π (t) } 2γ<t≤T is an ϵ-QRE as long as T is on the order of O nd 3 max ∥A∥ 3 ∞ (γ+1) 5/2 τ 2 ϵ , which translates to an iteration complexity of O nd 3 max ∥A∥ 3 ∞ (γ+1) 5/2 ϵ 3 for finding an ϵ-NE. While the rate seems slower than the previous theorems, Theorem 4 holds under arguably the weakest delay assumptions, where it can be even adversarially bounded. We remark that the result in (16) also guarantees the convergence of the last iterate π (t) to the QRE asymptotically, although without a finite-time rate. This is in sharp contrast to typical average-iterate analysis that only applies to 1 T T t=1 π (t) without implications on the convergence of the last iterate π (t) . Remark 2. The analysis in this section can be generalized to more commonly-used delay models where the reward information is not assumed to be observed once per round (Quanrud & Khashabi, 2015; Joulani et al., 2013) , i.e., in every round an agent may observe multiple reward feedbacks from previous iterations or receive no information. This can be achieved by storing reward feedbacks in a buffer memory and picking one for policy update every round in a First-In-First-Out manner.

5. NUMERICAL EXPERIMENTS

In this section, we verify our theoretical findings by investigating the performance of both singletimescale and two-timescale OMWU on randomly generated zero-sum entropy-regularized polymatrix games with n = 10, |S i | = 10, i ∈ V and τ = 0.1. For each (i, j) ∈ E, we set A ij = -A ⊤ ji with entries of A ij independently sampled from the uniform distribution over [-1, 1] . All the results are averaged over five independent runs. In Fig. 1 (a), we compare the performance of single-timescale OMWU in both synchronous and asynchronous settings, with delay uniformly sampled from {0, 1, . . . , 10}. We adopt the optimal learning rate η from {0.1, 0.05, 0.02, 0.01, . . . } that yields the highest accuracy. The method achieves linear convergence in both cases, yet the convergence rate is slowed down by delayed feedbacks in the asynchronous setting. Fig. 1 (b) and (c ) compare the effect of different choices of learning rates η, η on the performance of the proposed methods, where the feedback is permutated with bounded delay γ = 25 (cf. Assumptions 2 and 3). In general, two-timescale OMWU outperforms single-timescale OMWU given appropriate choices of learning rate η. On the other hand, (c) demonstrates that the choice of η = τ -1 (1 -(1ητ ) γ+1 ) suggested by the theory (marked with star) indeed leads to near-optimal performance of two-timescale OMWU. Figure 2 shows KL π ⋆ τ ∥ π (t) with respect to the number of iterations of single-timescale and twotimescale OMWU under different asynchronous scenarios, with optimal choices of η and τ = 0.1. In particular, two-timescale OMWU adopts the extrapolation learning rate suggested by theory η = τ -1 (1 -(1ητ ) γ+1 ). While both methods yield linear convergence to the QRE, two-timescale method outperforms its single-timescale counterpart in the case with constant and known delay and the case where the feedback is permutated with bounded delay, which verifies our theory. 

6. CONCLUSION

This paper studies asynchronous gradient play in zero-sum polymatrix games, by investigating the convergence behaviors of entropy-regularized OMWU with delayed feedbacks under two different schedules of the learning rates. We demonstrate that single-timescale OMWU enjoys a linear lastiterate convergence to the QRE even under mild statistical delays. However, the presence of the delay noticeably limits the allowable range of learning rates and slows down the convergence. To mitigate the impact, we further show that the method benefits from adopting a two-timescale learning rate in a delay-aware manner, which achieves a faster last-iterate convergence when the delay is fixed and known, and continues to converge provably even when the delays are arbitrarily bounded in an average-iterate manner. We believe our work lays the foundation for further understandings of delayed feedback in games under symmetric and independent learning. 

A FURTHER RELATED WORKS

Learning in two-player zero-sum matrix games. Freund & Schapire (1999) proved that Multiplicative Weights Update (MWU) method achieve an average-iterate convergence rate of O(1/ √ T ) through the lens of regret analysis. Daskalakis et al. (2011) is the first to achieve an optimal convergence rate of O(1/T ) with the excessive gap technique of Nesterov (Nesterov, 2005a; b) . Rakhlin & Sridharan (2013) achieved the same rate with OMD, which is more commonly referred to as OMWU when entropy regularization is in use for the mirror descent update rule. In terms of lastiterate convergence, Daskalakis & Panageas (2018) established asymptotic last-iterate convergence for OMWU assuming the uniqueness of NE. Wei et al. (2021) improved upon the analysis under the same assumption by showing a problem-dependent linear rate of convergence, which is extended to a class of extensive-form games (Lee et al., 2021) . Cen et al. (2021) showed that entropy-regularized OMWU converges linearly to the QRE of two-player zero-sum matrix game, which translates to an iteration complexity of O(1/T ) for finding an ϵ-NE, without assuming its uniqueness; the linear convergence to the QRE continues to hold with smooth value updates (Cen et al., 2022) . Sokota et al. (2022) showed that linear convergence to QRE can be achieved without resorting to optimistic update rules, e.g., using entropy-regularized MWU, albeit with a more restrictive learning rate. It is worth pointing out that the idea of learning rate separation has been explored for equilibrium finding in two-player zero-sum games with instant feedback (Fasoulakis et al., 2022) and online learning with delayed feedback (Hsieh et al., 2020) , but lacks study in an asynchronous multi-player game setting. Asynchronous optimization. Asynchronous and decentralized optimization algorithms have been extensively studied since the proposal of Bertsekas & Tsitsiklis (1989) , where a number of agents seek to find an approximate global optimizer of a common loss function, by performing iterative gradient-based methods in a collaborative manner. Typical approaches including parallelizing the computation of gradient with regard to data (Tong et al., 2020) , or parallelizing the model updates by imposing coordinate update rules (Nesterov, 2012; Liu et al., 2014; Liu & Wright, 2015) . Delayed gradient (feedback) is also common in these scenarios due to the existence of other agents updating the model. Moreover, the zero-sum polymatrix setting considered in this work is inherently noncollaborative by requiring every agent to maximize its own utility function and compete with other agents, and leads to substantially difference analysis techniques.

B PROOF FOR SINGLE-TIMESCALE OMWU (SECTION 3)

Before delving into the main proof, we first record a useful lemma pertaining to a basic property of zero-sum polymatrix games; the proof is deferred to Appendix E.1. For i ∈ V , we denote by N i = {j : (i, j) ∈ E} the neighbors of agent i in the graph (V, E). For notational simplicity, we denote by x 1 = y the equivalence between two vectors x and y up to a global shift, i.e., x = y + c • 1 (17) for some constant c ∈ R, where 1 is the all-one vector. Lemma 1. For any zero-sum polymatrix game G, it holds that for π, π ′ ∈ ∆(S) that i∈V u i (π i , π ′ -i ) + u i (π ′ i , π -i ) = 0. Or equivalently, i∈V π ⊤ i A i π ′ + (π ′ i ) ⊤ A i π = 0. It follows that i∈V π i -π ′ i , A i (π -π ′ ) = i∈V [u i (π) + u i (π ′ )] - i∈V π ⊤ i A i π ′ + (π ′ i ) ⊤ A i π = 0.

B.1 PROOF OF THEOREM 1

We start with the following lemma that characterizes the iterates of OMWU, which generalizes Cen et al. (2021, Lemma 1) for zero-sum two-player games to zero-sum polymatrix games. The proof can be found in Appendix E.2. Lemma 2. The iterates of OMWU based on the update rule (9) satisfy log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π (t+1) -π ⋆ τ = 0. To continue, by the definition of KL divergence, we have log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π (t+1) = log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π (t+1) -log π (t+1) -log π (t+1) , π (t+1) -log π (t+1) -log π (t+1) , π (t+1) -π (t+1) = (1 -ητ )KL π (t+1) ∥ π (t) + ητ KL π (t+1) ∥ π ⋆ τ + KL π (t+1) ∥ π (t+1) -log π (t+1) -log π (t+1) , π (t+1) -π (t+1) . In addition, t) . Summing up the above two relations, in view of Lemma 2, it holds that -log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π ⋆ τ = KL π ⋆ τ ∥ π (t+1) -(1 -ητ )KL π ⋆ τ ∥ π ( KL π ⋆ τ ∥ π (t+1) = (1 -ητ )KL π ⋆ τ ∥ π (t) -(1 -ητ )KL π (t+1) ∥ π (t) -KL π (t+1) ∥ π (t+1) + log π (t+1) -log π (t+1) , π (t+1) -π (t+1) -ητ KL π (t+1) ∥ π ⋆ τ . (19) We now proceed to bound the terms of interest one by one. Bounding KL π ⋆ τ ∥ π (t) . We aim to control the right-hand-side (RHS) of ( 19). Based on the update rule of π (t+1) i in Algorithm 1, we have log π (t+1) i -log π (t+1) i 1 = ηA i (π (t) -π (t+1) ) (20) 1 = ηA i (π (t) -π (t) ) + ηA i (π (t) -π (t+1) ). It follows that log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i = η j∈Ni (π (t+1) i -π (t+1) i ) ⊤ A ij (π (t) j -π (t) j ) + η j∈Ni (π (t+1) i -π (t+1) i ) ⊤ A ij (π (t) j -π (t+1) j ) ≤ η j∈Ni ∥A ij ∥ ∞ π (t+1) i -π (t+1) i 1 π (t) j -π (t) j 1 + η j∈Ni ∥A ij ∥ ∞ π (t+1) i -π (t+1) i 1 π (t) j -π (t+1) j 1 ≤ η 2 ∥A∥ ∞ j∈Ni π (t) j -π (t) j 2 1 + π (t+1) j -π (t) j 2 1 + 2 π (t+1) i -π (t+1) i 2 1 ≤ η ∥A∥ ∞ j∈Ni KL π (t) j ∥ π (t) j + KL π (t+1) j ∥ π (t) j + 2KL π (t+1) i ∥ π (t+1) i , where the last line follows from Pinsker's inequality. Summing the inequality over i ∈ V , we get log π (t+1) -log π (t+1) , π (t+1) -π (t+1) ≤ ηd max ∥A∥ ∞ KL π (t) ∥ π (t) + KL π (t+1) ∥ π (t) + 2KL π (t+1) ∥ π (t+1) . Plugging the above inequality back into (19) yields KL π ⋆ τ ∥ π (t+1) ≤ (1 -ητ )KL π ⋆ τ ∥ π (t) -(1 -ητ -ηd max ∥A∥ ∞ ) KL π (t+1) ∥ π (t) -(1 -2ηd max ∥A∥ ∞ )KL π (t+1) ∥ π (t+1) + ηd max ∥A∥ ∞ KL π (t) ∥ π (t) -ητ KL π (t+1) ∥ π ⋆ τ . ( ) With the choice of the learning rate 0 < η ≤ min 1 2τ , 1 4d max ∥A∥ ∞ , it holds that 1 -ητ -ηd max ∥A∥ ∞ > 0 and ηd max ∥A∥ ∞ ≤ 1 4 ≤ (1 -ητ )(1 -2ηd max ∥A∥ ∞ ). This allows us to further relax ( 22) by KL π ⋆ τ ∥ π (t+1) + (1 -2ηd max ∥A∥ ∞ )KL π (t+1) ∥ π (t+1) ≤ (1 -ητ )KL π ⋆ τ ∥ π (t) + ηd max ∥A∥ ∞ KL π (t) ∥ π (t) ≤ (1 -ητ ) KL π ⋆ τ ∥ π (t) + (1 -2ηd max ∥A∥ ∞ )KL π (t) ∥ π (t) . Let us now introduce the potential function of iterates t) , which allows us to simply the previous inequality as L (t) := KL π ⋆ τ ∥ π (t) + (1 -2ηd max ∥A∥ ∞ )KL π (t) ∥ π ( L (t+1) ≤ (1 -ητ )L (t) ≤ (1 -ητ ) t+1 L (0) = (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) , where the last equality follows from the definition π (0) = π (0) . Hence, we have t+1) . Following similar approaches to (21), we can bound KL π ⋆ τ ∥ π (t) ≤ L (t) ≤ (1 -ητ ) t KL π ⋆ τ ∥ π (0) . Bounding KL π ⋆ τ ∥ π ( -π ⋆ i,τ -π (t+1) i , log π (t+1) i -log π (t+1) i = η(π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t) -π (t) ) + η(π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t) -π (t+1) ) ≤ η ∥A∥ ∞ j∈Ni KL π (t) j ∥ π (t) j + KL π (t+1) j ∥ π (t) j + 2KL π ⋆ i,τ ∥ π (t+1) i . ( ) Summing the inequality over i ∈ V leads to t+1) . On the other hand, by the definition of KL divergence, we have t+1) . -π ⋆ τ -π (t+1) , log π (t+1) -log π (t+1) ≤ ηd max ∥A∥ ∞ KL π (t) ∥ π (t) + KL π (t+1) ∥ π (t) + 2KL π ⋆ τ ∥ π ( KL π ⋆ τ ∥ π (t+1) = KL π ⋆ τ ∥ π (t+1) -KL π (t+1) ∥ π (t+1) -π ⋆ τ -π (t+1) , log π (t+1) -log π ( (26) Combining the above two inequalities, we get (1 -2ηd max ∥A∥ ∞ )KL π ⋆ τ ∥ π (t+1) ≤ KL π ⋆ τ ∥ π (t+1) + ηd max ∥A∥ ∞ KL π (t) ∥ π (t) + KL π (t+1) ∥ π (t) . Plugging the above inequality back into (22), we have t) , where the second and third inequalities follow from the choice of the learning rate, and the last line follows from the definition of the potential function L (t) . Then the result follows from (24) as (1 -2ηd max ∥A∥ ∞ )KL π ⋆ τ ∥ π (t+1) ≤ (1 -ητ )KL π ⋆ τ ∥ π (t) -(1 -ητ -2d max η ∥A∥ ∞ )KL π (t+1) ∥ π (t) -ητ KL π (t+1) ∥ π ⋆ τ -(1 -2ηd max ∥A∥ ∞ )KL π (t+1) ∥ π (t+1) + 2ηd max ∥A∥ ∞ KL π (t) ∥ π (t) ≤ (1 -ητ )KL π ⋆ τ ∥ π (t) + 2ηd max ∥A∥ ∞ KL π (t) ∥ π (t) ≤ KL π ⋆ τ ∥ π (t) + (1 -2ηd max ∥A∥ ∞ )KL π (t) ∥ π (t) = L ( 1 2 KL π ⋆ τ ∥ π (t+1) ≤ (1 -2ηd max ∥A∥ ∞ )KL π ⋆ τ ∥ π (t+1) ≤ L (t) ≤ (1 -ητ ) t KL π ⋆ τ ∥ π (0) . Bounding the QRE-Gap. Finally, we bound the QRE-gap, which can be linked to the KL divergence using the following lemma. The proof can be found in Appendix E.3. Lemma 3. For any π ∈ ∆(S) and QRE π ⋆ τ ∈ ∆(S), it holds that QRE-Gap τ (π) ≤ τ KL π ∥ π ⋆ τ + d 2 max ∥A∥ 2 ∞ τ KL π ⋆ τ ∥ π . Lemma 3 tells us QRE-Gap τ (π (t) ) ≤ τ KL π (t) ∥ π ⋆ τ + d 2 max ∥A∥ 2 ∞ τ KL π ⋆ τ ∥ π (t) . ( ) With KL π ⋆ τ ∥ π (t) controlled in the above, we still need to control KL π (t) ∥ π ⋆ τ . From ( 22), it follows that 0) . Plugging them back to (27), we arrive at τ KL π (t) ∥ π ⋆ τ ≤ η -1 (1 -ητ )L (t-1) ≤ η -1 (1 -ητ ) t L (0) = η -1 (1 -ητ ) t KL π ⋆ τ ∥ π ( QRE-Gap τ (π (t) ) ≤ η -1 + 2τ -1 d 2 max ∥A∥ 2 ∞ (1 -ητ ) t-1 KL π ⋆ τ ∥ π (0) .

B.2 PROOF OF THEOREM 2

We begin with bounding the KL divergence KL π ⋆ τ ∥ π (t) and then move to bound the QRE-gap by linking it to the KL divergence. Bounding the term KL π ⋆ τ ∥ π (t) . We start with the following equation (1 -ητ )KL π ⋆ i,τ ∥ π (t) i = (1 -ητ )KL π (t+1) i ∥ π (t) i + ητ KL π (t+1) i ∥ π ⋆ i,τ + KL π (t+1) i ∥ π (t+1) i + KL π ⋆ i,τ ∥ π (t+1) i -log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i + η(π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) where its proof follows a similar deduction as (19). Our first target is to bound the last two terms on the RHS of (28) with ητ KL π (t+1) i ∥ π ⋆ i,τ + KL π (t+1) i ∥ π (t+1) i + (1 -ητ )KL π (t+1) i ∥ π (t) i . Let us introduce the potential function of iterates Ψ (l) i := KL π (l+1) i ∥ π (l) i +KL π (l) i ∥ π (l) i , Ψ (l) = i∈V Ψ (l) i = KL π (l+1) ∥ π (l) +KL π (l) ∥ π (l) , which will be used repetitively in the rest of this proof. For notational simplicity, let Ψ (l) i = 0 when l < 0. Step 1: bounding log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i . Following a similar argument as (21), we get log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i = η j∈Ni (π (t+1) i -π (t+1) i ) ⊤ A ij (π (κ (t+1) i ) j -π (κ (t) i ) j ) ≤ ηd max ∥A∥ ∞ KL π (t+1) i ∥ π (t+1) i + η ∥A∥ ∞ 2 j∈Ni π (κ (t+1) i ) j -π (κ (t) i ) j 2 1 . To control the term π (κ (t+1) i ) j -π (κ (t) i ) j 2 1 , when t = 0, we have π (κ (t+1) i ) j -π (κ (t) i ) j 2 1 = π (κ (t+1) i ) j -π (0) j 2 1 ≤ π (1) j -π (0) j 2 1 ≤ 2Ψ (0) j (30) by Pinsker's inequality. For t ≥ 1, consider the decomposition π (t) j -π (t-k) j = t-1 l=t-k π (l+1) j -π (l) j , ∀1 ≤ k ≤ t, it then follows that π (t) j -π (t-k) j 2 1 ≤ k t-1 l=t-k π (l+1) j -π (l) j 2 1 ≤ 2k t-1 l=t-k π (l+1) j -π (l) j 2 1 + π (l) j -π (l) j 2 1 ≤ 4k t-1 l=t-k Ψ (l) j , where the last line applies Pinsker's inequality. Depending on whether γ (t+1) i > 0, we proceed to bound the terms π (κ 29) considering the following two cases based on (31). (t+1) i ) j -π (κ (t) i ) j 2 1 in ( • γ (t+1) i = 0. Then π (κ (t+1) i ) j -π (κ (t) i ) j 2 1 ≤ 2 π (t+1) j -π (t) j 2 1 + 2 π (t) j -π (κ (t) i ) j 2 1 ≤ 8Ψ (t) j + 8γ (t) i t-1 l=t-γ (t) i Ψ (l) j , where the last step uses (31) and π (t+1) j -π (t) j 2 1 ≤ 2 π (t+1) j -π (t) j 2 1 + π (t) j -π (t) j 2 1 ≤ 4Ψ (t) j via again Pinsker's inequality. • γ (t+1) i > 0. Then it follows similarly that π (κ (t+1) i ) j -π (κ (t) i ) j 2 1 ≤ t-1 l=t+1-γ (t+1) i π (l+1) j -π (l) j 2 1 + t-1 l=t-γ (t) i π (l+1) j -π (l) j 2 1 ≤ 4γ (t+1) i t-1 l=t-γ (t+1) i Ψ (l) j + 4γ (t) i t-1 l=t-γ (t) i Ψ (l) j . Combining the above two bounds together, we get π (κ (t+1) i ) j -π (κ (t) i ) j 2 1 ≤ 8Ψ (t) j + 8γ (t) i t-1 l=t-γ (t) i Ψ (l) j + 4γ (t+1) i t-1 l=t-γ (t+1) i Ψ (l) j (32) when t > 0. In view of (30) when t = 0, the above bound (32) holds for all t ≥ 0. Plugging the above inequality into (29) yields log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i ≤ 2η ∥A∥ ∞ j∈Ni t-1 l=t-γ (t+1) i γ (t+1) i Ψ (l) j + 4η ∥A∥ ∞ j∈Ni t-1 l=t-γ (t) i γ (t) i Ψ (l) j + 4η ∥A∥ ∞ j∈Ni Ψ (t) j + ηd max ∥A∥ ∞ KL π (t+1) i ∥ π (t+1) i . ( ) Step 2: bounding (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t+1) i ) -π ⋆ τ ). Let us begin with the following decomposition (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t+1) i ) -π ⋆ τ ) = (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t+1) -π ⋆ τ ) + (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t+1) i ) -π (t+1) ), where the second term in the RHS of (34) can be bounded by (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t+1) i ) -π (t+1) ) = j∈Ni (π (t+1) i -π ⋆ i,τ ) ⊤ A ij (π (κ (t+1) i ) j -π (t+1) j ) ≤ ∥A∥ ∞ j∈Ni π (t+1) i -π ⋆ i,τ 1 π (κ (t+1) i ) j -π (t+1) j 1 ≤ 1 2 ∥A∥ ∞ j∈Ni τ d max ∥A∥ ∞ π (t+1) i -π ⋆ i,τ 2 1 + d max ∥A∥ ∞ τ π (κ (t+1) i ) j -π (t+1) j 2 1 ≤ τ KL π (t+1) i ∥ π ⋆ i,τ + d max ∥A∥ 2 ∞ 2τ j∈Ni π (κ (t+1) i ) j -π (t+1) j 2 1 . Following similar deduction of (32) for the second term, we attain (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t+1) i ) -π (t+1) ) ≤ τ KL π (t+1) i ∥ π ⋆ i,τ + 4d max ∥A∥ 2 ∞ τ j∈Ni Ψ (t) j + t-1 l=t-γ (t+1) i γ (t+1) i Ψ (l) j . Plugging the above inequality back to (34) results in (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t+1) i ) -π ⋆ τ ) ≤ (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t+1) -π ⋆ τ ) + τ KL π (t+1) i ∥ π ⋆ i,τ + 4d max ∥A∥ 2 ∞ τ j∈Ni Ψ (t) j + t-1 l=t-γ (t+1) i γ (t+1) i Ψ (l) j . ( ) Step 3: combining the bounds. For simplicity, we introduce the short-hand notation c τ = 1 + d max ∥A∥ ∞ τ and c A = d max ∥A∥ ∞ . Combining ( 33) and ( 35) into (28), and summing over i ∈ V gives (1 -ητ )KL π ⋆ τ ∥ π (t) ≥ (1 -ητ )KL π (t+1) ∥ π (t) + (1 -2ηc A )KL π (t+1) ∥ π (t+1) + KL π ⋆ τ ∥ π (t+1) -4η ∥A∥ ∞ i∈V j∈Ni t-1 l=t-γ (t+1) i c τ γ (t+1) i Ψ (l) j + t-1 l=t-γ (t) i γ (t) i Ψ (l) j + c τ Ψ (t) j ≥ KL π ⋆ τ ∥ π (t+1) + (1 -4ηc A (c τ + 1)) Ψ (t) -4η ∥A∥ ∞ i∈V j∈Ni c τ t-1 l=t-γ (t+1) i γ (t+1) i Ψ (l) j + t-1 l=t-γ (t) i γ (t) i Ψ (l) j , where we make use of the fact i∈V (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t+1) -π ⋆ τ ) = 0 from Lemma 1 in the first inequality, and the second inequality uses the relation i∈V j∈Ni Ψ (t) j = i∈V d i Ψ (t) i ≤ d max Ψ (t) . Step 4: finishing up via averaging the delay. We now evaluate the expectation of KL π ⋆ τ ∥ π (t+1) . Recall that we use subscript E γ (t) [•] to represent the conditional expectation given γ (t) = {γ (t) i } i∈V . We shall first control the conditional expectation of the last term in (37). Observing that π (l+1) j , π (l) j are independent of γ (t) i for j ∈ N i and l ≤ t -1. Using the definition of E(t -l), we have i∈V j∈Ni E γ (t) γ (t) i t-1 l=t-γ (t) i Ψ (l) j = i∈V t-1 l=0 j∈Ni E t-l≤γ (t) i γ (t) i Ψ (l) j ≤ t-1 l=0 E(t -l) i∈V j∈Ni Ψ (l) j = t-1 l=0 E(t -l) i∈V j∈Ni Ψ (l) i ≤ d max t-1 l=0 E(t -l) i∈V Ψ (l) i = d max t-1 l=0 E(t -l)Ψ (l) , (38) where the second line follows from the definition of E(tl) in Assumption 1. Applying a sim- ilar argument to bound i∈V j∈Ni E γ (t+1) γ (t+1) i t-1 l=t-γ (t+1) i Ψ (l) j , and taking expectation of γ (t) , γ (t+1) on both sides of (37), we get l) . (1 -ητ )E γ (t) KL π ⋆ τ ∥ π (t) ≥ E γ (t) ,γ (t+1) KL π ⋆ τ ∥ π (t+1) + (1 -4c A (c τ + 1))Ψ (t) -4ηc A (c τ + 1) t-1 l=0 E(t -l)Ψ ( Taking expectation on both sides over all the delays yields (1 -ητ )E KL π ⋆ τ ∥ π (t) ≥ E KL π ⋆ τ ∥ π (t+1) + E (1 -4ηc A (c τ + 1)) Ψ (t) -4ηc A (c τ + 1) t-1 l=0 E(t -l)Ψ (l) =:U (t) . ( ) Telescoping over t = 0, 1, . . . , T , we get (1 -ητ ) T +1 KL π ⋆ τ ∥ π (0) ≥ E KL π ⋆ τ ∥ π (T +1) + T t=0 (1 -ητ ) T -t E U (t) , which leads to the desired bound if t t=0 (1 -ητ ) T -t E U (t) ≥ 0. Proof of (41). To begin, notice that with the choice of the learning rate 0 < η ≤ min τ 24d 2 max ∥A∥ 2 ∞ (L + 1) , ζ -1 ζτ , it follows that 1 1 -ητ ≤ ζ and 4ηc A (c τ + 1)(L + 1) < 4 τ 24d 2 max ∥A∥ 2 ∞ (L + 1) d max ∥A∥ ∞ 2 + d max ∥A∥ ∞ τ (L + 1) = τ 6d max ∥A∥ ∞ 2 + d max ∥A∥ ∞ τ = τ 3d max ∥A∥ ∞ + 1 6 ≤ 1 2 (42b) as τ ≤ d max ∥A∥ ∞ . Both of these relations will be useful in our follow-up analysis. Now, taking the definition of U (t) (cf. ( 39)), we have T t=0 (1 -ητ ) T -t U (t) = T t=0 (1 -ητ ) T -t (1 -4ηc A (c τ + 1)) Ψ (t) -4ηc A (c τ + 1) t-1 l=0 E(t -l)Ψ (l) , where the second half of the RHS can be further controlled via T t=0 (1 -ητ ) T -t t-1 l=0 E(t -l)Ψ (l) = T t=0 Ψ (t) T l=t+1 (1 -ητ ) T -l E(l -t) ≤ T t=0 Ψ (t) T -t l ′ =0 (1 -ητ ) T -(t+l ′ ) E(l ′ ) = T t=0 (1 -ητ ) T -t Ψ (t) T -t l ′ =0 (1 -ητ ) -l ′ E(l ′ ) ≤ T t=0 (1 -ητ ) T -t Ψ (t) ∞ l=0 ζ l E(l) = T t=0 (1 -ητ ) T -t LΨ (t) , where the first line follows by changing the order of summation, the second line follows from the change of variable l ′ = lt, and the last line follows from (42a) and the definition of L in Assumption 1. Plugging the above relation back leads to T t=0 (1 -ητ ) T -t U (t) ≥ T t=0 (1 -ητ ) T -t [(1 -4ηc A (c τ + 1)) -4ηc A (c τ + 1)L] Ψ (t) ≥ T t=0 1 2 (1 -ητ ) T -t Ψ (t) ≥ 0, where the second line results from (42b). Bounding the term KL π ⋆ τ ∥ π (t+1) . With a similar deduction of ( 19), we get (1 -ητ )KL π ⋆ τ ∥ π (t) + η i∈V (π (t+1) i -π ⋆ τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) = KL π ⋆ τ ∥ π (t+1) + (1 -ητ )KL π (t+1) ∥ π (t) + ητ KL π (t+1) ∥ π ⋆ τ . Following the similar argument of (35), we have (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) ≤ (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t+1) -π ⋆ τ ) + τ 2 KL π (t+1) i ∥ π ⋆ i,τ + 8d max ∥A∥ 2 ∞ τ j∈Ni Ψ (t) j + t-1 l=t-γ (t) i γ (t) i Ψ (l) j . Summing over i ∈ V and plugging into (44) yields (1 -ητ )KL π ⋆ τ ∥ π (t) + 8ηd max ∥A∥ 2 ∞ τ (i,j)∈E Ψ (t) j + t-1 l=t-γ (t) i γ (t) i Ψ (l) j ≥ KL π ⋆ τ ∥ π (t+1) + (1 -ητ )KL π (t+1) ∥ π (t) + ητ 2 KL π (t+1) ∥ π ⋆ τ ≥ KL π ⋆ τ ∥ π (t+1) + ητ 2 KL π (t+1) ∥ π ⋆ τ . Taking expectation on both sides over all delays and using (38) leads to (1 -ητ )E KL π ⋆ τ ∥ π (t) + 8ηd 2 max ∥A∥ 2 ∞ τ E Ψ (t) + t-1 l=0 E(t -l)Ψ (l) ≥ E KL π ⋆ τ ∥ π (t+1) + ητ 2 E KL π (t+1) ∥ π ⋆ τ . ( ) Notice that with the choice of the learning rate 0 < η ≤ min τ 24d 2 max ∥A∥ 2 ∞ (L + 1) , ζ -1 ζτ , we have 8(L + 1)ηd 2 max ∥A∥ 2 ∞ τ ≤ 1 2 and (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) ≥ 1 2 t l=0 (1 -ητ ) t-l E Ψ (l) by combining ( 43) and ( 40). It follows that E Ψ (t) ≤ 2(1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) and E t-1 l=0 E(t -l)Ψ (l) (i) ≤ E t-1 l=0 (1 -ητ ) t-l Ψ (l) • E(t -l)ζ t-l ≤ E t-1 l=0 (1 -ητ ) t-l Ψ (l) t-1 l=0 E(t -l)ζ t-l (ii) ≤ 2L(1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) , where (i) is by the bound (1-ητ ) -1 ≤ ζ and (ii) uses the definition of L in Assumption 1. Plugging the above inequalities into (45) leads to (1 -ητ )E KL π ⋆ τ ∥ π (t) + (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) ≥ E KL π ⋆ τ ∥ π (t+1) + ητ 2 E KL π (t+1) ∥ π ⋆ τ . Then from (40) we have E KL π ⋆ τ ∥ π (t+1) ≤ E KL π ⋆ τ ∥ π (t+1) + ητ 2 E KL π (t+1) ∥ π ⋆ τ ≤ (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) + (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) = 2(1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) . ( ) Bounding the QRE-Gap. Combining ( 27) and ( 46), we have E QRE-Gap τ (π (t+1) ) ≤ τ E KL π (t+1) ∥ π ⋆ τ + d 2 max ∥A∥ 2 ∞ τ E KL π ⋆ τ ∥ π (t+1) ≤ 2 η ητ 2 E KL π (t+1) ∥ π ⋆ τ + E KL π ⋆ τ ∥ π (t+1) ≤ 4(1 -ητ ) t+1 η KL π ⋆ τ ∥ π (0) , where the second line uses the learning rate bound 2 η > 24d 2 max ∥A∥ 2 ∞ (L + 1) τ > d 2 max ∥A∥ 2 ∞ τ .

C REGRET ANALYSIS OF SINGLE-TIMESCALE OMWU

For completeness, we also provide the regret analysis of single-timescale OMWU in both synchronous and asynchronous settings, which might be of independent interest. To begin, for τ ≥ 0, the regret for each player i ∈ V is defined as Regret i,τ T = max πi∈∆(Si) T t=1 u i,τ (π i , π (t) -i ) - T t=1 u i,τ (π (t) ), which measures the performance gap compared to the optimal fixed strategy in hindsight for player i, when the rest of the players follow the strategies derived from Algorithm 1. Synchronous setting. We begin with the following no-regret guarantee of single-timescale OMWU in the synchronous setting. Theorem 5 (No-regret without delays). Suppose all players i ∈ V follow single-timescale OMWU in Algorithm 1 with the initialization π (0) i = 1 |Si| 1 and the learning rate obeys 0 < η ≤ 1 4dmax∥A∥ ∞ +4τ . Then, for T ≥ 1, it holds that Regret i,τ T ≤ 1 η log |S i | + 16η deg i ∥A∥ 2 ∞ k∈V log |S k |. By optimizing the learning rate η, Theorem 5 suggests that the regret is bounded by max i∈V Regret i,τ T ≲ O ∥A∥ ∞ nd max up to logarithmic factors. Compared with the OMD method for multi-agent games in Anagnostides et al. (2022) , which only provided the regret bound for τ = 0, our bound is more general by allowing entropy regularization. Moreover, our bound is tighter by a factor of n/d max by exploiting the graph connectivity pattern, which is significant for large sparse graphs. Asynchronous setting. We next move to the asynchronous case, and show that single-timescale OMWU continues to enjoy no-regret learning as long as the delays have finite second-order moments. Assumption 4 (Random delays). Recall the definition of E(ℓ) in (12). There exists some constant σ > 0, such that E (γ (t) i ) 2 ≤ ∞ ℓ=0 E(ℓ) ≤ σ 2 , for all t ≥ 0 and i ∈ V . Clearly Assumption 4 is weaker than Assumption 1, since it only requires the second-order moments to be finite, instead of an exponential decay of γ (t) i . We have the following theorem. Theorem 6 (No-regret with random delays). Under Assumption 4, suppose all players i ∈ V follow single-timescale OMWU in Algorithm 1 with the initialization π (0) i = 1 |Si| 1, the regularization parameter τ < min{1, ∥A∥ ∞ }, and the learning rate obeys 0 < η ≤ τ 24d 2 max ∥A∥ 2 ∞ (σ 2 +1) . Then, for T ≥ 1, it holds that E Regret i,τ T ≤ 1 η log |S i | + 8d max ∥A∥ ∞ d max ∥A∥ ∞ τ + 2 (σ 2 + 1) i∈V log |S i |. ( ) Theorem 6 guarantees that the iterate among {π (t) } t≥1 enjoys a regret bound on the order of max i∈V E Regret i,τ T ≲ O σ 2 nd max ∥A∥ 2 ∞ τ by optimizing the learning rate η.

C.1 PROOF OF THEOREM 5

Recall the expression of the regret Regret i,τ T = max πi∈∆(Si) Regret i,τ π i , T , where Regret i,τ π i , T := T t=1 u i,τ (π i , π (t) -i ) - T t=1 u i,τ (π (t) ) = T t=0 π i -π (t+1) i , A i π (t+1) + τ H(π i ) -τ H(π (t+1) i ) . Therefore, it is sufficient to bound Regret i,τ π i , T for any π i ∈ ∆(S i ). To begin, we record the following useful lemma whose proof can be found in Appendix E.4. Lemma 4. For any τ ≥ 0 and T ≥ 0, we have i∈V Regret i,τ T + 1 = i∈V max πi∈∆(Si) T t=0 π i -π (t+1) i , A i π (t+1) + τ H(π i ) -τ H(π (t+1) i ) ≥ 0. Let us now proceed with the following regret decomposition π i -π (t+1) i , A i π (t+1) + τ H(π i ) -τ H(π (t+1) i ) = π i -π (t+1) i , A i π (t+1) + τ H(π i ) + π (t+1) i -π (t+1) i , A i π (t) -τ H(π (t+1) i ) -π (t+1) i -π (t+1) i , A i π (t+1) -A i π (t) . ( ) We proceed to bound each term on the RHS of (50). • To begin, note that log π t+1) . The first term in (50) can then be written as (t+1) i 1 = (1 -ητ ) log π (t) i + ηA i π ( π i -π (t+1) i , A i π (t+1) + τ H(π i ) = 1 η log π (t+1) i -log π (t) i , π i -π (t+1) i + τ log π (t) i , π i -π (t+1) i -τ log π i , π i = 1 η -τ KL π i ∥ π (t) i - 1 η KL π i ∥ π (t+1) i + KL π (t+1) i ∥ π (t) i -τ log π (t) i , π (t+1) i , where the second step is derived from the definition of KL divergence. • Similarly, the second term in (50) has the form π (t+1) i -π (t+1) i , A i π (t) -τ H(π (t+1) i ) = 1 η KL π (t+1) i ∥ π (t) i -KL π (t+1) i ∥ π (t+1) i - 1 η -τ KL π (t+1) i ∥ π (t) i + τ log π (t) i , π . (52) • Moving to the third term on the RHS of (50), we first make the following claim, which shall be proven at the end of this proof: π (t+1) i -π (t+1) i 1 ≤ η A i π (t+1) -A i π (t) ∞ . With (53) in place, we have -π (t+1) i -π (t+1) i , A i π (t+1) -A i π (t) ≤ j∈Ni ∥A∥ ∞ π (t+1) i -π (t+1) i 1 π (t+1) j -π (t) j 1 ≤ η j∈Ni ∥A∥ ∞ A i (π (t+1) -π (t) ) ∞ π (t+1) j -π (t) j 1 ≤ η ∥A∥ 2 ∞ j∈Ni π (t+1) j -π (t) j 1 2 . The latter term can be further bounded by j∈Ni π (t+1) j -π (t) j 1 2 ≤ deg i j∈Ni π (t+1) j -π (t) j + π (t) j -π (t) j 2 1 ≤ 2 deg i j∈Ni π (t+1) j -π (t) j 2 1 + π (t) j -π (t) j 2 1 ≤ 4 deg i j∈Ni KL π (t+1) j ∥ π (t) j + KL π (t) j ∥ π (t) j , where it follows respectively from Cauchy-Schwarz inequality, ∥a + b∥ 2 1 ≤ 2 ∥a∥ 2 1 + ∥b∥ 2 1 , and Pinsker's inequality. Plugging this into the previous inequality leads to -π (t+1) i -π (t+1) i , A i π (t+1) -A i π (t) ≤ 4η deg i ∥A∥ 2 ∞ j∈Ni KL π (t+1) j ∥ π (t) j + KL π (t) j ∥ π (t) j . Plugging ( 51), ( 52), and ( 54) into (50) yields π i -π (t+1) i , A i π (t+1) + τ H(π i ) -τ H(π (t+1) i ) ≤ 1 η KL π i ∥ π (t) i -KL π i ∥ π (t+1) i - 1 η KL π (t+1) i ∥ π (t+1) i - 1 η -τ KL π (t+1) i ∥ π (t) i -τ KL π i ∥ π (t) i + 4η deg i ∥A∥ 2 ∞ j∈Ni KL π (t+1) j ∥ π (t) j + KL π (t) j ∥ π (t) j . Telescoping the sum over t = 0, 1, . . . , T leads to Regret i,τ π i , T + 1 ≤ 1 η KL π i ∥ π (0) i - 1 η T t=0 KL π (t+1) i ∥ π (t+1) i - 1 η -τ T t=0 KL π (t+1) i ∥ π (t) i + 4η deg i ∥A∥ 2 ∞ T t=0 j∈Ni KL π (t+1) j ∥ π (t) j + KL π (t) j ∥ π (t) j (55) ≤ 1 η log |S i | + 4η deg i ∥A∥ 2 ∞ T t=0 KL π (t+1) ∥ π (t) + KL π (t) ∥ π (t) , where the last line follows from the fact that KL π i ∥ π (0) i ≤ log |S i | and 1/η > τ . The proof is thus complete if we can establish T t=0 KL π (t) ∥ π (t) + T t=0 KL π (t+1) ∥ π (t) ≤ 4 k∈V log |S k |. Therefore, it remains to establish ( 53) and ( 57), which shall be completed as follows. Proof of (53). By the update rules of π (t+1) i and π (t+1) i , from (20) we can deduce that log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i = η A i (π (t) -π (t+1) ), π (t+1) i -π (t+1) i . ( ) By Pinsker's inequality, we have log π (t+1) i -log π (t+1) i , π (t+1) i -π (t+1) i ≥ π (t+1) i -π (t+1) i 2 1 . In addition, A i (π (t) -π (t+1) ), π (t+1) i -π (t+1) i ≤ π (t+1) i -π (t+1) i 1 A i (π (t) -π (t+1) ) ∞ . Plugging the above two relations into (58) then leads to (53).

Proof of (57). Summing (55) over

i ∈ V gives i∈V Regret i,τ π i , T + 1 ≤ 1 η KL π ∥ π (0) - 1 η T t=0 KL π (t+1) ∥ π (t+1) - 1 η -τ T t=0 KL π (t+1) ∥ π (t) + 4ηd 2 max ∥A∥ 2 ∞ T t=0 KL π (t+1) ∥ π (t) + KL π (t) ∥ π (t) ≤ 1 η KL π ∥ π (0) + KL π (0) ∥ π (0) - 1 η -τ -4ηd 2 max ∥A∥ 2 ∞ T t=0 KL π (t+1) ∥ π (t) - 1 η -4ηd 2 max ∥A∥ 2 ∞ T t=0 KL π (t) ∥ π (t) ≤ 1 η k∈V log |S k | - 1 4η T t=0 KL π (t+1) ∥ π (t) - 1 4η T t=0 KL π (t) ∥ π (t) , where the last line follows from π (0) = π (0) , KL π ∥ π (0) ≤ k∈V log |S k | for any π since π (0) is a uniform distribution, as well as the choice of the learning rate such that 4ηd 2 max ∥A∥ 2 ∞ ≤ 1 4η and τ ≤ 1 2η . Taking supremum over π on both sides of (59) and applying Lemma 4 gives (57) as advertised.

C.2 PROOF OF THEOREM 6

Similar to the proof of Theorem 5 in Appendix C.1, it suffices to bound Regret i,τ π i , T for any π i ∈ ∆(S i ), where Regret i,τ π i , T := T t=1 u i,τ (π i , π (t) -i ) - T t=1 u i,τ (π (t) ) = T t=0 π i -π (t+1) i , A i π (t+1) + τ H(π i ) -τ H(π ) . Consider the following decomposition: π i -π (t+1) i , A i π (t+1) + τ H(π i ) -τ H(π (t+1) i ) = π i -π (t+1) i , A i π (κ (t+1) i ) + π i -π (t+1) i , A i π (t+1) -A i π (κ (t+1) i ) + τ H(π i ) -τ H(π (t+1) i ) = π i -π (t+1) i , A i π (κ (t+1) i ) + π (t+1) i -π (t+1) i , A i π (κ (t) i ) -π (t+1) i -π (t+1) i , A i π (κ (t+1) i ) -A i π (κ (t) i ) + π i -π (t+1) i , A i π (t+1) -A i π (κ (t+1) i ) . We now bound each term on the RHS of (61). For simplicity, we reuse the short-hand notation in (36). • To begin with, note that π (t+1) i = (1 -ητ )π (t) i + ηA i π (κ (t+1) i ) + c i 1 for some normalization constant c i . Thus we have π i -π (t+1) i , A i π (κ (t+1) i ) = 1 η log π (t+1) i -log π (t) i , π i -π (t+1) i + τ log π (t) i , π i -π (t+1) i = 1 η KL π i ∥ π (t) i -KL π i ∥ π (t+1) i -KL π (t+1) i ∥ π (t) i + τ log π (t) i , π i -π (t+1) i , where the second step is derived from the definition of KL-divergence. • Similarly, it holds that π (t+1) i -π (t+1) i , A i π (κ (t) i ) = 1 η KL π (t+1) i ∥ π (t) i -KL π (t+1) i ∥ π (t+1) i -KL π (t+1) i ∥ π (t) i + τ log π (t) i , π (t+1) i -π (t+1) i . • For the term π (t+1) i -π (t+1) i , A i π (κ (t+1) i ) -A i π (κ (t) i ) in ( 61), following the deduction of (33), we get -π (t+1) i -π (t+1) i , A i π (κ (t+1) i ) -A i π (κ (t) i ) ≤ 2 ∥A∥ ∞ j∈Ni Ψ (t) j + t-1 l=t-γ (t+1) i γ (t+1) i Ψ (l) j + t-1 s=t-γ (t) i γ (t) i Ψ (l) j + 2c A KL π (t+1) i ∥ π (t+1) i . ( ) • For the last term in ( 61), it similarly follows that π i -π (t+1) i , A i π (t+1) -A i π (κ (t+1) i ) = π i -π (t) i , A i π (t+1) -A i π (κ (t+1) i ) + π (t) i -π (t+1) i , A i π (t+1) -A i π (κ (t+1) i ) ≤ 2c A KL π (t+1) i ∥ π (t) i + 4c τ ∥A∥ ∞ j∈Ni Ψ (t) j + t-1 l=t-γ (t+1) i γ (t+1) i Ψ (l) j . ( ) Plugging (62) (63) (64) ( 65) into ( 61) yields π i -π (t+1) i , A i π (t+1) + τ H(π i ) -H(π (t+1) i ) ≤ 1 η KL π i ∥ π (t) i -KL π i ∥ π (t+1) i - 1 η -2c A Ψ (t) i + 4c τ ∥A∥ ∞ j∈Ni Ψ (t) j + 2γ (t) i ∥A∥ ∞ j∈Ni t-1 l=t-γ (t) i Ψ (l) j + 4c τ ∥A∥ ∞ γ (t+1) i j∈Ni t-1 l=t-γ (t+1) i Ψ (l) j + τ log π (t) i , π i -π (t) i + τ H(π i ) -H(π (t+1) i ) . Note that H(π i ) -H(π (t+1) i ) + log π (t) i , π i -π (t+1) i = -log π i -log π (t) i , π i + log π (t+1) i -log π (t) i , π (t+1) i = KL π (t+1) i ∥ π (t) i -KL π i ∥ π (t) i . Then we have π i -π (t+1) i , A i π (t+1) + τ H(π i ) -H(π (t+1) i ) ≤ 1 η KL π i ∥ π (t) i -KL π i ∥ π (t+1) i - 1 η -2c A -τ Ψ (t) i + 4c τ ∥A∥ ∞ j∈Ni Ψ (t) j + 2γ (t) i ∥A∥ ∞ j∈Ni t-1 l=t-γ (t) i Ψ (l) j + 4c τ ∥A∥ ∞ γ (t+1) i j∈Ni t-1 l=t-γ (t+1) i Ψ (l) j . ( ) Since the learning rate satisfies 1 η ≥ 24d 2 max ∥A∥ 2 ∞ τ (σ 2 + 1) ≥ 2d max ∥A∥ ∞ + τ = 2c A + τ, taking expectation on both sides of (66) leads to E π i -π (t+1) i , A i π (t+1) + τ H(π i ) -H(π (t+1) i ) ≤ 1 η E KL π i ∥ π (t) i -KL π i ∥ π (t+1) i - 1 η -2c A -τ E Ψ (t) i + 4c τ ∥A∥ ∞ E Ψ (t) + 4c τ ∥A∥ ∞ E t-1 l=0 E(t -l)Ψ (l) ≤ 1 η E KL π i ∥ π (t) i -KL π i ∥ π (t+1) i + 4c τ ∥A∥ ∞ E Ψ (t) + 4c τ ∥A∥ ∞ E t-1 l=0 E(t -l)Ψ (l) , ( ) where we use the fact j∈Ni Ψ (l) j ≤ Ψ (l) and the definition of E(tl). Since ∞ l=0 E(l) ≤ σ 2 by definition in Assumption 4, summing (67) over t = 0, 1, . . . , T yields E Regret i,τ T + 1 ≤ 1 η E KL π i ∥ π (0) i + 4c τ ∥A∥ ∞ (σ 2 + 1)E T t=0 Ψ (t) ≤ 1 η log |S i | + 4c τ ∥A∥ ∞ (σ 2 + 1)E T t=0 Ψ (t) . ( ) It remains to establish E T t=0 Ψ (t) ≤ 2 i∈V log |S i |. ( ) Proof of (69). Summing (66 ) over i ∈ V gives i∈V π i -π (t+1) i , A i π (t+1) + τ H(π i ) -H(π (t+1) i ) ≤ 1 η KL π ∥ π (t) -KL π ∥ π (t+1) - 1 η -2c A -τ Ψ (t) + 4c τ ∥A∥ ∞ i∈V j∈Ni Ψ (t) j + 2γ (t) i ∥A∥ ∞ i∈V j∈Ni t-1 l=t-γ (t) i Ψ (l) j + 4c τ ∥A∥ ∞ i∈V γ (t+1) i j∈Ni t-1 l=t-γ (t+1) i Ψ (l) j . Taking expectation of on both sides and using (38) leads to E i∈V π i -π (t+1) i , A i π (t+1) + τ H(π i ) -H(π (t+1) i ) ≤ 1 η E KL π ∥ π (t) -KL π ∥ π (t+1) - 1 η -4c A (c τ + -τ E Ψ (t) + 4c A (c τ + 1)E t-1 l=0 E(t -l)Ψ (l) - τ 2 E KL π ∥ π (t) . Summing over t = 0, 1, . . . , T yields E i∈V Regret i,τ π, T + 1 ≤ 1 η E KL π ∥ π (0) - 1 η -4c A (c τ + 1)(σ 2 + 1) E T t=0 Ψ (t) ≤ 1 η E KL π ∥ π (0) - 1 2η E T t=0 Ψ (t) , ( ) where the second line follows from 4c A (c τ + 1)(σ 2 + 1)η ≤ 4d max ∥A∥ ∞ 2 + d max ∥A∥ ∞ τ (σ 2 + 1) τ 24d 2 max ∥A∥ 2 ∞ (σ 2 + 1) < 1 2 due to τ ≤ d max ∥A∥ ∞ and η ≤ τ 24d 2 max ∥A∥ 2 ∞ (σ 2 +1) . Taking supremum with respect to π on both sides, in view of Lemma 4, we arrive at the advertised bound (69).

D PROOF FOR TWO-TIMESCALE OMWU (SECTION 4) D.1 PROOF OF THEOREM 3

Bounding KL π ⋆ τ ∥ π (t) . For notational convenience, we set π (t) = π (t) = π (0) for t < 0. The following lemma parallels Lemma 2 by focusing on delayed feedbacks. The proof is postponed to Appendix E.5. Lemma 5. Assuming constant delays γ (t) i = γ, the iterates of OMWU based on the update rule (10) satisfy log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π (t-γ+1) -π ⋆ τ = 0. By following a similar argument in ( 19), we conclude that KL π ⋆ τ ∥ π (t+1) = (1 -ητ )KL π ⋆ τ ∥ π (t) -(1 -ητ )KL π (t-γ+1) ∥ π (t) -KL π (t+1) ∥ π (t-γ+1) + log π (t-γ+1) -log π (t+1) , π (t-γ+1) -π (t+1) -ητ KL π (t-γ+1) ∥ π ⋆ τ . ( ) It boils down to control the termlog π (t-γ+1)log π (t+1) , π (t-γ+1)π (t+1) . When t ≥ γ, by taking logarithm on the both sides of the update rules ( 7) and ( 10), we have log π (t-γ+1) i 1 = (1 -ητ ) log π (t-γ) i + ηA i π (t-2γ) and log π (t+1) i 1 = (1 -ητ ) log π (t) i + ηA i π (t-γ+1) 1 = (1 -ητ ) γ+1 log π (t-γ) i + η γ l=0 (1 -ητ ) l A i π (t-γ-l+1) . Subtracting the above equalities and taking inner product with π (t-γ+1) i -π (t+1) i gives log π (t-γ+1) i -log π (t+1) i , π (t-γ+1) i -π (t+1) i = η γ l=0 (1 -ητ ) l π (t-γ+1) i -π (t+1) i , A i (π (t-2γ) -π (t-γ-l+1) ) , where the log π (t-γ) i terms cancel out due to the choice 1 -ητ = (1 -ητ ) γ+1 . Summing over i ∈ V , log π (t-γ+1) -log π (t+1) , π (t-γ+1) -π (t+1) = η i∈V γ l=0 (1 -ητ ) l π (t-γ+1) i -π (t+1) i , A i (π (t-2γ) -π (t-γ-l+1) ) ≤ η ∥A∥ ∞ (i,j)∈E γ l=0 (1 -ητ ) l π (t-γ+1) i -π (t+1) i 1 π (t-2γ) j -π (t-γ-l+1) j 1 . ( ) Using the triangle inequality, we can bound π (t-2γ)π (t-γ-l+1) 1 as π (t-2γ) -π (t-γ-l+1) 1 ≤ t-l l1=t-γ π (l1-γ) i -π (l1-γ+1) j 1 ≤ t-l l1=t-γ π (l1-γ) i -π (l1) i 1 + π (l1-γ+1) j -π (l1) j 1 . Substitution of the bound into (72) yields log π (t-γ+1) -log π (t+1) , π (t-γ+1) -π (t+1) ≤ η ∥A∥ ∞ (i,j)∈E γ l=0 (1 -ητ ) l t-l l1=t-γ π (t-γ+1) i -π (t+1) i 1 π (l1-γ) j -π (l1) j 1 + π (l1-γ+1) j -π (l1) j 1 = η ∥A∥ ∞ (i,j)∈E t l1=t-γ t-l1 l=0 (1 -ητ ) l π (t-γ+1) i -π (t+1) i 1 π (l1-γ) j -π (l1) j 1 + π (l1-γ+1) j -π (l1) j 1 ≤ 1 2 η ∥A∥ ∞ (i,j)∈E 2 t l1=t-γ t-l1 l=0 (1 -ητ ) l π (t-γ+1) i -π (t+1) i 2 1 + t l1=t-γ t-l1 l=0 (1 -ητ ) l π (l1-γ) j -π (l1) j 2 1 + π (l1-γ+1) j -π (l1) j 2 1 ≤ ηd max ∥A∥ ∞ 2(γ + 1) 2 KL π (t+1) ∥ π (t-γ+1) + t l1=t-γ t-l1 l=0 (1 -ητ ) l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) . ( ) Plugging the above inequality into (71) and recursively applying the inequality gives KL π ⋆ τ ∥ π (t+1) + KL π (t+1) ∥ π (t-γ+1) + ητ KL π (t-γ+1) ∥ π ⋆ τ ≤ (1 -ητ ) t+1-γ KL π ⋆ τ ∥ π (γ) - t l1=γ (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + ηd max ∥A∥ ∞ 2(γ + 1) 2 t l1=γ (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + t t2=γ (1 -ητ ) t-l2 l2 l1=l2-γ l2-l1 l=0 (1 -ητ ) l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) (i) ≤ (1 -ητ ) t+1-γ KL π ⋆ τ ∥ π (γ) - t l1=γ (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + 2(γ + 1) 2 ηd max ∥A∥ ∞ t l1=γ (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + 2(γ + 1) 2 ηd max ∥A∥ ∞ t l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) (ii) ≤ (1 -ητ ) t+1-γ KL π ⋆ τ ∥ π (γ) + 2(γ + 1) 2 ηd max ∥A∥ ∞ γ-1 l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) , ) where (i) results from basic calculation t t2=γ (1 -ητ ) t-l2 l2 l1=l2-γ l2-l1 l=0 (1 -ητ ) l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) = t l1=0 (1 -ητ ) t-l1 l1+γ l2=l1 l2-l1 l=0 (1 -ητ ) l1-l2+l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) = t l1=0 (1 -ητ ) t-l1 γ l ′ =0 l ′ l=0 (1 -ητ ) l-l ′ KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) ≤ t l1=0 (1 -ητ ) t-l1 (γ + 1) 2 1 - 1 2(γ + 1) -(γ+1) (1 -ητ ) KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) ≤ 2(γ + 1) 2 t l1=0 (1 -ητ ) t-l1 (1 -ητ ) KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) ≤ 2(γ + 1) 2 t l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) and (ii) is due to η ≤ min 1 2τ (γ+1) , 1 5dmax∥A∥ ∞ (γ+1) 2 . To proceed, we introduce the following lemma concerning the error KL π ⋆ τ ∥ π (γ) , with the proof postponed to Appendix E.6. Lemma 6. With constant delays γ (t) i = γ, the iterates of OMWU based on the update rule (10) satisfy KL π ⋆ τ ∥ π (γ) ≤ (1 -ητ ) γ KL π ⋆ τ ∥ π (0) - γ-1 l1=0 (1 -ητ ) γ-1-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + 2ηγ 2 d max ∥A∥ ∞ . With the lemma above in mind, we can continue to bound (74) by KL π ⋆ τ ∥ π (t+1) ≤ (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) + 2(1 -ητ ) t+1-γ ηγ 2 d max ∥A∥ ∞ - γ-1 l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + 2(γ + 1) 2 ηd max ∥A∥ ∞ γ-1 l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) ≤ (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) + (1 -ητ ) t+1-γ . Bounding KL π ⋆ τ ∥ π (t-γ+1) . By definition of KL divergence, we have -γ+1) . By following a similar argument in (73), we have l1) . KL π ⋆ τ ∥ π (t-γ+1) = KL π ⋆ τ ∥ π (t+1) + π ⋆ τ , log π (t+1) -log π (t-γ+1) = KL π ⋆ τ ∥ π (t+1) + KL π (t+1) ∥ π (t-γ+1) + π ⋆ τ -π (t+1) , log π (t+1) -log π (t-γ+1) . (75) It remains to control the term π ⋆ τ -π (t+1) , log π (t+1) -log π (t π ⋆ τ -π (t+1) , log π (t+1) -log π (t-γ+1) = η i∈V γ l=0 (1 -ητ ) l π ⋆ i,τ -π (t+1) i , A i (π (t-2γ) -π (t-γ-l+1) ) ≤ η ∥A∥ ∞ (i,j)∈E γ l=0 (1 -ητ ) l π ⋆ i,τ -π (t+1) i 1 π (t-2γ) j -π (t-γ-l+1) j 1 ≤ η ∥A∥ ∞ (i,j)∈E γ l=0 (1 -ητ ) l t-l l1=t-γ π ⋆ i,τ -π (t+1) i 1 π (l1-γ) j -π (l1) j 1 + π (l1-γ+1) j -π (l1) j 1 = η ∥A∥ ∞ (i,j)∈E t l1=t-γ t-l1 l=0 (1 -ητ ) l π ⋆ i,τ -π (t+1) i 1 π (l1-γ) j -π (l1) j 1 + π (l1-γ+1) j -π (l1) j 1 ≤ ηd max ∥A∥ ∞ 2(γ + 1) 2 KL π ⋆ τ ∥ π (t+1) + t l1=t-γ t-l1 l=0 (1 -ητ ) l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π ( Substitution of the above inequality into (75) yields l1) , KL π ⋆ τ ∥ π (t-γ+1) + ητ KL π (t-γ+1) ∥ π ⋆ τ = (1 + 2(γ + 1) 2 ηd max )KL π ⋆ τ ∥ π (t+1) + KL π (t+1) ∥ π (t-γ+1) + ητ KL π (t-γ+1) ∥ π ⋆ τ + ηd max ∥A∥ ∞ t l1=t-γ t-l1 l=0 (1 -ητ ) l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) (i) ≤ 2 KL π ⋆ τ ∥ π (t+1) + KL π (t+1) ∥ π (t-γ+1) + ητ KL π (t-γ+1) ∥ π ⋆ τ + 2(γ + 1)ηd max ∥A∥ ∞ t l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) (ii) ≤ 2(1 -ητ ) t+1-γ KL π ⋆ τ ∥ π (γ) -2 t l1=γ (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + 4(γ + 1) 2 ηd max ∥A∥ ∞ t l1=γ (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + 6(γ + 1) 2 ηd max ∥A∥ ∞ t l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) ≤ 2(1 -ητ ) t+1-γ KL π ⋆ τ ∥ π (γ) + 6(γ + 1) 2 ηd max ∥A∥ ∞ γ-1 l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π ( where (i) results from t l1=t-γ t-l1 l=0 (1 -ητ ) l KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) = t l1=t-γ (1 -ητ ) t-l1 t-l1 l=0 (1 -ητ ) l+l1-t KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) ≤ t l1=t-γ (1 -ητ ) t-l1 (γ + 1)(1 -ητ ) -(γ+1) (1 -ητ ) KL π (l1) ∥ π (l1-γ) + KL π (l1-γ+1) ∥ π (l1) ≤ 2(γ + 1) t l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) . and (ii) is due to the bound established in (74). Finally, applying Lemma 6 yields KL π ⋆ τ ∥ π (t-γ+1) + ητ KL π (t-γ+1) ∥ π ⋆ τ ≤ 2(1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) + 4(1 -ητ ) t+1-γ ηγ 2 d max ∥A∥ ∞ -2 γ-1 l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + 6(γ + 1) 2 ηd max ∥A∥ ∞ γ-1 l1=0 (1 -ητ ) t-l1 KL π (l1+1) ∥ π (l1-γ+1) + (1 -ητ )KL π (l1-γ+1) ∥ π (l1) ≤ 2(1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) + 2(1 -ητ ) t+1-γ . ( ) Bounding the QRE gap. With Lemma 3, we have QRE-Gap τ (π (t-γ+1) ) ≤ d 2 max ∥A∥ 2 ∞ τ KL π ⋆ τ ∥ π (t-γ+1) + τ KL π (t-γ+1) ∥ π ⋆ τ ≤ max d 2 max ∥A∥ 2 ∞ τ , 1 η KL π ⋆ τ ∥ π (t-γ+1) + ητ KL π (t-γ+1) ∥ π ⋆ τ ≤ 2 max d 2 max ∥A∥ 2 ∞ τ , 1 η (1 -ητ ) t+1 KL π ⋆ τ ∥ π (0) + (1 -ητ ) t+1-γ , where the last step results from (76).

D.2 PROOF OF THEOREM 4

Bounding the term KL π ⋆ τ ∥ π (t) . Recall that the update rule of π (t) i (k) is given by π (t) i (k) ∝ π (t-1) i (k) 1-ητ exp(η[A i π (κ (t) i ) ] k ). We introduce an auxiliary variable π (t) i : π (t) i (k) ∝ π (t-1) i (k) 1-η (t) i τ exp η (t) i [A i π (κ (t) i ) ] k , which can be viewed as a conceptual alternative update of π i with a different step size η (t) i > 0 satisfying (1 -η (t) i τ )(1 -ητ ) t-κ (t) i = 1 -ητ or equivalently 1 -η (t) i τ = (1 -ητ ) γ+1-t+κ (t) i . It directly follows that η (t) i ≥ η. Since κ (t) i ≤ t, we have 1 -η (t) i τ ≥ 1 -(γ + 1 -t + κ (t) i )ητ ≥ 1 -(γ + 1)ητ , which implies η (t) i ≤ (γ + 1)η. For notational convenience, we set π (t) i = π (0) , η (t) i = η and κ (t) i = 0 when t ≤ 0. The following lemma establishes a one-step analysis, with the proof postponed to Appendix E.7. Lemma 7. When t ≥ 1, it holds that KL π ⋆ i,τ ∥ π (t) i + ητ KL π (κ (t) i ) i ∥ π ⋆ i,τ = (1 -ητ )KL π ⋆ i,τ ∥ π (t-1) i -η(π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) -ψ (t) i + η η (t) i log π (κ (t) i ) i -log π (t) i , π (κ (t) i ) i -π (t) i , where ψ (t) i := 1 - η η (t) i KL π (t) i ∥ π (t-1) i + η η (t) i (1 -η (t) i τ )KL π (κ (t) i ) i ∥ π (t-1) i + KL π (t) i ∥ π (κ (t) i ) i + KL π (t) i ∥ π (t) i . We proceed to control the term log π (κ (t) i ) i -log π (t) i , π (κ (t) i ) i -π (t) i . By definition, we have log π (t) i 1 = (1 -η (t) i τ ) log π (t-1) i + η (t) i A i π (κ (t) i ) 1 = (1 -η (t) i τ )(1 -ητ ) t-κ (t) i log π (κ (t) i -1) + η (t) i A i π (κ (t) i ) + t-1 l=κ (t) i (1 -η (t) i τ )(1 -ητ ) t-1-l A i π (κ (l) i ) and log π (κ (t) i ) i 1 = (1 -ητ ) log π (κ (t) i -1) + ηA i π (κ (κ (t) i -1) i ) when κ (t) i ≥ 1. Subtracting the two equations yields log π (κ (t) i ) i -log π (t) i 1 = η (t) i A i (π (κ (κ (t) i -1) i ) -π (κ (t) i ) ) + t-1 l=κ (t) i (1 -η (t) i τ )(1 -ητ ) t-1-l A i (π (κ (κ (t) i -1) i ) -π (κ (l) i ) ) , where the log π (κ (t) i -1) terms cancel out due to (1η (t) i τ )(1 -ητ ) t-κ (t) i = 1 -ητ . It follows that log π (κ (t) i ) i -log π (t) i , π (κ (t) i ) i -π (t) i = η (t) i π (κ (t) i ) i -π (t) i , A i (π (κ (κ (t) i -1) i ) -π (κ (t) i ) ) + t-1 l=κ (t) i (1 -η (t) i τ )(1 -ητ ) t-1-l π (κ (t) i ) i -π (t) i , A i (π (κ (κ (t-1) i ) i ) -π (κ (l) i ) ) ≤ η (t) i ∥A∥ ∞ π (κ (t) i ) i -π (t) i 1 j∈Ni t l=κ ) i ) j 1 . The next lemma establishes an upper bound on the term t l=κ (t) i π (κ (l) i ) j -π (κ (κ (t-1) i ) i ) j 1 , with the proof postponed to Appendix E.8. Lemma 8. Let ν j (t) denote the time index when agent j receives the payoff from the t-th iteration, i.e., κ (νj (t)) j = t. For t = 0, we set ν j (0) to an arbitrary index that satisfies κ (νj (0)) j = 0. When t ≥ 2γ + 1, it holds that t l=κ (t) i π (κ (l) i ) j -π (κ (κ (t-1) i ) i ) j 1 ≤ 4 √ 2(γ + 1) t+γ l=t-2γ ψ (l) j + 2 √ 2(γ + 1) 2 ψ (νj (κ (κ (t-1) i ) i )) j , Plugging Lemma 8 into (81) gives log π (κ (t) i ) i -log π (t) i , π (κ (t) i ) i -π (t) i ≤ η (t) i ∥A∥ ∞ π (κ (k) i ) i -π (t) i 1 j∈Ni 4 √ 2(γ + 1) t+γ l=t-2γ ψ (l) j + 2 √ 2(γ + 1) 2 ψ (νj (κ (κ (t-1) i ) i )) j (i) ≤ 1 2 η (t) i ∥A∥ ∞ 14d max (γ + 1) 3/2 π (κ (t) i ) i -π (t) i 2 1 + j∈Ni 8(γ + 1) 3/2 t+γ l=t-2γ ψ (l) j + 4(γ + 1) 5/2 ψ (νj (κ (κ (t-1) i ) i )) j (ii) ≤ η (t) i ∥A∥ ∞ 14d max (γ + 1) 5/2 ψ (t) i + j∈Ni 4(γ + 1) 3/2 t+γ l=t-2γ ψ (l) j + 2(γ + 1) 5/2 ψ (νj (κ (κ (t-1) i ) i )) j , where (i) results from Young's inequality π (κ (t) i i -π (t) i 1 ψ (l) j ≤ 1 2 1 √ 2(γ + 1) 1/2 π (κ (t) i ) i -π (t) i 2 1 + √ 2(γ + 1) 1/2 ψ (l) j and (ii) follows from π (κ (t) i ) i -π (t) i 2 1 ≤ 2KL π (t) i ∥ π (κ (t) i ) i ≤ 2(γ + 1)ψ (t) i . Plugging (82) into (79) and summing over i ∈ V yields KL π ⋆ τ ∥ π (t) + ητ i∈V KL π (κ (t) i ) i ∥ π ⋆ τ ≤ (1 -ητ )KL π ⋆ τ ∥ π (t-1) -η i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) -(1 -14ηd max ∥A∥ ∞ (γ + 1) 5/2 ) i∈V ψ (t) i + 2η ∥A∥ ∞ (γ + 1) 5/2 (i,j)∈E ψ (νj (κ (κ (t-1) i ) i )) j + 4ηd max ∥A∥ ∞ (γ + 1) 3/2 t+γ l=t-2γ ψ (l) , where we denote i∈V ψ i by ψ (l) for notation simplicity. We then seek to sum the above equation over l) , and that t = 2γ + 1, • • • , T . Before proceeding, we note that T t=2γ+1 t+γ l=t-2γ ψ (l) ≤ T +γ l=1 l+2γ t=l-γ ψ (l) ≤ 3(γ + 1) T +γ l=1 ψ ( T t=2γ+1 (i,j)∈E ψ (νj (κ (κ (t-1) i ) i )) j ≤ (i,j)∈E T +γ-1 t=0 ψ (t) j ≤ d max T +γ-1 t=1 ψ (t) , where the first step is due to the mapping t → ν j (κ (κ (t-1) i ) i ) being injective when t ≥ 2γ + 1 (cf. Assumptions 2, 3). Note that ψ (t) j = 0 when t ≤ 0 and hence can be safely discarded. Taken together, we arrive at ητ T t=2γ+1 KL π ⋆ τ ∥ π (t) + ητ T t=2γ+1 i∈V KL π (κ (t) i ) i ∥ π ⋆ i,τ ≤ (1 -ητ )KL π ⋆ τ ∥ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) -1 -14ηd max ∥A∥ ∞ (γ + 1) 5/2 T t=2γ+1 ψ (t) + 12ηd max ∥A∥ ∞ (γ + 1) 5/2 T +γ l=1 ψ (l) + 2ηd max ∥A∥ ∞ (γ + 1) 5/2 T +γ-1 t=1 ψ (t) ≤ (1 -ητ )KL π ⋆ τ ∥ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) -1 -28ηd max ∥A∥ ∞ (γ + 1) 5/2 T t=2γ+1 ψ (t) + 14ηd max ∥A∥ ∞ (γ + 1) 5/2 l∈Γ ψ (l) ≤ (1 -ητ )KL π ⋆ τ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) + 1 3 l∈Γ ψ (l) , where Γ = {1, • • • , 2γ} ∪ {T + 1, • • • , T + γ}. The last step results from the choice of learning rate η ≤ 1 28dmax∥A∥ ∞ (γ+1) 5/2 . It now remains to bound the terms 2γ) and l∈Γ ψ (l) . In view of Lemma 1, we have T t=2γ+1 i∈V (π (κ (t) i ) i - π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ), KL π ⋆ τ ∥ π ( - T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) = T t=γ+1 i∈V (π (t) i -π ⋆ i,τ ) ⊤ A i (π (t) -π ⋆ τ ) - T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ). We remark that each (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) term will cancel out due to the mapping t → κ (t) i being injective when t ≥ γ. In addition, we have a crude bound (π (t) i -π ⋆ i,τ ) ⊤ A i (π (t) -π ⋆ τ ) = j∈Ni (π (t) i -π ⋆ i,τ ) ⊤ A ij (π (t) j -π ⋆ j,τ ) ≤ 4d max ∥A∥ ∞ for every i ∈ V, t ≥ 0. Applying the bound to the remaining nγ terms gives - T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) ≤ 4nγd max ∥A∥ ∞ . The remaining terms KL π ⋆ τ ∥ π (2γ) and ψ (l) can be bounded with the following lemma, with the proof postponed to Appendix E.9. Lemma 9. It holds for all i ∈ V and t ≥ 0 that ψ (t) i ≤ η(d max ∥A∥ ∞ (2γ + 11) + 3τ log |S i |). (86) In addition, we have KL π ⋆ i,τ ∥ π (2γ) i ≤ KL π ⋆ i,τ ∥ π (0) i + 4ηd max ∥A∥ ∞ γ. Putting all pieces together, we continue from (84) and show that ητ T t=2γ+1 KL π ⋆ τ ∥ π (t) ≤ (1 -ητ )KL π ⋆ τ ∥ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) + 1 3 l∈Γ ψ (l) ≤ KL π ⋆ i,τ ∥ π (0) i + 8ηnγd max ∥A∥ ∞ + ηγ nd max ∥A∥ ∞ (2γ + 11) + 3τ i∈V log |S i | ≤ KL π ⋆ i,τ ∥ π (0) i + 8ηn γd max ∥A∥ ∞ + γ d max ∥A∥ ∞ (2γ + 11) + 3τ log S max ≤ KL π ⋆ i,τ ∥ π (0) i + n + 24ητ nγ log S max . Bounding the term KL π ⋆ τ ∥ π (t-γ+1) . By definition of KL divergence, we have KL π ⋆ i,τ ∥ π (t-γ+1) i = KL π ⋆ i,τ ∥ π (t+1) i + π ⋆ i,τ , log π (t+1) i -log π (t-γ+1) i = KL π ⋆ i,τ ∥ π (t+1) i -KL π (t-γ+1) i ∥ π (t+1) i + π ⋆ i,τ -π (t-γ+1) i , log π (t+1) i -log π (t-γ+1) i . ( ) It follows directly from the update rules that       log π (t-γ+1) i 1 = (1 -ητ ) log π (t-γ) i + ηA i π (κ (t-γ) i ) log π (t+1) i 1 = (1 -ητ ) γ+1 log π (t-γ) i + η t+1 l=t-γ+1 (1 -ητ ) t-l+1 A i π (κ (l) i ) , which enables us to control the term π ⋆ i,τπ (t-γ+1) i , log π (t+1) i -log π (t-γ+1) i as π ⋆ i,τ -π (t-γ+1) i , log π (t+1) i -log π (t-γ+1) i = η t+1 l=t-γ+1 (1 -ητ ) t-l+1 π ⋆ i,τ -π (t-γ+1) i , A i (π (κ (t-γ) i ) -π (κ (l) i ) ) ≤ η ∥A∥ ∞ π ⋆ i,τ -π (t-γ+1) i 1 j∈Ni t+1 l=t-γ+1 π (κ (t-γ) i ) j -π (κ (l) i ) j 1 . (89) In the same vein as Lemma 8, we can bound the term t+1 l=t-γ+1 π (κ (t-γ) i ) j -π (κ (l) i ) j 1 with {ψ (l) i }, as detailed in the following lemma. The proof is omitted due to its similarity with that of Lemma 8. Lemma 10. When t ≥ 2γ, it holds that t+1 l=t-γ+1 π (κ (t-γ) i ) j -π (κ (l) i ) j 1 ≤ 4 √ 2(γ + 1) t+γ+1 l=t-2γ+1 ψ (l) i + 2 √ 2(γ + 1) 2 ψ (νj (κ (t-γ) i )) j . Plugging the above lemma into (89), we have π ⋆ i,τ -π (t-γ+1) i , log π (t+1) i -log π (t-γ+1) i ≤ η ∥A∥ ∞ π ⋆ i,τ -π (t-γ+1) i 1 j∈Ni 4 √ 2(γ + 1) t+γ+1 l=t-2γ+1 ψ (l) i + 2 √ 2(γ + 1) 2 ψ (νj (κ (t-γ) i )) j (i) ≤ 1 2 η ∥A∥ ∞ 14d max (γ + 1) 3/2 π ⋆ i,τ -π (t-γ+1) i 2 1 + j∈Ni 8(γ + 1) 3/2 t+γ+1 l=t-2γ+1 ψ (l) j + 4(γ + 1) 5/2 ψ (νj (κ (t-γ) i )) j (ii) ≤ η ∥A∥ ∞ 14d max (γ + 1) 3/2 KL π ⋆ i,τ ∥ π (t-γ+1) i + j∈Ni 4(γ + 1) 3/2 t+γ+1 l=t-2γ+1 ψ (l) j + 2(γ + 1) 5/2 ψ (νj (κ (t-γ) i )) j , where (i) results from similar arguments in ( 82) and (ii) invokes Pinsker's inequality. Substitution of the above inequality into (88) and summing over i ∈ V leads to (1 -14ηd max ∥A∥ ∞ (γ + 1) 3/2 )KL π ⋆ τ ∥ π (t-γ+1) ≤ KL π ⋆ τ ∥ π (t+1) + η ∥A∥ ∞ (i,j)∈E 4(γ + 1) 3/2 t+γ+1 l=t-2γ+1 ψ (l) j + 2(γ + 1) 5/2 ψ (νj (κ (t-γ) i )) j ≤ KL π ⋆ τ ∥ π (t+1) + 4ηd max ∥A∥ ∞ (γ + 1) 3/2 t+γ+1 l=t-2γ+1 ψ (l) + 2ηd max ∥A∥ ∞ (γ + 1) 5/2 ψ (νj (κ (t-γ) i )) . Summing the above inequality over t l) . = 2γ -1, • • • , T -1 and adding T -1 t=2γ i∈V KL π (κ (t+1) i ) i ∥ π ⋆ i,τ to the both sides, T -1 t=2γ 2 3 KL π ⋆ τ ∥ π (t-γ+1) + i∈V KL π (κ (t+1) i ) i ∥ π ⋆ i,τ ≤ T -1 t=2γ KL π ⋆ τ ∥ π (t+1) + T -1 t=2γ i∈V KL (κ (t+1) i ) i ∥ π ⋆ i,τ + 4ηd max ∥A∥ ∞ (γ + 1) 3/2 T -1 t=2γ t+γ+1 l=t-2γ+1 ψ (l) + 2ηd max ∥A∥ ∞ (γ + 1) 5/2 T -1 t=2γ ψ (νj (κ (t-γ) i )) (i) ≤ 1 ητ (1 -ητ )KL π ⋆ τ ∥ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) -1 -28ηd max ∥A∥ ∞ (γ + 1) 5/2 T t=2γ+1 ψ (t) + 14ηd max ∥A∥ ∞ (γ + 1) 5/2 l∈Γ ψ (l) + 12ηd max ∥A∥ ∞ (γ + 1) 5/2 T +γ l=1 ψ (l) + 2ηd max ∥A∥ ∞ (γ + 1) 5/2 T +γ-1 t=0 ψ (l) = 1 ητ (1 -ητ )KL π ⋆ τ ∥ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) -1 -28(1 + ητ 2 )ηd max ∥A∥ ∞ (γ + 1) 5/2 T t=2γ+1 ψ (t) + 14(1 + ητ )ηd max ∥A∥ ∞ (γ + 1) 5/2 l∈Γ ψ Here, (i) invokes the bound established in ( 84). We remark that our choice of learning rate η ≤ min 1 2τ (γ + 1) , 1 42d max ∥A∥ ∞ (γ + 1) 5/2 guarantees 1 -28(1 + ητ 2 )ηd max ∥A∥ ∞ (γ + 1) 5/2 ≥ 0. This taken together with (85) and Lemma 9 gives T -1 t=2γ 2 3 KL π ⋆ τ ∥ π (t-γ+1) + i∈V KL π (κ (t+1) i ) i ∥ π ⋆ i,τ ≤ 1 ητ (1 -ητ )KL π ⋆ τ ∥ π (2γ) -η T t=2γ+1 i∈V (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) + 1 2 l∈Γ ψ (l) ≤ 1 ητ KL π ⋆ i,τ ∥ π (0) i + 8ηn γd max ∥A∥ ∞ + 3γ 2 d max ∥A∥ ∞ (2γ + 11) + 3τ log S max ≤ 1 ητ KL π ⋆ i,τ ∥ π (0) i + n + 36ητ nγ log S max . ( ) Bounding the QRE gap. With Lemma 3, we have T -γ-1 t=2γ QRE-Gap τ (π (t+1) ) ≤ T -γ-1 t=2γ d 2 max ∥A∥ 2 ∞ τ KL π ⋆ τ ∥ π (t+1) + τ KL π (t+1) ∥ π ⋆ τ ≤ max 3d 2 max ∥A∥ 2 ∞ 2τ , τ T -γ-1 t=2γ 2 3 KL π ⋆ τ ∥ π (t+1) + KL π (t+1) ∥ π ⋆ τ . Since the mapping t → ν i (t) is injective, we have T -γ-1 t=2γ i∈V KL π (t+1) i ∥ π ⋆ i,τ = T -γ-1 t=2γ i∈V KL π (κ (ν i (t+1)) i ) i ∥ π ⋆ i,τ ≤ T -1 t=2γ i∈V KL π (κ (t+1) i ) i ∥ π ⋆ i,τ . Combining the above two equalities gives T -γ-1 t=2γ QRE-Gap τ (π (t+1) ) ≤ max 3d 2 max ∥A∥ 2 ∞ 2τ , τ T -γ-1 t=2γ 2 3 KL π τ ∥ π (t+1) + T -1 t=2γ KL π (t+1) ∥ π ⋆ τ ≤ max 3d 2 max ∥A∥ 2 ∞ 2τ , τ T -1 t=2γ 2 3 KL π ⋆ τ ∥ π (t-γ+1) + i∈V KL π (κ (t+1) i ) i ∥ π ⋆ i,τ ≤ max 3d 2 max ∥A∥ 2 ∞ 2τ , τ 1 ητ KL π ⋆ i,τ ∥ π (0) i + n + 36ητ nγ log S max , where the last step results from ( 90).

E PROOF OF AUXILIARY LEMMAS E.1 PROOF OF LEMMA 1

To prove this lemma, we recall a key observation in Cai et al. (2016) that allows one to transform a zero-sum polymatrix game G = {(V, E), {S i } i∈V , {A ij } (i,j)∈E } into a pairwise constant-sum polymatrix game G = {(V, E), {S i } i∈V , { A ij } (i,j)∈E } such that (1) For every player i ∈ V , it has the same payoff in G and G: u i (s) = u i (s), ∀s ∈ S. (2) For each pair (i, j) ∈ E, i ̸ = j, the two-player game G is constant-sum, i.e., there exist constants α ij = α ji , such that A ij (s i , s j ) + A ji (s j , s i ) = α ij holds for all s i ∈ S i , s j ∈ S j . We are now in a place to prove Lemma 1. Let G be the pairwise constant-sum polymatrix game associated with G after the above payoff preserving transformation. We have i∈V u i (π i , π ′ -i ) + u i (π ′ i , π -i ) = i∈V u i (π i , π ′ -i ) + u i (π ′ i , π -i ) = (i,j)∈E E si∼πi,sj ∼π ′ j A ij (s i , s j ) + E si∼π ′ i ,sj ∼πj A ij (s i , s j ) = (i,j)∈E E si∼πi,sj ∼π ′ j A ij (s i , s j ) + E si∼π ′ i ,sj ∼πj α ij -A ji (s j , s i ) = (i,j)∈E α ij = 0, where the penultimate line uses (91), and the last line uses the fact that G is also a zero-sum polymatrix game, which satisfies (i,j)∈E α ij = (i,j)∈E A ij (s i , s j ) + A ji (s j , s i ) = i∈V u i (s) + j∈V u j (s) = 0 for any arbitrary s ∈ S.

E.2 PROOF OF LEMMA 2

In view of the update rule (7), we have log π (t+1) i = (1 -ητ ) log π (t) i + ηA i π (t+1) + c i 1 for some constant c i . On the other hand, it follows from the expression of QRE in (4) that log π ⋆ i,τ = ηA i π ⋆ τ + c ⋆ i 1 (92) for some constant c ⋆ i . By combining the above two equalities and taking the inner product with π (t+1) i -π ⋆ i,τ , we have log π (t+1) i -(1 -ητ ) log π (t) i -ητ log π ⋆ i,τ , π (t+1) i -π ⋆ i,τ = η(π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t+1) -π ⋆ τ ). (93) Summing the above equality over i ∈ V gives log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π (t+1) -π ⋆ τ = η i∈V (π (t+1) i -π ⋆ i,τ ) ⊤ A i (π (t+1) -π ⋆ τ ) = η i∈V (π (t+1) i ) ⊤ A i π (t+1) + (π ⋆ i,τ ) ⊤ A i π ⋆ τ -η i∈V (π (t+1) i ) ⊤ A i π ⋆ τ + (π ⋆ i,τ ) ⊤ A i π (t+1) = η i∈V u i (π (t+1) ) + u i (π ⋆ τ ) = 0, where the last line follows from i∈V (π t+1) = 0 due to Lemma 1, as well as that the game is zero-sum. (t+1) i ) ⊤ A i π ⋆ τ + (π ⋆ i,τ ) ⊤ A i π

E.3 PROOF OF LEMMA 3

Recalling the definition QRE-Gap τ (π) = max i∈V max π ′ i ∈∆(Si) u i,τ (π ′ i , π -i ) -u i,τ (π) ≤ i∈V max π ′ i ∈∆(Si) u i,τ (π ′ i , π -i ) -u i,τ (π) = max i∈V :π ′ i ∈∆(Si) i∈V [u i,τ (π ′ i , π -i ) -u i,τ (π i , π -i )] , where the inequality holds since max π ′ i ∈∆(Si) u i,τ (π ′ i , π -i )-u i,τ (π) ≥ u i,τ (π i , π -i )-u i,τ (π) = 0 for all i ∈ V . We now proceed to decompose i∈V [u i,τ (π ′ i , π -i ) -u i,τ (π i , π -i )] = i∈V u i,τ (π ′ i , π -i ) -u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) -τ i∈V H(π i ) -H(π ⋆ i,τ ) = i∈V u i,τ (π ′ i , π -i ) -u i,τ (π ′ i , π ⋆ -i,τ ) -u i,τ (π ⋆ i,τ , π -i ) + u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) + i∈V u i,τ (π ⋆ i,τ , π -i ) -u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) -τ H(π i ) -H(π ⋆ i,τ ) + i∈V u i,τ (π ′ i , π ⋆ -i,τ ) -u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) where the first line follows from i∈V (u i,τ (π)τ H(π i )) = i∈V u i,τ (π ⋆ τ )τ H(π ⋆ i,τ ) = 0 by the definition of zero-sum games. It boils down to control the terms on the RHS of (94). • To control the first term, by the definition of u i,τ in (5) (see also (3)), it follows that u i,τ (π ′ i , π -i ) -u i,τ (π ′ i , π ⋆ -i,τ ) -u i,τ (π ⋆ i,τ , π -i ) + u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) = u i (π ′ i , π -i ) -u i (π ′ i , π ⋆ -i,τ ) -u i (π ⋆ i,τ , π -i ) + u i (π ⋆ i,τ , π ⋆ -i,τ ) = (π ′ i -π ⋆ i,τ ) ⊤ A i (π -π ⋆ τ ) = j∈Ni (π ′ i -⋆ i,τ ) ⊤ A ij (π j -π ⋆ j,τ ), which each summand can be further bounded by Young's inequality and Pinsker's inequality as (π ′ i -π ⋆ i,τ ) ⊤ A ij (π j -π ⋆ j,τ ) ≤ ∥A∥ ∞ π ′ i -π ⋆ i,τ 1 π j -π ⋆ j,τ 1 ≤ 1 2 ∥A∥ ∞ τ d max ∥A∥ ∞ π ′ i -π ⋆ i,τ 2 1 + d max ∥A∥ ∞ τ π j -π ⋆ j,τ 2 1 ≤ ∥A∥ ∞ τ d max ∥A∥ ∞ KL π ′ i ∥ π ⋆ i,τ + d max ∥A∥ ∞ τ KL π ⋆ j,τ ∥ π j . Summing the inequality over i, j gives i∈V u i,τ (π ′ i , π -i ) -u i,τ (π ′ i , π ⋆ -i,τ ) -u i,τ (π ⋆ i,τ , π -i ) + u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) ≤ τ KL π ′ ∥ π ⋆ τ + d 2 max ∥A∥ 2 ∞ τ KL π ⋆ τ ∥ π . • Regarding the second term, we have i∈V u i,τ (π ⋆ i,τ , π -i ) -u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) -τ H(π i ) -H(π ⋆ i,τ ) = i∈V (π ⋆ i,τ ) ⊤ A i (π -π ⋆ τ ) + τ (π ⊤ i log π i -(π ⋆ i,τ ) ⊤ log π ⋆ i,τ ) = i∈V (π ⋆ i,τ ) ⊤ A i (π -π ⋆ τ ) + τ π i , log π i -log π ⋆ i,τ + π i -π ⋆ i,τ , log π ⋆ i,τ = i∈V (π ⋆ i,τ ) ⊤ A i (π -π ⋆ τ ) + (π i -π ⋆ i,τ ) ⊤ A i π ⋆ τ + τ KL π i ∥ π ⋆ i,τ = τ KL π ∥ π ⋆ τ , where the penultimate step follows from (92) and the last step invokes Lemma 1. • Moving to the last term, we have u i,τ (π ⋆ i,τ , π ⋆ -i,τ ) -u i,τ (π ′ i , π ⋆ -i,τ ) = (π ⋆ i,τ -π ′ i ) ⊤ A i π ⋆ τ -τ (π ⋆ i,τ ) ⊤ log π ⋆ i,τ + τ (π ′ i ) ⊤ log π ′ i = τ (π ⋆ i,τ -π ′ i ) ⊤ log π ⋆ i,τ -τ (π ⋆ i,τ ) ⊤ log π ⋆ i,τ + τ (π ′ i ) ⊤ log π ′ i = τ KL π ′ i ∥ π ⋆ i,τ . where the second line follows again from (92). Plugging ( 95), ( 96) and ( 97) into (94) gives i∈V [u i,τ (π ′ i , π -i ) -u i,τ (π i , π -i )] ≤ τ KL π ∥ π ⋆ τ + d 2 max ∥A∥ 2 ∞ τ KL π ⋆ τ ∥ π . Taking maximum over π ′ finishes the proof.

E.4 PROOF OF LEMMA 4

Let π (T ) = 1 T +1 T t=0 π (t+1) , then π (T ) ∈ ∆(S). The proof is completed if we can show i∈V Regret i,τ T + 1 ≥ i∈V Regret i,τ π (T ) i , T + 1 ≥ 0, where the first inequality holds trivially since Regret i,τ T + 1 ≥ Regret i,τ π  (T ) i -π (t+1) i , A i π (t+1) = i∈V T t=0 π (T ) i , A i π (t+1) = i∈V π (T ) i , A i T t=0 π (t+1) = (T + 1) i∈V π (T ) i , A i π (T ) = 0. addition, applying Jensen's inequality gives T t=0 H( π (T ) i ) = (T + 1)H( π (T ) i ) ≥ T t=0 H(π (t+1) i ). Combining the above two relations yields i∈V Regret i,τ π (T ) i , T + 1 ≥ i∈V T t=0 π (T ) i -π (t+1) i , A i π (t+1) + τ H( π (T ) i ) -τ H(π (t+1) i ) ≥ 0, which concludes the proof.

E.5 PROOF OF LEMMA 5

Taking logarithm on the both sides of (7), we have log π (t+1) i 1 = (1 -ητ ) log π (t) i + ηA i π (t-γ+1) . On the other hand, the definition of QRE in (4) gives ητ log π ⋆ i,τ 1 = ηA i π ⋆ τ . Subtracting the two equalities and taking inner product with π (t-γ+1) i -π ⋆ i,τ , we get log π (t+1) i -(1 -ητ ) log π (t) i -ητ log π ⋆ i,τ , π (t-γ+1) i -π ⋆ i,τ = η π (t-γ+1) i -π ⋆ i,τ ⊤ A i π (t-γ+1) -π ⋆ τ . Summing the above equality over i ∈ V leads to log π (t+1) -(1 -ητ ) log π (t) -ητ log π ⋆ τ , π (t-γ+1) i -π ⋆ τ = η i∈V π (t-γ+1) i -π ⋆ i,τ ⊤ A i π (t-γ+1) -π ⋆ τ = 0, where the final step results from Lemma 1.

E.6 PROOF OF LEMMA 6

Recall from (71) that KL π ⋆ τ ∥ π (t+1) = (1 -ητ )KL π ⋆ τ ∥ π (t) -(1 -ητ )KL π (t-γ+1) ∥ π (t) -KL π (t+1) ∥ π (t-γ+1) + log π (t-γ+1) -log π (t+1) , π (t-γ+1) -π (t+1) -ητ KL π (t-γ+1) ∥ π ⋆ τ . When t < γ, we have π (t-γ+1) i = π (0) . It follows that log π (t-γ+1) i = log π (0) 1 = 0, and that log π (t+1) i 1 = (1 -ητ ) t+1 log π (0) + η t l=0 (1 -ητ ) l A i π (t-γ-l+1) 1 = η t l=0 (1 -ητ ) l A i π (0) . Therefore, we can bound the term log π (t-γ+1)log π (t+1) , π (t-γ+1)π (t+1) as log π (t-γ+1) log π (t+1) , π (t-γ+1) -π (t+1) = η t l=0 (1 -ητ ) l A i π (0) , π (0) -π (t+1) ≤ η(t + 1)d max ∥A∥ ∞ π (0) -π (t+1) 1 ≤ 2η(t + 1)d max ∥A∥ ∞ . Plugging the above inequality into (100) leads to KL π ⋆ τ ∥ π (t+1) ≤ (1 -ητ )KL π ⋆ τ ∥ π (t) -(1 -ητ )KL π (t-γ+1) ∥ π (t) -KL π (t+1) ∥ π (t-γ+1) + 2η(t + 1)d max ∥A∥ ∞ . Applying the above inequality recursively to the iterates 0, 1, . . . , γ -1, we arrive at KL π ⋆ τ ∥ π (γ) ≤ (1 -ητ ) γ KL π ⋆ τ ∥ π (0) - γ-1 l1=0 (1 -ητ ) γ-1-l1 (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + KL π (l1+1) ∥ π (l1-γ+1) + 2η γ-1 l1=0 (1 -ητ ) γ-1-l1 (l 1 + 1)d max ∥A∥ ∞ ≤ (1 -ητ ) γ KL π ⋆ τ ∥ π (0) - γ-1 l1=0 (1 -ητ ) γ-1-l1 (1 -ητ )KL π (l1-γ+1) ∥ π (l1) + KL π (l1+1) ∥ π (l1-γ+1) + 2ηγ 2 d max ∥A∥ ∞ .

E.7 PROOF OF LEMMA 7

Taking logarithm on the both sides of ( 77) and ( 78), we get η log π (t) i -log π (t-1) i 1 = η (t) i log π (t) i -log π (t-1) i , or equivalently log π (t) i 1 = η η (t) i log π (t) i + 1 - η η (t) i log π (t-1) i . Taking inner product with π ⋆ i,τ -π (t) i , log π (t) i - η η (t) i log π (t) i -1 - η η (t) i log π (t-1) i , π ⋆ i,τ -π (t) i = 0. By definition of KL divergence, we have log π (t) i - η η (t) i log π (t) i -1 - η η (t) i log π (t-1) i , π ⋆ i,τ = (log π (t) i -log π ⋆ i,τ ) - η η (t) i (log π (t) i -log π ⋆ i,τ ) -1 - η η (t) i (log π (t-1) i -log π ⋆ i,τ ), π ⋆ i,τ = -KL π ⋆ i,τ ∥ π (t) i + 1 - η η (t) i KL π ⋆ i,τ ∥ π (t-1) i + η η (t) i KL π ⋆ i,τ ∥ π (t) i , log π (t) i - η η (t) i log π (t) i -1 - η η (t) i log π (t-1) i , π (t) i = η η (t) i KL π (t) i ∥ π (t) i + 1 - η η (t) i KL π (t) i ∥ π (t-1) i . Taken together, we get KL π ⋆ i,τ ∥ π i + η η (t) i KL π (t) i ∥ π (t) i + 1 - η η (t) i KL π (t) i ∥ π (t-1) i = 1 - η η (t) i KL π ⋆ i,τ ∥ π (t-1) i + η η (t) i KL π ⋆ i,τ ∥ π (t) i . ( ) On the other hand, taking logarithm of (78) and making inner product with π (κ (t) i ) i -π ⋆ i,τ gives log π (t) i -(1 -η (t) i τ ) log π (t-1) i -η (t) i τ log π ⋆ i,τ , π (κ (t) i ) i -π ⋆ i,τ = η (t) i (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ). Following a similar discussion in (19) gives KL π ⋆ i,τ ∥ π (t) i = (1 -η (t) i τ )KL π ⋆ i,τ ∥ π (t-1) i -(1 -η (t) i τ )KL π (κ (t) i ) i ∥ π (t-1) i -η (t) i τ KL π (κ (t) i ) i ∥ π ⋆ i,τ -KL π (t) i ∥ π (κ (t) i ) i + log π (κ (t) i ) i -log π (t) i , π (κ (t) i ) i -π (t) i -η (t) i (π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ). Plugging the above equation into (102), KL π ⋆ i,τ ∥ π (t) i + η η (t) i KL π (t) i ∥ π (t) i + 1 - η η (t) i KL π (t) i ∥ π (t-1) i = (1 -ητ )KL π ⋆ i,τ ∥ π (t-1) i -η(π (κ (t) i ) i -π ⋆ i,τ ) ⊤ A i (π (κ (t) i ) -π ⋆ τ ) - η η (t) i (1 -η (t) i τ )KL π (κ (t) i ) i ∥ π (t-1) i + η (t) i τ KL π (κ (t) i ) i ∥ π ⋆ i,τ + KL π (t) i ∥ π (κ (t) i ) i + η η (t) i log π (κ (t) i ) i -log π (t) i , π (κ (t) i ) i -π (t) i . Rearranging the terms finishes the proof.

E.8 PROOF OF LEMMA 8

For notational convenience, we set ϕ (t) i = 1 - η η (t) i π (t) i -π (t-1) i 1 + η η (t) i π (κ (t) i ) i -π (t-1) i 1 + π (t) i -π (κ (t) i ) i 1 + π (t) i -π (t) i 1 for all i ∈ V, t ≥ 0. By triangular inequality, we have ϕ (t) i ≥ π (t) i -π (t-1) i 1 . In addition, we denote by t 1 ∧ t 2 := min{t 1 , t 2 } and t 1 ∨ t 2 := max{t 1 , t 2 }. For 0 < t 1 < t 2 , it holds that π (κ  ( (ϕ (t) i ) 2 = 1 - η η (t) i 1/2 • 1 - η η (t) i 1/2 π (t) i -π (t-1) i 1 + η η (t) i (1 -η (t) i τ ) -1 1/2 • η η (t) i (1 -η (t) i τ ) 1/2 π (κ (t) i ) i -π (t-1) i 1 + η η (t) i 1/2 • η η (t) i 1/2 π (t) i -π (κ (t) i ) i 1 + π (t) i -π (t) i 1 2 (i) ≤ 1 - η η (t) i + η η (t) i 2 + (1 -η (t) i τ ) -1 1 - η η (t) i π (t) i -π (t-1) i 2 1 + η η (t) i (1 -η (t) i τ ) π (κ (t) i ) i -π (t-1) i 2 1 + π (t) i -π (κ (t) i ) i 2 1 + π (t) i -π i 2 1 (ii) ≤ 2 2 + (1 -η (t) i τ ) -1 1 - η η (t) i KL π (t) i ∥ π (t-1) i + η η (t) i (1 -η (t) i τ )KL π (κ (t) i ) i ∥ π (t-1) i + KL π (t) i ∥ π (κ (t) i ) i + KL π (t) i ∥ π (t) i (iii) ≤ 8ψ (t) i , where (i) applies Cauchy-Schwarz inequality, (ii) invokes Pinsker's inequality and (iii) is due to η (t) i τ ≤ (γ + 1)ητ ≤ 1/2. Combining the above two inequalities finishes the proof. E.9 PROOF OF LEMMA 9 We start with verifying the claim (86). Recall that ψ (t) i := 1 - η η (t) i KL π (t) i ∥ π (t-1) i + η η (t) i (1 -η (t) i τ )KL π (κ (t) i ) i ∥ π (t-1) i + KL π (t) i ∥ π (κ (t) i ) i + KL π (t) i ∥ π (t) i . We introduce the following standard Lemma (see e.g., (Cen et al., 2020, Appendix A.2 )), which allows us to bound control KL π i ∥ π ′ i properly: Lemma 11. Given π i , π ′ i ∈ ∆(S i ) and w ∈ R |Si| with log π i 1 = log π ′ i + w, we have KL π i ∥ π ′ i ≤ log π i -log π ′ i ∞ ≤ 2 w ∞ . Therefore, it suffices to figure out the terms log π . The following equations follow directly from (77) and (78):    log π (t) i -log π (t-1) i 1 = η([A i π (κ (t) i ) ] k -τ log π (t-1) i ) log π (t) i -log π (t) i 1 = (η -η (t) i )([A i π (κ (t) i ) ] k -τ log π (t-1) i ) . In addition, we have the following bound w.r.t. the order of log π (t-1) i ∞ , which we shall establish momentarily. τ log π (t-1) i ∞ ≤ τ log |S i | + 2d max ∥A∥ ∞ . ( ) This taken together with Lemma 11 yields KL π (t) i ∥ π (t-1) i ≤ η(3d max ∥A∥ ∞ + τ log |S i |) KL π (t) i ∥ π (t) i ≤ ( η (t) i -η)(3d max ∥A∥ ∞ + τ log |S i |) . ( ) • Bounding KL π (t) i ∥ π (κ (t) i ) i . When κ (t) i ≥ 1, we recall from (80) that: log π (κ (t) i ) i -log π (t) i 1 = η (t) i A i (π (κ (κ (t) i -1) i ) -π (κ (t) i ) ) + t-1 l=κ (t) i (1 -η (t) i τ )(1 -ητ ) t-1-l A i (π (κ (κ (t) i -1) i ) -π (κ (l) i ) ) , which leads to a crude bound KL π (κ (t) i ) i ∥ π (t) i ≤ η (t) i d max ∥A∥ ∞ (t -κ (t) i + 1) ≤ η (t) i d max ∥A∥ ∞ (γ + 1). κ (t) i = 0, we have log π (κ (t) i ) i -log π (t) i 1 = -log π (t) i 1 = -(1 -η (t) i τ )(1 -ητ ) t-1 log π (0) -η (t) i A i π (κ (t) i ) + t-1 l=κ (t) i +1 (1η (t) i τ )(1 -ητ ) t-1-l A i π (κ (l) i ) 1 = -η (t) i A i π (κ (t) i ) + t-1 l=κ (t) i +1 (1η (t) i τ )(1 -ητ ) t-1-l A i π (κ (l) i ) , which yields KL π (κ (t) i ) i ∥ π (t) i ≤ η (t) i d max ∥A∥ ∞ t ≤ η (t) i d max ∥A∥ ∞ (γ + 1). • Bounding KL π (κ (t) i ) i ∥ π (t-1) i . Note that log π (κ (t) i ) i -log π (t-1) i = (log π (κ (t) i ) i -log π (t) i ) + (log π (t) i -log π (t) i ) + (log π (t) i -log π (t-1) i ). This yields, by equations ( 107), ( 110), ( 111) and associated bounds, KL π (κ (t) i ) i ∥ π (t-1) i ≤ η (t) i d max ∥A∥ ∞ (γ + 1) + η (t) i (3d max ∥A∥ ∞ + τ log |S i |). Putting all pieces together, we conclude that ψ (t) i = 1 - η η (t) i KL π (t) i ∥ π (t-1) i + η η (t) i (1 -η (t) i τ )KL π (κ (t) i ) i ∥ π (t-1) i + KL π (t) i ∥ π (κ (t) i ) i + KL π (t) i ∥ π (t) i ≤ 3η(3d max ∥A∥ ∞ + τ log |S i |) + 2ηd max ∥A∥ ∞ (γ + 1) = η(d max ∥A∥ ∞ (2γ + 11) + 3τ log |S i |). It remains to prove the claim (87): KL π ⋆ i,τ ∥ π (2γ) i = KL π ⋆ i,τ ∥ π (0) i + π ⋆ i,τ , log π (0) i -log π (2γ) i ≤ KL π ⋆ i,τ ∥ π (0) i + log π (0) i -log π (2γ) i ∞ ≤ KL π ⋆ i,τ ∥ π (0) i + 2 η 2γ l=1 (1 -ητ ) 2γ-l A i π (κ (l) i ) ∞ ≤ KL π ⋆ i,τ ∥ π (0) i + 4ηd max ∥A∥ ∞ γ, where the third step results from log π (2γ) i 1 = (1 -ητ ) 2γ log π (0) i + η 2γ l=1 (1 -ητ ) 2γ-l A i π (κ (l) i ) and Lemma 11. Proof of the claim (108). First, we prove by induction that for any k, l ∈ S i , log π (t) i (k) -log π (t) i (l) ≤ 2d max ∥A∥ ∞ τ , ∀t ≥ 0. ( ) Note that the claim trivially holds for t = 0 with the uniform initialization π (0) i = 1 |Si| 1, ∀i ∈ V . Assume that (112) holds for all t ′ ≤ t -1. Note that log π (t) i 1 = (1ητ ) log π (t-1) i + ηA i π (κ (t) i ) , we have log π (t) i (k) -log π (t) i (l) = (1 -ητ ) log π (t-1) i (k) -log π (t-1) i (l) + η [A i π (κ (t) i ) ] k -[A i π (κ (t) i ) ] l ≤ (1 -ητ ) 2d max ∥A∥ ∞ τ + 2ηd max ∥A∥ ∞ = 2d max ∥A∥ ∞ τ , where the second line follows from the induction hypothesis (112). This completes the induction at the t-th iteration. It follows for all i ∈ V and t ≥ 0, log π (t) i (l) ≥ log max k∈Si π (t) i (k) - 2d max ∥A∥ ∞ τ ≥ -log |S i | - 2d max ∥A∥ ∞ τ .



Figure 1: KL π ⋆ τ ∥ π (t) of single-timescale and two-timescale OMWU with respect to different values of learning rate and delay. (a): performance of single-timescale OMWU in the synchronous setting and asynchronous setting. (b) & (c): performance of the two methods after 5000 iterations under various choices of η and η, with η fixed to η = τ -1 (1 -(1ητ ) γ+1 ) in (b) and η fixed to 0.001 in (c).

Figure 2: KL π ⋆ τ ∥ π (t) with respect to iteration count t of single-timescale and two-timescale OMWU under various asynchronous settings. (a): random delays bounded by γ = 25. (b): constant delays γ = 50. (c): permuted feedback with delay bounded by γ = 25.

Proof for single-timescale OMWU (Section 3) 14 B.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C Regret analysis of single-timescale OMWU 22 C.1 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.2 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D Proof for two-timescale OMWU (Section 4) 29 D.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E Proof of auxiliary lemmas 39 E.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 E.4 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 E.5 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 E.6 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 E.7 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 E.8 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 E.9 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

. It then boils down to show the second inequality of the above relation. From the definition of zero-sum polymatrix games, it holds that i∈V T t=0 π

In addition, the mapping k → ν j (κ (k) i ) is injective when k ≥ γ (cf. Assumption 2 and 3). It follows that

and the maximum degree of the graph by d max = max i∈V deg i , where deg i is the degree of player i. Moreover, we denote S max = max i |S i | as the maximum size of the action space over all players.

≤ t ≤ ν i (t) ≤ t + γ for all i ∈ V , t ≥ 0, the first term can be bounded by

acknowledgement

ACKNOWLEDGEMENT S. Cen and Y. Chi are supported in part by the grants ONR N00014-19-1-2404, NSF CCF-1901199, CCF-2106778 and CNS-2148212. S. Cen is also gratefully supported by Wei Shen and Xuehong Zhang Presidential Fellowship, and Nicholas Minnici Dean's Graduate Fellowship in Electrical and Computer Engineering at Carnegie Mellon University. R. Ao is supported by the Elite Undergraduate Training Program of School of Mathematical Sciences at Peking University.

