STOCHASTIC NO-REGRET LEARNING FOR GENERAL GAMES WITH VARIANCE REDUCTION

Abstract

We show that a stochastic version of optimistic mirror descent (OMD), a variant of mirror descent with recency bias, converges fast in general games. More specifically, with our algorithm, the individual regret of each player vanishes at a speed of O(1/T 3/4 ) and the sum of all players' regret vanishes at a speed of O(1/T ), which is an improvement upon the O(1/ √ T ) convergence rate of prior stochastic algorithms, where T is the number of interaction rounds. Due to the advantage of stochastic methods in the computational cost, we significantly improve the time complexity over the deterministic algorithms to approximate coarse correlated equilibrium. To achieve lower time complexity, we equip the stochastic version of OMD in (AM21) with a novel low-variance Monte-Carlo estimator. Our algorithm extends previous works (AM21; CJST19) from two-player zero-sum games to general games. N i=1 F i (x), instead of estimating the gradient by ∇F i (x) like Stochastic Gradient Descent (SGD), the variance reduction method proposes to use ∇F (x) = ∇F (w k ) +

1. INTRODUCTION

How does a player in a game interact with others, and selfishly maximize its own utilities? This is one central problem in online learning and game theory and has intimate connections to economics, auction design, and machine learning. The study of this problem was pioneered by (Bro49; Rob51). Robinson (Rob51) shows that fictitious play asymptotically converges to Nash equilibrium in twoplayer zero-sum games. But its convergence rate is exponentially slow and it may not even converge in non-zero-sum games (Sha64) . Another natural choice for each player is to use no-regret learning algorithms. With some well-known families of no-regret learning algorithms, e.g., mirror descent (NY83) and follow-the-regularizedleader (KV05), the average regret of each player vanishes at a speed of O(1/ √ T ) where T is the number of interaction rounds. This regret bound implies an O(1/ √ T ) convergence rate to the coarse correlated equilibrium in general games (or Nash equilibrium in two-player zero-sum games). And it is noteworthy that Chen and Peng (CP20) show that these algorithms' convergence rate is Ω(1/ √ T ). Players can do even better with some special no-regret algorithms tailored for games. The most representative one is known as optimistic mirror descent (OMD) which is a variant of mirror descent with recency bias. Syrgkanis et al. (SALS15) and Rakhlin et al. (RS13) show that OMD approaches optimal social welfare (or equivalently, minimizes the sum of all players' regrets) at a speed of O(1/T ) and minimizes each player's individual regret at a speed of O(1/T 3/4 ). Several works (CP20; HAM21; DFG21) improve the results in (SALS15) under different settings or assumptions. Remarkably, Daskalakis et al. (DFG21) improve the convergence rate of players' individual regret of OMD to O(poly log T /T ) in general games. However, the computational cost of players to use OMD, as well as other deterministic no-regret algorithms, could be not manageable. Since each player needs to compute the exact loss vector to update its strategy in OMD. And the time complexity of computing this exact loss vector is, in the worst case, exponential in the number of players in the game. One standard method to accelerate the computation is to estimate the loss vector with Monte-Carlo methods. But a Monte-Carlo estimator with an uncontrolled variance will immediately make the convergence rate degenerate to O(1/ √ T ). To alleviate the effect of Monte-Carlo estimator's variance, Carmon et al. (CJST19) and Alacaoglu et al. (AM21) propose variance reduced stochastic no-regret algorithms with a convergence rate of O(1/T ) for two-player zero-sum games. As a result, they improve the time complexity of computing ϵ-Nash Equilibrium in two-player zero-sum games from O(Cost/ϵ) of deterministic algorithms to O(Cost + √ Cost/ϵ) (some lower order terms are omitted) where Cost is the time complexity of computing the loss vector. While Carmon et al. (CJST19) and Alacaoglu et al. (AM21) make a huge step towards developing efficient stochastic algorithms for games, their algorithms are tailored for the simplest two-player zero-sum games and could not cover more practical settings, such as auctions, which may involve multiple players and not be zero-sum. One crucial factor of the algorithms in Carmon et al. (CJST19) and Alacaoglu et al. (AM21) is a stochastic loss estimator with small variance. However, the time complexity of calculating this estimator is O(A N -1 ) where N is the number of players, which is exponentially large in general games. The high complexity of the estimator becomes a major obstacle to developing efficient stochastic algorithms for general games. Contributions. We consider general normal-form games with an arbitrary number of players. Compared to the two-player zero-sum case, this is more challenging and practically significant. We show that in general games, a stochastic version of OMD converges to the optimal social welfare (or equivalently, minimizes the sum of all players' regrets) at a rate of Õ(1/T ) and minimizes the individual regret at a speed of O(1/T 3/4 ) in contrast to the O(1/ √ T ) convergence rate of existing stochastic algorithms. Due to the advantage of stochastic methods in the computational cost, this significantly improves the time complexity to approximate coarse correlated equilibrium in general games. Please see Table 1 for the comparison of the time complexity of our algorithm against prior works. Specifically, our result improves previous works for weak ϵ-CCE when Cost ≥ N A and for strong ϵ-CCE when Cost ≥ N A 2 /ϵ. To achieve the above regret bounds, we make two main technical contributions. Firstly, we extend the theoretical framework of analyzing regret bounds of stochastic OMD in (AM21) from two-player zero-sum games to general games. Secondly, we propose a novel low-variance Monte-Carlo estimator for general games. The computational complexity of this estimator is exponentially faster than Carmon et al. (CJST19) and Alacaoglu et al. (AM21) while the variance is only slightly larger. The stochastic OMD algorithm equipped with our novel estimator achieves the above results. The rest of the paper is organized as follows: In Section 2, we discuss prior works related to this problem. In Section 3, we provide necessary preliminaries for games, coarse correlated equilibrium and optimistic mirror descent. In Section 4, we introduce our algorithm and present a general regret upper bound in Theorem 1. In Section 5, we introduce our low-variance Monte-Carlo estimator and analyze its variance in Lemma 3. In Section 6, by combining the results in Theorem 1 and Lemma 3, we present our final regret bounds in Theorem 2 and 3, as well as the time complexity to approximate coarse correlated equilibrium in Corollary 1 and 2.

2. RELATED WORK

Comparisons to existing algorithms. Table 1 compares the time complexity of our algorithm to compute ϵ-coarse correlated equilibrium for general games (and ϵ-Nash equilibrium for two-player zero-sum games) against prior no-regret algorithms. The time complexity is determined by two terms: the convergence rate (or the regret) and the computational cost in each round. Deterministic algorithms (PSS21; DFG21; SALS15) converge fast, but with a relatively higher per round time complexity since they have to compute the loss in each round. Stochastic algorithms exploit the Monte-Carlo approach to accelerate the computation of loss. However, the variance of the estimated loss may slow down the convergence rate. To alleviate the effect of the variance, Carmon et al. Variance reduction. Variance reduction is one of the most useful techniques to accelerate stochastic algorithms (see (GSBR20) for a comprehensive survey). Typically, when optimizing the finite sum problem min x F (x) = ϵ-Nash equilibrium Weak ϵ-CCE Strong ϵ-CCE Two-player zero-sum General games General games (D)(SALS15) Õ(Cost/ϵ) Õ(N 3 Cost/ϵ) Õ(N 3/2 Cost/ϵ 4/3 ) (D)(DFG21) Õ(Cost/ϵ) Õ(N 3 Cost/ϵ) Õ(N 2 Cost/ϵ) (D)(PSS21) Õ(Cost/ϵ) Õ(N 3 Cost/ϵ) Õ(N 2 Cost/ϵ) (S)(AM21) Õ(Cost + √ ACost/ϵ) - - (S)(CJST19) Õ(Cost + √ ACost/ϵ) - - (S) Ours Õ(Cost + √ ACost/ϵ) Õ(Cost + N 7 2 √ ACost/ϵ) O(Cost + Cost 2 3 N 7 3 A 2 3 ϵ 4/3 ) Table 1 : Comparisons of time complexity to compute ϵ-CCE for general games and ϵ-Nash equilibrium for two-player zero-sum games. Cost denotes the time complexity of computing the loss vector. N is the number of players and A is the number of actions for each player. (S) means the algorithm is stochastic and (D) means the algorithm is deterministic. For stochastic algorithms, the time complexity is the expected running time to achieve an expected approximation error, which directly follows (AM21). Õ hides factors which are polynomial in terms of log(1/ϵ) and log(A). ∇F i (x) -∇F i (w k ), where w k is called the "snapshot". In many cases, e.g., when F i 's are convex, sampling i from a uniform distribution is sufficient to reduce the variance. When dealing with games, it turns out that one has to be meticulous to design a low-variance sampling distribution, even when the game is two-player zero-sum (CJST19; AM21 Basics of game. We consider general games with N players. The action space of each player i ∈ [N ] is A i . Denote A = max i |A i | as the cardinality of the largest action space. The joint action space of all players is A = A 1 ×A 2 ×• • •×A N . For simplicity, let A -i = A 1 ×• • •×A i-1 ×A i+1 ×• • •×A N be the joint action space of all players except for player i ∈ [N ]. The loss of players can be specified by the functions F 1 , F 2 , • • • , F N : A → [0, 1], which map the joint action space to a real value. Specifically, if each player j selects action a j ∈ A j , then F i (a) is the loss of player i where a := (a 1 , a 2 , . . . , a N ). A mixed strategy σ i is a probability distribution over A i . We say a player p i plays according to σ i if it selects action a i ∈ A i with probability σ i (a i ). For any strategy profile σ := (σ i ) i∈[N ] , let σ -i := (σ 1 , . . . , σ i-1 , σ i+1 , . . . , σ N ) denote the strategy profile σ after removing σ i . And for convenience, let (σ ′ i , σ -i ) denote the strategy profile σ after replacing σ i with σ ′ i . Similarly, given an action profile a := (a i ) i∈[N ] , let a -i := (a 1 , . . . , a i-1 , a i+1 , . . . , a N ) denote the action profile after removing a i and (a ′ i , a -i ) be the action profile a after replacing a i with a ′ i . With a little abuse of notation, for a strategy profile σ, let F i (σ) := E a∼σ F i (a) be the expected loss of player i if each player j ∈ [N ] plays according to σ j . For convenience, define the vector F i (a -i ) ∈ [0, 1] |Ai| with [F i (a -i )](a ′ i ) = F i ((a ′ i , a -i )) representing the loss of player i when each player j ̸ = i selects a j and i selects a ′ i ∈ A i . Similarly, define the vector F i (σ -i ) ∈ [0, 1] |Ai| with [F i (σ -i )](a ′ i ) = E a-i∼σ-i F i ((a ′ i , a -i )) representing the expected loss of player i if each player j plays according to σ j and i selects action a ′ i . Further, denote Cost as the time complexity of computing the vector F i (σ -i ). No-regret learning. In online learning, N players play the game for T rounds. In round k ∈ [T ], player i plays according to the strategy σ k i and suffers the expected loss ⟨F i (σ k -i ), σ k i ⟩. Each player aims to minimize its cumulative loss, which is equivalent to minimizing its regret max σi R i (σ i ) := max σi T k=1 ⟨F i (σ k -i ), σ k i -σ i ⟩ where R i (σ i ) is the cumulative loss difference between the adopted strategies and the fixed strategy σ i . Given the strategy at round 1, ..., k, the optimistic mirror descent (OMD) (RS13; SALS15) method calculates the strategy in round k + 1 as: σ k+1 i = arg min σi∈∆(Ai) D(σ i , σ ′k+1 i ), where ∇h(σ ′k+1 i ) = ∇h(σ k i ) -τ (2F i (σ k -i ) -F i (σ k-1 -i ))) . ( ) where τ is the step size and D(x, y) is the Bregman divergence induced by some mirror map h(•)foot_0 . Without loss of generality, we mainly consider two common mirror maps in this paper: negative entropy h 1 (x) = 

Coarse correlated equilibrium (CCE).

A correlated strategy ζ is a distribution over the joint action space A. We call ζ a coarse correlated equilibrium (CCE) if no player can benefit from unilaterally deviating from ζ given that others take actions according to ζ. There are two versions of approximate CCE: one corresponds to the social welfare and is referred to as weak ϵ-CCE; the other corresponds to the individual loss and is referred to as strong ϵ-CCE. Intuitively, weak ϵ-CCE states that the averaged difference among all players i ∈ [N ] between the expected loss E a∼ζ F i (a) by following ζ and the least expected loss min a * i E a∼ζ F i ((a * i , a -i )) by deviating from ζ is no more than ϵ. And strong ϵ-CCE requires that the maximum difference among all players i ∈ [N ] between these two types of expected loss to be smaller than ϵ. The formal definitions are as follows. Definition 1. A correlated strategy ζ is a weak ϵ-CCE (corresponding to social welfare) if N i=1 E a∼ζ F i (a) -min a * i E a∼ζ F i ((a * i , a -i )) ≤ ϵ and it is a strong ϵ-CCE (corresponding to individual loss) if max i∈[N ] E a∼ζ F i (a) -min a * i E a∼ζ F i ((a * i , a -i )) ≤ ϵ . Here for convenience, we simply multiply the average difference by N as the sum over all players in the definition of Weak ϵ-CCE. Thus a strong ϵ-CCE is a weak N ϵ-CCE. Strong-CCE ensures that each player will not suffer a large loss, and weak-CCE guarantees the convergence to the global optimal strategy within a large class of smooth games (SALS15). Since this work mainly focuses on the stochastic method, we study the expected performance of the generated strategy. Specifically, a randomly generated correlated strategy ζ is called a weak ϵ-CCE if N i=1 E ζ,a∼ζ F i (a) -min a * i E ζ,a∼ζ F i ((a * i , a -i )) ≤ ϵ and a strong ϵ-CCE if max i∈[N ] E ζ,a∼ζ F i (a) -min a * i E ζ,a∼ζ F i ((a * i , a -i )) ≤ ϵ, where the expectation is additionally taken over the randomness of the generated strategy ζ. Connection between no-regret learning and CCE. The following Lemma 1 shows a well-known connection between no-regret learning and CCE. Lemma 1. Let σ 1 , • • • , σ T denote an arbitrary collection of strategy profiles and ζ(a) := 1 T T k=1 N i=1 σ k i (a i ). It holds that E a∼ζ F i (a) -min a * i E a∼ζ F i ((a * i , a -i )) ≤ max σi R i (σ i )/T, ∀i ∈ [N ] . Lemma 1 shows that the running time to approximate CCE depends on both the regret and the running time in each round. Deterministic algorithms, such as OMD, though enjoy excellent regret bounds, take a long running time for computation in each round. In the next sections, we will show how to use the Monte-Carlo method to accelerate the calculations without significantly hurting the regret bound.

4. OPTIMISTIC MIRROR DESCENT WITH MONTE-CARLO ESTIMATION

In this section, we present our algorithm. We first introduce some necessary notations. Let F i,a-i (σ -i ) = F i (a -i )σ -i (a -i )/q(a -i ) where a -i ∈ A -i denotes a random variable drawn from a sampling distribution q -i . The algorithm is presented in Algorithm 1, which is a stochastic version of OMD shown in equation 1 with variance reduction. There are two crucial modifications over the vanilla OMD to accelerate calculation: one is on the estimated loss and the other is on the update starting points. Algorithm 1 Optimistic mirror descent with variance reduction 1: Input: hyper-parameters p, τ and α; mirror map h 2: Initialize: w 1 i (a) = 1/A and σ 1 i (a) = 1/A 3: for k = 1, 2, • • • , T do 4: Sample u from uniform distribution over [0, 1].

5:

for i = 1, • • • , N do 6: Compute σk i such that ∇h(σ k i ) = α∇h(σ k i ) + (1 -α)∇h(w k i ). 7: Sample a k -i ∼ LVE(i, σ k , w k-1 ). {See section 5 for the definition of LVE.} 8: Compute σk+1 i such that ∇h(σ k+1 i ) = ∇h(σ k i ) -τ (F i (w k -i ) + F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )). 9: Compute σ k+1 i = arg min σi∈∆(Ai) D(σ i , σk+1 i ). 10: Update w k+1 i = σ k+1 i if u < p and w k+1 i = w k i if u ≥ p. 11: end for 12: end for Estimated loss. As can be seen in Line 8, Algorithm 1 updates the strategy using the estimated loss F i (w k -i ) + F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ) instead of 2F i (σ k -i ) -F i (σ k-1 -i ) in the vanilla OMD equation 1. By constructing appropriate distribution q -i to sample a k -i (Line 7), it can be shown that the adopted estimated loss F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ) where F i,a-i (σ -i ) = F i (a -i )σ -i (a -i )/q(a -i ) and a -i ∼ q -i in the algorithm is an unbiased estimator of F i (σ k -i ) -F i (w k-1 -i ) in the vanilla OMD. The objective to introduce such estimated loss is to reduce the computation complexity. Since w k+1 is only updated with probability p at each round (Line 10), the running time to compute F i (w k -i ) over T rounds is just O(pT Cost). It is also worth noting that such an estimator may bring high variance. So how to construct q -i to accelerate convergence while ensuring low variance is the key innovation of our algorithm. We postpone the descriptions and discussions on this estimator to Section 5. Different update starting points. As in Line 8, Algorithm 1 updates σ k+1 i from the starting point σk i instead of σ k i in OMD equation 1. This modification was first introduced by (AM21) for two-player zero-sum games. We prove its effect in general games is to admit a faster learning rate in the case of maximizing social welfare. The detailed proof for this part can be seen in the full analysis. It is remarkable that under special settings of hyper-parameters, our Algorithm 1 degenerates to standard OMD (SALS15) and stochastic OMD for two-player zero-sum games (AM21). Specifically, when setting p = 1, α = 1 and using the exact value of F i (σ k -i ) -F i (w k-1 -i ) in Line 8, Algorithm 1 is equivalent to standard OMD (SALS15) . And when the underlying game is two-player zero-sum, our constructed unbiased estimator for F i (σ k -i ) -F i (w k-1 -i ) is equivalent to that in (AM21) and thus Algorithm 1 can also recover their stochastic OMD. In the following, we show how to compute σ k+1 by taking two common examples of the mirror map function: negative entropy h 1 (x) = d a=1 x(a) log x(a) and square of ℓ 2 -norm h 2 (x) = d a=1 x 2 (a). When the mirror map is h 1 , it is easy to verify that ∇ x(a) h 1 (x) = 1 + log x(a). Then according to Line 6, σk i (a) = (σ k i (a)) α (w k i (a)) 1-α . So we can update σ k+1 i in O(A) time as σ k+1 i (a) ∝ (σ k i (a)) α (w k i (a)) 1-α exp -τ [F i (w k -i ) + F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )](a) . Similarly, when the mirror map is h 2 , we can directly compute σk i (a) = ασ k i (a) + (1 -α)w k i (a). The computation of σk+1 i is easy according to the gradient of the ℓ 2 -norm and σ k+1 i can then be obtained by standard procedure of projecting σk+1 i to the simplex.

4.1. THEORETICAL ANALYSIS

We first provide a general regret upper bound for Algorithm 1 in the following Theorem 1. The final results (Theorem 2 and Theorem 3) will be postponed in later sections by combining this general upper bound and the variance upper bound for the unbiased estimator in Lemma 3 as well as some delicate derivations. Theorem 1. If D(x, y) ≥ γ∥x -y∥ 2 for all x, y and some norm ∥ • ∥ with γ > 0 being a constant, then the expected regret of Algorithm 1 is upper-bounded by τ max σi E R i (σ i ) ≤ U i -(1 -α)E   T k=1 D σ k i , w k-1 i   + τ 2 1 + 1 αγ E   T k=1 ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 *   , where U i := max σi E ϕ 1 i (σ i ) -ϕ T +1 i (σ i ) + max σi D(σ i , σ 0 i ), σ 0 , w 0 are uniform and ϕ k i (σ i ) =αD(σ i , σ k i ) + (1 -α)/p • D(σ i , w k i ) + (1 -α)D(σ k i , w k-1 i ) + τ ⟨F i (σ k -i ) -F i (w k-1 -i ), σ i -σ k i ⟩ . Due to the space limit, the full proof of Theorem 1 is deferred to Appendix A.2. To obtain the main order of the above general upper bound, we first bound U i , i.e., the first term in equation 3. According to the following Lemma 2, it is easy to see that U i is of order Õ 1 + τ + (1 -α)/p . When adopting hyper-parameters such that α ≥ 1 -p and τ = O(1), we would have U i = Õ(1). Therefore, the key to bound the regret in Theorem 1 is to control the variance of the estimator, i.e., the third term in equation 3, which will be completed by our estimator introduced in Section 5. Lemma 2 (Bounds for ϕ k i defined in equation 4). For any k and σ i , ϕ k i (σ i ) ≥ -4τ . If the mirror map is h 1 and w k i (a) = 1/A, then ϕ k i (σ i ) ≤ 4τ + (1 + (1 -α)/p) log A. If the mirror map is h 2 and w k i (a) = 1/A, then ϕ k i (σ i ) ≤ 4τ + 2(1 + (1 -α)/p).

5. LOW-VARIANCE MONTE-CARLO ESTIMATOR

In this section, we present our low-variance estimator for F i (σ -i ) -F i (w -i ) where σ and w are two strategy profiles. For convenience, we assume σ ̸ = w. It is standard to use importance sampling to construct an unbiased estimator. Specifically, let q -i denote a distribution over A -i such that q -i (a -i ) = 0 only if σ -i (a -i ) = 0 and w -i (a -i ) = 0. Recall F i,a-i (σ -i ) = F i (a -i ) σ-i(a-i) q(a-i) . Clearly, we have E a-i∼q-i F i,a-i (σ -i ) = a-i F i (a -i )σ -i (a -i ) = F i (σ -i ) . Therefore, F i,a-i (σ -i ) -F i,a-i (w -i ) is an unbiased estimator for F i (σ -i ) -F i (w -i ) when a -i ∼ q -i . However, for an arbitrary q -i , the variance of the estimator can be very large. For example, let q -i denote the uniform distribution over A -i . Then, in the worst case, the variance of F i,a-i (σ -i ) -F i,a-i (w -i ) can be A N . So we need to carefully design q -i to ensure low variance.

Note the variance E

a-i∼q-i ∥F i,a-i (σ -i ) -F i,a-i (w -i )∥ 2 ∞ 2 is upper bounded by a-i ∥F i (a -i )∥ 2 ∞ (σ -i (a -i ) -w -i (a -i )) 2 /q -i (a -i ). Intuitively, to control the variance, we should allocate a large probability mass to a -i where (σ -i (a -i ) -w -i (a -i ))foot_1 is large. With the observation that the difference between σ -i (a -i ) and w -i (a -i ) can be decomposed as σ(a) -w(a) = N i=1 (σ i (a i ) -w i (a i )) x<i σ x (a x ) y>i w y (a y ) (see Lemma 10 for more details), we propose to sample a -i according to the following distribution: q -i (a -i ) = 1 j ′ ̸ =i Z j ′ j̸ =i |σ j (a j ) -w j (a j )| x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y ) , where Z j = aj |σ j (a j ) -w j (a j )|. It is easy to verify that q -i (a -i ) ≥ 0 and a-i q -i (a -i ) = 1. The following Algorithm 2 summarizes an efficient sampling procedure from q -i in equation 5 with polynomial time complexity O((N -1)A). It takes two main steps to sample a -i . Firstly, we sample index j with probability Z j / j ′ ̸ =i Z j ′ (Line 1). And then, we sample a -i with probability proportional to |σ j (a j ) -w j (a j )| x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y ) (Line 3-5). Algorithm 2 LVE(i, σ, w) 1: Compute Z j = aj |σ j (a j ) -w j (a j )| for j ̸ = i. 2: Sample j with probability Zj j ′ ̸ =i Z j ′ . 3: For 1 ≤ j ′ < j, j ′ ̸ = i, sample a j ′ according to σ j ′ . 4: Sample a j with probability |σj (aj )-wj (aj )| a ′ j |σj (a ′ j )-wj (a ′ j )| . 5: For j < j ′ ≤ N, j ′ ̸ = i, sample a j ′ according to w j ′ . 6: return : a -i The following Lemma 3 provides an upper bound for the variance of the constructed estimator in equation 5. We believe this result is also of independent interest since it is quite general and may be used to control the variance of other stochastic algorithms for general games. Lemma 3 (Variance bound). Sampling a -i ∼ q -i defined in equation 5 can be implemented by Algorithm 2. Moreover, we can upper bound its variance by E a-i∼q-i ∥F i,a-i (σ -i ) -F i,a-i (w -i )∥ 2 ∞ ≤ (N -1) j̸ =i ∥σ j -w j ∥ 2 1 . Proof. According to the definition of q -i , the variance of the estimated loss is E a-i∼q-i ∥F i,a-i (σ -i ) -F i,a-i (w -i )∥ 2 ∞ ≤ a-i (σ -i (a -i ) -w -i (a -i )) 2 /q -i (a -i ) . Further, recall that the difference between σ(a) and w(a) can be decomposed as N i=1 (σ i (a i )w i (a i )) x<i σ x (a x ) y>i w y (a y ) in Lemma 10 (Please see appendix for the statement and proof), it holds that a-i σ -i (a -i ) -w -i (a -i ) 2 /q -i (a -i ) = a-i   j̸ =i (σ j (a j ) -w j (a j )) x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y )   2 /q -i (a -i ) ≤ a-i   N i=1 σ i (a i ) -w i (a i ) x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y )   2 /q -i (a -i ) =   j̸ =i Z j   a-i   N i=1 |σ i (a i ) -w i (a i )| x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y )   =   j̸ =i Z j   2 ≤ (N -1) j̸ =i ∥σ j -w j ∥ 2 1 . Discussion on the optimality of LVE There are mainly two aspects to examine the optimality of a Monte-Carlo estimator: the variance and computational complexity.

1.. Variance:

It is obvious that the smallest variance is ( a-i |σ -i (a -i ) - w -i (a -i )|) 2 achieved by q CJST19 -i (CJST19). Then, after noticing that q -i (a -i ) ≥ q CJST19 -i (a -i )( a-i |σ -i (a -i ) -w -i (a -i )|)/ j̸ =i Z j ≥ q CJST19 -i (a -i )/(N -1), the variance of our estimator is optimal within a multiplicative factor N -1 according to Lemma 4. Lemma 4. For any σ and w, we have the variance estimator with q -i from equation 5 is no more than (N -1)( a-i |σ -i (a -i ) -w -i (a -i )|) 2 . 2. Computational complexity: The computational cost of our LVE is also optimal among estimators with low variance. Intuitively, the time complexity of LVE equals to access each entry of σ k j and w k-1 j for constant times, and it is unlikely to be improved. The following lemma formally shows this. Lemma 5. We say j is agnostic to q ′ -i if there are at least two entries of σ j are not accessed when computing q ′ -i . Let m denote the number of j which is agnostic to q ′ -i . Then, there exists σ and w, the variance of the estimator is Ω(2 m ). It is remarkable that though the estimator of (CJST19) achieves the smallest variance, its computational complexity is O(|A| N -1 ) to access every configuration of a -i to compute |σ -i (a -i )w -i (a -i )|, which is impractical in general games with multiple players. Our LVE estimator simultaneously guarantees (near-)optimal variance and computational complexity.

6. REGRET UPPER BOUNDS AND TIME COMPLEXITY TO APPROXIMATE CCE

Now we are ready to provide our final guarantees to reach weak ϵ-CCE which corresponds to social welfare and strong ϵ-CCE which corresponds to individual loss. In addition, we also provide an adversarial regret bound of O( √ T ) to show the robustness of our algorithm.

6.1. SOCIAL WELFARE

We first consider the time complexity to reach a weak ϵ-CCE. For this case, we take the mirror map h 1 (x) = d a=1 x(a) log x(a) as an example. Recall that Lemma 1 shows that the time complexity to approximate CCE depends on both the regret and the per-round running time. So we first provide an upper bound for the regret i∈N max σi E R i (σ i ) defined in weak ε-CCE. Theorem 2. Let hyper-parameters α = 1 -p, τ = γα(1 -α)/2/ (N -1) √ 1 + αγ , γ ∈ (0, 1/2) be a constant and the mirror map be h 1 . Then there exists a constant C such that max σ E N i=1 R i (σ i ) ≤ CN 2 log A/ √ p. With the above upper bound, we can then guarantee the time complexity to reach a weak ϵ-CCE. Corollary 1. With p = N A/Cost and the hyper-parameters defined as Theorem 2, the time complexity of Algorithm 1 to reach a weak ϵ-CCE is O Cost + N 7/2 √ ACost log A/ϵ .

6.2. INDIVIDUAL REGRET

In this subsection, we provide the time complexity to reach a strong ϵ-CCE. As previous analysis, we will also first give the upper bound for the individual regret. Here we take the mirror map h 2 (x) = d a=1 x 2 (a) as an example. Before providing the upper bound for the individual regret, we first present a useful lemma to bound the term E ∥σ k j -w k-1 j ∥ 2 1 appearing in the general upper bound in Theorem 1. Lemma 6. If the hyper-parameter α = 1, then E ∥σ k j -w k-1 j ∥ 2 1 ≤ 2τ 2 /p 2 . With Lemma 6, we now provide an upper bound on individual regret. Theorem 3. With τ = A 2 + 1/γ (N -1) 2 T /p 2 -1/4 , α = 1, γ ∈ (0, 1/2) and mirror map h 2 , we have max σi E R i (σ i ) = O A(N -1) 2 T /p 2 1/4 for any i ∈ [N ]. The above individual regret upper bound in Theorem 3 implies the following time-complexity to reach a strong ϵ-CCE. Corollary 2. With p = N A/Cost and the hyper-parameters defined in Theorem 3, the running time to reach a strong ϵ-CCE is O Cost + N 2 ACost 2/3 /ϵ 4/3 .

6.3. ADVERSARIAL REGRET

While this work mainly focuses on the case that every player adopts the same algorithm to optimize its own strategy, it is also interesting to consider the case where some players play adversarially, i.e., ∃ player j who updates σ k j and w k-1 j adversarially, to see if the algorithm is robust. According to Lemma 3, we have E a-i∼q-i ∥F i,a-i (σ -i ) -F i,a-i (w -i )∥ 2 ∞ ≤ (N -1) j̸ =i ∥σ j -w j ∥ 2 1 ≤ N (N -1) , which also holds in the adversarial setting. Then, inserting this inequality into Theorem 1 and with standard derivation, we have an O( √ T ) adversarial regret which is formally presented below. Lemma 7. With the mirror map h 2 , α = 1, τ = N √ T and γ ∈ (0, 1/2), if player i follows Algorithm 1, then for any σ k j , w k-1 j , j ̸ = i, k = 1, • • • , T , we have E R i (σ i ) = O(N √ T ).

7. CONCLUSION

In this paper, we propose a stochastic version of OMD with variance reduction. Our algorithm extends prior works on variance-reduced stochastic algorithms from two-player zero-sum games to general games. The key innovation of this work is a low-variance Monte-Carlo estimator. In comparison with the prior estimator in (CJST19), our estimator is exponentially fast with a slightly larger variance. There are several directions to extend our algorithm: Firstly, despite our algorithm enjoying an O(1/ϵ) convergence rate to weak-CCE, its convergence rate to strong CCE is O(1/ϵ 4/3 ) which seems to be sub-optimal in comparison to the convergence rate in (DFG21) and (PSS21). Thus, developing stochastic algorithms with an O(1/ϵ) convergence rate to strong-CCE is an interesting direction; Secondly, we only consider normal-form games. However, it is more realistic to consider games with sequential structures, e.g., extensive-form games. We hope this work could be a starting point for developing more efficient stochastic algorithms for general games.

A PROOF

Our regret bounds are derived from the following first-order optimality condition which is according to the update rule in Line 9 in Alg. 1. We have 0 ≤ ⟨∇h(σ k+1 i ) -∇h(σ k+1 i ), σ i -σ k+1 i ⟩ = ⟨∇h(σ k+1 i ) -∇h(σ k i ) + τ (F i (w k -i ) + F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )), σ i -σ k+1 i ⟩ . Adding τ ⟨F i (σ k+1 -i ), σ k+1 i -σ i ⟩ to both sides of equation 7, we immediately get an upper bound on the regret. τ ⟨F i (σ k+1 -i ), σ k+1 i -σ i ⟩ ≤⟨∇h(σ k+1 i ) -∇h(σ k i ), σ i -σ k+1 i ⟩ + τ ⟨F i (w k -i ) + F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ) -F i (σ k+1 -i ), σ i -σ k+1 i ⟩ . (8) The rest of our proof starts from equation 8.

A.1 USEFUL LEMMAS

The following lemmas are useful in the proof of Theorem 1 and Lemma 3. Lemma 8. Define E k [•] = E[•|σ k , w k-1 ]. Recall ∇h(σ k i ) = α∇h(σ k i ) + (1 -α)∇h(w k i ) and ϕ k i (σ i ) = αD(σ i , σ k i )+ 1 -α p D(σ i , w k i )+(1-α)D(σ k i , w k-1 i )+τ ⟨F i (σ k -i )-F i (w k-1 -i ), σ i -σ k+1 i ⟩, we have ϕ k i (σ i ) -ϕ k+1 i (σ i ) =⟨∇h(σ k+1 i ) -∇h(σ k i ), σ i -σ k+1 i ⟩ + 1 -α p E k D(σ i , w k+1 i ) + αD(σ k+1 i , σ k i ) + (1 -α)D(σ k i , w k-1 i ) - 1 -α p D(σ i , w k+1 i ) + τ ⟨F i (σ k -i ) -F i (w k-1 -i ), σ i -σ k i ⟩ + τ ⟨F i (w k -i ) -F i (σ k+1 -i ), σ i -σ k+1 i ⟩ . Proof. We can decompose ⟨∇h(σ k+1 i ) -∇h(σ k i ), σ i -σ k+1 i ⟩ as follows. ⟨∇h(σ k+1 i ) -∇h(σ k i ), σ i -σ k+1 i ⟩ =α⟨∇h(σ k+1 i ) -∇h(σ k i ), σ i -σ k+1 i ⟩ + (1 -α)⟨∇h(σ k+1 i ) -∇h(w k i ), σ i -σ k+1 i ⟩ =α D(σ i , σ k i ) -D(σ k+1 i , σ k i ) -D(σ i , σ k+1 i ) + (1 -α) D(σ i , w k i ) -D(σ k+1 i , w k i ) -D(σ i , σ k+1 i ) =α D(σ i , σ k i ) -D(σ k+1 i , σ k i ) -D(σ i , σ k+1 i ) + (1 -α) D(σ i , w k i ) -D(σ k+1 i , w k i ) -D(σ i , σ k+1 i ) + 1 -α p E k D(σ i , w k+1 i ) - 1 -α p E k D(σ i , w k+1 i ) = αD(σ i , σ k i ) + 1 -α p D(σ i , w k i ) -αD(σ i , σ k+1 i ) + 1 -α p E k D(σ i , w k+1 i ) -αD(σ k+1 i , σ k i ) -(1 -α)D(σ k+1 i , w k i ) = αD(σ i , σ k i ) + 1 -α p D(σ i , w k i ) + (1 -α)D(σ k i , w k-1 i ) -αD(σ i , σ k+1 i ) + 1 -α p E k D(σ i , w k+1 i ) -(1 -α)D(σ k+1 i , w k i ) -αD(σ k+1 i , σ k i ) -(1 -α)D(σ k i , w k-1 i ) , where the first equality is based on the fact that ∇h(σ k i ) = α∇h(σ k i ) + (1 -α)∇h(w k i ), the second one is based on the definition of the Bregman divergence, the fourth one is according to the updating rule of w k+1 i (Line 10 in Alg. 1) and the third and the last equality holds obviously. Further with the definition of ϕ k i , we complete the proof. Lemma 9 (Lemma 3.5 in (AM21)). Let F = {F k } k≥0 be a filtration and (u k ) be a stochastic process adapted to F with E[u k+1 |F k ] = 0. Then for any x 0 and any compact set C, E   max x∈C K-1 k=0 ⟨u k+1 , x⟩   ≤ max x∈C D(x, x 0 ) + 1 2 K-1 k=0 E ∥u k+1 ∥ 2 * . Lemma 10. For any σ(a) = N j=1 σ j (a j ), w(a) = N j=1 w j (a j ), we have σ(a) -w(a) = N i=1 (σ i (a i ) -w i (a i )) x<i σ x (a x ) y>i w y (a y ) . Proof. We prove this lemma by mathematical induction. The case when N = 1 holds obviously. Further, assume Lemma 10 holds for N -1, then for N , it holds that  N j=1 σ j (a j ) - N j=1 w j (a j ) = N j=1 σ j (a j ) -w N (a N ) N -1 j=1 σ j (a j ) + w N (a N ) N -1 j=1 σ j (a j ) - N j=1 w j (a j ) =(σ N (a N ) -w N (a N )) N -1 j=1 σ j (a j ) + w N (a N )   N -1 j=1 σ j (a j ) - N -1 j=1 w j (a j )   = N i=1 (σ i (a i ) -w i (a i )) x<i σ x (a x ) max σi E   T k=1 τ ⟨F i (σ k+1 -i ), σ k+1 i -σ i ⟩   1 ≤ max σi E   T k=1 ϕ k i (σ i ) -ϕ k+1 i (σ i ) + τ ⟨F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ), σ i -σ k+1 i ⟩ -τ ⟨F i (σ k -i ) -F i (w k-1 -i ), σ i -σ k i ⟩ -(1 -α)D(σ k i , w k-1 i ) + αD(σ k+1 i , σ k i ) - 1 -α p E k D(σ i , w k+1 i ) - 1 -α p D(σ i , w k+1 i )   ≤ max σi E ϕ 1 i (σ i ) -ϕ T +1 i (σ i ) + max σi E   T k=1 τ ⟨F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ) -F i (σ k -i ) -F i (w k-1 -i ) , σ i -σ k i ⟩   + max σi E   T k=1 τ ⟨F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ), σ k i -σ k+1 i ⟩ -(1 -α)D(σ k i , w k-1 i ) + αD(σ k+1 i , σ k i )   + max σi E   T k=1 - 1 -α p E k D(σ i , w k+1 i ) - 1 -α p D(σ i , w k+1 i )   2 ≤ max σi E ϕ 1 i (σ i ) -ϕ T +1 i (σ i ) + max σi E   T k=1 τ ⟨F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ) -F i (σ k -i ) -F i (w k-1 -i ) , σ i -σ k i ⟩   + E   T k=1 τ 2 αγ ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 * + αγ∥σ k+1 i -σ k i ∥ 2 -(1 -α)D(σ k i , w k-1 i ) + αD(σ k+1 i , σ k i )   ≤ max σi E ϕ 1 i (σ i ) -ϕ T +1 i (σ i ) + max σi E   T k=1 τ ⟨F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i ) -(F i (σ k -i ) -F i (w k-1 -i )), σ i -σ k i ⟩   + E   T k=1 τ 2 αγ ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 * -(1 -α)D(σ k i , w k-1 i )   3 ≤ max σi E ϕ 1 i (σ i ) -ϕ T +1 i (σ i ) + max σi D(σ i , σ 0 i ) + τ 2 E   T k=1 ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 *   + E   T k=1 τ 2 αγ ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 * -(1 -α)D(σ k i , w k-1 i )   ≤ max σi E ϕ 1 i (σ i ) -ϕ T +1 i (σ i ) + max σi D(σ i , σ 0 i ) -(1 -α)E   T k=1 D(σ k i , w k-1 i )   + τ 2 1 + 1 αγ E   T k=1 ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 *   , where 1 is based on Lemma 8; 2 is according to Young's inequality and 3 is derived from Lemma 9.

B MISSING PROOFS

Proof of Lemma 2. The lower bound follows directly from the non-negativity of Bregman divergence D(x, y). The upper bound of ϕ k i for h 1 is based on the fact that max σi D(σ i , w k i ) = log A when w K i (a) = 1/A. And the case of h 2 can also be proved in a similar way. Proof of Theorem 2. When the mirror map is h 1 (x) = d a=1 x(a) log x(a), it is known that D(x, y) ≥ ∥x -y∥ 2 1 and the dual norm is ∥ • ∥ ∞ . Thus a direct combination of Theorem 1 and Lemma 3 yields that τ max σi E R i (σ i ) ≤ U i -(1 -α)E   T k=1 D(σ k i , w k-1 i )   + (N -1)τ 2 1 + 1 αγ E   T k=1 j̸ =i ∥σ k j -w k-1 j ∥ 2 1   . Further, summing equation 10 over i = 1, • • • , N , we have N i=1 τ max σi E R i (σ i ) ≤ N i=1 U i - 1 -α 2 N i=1 E   T k=1 ∥σ k i -w k-1 i ∥ 2 1   + (N -1) 2 τ 2 1 + 1 αγ E   T k=1 N i=1 ∥σ k i -w k-1 i ∥ 2 1   = N i=1 U i . Recall that Lemma 2 shows U i is of order O (1 -α) log A/p + τ when adopting mirror map h 1 . By choosing α = 1 -p and γ ∈ (0, 1/2), we can see that there exist constants C, C ′ such that Proof of Lemma 6. With α = 1, it holds that N i=1 max σi E R i (σ i ) ≤ C ′ N 1 + log A/τ ≤ CN 2 log A/ √ p . E ∥σ k j -w k-1 j ∥ 2 1 =E   k-1 t=1 Prob[w k-1 j = σ t j ]∥σ k j -σ t j ∥ 2 1   =E   k-1 t=1 p(1 -p) k-1-t ∥σ k j -σ t j ∥ 2 1   = k-1 t=1 p(1 -p) k-1-t (k -t) 2 τ 2 ≤ 2τ 2 /p 2 . Proof of Theorem 3. When the mirror map is h 2 (x) = 1 2 d a=1 x 2 (a), it is known that is D(x, y) = 1 2 ∥x -y∥ 2 2 and the dual norm is ∥ • ∥ 2 . Combining the results of Theorem 1, Lemma 3 , Lemma 6, and the fact that ∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ 2 ≤ √ A∥F i,a k -i (σ k -i ) -F i,a k -i (w k-1 -i )∥ ∞ , we have max σi E R i (σ i ) ≤ U i /τ + 2τ 3 A 2 + 1/γ (N -1) 2 T /p 2 . And Lemma 2 implies that U i = O(1 + τ ) when α = 1. Let τ = (A 2 + 1/γ (N -1) 2 T /p 2 ) -1/4 , we have max σi E R i (σ i ) = O A 2 + 1/γ (N -1) 2 T /p 2 1/4 . Proof of Corollary 2. According to the upper bound in Theorem 3, it needs T = O((A(N -1)/p) 1/3 ϵ -4/3 ) rounds to reach an expected strong ϵ-CCE. Recall that the per round time complexity is O(pN Cost + N 2 A). Thus the expected running time is O Cost + A(N -1)/p 1/3 ϵ -4/3 pN Cost + N 2 A . By replacing p with N A/Cost, we can conclude that the running time to reach an expected strong ϵ-CCE is O Cost + N 7/3 A 2/3 Cost 2/3 /ϵ 4/3 . Proof of Lemma 4. Recall -i q -i (a -i ) = 1 j ′ ̸ =i Z j ′ (a -i ) = |σ -i (a -i ) -w -i (a -i )|/ a ′ -i |σ -i (a ′ -i ) -w -i (a ′ -i )| . We have q -i (a -i ) q CJST19 -i (a -i ) = a ′ -i |σ -i (a ′ -i ) -w -i (a ′ -i )| j ′ ̸ =i Z j ′ j̸ =i |σ j (a j ) -w j (a j )| x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y ) |σ -i (a -i ) -w -i (a -i )| ≥ a ′ -i |σ -i (a ′ -i ) -w -i (a ′ -i )| j ′ ̸ =i Z j ′ | j̸ =i σ j (a j ) -w j (a j ) x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y )| |σ -i (a -i ) -w -i (a -i )| = a ′ -i |σ -i (a ′ -i ) -w -i (a ′ -i )| j ′ ̸ =i Z j ′ , where the last equality is according to Lemma 10. Further, let a -i,-j denote the action profile after removing a i and a j , we have (σ -i (a -i,-j , a j ) -w -i (a -i,-j , a j ))| ≤ j̸ =i aj a-i,-j |(σ -i (a -i,-j , a j ) -w -i (a -i,-j , a j ))| = j̸ =i a-i |(σ -i (a -i ) -w -i (a -i ))| = (N -1) a-i |(σ -i (a -i ) -w -i (a -i ))| . Therefore, we have q-i(a-i) q CJST19 -i (a-i) ≥ 1/(N -1). And the variance of our LVE estimator can be bounded as a-i σ -i (a -i ) -w -i (a -i ) 2 /q -i (a -i ) ≤(N -1) a-i σ -i (a -i ) -w -i (a -i ) 2 /q CJLS19 -i (a -i ) = (N -1)( a-i |σ -i (a -i ) -w -i (a -i )|) 2 . We finish the proof. Proof of Lemma 5. Let AG denote the set of agnostic j. If j is agnostic to q ′ -i , we assume σ j (a j ) = 0 for all a j which has been accessed by q ′ -i . We further assume σ j ′ = w j ′ for j ′ / ∈ AG. Then we can construct a equivalent game G with |AG| + 1 = m + 1 players and |A -i | = 2 m . Moreover, q ′ -i does not know any entries of σ -i . Then for w -i (a -i ) = 1/2 m and q ′ -i , , we have max σ-i a-i σ -i (a -i ) -w -i (a -i ) 2 /q -i (a -i ) ≥ max a-i 1 -w -i (a -i ) 2 /q -i (a -i ) + a ′ -i ̸ =a-i w 2 -i (a ′ -i ) q -i (a ′ -i ) = max a-i 1 q -i (a -i ) -2 w -i (a -i ) q -i (a -i ) + a ′ -i w 2 -i (a ′ -i ) q -i (a ′ -i ) ≥2 m -2 .



h : R d → R is called a mirror map if h is strongly convex with respect to some norm and ∇h(R d ) = R d . In Theorem 1, we consider the general dual norm ∥ • ∥ * . Here, we only upper bound ∥ • ∥∞ because when the mirror map is h1 or h2, the dual norm is ∥ • ∥∞ or ∥ • ∥2 ≤ √ A∥ • ∥∞, respectively.



(CJST19) andAlacaoglu et al. (AM21)  develop variance reduced stochastic no-regret learning algorithms. Their algorithms significantly accelerate the computation of ϵ-Nash equilibrium in two-player zero-sum games.

For any vector v ∈ R d , denote v(j) as its jth coordinate, ∥v∥ 1 = d j=1 |v(j)| as its ℓ 1 -norm, ∥v∥ 2 = d j=1 v 2 (j) as its ℓ 2 -norm, and ∥v∥ ∞ = max j=1,••• ,d |v(j)| as its ℓ ∞ norm. For a general norm ∥ • ∥, let ∥ • ∥ * represent its dual norm. Denote ⟨v, w⟩ = d i=1 v(i)w(i) as the standard inner product of two vectors v, w ∈ R d . For a positive integer n, let [n] = {1, • • • , n}. For a discrete set S, let ∆(S) be the set of distributions over S.

a=1 x(a) log x(a) and squared ℓ 2 -norm h 2 (x) = d a=1 x 2 (a). Their corresponding Bregman divergences are D 1 (x, y) = d a=1 x(a) log(x(a)/y(a)) and D 2 (x, y) = d a=1 (x(a) -y(a)) 2 , respectively. Our analyses apply to other mirror maps as well.

(a y ) . A.2 PROOF OF THEOREM 1 If D(x, y) ≥ γ∥x -y∥ 2 . Summing equation 8 over k = 1, • • • , T , applying Lemma 8 and taking expectation on both sides, we have Published as a conference paper at 2023

According to Theorem 2, to arrive a weak ϵ-CCE, we need to run Algorithm 1 for T = CN 2 log A / ϵ √ p rounds. Recall that the per round time complexity is O(pN Cost + N 2 A). Then the total running time is O(Cost + (pN Cost + N 2 A)N 2 log A/(ϵ √ p) = O(Cost + N 7/2 √ ACost log A/ϵ) with p = N A/Cost.

j̸ =i |σ j (a j ) -w j (a j )| x<j,x̸ =i σ x (a x ) y>j,y̸ =i w y (a y ) and q CJST19

j̸ =i Z j = j̸ =i aj |σ j (a j ) -w j (a j )| = j̸ =i aj | a-i,-j

). And it remains unclear how to design such a low-variance distribution for general games. One of our main contributions is a low-variance Monte-Carlo estimator for general games, which ensures fast convergence in general games.

ACKNOWLEDGEMENT

Shuai Li is supported by National Natural Science Foundation of China (92270201, 62006151, 62076161) and Shanghai Sailing Program. The underlining research carried out in this paper is not a part of any projects funded by the National Natural Science Foundation of China.

