STOCHASTIC NO-REGRET LEARNING FOR GENERAL GAMES WITH VARIANCE REDUCTION

Abstract

We show that a stochastic version of optimistic mirror descent (OMD), a variant of mirror descent with recency bias, converges fast in general games. More specifically, with our algorithm, the individual regret of each player vanishes at a speed of O(1/T 3/4 ) and the sum of all players' regret vanishes at a speed of O(1/T ), which is an improvement upon the O(1/ √ T ) convergence rate of prior stochastic algorithms, where T is the number of interaction rounds. Due to the advantage of stochastic methods in the computational cost, we significantly improve the time complexity over the deterministic algorithms to approximate coarse correlated equilibrium. To achieve lower time complexity, we equip the stochastic version of OMD in (AM21) with a novel low-variance Monte-Carlo estimator. Our algorithm extends previous works (AM21; CJST19) from two-player zero-sum games to general games.

1. INTRODUCTION

How does a player in a game interact with others, and selfishly maximize its own utilities? This is one central problem in online learning and game theory and has intimate connections to economics, auction design, and machine learning. The study of this problem was pioneered by (Bro49; Rob51). Robinson (Rob51) shows that fictitious play asymptotically converges to Nash equilibrium in twoplayer zero-sum games. But its convergence rate is exponentially slow and it may not even converge in non-zero-sum games (Sha64). Another natural choice for each player is to use no-regret learning algorithms. With some well-known families of no-regret learning algorithms, e.g., mirror descent (NY83) and follow-the-regularizedleader (KV05), the average regret of each player vanishes at a speed of O(1/

√

T ) where T is the number of interaction rounds. This regret bound implies an O(1/ √ T ) convergence rate to the coarse correlated equilibrium in general games (or Nash equilibrium in two-player zero-sum games). And it is noteworthy that Chen and Peng (CP20) show that these algorithms' convergence rate is Ω(1/ √ T ). Players can do even better with some special no-regret algorithms tailored for games. The most representative one is known as optimistic mirror descent (OMD) which is a variant of mirror descent with recency bias. However, the computational cost of players to use OMD, as well as other deterministic no-regret algorithms, could be not manageable. Since each player needs to compute the exact loss vector to update its strategy in OMD. And the time complexity of computing this exact loss vector is, in the worst case, exponential in the number of players in the game. One standard method to accelerate the computation is to estimate the loss vector with Monte-Carlo methods. But a Monte-Carlo estimator with an uncontrolled variance will immediately make the convergence rate degenerate to O(1/ √ T ).



Syrgkanis et al. (SALS15) and Rakhlin et al. (RS13) show that OMD approaches optimal social welfare (or equivalently, minimizes the sum of all players' regrets) at a speed of O(1/T ) and minimizes each player's individual regret at a speed of O(1/T 3/4 ). Several works (CP20; HAM21; DFG21) improve the results in (SALS15) under different settings or assumptions. Remarkably, Daskalakis et al. (DFG21) improve the convergence rate of players' individual regret of OMD to O(poly log T /T ) in general games.

