ASYNCHRONOUS GRADIENT PLAY IN ZERO-SUM MULTI-AGENT GAMES

Abstract

Finding equilibria via gradient play in competitive multi-agent games has been attracting a growing amount of attention in recent years, with emphasis on designing efficient strategies where the agents operate in a decentralized and symmetric manner with guaranteed convergence. While significant efforts have been made in understanding zero-sum two-player matrix games, the performance in zerosum multi-agent games remains inadequately explored, especially in the presence of delayed feedbacks, leaving the scalability and resiliency of gradient play open to questions. In this paper, we make progress by studying asynchronous gradient plays in zero-sum polymatrix games under delayed feedbacks. We first establish that the last iterate of entropy-regularized optimistic multiplicative weight updates (OMWU) method converges linearly to the quantal response equilibrium (QRE), the solution concept under bounded rationality, in the absence of delays. While the linear convergence continues to hold even when the feedbacks are randomly delayed under mild statistical assumptions, it converges at a noticeably slower rate due to a smaller tolerable range of learning rates. Moving beyond, we demonstrate entropy-regularized OMWU-by adopting two-timescale learning rates in a delay-aware manner-enjoys faster last-iterate convergence under fixed delays, and continues to converge provably even when the delays are arbitrarily bounded in an average-iterate manner. Our methods also lead to finite-time guarantees to approximate the Nash equilibrium (NE) by moderating the amount of regularization. To the best of our knowledge, this work is the first that aims to understand asynchronous gradient play in zero-sum polymatrix games under a wide range of delay assumptions, highlighting the role of learning rates separation.

1. INTRODUCTION

Finding equilibria of multi-player games via gradient play lies at the heart of game theory, which permeates a remarkable breadth of modern applications, including but not limited to competitive reinforcement learning (RL) (Littman, 1994) , generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarial training (Mertikopoulos et al., 2018) . While conventional wisdom leans towards the paradigm of centralized learning (Bertsekas & Tsitsiklis, 1989) , retrieving and sharing information across multiple agents raise questions in terms of both privacy and efficiency, leading to a significant amount of interest in designing decentralized learning algorithms that utilize only local payoff feedbacks, with the updates at different agents executed in a symmetric manner. In reality, there is no shortage of scenarios where the feedback can be obtained only in a delayed manner (He et al., 2014) , i.e., the agents only receive the payoff information sent from a previous round instead of the current round, due to communication slowdowns and congestions, for example. Substantial progress has been made towards reliable and efficient online learning with delayed feedbacks in various settings, e.g., stochastic multi-armed bandit (Pike-Burke et al., 2018; Vernade et al., 2017) , adversarial multi-armed bandit (Cesa-Bianchi et al., 2016; Li et al., 2019) , online convex optimization (Quanrud & Khashabi, 2015; McMahan & Streeter, 2014) and multi-player game (Meng et al., 2022; Héliou et al., 2020; Zhou et al., 2017) . Typical approaches to combatting delays include subsampling the payoff history (Weinberger & Ordentlich, 2002; Joulani et al., 2013) , or adopting

Learning rate

Type of delay Iteration complexity ϵ-QRE ϵ-NE single-timescale none τ -1 d max ∥A∥ ∞ log ϵ -1 d max ∥A∥ ∞ ϵ -1 statistical τ -2 d 2 max ∥A∥ 2 ∞ (γ + 1) 2 log ϵ -1 d 2 max ∥A∥ 2 ∞ (γ + 1) 2 ϵ -2 two-timescale constant τ -1 d max ∥A∥ ∞ (γ + 1) 2 log ϵ -1 d max ∥A∥ ∞ (γ + 1) 2 ϵ -1 bounded τ -2 nd 3 max ∥A∥ 3 ∞ (γ + 1) 5/2 ϵ -1 nd 3 max ∥A∥ 3 ∞ (γ + 1) 5/2 ϵ -3 Table 1 : Iteration complexities of the proposed OMWU method for finding ϵ-QRE/NE of zero-sum polymatrix games, where logarithmic dependencies are omitted. Here, γ denotes the maximal time delay when the delay is bounded, n denotes the number of agents in the game, d max is the maximal degree of the graph, and ∥A∥ ∞ = max i,j ∥A i,j ∥ ∞ is the ℓ ∞ norm of the entire payoff matrix A (over all games in the network). We only present the result under statistical delay when the delays are bounded for ease of comparison, while more general bounds are given in Section 3.2. adaptive learning rates suggested by delay-aware analysis (Quanrud & Khashabi, 2015; McMahan & Streeter, 2014; Hsieh et al., 2020; Flaspohler et al., 2021) . Most of these efforts, however, have been limited to either the asymptotic convergence to the equilibrium (Zhou et al., 2017; Héliou et al., 2020) or the study of individual regret, which characterizes the performance gap between an agent's learning trajectory and the best policy in hindsight. It remains highly inadequate when it comes to guaranteeing finite-time convergence to the equilibrium in a multi-player environment, especially in the presence of delayed feedbacks, thus leaving the scalability and resiliency of gradient play open to questions. In this work, we initiate the study of asynchronous learning algorithms for an important class of games called zero-sum polymatrix games (also known as network matrix games (Bergman & Fokin, 1998)), which generalizes two-player zero-sum matrix games to the multiple-player setting and serves as an important stepping stone to more general multi-player general-sum games. Zero-sum polymatrix games are commonly used to describe situations in which agents' interactions are captured by an interaction graph and the entire system of games are closed so that the total payoffs keep invariant in the system. They find applications in an increasing number of important domains such as security games (Cai et al., 2016) , graph transduction (Bernardi, 2021), and more. In particular, we focus on finite-time last-iterate convergence to two prevalent solution concepts in game theory, namely Nash Equilibrium (NE) and Quantal Response Equilibrium (QRE) which considers bounded rationality (McKelvey & Palfrey, 1995) . Despite the seemingly simple formulation, few existing works have achieved this goal even in the synchronous setting, i.e., with instantaneous feedback. Leonardos et al. ( 2021) studied a continuous-time learning dynamics that converges to the QRE at a linear rate. Anagnostides et al. ( 2022) demonstrated Optimistic Mirror Descent (OMD) (Rakhlin & Sridharan, 2013) enjoys finite-time last-iterate convergence to the NE, yet the analysis therein requires continuous gradient of the regularizer, which incurs computation overhead for solving a subproblem every iteration. In contrast, an appealing alternative is the entropy regularizer, which leads to closed-form multiplicative updates and is computationally more desirable, but remains poorly understood. In sum, designing efficient learning algorithms that provably converge to the game equilibria has been technically challenging, even in the synchronous setting.

1.1. OUR CONTRIBUTIONS

In this paper, we develop provably convergent algorithms-broadly dubbed as asynchronous gradient play-to find the QRE and NE of zero-sum polymatrix games in a decentralized and symmetric manner with delayed feedbacks. We propose an entropy-regularized Optimistic Multiplicative Weights Update (OMWU) method (Cen et al., 2021) , where each player symmetrically updates their strategies without access to the payoff matrices and other players' strategies, and initiate a systematic investigation on the impacts of delays on its convergence under two schemes of learning rates schedule. Our main contributions are summarized as follows. • Finite-time last-iterate convergence of single-timescale OMWU. We begin by showing that, in the synchronous setting, the single-timescale OMWU method-when the same learning rate is adopted for extrapolation and update-achieves last-iterate convergence to the QRE at a linear

