META-LEARNING IN GAMES

Abstract

In the literature on game-theoretic equilibrium finding, focus has mainly been on solving a single game in isolation. In practice, however, strategic interactionsranging from routing problems to online advertising auctions-evolve dynamically, thereby leading to many similar games to be solved. To address this gap, we introduce meta-learning for equilibrium finding and learning to play games. We establish the first meta-learning guarantees for a variety of fundamental and well-studied classes of games, including two-player zero-sum games, general-sum games, and Stackelberg games. In particular, we obtain rates of convergence to different game-theoretic equilibria that depend on natural notions of similarity between the sequence of games encountered, while at the same time recovering the known single-game guarantees when the sequence of games is arbitrary. Along the way, we prove a number of new results in the single-game regime through a simple and unified framework, which may be of independent interest. Finally, we evaluate our meta-learning algorithms on endgames faced by the poker agent Libratus against top human professionals. The experiments show that games with varying stack sizes can be solved significantly faster using our meta-learning techniques than by solving them separately, often by an order of magnitude.

1. INTRODUCTION

Research on game-theoretic equilibrium computation has primarily focused on solving a single game in isolation. In practice, however, there are often many similar games which need to be solved. One use-case is the setting where one wants to find an equilibrium for each of multiple game variationsfor example poker games where the players have various sizes of chip stacks. Another use-case is strategic interactions that evolve dynamically: in online advertising auctions, the advertiser's value for different keywords adapts based on current marketing trends (Nekipelov et al., 2015) ; routing gamesbe it Internet routing or physical transportation-reshape depending on the topology and the cost functions of the underlying network (Hoefer et al., 2011) ; and resource allocation problems (Johari and Tsitsiklis, 2004) vary based on the values of the goods/services. Successful agents in such complex decentralized environments must effectively learn how to incorporate past experience from previous strategic interactions in order to adapt their behavior to the current and future tasks. -learning, or learning-to-learn (Thrun and Pratt, 1998) , is a common formalization for machine learning in dynamic single-agent environments. In the meta-learning framework, a learning agent faces a sequence of tasks, and the goal is to use knowledge gained from previous tasks in order to improve performance on the current task at hand. Despite rapid progress in this line of work, prior results have not been tailored to tackle multiagent settings. This begs the question: Can players obtain provable performance improvements when meta-learning across a sequence of games? We answer this question in the affirmative by introducing meta-learning for equilibrium finding and learning to play games, and providing the first performance guarantees in a number of fundamental multiagent settings.

1.1. OVERVIEW OF OUR RESULTS

Our main contribution is to develop a general framework for establishing the first provable guarantees for meta-learning in games, leading to a comprehensive set of results in a variety of well-studied multiagent settings. In particular, our results encompass environments ranging from two-player zero-sum games with general constraint sets (and multiple extensions thereof), to general-sum games and Stackelberg games. See Table 1 for a summary of our results. Our refined guarantees are parameterized based on natural similarity metrics between the sequence of games. For example, in zero-sum games we obtain last-iterate rates that depend on the variance of the Nash equilibria (Theorem 3.2); in potential games based on the deviation of the potential functions (Theorem 3.4); and in Stackelberg games our regret bounds depend on the similarity of the leader's optimal commitment in hindsight (Theorem 3.8). All of these measures are algorithm-independent, and tie naturally to the underlying game-theoretic solution concepts. Importantly, our algorithms are agnostic to how similar the games are, but are nonetheless specifically designed to adapt to the similarity. Our guarantees apply under a broad class of no-regret learning algorithms, such as optimistic mirror descent (OMD) (Chiang et al., 2012; Rakhlin and Sridharan, 2013b) , with the important twist that each player employs an additional regret minimizer for meta-learning the parameterization of the base-learner; the latter component builds on the meta-learning framework of Khodak et al. (2019) . For example, in zero-sum games we leverage an initialization-dependent RVU bound (Syrgkanis et al., 2015) in order to meta-learn the initialization of OMD across the sequences of games, leading to per-game convergence rates to Nash equilibria that closely match our refined lower bound (Theorem 3.3). More broadly, in the worst-case-i.e., when the sequence of games is arbitrarywe recover the near-optimal guarantees known for static games, but as the similarity metrics become more favorable we establish significant gains in terms of convergence to different notions of equilibria. Along the way, we also obtain new insights and results even from a single-game perspective, including convergence rates of OMD and the extra-gradient method in Hölder continuous variational inequalities (Rakhlin and Sridharan, 2013a), and certain nonconvex-nonconcave problems such as those considered by (Diakonikolas et al., 2021) and stochastic games. Further, our analysis is considerably simpler than prior techniques and unifies several prior results. Finally, in Section 4 we evaluate our techniques on a series of poker endgames faced by the poker agent Libratus (Brown and Sandholm, 2018) against top human professionals. The experiments show that our meta-learning algorithms offer significant gains compared to solving each game in isolation, often by an order of magnitude. et al., 2011; Bertrand et al., 2020; Meigs et al., 2017) . Indeed, a number of prior works study the performance of learning algorithms in time-varying zero-sum games (Zhang et al., 2022b; Fiez et al., 2021b; Duvocelle et al., 2022; Cardoso et al., 2019) ; there, it is natural to espouse dynamic notions of regret (Yang et al., 2016; Zhao et al., 2020) . A work closely related to ours is the recent paper by Zhang et al. (2022b) , which provides regret bounds in time-varying bilinear saddle-point problems parameter-



A summary of our key theoretical results on meta-learning in games.

