LEARNING RATIONALIZABLE EQUILIBRIA IN MULTIPLAYER GAMES

Abstract

A natural goal in multi-agent learning is to learn rationalizable behavior, where players learn to avoid any Iteratively Dominated Action (IDA). However, standard no-regret based equilibria-finding algorithms could take exponential samples to find such rationalizable strategies. In this paper, we first propose a simple yet sample-efficient algorithm for finding a rationalizable action profile in multi-player general-sum games under bandit feedback, which substantially improves over the results of Wu et al. (2021) . We further develop algorithms with the first efficient guarantees for learning rationalizable Coarse Correlated Equilibria (CCE) and Correlated Equilibria (CE). Our algorithms incorporate several novel techniques to guarantee the elimination of IDA and no (swap-)regret simultaneously, including a correlated exploration scheme and adaptive learning rates, which may be of independent interest. We complement our results with a sample complexity lower bound showing the sharpness of our guarantees. * Equal contribution. 1 An action is ∆-rationalizable if it survives iterative elimination of ∆-dominated actions; c.f. Definition 1. 2 Throughout this paper, we use O to suppress logarithmic factors in N , A, L, 1 ∆ , 1 δ , and 1 ϵ . 3 For this equivalence to hold, we need to allow dominance by mixed strategies, and correlated beliefs when there are more than two players. These conditions are met in the setting of this work.

1. INTRODUCTION

A common objective in multi-agent learning is to find various equilibria, such as Nash equilibria (NE), correlated equilibria (CE) and coarse correlated equilibria (CCE). Generally speaking, a player in equilibrium lacks incentive to deviate assuming conformity of other players to the same equilibrium. Equilibrium learning has been extensively studied in the literature of game theory and online learning, and no-regret based learners can provably learn approximate CE and CCE with both computational and statistical efficiency (Stoltz, 2005; Cesa-Bianchi & Lugosi, 2006) . However, not all equilibria are created equal. As shown by Viossat & Zapechelnyuk (2013) , a CCE can be entirely supported on dominated actions-actions that are worse off than some other strategy in all circumstances-which rational agents should apparently never play. Approximate CE also suffers from a similar problem. As shown by Wu et al. (2021, Theorem 1) , there are examples where an ϵ-CE always plays iteratively dominated actions-actions that would be eliminated when iteratively deleting strictly dominated actions-unless ϵ is exponentially small. It is also shown that standard no-regret algorithms are indeed prone to finding such seemingly undesirable solutions (Wu et al., 2021) . The intrinsic reason behind this is that CCE and approximate CE may not be rationalizable, and existing algorithms can indeed fail to find rationalizable solutions. Different from equilibria notions, rationalizability (Bernheim, 1984; Pearce, 1984) looks at the game from the perspective of a single player without knowledge of the actual strategies of other players, and only assumes common knowledge of their rationality. A rationalizable strategy will avoid strictly dominated actions, and assuming other players have also eliminated their dominated actions, iteratively avoid strictly dominated actions in the subgame. Rationalizability is a central solution concept in game theory (Osborne & Rubinstein, 1994) and has found applications in auctions (Battigalli & Siniscalchi, 2003) and mechanism design (Bergemann et al., 2011) . If an (approximate) equilibrium only employs rationalizable actions, it would prevent irrational behavior such as playing dominated actions. Such equilibria are arguably more reasonable than unrationalizable ones, and constitute a stronger solution concept. This motivates us to consider the following open question: Can we efficiently learn equilibria that are also rationalizable? Despite its fundamental role in multi-agent reasoning, rationalizability is rarely studied from a learning perspective until recently, with Wu et al. (2021) giving the first algorithm for learning rationalizable strategies from bandit feedback. However, the problem of learning rationalizable CE and CCE remains a challenging open problem. Due to the existence of unrationalizable equilibria, running standard CE or CCE learners will not guarantee rationalizable solutions. On the other hand, one cannot hope to first identify all rationalizable actions and then find an equilibrium on the subgame, since even determining whether an action is rationalizable requires exponentially many samples (see Proposition 2). Therefore, achieving rationalizability and approximate equilibria simultaneously is nontrivial and presents new algorithmic challenges. In this work, we address the challenges above and give a positive answer to our main question. Our contributions can be summarized as follows: • As a first step, we provide a simple yet sample-efficient algorithm for identifying a ∆rationalizable 1 action profile under bandit feedback, using only O LN A ∆ 2 2 samples in normalform games with N players, A actions per player and a minimum elimination length of L. This greatly improves the result of Wu et al. (2021) and is tight up to logarithmic factors when L = O(1). • Using the above algorithm as a subroutine, we develop exponential weights based algorithms that can provably find ∆-rationalizable ϵ-CCE using O LN A ∆ 2 + N A ϵ 2 samples, and ∆-rationalizable ϵ-CE using O LN A ∆ 2 + N A 2 min{ϵ 2 ,∆ 2 } samples. To the best of our knowledge, these are the first guarantees for learning rationalizable approximate CCE and CE. • We also provide reduction schemes that find ∆-rationalizable ϵ-CCE/CE using black-box algorithms for ϵ-CCE/CE. Despite having slightly worse rates, these algorithms can directly leverage the progress in equilibria finding, which may be of independent interest.

1.1. RELATED WORK

Rationalizability and iterative dominance elimination. Rationalizability (Bernheim, 1984; Pearce, 1984) is a notion that captures rational reasoning in games and relaxes Nash Equilibrium. Rationalizability is closely related to the iterative elimination of dominated actions, which has been a focus of game theory research since the 1950s (Luce & Raiffa, 1957) . It can be shown that an action is rationalizable if and only if it survives iterative elimination of strictly dominated actions 3 (Pearce, 1984) . There is also experimental evidence supporting iterative elimination of dominated strategies as a model of human reasoning (Camerer, 2011) .

Equilibria learning in games.

There is a rich literature on applying online learning algorithms to learning equilibria in games. It is well-known that if all agents have no-regret, the resulting empirical average would be an ϵ-CCE (Young, 2004) , while if all agents have no swap-regret, the resulting empirical average would be an ϵ-CE (Hart & Mas-Colell, 2000; Cesa-Bianchi & Lugosi, 2006) . Later work continuing this line of research include those with faster convergence rates (Syrgkanis et al., 2015; Chen & Peng, 2020; Daskalakis et al., 2021) , last-iterate convergence guarantees (Daskalakis & Panageas, 2018; Wei et al., 2020) , and extension to extensive-form games (Celli et al., 2020; Bai et al., 2022b; a; Song et al., 2022) and Markov games (Song et al., 2021; Jin et al., 2021) . Computational and learning aspect of rationalizability. Despite its conceptual importance, rationalizability and iterative dominance elimination are not well studied from a computational or learning perspective. For iterative strict dominance elimination in two-player games, Knuth et al. (1988) provided a cubic-time algorithm and proved that the problem is P-complete. The weak dominance version of the problem is proven to be NP-complete by Conitzer & Sandholm (2005) . Hofbauer & Weibull (1996) showed that in a class of learning dynamics which includes replicator dynamics -the continuous-time variant of Follow-The-Regularzied-Leader (FTRL), all iteratively strictly dominated actions vanish over time, while Mertikopoulos & Moustakas (2010) proved similar results for stochastic replicator dynamics; however, neither work provides finite-time guarantees. Cohen et al. (2017) proved that Hedge eliminates dominated actions in finite time, but did not extend their results to the more challenging case of iteratively dominated actions. The most related work in literature is the work on learning rationalizable actions by Wu et al. (2021) , who proposed the Exp3-DH algorithm to find a strategy mostly supported on rationalizable actions with a polynomial rate. Our Algorithm 1 accomplishes the same task with a faster rate, while our Algorithms 2 & 3 deal with the more challenging problems of finding ϵ-CE/CCE that are also rationalizable. Although Exp3-DH is based on a no-regret algorithm, it does not enjoy regret or weighted regret guarantees and thus does not provably find rationalizable equilibria.

2. PRELIMINARY

An N -player normal-form game involves N players whose action space are denoted by A = A 1 × • • • × A N , and is defined by utility functions u 1 , • • • , u N : A → [0, 1]. Let A = max i∈[N ] |A i | denote the maximum number of actions per player, x i denote a mixed strategy of the i-th player (i.e., a distribution over A i ) and x -i denote a (correlated) mixed strategy of the other players (i.e., a distribution over j̸ =i A j ). We further denote u i (x i , x -i ) := E ai∼xi,a-i∼x-i u i (a i , a -i ). We use ∆(S) to denote a distribution over the set S. Learning from bandit feedback We consider the bandit feedback setting where in each round, each player i ∈ [N ] chooses an action a i ∈ A i , and then observes a random feedback U i ∈ [0, 1] such that E[U i |a 1 , a 2 , • • • , a n ] = u i (a 1 , a 2 , • • • , a n ).

2.1. RATIONALIZABILITY

An action a ∈ A i is said to be rationalizable if it could be the best response to some (possibly correlated) belief of other players' strategies, assuming that they are also rational. In other words, the set of rationalizable actions is obtained by iteratively removing actions that could never be a best response. For finite normal-form games, this is in fact equivalent to the iterative elimination of strictly dominated actionsfoot_0 (Osborne & Rubinstein, 1994, Lemma 60.1). Definition 1 (∆-Rationalizability). 5 Define E 1 := N i=1 {a ∈ A i : ∃x ∈ ∆(A i ), ∀a -i , u i (a, a -i ) ≤ u i (x, a -i ) -∆} , which is the set of ∆-dominated actions for all players. Further define E l := N i=1 {a ∈ A i : ∃x ∈ ∆(A i ), ∀a -i s.t. a -i ∩ E l-1 = ∅, u i (a, a -i ) ≤ u i (x, a -i ) -∆} , which is the set of actions that would be eliminated by the l-th round. Define L = inf{l : E l+1 = E l } as the minimum elimination length, and E L as the set of ∆-iteratively dominated actions (∆-IDAs). Actions in ∪ n i=1 A i \ E L are said to be ∆-rationalizable. Notice that E 1 ⊆ • • • ⊆ E L = E L+1 . Here ∆ plays a similar role as the reward gap for best arm identification in stochastic multi-armed bandits. We will henceforth use ∆-rationalizability and survival of L rounds of iterative dominance elimination (IDE) interchangeablyfoot_2 . Since one cannot eliminate all the actions of a player, |E L | ≥ N , which further implies L ≤ N (A -1) < N A.

2.2. EQUILIBRIA IN GAMES

We consider three common learning objectives, namely Nash Equilibrium (NE), Correlated Equilibrium (CE) and Coarse Correlated Equilibrium (CCE). Definition 2 (Nash Equilibrium). A strategy profile (x 1 , • • • , x N ) is an ϵ-Nash equilibrium if u i (x i , x -i ) ≥ u i (a, x -i ) -ϵ, ∀a ∈ A i , ∀i ∈ [N ]. Definition 3 (Correlated Equilibrium). A correlated strategy Π ∈ ∆(A) is an ϵ-correlated equilib- rium if ∀i ∈ [N ], ∀ϕ : A i → A i , ai∈Ai,a-i∈A-i Π(a i , a -i )u i (a i , a -i ) ≥ ai∈Ai,a-i∈A-i Π(a i , a -i )u i (ϕ(a i ), a -i ) -ϵ. Definition 4 (Coarse Correlated Equilibrium). A correlated strategy Π ∈ ∆(A) is an ϵ-CCE if ∀i ∈ [N ], ∀a ′ ∈ A i , ai∈Ai,a-i∈A-i Π(a i , a -i )u i (a i , a -i ) ≥ ai∈Ai,a-i∈A-i Π(a i , a -i )u i (a ′ , a -i ) -ϵ. When ϵ = 0, the above definitions give exact Nash equilibrium, correlated equilibrium, and coarse correlated equilibrium, respectively. It is well known that ϵ-NE are ϵ-CE, and ϵ-CE are ϵ-CCE. Furthermore, we call an ϵ-CCE/CE that only plays ∆-rationalizable actions a.s. a ∆-rationalizable ϵ-CCE/CE.

2.3. CONNECTION BETWEEN EQUILIBRIA AND RATIONALIZABILITY

It is known that all actions in the support of an exact CE are rationalizable (Osborne & Rubinstein, 1994, Lemma 56.2 ). However, one can easily construct an exact CCE that is supported on dominated (hence, unrationalizable) actions (see e.g. Viossat & Zapechelnyuk (2013, Fig. 3 )). One might be tempted to suggest that running a CE solver immediately finds a CE (and hence CCE) that is also rationalizable. However, the connection between CE and rationalizability becomes quite different when it comes to approximate equilibria, which are inevitable in the presence of noise. As shown by Wu et al. (2021, Theorem 1) , an ϵ-CE can be entirely supported on iteratively dominated action, unless ϵ = O(2 -A ). In other words, rationalizability is not guaranteed by running an approximate CE solver unless with an extremely high accuracy. Therefore, finding ϵ-CE and CCE that are simultaneously rationalizable remains a challenging open problem. Since NE is a subset of CE, all actions in the support of an (exact) NE would also be rationalizable. Unlike approximate CE, for ϵ < poly(∆, 1/N, 1/A)), one can show that any ϵ-Nash equilibrium is still mostly supported on rationalizable actions. Proposition 1. If x * = (x * 1 , • • • , x * N ) is an ϵ-Nash with ϵ < ∆ 2 24N 2 A , ∀i, Pr a∼x * i [a ∈ E L ] ≤ 2Lϵ ∆ . Therefore, for two-player zero-sum games, it is possible to run an approximate NE solver and automatically find a rationalizable ϵ-NE. However, this method will induce a rather slow ratefoot_3 , and we will provide a much more efficient algorithm for finding rationalizable ϵ-NE in Section 4.

3. LEARNING RATIONALIZABLE ACTION PROFILES

In order to learn a rationalizable CE/CCE, one might suggest identifying the set of all rationalizable actions, and then learn CE or CCE on this subgame. Unfortunately, as shown by Proposition 2, even the simpler problem of deciding whether one single action is rationalizable is statistically hard. Proposition 2. For ∆ < 0.1, any algorithm that correctly decides whether an action is ∆rationalizable with 0.9 probability needs Ω(A N -1 ∆ -2 ) samples. This negative result motivates us to consider an easier task: can we at least find one rationalizable action profile sample-efficiently? Formally, we say a action profile (a 1 , . . . , a N ) is rationalizable if for all i ∈ [N ], a i is a rationalizable action. This is arguably one of the most fundamental tasks regarding rationalizability. For mixed-strategy dominance solvable games (Alon et al., 2021) , the unique rationalizable action profile will be the unique NE and also the unique CE of the game. Therefore this easier task per se is still of practical importance. In this section we answer this question in the affirmative. We provide a sample-efficient algorithm which finds a rationalizable action profile using only O LN A ∆ 2 samples. This algorithm will also serve as an important subroutine for algorithms finding rationalizable CCE/CE in the later sections. Algorithm 1 Iterative Best Response 1: Initialization: choose a (0) i ∈ A i arbitrarily for all i ∈ [N ] 2: for l = 1, • • • , L do 3: for i ∈ [N ] do 4: For all a ∈ A i , play (a, a (l-1) -i ) for M times, compute player i's average payoff ûi (a, a (l-1) -i ) 5: Set a (l) i ← arg max a∈Ai ûi (a, a (l-1) -i ) // Computing the empirical best response 6: return (a (L) 1 , • • • , a (L) N ) The intuition behind this algorithm is simple: if an action profile a -i can survive l rounds of IDE, then its best response a i (i.e., arg max a∈Ai u i (a, a -i )) can survive at least l + 1 rounds of IDE, since the action a i can only be eliminated after some actions in a -i are eliminated. Concretely, we start from an arbitrary action profile (a (0) 1 , . . . , a N ). In each round l ∈ [L], we compute the (empirical) best response of a (l-1) -i for each i ∈ [N ], and use those best responses to construct a new action profile (a (l) 1 , . . . , a (l) N ) . By constructing iterative best responses, we will end up with an action profile that can survive L rounds of IDE, which means surviving any number of rounds of IDE according to the definition of L. The full algorithm is presented in Algorithm 1, for which we have the following theoretical guarantee. Theorem 3. With M = 16 ln(LN A/δ) ∆ 2 , with probability 1 -δ, Algorithm 1 returns an action profile that is ∆-rationalizable using a total of O LN A ∆ 2 samples. Wu et al. (2021) provide the first polynomial sample complexity results for finding rationalizable action profiles. They prove that the Exp3-DH algorithm is able to find a distribution with 1 -ζ fraction supported on ∆-rationalizable actions using O L 1.5 N 3 A 1.5 ζ 3 ∆ 3 samples under bandit feedback 8 .

Compared to their result, our sample complexity bound

O LN A ∆ 2 has more favorable dependence on all problem parameters, and our algorithm will output a distribution that is fully supported on rationalizable actions (thus has no dependence on ζ). We further complement Theorem 3 with a sample complexity lower bound showing that the linear dependency on N and A are optimal. This lower bound suggests that the O LN A ∆ 2 upper bound is tight up to logarithmic factors when L = O(1), and we conjecture that this is true for general L. Theorem 4. Even for games with L ≤ 2, any algorithm that returns a ∆-rationalizable action profile with 0.9 probability needs Ω N A ∆ 2 samples. Conjecture 5. The minimax optimal sample complexity for finding a ∆-rationalizable action profile is Θ LN A ∆ 2 for games with minimum elimination length L.

4. LEARNING RATIONALIZABLE COARSE CORRELATED EQUILIBRIA (CCE)

In this section we introduce our algorithm for efficiently learning rationalizable CCEs. The high-level idea is to run no-regret Hedge-style algorithms for every player, while constraining the strategy inside the rationalizable region. Our algorithm is motivated by the fact that the probability of playing a dominated action will decay exponentially over time in the Hedge algorithm for adversarial bandit under full information feedback (Cohen et al., 2017) . The full algorithm description is provided in Algorithm 2, and here we explain several key components in our algorithm design. Correlated Exploration Scheme. In the bandit feedback setting, standard exponential weights algorithms such as EXP3.IX require importance sampling and biased estimators to derive a highprobability regret bound (Neu, 2015) . However, such bias could cause a dominating strategy to lose its advantage. In our algorithm we adopt a correlated exploration scheme, which essentially simulates full information feedback by bandit feedback using N A samples. Specifically, at every time step t, Published as a conference paper at ICLR 2023 Algorithm 2 Hedge for Rationalizable ϵ-CCE 1: (a ⋆ 1 , • • • , a ⋆ N ) ← Algorithm 1 2: For all i ∈ [N ], initialize θ (1) i (•) ← 1[• = a ⋆ i ] 3: for t = 1, • • • , T do 4: for i = 1, • • • , N do 5: For all a ∈ A i , play (a, θ (t) -i ) for M t times, compute player i's average payoff u (t) i (a) 6: Set θ (t+1) i (•) ∝ exp η t t τ =1 u (τ ) i (•) 7: For all t ∈ [T ] and i ∈ [N ], eliminate all actions in θ (t) i with probability smaller than p, then renormalize the vector to simplex as θ(t) i 8: output: T t=1 ⊗ n i=1 θ(t) i /T the players take turn to enumerate their action set, while the other players fix their strategies according to Hedge. For i ∈ [N ] and t ≥ 2, we denote θ (t) i the strategy computed using Hedge for player i in round t. Joint strategy (a, θ (t) -i ) is played to estimate player i's payoff u (t) i (a). It is important to note that such correlated scheme does not require any communication between the players-the players can schedule the whole process before the game starts. Rationalizable Initialization and Variance Reduction. We use Algorithm 1, which learns a rationalizable action profile, to give the strategy for the first round. By carefully preserving the disadvantage of any iteratively dominated action, we keep the iterates inside the rationalizable region throughout the whole learning process. To ensure this for every iterate with high probability, a minibatch is used to reduce the variance of the estimator. Clipping. In the final step, we clip all actions with small probabilities, so that iteratively dominated actions do not appear in the output. The threshold is small enough to not affect the ϵ-CCE guarantee.

4.1. THEORETICAL GUARANTEE

In Algorithm 2, we choose parameters in the following manner: η t = max ln A t , 4 ln(1/p) ∆t , M t = 64 ln(AN T /δ) ∆ 2 t , and p = min{ϵ,∆} 8AN . (1) Note that our learning rate can be bigger than the standard learning rate in FTRL algorithms when t is small. The purpose is to guarantee the rationalizability of the iterates from the beginning of the learning process. As will be shown in the proof, this larger learning rate will not hurt the final rate. We now state the theoretical guarantee for Algorithm 2. Theorem 6. With parameters chosen as in Eq.( 1) , after T = O 1 ϵ 2 + 1 ϵ∆ rounds, with probability 1 -3δ, the output strategy of Algorithm 2 is a ∆-rationalizable ϵ-CCE.The total sample complexity is O LN A ∆ 2 + N A ϵ 2 . Remark 7. Due to our lower bound (Theorem 4), an O( N A ∆ 2 ) term is unavoidable since learning a rationalizable action profile is an easier task than learning rationalizable CCE. Based on our Conjecture 5, the additional L dependency is also likely to be inevitable. On the other hand, learning an ϵ-CCE alone only requires O( A ϵ 2 ) samples, where as in our bound we have a larger O( N A ϵ 2 ) term. The extra N factor is a consequence of our correlated exploration scheme in which only one player explores at a time. Removing this N factor might require more sophisticated exploration methods and utility estimators, which we leave as future work. Remark 8. Evoking Algorithm 1 requires knowledge of L, which may not be available in practice. In that case, an estimate L ′ may be used in its stead. If L ′ ≥ L (for instance when L ′ = N A), we can recover the current rationalizability guarantee, albeit with a larger sample complexity scaling with L ′ . If L ′ < L, we can still guarantee that the output policy avoids actions in E L ′ , which are, informally speaking, actions that would be eliminated with L ′ levels of reasoning.

4.1.1. OVERVIEW OF THE ANALYSIS

We give an overview of our analysis of Algorithm 2 below. The full proof is deferred to Appendix C. Step 1: Ensure rationalizability. We will first show that rationalizability is preserved at each iterate, i.e., actions in E L will be played with low probability across all iterates. Formally, Lemma 9. With probability at least 1 -2δ, for all t ∈ [T ] and all i ∈ [N ], a i ∈ A i ∩ E L , we have θ (t) i (a i ) ≤ p. Here p is defined in (1). Lemma 9 guarantees that, after the clipping in Line 7 of Algorithm 2, the output correlated strategy be ∆-rationalizable. We proceed to explain the main idea for proving Lemma 9. A key observation is that the set of rationalizable actions, ∪ n i=1 A i \ E L , is closed under best response-for the i-th player, as long as the other players continue to play actions in ∪ j̸ =i A j \E L , actions in A i ∩E L will suffer from excess losses each round in an exponential weights style algorithm. Concretely, for any a -i ∈ ( j̸ =i A j ) \ E L and any iteratively dominated action a i ∈ A i ∩ E L , there always exists x i ∈ ∆(A i ) such that u i (x i , a -i ) ≥ u i (a i , a -i ) + ∆. With our choice of p in Eq. ( 1), if other players choose their actions from ∪ j̸ =i A j \E L with probability 1 -pAN , we can still guarantee an excess loss of Ω(∆). It follows that t τ =1 u (τ ) i (x i ) - t τ =1 u (τ ) i (a i ) ≥ Ω(t∆) -Sampling Noise. However, this excess loss can be obscured by the noise from bandit feedback when t is small. Note that it is crucial that the statement of Lemma 9 holds for all t due to the inductive nature of the proof. As a solution, we use a minibatch of size  M t = O ⌈ 1 ∆ 2 t ) η t t τ =1 u (τ ) i (x i ) - t τ =1 u (τ ) i (a i ) ≫ 1. (2) By the update rule of the Hedge algorithm, this implies that θ (t+1) i (a i ) ≤ p, which enables us to complete the proof of Lemma 9 via induction on t. Step 2: Combine with no-regret guarantees. Next, we prove that the output strategy is an ϵ-CCE. For a player i ∈ [N ], the regret is defined as Regret i T = max θ∈∆(Ai) T t=1 ⟨u (t) i , θ -θ (t) i ⟩. We can obtain the following regret bound by standard analysis of FTRL with changing learning rates. Lemma 10. For all i ∈ [N ], Regret i T ≤ O √ T + 1 ∆ . Here the additive 1/∆ term is the result of our larger O(∆ -1 t -1 ) learning rate for small t. It follows from Lemma 10 that T = O 1 ϵ 2 + 1 ∆ϵ suffices to guarantee that the correlated strategy 1 T T t=1 ⊗ n i=1 θ (t) i is an (ϵ/2)-CCE. Since pN A = O(ϵ) , the clipping step only minorly affects the CCE guarantee and the clipped strategy 1 T T t=1 ⊗ n i=1 θ(t) i is an ϵ-CCE.

4.2. APPLICATION TO LEARNING RATIONALIZABLE NASH EQUILIBRIUM

Algorithm 2 can also be applied to two-player zero-sum games to learn a rationalizable ϵ-NE efficiently. Note that in two-player zero-sum games, the marginal distribution of an ϵ-CCE is guaranteed to be a 2ϵ-Nash (see, e.g., Proposition 9 in Bai et al. (2020) ). Hence direct application of Algorithm 2 to a zero-sum game gives the following sample complexity bound. Corollary 11. In a two-player zero-sum game, the sample complexity for finding a ∆-rationalizable ϵ-Nash with Algorithm 2 is O LA ∆ 2 + A ϵ 2 . This result improves over a direct application of Proposition 1, which gives O A 3 ∆ 4 + A ϵ 2 sample complexity and produces an ϵ-Nash that could still take unrationalizable actions with positive probability. Published as a conference paper at ICLR 2023 Algorithm 3 Adaptive Hedge for Rationalizable ϵ-CE 1: (a ⋆ 1 , • • • , a ⋆ N ) ← Algorithm 1 2: For all i ∈ [N ], initialize θ (1) i ← (1 -|A i |p)1[• = a ⋆ i ] + p1 3: for t = 1, 2, . . . , T do 4: for i = 1, 2, . . . , N do 5: For all a ∈ A i , play (a, θ (t) -i ) for M (t) i times, compute player i's average payoff u (t) i (a) 6: For all b ∈ A i , set θ(t+1) i (•|b) ∝ exp η b t,i t τ =1 u (τ ) i (•)θ (τ ) i (b) 7: Find θ (t+1) i ∈ ∆(A i ) such that θ (t+1) i (a) = b∈Ai θ(t+1) i (a|b)θ (t+1) i (b) 8: For all t ∈ [T ] and i ∈ [N ], eliminate all actions in θ (t) i with probability smaller than p, then renormalize the vector to simplex as θ(t) i 9: output: T t=1 ⊗ n i=1 θ(t) i /T

5. LEARNING RATIONALIZABLE CORRELATED EQUILIBRIUM

In order to extend our results on ϵ-CCE to ϵ-CE, a natural approach would be augmenting Algorithm 2 with the celebrated Blum-Mansour reduction (Blum & Mansour, 2007) from swap-regret to external regret. In this reduction, one maintains A instances of a no-regret algorithm {Alg 1 , • • • , Alg A }. In iteration t, the player would stack the recommendations of the A algorithms as a matrix, denoted by θ(t) ∈ R A×A , and compute its eigenvector θ (t) as the randomized strategy in round t. After observing the actual payoff vector u (t) , it will pass the weighted payoff vector θ (t) (a)u (t) to algorithm Alg a for each a. In this section, we focus on a fixed player i, and omit the subscript i when it's clear from the context. Applying this reduction to Algorithm 2 directly, however, would fail to preserve rationalizability since the weighted loss vector θ (t) (a)u (t) admit a smaller utility gap θ (t) (a)∆. Specifically, consider an action b dominated by a mixed strategy x. In the payoff estimate of instance a, t τ =1 θ (τ ) (a) u (τ ) (b) -u (τ ) (x) ≳ ∆ t τ =1 θ (τ ) (a) - t τ =1 1 M (τ ) ≱ 0, which means that we cannot guarantee the elimination of IDAs every round as in Eq (2). In Algorithm 3, we address this by making t τ =1 θ (τ ) (a) play the role as t, tracking the progress of each no-regret instance separately. In time step t, we will compute the average payoff vector u (t) based on M (t) samples; then as in the Blum-Mansour reduction, we will update the A instances of Hedge with weighted payoffs θ (t) (a)u (t) and will use the eigenvector of θ as the strategy for the next round. The key detail here is our choice of parameters, which adapts to the past strategies {θ (τ ) } t τ =1 : M (t) i := max a 64θ (t) i (a) ∆ 2 • t τ =1 θ (τ ) i (a) , η a t,i := max 2 ln(1/p) ∆ t τ =1 θ (τ ) i (a) , A ln A t , p = min{ϵ,∆} 8AN . ( ) Compared to Eq (1), we are essentially replacing t with an adaptive t τ =1 θ (τ ) (a). We can now improve (3) to t τ =1 θ (τ ) (a) u (τ ) (b) -u (τ ) (x) ≳ ∆ t τ =1 θ (τ ) (a) - t τ =1 θ (τ ) (a) 2 M (τ ) ≳ ∆ t τ =1 θ (τ ) (a). (5) This together with our choice of η a t allows us to ensure the rationalizability of every iterate. The full algorithm is presented in Algorithm 3. We proceed to our theoretical guarantee for Algorithm 3. The analysis framework is largely similar to that of Algorithm 2. Our choice of M (t) i is sufficient to ensure ∆-rationalizability via Azuma-Hoeffding inequality, while swap-regret analysis of the algorithm proves that the average (clipped) strategy is indeed an ϵ-CE. The full proof is deferred to Appendix D. Theorem 12. With parameters in Eq. (4), after T = O A ϵ 2 + A ∆ 2 rounds, with probability 1 -3δ, the output strategy of Algorithm 3 is a ∆-rationalizable ϵ-CE . The total sample complexity is O LN A ∆ 2 + N A 2 min{∆ 2 ,ϵ 2 } . Algorithm 4 Rationalizable ϵ-CCE via Black-box Reduction 1: (a ⋆ 1 , • • • , a ⋆ N ) ← Algorithm 1 2: For all i ∈ [N ], initialize A (1) i ← {a ⋆ i } 3: for t = 1, 2, . . . do 4: Find an ϵ ′ -CCE Π with black-box algorithm O in the sub-game Π i∈[N ] A (t) i 5: ∀i ∈ [N ], a ′ i ∈ A i , evaluate u i (a ′ i , Π -i ) for M times and compute average ûi (a ′ i , Π -i ) 6: for i ∈ [N ] do 7: Let a ′ i ← arg max a∈Ai ûi (a, Π -i ) // Computing the empirical best response 8: A (t+1) i ← A (t) i ∪ {a ′ i } 9: if A (t) i = A (t+1) i for all i ∈ [N ] then 10: return Π Compared to Theorem 6, our second term has an additional A factor, which is quite reasonable considering that algorithms for learning ϵ-CE take O(A 2 ϵ -2 ) samples, also A-times larger than the ϵ-CCE rate.

6. REDUCTION-BASED ALGORITHMS

While Algorithm 2 and 3 make use of one specific no-regret algorithm, namely Hedge (Exponential Weights), in this section, we show that arbitrary algorithms for finding CCE/CE can be augmented to find rationalizable CCE/CE. The sample complexity obtained via this reduction is comparable with those of Algorithm 2 and 3 when L = Θ(N A), but slightly worse when L ≪ N A. Moreover, this black-box approach would enable us to derive algorithms for rationalizable equilibria with more desirable qualities, such as last-iterate convergence, when using equilibria-finding algorithms with these properties. Suppose that we are given a black-box algorithm O that finds ϵ-CCE in arbitrary games. We can then use this algorithm in the following "support expansion" manner. We start with a subgame of only rationalizable actions, which can be identified efficiently with Algorithm 1, and call O to find an ϵ-CCE Π for the subgame. Next, we check for each i ∈ [N ] if the best response to Π -i is contained in A i . If not, this means that the subgame's ϵ-CCE may not be an ϵ-CCE for the full game; in this case, the best response to Π -i would be a rationalizable action that we can safely include into the action set. On the other hand, if the best response falls in A i for all i, we can conclude that Π is also an ϵ-CCE for the original game. The details are given by Algorithm 4, and our main theoretical guarantee is the following. Theorem 13. Algorithm 4 outputs a ∆-rationalizable ϵ-CCE with high probability, using at most N A calls to the black-box CCE algorithm and O N 2 A 2 min{ϵ 2 ,∆ 2 } additional samples. Using similar algorithmic techniques, we can develop a reduction scheme for rationalizable ϵ-CE. The detailed description for this algorithm is deferred to Appendix E. Here we only state its main theoretical guarantee. Theorem 14. There exists an algorithm that outputs a ∆-rationalizable ϵ-CE with high probability, using at most N A calls to a black-box CE algorithm and O N 2 A 3 min{ϵ 2 ,∆ 2 } additional samples.

7. CONCLUSION

In this paper, we consider two tasks: (1) learning rationalizable action profiles; (2) learning rationalizable equilibria. For task 1, we propose a conceptually simple algorithm whose sample complexity is significantly better than prior work (Wu et al., 2021) . For task 2, we develop the first provably efficient algorithms for learning ϵ-CE and ϵ-CCE that are also rationalizable. Our algorithms are computationally efficient, enjoy sample complexity that scales polynomially with the number of players and are able to avoid iteratively dominated actions completely. Our results rely on several new techniques which might be of independent interests to the community. There remains a gap between our sample complexity upper bounds and the available lower bounds for both tasks, closing which is an important future research problem.

A FURTHER DETAILS ON RATIONALIZABILITY A.1 EQUIVALENCE OF NEVER-BEST-RESPONSE AND STRICT DOMINANCE

It is known that for finite normal form games, rationalizable actions are given by iterated elimination of never-best-response actions, which is in fact equivalent to the iterative elimination of strictly dominated actions (Osborne & Rubinstein, 1994, Lemma 60 .1). Here, for completeness, we include a proof that the iterative elimination of of actions that are never ∆-best-response gives the same definition as Definition 1. Notice that it suffices to show that for every subgame, the set of never ∆-best response actions and the set of ∆-dominated actions are the same. Proposition A.1. Suppose that an action a ∈ A i is never a ∆-best response, i.e. ∀Π -i ∈ ∆( j̸ =i A i ), ∃u ∈ ∆(A i ) such that u i (a, Π -i ) ≤ u i (u, Π -1 ) -∆. Then a is also ∆-dominated, i.e. ∃u ∈ ∆(A i ), ∀Π -i ∈ ∆( j̸ =i A i ) u i (a, Π -i ) ≤ u i (u, Π -1 ) -∆. Proof. That a is never a ∆-best response is equivalent to min Π-1 max u {u i (a, Π -i ) -u i (u, Π -1 )} ≤ -∆. That a is ∆-dominated is equivalent to max u min Π-1 {u i (a, Π -i ) -u i (u, Π -1 )} ≤ -∆. Equivalence immediately follows from von Neumman's minimax theorem.

A.2 PROOF OF PROPOSITION 1

Proof. We prove this inductively with the following hypothesis: ∀l ≥ 1, ∀i ∈ [N ], a∈Ai x * i (a) • 1[a ∈ E l ] ≤ 2lϵ ∆ . Base case: By the definition of ϵ-NE, ∀i ∈ [N ], ∀x ′ ∈ ∆(A i ), u i (x * i , x * -i ) ≥ u i (x ′ , x * -i ) -ϵ. Note that if a ∈ E 1 ∩ A i , ∃x ∈ ∆(A i ) such that ∀a -i , u i ( a, a -i ) ≤ u i (x, a -i ) -∆. Therefore if we choose x ′ := x * i - a∈Ai 1[a ∈ E 1 ]x * i (a)e a + a∈Ai 1[a ∈ E 1 ]x * i (a) • x(a), that is if we play the dominating strategy instead of the dominated action in x * i , then u i (x ′ , x * -i ) ≥ u i (x * i , x * -i ) + a∈Ai x * i (a) • 1[a ∈ E 1 ]∆. It follows that a∈Ai x * i (a) • 1[a ∈ E 1 ] ≤ ϵ ∆ . Induction step: By the induction hypothesis, ∀i ∈ [N ], a∈Ai x * i (a) • 1[a ∈ E l ] ≤ 2lϵ ∆ .

Now consider

x i := x * i -a∈Ai 1[a ∈ E l ] • x * i (a)e a 1 -a∈Ai 1[a ∈ E l ] • x * i (a) , (∀i ∈ [N ]) which is supported on actions on in E l . The induction hypothesis implies ∥ x i -x * i ∥ 1 ≤ 6lϵ/∆. Therefore ∀i ∈ [N ], ∀a ∈ A i , u i (a, x -i ) -u i (a, x * -i ) ≤ 6N lϵ ∆ . Now if a ∈ (E l+1 \ E l ) ∩ A i , since x -i is not supported on E l , ∃x ∈ ∆(A i ) such that u i ( a, x -i ) ≤ u i (x, x -i ) -∆. It follows that u i ( a, x * -i ) ≤ u i (x, x * -i ) -∆ + 12N lϵ ∆ ≤ u i (x, x * -i ) - ∆ 2 . Using the same arguments as in the base case, a∈Ai x * i (a) • 1[a ∈ E l+1 \ E l ] ≤ ϵ ∆ -12N lϵ ∆ ≤ 2ϵ ∆ . It follows that ∀i ∈ [N ], a∈Ai x * i (a) • 1[a ∈ E l+1 ] ≤ 2(l + 1)ϵ ∆ . The statement is thus proved via induction on l.

B FIND ONE RATIONALIZABLE ACTION PROFILE B.1 PROOF OF PROPOSITION 2

Proof. Consider the following N -player game denoted by G 0 with action set [A]: u i (•) = 0 (1 ≤ i ≤ N -1) u N (a N ) = ∆ • 1[a N > 1]. Specifically, a payoff with mean u is realized by a skewed Rademacher random variable with 1+u 2 probability on +1 and 1-u 2 on -1. In game G 0 , clearly for player N , the action 1 is ∆-dominated. However, consider the following game, denoted by G a * (where a * ∈ [A] N -1 ) u i (•) = 0, (1 ≤ i ≤ N -1) u N (a N ) = ∆, (a N > 1) u N (1, a -N ) = 2∆ • 1[a -N = a * ]. It can be seen that in game G a * , for player N , the action 1 is not dominated or iteratively strictly dominated. Therefore, suppose that an algorithm O is able to determine whether an action is rationalizable (i.e. not iteratively strictly dominated) with 0.9 accuracy, then its output needs to be False with at least 0.9 probability in game G 0 , but True with at least 0.9 probability in game G a * . By Pinsker's inequality, KL(O(G 0 )||O(G a * )) ≥ 2 • 0.8 2 > 1, where we used O(G) to denote the trajectory generated by running algorithm O on game G. Meanwhile, notice that G 0 and G a * is different only when the first N -1 players play a * . Denote the number of times where the first N -1 players play a * by n(a * ). Using the chain rule of KL-divergence, KL(O(G 0 )||O(G a * )) ≤ E G0 [n(a * )] • KL Ber 1 2 Ber 1 + 2∆ 2 (a) ≤ E G0 [n(a * )] • 1 1-2∆ 2 • (2∆) 2 (b) ≤ 10∆ 2 E G0 [n(a * )] . Here (a) follows from reverse Pinsker's inequality (see e.g. Binette ( 2019)), while (b) uses the fact that ∆ < 0.1. This means that for any a * ∈ [A] N -1 , E G0 [n(a * )] ≥ 1 10∆ 2 . It follows that the expected number of samples when running O on G 0 is at least E G0   a * ∈[A] N -1 n(a * )   ≥ A N -1 10∆ 2 .

B.2 PROOF OF THEOREM 3

Proof. We first present the concentration bound. For l ∈ [L], i ∈ [N ], and a ∈ A i , by Hoeffding's inequality we have that with probability at least 1 -δ LN A , u i (a, a (l-1) -i ) -ûi (a, a (l-1) -i ) ≤ 4 ln(AN L/δ) M ≤ ∆ 4 . Therefore by a union bound we have that with probability at least 1 -δ, for all l ∈ [L], i ∈ [N ], and a ∈ A i , u i (a, a (l-1) -i ) -ûi (a, a (l-1) -i ) ≤ ∆ 4 . We condition on this event for the rest of the proof. We use induction on l to prove that for all l ∈ [L] ∪ {0}, (a (l) 1 , • • • , a N ) can survive at least l rounds of IDE. The base case for l = 0 directly holds. Now we assume that the case for 1, 2, . . . , l -1 holds and consider the case of l. For any i ∈ [N ], we show that a (l) i can survive at least l rounds of IDE. Recall that a (l) i is the empirical best response, i.e. a (l) i = arg max a∈Ai ûi (a, a (l-1) -i ). For any mixed strategy x i ∈ ∆(A i ), we have that u i (a (l) i , a (l-1) -i ) -u i (x i , a (l-1) -i ) ≥û i (a (l) i , a (l-1) -i ) -ûi (x i , a (l-1) -i ) -u i (a (l) i , a (l-1) -i ) -ûi (a (l) i , a (l-1) -i ) -u i (x i , a (l-1) -i ) -ûi (x i , a (l-1) -i ) ≥0 - ∆ 4 - ∆ 4 = - ∆ 2 . Since actions in a (l-1) -i can survive at least l -1 rounds of ∆-IDE, a i cannot be ∆-dominated by x i in rounds 1, • • • , l. Since x i can be arbitrarily chosen, a i can survive at least l rounds of ∆-IDE. We can now ensure that the output (a (L) 1 , • • • , a (L) N ) survives L rounds of ∆-IDE, which is equivalent to ∆-rationalizability (see Definition 1). The total number of samples used is LN A • M = O LN A ∆ 2 .

B.3 PROOF OF THEOREM 4

Proof. Without loss of generality, assume that ∆ < 0.1. Consider the following instance where A 1 = • • • = A N = [A]: u i (a i ) = ∆ • 1[a i = 1], (i ̸ = j) u j (a j , a -j ) = ∆ • 1[a j = 1] (a -j ̸ = {1} N -1 ) ∆ • 1[a j = 1] + 2∆ • 1[a j = a] (a -j = {1} N -1 ) . Denote this instance by G j,a . Additionally, define the following instance G 0 : u i (a i ) = ∆ • 1[a i = 1]. (∀i ∈ [N ]) As before, a payoff with expectation u is realized as a random variable with distribution 2Ber( 1+u 2 )-1. It can be seen that the only difference between G 0 and G j,a lies in u j (a, {1} N -1 ). By the KLdivergence chain rule, for any algorithm O, KL ( O(G 0 )∥ O(G j,a )) ≤ 10∆ 2 • E G0 n(a j = a, a -j = {1} N -1 ) , where n(a j = a, a -j = {1} N -1 ) denotes the number of times the action profile (a, 1 N -1 ) is played. Note that in G 0 , the only action profile surviving two rounds of ∆-IDE is (1, • • • , 1), while in G j,a , the only rationalizable action profile is (1, • • • , 1 j-1 , a, 1, • • • , 1). To guarantee 0.9 accuracy, by Pinsker's inequality, KL (O(G 0 )||O(G j,a )) ≥ 1 2 |O(G 0 ) -O(G j,a )| 2 > 1. It follows that ∀j ∈ [N ], a > 1, E G0 n(a j = a, a -j = {1} N -1 ) ≥ 1 10∆ 2 . Thus the total expected sample complexity is at least a>1,j∈[N ] E G0 n(a j = a, a -j = {1} N -1 ) ≥ N (A -1) 10∆ 2 .

C OMITTED PROOFS IN SECTION 4

We start our analysis by bounding the sampling noise. For player i ∈ [N ], action a i ∈ A i , and τ ∈ [T ], we denote the sampling noise as ξ (τ ) i (a i ) := u (τ ) i (a i ) -u i (a i , θ -i ). We have the following lemma. Lemma C.1. Let Ω 1 denote the event that for all t ∈ [T ], i ∈ [N ], and a i ∈ A i , t τ =1 ξ (τ ) i (a i ) ≤ 2 ln(AN T /δ) t τ =1 1 M τ . Then Pr[Ω 1 ] ≥ 1 -δ. Proof. Note that t τ =1 ξ (τ ) i (a i ) can be written as the sum of t τ =1 M τ mean-zero bounded terms. By Azuma-Hoeffding inequality, with probability at least 1 -δ AN T , for a fixed i ∈ [N ], t ∈ [T ], a i ∈ A i , t τ =1 ξ (τ ) i (a i ) ≤ 2 ln(AN T /δ) t τ =1 M τ • 1 M τ 2 . A union bound over i ∈ [N ], t ∈ [T ], a i ∈ A i proves the statement. Lemma C.2. With probability at least 1 -2δ, for all t ∈ [T ] and all i ∈ [N ], a i ∈ A i ∩ E L , θ i (a i ) ≤ p. Proof. We condition on the event Ω 1 defined in Lemma C.1 and the success of Algorithm 1. We prove the claim by induction in t. The base case for t = 1 holds directly by initialization. Now we assume the case for 1, 2, . . . , t holds and consider the case of t + 1. Consider a fixed player i ∈ [N ] and iteratively dominated action a i ∈ A i ∩ E L . By definition there exists a mixed strategy x i such that for all a -i ∩ E L = ∅, u i (x i , a -i ) ≥ u i (a i , a -i ) + ∆. Therefore for τ ∈ [t], by the induction hypothesis for τ , u i (x i , θ (τ ) -i ) ≥ u i (a i , θ -i ) + (1 -AN p) • ∆ -AN p ≥ u i (a i , θ (τ ) -i ) + ∆/2. ( ) Consequently, t τ =1 (u (τ ) i (x i ) -u (τ ) i (a i )) ≥ t τ =1 (u i (x i , θ -i ) -u i (a i , θ -i )) -4 • ln(AN T /δ) t τ =1 1 M τ (By (6)) ≥ t∆ 2 -4 • ln(AN T /δ) t τ =1 1 M τ (By (7)) ≥ t∆ 4 . Therefore by our choice of learning rate, θ (t+1) i (a i ) ≤ exp -η t • t τ =1 u (τ ) i (x i ) -u (τ ) i (a i ) ≤ exp - 4 ln(1/p) ∆t • ∆t 4 = p. Therefore θ (t+1) i (a i ) ≤ p as desired. Now we turn to the ϵ-CCE guarantee. For a player i ∈ [N ], recall that the regret is defined as Regret i T = max θ∈∆(Ai) T t=1 ⟨u (t) i , θ -θ (t) i ⟩. Lemma C.3. The regret can be bounded as Regret i T ≤ O √ ln A • T + ln(1/p) ln T ∆ . Proof. Note that apart from the choice of θ (1) , we are exactly running FTRL with learning rates η t = max ln A/t, 4 ln(1/p) ∆t , which are monotonically decreasing. Therefore following the standard analysis of FTRL (see, e.g., Orabona (2019, Corollary 7.9 )), we have max θ∈∆(Ai) T t=1 ⟨u (t) i , θ -θ (t) i ⟩ ≤ 2 + ln A η T + 1 2 T t=1 η t ≤ 2 + √ ln A • T + 1 2 T t=1 ln A t + 4 ln(1/p) ∆t = O √ ln A • T + ln(1/p) ln T ∆ . However, this form of regret cannot directly imply approximate CCE. We define the following expected version regret Regret i,⋆ T = max θ∈∆(Ai) T t=1 ⟨u i (•, θ (t) -i ), θ -θ (t) i ⟩. The next lemma bound the difference between these two types of regret Lemma C.4. The following event Ω 2 holds with probability at least 1 -δ: for all i ∈ [N ] Regret i,⋆ T -Regret i T ≤ O T • ln(N A/δ) . Proof. We denote Θ i := {e 1 , e 2 , . . . , e |Ai| } Therefore we have Regret i,⋆ T -Regret i T = max θ∈∆(Ai) T t=1 ⟨u i (•, θ (t) -i ), θ -θ (t) i ⟩ -max θ∈∆(Ai) T t=1 ⟨u (t) i , θ -θ (t) i ⟩ = max θ∈Θi T t=1 ⟨u i (•, θ (t) -i ), θ -θ (t) i ⟩ -max θ∈Θi T t=1 ⟨u (t) i , θ -θ (t) i ⟩ = max θ∈Θi T t=1 ⟨u i (•, θ (t) -i ), θ -θ (t) i ⟩ - T t=1 ⟨u (t) i , θ -θ (t) i ⟩ = max θ∈Θi T t=1 ⟨u i (•, θ (t) -i ) -u (t) i , θ -θ (t) i ⟩ Note that ⟨u i (•, θ (t) -i ) -u (t) i , θ -θ (t) i ⟩ is a bounded martingale difference sequence. By Azuma-Hoeffding's inequality, for a fixed θ ∈ Θ i , with probability at least 1 -δ AN , T t=1 ⟨u i (•, θ (t) -i ) -u (t) i , θ -θ (t) i ⟩ ≤ O T • ln(N A/δ) Thus we complete the proof by a union bound. Proof of Theorem 6. We condition on event Ω 1 defined Lemma C. i )/T would be an (ϵ/2)-CCE. Finally, in the clipping step, ∥ θ(t) i -θ (t) i ∥ 1 ≤ 2pA ≤ ϵ 4N for all i ∈ [N ], t ∈ [T ]. Thus for all t ∈ [T ], we have ∥ ⊗ n i=1 θ(t) i -⊗ n i=1 θ (t) i ∥ 1 ≤ ϵ 4 , which further implies ( T t=1 ⊗ n i=1 θ(t) i )/T -( T t=1 ⊗ n i=1 θ (t) i )/T 1 ≤ ϵ 4 . Therefore the output strategy Π = ( T t=1 ⊗ N i=1 θ(t) i )/T is an ϵ-CCE. Rationalizability. By Lemma C.2, if a ∈ E L ∩ A i , θ i (a) ≤ p for all t ∈ [T ]. It follows that θ(t) i (a) = 0, i.e., the action would not be the support in the output strategy Π = ( T t=1 ⊗ N i=1 θ(t) i )/T . Sample complexity. The total number of full-information queries is T t=1 M t ≤ T + T t=1 64 ln(AN T /δ) ∆ 2 t ≤ T + O 1 ∆ 2 = O 1 ∆ 2 + 1 ϵ 2 . The total sample complexity for CCE learning would then be N A • T t=1 M t = O N A ϵ 2 + N A ∆ 2 . Finally consider the cost of finding one IDE-surviving action profile ( O LN A ∆ 2 ) and we get the claimed rate.

D OMITTED PROOFS IN SECTION 5

Similar to the CCE case we first bound the sampling noise. For action a i ∈ A i , and τ ∈ [T ], we denote the sampling noise as ξ (τ ) i (a i ) := u (τ ) i (a i ) -u i (a i , θ (τ ) -i ). In the CE case, we are interested in the weighted sum of noise t τ =1 ξ (τ ) i (a i )θ (τ ) i (b i ), which is bounded in the following lemma. Lemma D.1. The following event Ω 3 holds with probability at least 1 -δ: for all t ∈ [T ], i ∈ [N ], and a i ∈ A i , t τ =1 ξ (τ ) i (a i )θ (τ ) i (b i ) ≤ ∆ 4 t τ =1 θ (τ ) i (b i ). Proof. Note that t τ =1 ξ (τ ) i (a i )θ (τ ) i (b i ) can be written as the sum of t τ =1 M τ i mean-zero bounded terms. Precisely, there are M τ i terms bounded by θ (τ ) i (bi) M τ i . By the Azuma-Hoeffding inequality, we Now we turn to the ϵ-CE guarantee. For a player i ∈ [N ], recall that the swap-regret is defined as SwapRegret i T := sup ϕ:Ai→Ai T t=1 b∈Ai θ (t) i (b)u (t) i (ϕ(b)) - T t=1 θ (t) i , u . Lemma D.3. For all i ∈ [N ], the swap-regret can be bounded as SwapRegret i T ≤ O A ln(A)T + A ln(N AT /∆ϵ) 2 ∆ . Proof. For i ∈ [N ], recall that the regret for an expert b ∈ A i is defined as Regret i,b T := max a∈Ai T t=1 θ (t) i (b)u (t) i (a) - T t=1 θ(t) i (•|b), θ (t) i (b)u (t) i . Since θ (t) i (a) = b∈Ai θ(t) i (a|b)θ (t) i (b) for all a and all t > 1, b∈Ai Regret i,b T = b∈Ai max a b ∈Ai T t=1 θ (t) i (b)u (t) i (a b ) - b∈Ai T t=1 θ(t) i (•|b)θ (t) i (b), u (t) i = max ϕ:Ai→Ai b∈Ai T t=1 θ (t) i (b)u (t) i (ϕ(b)) - T t=1 b∈Ai θ(t) i (•|b)θ (t) i (b), u (t) i ≥ max ϕ:Ai→Ai T t=1 b∈Ai θ (t) i (b)u (t) i (ϕ(b)) - T t=2 θ (t) i , u (t) i -1 ≥ SwapRegret i T -1. It now suffices to control the regret of each individual expert. For expert b, we are essentially running FTRL with learning rates η b t,i := max 4 ln(1/p) ∆ t τ =1 θ (τ ) i (b) , √ A ln A √ t , which are clearly monotonically decreasing. Therefore using standard analysis of FTRL (see, e.g., Orabona (2019, Corollary 7.9)), Regret i,b T ≤ ln A η b T,i + T t=1 η b t,i • θ (t) i (b) 2 ≤ T ln A A + T t=1 θ (t) i (b) • A ln A t + 4 ln(1/p) ∆ • T t=1 θ (t) i (b) t τ =1 θ (τ ) i (b) ≤ T ln A A + T t=1 θ (t) i (b) • A ln A t + 4 ln(1/p) ∆ 1 + ln T p . Here we used the fact that ∀b ∈ A i , θ i (b) ≥ p, and T t=1 θ (t) i (b) τ i=1 θ (τ ) i (b) ≤ 1 + T t=1 θ (t) i (b) θ (1) i (b) ds s = 1 + ln T t=1 θ (t) i (b) θ (1) i (b) ≤ 1 + ln T p . Notice that b∈Ai T t=1 θ (t) i (b) • A ln A t ≤ O( A ln(A)T ). Therefore SwapRegret i T ≤ O(1) + b∈Ai Regret i,b T ≤ O A ln(A)T + A ln(N AT /∆ϵ) 2 ∆ . Similar to the CCE case,, this form of regret can not directly imply approximate CE. We define the following expected version regret SwapRegret i,⋆ T := sup ϕ:Ai→Ai T t=1 ϕ • θ (t) i , u i (•, θ (t) -i ) - T t=1 θ (t) i , u i (•, θ -i ) The next lemma bound the difference between these two types of regret Lemma D.4. The following event Ω 4 has probability at least 1 -δ: for all i ∈ [N ], SwapRegret i,⋆ T -SwapRegret i T ≤ O AT ln AN δ . Proof. Note that SwapRegret i,⋆ T -SwapRegret i T = sup ϕ:Ai→Ai T t=1 ϕ • θ (t) i -θ (t) i , u i (•, θ (t) -i ) -sup ϕ:Ai→Ai T t=1 ϕ • θ (t) i -θ (t) i , u (t) i ≤ sup ϕ:Ai→Ai T t=1 ϕ • θ (t) i -θ (t) i , u i (•, θ (t) -i ) -u (t) i . Notice that E[u (t) i ] = u i •, θ -i , and that u (t) i ∈ [-1, 1] A . Therefore, ∀ϕ : A i → A i , ξ ϕ t := ϕ • θ (t) i -θ (t) i , u i •, θ -i -u (t) i is a bounded martingale difference sequence. By Azuma-Hoeffding inequality, for a fixed ϕ : A i → A i , with probability 1 -δ ′ , T t=1 ξ ϕ t ≤ 2 2T ln 2 δ ′ . By setting δ ′ = δ/(N A A ), we get with probability 1 -δ/N , ∀ϕ : A i → A i , T t=1 ξ ϕ t ≤ 2 2AT ln 2AN δ . Therefore we complete the proof by a union bound over i ∈ [N ]. Proof of Theorem 12. We condition on event Ω 3 defined Lemma D.1, event Ω 4 defined in Lemma D.4, and the success of Algorithm 1. Correlated Equilibrium. By Lemma D.3 and Lemma D.4 we know that for all i ∈ [N ], SwapRegret i,⋆ T ≤ O A ln(A)T + A ln(N AT /∆ϵ) 2 ∆ + AT ln AN δ . Therefore choosing T = Θ A ln AN δ ϵ 2 + A ln 3 N A ∆ϵδ ∆ϵ will guarantee that SwapRegret i,⋆ T is at most ϵT /2 for all i ∈ [N ]. In this case the average strategy ( T t=1 ⊗ N i=1 θ (t) i )/T would be an ϵ/2-CE. Finally, in the clipping step, ∥ θ (t) i -θ (t) i ∥ 1 ≤ 2pA ≤ ϵ 4N for all i ∈ [N ], t ∈ [T ]. Thus for all t ∈ [T ], we have ∥ ⊗ n i=1 θ(t) i -⊗ n i=1 θ (t) i ∥ 1 ≤ ϵ 4 , which further implies ( T t=1 ⊗ n i=1 θ(t) i )/T -( T t=1 ⊗ n i=1 θ (t) i )/T 1 ≤ ϵ 4 . Therefore the output strategy Π = ( i (a) = 0, i.e., the action would not be the support in the output strategy Π = ( t ⊗ i θ(t) i )/T . Sample complexity. The total number of queries is i∈[N ] T t=1 AM (t) i ≤ N AT + i∈[N ] b∈Ai T t=1 16θ (t) i (b) ∆ 2 • t τ =1 θ (τ ) i (b) ≤ N AT + 16N A 2 ∆ 2 • ln(T /p) ≤ O N A 2 ϵ 2 + N A 2 ∆ 2 , where we used the fact that Finally consider the cost of finding one IDE-surviving action profile ( O LN A ∆ 2 ) and we get the claimed rate.

E DETAILS FOR REDUCTION ALGORITHMS

In this section, we present the details for the reduction based algorithm for finding rationalizable CE (Algorithm 5) and analysis of both Algorithm 4 and 5.

E.1 RATIONALIZABLE CCE VIA REDUCTION

We will choose ϵ ′ = min{ϵ,∆} 3 , M = 4 ln(2N A/δ) ϵ ′2 . Lemma E.1. With probability 1 -δ, throughout the execution of Algorithm 4, for every t and i ∈ [N ], a ′ i ∈ A i , |û i (a ′ i , Π -i ) -u i (a ′ i , Π -i )| ≤ ϵ ′ . Proof. First, observe that during every iterate of t before the algorithm returns, the total support size t i=1 |A (t) i | is increased by at least 1. It follows that the algorithm returns before t = N A. By Hoeffding's inequality, Pr [|û i (a ′ i , Π -i ) -u i (a ′ i , Π -i )| > ϵ ′ ] ≤ 2 exp - nϵ ′2 2 ≤ δ N 2 A 2 . Applying union bound over t, i and a ′ i proves the statement. Proof of Theorem 13. Correctness. Since Π is an ϵ-CCE in the subgame Π N i=1 A (t) i , ∀i ∈ [N ], ∀a ∈ A (t) i u i (a, Π -i ) ≤ u i (Π) + ϵ ′ . Because arg max a∈Ai ûi (a, Π -i ) ∈ A (t) i , ∀i ∈ [N ], ∀a ∈ A i u i (a, Π -i ) ≤ ûi (a, Π -i ) + ϵ ′ ≤ max a ′ ∈A (t) i ûi (a ′ , Π -i ) + ϵ ′ ≤ max a ′ ∈A (t) i u i (a ′ , Π -i ) + 2ϵ ′ ≤ u i (Π) + 3ϵ ′ ≤ u i (Π) + ϵ. Therefore Π is an ϵ-CCE in the full game. Moreover, we claim that for any t, A (t) i only contains ∆-rationalizable actions. This is true for t = 1 with high probability due to our initialization. Suppose that this is true for t. Notice that the only way for an action a ′ i ∈ A (t+1) i is to be an empirical best response, which means u i (a ′ i , Π -i ) ≥ ûi (a ′ i , Π -i ) -ϵ ′ ≥ max a∈Ai ûi (a, Π -i ) -ϵ ′ ≥ max a∈Ai u i (a, Π -i ) -2ϵ ′ . Since ϵ ′ < ∆/2, this means that a ′ i is the ∆-best response to a ∆-rationalizable strategy, and is therefore ∆-rationalizable. Therefore A (t+1) i also only contains ∆-rationalizable actions. Our claim can be thus proven via induction, and it follows that the output strategy is also ∆-rationalizable. We conclude that the output strategy is a ∆-rationalizable ϵ-CCE with probability 1 -2δ (assuming the event in Lemma E.1 as well as the rationalizability of the initialization). Sample complexity. By Theorem 3, Line 1 needs O LN A ∆ 2 samples. Since the algorithm returns before t = N A, the total number of calls to the black-box oracle O is N A. For each t, the number of samples required is N AM = O N A min{∆, ϵ} 2 . Combining this with the upper bound on t, and the cost for Algorithm 1 gives the total sample complexity bound O N 2 A 2 min{∆ 2 , ϵ 2 } .

E.2 RATIONALIZABLE CE VIA REDUCTION

The algorithm for CE is quite similar to the one for CCE, except now when testing whether a subgame ϵ-CE is an actual ϵ-CE, we need to use the conditional distribution Π|a i , which is the conditional distribution of the other players' actions given that player i is told to play a i . The detailed description is given in Algorithm 5. Similar to the CCE case, we will choose ϵ ′ = 



See, e.g., the Diamond-In-the-Rough (DIR) games in Wu et al. (2021, Definition 2) for a concrete example of iterative dominance elimination. Here we slightly abuse the notation and use ∆ to refer to both the gap and the probability simplex. Alternatively one can also define ∆-rationalizability by the iterative elimination of actions that are never ∆-best response, which is mathematically equivalent to Definition 1 (see Appendix A.1). For two-player zero-sum games, the marginals of any CCE is an NE so NE can be found efficiently. This is not true for general games, where finding NE is computationally hard and takes Ω(2 N ) samples. Wu et al. (2021)'s result allows trade-off between variables via different choice of algorithmic parameters. However, a ζ -1 ∆ -3 factor is unavoidable regardless of choice of parameters.



in the t-th round to reduce the variance of the payoff estimator u

1, event Ω 2 defined in Lemma C.4, and the success of Algorithm 1.

Rationalizability. By Lemma D.2, if a ∈ E L ∩ A i , θ (a) ≤ p for all t ∈ [T ]. It follows that θ(t)

Algorithm 1 2: For all i ∈ [N ], initialize A } for all i ∈ [N ] 3: for t = 1, 2, . . . do Find an ϵ ′ -CE, Π, in the sub-game supported on Π i∈[N ] A [N ], a i , a ′ i ∈ A i , sample u i (a ′ i , Π -i |a i )for M times and compute average ûi (a ′ i , Π -i |a i )

ACKNOWLEDGEMENTS

This work is supported by Office of Naval Research N00014-22-1-2253. Dingwen Kong is partially supported by the elite undergraduate training program of School of Mathematical Sciences in Peking University.

annex

Proof. First, observe that during every iterate of t before the algorithm returns, the total support size t i=1 |A (t) i | is increased by at least 1. It follows that the algorithm returns before t = N A. By Hoeffding's inequality,Applying union bound over t, i, a i , and a ′ i proves the statement.Proof. Note that with high probability, the empirical estimates Û are at most ϵ/4 away from the true value U . Since a ′ i is the empirical best response, we haveNote that Π|a i is supported on actions that can survive any rounds of ϵ-IDE. Therefore it serves as a certificate that a ′ i will never be ϵ-eliminated as well.Lemma E.3. The returned strategy Π is an ϵ-CE with probability 1 -δ.Proof. When the algorithm terminates, for all i ∈ [N ],Since Π is an ϵ ′ -CE in the reduced game,Summing the two inequalities above giveswhich proves the statement.Lemma E.4. For any t, Ai only contains ∆-rationalizable actions with probability 1 -2δ.Proof. We prove this inductively. This is true for t = 1 with probability 1 -δ due to our initialization. Suppose that this is true for t. Notice that the only way for an actionis to be an empirical best response, which means for some a iSince ϵ ′ < ∆/2, this means that a ′ i is the ∆-best response to a ∆-rationalizable strategy, and is therefore ∆-rationalizable. Therefore A (t+1) i also only contains ∆-rationalizable actions. Our claim can be thus proven via induction, and it follows that the output strategy is also ∆-rationalizable.Proof of Theorem 14. Correctness. By Lemma E.3 and E.4, the output strategy is a ∆-rationalizable ϵ-CE with probability 1 -2δ (assuming that the event in Lemma E.2 holds and the rationalizability of the initialization).

