OFFLINE CONGESTION GAMES: HOW FEEDBACK TYPE AFFECTS DATA COVERAGE REQUIREMENT

Abstract

This paper investigates when one can efficiently recover an approximate Nash Equilibrium (NE) in offline congestion games. The existing dataset coverage assumption in offline general-sum games inevitably incurs a dependency on the number of actions, which can be exponentially large in congestion games. We consider three different types of feedback with decreasing revealed information. Starting from the facility-level (a.k.a., semi-bandit) feedback, we propose a novel one-unit deviation coverage condition and show a pessimism-type algorithm that can recover an approximate NE. For the agent-level (a.k.a., bandit) feedback setting, interestingly, we show the one-unit deviation coverage condition is not sufficient. On the other hand, we convert the game to multi-agent linear bandits and show that with a generalized data coverage assumption in offline linear bandits, we can efficiently recover the approximate NE. Lastly, we consider a novel type of feedback, the game-level feedback where only the total reward from all agents is revealed. Again, we show the coverage assumption for the agent-level feedback setting is insufficient in the game-level feedback setting, and with a stronger version of the data coverage assumption for linear bandits, we can recover an approximate NE. Together, our results constitute the first study of offline congestion games and imply formal separations between different types of feedback.

1. INTRODUCTION

Congestion game is a special class of general-sum matrix games that models the interaction of players with shared facilities (Rosenthal, 1973) . Each player chooses some facilities to utilize, and each facility will incur a different reward depending on how congested it is. For instance, in the routing game (Koutsoupias & Papadimitriou, 1999) , each player decides a path to travel from the starting point to the destination point in a traffic graph. The facilities are the edges and the joint decision of all the players determines the congestion in the graph. The more players utilize one edge, the longer the travel time on that edge will be. As one of the most well-known classes of games, congestion game has been successfully deployed in numerous real-world applications such as resource allocation (Johari & Tsitsiklis, 2003) , electrical grids (Ibars et al., 2010) and cryptocurrency ecosystem (Altman et al., 2019) . Nash equilibrium (NE), one of the most important concepts in game theory (Nash Jr, 1950) , characterizes the emerging behavior in a multi-agent system with selfish players. It is commonly known that solving for the NE is computationally efficient in congestion games as they are isomorphic to potential games (Monderer & Shapley, 1996) . Assuming full information access, classic dynamics such as best response dynamics (Fanelli et al., 2008) , replicator dynamics (Drighes et al., 2014) and no-regret dynamics (Kleinberg et al., 2009) provably converge to NE in congestion games. Recently Heliou et al. (2017) and Cui et al. (2022) relaxed the full information setting to the online (semi-) bandit feedback setting, achieving asymptotic and non-asymptotic convergence, respectively. It is worth noting that Cui et al. (2022) proposed the first algorithm that has sample complexity independent of the number of actions. Offline reinforcement learning has been studied in many real-world applications (Levine et al., 2020) . From the theoretical perspective, a line of work provides understanding of offline singleagent decision making, including bandits and Markov Decision Processes (MDPs), where researchers derived favorable sample complexity under the single policy coverage (Rashidinejad et al., 2021; Xie et al., 2021b) . However, how to learn in offline multi-agent games with offline data is still far from clear. Recently, the unilateral coverage assumption has been proposed as the minimal assumption for offline zero-sum games and offline general-sum games with corresponding algorithms to learn the NE (Cui & Du, 2022a; b; Zhong et al., 2022) . Though their coverage assumption and the algorithms apply to the most general class of normal-form games, when specialized to congestion games, the sample complexity will scale with the number of actions, which can be exponentially large. Since congestion games admit specific structures, one may hope to find specialized data coverage assumptions that permit sample-efficient offline learning.

One-Unit Deviation

Weak Covariance Domination Strong Covariance Domination Facility-Level ✔ ✔ ✔ Agent-Level ✘ ✔ ✔ Game-Level ✘ ✘ ✔ Table 1 : A summary of how data coverage assumptions affect offline learnability. In particular, ✔ represents under this pair of feedback type and assumption, an NE can be learned with a sufficient amount of data; on the other hand, ✘ represents there exists some instances in which a NE cannot be learned no matter how much data is collected. In different applications, the types of feedback, i.e., the revealed reward information, can be different in the offline dataset. For instance, the dataset may include the reward of each facility, the reward of each player, or the total reward of the game. With decreasing information contained in the dataset, different coverage assumptions and algorithms are necessary. In addition, the main challenge in solving congestion games lies in the curse of an exponentially large action set, as the number of actions can be exponential in the number of facilities. In this work, we aim to answer the following question: When can we find approximate NE in offline congestion games with different types of feedback, without suffering from the curse of large action set? We provide an answer that reveals striking differences between different types of feedback.

1.1. MAIN CONTRIBUTIONS

We provide both positive and negative results for each type of feedback. See Table 1 for a summary. 1. Three types of feedback and corresponding data coverage assumptions. We consider three types of feedback: facility-level feedback, agent-level feedback and game-level feedback to model different real-world applications and what dataset coverage assumptions permit finding an approximate NE. In offline general-sum games, Cui & Du (2022b) proposes the unilateral coverage assumption. Although their result can be applied to offline congestion games with agent-level feedback, their unilateral coverage coefficient is at least as large as the number of actions and thus has an exponential dependence on the number of facilities. Therefore, for each type of feedback, we propose a corresponding data coverage assumption to escape the curse of the large action set. Specifically: • Facility-Level Feedback: For facility-level feedback, the reward incurred in each facility is provided in the offline dataset. This type of feedback has the strongest signal. We propose the One-Unit Deviation coverage assumption (cf. Assumption 2) for this feedback. • Agent-Level Feedback: For agent-level feedback, only the sum of the facilities' rewards for each agent is observed. This type of feedback has weaker signals than the facility-level feedback does, and therefore we require a stronger data coverage assumption (cf. Assumption 4). • Game-Level Feedback: For the game-level feedback, only the sum of the agent rewards is obtained. This type of feedback has the weakest signals, and we require the strongest data coverage assumption (Assumption 5). Notably, for the latter two types of feedback, we leverage the connections between congestion games and linear bandits. 2. Sample complexity analyses for different types of feedback. We adopt the surrogate minimization idea in Cui & Du (2022b) and show a unified algorithm (cf. Algorithm 1) with carefully designed bonus terms tailored to different types of feedback can efficiently find an approximate NE, therefore showing our proposed data coverage assumptions are sufficient. For each type of feedback, we give a polynomial upper bound under its corresponding dataset coverage assumption. 3. Separations between different types of feedback. To rigorously quantify the signal strengths in the three types of feedback, we provide concrete hard instances. Specifically, we show there exists a problem instance that satisfies Assumption 2, but with only agent-level feedback, we provably cannot find an approximate NE, yielding a separation between the facility-level feedback and the agent-level feedback. Furthermore, we also show there exists a problem instance that satisfies Assumption 4, but with only game-level feedback, we provably cannot find an approximate NE, yielding a separation between the agent-level feedback and game-level feedback. In addition, we also provide several concrete scenarios to exemplify and motivate the aforementioned three types of feedback, which can be found in Appendix A.

1.2. RELATED WORK

Potential Games and Congestion Games. Potential games are a special class of normal-form games with a potential function to quantify the changes in the payoff of each player and deterministic NE is proven to exist (Monderer & Shapley, 1996) . Asymptotic convergence to the NE can be achieved by classic game theory dynamics such as best response dynamic (Durand, 2018; Swenson et al., 2018 ), replicator dynamic (Sandholm et al., 2008; Panageas & Piliouras, 2016) and no-regret dynamic (Heliou et al., 2017) . Recently, Cen et al. (2021) proved that natural policy gradient has a convergence rate independent of the number of actions in entropy regularized potential games. Anagnostides et al. (2022) provided the non-asymptotic convergence rate for mirror descent and O(1) individual regret for optimistic mirror descent. Congestion games are proposed in the seminal work (Rosenthal, 1973 ) and the equivalence with potential games is proven in (Monderer & Shapley, 1996) . Note that congestion games can have exponentially large action sets, so efficient algorithms for potential games are not necessarily efficient for congestion games. Non-atomic congestion games consider separable players, which enjoy a convex potential function if the cost function is non-decreasing (Roughgarden & Tardos, 2004) . For atomic congestion games, the potential function is usually non-convex, making the problem more difficult. (Kleinberg et al., 2009; Krichene et al., 2014) show that no-regret algorithms asymptotically converge to NE with full information feedback and (Krichene et al., 2015) provide averaged iterate convergence for bandit feedback. (Chen & Lu, 2015; 2016) provide non-asymptotic convergence by assuming the atomic congestion game is close to a non-atomic one, and thus approximately enjoys the convex potential. Recently, Cui et al. (2022) first proposed an upper-confidence-boundtype algorithm and a Frank-Wolfe-type algorithm that has a convergence rate without dependence on the number of actions for semi-bandit feedback and bandit feedback settings, respectively. To the best of our knowledge, all of these works either consider the full information setting or the online feedback setting instead of the offline setting in this paper. Offline Bandits and Reinforcement Learning. For related works in empirical offline reinforcement learning, interested readers can refer to (Levine et al., 2020) . From the theoretical perspective, researchers have been putting efforts into understanding what dataset coverage assumptions allow for learning the optimal policy. The most basic assumption is the uniform coverage, i.e., every stateaction pair is covered by the dataset (Szepesvári & Munos, 2005) . Provably efficient algorithms have been proposed for both single-agent and multi-agent reinforcement learning (Yin et al., 2020; 2021; Ren et al., 2021; Sidford et al., 2020; Cui & Yang, 2021; Zhang et al., 2020; Subramanian et al., 2021) . In single-agent bandits and reinforcement learning, with the help of pessimism, only single policy coverage is required, i.e., the dataset only needs to cover the optimal policy (Jin et al., 2021; Rashidinejad et al., 2021; Xie et al., 2021b; a) . For offline multi-agent Markov games, Cui & Du (2022a) first show that it is impossible to learn with the single policy coverage and they identify the unilateral coverage assumption as the minimal coverage assumption while Zhong et al. (2022) ; Xiong et al. (2022) provide similar results with linear function approximation. Recently, Yan et al. (2022) ; Cui & Du (2022b) give minimax sample complexity for offline zero-sum Markov games. In addition, Cui & Du (2022b) proposes an algorithm for offline multi-player general-sum Markov games that do not suffer from the curse of multiagents.

2. PRELIMINARY

2.1 CONGESTION GAME General-Sum Matrix Game. A general-sum matrix game is defined by a tuple G = ({A i } m i=1 , R), where m is the number of players, A i is the action space for player i and R(•|a) is a distribution over [0, r max ] m with mean r(a). When playing the game, all players simultaneously select actions, constituting joint action a and the reward is sampled as r ∼ R(•|a), where player i gets reward r i . Let A = m i=1 A i . A joint policy is a distribution π ∈ ∆(A) while a product policy is π = m i=1 π i with π i ∈ ∆(A i ), where ∆(X ) denotes the probability simplex over X . If the players follow a policy π, their actions are sampled from the distribution a ∼ π. The expected return of player i under some policy π is defined as value V π i = E a∼π [r i (a)]. Let π -i be the joint policy of all the players except for player i. The best response of the player i to policy π -i is defined as π †,π-i i = arg max µ∈∆(Ai) V µ,π-i i . Here µ is the policy for player i and (µ, π -i ) constitutes a joint policy for all players. We can always set the best response to be a pure strategy since the value function is linear in µ. We also denote V †,π-i i := V π †,π -i i ,π-i i as the best response value. To evaluate a policy π, we use the performance gap defined as Gap(π) = max i∈[m] î V †,π-i i -V π i ó . A product policy π is an ε-approximate NE if Gap(π) ≤ ε. A product policy π is an NE if Gap(π) = 0. Note that it is possible to have multiple Nash equilibria in one game. Congestion Game. A congestion game is a general-sum matrix game with special structures. In particular, there is a facility set F such that a ⊆ F for all a ∈ A i , meaning that the size of A i can at most be 2 F , where F = |F|. A set of facility reward distributions R f (•|n) n ∈ N f ∈F is associated with each facility f . Let the number of players choosing facility f in the action be n f (a) = m i=1 1 {f ∈ a i }, where a is the joint action. A facility with specific number of players selecting it is said to be a configuration on f . Two joint actions where the same number of players selecting f are said to have the same configuration on f . The reward associated with facility f is sampled by r f ∼ R f • n f (a) and the total reward of player i is r i = f ∈ai r f . With slight abuse of notation, let r f (n) be the mean reward that facility f generates when there are n players choosing it. We further assume the support of R f (•|n) is [-1,1] for all n ∈ [m]. It is well known that every congestion game has pure strategy NE. The information we get from the game each episode is a k , r k , where a k is the joint action and r k contains the reward signal. In this paper, we will consider three types of reward feedback in congestion games, which essentially make r k different in each data point a k , r k . • Facility-Level Feedback (Semi-Bandit Feedback): In each data point a k , r k , r k contains reward received from each facility f ∈ m i=1 a k i , meaning that r k = r f,k f ∈ m i=1 a k i . • Agent-Level Feedback (Bandit Feedback): In each data point a k , r k , r k contains reward received by each player, meaning that r k = r k i m i=1 , where r k i = f ∈a k i r f,k . • Game-Level Feedback: In each data point a k , r k , r k contains only the total reward received by all players, meaning that r k = m i=1 r k i , which becomes a scalar. This type of feedback is the minimal information we can get and has not been discussed in previous literature.

2.2. OFFLINE MATRIX GAME

Offline Matrix Game. In the offline setting, the algorithm only has access to an offline dataset D = a k , r k n k=1 collected by some exploration policy ρ in advance. A joint action a is said to be covered if ρ(a) > 0. Cui & Du (2022a) has proved that the following assumption is a minimal dataset coverage assumption to learn an NE in a general-sum matrix game. The assumption requires the dataset to cover all unilaterally deviated actions from one NE. Assumption 1. There exists an NE π * such that for any player i and policy π i ∈ ∆(A), a is covered by ρ as long as Cui & Du (2022b) provides a sample complexity result for matrix games with dependence on C(π * ), where C(π) quantifies how well π is unilaterally covered by the dataset. The definition is as follows. (π i , π * -i )(a) > 0. Definition 1. For strategy π and ρ satisfying Assumption 1, the unilateral coefficient is defined as C(π) = max i,π ′ ,ρ(a)>0 (π ′ i , π -i ) (a) ρ(a) . Surrogate Minimization. Cui & Du (2022b) proposed an algorithm called Strategy-wise Bonus + Surrogate Minimization (SBSM) to achieve efficient learning under Assumption 1. SBSM motivates a general algorithm framework for learning congestion games in different settings. First we design r i (a) which estimates the reward player i gets when the joint action is a. Offline bandit (reinforcement learning) algorithm usually leverages the confidence bound (bonus) to create a pessimistic estimate of the reward estimator, inducing conservatism in the output policy and achieving a sample-efficient algorithm. Here we formally define bonus as follows. Definition 2. For any reward estimator r i : A → R that estimates reward with expectation r i : A → R, b i : A → R is called the bonus term for r if for all i ∈ [m], a ∈ A, with probability at least 1 -δ, it holds that |r i (a) -r i (a)| ≤ b i (a). The formulae for r i and b vary according to the type of feedback as discussed in later sections. The optimistic and pessimistic values for policy π and player i are defined as V π i = E a∼π [ r i (a) + b i (a)] , V π i = E a∼π [ r i (a) -b i (a)] . Finally, the algorithm minimizes max i∈[m] V †,π-i i -V π i over the policy π, which serves as a surrogate of the performance gap (see Lemma 1). We summarize it in Algorithm 1. Note that we only take the surrogate gap from SBSM but not the strategy-wise bonus, which is a deliberately designed bonus term depending on the policy. Instead, we design specialized bonus terms by exploiting the unique structure of the congestion game, which will be discussed in detail in later sections.

Algorithm 1 Surrogate Minimization for Congestion Games

Require: Offline dataset D 1: Compute r(a), b(a) for all a ∈ A according to the dataset D. 2: Compute the optimistic value V π i and pessimistic value V π i for all policy π and player i by (3). 3: Compute V †,π-i i = max π ′ i ∈∆(Ai) V π ′ i ,π-i i . 4: return arg min π max i∈[m] V †,π-i i -V π i . The sample complexity of this algorithm is guaranteed by the following theorem. Theorem 1. Let Π be the set of all deterministic policies and b is a bonus term for r. With probability 1 -δ, it holds that Gap(π output ) ≤ 2 max i∈[m] ï max π ′ ∈Π E a∼(π ′ i ,π * -i ) b i (a) + E a∼π * b i (a) ò . where π output is the output of Algorithm 1. Here, the expectation of bonus term over some policy reflects the degree of uncertainty of the reward under that policy. Inside the operation min π∈Π [•], the first term is for unilaterally deviated policy from π that maximizes this uncertainty and the second term is the uncertainty for π. The full proof is deferred to Appendix B. This theorem tells us that if we want to bound the performance gap, we need to precisely estimate rewards induced by unilaterally deviated actions from the NE, which caters to Assumption 1.

3. OFFLINE CONGESTION GAME WITH FACILITY-LEVEL FEEDBACK

Recall that for Definition 1. If π is deterministic, the minimum value of C(π) is achieved when ρ uniformly covers all actions achievable by unilaterally deviating from π (see Proposition 1). Since having π ′ i deviating from π can at least cover A i actions, the smallest value of C(π) scales with max i∈[m] A i , which is reasonable for general-sum matrix game. However, this is not acceptable for congestion games since the size of action space can be exponential (A i ≤ 2 F ). As a result, covering all possible unilateral deviations becomes inappropriate. There are five facilities and five players with full action space. The facility configuration in π * is marked in red. The transparent boxes cover the facility configuration required in the assumption. Compared to general-sum games, congestion games with facility-level feedback inform us not only the total reward but also the individual rewards from all chosen facilities. This allows us to estimate the reward distribution from each facility separately. Instead of covering all unilaterally deviating actions a, we only need to make sure for any such action a and any facility f ∈ F, we cover some actions that share the same configuration with a on f . This motivates the dataset coverage assumption on facilities rather than actions. In particular, we quantify the facility coverage condition and present the new assumption as follows. Definition 3. For strategy π, facility f and integer n, the facility cumulative density is defined as d π f (n) = a:n f (a)=n π(a). Furthermore, a facility f is said to be covered by ρ at n if d ρ f (n) > 0. Assumption 2 (One-Unit Deviation). There exists an NE π * such that for any player i, facility f and integer n, if there exists a policy π i ∈ ∆(A) such that d πi,π * -i f (n) > 0, we have d ρ f (n) > 0. In plain text, this assumption requires us to cover all possible facility configurations induced by unilaterally deviated actions. As mentioned in Section 2.1, we can always choose π * to be deterministic. In the view of each facility, the unilateral deviation is either a player who did not select it now selects it or a player who selected it now does not select it. Thus for each f ∈ F, it is sufficient to cover configurations where the number of players selecting f differs from that number of NE by 1. This is why we call it the one-unit deviation assumption. The facility coverage condition is visualized in Figure 1 . Meanwhile, Definition 1 is adapted to this assumption as follows. Assumption 3. There exists a constant C facility > 0 and an NE π * such that C facility ≥ max i,π,f d π i ,π * -i f (n) d ρ f (n) , where we use the convention that 0 0 = 0. The sample complexity bound depends on C facility (see Theorem 3). The minimum value of C facility is at most 3, which is acceptable (see Proposition 2). Furthermore, we show that no assumption weaker than the one-unit deviation allows NE learning, as stated in the following theorem. Theorem 2. Define a class X of congestion game M and exploration strategy ρ that consists of all M and ρ pairs that Assumption 2 is satisfied except for at most one configuration for one facility. For any algorithm ALG there exists (M, ρ) ∈ X such that the output of ALG is at most a 1/2-NE strategy no matter how much data is collected. Proof Sketch. Consider congestion games with a single facility f and five players. The action space for each player is {∅, {f }}. We construct the following two congestion games with deterministic rewards. The facility coverage condition of NEs are marked using bold symbols in the table. The exploration policy is set to be ρ(a) = ß 1/20 one, three or four players select f , 0 otherwise. Congestion Game 1: R f (1) = 1 R f (2) = -1 R f (3) = 1 R f (4) = 1 R f (5) = 1 Congestion Game 2: R f (1) = 1 R f (2) = 1 R f (3) = 1 R f (4) = 1 R f (5) = -1 These two games with ρ are not distinguishable for ALG. Full proof is deferred to Appendix B. In the facility-level feedback setting, the bonus term is similar to that from Cui et al. (2022) . First, we count the number of tuples in dataset D with n players choosing facility f as N f (n) = a k ∈D 1 n f a k = n . Then, we define the estimated reward function and bonus term as r i (a) = f ∈ai (a k ,r k )∈D r f,k 1 n f a k = n f (a) N f (n f (a)) ∨ 1 , b i (a) = f ∈ai … ι N f (n f (a)) ∨ 1 . Here ι = 2 log(4(m+1)F/δ). The contribution for each term in b i mimics the bonus terms from the well-known UCB algorithm and the sample complexity bound is provided by the following theorem. Theorem 3. With probability 1 -δ, if Assumption 2 is satisfied, it holds that Gap(π output ) ≤ 8 √ m + 1C facility ιF/ √ n. The proof of this theorem involves bounding the expectation of b by exploiting the special structure of congestion game. The actions can be classified by the configuration on one facility. This helps bound the expectation over actions, which is essentially the sum over A i actions, by the number of players. Detailed proof is deferred to Section B in the appendix.

4.1. IMPOSSIBILITY RESULT

In the agent-level feedback setting, we no longer have access to rewards provided by individual facilities, so estimating them separately is no longer feasible. From limited actions covered in the dataset, we may not be able to precisely estimate rewards for all unilaterally deviated actions, and thus unable to learn an approximate NE. This observation is formalized in the following theorem. Theorem 4. Define a class X of congestion game M and exploration strategy ρ that consists of all M and ρ pairs such that Assumption 2 is satisfied. For agent-level feedback, for any algorithm ALG there exists (M, ρ) ∈ X such that the output of ALG is at most a 1/8-approximate NE no matter how much data is collected. Proof Sketch. Consider congestion game with two facilities f 1 , f 2 and two players. Action space for both players is unlimited. We construct the following two congestion games with deterministic rewards. The facility coverage conditions for these NEs are marked by bold symbols in the tables. The exploration policy ρ is set to be R f1 (2) = 1/2 R f2 (2) = -1 R f1 (1) = 1 R f2 (1) = 1 Congestion Game 3 R f1 (2) = -1/4 R f2 (2) = -1/4 R f1 (1) = 1 R f2 (1) = 1 Congestion Game 4 ρ(a 1 , a 2 ) =    1/3 a 1 = a 2 = {f 1 , f 2 }, 1/3 a 1 = {f 1 }, a 2 = {f 2 } or a 1 = {f 2 }, a 2 = {f 1 } 0 otherwise. It can be verified that both f 1 and f 2 are covered at 1 and 2 as shown in Figure 2 . These two games look identical with agent-level feedback. Hence it is impossible for the algorithm to distinguish them. Detailed proof is deferred to Appendix C.

4.2. SOLUTION VIA LINEAR BANDIT

Published as a conference paper at ICLR 2023 (𝑓 ! , 1) (𝑓 ! , 2) (𝑓 " , 1) (𝑓 " , 2) c Figure 2 : Facility coverage condition for ρ. Each pair (f, n) represents the configuration that n players select facility f . Each box contains the facility coverage condition for one player. There are two classes of covered actions as described in the formula (4). The color of each box represents the class of actions it belongs to. In the agent-level feedback setting, a congestion game can be viewed as m linear bandits. Let θ be a d-dimensional vector where d = mF and r f (n) = θ n+mf . Let A i : A → {0, 1} d and [A i (a)] j = 1{j = n + mf, f ∈ a i , n = n f (a)}. Here we assign each facility an index in 0, 1, • • • , F -1. Then the mean reward for player i can be written as r i (a) = ⟨A i (a), θ⟩. In the view of bandit problem, i is the index of the bandit and the action taken is a, which is identical for all m bandits. r i (a) = ¨Ai (a), θ ∂ where θ can be estimated through ridge regression together with bonus term as follows. θ = V -1 (a k ,r k )∈D i∈[m] A i (a k )r k i , V = I + (a k ,r k )∈D i∈[m] A i (a k )A i (a k ) ⊤ . ( ) b i (a) = ∥A i (a)∥ V -1 β, where β = 2 √ d + d log Å 1 + mnF d ã + ι. Jin et al. ( 2021) studied offline linear Markov Decision Process (MDP) and proposed a sufficient coverage assumption for learning optimal policy. A linear bandit is essentially a linear MDP with only one state and horizon equals to 1. Here we adapt the assumption to bandit setting and generalize it to congestion game in Assumption 4. Assumption 4 (Weak Covariance Domination). There exists a constant C agent > 0 and an NE π * such that for all i ∈ [m] and policy π i , it holds that V ⪰ I + nC agent E a∼(πi,π * -i ) A i (a)A i (a) ⊤ . To see why Assumption 4 implies learnability, notice that the right hand side of ( 7) is equal to the expectation of the covariance matrix V if the data is collected by running policy (π j , π * -j ) for C agent n episodes. By using such a matrix, we can estimate the rewards of actions sampled from (π j , π * -j ) precisely via linear regression. Here, Assumption 4 states that for all unilaterally deviated policy (π j , π * -j ), we can estimate the rewards it generate at least as well as collecting data from (π j , π * -j ) for C agent n episodes, which implies that we can learn an approximate NE (see Theorem 1). Under Assumption 4, we can obtain the sample complexity bound as follows. Theorem 5. If Assumption 4 is satisfied, with probability 1 -δ, it holds that Gap(π output ) ≤ 4 mF β C agent n , where √ β is defined in (6) and π output is the output of Algorithm 1.. Remark 1. As an illustrative example, consider a congestion game and full action space, i.e. A i = 2 F for all player i with pure strategy NE. The dataset uniformly covers all actions where only one player deviates and only deviates on one facility. For example, if player 1 chooses {f 1 , f 2 }, the dataset should cover player 1 selecting {f 1 }, {f 2 }, {f 1 , f 2 , f 3 }, {f 1 , f 2 , f 4 }, • • • with other players unchanged. There are F such actions for each player, so the dataset covers mF actions in total. The change in reward when a single player deviates from π * is the sum of change in reward from each deviated facility. With sufficient data, we can precisely estimate the change in reward from each deviated facility and estimate the reward from any unilaterally deviated action afterward. With high probability, C agent for this example is no smaller than 1 /2mF 4 (see Proposition 4 in the appendix). Hence with appropriate dataset coverage, our algorithm can achieve sample-efficient approximate NE learning in agent-level feedback.

5. OFFLINE CONGESTION GAME WITH GAME-LEVEL FEEDBACK

With less information revealed in game-level feedback, a stronger assumption is required to learn an approximate NE, which is formally stated in Theorem 6. The proof is similar to that of Theorem 4 and we defer it to the Appendix D. Theorem 6. Define a class X of congestion game M and exploration strategy ρ that consists of all M and ρ pairs such that Assumption 4 is satisfied. For game-level feedback, for any algorithm ALG there exists (M, ρ) ∈ X such that the output of ALG is at least 1/4-approximate NE no matter how much data is collected. In the game-level feedback setting, a congestion game can be viewed as a linear bandit. Let A : A → {0, 1} d and A(a) = i∈[m] A i (a). The game-level reward can be written as r(a) = ⟨A(a), θ⟩. Thus, we can similarly use ridge regression and build bonus terms as follows. r i (a) = ¨Ai (a), θ ∂ , θ = V -1 (a k ,r k )∈D A(a k )r k , V = I + (a k ,r)∈D A(a k )A(a k ) ⊤ , (8) b i (a) = max i∈[m] ∥A i (a)∥ V -1 β, where β = 2 √ d + » d log (1 + nm) + ι. The coverage assumption is adapted from Assumption 4 as follows. Assumption 5 (Strong Covariance Domination). There exists a constant C game > 0 and an NE π * such that for all i ∈ [m] and policy π i , it holds that V ⪰ I + nC game E a∼(πi,π * -i ) A i (a)A i (a) ⊤ . ( ) Note that although the statement of Assumption 5 is identical to that of Assumption 4, the definition of V has changed, so they are actually different. The interpretation of this assumption is similar to that of Assumption 4. It states that for all unilaterally deviated policy (π i , π * -i ), we can estimate the reward at least as well as collecting data from (π i , π * -i ) for c output n episodes with agent-level feedback. Under this assumption, we get the sample complexity bound as follows. Theorem 7. If Assumption 5 is satisfied, with probability 1 -δ, it holds that Gap(π output ) ≤ 4 mF β C game n , where β is defined in equation ( 9) and π output is the output of Algorithm 1. Remark 2. As an illustrative example, consider a congestion game with full action space and pure strategy NE. Let the numbers of players selecting each facility be (n 1 , n 2 , • • • , n f ). The dataset uniformly contains the following actions: action where the number of players selecting each facility are (0, n 2 , • • • , n f ), (n 1 -1, n 2 , • • • , n f ), (n 1 + 1, n 2 , • • • , n f ) and similar actions for other facilities. Besides, we cover an NE action. From this dataset, we can precisely estimate the reward from each single facility with one-unit deviation configuration from NE and hence estimate the reward of unilaterally deviated actions. With high probability, C game for this example is no smaller than 1 /24F 3 (see Proposition 6 in the appendix). Hence with appropriate dataset coverage, our algorithm can achieve sample-efficient approximate NE learning in game-level feedback.

6. CONCLUSION

In this paper, we studied NE learning for congestion games in the offline setting. We analyzed the problem under various types of feedback. Hard instances were constructed to show separations between different types of feedback. For each type of feedback, we identified dataset coverage assumptions to ensure NE learning. With tailored reward estimators and bonus terms, we showed the surrogate minimization algorithm is able to find an approximate NE efficiently.

A MOTIVATING EXAMPLES

In this section, we provide concrete scenarios for each type of feedback. Example 1 (Facility-level feedback). Suppose Google Maps is trying to improve its route assigning algorithm through historical data based on certain regions. Then, each edge (road) on the traffic graph of this region can be considered as a facility and the action that a user will take is a path that connects a certain origin and destination. In this setting, the cost of each facility is the waiting time on that road, which may increase as the number of users choosing this facility increases. In the historical data, each data point contains the path chosen by each user and his/her waiting time on each road, which is an offline dataset with facility-level feedback. Example 2 (Agent-level feedback). Suppose a company is trying to learn a policy to advertise its products from historical data. We can consider a certain set of websites as the facility set, and the products as the players. The action chosen for each product is a subset of websites where the company will place advertisements for that product. The reward for each product is measured by its sales. In the historical data, each data point contains the websites chosen for each product advertisement and the total amount of sales within a certain range of time. This offline dataset inherently has only agent-level feedback since the company cannot measure each website's contribution to sales. Example 3 (Game-level feedback). Under the same setting above, suppose now another company (called B) is also trying to learn such a policy but lacks internal historical data. Therefore, B decides to use the data from the company mentioned in the above example (called A). However, since company B does not have internal access to company A's database, the precise sales of each product is not visible to company B. As a result, company B can only record the total amount of sales of all concerning products from company A's public financial reports, making its offline dataset have only the game-level feedback.

B OMITTED PROOF IN SECTION 3

Lemma 1. With probability 1 -δ, for any policy π, we have Gap(π) ≤ max i∈[m] V †,π-i i -V π i . In addition, we have Gap(π output ) ≤ min π max i∈[m] V †,π-i i -V π i . Proof. By (3) and ( 2), with probability 1 -δ V π i ≤ V π i ≤ V π i . Hence Gap(π) = max π ′ max i∈[m] V π ′ i ,π-i i -V π i ≤ max π ′ max i∈[m] V π ′ i ,π-i i -V π i . Since both V π ′ i ,π-i i and V π i are linear in each entry of π, the first maximizer on the RHS must correspond to a deterministic policy. This proves the first statement. The second statement is by the fact that the algorithm minimizes the RHS of the first statement. Theorem 1. Let Π be the set of all deterministic policies and b is a bonus term for r. With probability 1 -δ, it holds that Gap(π output ) ≤ 2 max i∈[m] ï max π ′ ∈Π E a∼(π ′ i ,π * -i ) b i (a) + E a∼π * b i (a) ò . where π output is the output of Algorithm 1. Proof. V π i -V π i = E a∼π [r i (a) -r i (a) + b i (a)] ≤ 2E a∼π b i (a) V π i -V π i = E a∼π [ r i (a) -r i (a) + b i (a)] ≤ 2E a∼π b i (a). Let π = arg max π ′ max i∈[m] V π ′ i ,π-i i -V π i . By similar argument in Lemma 1 we know that we can always choose π ∈ Π. Gap π output ≤ min π max i∈[m] V †,π-i i -V π i = min π max i∈[m] V πi,π-i i -V π i ≤ min π max i∈[m] î V πi,π-i i -V π i + 2E a∼(πi,π-i) b i (a) + 2E a∼π b i (a) ó ≤ min π ß max i∈[m] î V πi,π-i i -V π i ó + max i∈[m] 2E a∼(πi,π-i) b i (a) + 2E a∼π b i (a) ™ = min π ß Gap(π) + max i∈[m] ï 2 max π ′ ∈Π E a∼(π ′ i ,π-i) b i (a) + 2E a∼π b i (a) ò™ ≤Gap(π * ) + max i∈[m] ï 2 max π ′ ∈Π E a∼(π ′ i ,π * -i ) b i (a) + 2E a∼π * b i (a) ò =2 max i∈[m] ï max π ′ ∈Π E a∼(π ′ i ,π * -i ) b i (a) + E a∼π * b i (a) ò Lemma 2. With probability 1 -δ, we have |r i (a) -r i (a)| ≤ b i (a), 1 N f (n) ≤ 4Hι nd ρ f (n) for all a ∈ A, i ∈ [m], f ∈ F, n ∈ [m]. Proof. r i (a) -r i (a) = f ∈ai r f n f (a) -r f n f (a) . By Hoeffding's bound and union bound we have r f n f (a) -r f n f (a) ≤ 2 N f (n f (a)) log 4(m + 1)F δ for all f ∈ F, a ∈ A with probability 1 -δ/2. Combine the above inequalities we get the first statement hold with probability 1 -δ/2. By lemma A.1 of Xie et al. (2021b) , replacing p by d ρ f n f (a) and the union bound we get 1 N f (n) ≤ 8 log(2(m + 1)F/δ) nd ρ f (n) ≤ 4ι nd ρ f (n) for all f ∈ G, a ∈ A with probability 1 -δ/2. Finally, the proof is complete by using the union bound. Theorem 3. With probability 1 -δ, if Assumption 2 is satisfied, it holds that Gap(π output ) ≤ 8 √ m + 1C facility ιF/ √ n. Proof. We have E a∼(π ′ i ,π * -i ) b i (a) = f ∈F E a∼(π ′ i ,π * -i ) … ι N f (n f (a)) ∨ 1 ≤C facility f ∈F m n ′ =0 d ρ f (n ′ ) 4ι 2 nd ρ f (n ′ ) =2C facility ι f ∈F m n ′ =0 d ρ f (n ′ ) n ≤2C facility ι f ∈F Ã m + 1 n m n ′ =0 d ρ f (n ′ ) ≤2 √ m + 1C facility ιF/ √ n The first inequality is by Definition 3 and Lemma 2. The second inequality is by the fact that d ρ f (a) ≤ 1. Combine this with Theorem 1 and Lemma 2 we get the conclusion. Theorem 2. Define a class X of congestion game M and exploration strategy ρ that consists of all M and ρ pairs that Assumption 2 is satisfied except for at most one configuration for one facility. For any algorithm ALG there exists (M, ρ) ∈ X such that the output of ALG is at most a 1/2-NE strategy no matter how much data is collected. Proof. Consider congestion game with a single facility f and five players. The action space for each player is {∅, {f }}. We construct the following two congestion games with deterministic rewards. Since there is only one facility, the reward players receive and whether a joint action is NE only depends on the configuration, i.e. the number of players selecting f . Hence in the remaining part of the proof we will use configuration to describe the action. For the first game, there are two NEs, which are "only one player selecting f " and "all players selecting f ". For the second game, the NE is "four players selecting f ". The exploration policy is set to be ρ(a) = ß 1/20 one, three or four players select f , 0 otherwise. For the first game, we cover the first NE and its unilateral deviation except for two players selecting R f (1) = 1 R f (2) = -1 R f (3) = 1 R f (4) = 1 R f (5) = 1 Congestion Game 1 R f (1) = 1 R f (2) = 1 R f (3) = 1 R f (4) = 1 R f (5) = -1 Congestion Game 2 f . For the second game, we cover the NE except for five players selecting f . Hence both game with ρ are in X and are not distinguishable for ALG. Let the probability of the output policy selecting four players choosing f be p. Then it is at least p-approximate NE for game 1 and (1 -p)-approximate NE for game 2. In conclusion, there exists (M, ρ) ∈ X such that the output of ALG is at most a 1/2-NE strategy no matter how much data is collected.

B.1 PURE NASH EQUILIBRIUM

It is well known that pure Nash equilibrium exists for any congestion game (Rosenthal, 1973) . Now we restrict our attention to pure Nash equilibrium and show that we can remove the √ m factor in Theorem 3. We use Π pure to denote the set of all pure strategies. We modify Assumption 3 and Algorithm 1 for pure strategies. Assumption 6. There exists a constant C pure facility > 0 and a pure NE π * such that C pure facility ≥ max i,π∈Π pure ,f d πi,π * -i f (n) d ρ f (n) , where we use the convention that 0 0 = 0. Algorithm 2 Surrogate Minimization for Congestion Games (Pure Strategy) Require: Offline dataset D 1: Compute r(a), b(a) for all a ∈ A according to the dataset D. 2: Compute the optimistic value V π i and pessimistic value V π i for all policy π and player i by (3). 3: Compute V †,π-i i = max π ′ i ∈∆(Ai) V π ′ i ,π-i i . 4: return arg min π∈Π pure max i∈[m] V †,π-i i -V π i . Theorem 8. With probability 1 -δ, if Assumption 6 is satisfied, it holds that Gap(π output ) ≤ 2 √ 3C facility ιF/ √ n. Proof. We have E a∼(π ′ i ,π * -i ) b i (a) = f ∈F E a∼(π ′ i ,π * -i ) … ι N f (n f (a)) ∨ 1 ≤C pure facility f ∈F n f (π * )+1 n ′ =n f (π * )-1 d ρ f (n ′ ) 4ι 2 nd ρ f (n ′ ) =2C pure facility ι f ∈F n f (π * )+1 n ′ =n f (π * )-1 d ρ f (n ′ ) n ≤2C pure facility ι f ∈F Õ 3 n n f (π * )+1 n ′ =n f (π * )-1 d ρ f (n ′ ) ≤2 √ 3C pure facility ιF/ √ n The first inequality is by Definition 3 and Lemma 2. The second inequality is by the definition of C pure facility . Combine this with Theorem 1 and Lemma 2 we get the conclusion.

B.2 OMITTED CALCULATIONS IN SECTION 3

Proposition 1. Suppose π is a deterministic strategy. For a fixed domain of ρ, the value of C(π) is the smallest when ρ is uniform over all actions achievable from unilaterally deviating from π. Proof. Assume that ρ covers an action a which is not achievable from unilaterally deviating from pi, then we construct a new ρ ′ where ρ ′ (a) = 0 and the other entries scales up by factor 1/(1-ρ(a)). ρ ′ achieves larger C(π) than ρ. Hence ρ only cover the actions achievable from unilaterally deviating from π. Assume that the distribution is not uniform. Since the best response to a pure strategy can always taken to be a pure strategy, the numerator of 1 can always achieve 1 no matter what a is. Let a * = arg min a ρ(a), then there exists ã such that ρ(ã) > ρ(a * ). Construct ρ ′ such that ρ ′ (a * ) = ρ ′ (ã) = (ρ(a * ) + ρ(ã))/2, then C(π) would not increase. By contradiction we get the conclusion. Proposition 2. The minimum value of C facility is no larger than 3. Proof. Consider the case when ρ is a policy that induces uniform coverage on all facility configurations achievable from π * . Since at most three configurations are covered for each facility, the minimum value of d ρ f (n) is 1/3. Thus the minimum value of C(π * ) is no larger than 3

C OMITTED PROOF IN SECTION 4

Lemma 3. With probability 1 -δ we have |r i (a) -r i (a)| ≤ b i (a) for all i ∈ [m], a ∈ A. Proof. As a degenerate version of theorem 20.5 of Lattimore & Szepesvári (2020) , we have with probability 1 -δ it holds that θ -θ V ≤ ∥θ∥ 2 + » log det(V ) + 2 log(1/δ). Hence with probability 1 -δ |r i (a) -r i (a)| = ¨Ai (a), θ -θ ∂ ≤ ∥A i (a)∥ V -1 θ -θ V ≤ ∥A i (a)∥ V -1 ∥θ∥ 2 + » log det(V ) + 2 log (1/δ) for all i ∈ [m] and a ∈ A. By Lemma 4 of Cui et al. (2022) we have  det(V ) ≤ Å 1 + mnF d ã d since by (5) ∥A i (a)∥ 2 2 ≤ F . Besides, ∥θ∥ 2 ≤ 2 √ d. E a∼(πi,π * -i ) b i (a) =E a∼(πi,π * -i ) » A ⊤ i (a)V -1 A i (a) ≤E a∼(πi,π * -i ) … A ⊤ j (a) Ä I + C agent nE a ′ ∼(πi,π * -i ) [A i (a ′ )A i (a ′ ) ⊤ ] ä -1 A i (a) =E a∼(πi,π * -i ) … tr Ä I + C agent nE a ′ ∼(πi,π * -i ) [A i (a ′ )A i (a ′ ) ⊤ ] ä -1 A i (a)A ⊤ i (a) ≤E a∼(πi,π * -i ) … tr Ä I + C agent nE a ′ ∼(πi,π * -i ) [A i (a ′ )A i (a ′ ) ⊤ ] ä -1 A i (a)A ⊤ i (a) ≤ … tr Ä I + C agent nE a ′ ∼(πi,π * -i ) [A i (a ′ )A i (a ′ ) ⊤ ] ä -1 E a∼(πi,π * -i ) A i (a)A ⊤ i (a) = 1 C agent n … tr I - Ä I + C agent nE a ′ ∼(πi,π * -i ) [A i (a ′ )A i (a ′ ) ⊤ ] ä -1 ≤ d C agent n = mF C agent n Combine this with Theorem 1 we get the conclusion. Theorem 4. Define a class X of congestion game M and exploration strategy ρ that consists of all M and ρ pairs such that Assumption 2 is satisfied. For agent-level feedback, for any algorithm ALG there exists (M, ρ) ∈ X such that the output of ALG is at most a 1/8-approximate NE no matter how much data is collected. Proof. Consider congestion game with two facilities f 1 , f 2 and two players. Action space for both players are unlimited, i.e. A 1 = A 2 = {{f 1 }, {f 2 }, {f 1 , f 2 }}. We construct the following two congestion games with deterministic rewards. The NE for game 3 is a 1 = {f 1 , f 2 }, a 2 = {f 1 } or a 2 = {f 1 , f 2 }, a 1 = {f 1 }. The NE for game 4 is a 1 = {f 1 }, a 2 = {f 2 } or a 2 = {f 1 }, a 1 = {f 2 }. The facility coverage conditions for these NEs are marked by bold symbols in the tables. The R f1 (2) = 1/2 R f2 (2) = -1 R f1 (1) = 1 R f2 (1) = 1 Congestion Game 3 R f1 (2) = -1/4 R f2 (2) = -1/4 R f1 (1) = 1 R f2 (1) = 1 Congestion Game 4 exploration policy ρ is set to be It can be easily verified that both f 1 and f 2 are covered at 1 and 2. However, all information we may extract from the dataset is R f1 (1) = 1, R f2 (1) = 1 and R f1 (2) + R f2 (2) = -1/2. It is impossible for the algorithm to distinguish these two games. Suppose the output strategy of ALG selects action such that two players select f 1 with probability p. Then π is at least a (1 -p)/4-approximate NE for the first game and at least a p/4-approximate NE for the second game. In conclusion, there exists (M, ρ) ∈ X such that the output of the algorithm ALG is at most a 1/8-approximate NE strategy no matter how much data is collected. ρ(a 1 , a 2 ) =    1/3 a 1 = a 2 = {f 1 , f 2 } 1/3 a 1 = {f 1 }, a 2 = {f 2 } or a 1 = {f 2 }, a 2 = {f 1 } 0 otherwise (11) (𝑓 ! , 1) (𝑓 ! , 2) (𝑓 " , 1) (𝑓 " , 2) c

C.1 OMITTED CALCULATIONS IN SECTION 4

Lemma 4. If n ≥ 8 log((mF + 1)/δ)(mF + 1), with probability 1 -δ, N (a) ≥ ρ(a)n/2 = n/2(mF + 1) for all a with ρ(a) > 0. Proof. N (a) follows binomial distribution with parameters n and ρ(a). By Chernoff bound, for all ε ∈ R + , Pr {N (a) ≤ (1 -ε)nρ(a)} ≤ exp Å - ε 2 nρ(a) 2 ã Hence if ρ ≥ -8 log δ/n, we have for all a covered in the example, we have Pr {N (a) ≥ (1 -ε)nρ(a)} ≥ 1 -exp(-µ/8) ≥ 1 -δ. By construction ρ(a) = 1/(mF + 1) for covered action a. By the union bound we get the conclusion. Proposition 3. If n ≥ F , the maximum value of C agent is no smaller than 1 /2F . Proof. Let the NE be pure strategy taking joint action a * . For some specific i, Let A * i = A i (a i , a * -i ) a i ∈ A i and span(A * i ) = d, by Kiefer-Wolfowitz theorem there is a joint policy πi such that max A∈A * i A ⊤ [E a∼π i (A i (a)A i (a))] -1 A = d. From now on we scale each entry of πi by 1/m. It is no longer a valid probability measure but the definition of expectation retains. Hence for each A ∈ A * i A ⊤ I/m + nE a∼π i A i (a)A ⊤ i (a) -1 A ≤ d mn .

By Cauchy inequality for each

A ∈ A * i A ⊤ I/m + nE a∼π i A i (a)A ⊤ i (a) A ≥ mn|A| 4 d . Let C = m/d -1/n, then for each A ∈ A * i A ⊤ [I/m + nE a∼π i (A i (a)A i (a))] A ≥ (nC + 1)|A| 4 ≥ A ⊤ (I + nCAA ⊤ )A. Let π be the sum of πi over all player i, then for each A ∈ A * i and i A ⊤ [I + nE a∼π i (A i (a)A i (a))] A ≥ A ⊤ (I + nCAA ⊤ )A. Hence for each A ∈ A * i and i I + nE a∼π i (A i (a)A i (a)) ⪰ I + nCAA ⊤ . Furthermore, for each A ∈ A * i and i I + nE a∼π i [A i (a)A i (a)] ⪰ I + nCE a∼(πi,π * -i ) A i (a)A i (a) ⊤ . Now we are ready to lower bound C agent . By d ≤ mF C agent ≥ C ≥ m 2d ≥ 1 2F . Proposition 4. If n ≥ 8 log((mF + 1)/δ)(mF + 1) and C agent = 1 /2mF 4 , then with probability 1 -δ, Assumption 4 holds for the example described in Remark 1. Proof. It suffices to show that inequality 7 holds for all pure strategy π and i because the right hand side is linear in any entry of π i . From now on, let us focus on some specific i and pure strategy (π i , π * -i ) choosing ã deterministically. Without loss of generality, suppose among all elements in ã, facility that deviates from NE are f 1 , f 2 , • • • , f s . For convenience, let A 0 = A i (a * ) where a * is the NE. For f j , to estimate its contribution to the reward change, we need an action besides a * , which we denote as a fj . That is, a fj deviates from a * only on f j and let A j = A i (a fj ). Without loss of generality, suppose the contribution corresponds to ⟨A j -A 0 , θ⟩. Then we can write A i (ã) = A 0 + j∈[s] (A j -A 0 ) = j∈[s] A j + (1 -s)A 0 By Lemma 4, it suffices to show I + n 2(mF + 1) j∈[s] A j A ⊤ j + n 2(mF + 1) A 0 A ⊤ 0 ⪰ I + C agent nA i (ã)A i (ã) ⊤ . In other words, for any x ∈ R mF , we have n 2(mF + 1) j∈[s] x ⊤ A j A ⊤ j x + n 2(mF + 1) x ⊤ A 0 A ⊤ 0 x ≥ C agent nx ⊤ A i (ã)A i (ã) ⊤ x. For convenience, let x i = x ⊤ A i , this inequality can be rewritten as n 2(mF + 1) j∈[s] x 2 j + n 2(mF + 1) x 2 0 ≥ C agent n   j∈[s] x j + (1 -s)x 0   2 . By Jensen's inequality it suffices to show n 2(mF + 1) j∈[s] x 2 j + n 2(mF + 1) x 2 0 ≥ C agent n(s + 1)   j∈[s] x 2 j + (1 -s) 2 x 2 0   . Hence it suffices to show n 2(mF + 1) ≥ C agent n(F + 1)(F -1) 2

D.1 OMITTED CALCULATIONS IN SECTION 5

Lemma 6. If n ≥ 8 log((3F + 1)/δ)(3F + 1), with probability 1 -δ, N (a) ≥ ρ(a)n/2 = n/2(3F + 1) for all a with ρ(a) > 0. The proof is identical to Lemma 4 except that we have at most 3F + 1 actions to cover instead of at most mF + 1 actions. Proposition 5. If n ≥ F , the maximum value of C overall is no smaller than 1 /2F . Proof. The proof is very similar to that of Proposition 3 except that here we have A(a) instead of A i (a) Proposition 6. If n ≥ 8 log((3F + 1)/δ)(3F + 1) and C game = 1 /24F 3 , then with probability 1 -δ, Assumption 5 holds for the exampled described in Remark 2. Proof. The procedure is similar to that in the proof of Proposition 4. Let us focus on some specific pure strategy π choosing ã deterministically and player i. To calculate the reward from one facility, we need two actions. Suppose ãi covers f 1 , f 2 , • • • , f s with configuration n 1 , n 2 , • • • , n s . Let the action vector corresponding to facility f j be A j,1 and A j,2 . Without loss of generality, suppose the reward from the individual facility is ⟨A j,1 -A j,2 , θ⟩/n j . Then we can write A i (ã) = j∈[s] A j,1 -A j,2 n j It suffices to show n 2F (3F + 1) j∈[s] A j,1 A ⊤ j,1 + A j,2 A ⊤ j,2 ⪰ C game nA i (ã)A i (ã) ⊤ . Note that because {A j,1 , A j,2 } may have repeated elements and repeats at most F times, so we further discount the number of samples on the left hand side by F . Following similar procedure in the proof of Proposition 6 and n j ≥ 1, s ≤ F we get it suffices to show n 2F (3F + 1) ≥ C game n2F

E EXPERIMENT

We implement algorithm 1 for the facility-level feedback setting and test its performance on a didactic example to verify our theory. In this section, we aim to answer three questions: (i) Does our algorithm recover the NE from datasets that satisfy assumption 3? (ii) How fast does our algorithm converge? (iii) How does the convergence rate vary with the number of players m? E.1 SETTING The Braess paradox (Braess et al., 2005) is a famous example of congestion game. The game presented here is a modified version but retains the core idea of the paradox. There are m cars wanting to travel simultaneously from S to D. There are five roads (facilities) on the road map, indexed from 0 to 4 as illustrated in figure 5 . It takes some time to travel from the starting point S to the destination point D and everyone wants to accomplish it as quickly as possible. The reward (negative latency) for each facility is as follows. r 0 ∼ -n 1 (a) m + 1 + η 0 , r 1 ∼ -1 + η 1 , r 2 ∼ η 2 , r 3 ∼ -1 + η 3 , r 4 ∼ n 4 (a) m + 1 + η 4 where η 0 , η 1 , η 2 , η 3 , η 4 are i.i.d. random variables with Gaussian distribution of mean 0 and variance 1. Formally, the facility set and the action set of this game are F = {0, 1, 2, 3, 4}, A = {{0, 1}, {0, 2, 4}, {3, 4}} m . The NE for this game is {0, 2, 4} m . An interesting fact about the game is that when m > 2 if we remove facility 2, the road that does provide zero latency on average, the NE of this game gives everybody less latency. This means constructing more roads may aggravate traffic jams if we assume all drivers are selfish. In this paper, we employ the original game to test our algorithm. We use two different exploration policies to collect datasets. The first policy is random exploration, where each player uniformly randomly selects his/her action from A i . Formally for any player i π random i (a i ) = 1/3 a i = {0, 1} 1/3 a i = {0, 2, 4} 1/3 a i = {3, 4} This policy covers all possible facility configurations and C facility = 3 m . The second policy is as follows. π facility 0 (a 0 ) = 1/3 a 0 = {0, 1} 1/3 a 0 = {0, 2, 4} 1/3 a 0 = {3, 4} π facility i (a i ) = 1{a i = {0, 2, 4}} for i ̸ = 0 For m > 1, this policy does not satisfy assumption 1 as it only covers the actions of NE unilaterally deviated by player 1. However, it covers all facility configurations achievable from unilaterally deviated actions, hence satisfying assumption 2. Moreover, since the NE covered is a pure strategy, it satisfies assumption 6 and C pure facility = C facility = 3.

E.2 RESULT

For m = 1, 2, 3, 4, 5, 6 and two exploration policies, we test the algorithm for different sample sizes n and evaluate the performance gap of the output policy. Results are visualized in Figure 6 and 7. For each n, we run the algorithm on 16 datasets sampled from different random seeds. Mean performance gaps are shown with solid lines in the figures. We set δ = 10 -2 . When the dataset is collected by π random , the algorithm can always find the optimal policy as long as the dataset is large enough. Fixing m, as n increases, the gap drops quickly when n is small and gets more slowly when n becomes large. This complies with the 1/ √ n term in the bound from theorem 3. As m increases, convergence becomes slower, which also complies with theorem 3. When the dataset is collected by π facility , the algorithm can always find the optimal policy as well. Furthermore, the size of dataset needed is far smaller than that for π random . Except for the m = 1 case the convergence rate barely varies with m, which complies with theorem 8.



Figure 1: Illustration of Assumption 2.There are five facilities and five players with full action space. The facility configuration in π * is marked in red. The transparent boxes cover the facility configuration required in the assumption.

The proof is complete by combining all these and taking max i∈[m] . Theorem 5. If Assumption 4 is satisfied, with probability 1 -δ, it holds that Gap(π output ) ≤ 4 mF β C agent n , where √ β is defined in (6) and π output is the output of Algorithm 1.. Proof. We have for all i ∈ [m]

Figure 3: Facility coverage condition for ρ. Each pair (f, n) represents the configuration that n players select facility f . Each box contains the facility coverage condition for one player. There are two classes of covered actions as described in formula (11). The color of each box represents the class of actions it belongs to.

Figure 5: the Braess Paradox

ACKNOWLEDGEMENTS

This work was supported in part by NSF TRIPODS II-DMS 2023166, NSF CCF 2007036, NSF IIS 2110170, NSF DMS 2134106, NSF CCF 2212261, NSF IIS 2143493, NSF CCF 2019844.

D OMITTED PROOF IN SECTION 5

Lemma 5. With probability 1 -δ we havefor all i ∈ [m], a ∈ A.Proof. Similar to Lemma 3, we haveThe bound of det(V ) is now as follows. The basic idea is the same as that of lemma 4 by Cui et al. ( 2022)Combine all these we get the conclusion.Theorem 7. If Assumption 5 is satisfied, with probability 1 -δ, it holds thatwhere β is defined in equation ( 9) and π output is the output of Algorithm 1.Proof. The proof is identical to that of Theorem 5 except that C game is used instead of C game .Theorem 6. Define a class X of congestion game M and exploration strategy ρ that consists of all M and ρ pairs such that Assumption 4 is satisfied. For game-level feedback, for any algorithm ALG there exists (M, ρ) ∈ X such that the output of ALG is at least 1/4-approximate NE no matter how much data is collected.Proof. Similar to the proof of Theorem 4, consider a congestion game with two facilities f 1 , f 2 and two players. Action space for both players are unlimited, i.e.We construct the following two congestion games with deterministic rewards. The NE for game 5 isThe exploration policy is set to beFigure 4 : Facility coverage condition for ρ. Similar to Figure 3 .The reward information we can receive from the dataset in the agent-level feedback setting includes:Hence we can compute the NE directly. In the game-level feedback setting, we only know R f2 (2), R f1 (1) + R f2 (1) and 2R f1 (2) + R f2 (1). Hence ALG cannot distinguish these two games. Suppose the output of ALG selects action that 2 players select f 1 with probability p, then it is at least (1 -p)/2-approximate NE for game 5 and at least p/2-approximate NE for game 6. In conclusion, ALG is at least 1/4-approximate NE strategy no matter how much data is collected. 

