DECENTRALIZED ONLINE BANDIT OPTIMIZATION ON DIRECTED GRAPHS WITH REGRET BOUNDS

Abstract

We consider a decentralized multiplayer game, played over T rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint banditreward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.

1. INTRODUCTION

Decentralized multi-agent online learning concerns agents that, simultaneously, learn to behave over time in order to achieve their goals. Compared to the single-agent setup, novel challenges are present as agents may not share the same objectives, the environment becomes non-stationary, and information asymmetry may exist between agents (Yang & Wang, 2020) . Traditionally, the multi-agent problem has been addressed by either relying on a central controller to coordinate the agents' actions or to let the agents learn independently. However, access to a central controller may not be realistic and independent learning suffers from convergence issues (Zhang et al., 2019) . To circumvent these issues, a common approach is to drop the central coordinator and allow information exchange between agents (Zhang et al., 2018; 2019; Cesa-Bianchi et al., 2021) . Decision-making that involves multiple agents is often modeled as a game and studied under the lens of game theory to describe the learning outcomes. 1 Herein, we consider games with a leaderfollower structure in which players act consecutively. For two players, such games are known as Stackelberg games (Hicks, 1935) . Stackelberg games have been used to model diverse learning situations such as airport security (Balcan et al., 2015) , poaching (Sessa et al., 2020) , tax planning (Zheng et al., 2020) , and generative adversarial networks (Moghadam et al., 2021) . In a Stackelberg game, one is typically concerned with finding the Stackelberg equilibrium, sometimes called Stackelberg-Nash equilibrium, in which the leader uses a mixed strategy and the follower is bestresponding. A Stackelberg equilibrium may be obtained by solving a bi-level optimization problem if the reward functions are known (Schäfer et al., 2020; Aussel & Svensson, 2020) or, otherwise, it may be learnt via online learning techniques (Bai et al., 2021; Zhong et al., 2021) , e.g., no-regret algorithms (Shalev-Shwartz, 2012; Deng et al., 2019; Goktas et al., 2022) . No-regret algorithms have emerged from the single-player multi-armed bandit problem as a means to alleviate the exploitation-exploration trade-off (Bubeck & Slivkins, 2012 ). An algorithm is called no-regret if the difference between the cumulative rewards of the learnt strategy and the single best action in hindsight is sublinear in the number of rounds (Shalev-Shwartz, 2012) . In the multi-armed bandit problem, rewards may be adversarial (based on randomness and previous actions), oblivious adversarial (random), or stochastic (independent and identically distributed) over time (Auer et al., 2002) . Different assumptions on the bandit rewards yield different algorithms and regret bounds. Indeed, algorithms tailored for one kind of rewards are sub-optimal for others, e.g., the EXP3 algorithm due to Auer et al. (2002) yields the optimal scaling for adversarial rewards but not for stochastic rewards. For this reason, best-of-two-worlds algorithms, able to optimally handle both the stochastic and adversarial rewards, have recently been pursued and resulted in algorithms with close to optimal performance in both settings (Auer & Chiang, 2016; Wei & Luo, 2018; Zimmert & Seldin, 2021) . Extensions to multiplayer multi-armed bandit problems have been proposed in which players attempt to maximize the sum of rewards by pulling an arm each, see, e.g., (Kalathil et al., 2014; Bubeck et al., 2021) . No-regret algorithms are a common element also when analyzing multiplayer games. For example, in continuous two-player Stackelberg games, the leader strategy, based on a no-regret algorithm, converges to the Stackelberg equilibrium if the follower is best-responding (Goktas et al., 2022) . In contrast, if also the follower adopts a no-regret algorithm, the regret dynamics is not guaranteed to converge to a Stackelberg equilibrium point (Goktas et al., 2022, Ex. 3.2) . In (Deng et al., 2019) , it was shown for two-player Stackelberg games that a follower playing a, so-called, mean-based no-regret algorithm, enables the leader to achieve a reward strictly larger than the reward achieved at the Stackelberg equilibrium. This result does, however, not generalize to n-player games as demonstrated by D'Andrea (2022). Apart from studying the Stackelberg equilibrium, several papers have analyzed the regret. For example, Sessa et al. ( 2020) presented upper-bounds on the regret of a leader, employing a no-regret algorithm, playing against an adversarial follower with an unknown response function. Furthermore, Stackelberg games with states were introduced by Lauffer et al. (2022) along with an algorithm that was shown to achieve no-regret. As the follower in a Stackelberg game observes the leader's action, there is information exchange. A generalization to multiple players has been studied in a series of papers (Cesa-Bianchi et al., 2016; 2020; 2021) . In this line of work, players with a common action space form an arbitrary graph and are randomly activated in each round. Active players share information with their neighbors by broadcasting their observed loss, previously received neighbor losses, and their current strategy. The goal of the players is to minimize the network regret, defined with respect to the cumulative losses observed by active players over the rounds. The players, however, update their strategies according to their individually observed loss. Although we consider players connected on a graph, our work differs significantly from (Cesa-Bianchi et al., 2016; 2020; 2021) , e.g., we allow only actions to be observed between players and player strategies are updated based on a common bandit reward.

Contributions:

We introduce the joint pseudo-regret, defined with respect to the cumulative reward where all the players observe the same bandit-reward in each round. We provide an online learning-algorithm for general consecutive-play games that relies on no-regret algorithms developed for the single-player multi-armed bandit problem. The main novelty of our contribution resides in the joint analysis of players with coupled rewards where we derive upper bounds on the joint pseudoregret and prove our algorithm to be no-regret in the stochastic and adversarial setting. Furthermore, we quantify the penalty incurred by our decentralized setting in relation to the centralized setting.

2. PROBLEM FORMULATION

In this section, we formalize the consecutive structure of the game and introduce the joint pseudoregret that will be used as a performance metric throughout. We consider a decentralized setting where, in each round of the game, players pick actions consecutively. The consecutive nature of the game allows players to observe preceding players' actions and may be modeled by a DAG. For example, in Fig. 1 , a seven-player game is illustrated in which player 1 initiates the game and her action is observed by players 2, 5, and 6. The observations available to the remaining players follow analogously. Note that for a two-player consecutive game, the DAG models a Stackelberg game. We let G = (V, E) denote a DAG where V denotes the vertices and E denotes the edges. For our setting, V constitutes the n different players and E = {(j, i) : j → i, j ∈ V, i ∈ V} describes the observation structure where j → i indicates that player i observes the action of player j. Accordingly, a given player i ∈ V observes the actions of its direct parents, i.e., players j ∈ E i = {k : (k, i) ∈ E}. Furthermore, each player i ∈ V is associated with a discrete action space A i of size A i . We denote by π i (t), the mixed strategy of player i over the action space A i in round t ∈ [T ] such that π i (t) = a with probability p i,a for a ∈ A i . In the special case when p i,a = 1 for some a ∈ A i , the strategy is referred to as pure. Let A B denote the joint action space of players in a set B given by the Cartesian product A B = i∈B A i . If a player i has no parents, i.e., E i = ∅, we use the convention |A Ei | = 1. We consider a collaborative setting with bandit rewards given by a mapping r t : A V → [0, 1] in each round t ∈ [T ]. The bandit rewards may be either adversarial or stochastic. Let C denote a set of independent cliques in the DAG (Koller & Friedman, 2009, Ch. 2) and let N k ∈ C for k ∈ [|C|] denote the players in the kth clique in C with joint action space A N k such that N k ∩ N j = ∅ for j ̸ = k. For a joint action a(t) ∈ A V , we consider bandit rewards given by a linear combination of the clique-rewards as r t (a(t)) = |C| k=1 β k r k t (P k (a(t))), where r k t : A N k → [0, 1], β k ≥ 0 is the weight of the kth clique reward such that |C| k=1 β k = 1 , and P k (a(t)) denotes the joint action of the players in N k . As an example, Fig. 1b highlights the cliques C = {{2, 3, 4}, {1, 5}, {6}, {7}} and we have, e.g., N 1 = {2, 3, 4}, and P 1 (a(t)) = (a 2 (t), a 3 (t), a 4 (t)). Note that each player influences only a single term in the reward (1). In each round t ∈ [T ], the game proceeds as follows for player i ∈ V: 1) the player is idle until the actions of all parents in E i have been observed, 2) the player picks an action a i (t) ∈ A i according to its strategy π i (t), 3) once all the n players in V have chosen an action, the player observes the bandit reward r t (a(t)) and updates its strategy. The goal of the game is to find policies {π i (t)} n i=1 that depend on past actions and rewards in order to minimize the joint pseudo-regret R(T ) which is defined similarly to the pseudo regret (Shalev-Shwartz, 2012, Ch. 4.2) as R(T ) = r(a ⋆ ) -E T t=1 r t (a(t)) , r(a ⋆ ) = max a∈A V E T t=1 r t (a) , where the expectations are taken with respect to the rewards and the player actions.foot_1 Note that r(a ⋆ ) corresponds to the largest expected reward obtainable if all players use pure strategies. Hence, the pseudo-regret in (2) quantifies the difference between the expected reward accumulated by the learnt strategies and the reward-maximizing pure strategies in hindsight. Our problem formulation pertains to a plethora of applications. Examples include resource allocation in cognitive radio networks where available frequencies are obtained via channel sensing (Janatian et al., 2015) and semi-autonomous vehicles with adaptive cruise control, i.e., vehicles ahead are observed before an action is decided (Marsden et al., 2001) . Also recently, the importance of coupled rewards and partner awareness through implicit communications, e.g., by observation, has been highlighted in human-robot and human-AI collaborative settings (Bıyık et al., 2022) . As will be shown in the next section, any no-regret algorithm can be used as a building block for the games considered herein to guarantee a sub-linear pseudo-regret in the number of rounds T . As our goal is to study the joint pseudo-regret (2) for both adversarial and stochastic rewards, we resort to best-of-two-worlds algorithms for the multi-armed bandit problem (Bubeck & Slivkins, 2012) . In particular, we will utilize the TSALLIS-INF algorithm that guarantees a close-to-optimal pseudoregret for both stochastic and adversarial bandit-rewards (Zimmert & Seldin, 2021) . Note that our analysis in the next section pertains to adversarial rewards; for stochastic rewards, our analysis apply verbatim by replacing Theorem 1 by (Zimmert & Seldin, 2021, Th. 4 ).

3. ANALYSIS OF THE JOINT PSEUDO-REGRET

Our analysis of the joint pseudo-regret builds upon learning algorithms for the single-player multiarmed bandit problem. First, let us build intuition on how to use a multi-armed bandit algorithm in the DAG-based game described in Section 2. Consider a 2-player Stackelberg game where the players choose actions from A 1 and A 2 , respectively, and where player 2 observes the actions of player 1. For simplicity, we let player 1 use a mixed strategy whereas player 2 is limited to a pure strategy. Furthermore, consider the rewards to be a priori known by the players and let T = 1 for which the Stackelberg game may be viewed as a bi-level optimization problem (Aussel & Svensson, 2020) . In this setting, the action of player 1 imposes a Nash game on player 2 whom attempts to play optimally given the observation. Hence, player 2 has A 1 pure strategies, one for each of the A 1 actions of player 1. We may generalize this idea to the DAG-based multiplayer game with unknown bandit-rewards and T ≥ 1 to achieve no-regret. Indeed, a player i ∈ V may run |A Ei | different multi-armed bandit algorithms, one for each of the joint actions of its parents. Algorithm 1 illustrates this idea in conjunction with the TSALLIS-INF update rule introduced by Zimmert & Seldin (2021), which is given in Algorithm 2 for completeness. 3 In particular, for the 2-player Stackelberg game, the leader runs a single multi-armed bandit algorithm whereas the follower runs A 1 learning algorithms. For simplicity, Algorithm 1 assumes that player i knows the size of the joint action space of its parents, i.e., |A Ei |. Dropping this assumption is straightforward: simply keep track of the observed joint actions and initiate a new multi-armed bandit learner upon a unique observation. Algorithm 1 Learning algorithm of player i ∈ V Note: for ease of notation, let the actions in A Ei be labeled as 1, 2, . . . , observe the joint bandit-reward r t (a(t)) |A Ei | 1: procedure 2: initialize cumulative rewards L j ← 0 ∈ R Ai for all j ∈ [|A Ei |] 3: initialize fixed-point x j ← 0 for all j ∈ [|A Ei |] 11: update the cumulative loss L j,k ← L j,k +1{a i (t) = k}(1-r t (a(t)))/p k for all k ∈ [A i ] 12: end for 13: end procedure Algorithm 2 Strategy update for player i ∈ V Input time step t, cumulative rewards L ∈ R Ai + , previous fixed point x Output strategy π i (t), fixed point x 1: procedure TSALLIS-INF 2: set learning rate η ← 2 1/t 3: repeat 4: p j ← 4(η(L j -x)) -2 for all j ∈ [A i ] 5: x ← x - Ai j=1 p j -1 / η Ai j=1 p 3/2 j 6: until convergence 7: update strategy π i (t) ← (p 1 , . . . , p Ai ) 8: end procedure Next, we go on to analyze the joint pseudo-regret of Algorithm 1. First, we present a result on the pseudo-regret for the single-player multi-armed bandit problem that will be used throughout. Theorem 1 (Pseudo-regret of TSALLIS-INF). Consider a single-player multi-armed bandit problem with A 1 arms, played over T rounds. Let the player operate according to Algorithm 1. Then, the pseudo-regret satisfies R(T ) ≤ 4 A 1 T + 1. Proof. For a single player, E 1 = ∅ and we have |A E1 | = 1 by convention. Hence, our setting becomes equivalent to that of Zimmert & Seldin (2021, Th 1) and the result follows thereof. Next, we consider a two-player Stackelberg game with joint bandit-rewards defined over a twoplayer clique. We have the following upper bound on the joint pseudo-regret. Theorem 2 (Joint pseudo-regret over cliques of size 2). Consider a 2-player Stackelberg game with bandit-rewards, given by (1), defined over a single clique containing both players. Furthermore, let each of the players follow Algorithm 1. Then, the joint pseudo-regret satisfies R(T ) ≤ 4 A 1 A 2 T + 4 A 1 T + A 1 + 1. Proof. Without loss of generality, let player 2 observe the actions of player 1. Let a 1 (t) ∈ A 1 and a 2 (t) ∈ A 2 denote the actions of player 1 and player 2, respectively, at time t ∈ [T ] and let a ⋆ 1 and a ⋆ 2 (a 1 ) denote the reward-maximizing pure strategies of the players in hindsight, i.e., a ⋆ 1 = arg max a1∈A1 E T t=1 r t (a 1 , a ⋆ 2 (a 1 )) , a ⋆ 2 (a 1 ) = arg max a2∈A2 E T t=1 r t (a 1 , a 2 ) . Note that the optimal joint decision in hindsight is given by (a ⋆ 1 , a ⋆ 2 (a ⋆ 1 )). The joint pseudo-regret is given by R(T ) = T t=1 E [r t (a ⋆ 1 , a ⋆ 2 (a ⋆ 1 )) -r t (a ⋆ 1 , a 2 (t)) + r t (a ⋆ 1 , a 2 (t)) -r t (a 1 (t), a 2 (t))] ≤ T t=1 max at∈A1 E [r t (a t , a ⋆ 2 (a t )) -r t (a t , a 2 (t))] + E T t=1 r t (a ⋆ 1 , a 2 (t)) -r t (a 1 (t), a 2 (t)) . (4) Next, let a + 1 (t) = arg max at∈A1 E [r t (a t , a ⋆ 2 (a t )) -r t (a t , a 2 (t))] and let T a = {t : a + 1 (t) = a}, for a ∈ A 1 , denote all the rounds that player 1 chose action a and introduce T a = |T a |. Then, the first term in (4) is upper-bounded as T t=1 max at∈A1 E [r t (a t , a ⋆ 2 (a t )) -r t (a t , a 2 (t))] = a∈A1 t∈Ta E [r t (a, a ⋆ 2 (a)) -r t (a, a 2 (t))] ≤ a∈A1 4 A 2 T a + 1 (5) ≤ max a Ta=T a∈A1 4 A 2 T a + 1 = 4 A 1 A 2 T + A 1 (6) where (5) follows from Theorem 1 and because player 2 follows Algorithm 1. Next, we consider the second term in (4). Note that, according to (3), a ⋆ 1 is obtained from the optimal pure strategies in hindsight of both the players. Let a • 1 = arg max a1∈A1 T t=1 E [r t (a 1 , a 2 (t))] and observe that E T t=1 r t (a ⋆ 1 , a 2 (t)) ≤ E T t=1 r t (a • 1 , a 2 (t)) . Hence, by adding and subtracting r t (a • 1 , a 2 (t)) to the second term in (4), we get E T t=1 r t (a ⋆ 1 , a 2 (t)) -r t (a 1 (t), a 2 (t)) ≤ E T t=1 r t (a • 1 , a 2 (t)) -r t (a 1 (t), a 2 (t)) ≤ 4 A 1 T + 1 (7) where the last equality follows from Theorem 1. The result follows from ( 6) and ( 7). From Theorem 2, we note that the joint pseudo-regret scales with the size of the joint action space as R(T ) = O( √ A 1 A 2 T ). This is expected as a centralized version of the cooperative Stackelberg game may be viewed as a single-player multi-armed bandit problem with A 1 A 2 arms where, according to Theorem 1, the pseudo-regret is upper-bounded by 4 √ A 1 A 2 T + 1. Hence, from Theorem 2, we observe a penalty of 4 √ A 1 T + A 1 due to the decentralized nature of our setup. Moreover, in the centralized setting, Algorithm 2 was shown in Zimmert & Seldin (2021) to achieve the same scaling as the lower bound in Cesa-Bianchi & Lugosi (2006, Th. 6.1) . Hence, Algorithm 1 achieves the optimal scaling. Next, we extend Theorem 2 cliques of size larger than two. Theorem 3 (Joint pseudo-regret over a clique of arbitrary size). Consider a DAG-based game with bandit rewards given by ( 1), defined over a single clique containing m players. Let each of the players operate according to Algorithm 1. Then, the joint pseudo-regret satisfies R(T ) ≤ 4 √ T m i=1 i k=1 A k + m-1 i=1 i k=1 A k + 1. Proof. Let R ub (T, m) denote an upper bound on the joint pseudo-regret when the bandit-reward is defined over a clique containing m players. From Theorem 1 and Theorem 2, we have that R ub (T, 1) = 4 A 1 T + 1 R ub (T, 2) = 4 A 1 T + 4 A 1 A 2 T + A 1 + 1, respectively. Therefore, we form an induction hypothesis as R ub (T, m) = 4 √ T m i=1 i k=1 A k + m-1 i=1 i k=1 A k + 1. ( ) Assume that ( 8) is true for a clique containing m -1 players and add an additional player, assigned player index 1, whose actions are observable to the original m -1 players. The m players now form a clique C of size m. Let a(t) ∈ A C denote the joint action of all the players in the clique at time t ∈ [T ] and let a -i (t) = (a 1 (t), . . . , a i-1 (t), a i+1 (t), . . . , a m (t)) ∈ A C\i denote the joint action excluding the action of player i. Furthermore, let a ⋆ 1 = arg max a1∈A1 E T t=1 r t (a 1 , a ⋆ -1 (a 1 )) a ⋆ -1 (a 1 ) = arg max a∈A C\1 E T t=1 r t (a 1 , a) denote the optimal actions in hindsight of player 1 and the optimal joint action of the original m -1 players given the action of player 1, respectively. The optimal joint action in hindsight is given as a ⋆ = (a ⋆ 1 , a ⋆ -1 (a ⋆ 1 ) ). Following the steps in the proof of Theorem 2 verbatim, we obtain R(T ) = T t=1 E [r t (a ⋆ ) -r t (a ⋆ 1 , a -1 (t)) + r t (a ⋆ 1 , a -1 (t)) -r t (a(t))] ≤ T t=1 max a1 E r t (a 1 , a ⋆ -1 (a 1 )) -r t (a 1 , a -1 (t)) + T t=1 E [r t (a ⋆ 1 , a -1 (t)) -r t (a 1 (t), a -1 (t))] ≤ a∈A1 t∈Ta E r t (a, a ⋆ -1 (a)) -r t (a, a -1 (t)) + T t=1 E [r t (a • 1 , a -1 (t)) -r t (a 1 (t), a -1 (t))] ≤ a∈A1 R ub (T a , m -1) + 4 A 1 T + 1 ≤ A 1 R ub (T /A 1 , m -1) + 4 A 1 T + 1 where T a , T a , and a • n are defined analogously as in the proof of Theorem 2. By using the induction hypothesis (8) in ( 9) and by accounting for the original m -1 players being indexed from 2 to m, we obtain R(T ) ≤ A 1 4 T /A 1 m i=2 i k=2 A k + m-1 i=2 i k=2 A k + 1 + 4 A 1 T + 1 = R ub (T, m) which is what we wanted to show. As in the two-player game, the joint pseudo-regret of Algorithm 1 achieves the optimal scaling, i.e., R(T ) = O( √ T m k=1 √ A k ), but exhibits a penalty due to the decentralized setting which is equal to 4 √ T m-1 i=1 i k=1 √ A k + m-2 i=1 i k=1 A k . Up until this point, we have considered the pseudo-regret when the bandit-reward (1) is defined over a single clique. The next theorem leverages the previous results to provide an upper bound on the joint pseudo-regret when the bandit-reward is defined over an arbitrary number of independent cliques in the DAG. Theorem 4 (Joint pseudo-regret in DAG-based games). Consider a DAG-based game with bandit rewards given as in (1) and let C contain a collection of independent cliques associated with the DAG. Let each player operate according to Algorithm 1. Then, the joint pseudo-regret satisfies R(T ) = O T max k∈[|C|] |A N k | where A N k denotes the joint action-space of the players in the kth clique N k ∈ C. Proof. Let N k ∈ C denote the players belonging to the kth clique in C with joint action space A N k . The structure of (1) allows us to express the joint pseudo-regret as R(T ) = E T t=1 r t (a ⋆ ) -r t (a(t)) ≤ |C| k=1 β k E T t=1 r k t (a ⋆ k ) -r k t (P k (a(t))) where a ⋆ = arg max a∈A V E T t=1 r t (a) , a ⋆ k = arg max a∈A N k E T t=1 r k t (a) , and the inequality follows since E T t=1 r k t (P k (a ⋆ )) ≤ E T t=1 r k t (a ⋆ k ) . Now, for each clique N k ∈ C, let the player indices in N k be ordered according to the order of player observations within the clique. As Theorem 3 holds for any N k ∈ C, we may, with a slight abuse of notation, bound the joint pseudo-regret of each clique as R(T ) ≤ |C| k=1 β k R ub (T, N k ) ≤ max k∈[|C|] R ub (T, N k ) where R ub (T, N k ) follows from Theorem 3 as R ub (T, N k ) = 4 √ T i∈N k j≤i,j∈N k A j + i∈N - k j≤i,j∈N - k A j + 1 where N - k excludes the last element in N k . The result follows as R ub (T, N k ) = O( T |A N k |).

4. NUMERICAL RESULTS

The experimental setup in this section is inspired by the socio-economic simulation in (Zheng et al., 2020) . 4 We consider a simple taxation game where one player acts as a socio-economic planner and the remaining M players act as workers that earn an income by performing actions, e.g., constructing houses. The socio-economic planner divides the possible incomes into N brackets where [β i-1 , β i ] denotes the ith bracket with β 0 = 0 and β N = ∞. In each round t ∈ [T ], the socio-economic planner picks an action a p (t) = (a p,1 (t), . . . , a p,N (t)) that determines the taxation rate where a p,i (t) ∈ R i denotes the the marginal taxation rate in income bracket i and R i is a finite set. We use the discrete set A p = N i=1 R i of size A p to denote the action space of the planner. In each round, the workers observe the taxation policy a p (t) ∈ A p and choose their actions consecutively, see Fig. 2a . Worker j ∈ [M ] takes actions a j (t) ∈ A j where A j is a finite set. A chosen action a j (t) ∈ A j translates into a tuple (x j (t), lj (t)) consisting of a gross income and a marginal default labor cost, respectively. Furthermore, each worker has a skill level s j that serves as a divisor of the default labor, resulting in an effective marginal labor l j (t) = lj (t)/s j . Hence, given a common action, high-skilled workers exhibit less labor than low-skilled workers. The gross income x j (t) of worker j in round t is taxed according to a p (t) as ξ(x j (t)) = N i=1 a p,i (t) [(β i -β i-1 )1{x j (t) > β i } + (x j (t) -β i-1 )1{x j (t) ∈ [β i-1 , β i ]}] where a p,i (t) is the taxation rate of the ith income bracket and ξ(x j (t)) denotes the collected tax. Hence, worker j's cumulative net income z j (t) and cumulative labor ℓ j (t) in round t are given as z j (t) = t u=1 x j (u) -ξ(x j (u)), ℓ j (t) = t u=1 l j (u). In round t, the utility of worker j depends on the cumulative net income and the cumulative labor as r j t (z j (t), ℓ j (t)) = (z j (t)) 1-η -1 1 -η -ℓ j (t) where η > 0 determines the non-linear impact of income. An example of the utility function in ( 11) is shown in Fig. 2b for η = 0.3, income x j (t) = 10, and a default marginal labor lj (t) = 1 at different skill levels. It can be seen that the utility initially increases with income until a point at which the cumulative labor outweighs the benefits of income and the worker gets burnt out. We consider bandit-rewards defined with respect to the worker utilities and the total collected tax as r t (a p (t), a 1 (t), . . . , a M (t)) = 1 (M + 1)   M j=1 wr j t (z j (t), ℓ j (t)) + w p M j=1 ξ(x j (t))   where the weights trade off worker utility for the collected tax and satisfy M w + w p = M + 1. The individual rewards are all normalized to [0, 1], hence, r t (a p (t), a 1 (t), . . . , a M (t)) ∈ [0, 1]. For the numerical experiment, we consider N = 2 income brackets where the boundaries of the income brackets are {0, 14, ∞} and the socio-economic planner chooses a marginal taxation rate from R = {0.1, 0.3, 0.5} in each income bracket, hence, A p = 9. We consider M = 3 workers with the same action set A of size 3. Consequently, the joint action space is of size 243. Furthermore, we let the skill level of the workers coincide with the worker index, i.e., s j = j for j ∈ [M ]. Simply, workers able to observe others have higher skill. The worker actions translate to a gross marginal income and a marginal labor as a j (t) → (x j (t), l j (t)) where x j (t) = 5a j (t) and l j (t) = a j (t)/s j for a j (t) ∈ {1, 2, 3}. Finally, we set η = 0.3 and let w = 1/M and w p = M to model a situation where the collected tax is preferred over workers' individual utility. The joint pseudo-regret of the socio-economic simulation is illustrated in Fig. 3 along with the upper bound in Theorem 4. We collect 100 realizations of the experiment and, along with the pseudo-regret R(T ), two standard deviations are also presented. It can be seen that the players initially explore the action space and are able to eventually converge on an optimal strategy from a pseudo-regret perspective. The upper bound in Fig. 3 is admittedly loose and does not exhibit the same asymptotic decay as the simulation due to different constants in the scaling law, see Fig. 3b . However, it remains valuable as it provides an asymptotic no-regret guarantee for the learning algorithm. 

5. CONCLUSION

We have studied multiplayer games with joint bandit-rewards where players execute actions consecutively and observe the actions of the preceding players. We introduced the notion of joint pseudo-regret and presented an algorithm that is guaranteed to achieve no-regret for both adversarial and stochastic bandit rewards. A bottleneck of many multi-agent algorithms is that the complexity scales with the joint action space (Jin et al., 2021) and our algorithm is no exception. An interesting venue of further study is to find algorithms that have more benign scaling properties, see e.g., (Jin et al., 2021; Daskalakis et al., 2021) .



The convention is to use agents in learning applications and players in game theoretic applications, we shall use the game-theoretic nomenclature in the remainder of the paper. This is called pseudo-regret as r(a ⋆ ) is obtained by a maximization outside of the expectation. The original TSALLIS-INF Algorithm is given in terms of losses. To use rewards, one may simply use the relationship l = 1 -r. The source code of our experiments is available on https://anonymous.4open.science/r/ bandit_optimization_dag-242C/.



Colored cliques comprising the bandit reward.

Figure 1: Game and reward structures.

n j ← 0 for all j ∈ [|A Ei |] 5: for t = 1, 2, . . . , T do 6: observe the joint action j ∈ [|A Ei |] of the preceding players 7: increase counter n j ← n j + 1 8:obtain new strategy and new fixed-point (π i (t), x j ) ← TSALLIS-INF(n j , L j , x j ) 9: play action a i (t) ∼ π i (t) 10:

Example of utility functions for different skill levels when xj(t) = 10 and lj(t) = 1.

Figure 2: Socio-economic setup.

Figure 3: Pseudo regret vs the upper bound in Theorem 4.

