DECENTRALIZED ONLINE BANDIT OPTIMIZATION ON DIRECTED GRAPHS WITH REGRET BOUNDS

Abstract

We consider a decentralized multiplayer game, played over T rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint banditreward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.

1. INTRODUCTION

Decentralized multi-agent online learning concerns agents that, simultaneously, learn to behave over time in order to achieve their goals. Compared to the single-agent setup, novel challenges are present as agents may not share the same objectives, the environment becomes non-stationary, and information asymmetry may exist between agents (Yang & Wang, 2020) . Traditionally, the multi-agent problem has been addressed by either relying on a central controller to coordinate the agents' actions or to let the agents learn independently. However, access to a central controller may not be realistic and independent learning suffers from convergence issues (Zhang et al., 2019) . To circumvent these issues, a common approach is to drop the central coordinator and allow information exchange between agents (Zhang et al., 2018; 2019; Cesa-Bianchi et al., 2021) . Decision-making that involves multiple agents is often modeled as a game and studied under the lens of game theory to describe the learning outcomes.foot_0 Herein, we consider games with a leaderfollower structure in which players act consecutively. For two players, such games are known as Stackelberg games (Hicks, 1935) . Stackelberg games have been used to model diverse learning situations such as airport security (Balcan et al., 2015) , poaching (Sessa et al., 2020) , tax planning (Zheng et al., 2020) , and generative adversarial networks (Moghadam et al., 2021) . In a Stackelberg game, one is typically concerned with finding the Stackelberg equilibrium, sometimes called Stackelberg-Nash equilibrium, in which the leader uses a mixed strategy and the follower is bestresponding. A Stackelberg equilibrium may be obtained by solving a bi-level optimization problem if the reward functions are known (Schäfer et al., 2020; Aussel & Svensson, 2020) or, otherwise, it may be learnt via online learning techniques (Bai et al., 2021; Zhong et al., 2021 ), e.g., no-regret algorithms (Shalev-Shwartz, 2012; Deng et al., 2019; Goktas et al., 2022) . No-regret algorithms have emerged from the single-player multi-armed bandit problem as a means to alleviate the exploitation-exploration trade-off (Bubeck & Slivkins, 2012 ). An algorithm is called no-regret if the difference between the cumulative rewards of the learnt strategy and the single best action in hindsight is sublinear in the number of rounds (Shalev-Shwartz, 2012). In the multi-armed bandit problem, rewards may be adversarial (based on randomness and previous actions), oblivious adversarial (random), or stochastic (independent and identically distributed) over time (Auer et al., 2002) . Different assumptions on the bandit rewards yield different algorithms and regret bounds. Indeed, algorithms tailored for one kind of rewards are sub-optimal for others, e.g., the EXP3 algorithm due to Auer et al. (2002) yields the optimal scaling for adversarial rewards but not for stochastic rewards. For this reason, best-of-two-worlds algorithms, able to optimally handle both the stochastic and adversarial rewards, have recently been pursued and resulted in algorithms with close to optimal performance in both settings (Auer & Chiang, 2016; Wei & Luo, 2018; Zimmert & Seldin, 2021) . Extensions to multiplayer multi-armed bandit problems have been proposed in which players attempt to maximize the sum of rewards by pulling an arm each, see, e.g., (Kalathil et al., 2014; Bubeck et al., 2021) . No-regret algorithms are a common element also when analyzing multiplayer games. For example, in continuous two-player Stackelberg games, the leader strategy, based on a no-regret algorithm, converges to the Stackelberg equilibrium if the follower is best-responding (Goktas et al., 2022) . In contrast, if also the follower adopts a no-regret algorithm, the regret dynamics is not guaranteed to converge to a Stackelberg equilibrium point (Goktas et al., 2022, Ex. 3.2) . In (Deng et al., 2019) , it was shown for two-player Stackelberg games that a follower playing a, so-called, mean-based no-regret algorithm, enables the leader to achieve a reward strictly larger than the reward achieved at the Stackelberg equilibrium. This result does, however, not generalize to n-player games as demonstrated by D'Andrea (2022). Apart from studying the Stackelberg equilibrium, several papers have analyzed the regret. For example, Sessa et al. ( 2020) presented upper-bounds on the regret of a leader, employing a no-regret algorithm, playing against an adversarial follower with an unknown response function. Furthermore, Stackelberg games with states were introduced by Lauffer et al. ( 2022) along with an algorithm that was shown to achieve no-regret. As the follower in a Stackelberg game observes the leader's action, there is information exchange. A generalization to multiple players has been studied in a series of papers (Cesa-Bianchi et al., 2016; 2020; 2021) . In this line of work, players with a common action space form an arbitrary graph and are randomly activated in each round. Active players share information with their neighbors by broadcasting their observed loss, previously received neighbor losses, and their current strategy. The goal of the players is to minimize the network regret, defined with respect to the cumulative losses observed by active players over the rounds. The players, however, update their strategies according to their individually observed loss. Although we consider players connected on a graph, our work differs significantly from (Cesa-Bianchi et al., 2016; 2020; 2021) , e.g., we allow only actions to be observed between players and player strategies are updated based on a common bandit reward.

Contributions:

We introduce the joint pseudo-regret, defined with respect to the cumulative reward where all the players observe the same bandit-reward in each round. We provide an online learning-algorithm for general consecutive-play games that relies on no-regret algorithms developed for the single-player multi-armed bandit problem. The main novelty of our contribution resides in the joint analysis of players with coupled rewards where we derive upper bounds on the joint pseudoregret and prove our algorithm to be no-regret in the stochastic and adversarial setting. Furthermore, we quantify the penalty incurred by our decentralized setting in relation to the centralized setting.

2. PROBLEM FORMULATION

In this section, we formalize the consecutive structure of the game and introduce the joint pseudoregret that will be used as a performance metric throughout. We consider a decentralized setting where, in each round of the game, players pick actions consecutively. The consecutive nature of the game allows players to observe preceding players' actions and may be modeled by a DAG. For example, in Fig. 1 , a seven-player game is illustrated in which player 1 initiates the game and her action is observed by players 2, 5, and 6. The observations available to the remaining players follow analogously. Note that for a two-player consecutive game, the DAG models a Stackelberg game. We let G = (V, E) denote a DAG where V denotes the vertices and E denotes the edges. For our setting, V constitutes the n different players and E = {(j, i) : j → i, j ∈ V, i ∈ V} describes the observation structure where j → i indicates that player i observes the action of player j. Accordingly, a given player i ∈ V observes the actions of its direct parents, i.e., players j ∈ E i = {k : (k, i) ∈ E}. Furthermore, each player i ∈ V is associated with a discrete action space A i of size A i . We denote by π i (t), the mixed strategy of player i over the action space A i in round t ∈ [T ] such that π i (t) = a with probability p i,a for a ∈ A i . In the special case when p i,a = 1 for some a ∈ A i , the strategy is referred to as pure. Let A B denote the joint action space of players in a set B given by the Cartesian product A B = i∈B A i . If a player i has no parents, i.e., E i = ∅, we use the convention |A Ei | = 1.



The convention is to use agents in learning applications and players in game theoretic applications, we shall use the game-theoretic nomenclature in the remainder of the paper.

