DECENTRALIZED ONLINE BANDIT OPTIMIZATION ON DIRECTED GRAPHS WITH REGRET BOUNDS

Abstract

We consider a decentralized multiplayer game, played over T rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint banditreward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.

1. INTRODUCTION

Decentralized multi-agent online learning concerns agents that, simultaneously, learn to behave over time in order to achieve their goals. Compared to the single-agent setup, novel challenges are present as agents may not share the same objectives, the environment becomes non-stationary, and information asymmetry may exist between agents (Yang & Wang, 2020). Traditionally, the multi-agent problem has been addressed by either relying on a central controller to coordinate the agents' actions or to let the agents learn independently. However, access to a central controller may not be realistic and independent learning suffers from convergence issues (Zhang et al., 2019) . To circumvent these issues, a common approach is to drop the central coordinator and allow information exchange between agents (Zhang et al., 2018; 2019; Cesa-Bianchi et al., 2021) . Decision-making that involves multiple agents is often modeled as a game and studied under the lens of game theory to describe the learning outcomes.foot_0 Herein, we consider games with a leaderfollower structure in which players act consecutively. For two players, such games are known as Stackelberg games (Hicks, 1935) . Stackelberg games have been used to model diverse learning situations such as airport security (Balcan et al., 2015) , poaching (Sessa et al., 2020) , tax planning (Zheng et al., 2020) , and generative adversarial networks (Moghadam et al., 2021) . In a Stackelberg game, one is typically concerned with finding the Stackelberg equilibrium, sometimes called Stackelberg-Nash equilibrium, in which the leader uses a mixed strategy and the follower is bestresponding. A Stackelberg equilibrium may be obtained by solving a bi-level optimization problem if the reward functions are known (Schäfer et al., 2020; Aussel & Svensson, 2020) or, otherwise, it may be learnt via online learning techniques (Bai et al., 2021; Zhong et al., 2021 ), e.g., no-regret algorithms (Shalev-Shwartz, 2012; Deng et al., 2019; Goktas et al., 2022) . No-regret algorithms have emerged from the single-player multi-armed bandit problem as a means to alleviate the exploitation-exploration trade-off (Bubeck & Slivkins, 2012 ). An algorithm is called no-regret if the difference between the cumulative rewards of the learnt strategy and the single best action in hindsight is sublinear in the number of rounds (Shalev-Shwartz, 2012) . In the multi-armed bandit problem, rewards may be adversarial (based on randomness and previous actions), oblivious adversarial (random), or stochastic (independent and identically distributed) over time (Auer et al., 2002) . Different assumptions on the bandit rewards yield different algorithms and regret bounds. Indeed, algorithms tailored for one kind of rewards are sub-optimal for others, e.g., the EXP3 algorithm due to Auer et al. (2002) yields the optimal scaling for adversarial rewards but not for



The convention is to use agents in learning applications and players in game theoretic applications, we shall use the game-theoretic nomenclature in the remainder of the paper.1

