ACHIEVING NEAR-OPTIMAL INDIVIDUAL REGRET & LOW COMMUNICATIONS IN MULTI-AGENT BANDITS

Abstract

Cooperative multi-agent multi-armed bandits (CMA2B) study how distributed agents cooperatively play the same multi-armed bandit game. Most existing CMA2B works focused on maximizing the group performance of all agents-the accumulation of all agents' individual performance (i.e., individual reward). However, in many applications, the performance of the system is more sensitive to the "bad" agent-the agent with the worst individual performance. For example, in a drone swarm, a "bad" agent may crash into other drones and severely degrade the system performance. In that case, the key of the learning algorithm design is to coordinate computational and communicational resources among agents so to optimize the individual learning performance of the "bad" agent. In CMA2B, maximizing the group performance is equivalent to minimizing the group regret of all agents, and maximizing the individual performance can be measured by minimizing the maximum (worst) individual regret among agents. Minimizing the maximum individual regret was largely ignored in prior literature, and currently, there is little work on how to minimize this objective with a low communication overhead. In this paper, we propose a near-optimal algorithm on both individual and group regrets, in addition, we also propose a novel communication module in the algorithm, which only needs O(log(log T )) communication times where T is the number of decision rounds. We also conduct simulations to illustrate the advantage of our algorithm by comparing it to other known baselines.

1. INTRODUCTION

The stochastic multi-armed bandit problem is a classic sequential decision making problem. Given K arms, there is one agent who repeatedly chooses one arm to pull and observes a stochastic reward from the pulled arm in each time slot. To maximize cumulative reward (or minimize regret which is the cumulative reward difference between the optimal decision and agent's choices), the agent needs to pull an arm either with a large empirical mean reward to greedily maximize reward (exploitation), or whose reward estimate is highly uncertain so as to reduce that uncertainty to discover good arms (exploration). To model many real life applications, e.g., cognitive radio with multiple users (Liu & Zhao, 2010; Jouini et al., 2010; Boursier & Perchet, 2019) , clinical trials in multiple labs (Wang, 1991) , recommendation systems with multiple servers (Agarwal et al., 2008; Li et al., 2010; Landgren et al., 2016) , cooperative source search by multiple robots (Li et al., 2014; Jin et al., 2017), etc., one 

