ACHIEVING NEAR-OPTIMAL INDIVIDUAL REGRET & LOW COMMUNICATIONS IN MULTI-AGENT BANDITS

Abstract

Cooperative multi-agent multi-armed bandits (CMA2B) study how distributed agents cooperatively play the same multi-armed bandit game. Most existing CMA2B works focused on maximizing the group performance of all agents-the accumulation of all agents' individual performance (i.e., individual reward). However, in many applications, the performance of the system is more sensitive to the "bad" agent-the agent with the worst individual performance. For example, in a drone swarm, a "bad" agent may crash into other drones and severely degrade the system performance. In that case, the key of the learning algorithm design is to coordinate computational and communicational resources among agents so to optimize the individual learning performance of the "bad" agent. In CMA2B, maximizing the group performance is equivalent to minimizing the group regret of all agents, and maximizing the individual performance can be measured by minimizing the maximum (worst) individual regret among agents. Minimizing the maximum individual regret was largely ignored in prior literature, and currently, there is little work on how to minimize this objective with a low communication overhead. In this paper, we propose a near-optimal algorithm on both individual and group regrets, in addition, we also propose a novel communication module in the algorithm, which only needs O(log(log T )) communication times where T is the number of decision rounds. We also conduct simulations to illustrate the advantage of our algorithm by comparing it to other known baselines.

1. INTRODUCTION

The stochastic multi-armed bandit problem is a classic sequential decision making problem. Given K arms, there is one agent who repeatedly chooses one arm to pull and observes a stochastic reward from the pulled arm in each time slot. To maximize cumulative reward (or minimize regret which is the cumulative reward difference between the optimal decision and agent's choices), the agent needs to pull an arm either with a large empirical mean reward to greedily maximize reward (exploitation), or whose reward estimate is highly uncertain so as to reduce that uncertainty to discover good arms (exploration). To model many real life applications, e.g., cognitive radio with multiple users (Liu & Zhao, 2010; Jouini et al., 2010; Boursier & Perchet, 2019) , clinical trials in multiple labs (Wang, 1991) , recommendation systems with multiple servers (Agarwal et al., 2008; Li et al., 2010; Landgren et al., 2016) , cooperative source search by multiple robots (Li et al., 2014; Jin et al., 2017) , etc., one  (K log T ) O(K log T ) O(K 2 M 2 ) ComEx (Madhushani & Leonard, 2021) O(K log T ) O(K log T ) O(KM log T ) GosInE (Chawla et al., 2020) O((K/M + 2) log T ) O((K + 2M ) log T ) Ω(log T ) Dec_UCB (Zhu et al., 2021a) O((K/M ) log T ) O(K log T ) O(M T ) UCB-TCOM (our algorithm) O((K/M ) log T ) O(K log T ) O(KM log(log T )) needs to extend the model to allow for more than one agent (M > 1) playing the same multi-armed bandit game. These agents cooperate with each other to minimize their regrets. We call this problem the cooperative multi-agent multi-armed bandits (CMA2B) problem and present it formally in §2. The most common objective of CMA2B is to minimize the aggregate regret among all M agents, dubbed as group regret in this paper. This objective has been studied in the majority of prior work (Boursier & Perchet, 2019; Chawla et al., 2020; Huang et al., 2021; Shi et al., 2021b; Wang et al., 2020a; b) . In addition to group regret, individual performance among agents is another important metric that is less studied in prior work on CMA2B. The performance of each individual agent is critical in many applications in distributed systems. For example, in many distributed resource allocation scenarios with different agents in charge of the allocation, overall performance depends on the performance of the bottleneck agent instead of the aggregate performance of all agents. This can also be seen in a computer network scenario, in which an ISP may apply learning-based algorithms (Ma et al., 2010; Jiang et al., 2018) for some networking problems such as shortest path routing, channel selection, etc. To ensure that the users are served fairly, the underlying algorithms should fairly provide approximately equivalent individual performance for each learning agent. This is equivalent to minimizing the bottleneck agent's individual regret. For another thing, in network optimization literature, the max-min fairness metric-maximize the minimal individual reward-is widely used to measure a system's fairness (Srikant & Ying, 2013, §2.21), such as fair queuing (Demers et al., 1989) . Since the regret is the opposite of reward, optimizing the max-min fairness is also equivalent to minimizing the bottleneck agent's regret. Other fairness motivation examples can be found in political philosophy (Rawls, 2004) . In this paper, we explicitly take into account the notion of minimizing the maximum individual regret and, for brevity, hereinafter, refer to it as the individual regret. Another important metric in CMA2B is the communication time of all agents. For some distributed systems, e.g., agents are geographically located, communications among agents can be expensive. Thus it is important to design a cooperative learning algorithm that provides minimal group and individual regrets, while at the same time, incurs a small communication cost. In addition, it will be desirable to have a learning algorithm in which one can tune parameters so as to trade off communication times with regret as needed by different applications. Contributions. In §3, we present the UCB-TCOM algorithm that achieves not only a near-optimal group regret of O((K/∆ 2 ) log T ) but also a near-optimal problem-dependent individual regret of O((K/M ∆ 2 ) log T ) with only O(log(log T )) communication times, where ∆ 2 is the smallest reward mean gap between arms and T is the number of rounds. This is the first near-optimal algorithm on individual regret with efficient communications: Previous low communication algorithms, e.g., the leader-follower algorithm (Wang et al., 2020b) , cannot achieve the near-optimal individual regret; and previous near-optimal algorithms on individual regret, e.g., GosInE (Chawla et al., 2020) , required high communication times (see related works below). UCB-TCOM achieves the near-optimal individual regret performance by evenly dividing the group regret to all agents. To equalize the regrets of all agents, UCB-TCOM directs agents to synchronously pull arms: agents only utilize the common reward observations, i.e., those having been broadcast over all agents, to make decisions. The communication policy TCOM (Tunable COMmunication) of UCB-TCOM is a parametric meta-algorithm that governs the communication of agents and can be executed on top of any underlying bandit learning algorithm. A salient feature of TCOM is that it can be tuned to balance regret and communication times. In particular, two parameters in TCOM can be tuned to determine the aggressiveness and frequency of communications among agents. Our analysis explicitly demonstrates how communication times can be tuned from 0 to O(T ). Finally, we report numerical results in §5.



A comparison summary of prior literature and this work (all regret bounds are problemdependent and we omit the 1/∆ 2 factor.)

