FEDERATED NEURAL BANDITS

Abstract

Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: UCB a allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while UCB b uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes UCB a initially for accelerated exploration and relies more on UCB b later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.

1. INTRODUCTION

The stochastic multi-armed bandit is a prominent method for sequential decision-making problems due to its principled ability to handle the exploration-exploitation trade-off (Auer, 2002; Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020) . In particular, the stochastic contextual bandit problem has received enormous attention due to its widespread real-world applications such as recommender systems (Li et al., 2010a ), advertising (Li et al., 2010b) , and healthcare (Greenewald et al., 2017) . In each iteration of a stochastic contextual bandit problem, an agent receives a context (i.e., a d-dimensional feature vector) for each of the K arms, selects one of the K contexts/arms, and observes the corresponding reward. The goal of the agent is to sequentially pull the arms in order to maximize the cumulative reward (or equivalently, minimize the cumulative regret) in T iterations. To minimize the cumulative regret, linear contextual bandit algorithms assume that the rewards can be modeled as a linear function of the input contexts (Dani et al., 2008) and select the arms via classic methods such as upper confidence bound (UCB) (Auer, 2002) or Thompson sampling (TS) (Thompson, 1933) , consequently yielding the Linear UCB (Abbasi-Yadkori et al., 2011) and Linear TS (Agrawal & Goyal, 2013) algorithms. The potentially restrictive assumption of a linear model was later relaxed by kernelized contextual bandit algorithms (Chowdhury & Gopalan, 2017; Valko et al., 2013) , which assume that the reward function belongs to a reproducing kernel Hilbert space (RKHS) and hence model the reward function using kernel ridge regression or Gaussian process (GP) regression. However, this assumption may still be restrictive (Zhou et al., 2020) and the kernelized

