FEDERATED NEURAL BANDITS

Abstract

Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: UCB a allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while UCB b uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes UCB a initially for accelerated exploration and relies more on UCB b later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.

1. INTRODUCTION

The stochastic multi-armed bandit is a prominent method for sequential decision-making problems due to its principled ability to handle the exploration-exploitation trade-off (Auer, 2002; Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020) . In particular, the stochastic contextual bandit problem has received enormous attention due to its widespread real-world applications such as recommender systems (Li et al., 2010a) , advertising (Li et al., 2010b) , and healthcare (Greenewald et al., 2017) . In each iteration of a stochastic contextual bandit problem, an agent receives a context (i.e., a d-dimensional feature vector) for each of the K arms, selects one of the K contexts/arms, and observes the corresponding reward. The goal of the agent is to sequentially pull the arms in order to maximize the cumulative reward (or equivalently, minimize the cumulative regret) in T iterations. To minimize the cumulative regret, linear contextual bandit algorithms assume that the rewards can be modeled as a linear function of the input contexts (Dani et al., 2008) and select the arms via classic methods such as upper confidence bound (UCB) (Auer, 2002) or Thompson sampling (TS) (Thompson, 1933) , consequently yielding the Linear UCB (Abbasi-Yadkori et al., 2011) and Linear TS (Agrawal & Goyal, 2013) algorithms. The potentially restrictive assumption of a linear model was later relaxed by kernelized contextual bandit algorithms (Chowdhury & Gopalan, 2017; Valko et al., 2013) , which assume that the reward function belongs to a reproducing kernel Hilbert space (RKHS) and hence model the reward function using kernel ridge regression or Gaussian process (GP) regression. However, this assumption may still be restrictive (Zhou et al., 2020 ) and the kernelized model may fall short when the reward function is very complex and difficult to model. To this end, neural networks (NNs), which excel at modeling complex real-world functions, have been adopted to model the reward function in contextual bandits, thereby leading to neural contextual bandit algorithms such as Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021) . Due to their ability to use the highly expressive NNs for better reward prediction (i.e., exploitation), Neural UCB and Neural TS have been shown to outperform both linear and kernelized contextual bandit algorithms in practice. Moreover, the cumulative regrets of Neural UCB and Neural TS have been analyzed by leveraging the theory of the neural tangent kernel (NTK) (Jacot et al., 2018) , hence making these algorithms both provably efficient and practically effective. We give a comprehensive review of the related works on neural bandits in App. A. The contextual bandit algorithms discussed above are only applicable to problems with a single agent. However, many modern applications of contextual bandits involve multiple agents who (a) collaborate with each other for better performances and yet (b) are unwilling to share their raw observations (i.e., the contexts and rewards). For example, companies may collaborate to improve their contextual bandits-based recommendation algorithms without sharing their sensitive user data (Huang et al., 2021b) , while hospitals deploying contextual bandits for personalized treatment may collaborate to improve their treatment strategies without sharing their sensitive patient information (Dai et al., 2020) . These applications naturally fall under the setting of federated learning (FL) (Kairouz et al., 2019; Li et al., 2021) which facilitates collaborative learning of supervised learning models (e.g., NNs) without sharing the raw data. In this regard, a number of federated contextual bandit algorithms have been developed to allow bandit agents to collaborate in the federated setting (Shi & Shen, 2021) . We present a thorough discussion of the related works on federated contextual bandits in App. A. Notably, Wang et al. ( 2020) have adopted the Linear UCB policy and developed a mechanism to allow every agent to additionally use the observations from the other agents to accelerate exploration, while only requiring the agents to exchange some sufficient statistics instead of their raw observations. However, these previous works have only relied on either linear (Dubey & Pentland, 2020; Huang et al., 2021b) or kernelized (Dai et al., 2020; 2021) methods which, as discussed above, may lack the expressive power to model complex real-world reward functions (Zhou et al., 2020) . Therefore, this naturally brings up the need to use NNs for better exploitation (i.e., reward prediction) in federated contextual bandits, thereby motivating the need for a federated neural contextual bandit algorithm. To develop a federated neural contextual bandit algorithm, an important technical challenge is how to leverage the federated setting to simultaneously (a) accelerate exploration by allowing every agent to additionally use the observations from the other agents without requiring the exchange of raw observations (in a similar way to that of Wang et al. ( 2020)), and (b) improve exploitation by further enhancing the quality of the NN for reward prediction through the federated setting (i.e., without requiring centralized training using the observations from all agents). In this work, we provide a theoretically grounded solution to tackle this challenge by deploying a weighted combination of two upper confidence bounds (UCBs). The first UCB, denoted by UCB a , incorporates the neural tangent features (i.e., the random features embedding of NTK) into the Linear UCB-based mechanism adopted by Wang et al. (2020) , which achieves the first goal of accelerating exploration. The second UCB, denoted by UCB b , adopts an aggregated NN whose parameters are the average of the parameters of the NNs trained by all agents using their local observations for better reward prediction (i.e., better exploitation in the second goal). Hence, UCB b improves the quality of the NN for reward prediction in a similar way to the most classic FL method of federated averaging (FedAvg) for supervised learning (McMahan et al., 2017) . Notably, our choice of the weight between the two UCBs, which naturally arises during our theoretical analysis, has an interesting practical interpretation (Sec. 3.3): More weight is given to UCB a in earlier iterations, which allows us to use the observations from the other agents to accelerate the exploration in the early stage; more weight is assigned to UCB b only in later iterations after every agent has collected enough local observations to train its NN for accurate reward prediction (i.e., reliable exploitation). Of note, our novel design of the weight (Sec. 3.3) is crucial for our theoretical analysis and may be of broader interest for future works on related topics. This paper introduces the first federated neural contextual bandit algorithm which we call federated neural-UCB (FN-UCB) (Sec. 3). We derive an upper bound on its total cumulative regret from all N agents: R T = O( d

√

T N + d max N √ T )foot_0 where d is the effective dimension of the contexts from all N agents and d max represents the maximum among the N individual effective dimensions



The O ignores all logarithmic factors.

