DISTRIBUTED DIFFERENTIAL PRIVACY IN MULTI-ARMED BANDITS

Abstract

We consider the standard K-armed bandit problem under a distributed trust model of differential privacy (DP), which enables to guarantee privacy without a trustworthy server. Under this trust model, previous work on private bandits largely focus on achieving privacy using a shuffle protocol, where a batch of users data are randomly permuted before sending to a central server. This protocol achieves (ε, δ) or approximate-DP guarantee by sacrificing an additive O factor in T -step cumulative regret. In contrast, the optimal privacy cost to achieve a stronger (ε, 0) or pure-DP guarantee under the widely used central trust model is only , where, however, a trusted server is required. In this work, we aim to obtain a pure-DP guarantee under distributed trust model while sacrificing no more regret than that under the central trust model. We achieve this by designing a generic bandit algorithm based on successive arm elimination, where privacy is guaranteed by corrupting rewards with an equivalent discrete Laplace noise ensured by a secure computation protocol. We also show that our algorithm, when instantiated with Skellam noise and the secure protocol, ensures Rényi differential privacy -a stronger notion than approximate DP -under distributed trust model with a privacy cost of O K √ log T ε . Our theoretical findings are corroborated by numerical evaluations on both synthetic and real-world data.

1. INTRODUCTION

The multi-armed bandit (MAB) problem provides a simple but powerful framework for sequential decision-making under uncertainty with bandit feedback, which has attracted a wide range of practical applications such as online advertising (Abe et al., 2003) , product recommendations (Li et al., 2010) , clinical trials (Tewari & Murphy, 2017) , to name a few. Along with its broad applicability, however, there is an increasing concern of privacy risk in MAB due to its intrinsic dependence on users' feedback, which could leak users' sensitive information (Pan et al., 2019) . To alleviate the above concern, the notion of differential privacy, introduced by Dwork et al. (2006) in the field of computer science theory, has recently been adopted to design privacy-preserving bandit algorithms (see, e.g., Mishra & Thakurta (2015) ; Tossou & Dimitrakakis (2016); Shariff & Sheffet (2018) ). Differential privacy (DP) provides a principled way to mathematically prove privacy guarantees against adversaries with arbitrary auxiliary information about users. To achieve this, a differentially private bandit algorithm typically relies on a well-tuned random noise to obscure each user's contribution to the output, depending on privacy levels ε, δ -smaller values lead to stronger protection but also suffer worse utility (i.e., regret). For example, the central server of a recommendation system can use random noise to perturb its statistics on each item after receiving feedback (i.e., clicks/ratings) from users. This is often termed as central model (Dwork et al., 2014) , since the central server has the trust of its users and hence has a direct access to their raw data. Under this model, an optimal private MAB algorithm with a pure DP guarantee (i.e., when δ = 0) is proposed in Sajed & Sheffet (2019) , which only incurs an additive O K log T ε term in the cumulative regret compared to the standard setting when privacy is not sought after (Auer, 2002) . However, this high trust model is not always feasible in practice since users may not be willing to share their raw data directly to the server. This motivates to employ a local model (Kasiviswanathan et al., 2011) of trust, where DP is achieved without a trusted server as each user perturbs her data prior to sharing with the server. This ensures a stronger privacy protection, but leads to a high cost in utility due to large aggregated noise from all users. As shown in Ren et al. (2020) , under the local model, private MAB algorithms have to incur a multiplicative 1/ε 2 factor in the regret rather than the additive one in the central model. In attempts to recover the same utility of central model while without a trustworthy server like the local model, an intermediate DP trust model called distributed model has gained an increasing interest, especially in the context of (federated) supervised learning (Kairouz et al., 2021b; Agarwal et al., 2021; Kairouz et al., 2021a; Girgis et al., 2021; Lowy & Razaviyayn, 2021) . Under this model, each user first perturbs her data via a local randomizer, and then sends the randomized data to a secure computation function. This secure function can be leveraged to guarantee privacy through aggregated noise from distributed users. There are two popular secure computation functions: secure aggregation (Bonawitz et al., 2017) and secure shuffling (Bittau et al., 2017) . The former often relies on cryptographic primitives to securely aggregate users' data so that the central server only learns the aggregated result, while the latter securely shuffle users' messages to hide their source. To the best of our knowledge, distributed DP model is far less studied in online learning as compared to supervised learning, with only known results for standard K-armed bandits in Tenenbaum et al. (2021) , where secure shuffling is adopted. Despite being pioneer work, the results obtained in this paper have several limitations: (i) The privacy guarantee is obtained only for approximate DP (δ > 0) -a stronger pure DP (δ = 0) guarantee is not achieved; (ii) The cost of privacy is a multiplicative log(1/δ) factor away from that of central model, leading to a higher regret bound; (iii) The secure protocol works only for binary rewards (or communication intensive for real rewards). 1Our contributions. In this work, we design the first communication-efficient MAB algorithm that satisfies pure DP in the distributed model while attaining the same regret bound as in the central model (see Table 1 ). We overcome several key challenges that arise in the design and analysis of distributed DP algorithms for bandits. We now list the challenges and our proposed solutions below. (a) Private and communication efficient algorithm design. Secure aggregation (SecAgg) works only in the integer domain due to an inherent modular operation (Bonawitz et al., 2017) . Hence, leveraging this in bandits to achieve distributed DP with real rewards needs adopting data quantization, discrete privacy noise and modular summation arithmetic in the algorithm design. To this end, we take a batch version of the successive arm elimination technique as a building block of our algorithm, and on top of it, employ a privacy protocol tailored to discrete privacy noise and modular operation (see Algorithm 1). Instantiating the protocol at each user with Pólya random noise, we ensure that our algorithm satisfies pure DP in the distributed model. Moreover, the communication bits per-user scale only logarithmicaly with the number of participating users in each batch. (b) Regret analysis under pure DP with SecAgg. While our pure DP guarantee exploits known results for discrete Laplace mechanism, the utility analysis gets challenging due to modular clipping of SecAgg. In fact, in supervised learning, no known convergence rate exists for SGD under pure DP with SecAgg (although the same is well-known under central model). This is because modular clipping makes gradient estimates biased, and hence, standard convergence guarantees using unbiased estimates do not hold. In bandits, however, we work with zeroth order observations to build estimates of arms' rewards, and require high-confidence tight tail bounds for the estimates to analyse convergence. To this end, relying on tail properties of discrete Laplace and a careful analysis of modular operation, we prove a sublinear regret rate of our algorithm, which matches the optimal one in the central model, and thus, achieves the optimal rate under pure DP (see Theorem 1). (c) Improved regret bound under RDP. While our main focus were to design the first bandit algorithm with pure distributed DP that achieves the same regret rate under central model, our template protocol is general enough to obtain different privacy guarantees by tuning the noise at each user. We demonstrate this by achieving Rényi differential privacy (RDP) (Mironov, 2017) using a Skellam random noise. RDP is a weaker notion of privacy compared to pure DP, but it is still stronger than approximate DP. It also provides a tighter privacy accounting for composition compared to approximate DP. This is particularly useful for bandit algorithms, when users may participate in multiple rounds, necessitating the need for privacy composition. Hence, we focus on RDP with SecAgg and and show that a tighter regret bound compared to pure DP can be achieved (see Theorem 2) by proving novel tail-bound for Skellam distribution. We support our theoretical findings with extensive numerical evaluation over bandit instances generated from both synthetic and real-life data. Finally, our analysis technique is also general enough to recover best-known regrets under central and local DP models while only using discrete privacy noise (see Appendix H). This is important in practice since continuous Laplace noise might leak privacy on finite computers due to floating point arithmetic (Mironov, 2012) , which is a drawback of existing central and local DP MAB algorithms.

2. PRELIMINARIES

In this section, we formally introduce the distributed differential privacy model in bandits. Before that we recall the learning paradigm in multi-armed bandits and basic differential privacy definitions. Learning model and regret in MAB. At each time slot t ∈ [T ] := {1, . . . , T }, the agent (e.g., recommender system) selects an arm a ∈ [K] (e.g., an advertisement), recommends it to a new user t and obtains an i.i.d reward r t (e.g., a rating indicating how much she likes it), which is sampled from a distribution over [0, 1] with mean given by µ a . Let a * := argmax a∈[K] µ a be the arm with the highest mean and denote µ * := µ a * for simplicity. Let ∆ a := µ * -µ a be the gap of the expected reward between the optimal arm a * and any other arm a. Further, let N a (t) be the total number of times that arm a has been recommended to first t users. The goal of the agent is to maximize its total reward, or equivalently to minimize the cumulative expected pseudo-regret, defined as E [Reg(T )] := T • µ * -E T t=1 r t = E a∈[K] ∆ a N a (T ) . Differential privacy. Let D = [0, 1] be the data universe, and n ∈ N be the number of unique users. we say D, D ′ ∈ D n are neighboring datasets if they only differ in one user's reward preference for some i ∈ [n]. We have the following standard definition of differential privacy (Dwork et al., 2006) . Definition 1 (Differential Privacy). For ε, δ > 0, a randomized mechanism M satisfies (ε, δ)-DP if for all neighboring datasets D, D ′ and all events E in the range of M, we have P [M(D) ∈ E] ≤ e ε • P [M(D ′ ) ∈ E] + δ. The special case of (ε, 0)-DP is often referred to as pure differential privacy, whereas, for δ > 0, (ε, δ)-DP is referred to as approximate differential privacy. We also consider a related notion of privacy called Rényi differential privacy (RDP) Mironov (2017) , which allows for a tighter composition compared to approximate differential privacy. Definition 2 (Rényi Differential Privacy). For α > 1, a randomized mechanism M satisfies (α, ε(α))-RDP if for all neighboring datasets D, D ′ , we have D α (M(D), M(D ′ )) ≤ ε(α), where D α (P, Q) is the Rényi divergence (of order α) of the distribution P from the distribution Q, and is given by D α (P, Q) := 1 α-1 log E x∼Q P (x) Q(x) α . Distributed differential privacy. A distributed bandit learning protocol P = (R, S, A) consists of three parts: (i) a (local) randomizer R at each user's side, (ii) an intermediate secure protocol S, and (iii) an analyzer A at the central server. Each user i first locally apply the randomizer R on its raw data (i.e., reward) D i , and sends the randomized data to a secure computation protocol S (e.g., secure aggregation or shuffling). This intermediate secure protocol S takes a batch of users' randomized data and generates inputs to the central server, which utilizes an analyzer A to compute the output (e.g., action) using received messages from S. The secure computation protocol S has two main variations: secure shuffling and secure aggregation. Both of them essentially work with a batch of users' randomized data and guarantee that the central server cannot infer any individual's data while the total noise in the inputs to the analyzer provides a high privacy level. To adapt both into our MAB protocol, it is natural to divide participating users into batches. For each batch b ∈ [B] with n b users, the outputs of S is given by S • R n b (D) := S(R(D 1 ), . . . , R(D n b )). The goal is to guarantee that the the view of all B batches' outputs satisfy DP. To this end, we define a (composite) mechanism M P = (S • R n1 , . . . , S • R n B ), where In the central DP model, the privacy burden lies with a central server (in particular, analyzer A), which needs to inject necessary random noise to achieve privacy. On the other hand, in the local DP model, each user's data is privatized by local randomizer R. In contrast, in the distributed DP model, privacy without a trusted central server is achieved by ensuring that the inputs to the analyzer A already satisfy differential privacy. Specifically, by properly designing the intermediate protocol S and the noise level in the randomizer R, one can ensure that the final added noise in the aggregated data over a batch of users matches the noise that would have otherwise been added in the central model by the trusted server. Through this, distributed DP model provides the possibility to achieve the same level of utility as the central model while without a trustworthy central server.

3. A GENERIC ALGORITHM FOR PRIVATE BANDITS

In this section, we propose a generic algorithmic framework (Algorithm 1) for multi-armed bandits under the distributed privacy model. Batch-based successive arm elimination. Our algorithm builds upon the classic idea of successive arm elimination (Even-Dar et al., 2006) with the additional incorporation of batches and a blackbox protocol P = (R, S, A) to achieve distributed differential privacy. It divides the time horizon T into batches of exponentially increasing size and eliminates sub-optimal arms successively. To this end, for each active arm a at batch b, it first prescribes arm a to a batch of l(b) = 2 b new users. After pulling the prescribed action a, each user applies the local randomizer R to her reward and sends the randomized reward to the intermediary function S, which runs a secure computation protocol (e.g., secure aggregation or secure shuffling) over the total l(b) number of randomized rewards. Then, upon receiving the outputs of S, the server applies the analyzer A to compute the the sum of rewards for batch b when pulling arm a (i. There is one key difference between our algorithm and the VB-SDP-AE algorithm in Tenenbaum et al. (2021) . At the start of a batch, VB-SDP-AE uses all the past data to compute reward estimates. In contrast, we adopt the idea of forgetting and use only the data of the last completed batch. for each active arm a ∈ Φ(b) do 6: for each new user i from 1 to l(b) do end for 16: Update active set of arms: set z = ( ym)/g // correction for underflow 30: else set z = y/g Distributed DP protocol via discrete privacy noise. Inspired by Balle et al. (2020) ; Cheu & Yan (2021) , we provide a general template protocol P for the distributed DP model, which relies only on discrete privacy noise. Φ(b+1) = a ∈ Φ(b) : UCB a (b) ≥ max a ′ ∈Φ(b) LCB a ′ ( Local randomizer R receives each user i's real-valued data x i and encodes it as an integer via fixed-point encoding with precision g > 0 and randomized rounding. Then, it generates a discrete noise, which depends on the specific privacy-regret trade-off requirement (to be discussed later under specific mechanisms). Next, it adds the random noise to the encoded reward, clips the sum with modulo m ∈ N and sends the final integer y i as input to secure computation function S. We mainly focus on secure aggregation (SecAgg) for S here.foot_1 SecAgg is treated as a black-box function as in previous work on supervised learning (Kairouz et al., 2021a) , which implements the following procedure: given n users and their randomized messages y i ∈ Z m (i.e., integer in {0, 1, . . . , m-1}) obtained via R, the SecAgg function S securely computes the modular sum of the n messages, y = ( n i=1 y i ) mod m, while revealing no further information on individual messages to a potential attacker, ensuring that it is perfectly secure. Details of engineering implementations of SecAgg is beyond the scope of this paper, see Appendix G for a brief discussion on this. The job of analyzer A is to compute the sum of rewards within a batch as accurately as possible. It uses an accuracy parameter τ ∈ R and g to correct for possible underflow due to modular operation and bias due to encoding. To sum it up, the end goal of our protocol P = (R, S, A) is to ensure that it provides the required privacy protection while guaranteeing an output z ≈ n i=1 x i with high probability, which is the key to our privacy and regret analysis in the following sections.

4. ACHIEVING PURE DP IN THE DISTRIBUTED MODEL

In this section, we show that Algorithm 1 achieves pure-DP in the distributed DP model via secure aggregation. To do so, we need to carefully determine the amount of (discrete) noise in R so that the total noise in a batch provides (ε, 0)-DP. One natural choice is the discrete Laplace noise. Definition 4 (Discrete Laplace Distribution). Let b > 0. A random variable X has a discrete Laplace distribution with scale parameter b, denoted by Lap Z (b), if it has a p.m.f. given by ∀x ∈ Z, P [X = x] = e 1/b -1 e 1/b + 1 • e -|x|/b . A key property of discrete Laplace that we will use is its infinite divisibility, which allows us to simulate it in a distributed way (Goryczka & Xiong, 2015, Theorem 5.1) . Fact 1 (Infinite Divisibility of Discrete Laplace). A random variable X has a Pólya distribution with parameters r > 0, β ∈ [0, 1], denoted by Pólya(r, β), if it has a p.m.f. given by ∀x ∈ N, P [X = x] = Γ(x + r) x!Γ(r) β x (1 -β) r . Now, for any n ∈ N, let {γ + i , γ - i } i∈[n] be 2n i.i.d samples 3 from Pólya(1/n, e -1/b ), then the random variable n i=1 (γ + i -γ - i ) is distributed as Lap Z (b). Armed with the above fact and the properties of discrete Laplace noise (see Fact 3 in Appendix K), we are able to obtain the following main theorem, which shows that the same regret as in the central model is achieved under the distributed model via SecAgg. Theorem 1 (Pure-DP via SecAgg). Fix ε > 0 and T ∈ N. For each batch b, let noise for the i-th user in the batch be η i = γ + i -γ - i , where γ + i , γ - i i.i.d. ∼ Pólya(1/n, e -ε/g ), set n = l(b), g = ⌈ε √ n⌉, τ = ⌈ g ε log(2T )⌉ and m = ng + 2τ + 1. Then, Algorithm 1 achieves (ε, 0)-DP in the distributed model. Moreover, setting β(b) = O log(|Φ(b)|b 2 T ) 2l(b) + 2 log(|Φ(b)|b 2 T ) εl(b) , it enjoys expected regret E [Reg(T )] = O a∈[K]:∆a>0 log T ∆ a + K log T ε . Theorem 1 achieves optimal regret under pure DP. Theorem 1 achieves the same regret bound as the one achieved in Sajed & Sheffet (2019) under the central trust model with continuous Laplace noise. Moreover, it matches the lower bound obtained under pure DP in Shariff & Sheffet (2018) , indicating the bound is indeed tight. Note that, we achieve this rate under distributed trust modela stronger notion of privacy protection than the central model -while using only discrete noise. Communication bits. Algorithm 1 needs to communicate O(log m) bits per user to the secure protocol S, i.e., communicating bits scales logarithmically with the batch size. In contrast, the number of communication bits required in existing distributed DP bandit algorithms that work with real-valued rewards (as we consider here) scale polynomially with the batch size (Chowdhury & Zhou, 2022b; Garcelon et al., 2022) . Remark 1 (Pure DP via Secure Shuffling). It turns out that one can achieve same privacy and regret guarantees (orderwise) using a relaxed SecAgg protocol. Building on this result, we also establish pure DP under shuffling while again maintaining the same regret bound as the central model (see Theorem 3 in Appendix D.2). This improves the state-of-the-art result for MAB with shuffling (Tenenbaum et al., 2021) in terms of both privacy and regret.

5. ACHIEVING RDP IN THE DISTRIBUTED MODEL

A natural question to ask is whether one can get a better regret performance by sacrificing a small amount of privacy. We consider the notion of RDP (see Definition 2), which is a weaker notion of privacy than pure DP. However, it avoids the possible catastrophic privacy failure in approximate DP, and also provides a tighter privacy accounting for composition (Mironov, 2017) . To achieve RDP guarantee using discrete noise, we consider the Skellam distribution -which has recently been introduced in private federated learning (Agarwal et al., 2021) . A key challenge in the regret analysis of our bandit algorithm is to characterize the tail property of Skellam distribution. This is different from federated learning, where characterizing the variance renders sufficient. In Proposition 1, we prove that Skellam has sub-exponential tails, which not only is the key to our regret analysis, but could also be of independent interest. Below is the formal definition of Skellam. Definition 5 (Skellam Distribution). A random variable X has a Skellam distribution with mean µ and variance σ 2 , denoted by Sk(µ, σ 2 ), if it has a probability mass function given by ∀x ∈ Z, P [X = x] = e -σ 2 I x-µ (σ 2 ) , where I ν (•) is the modified Bessel function of the first kind. To sample from Skellam distribution, one can rely on existing procedures for Poisson samples. This is because if X = N 1 -N 2 , where N 1 , N 2 i.i.d. ∼ Poisson(σ 2 /2), then X is Sk(0, σ 2 ) distributed. Moreover, due to this fact, Skellam is closed under summation, i.e., if X 1 ∼ Sk(µ 1 , σ 2 1 ) and X 2 ∼ Sk(µ 2 , σ 2 2 ), then X 1 + X 2 ∼ Sk(µ 1 + µ 2 , σ 2 1 + σ 2 2 ). Proposition 1 (Sub-exponential Tail of Skellam). Let X ∼ Sk(0, σ 2 ). Then, X is (2σ 2 , √ 2 )-subexponential. Hence, for any p ∈ (0, 1], with probability at least 1p, |X| ≤ 2σ log(2/p) + √ 2 log(2/p). With the above result, we can establish the following privacy and regret guarantee of Algorithm 1. Theorem 2 (RDP via SecAgg). Fix ε > 0, T ∈ N and a scaling factor s ≥ 1. For each batch b, let noise for the i-th user be η i ∼ Sk(0, g 2 nε 2 ), set n = l(b), g = ⌈sε √ n⌉, τ = ⌈ 2g ε log(2T ) + √ 2 log(2T )⌉ and m = ng + 2τ + 1. Then, Algorithm 1 achieves (α, ε(α))-RDP in the distributed model for all α = 2, 3, . . ., with ε(α) = αε 2 2 + min (2α-1)ε 2 4s 2 + 3ε 2s 3 , 3ε 2 2s . Moreover, setting β(b) = O log(|Φ(b)|b 2 T ) 2l(b) + (1+1/s) log(|Φ(b)|b 2 T ) εl(b) , it enjoys the expected regret E [Reg(T )] = O a∈[K]:∆a>0 log T ∆ a + K √ log T ε + K log T sε . Privacy-Regret-Communication Trade-off. Observe that the scaling factor s allows us to achieve different trade-offs. If s increases, both privacy and regret performances improve. In fact, for a sufficiently large value of s, the third term in the regret bound becomes sufficiently small, and we obtain an improved regret bound compared to Theorem 1. Moreover, the RDP privacy guarantee improves to ε(α) ≈ αε 2 2 , which is the standard RDP rate for Gaussian mechanism (Mironov, 2017) . However, a larger s leads to an increase of communicating bits per user, but only grows logarithmically, since Algorithm 1 needs to communicate O(log m) bits to the secure protocol S. RDP to Approximate DP. To shed more insight on Theorem 2, we convert our RDP guarantee to approximate DP for a sufficiently large s. It holds that under the setup of Theorem 2, for sufficiently large s, one can achieve (O(ε), δ)-DP with regret O a:∆a>0 log T ∆a + K √ log T log(1/δ) ε (via Lemma 10 in Appendix K). Implication of this conversion is three-fold. First, this regret bound is O( √ log T ) factor tighter than that achieved by Tenenbaum et al. ( 2021) using a shuffle protocol. Second, it yields a better regret performance compared to the bound achieved under (ε, 0)-DP in Theorem 1 when the privacy budget δ > 1/T . This observation is consistent with the fact that a weaker privacy guarantee typically warrants a better utility bound. Third, this conversion via RDP also yields a gain of O( log(1/δ)) in the regret when dealing with privacy composition (e.g., when participating users across different batches are not unique) as compared to Tenenbaum et al. (2021) that only relies on approximate DP (see Appendix I for details). This results from the fact that RDP provides a tighter composition compared to approximate DP. Remark 2 (Achieving RDP with discrete Gaussian). One can also achieve RDP using discrete Gaussian noise (Canonne et al., 2020) . Here, we work with Skellam noise since it is closed under summation and enjoys efficient sampling procedure as opposed to discrete Gaussian (Agarwal et al., 2021) . Nevertheless, as a proof of flexibility of our proposed framework, we show in Appendix F that Algorithm 1 with discrete Gaussian noise can guarantee RDP with a similar regret bound.

6. KEY TECHNIQUES: OVERVIEW

Now, we provide an overview of the key techniques behind our privacy and regret guarantees. We show that the results of Theorem 1 and 2 can be obtained via a clean generic analytical framework, which not only covers the analysis of distributed pure DP/RDP with SecAgg, but also offers a unified view of private MAB under central, local and distributed DP models. As in many private learning algorithms, the key is to characterize the impact of added privacy noise on the utility. In our case, this reduces to capturing the tail behavior of total noise n a (b) := R a (b) -  E [Reg(T )] = O a∈[K]:∆a>0 log T ∆ a + Kσ log T + Khlog T . An acute reader may note that the bound N on the noise is the tail bound of sub-exponential distribution and it reduces to the bound for sub-Gaussian tail if h = 0. Our SecAgg protocol P with discrete Laplace noise (as in Theorem 1) satisfy this bound with σ = √ 2/ε, h = 1/ε. Similarly, our protocol with Skellam noise (as in Theorem 2) satisfy this bound with σ = O(1/ε), h = 1/(sε). Therefore, we can build on the above general result to directly obtain our regret bounds. In the following, we present the high-level idea behind privacy and regret analysis in distributed DP model. Privacy. For distributed DP, by definition, the view of the server during the entire algorithm needs to be private. Since each user only contributes once,foot_3 by parallel-composition of DP, it suffices to ensure that each view y a (b) (line 11 in Algorithm 1) is private. To this end, under SecAgg, the distribution of y a (b) can be simulated via ( i y i ) mod m, which further reduces to ( i x i + η i ) mod m by the distributive property of modular sum. Now, consider a mechanism M that accepts an input dataset { x i } i and outputs i ( x i + η i ). By post-processing, it suffices to show that M satisfies pure DP or RDP. To this end, the variance σ 2 tot of the total noise i η i needs to scale with the sensitivity of i x i . Thus, each user within a batch only needs to add a proper noise with variance of σ 2 tot /n. Finally, by the particular distribution properties of the noise, one can show that M is pure DP or RDP, and hence, obtain the privacy guarantees. Regret. Thanks to Lemma 1, we only need to focus on the tail of n a (b). To this end, fix any batch b and arm a. We have y = y a (b), x i = r i a (b), n = l(b) for P in Algorithm 1 and we need to establish that with probability at least 1p, for some σ and h, A( y) - i x i ≤ O σ log(1/p) + h log(1/p) . (1) To get the bound, inspired by Balle et al. (2020) ; Cheu & Yan (2021) , we divide the LHS into Term (i) = |A( y) -i x i /g| and Term (ii) = | i x i /g -i x i |, where Term (i) captures the error due to privacy noise and modular operation, while Term (ii) captures the error due to random rounding. In particular, Term (ii) can be easily bounded via sub-Gaussian tail since the noise is bounded. Term (i) needs care for the possible underflow due to modular operation by considering two different cases (see line 28-30 in Algorithm 1). In both cases, one can show that Term (i) is upper bounded by τ /g with high probability, where τ is the tail bound on the total privacy noise i η i . Thus, depending on particular privacy noise and parameter choices, one can find σ and h such that equation 1 holds, and hence, obtain the corresponding regret bound by Lemma 1. Remark 3. As a by-product of our generic analysis technique, Algorithm 1 and privacy protocol P along with Lemma 1 provide a new and structured way to design and analyze private MAB algorithms under central and local models with discrete private noise (see Appendix H for details). This enables us to reap the benefits of working with discrete noise (e.g., finite-computer representations, bit communications) in all three trust models (central, local and distributed). 

7. SIMULATION RESULTS

We empirically evaluate the regret performance of our successive elimination scheme with SecAgg protocol (Algorithm 1) under distributed trust model, which we abbreviate as Dist-DP-SE and Dist-RDP-SE when the randomizer R is instantiated with Pólya noise (for pure DP) and Skellam noise (for RDP), respectively. We compare them with the DP-SE algorithm of Sajed & Sheffet (2019) that achieves optimal regret under pure DP in the central model, but works only with continuous Laplace noise. We fix confidence level p = 0.1 and study comparative performances under varying privacy levels (ε < 1 for synthetic data, ε ≥ 1 for real data). We plot time-average regret Reg(T )/T in Figure 1 by averaging results over 20 randomly generated bandit instances. Bandit instances. In the top panel, similar to Vaswani et al. (2020) , we consider easy and hard MAB instances with K = 10 arms: in the former, arm means are sampled uniformly in [0.25, 0.75], while in the latter, those are sampled in [0.45, 0.55]. We consider real rewards -sampled from Gaussian distribution with aforementioned means and projected to [0, 1]. In the bottom panel, we generate bandit instances from Microsoft Learning to Rank dataset MSLR-WEB10K (Qin & Liu, 2013) . The dataset consists of 1,200,192 rows and 138 columns, where each row corresponds to a query-url pair. The first column is relevance label 0, 1, ... , 4 of the pair, which we take as rewards. The second column denotes the query id, and the rest 136 columns denote contexts of a query-url pair. We cluster the data by running K-means algorithm with K = 50. We treat each cluster as a bandit arm with mean reward as the empirical mean of the individual ratings in the cluster. This way, we obtain a bandit setting with number of arms K = 50. Observations. We observe that as T becomes large, the regret performance of Dist-DP-SE matches the regret of DP-SE. The slight gap in small T regime is the cost that we pay to achieve distributed privacy using discrete noise without access to a trusted server (for higher ε value, this gap is even smaller). In addition, we find that a relatively small scaling factor (s = 10) provides a considerable gain in regret under RDP compared to pure DP, especially when ε is small (i.e., when the cost of privacy is not dominated by the non-private part of regret). The experimental findings are consistent with our theoretical results. Here, we note that our simulations are proof-of-concept only and we did not tune any hyperparameters. More details and additional plots are given in Appendix J.foot_4  Concluding remarks. We show that MAB under distributed trust model can achieve pure DP while maintaining the same regret under central model. In addition, RDP is also achieved in MAB under distributed trust model for the first time. Both results are obtained via a unified algorithm design and performance analysis. More importantly, our work also opens the door to a promising and interesting research direction -private online learning with distributed DP guarantees, including contextual bandits and reinforcement learning.

A OTHER RELATED WORK

Private Multi-Armed Bandits. In addition to stochastic multi-armed bandits under the central model in Mishra & Thakurta (2015) ; Tossou & Dimitrakakis (2016); Sajed & Sheffet (2019) , different variants of differentially private bandits have been studied, including adversarial bandits (Tossou & Dimitrakakis, 2017; Agarwal & Singh, 2017) , heavy-tailed bandits (Tao et al., 2022) , combinatorial semi-bandits (Chen et al., 2020) , and cascading bandits (Wang et al., 2022) Private Contextual Bandits. In contextual bandits, in addition to the reward, the contexts are also sensitive information that need to be protected. However, a straightforward adaptation of the standard central DP in contextual bandits will lead to a linear regret, as proved in Shariff & Sheffet (2018) . Thus, to provide a meaningful regret under the central model for contextual bandits, a relaxed version of DP called joint differential privacy is considered, which rougly means that the change of any user would not change the actions prescribed to all other users, but it allows the change of the action prescribed to herself. Under this central JDP, Shariff & Sheffet (2018) establishes a regret bound of O( √ T ) for a private variant of LinUCB. On the other hand, contextual linear bandits under the local model incurs a regret bound of O(T 3/4 ) (Zheng et al., 2020) . Very recently, motivated by the regret gap between the central and local model, two concurrent works consider contextual linear bandits in the distributed model (via secure shuffling only) (Chowdhury & Zhou, 2022b; Garcelon et al., 2022) . In particular, Chowdhury & Zhou (2022b) show that a O(T 3/5 ) regret bound is achieved under the distributed model via secure shuffling. Private Reinforcement Learning (RL). RL is a generalization of contextual bandits in that contextual bandits can be viewed as a finite-horizon RL with horizon H = 1. This not only directly means that one also has to consider JDP in the central model for RL, but implies that RL becomes harder for privacy protection due to the additional state transition (i.e., H > 1). Tabular episodic RL under central JDP is first studied in Vietri et al. (2020) with an additive privacy cost while under the local model, a multiplicative privacy cost is shown to be necessary (Garcelon et al., 2021) We have also given more discussions on private (federated) supervised learning under the distributed model in Appendix G. For readers who are interested in the subtlety of DP definitions for bandits and RL, we refer to the blog post (Zhou, 2023) .

B A POSSIBLE APPROACH TO REDUCE THE COMMUNICATION COST

We first recall that the communication cost for RDP with SecAgg is roughly O(log(n+s/ε)), where n is the batch size and s is the scaling factor. A large s leads to better privacy and regret as shown in Theorem 2, but incurs a larger communication. One can observe that the current communication cost is inverse with respect to ε. However, we tend to believe that one can break the privacy-communication trade-off above using a very recent technique proposed in Chen et al. (2022a) . In particular, Chen et al. (2022a) shows that the fundamental communication cost for RDP with SecAgg for the mean estimation task scales with Ω(max(log(n 2 ε 2 ), 1)), where n is the batch size. That is, for a stronger privacy guarantee (i.e., a smaller ε), each user should send less number of bits. The intuition is that if a user sends less bits, then she communicates less information about her local data (hence natural protection of privacy). To achieve this improvement, Chen et al. (2022a) proposes to use a linear compression scheme based on sparse random projections and distributed discrete Gaussian noise. Now, to apply the same technique to the private bandit problem, one needs to handle a different utility metric -that is, instead of the mean-square error in Chen et al. (2022a) , one now needs to examine the tail concentration behavior. We leave it as an interesting future work.

C A GENERAL REGRET BOUND OF ALGORITHM 1

In this section, we present the proof of our generic regret bound in Lemma 1. We remark that this assumption naturally holds for a single (σ 2 , h)-sub-exponential noise and a single σ 2 -sub-Gaussian noise where h = 0 (cf. Lemma 7 and Lemma 6 in Appendix K) with constants adjustment. Lemma 2 (Formal statement of Lemma 1). Let Assumption 1 hold and choose confidence radius β(b) = log(4|Φ(b)|b 2 /p) 2l(b) + σ log(2|Φ(b)|b 2 /p) l(b) + h log(2|Φ(b)|b 2 /p) l(b) , where under Assumption 1. To see this, we note that µ a (b) -µ a = n a (b) + l(b) i=1 r i a (b) l(b) -µ a . By Hoeffeding's inequality (cf. Lemma 8), we have for any p ∈ (0, 1), with probability at least 1p, l(b) i=1 r i a (b) l(b) -µ a = log(2/p) 2l(b) . Then, by the concentration of noise n a (b) in Assumption 1 and triangle inequality, we obtain for a given arm a and batch b, with probability at least 1 -3p | µ a (b) -µ a | = log(2/p) 2l(b) + σ log(1/p) l(b) + h log(1/p) l(b) . Thus, by the choice of β(b) and a union bound, we have P [E] ≥ 1 -3p. In the following, we condition on the good event E. We first show that the optimal arm a * will always be active. We show this by contradiction. Suppose at the end of some batch b, a * will be eliminated, i.e., UCB a * (b) < LCB a ′ (b) for some a ′ . This implies that under good event E µ a * ≤ µ a * (b) + β(b) < µ a ′ (b)β(b) ≤ µ a ′ , which contradicts the fact that a * is the optimal arm. Then, we show that at the end of batch b, all arms such that ∆ a > 4β(b) will be eliminated. To show this, we have that under good event E µ a (b) + β(b) ≤ µ a (b) + 2β(b) < µ a * (b) -4β(b) + 2β(b) ≤ µ a * (b) -β(b), which implies that arm a will be eliminated by the rule. Thus, for each sub-optimal arm a, let ba be the last batch that arm a is not eliminated. By the above result, we have ∆ a ≤ 4β( ba ) = O log(KT /p) l( ba ) + σ log(KT /p) l( ba ) + h log(KT /p) l( ba ) . Hence, we have for some absolute constants c 1 , c 2 , c 3 , l( ba ) ≤ max c 1 log(KT /p) ∆ 2 a , c 2 σ log(KT /p) ∆ a , c 3 h log(KT /p) ∆ a Since the batch size doubles, we have N a (T ) ≤ 4l( ba ) for each sub-optimal arm a. Therefore, Reg(T ) = a∈[K] N a (T )∆ a ≤ 4l( ba )∆ a . Moreover, choose p = 1/T and assume T ≥ K, we have that the expected regret satisfies Reg(T ) = E   a∈[K] ∆ a N a (T )   ≤ P Ē • T + O   a∈[K]:∆>0 log T ∆ a   + O Kσ log T + O (Khlog T ) = O   a∈[K]:∆>0 log T ∆ a + Kσ log T + Khlog T   . Remark 4. In stead of a doubling batch schedule, one can also set l(b) = η b for some absolute constant η > 1 while attaining the same order of regret bound.

D APPENDIX FOR PURE DP IN SECTION 4

In this section, we provide proofs for Theorem 1 and Theorem 3, which show that pure DP can be achieved under the distributed model via SecAgg and secure shuffling, respectively. Both results build on the generic regret bound in Lemma 2.

D.1 PROOF OF THEOREM 1

Proof. Privacy: We need to show that the server's view at each batch has already satisfies (ε, 0)-DP, which combined with the fact of unique users and parallel composition, yields that Algorithm 1 satisfies (ε, 0)-DP in the distributed model. To this end, in the following, we fix a batch b and arm a, and hence x i = r i a (b) and n = l(b). Note that the server's view for each batch is given by y (a) =   i∈[n] y i   mod m (b) =   i∈[n] x i + η i   mod m, where (a) holds by SecAgg function; (b) holds by the distributive property: (a + b) mod c = (a mod c + b mod c) mod c for any a, b, c ∈ Z. Thus, the view of the server can be simulated as a post-processing of a mechanism H that accepts an input dataset { x i } i and outputs i x i + i η i . Hence, it suffices to show that H is (ε, 0)-DP by post-processing of DP. To this end, we note that the sensitivity of i x i is g, which, by Fact 3, implies that i η i needs to be distributed as Lap Z (g/ε) in order to guarantee ε-DP. Finally, by Fact 1, it suffices to generate η i = γ + iγ - i , where γ + i and γ - i are i.i.d samples from Pólya(1/n, e -ε/g ). Regret: Thanks to the generic regret bound in Lemma 2, we only need to verify Assumption 1. To this end, fix any batch b and arm a, we have y = y a (b), x i = r i a (b) and n = l(b). Then, in the following we will show that with probability at least 1 -2p A( y) - A( y) - i x i ≤ O 1 ε log(1/p) + 1 ε log(1/p) , i x i ≤ A( y) - 1 g i x i Term (i) + 1 g i x i - i x i Term (ii) , where Term (i) captures the error due to private noise and modular operation, and Term (ii) captures the error due to random rounding. To start with, we will bound Term (ii). More specifically, we will show that for any p ∈ (0, 1], with probability at least 1p, Term (ii) ≤ O 1 ε log(1/p) . ( ) Let xi := ⌊x i •g⌋, then x i = xi +Ber(x i g -xi ) = x i g + xi +Ber(x i g -xi )-x i g = x i g +ι i , where ι i := xi + Ber(x i g -xi ) -x i g. We have E [ι i ] = 0 and ι i ∈ [-1, 1]. Hence, ι i is 1-sub-Gaussian and as a result, 1 g i x i -i x i = 1 g i ι i is n/g 2 -sub-Gaussian. Therefore, by the concentration of sub-Gaussian (cf. Lemma 6), we have P i 1 g • x i - i x i > 2 n g 2 log(2/p) ≤ p. Hence, by the choice of g = ⌈ε √ n⌉, we establish equation 5. Now, we turn to bound Term (i). Recall the choice of parameters: g = ⌈ε √ n⌉, τ = ⌈ g ε log(2/p)⌉, and m = ng + 2τ + 1. We would like to show that P A( y) - 1 g i x i > τ g ≤ p, which implies that for any p ∈ (0, 1], with probability at least 1p Term (i) ≤ O 1 ε log(2/p) . ( ) To show equation 6, the key is to bound the error due to private noise and handle the possible underflow carefully. First, we know that the total private noise i η i is distributed as Lap Z (g/ε). Hence, by the concentration of discrete Laplace (cf. Fact 3), we have P i η i > τ ≤ p. Let E noise denote the event that i x i + η i ∈ [ i x i -τ, i x i + τ ], then by the above inequality, we have P [E noise ] ≥ 1p. In the following, we condition on the event of E noise to analyze the output A( y). As already shown in equation 3, the input y = ( i x i + η i ) mod m is already an integer. We let y = y. We will consider two cases of y as in the analyzer subroutine A. Case 1: y > ng + τ . We argue that this happens only when i x i + η i ∈ [-τ, 0), i.e., underflow. This is because for all i ∈ [n], x i ∈ [0, g], m = ng + 2τ + 1 and the total privacy noise is at most τ under E noise . Therefore, y -m = i x i + η i mod m -m = m + i x i + η i -m = i x i + η i . That is, ym ∈ [ i x iτ, i x i + τ ] with high probability. In other words, we have shown that when y > ng + τ , A( y) = y-m g satisfies equation 6. Case 2: y ≤ ng + τ . Here, we have noisy sum i x i + η i ∈ [0, ng + τ ]. Hence, y = i x i + η i since m = ng + 2τ + 1, which implies that A( y) = y g satisfies equation 6. Hence, we have shown that the output of the analyzer under both cases satisfies equation 6, which implies equation 7. Combined with the bound in equation 5, yields the bound in equation 4. Finally, plugging in σ = O(1/ε), h = O(1/ε) into the generic regret bound in Lemma 2, yields the required regret bound and completes the proof.

D.2 PURE DP VIA SHUFFLING

As stated in the main paper (see Remark 1), one can achieve same privacy and regret guarantees (orderwise) using a relaxed SecAgg protocol, which relaxes the SecAgg protocol mentioned above in the following sense: (i) relaxed correctness -the output of S can be used to compute the correct modular sum except at most a small probability (denoted by q); (ii) relaxed security -the output of S reveals only ε more information than the modular sum result. Putting the two aspects together, one obtains a relaxed protocol denoted by ( ε, q)-SecAgg. One important benefit of using this relaxation is that it allows us to achieve the same results of Theorem 1 via secure shuffling. More specifically, as shown in Cheu & Yan (2021) , there exists a shuffle protocol that can simulate an ( ε, q)-SecAgg. Hence, we can directly instantiate S using this shuffle protocol to achieve pure DP in the distributed model while obtaining the same regret bound as in the central model. We provide more details on this shuffle protocol below. To facilitate our discussion, we briefly give more formal definitions of a relaxed SecAgg based on Cheu & Yan (2021) , which will also be used in our next proof. We denote a perfect SecAgg as Σ. To start with, we first need the following two distance metrics between two probability distributions. As in Cheu & Yan (2021) , for a given distribution D and event E, we write P [D ∈ E] for P η∼D [η ∈ E]. We let supp(D, D ′ ) be the union of their supports. Definition 6 (Statistical Distance). For any pair of distributions D, D ′ , the statistical distance is given by SD(D, D ′ ) := max E∈supp(D,D ′ ) |P [D ∈ E] -P [D ′ ∈ E]| . Definition 7 (Log-Likelihood-Ratio (LLR) Distance). For any pair of distributions D, D ′ , the LLR distance is given by LLR(D, D ′ ) := max E∈supp(D,D ′ ) log P [D ∈ E] P [D ′ ∈ E] . Pure DP can be defined using LLR distance. Definition 8. A randomized mechanism M is (ε, 0)-DP if for any two neighboring datasets D, D ′ , LLR(M(D), M(D ′ )) ≤ ε Definition 9 (Relaxed SecAgg). We say S is an ( ε, q)-relaxation of Σ (i.e., ( ε, q)-relaxed SecAgg) if it satisfies the following two conditions for any input y. 1. ( q-relaxed correctness) There exists some post-processing function POST such that SD(POST(S(y)), Σ(y)) ≤ q. 2. ( ε-relaxed security) There exists a simulator SIM such that LLR(S(y), SIM(Σ(y))) ≤ ε. Cheu & Yan (2021) show that one can simulate a relaxed SecAgg via a shuffle protocol. We now briefly talk about the high-level idea behind this idea and refer readers to Cheu & Yan (2021) for details. To simulate an ( ε, q)-SecAgg via shuffling, the key is to introduce another local randomizer on top of the original R. More specifically, we let S := S • R n in Algorithm 1 with n = l(b), b ≥ 1, where R denotes the additional local randomizer at each user i ∈ [n] that maps the output of R (i.e., y i ) into a random binary vector of a particular length d. Then, S denotes a standard shuffler that uniformly at random permutes all the received bits from n users (i.e., a total of n • d bits). The nice construction of R in Cheu & Yan (2021) ensures that S = S • R n can simulate an ( ε, q)-SecAgg for any ε, q ∈ (0, 1). The following theorem says that a relaxed SecAgg is sufficient for the same order of expected regret bound while guaranteeing pure DP. Theorem 3 (Pure-DP via Shuffling). Fix ε > 0 and T ∈ N and consider an ( ε, q)-SecAgg (e.g., simulated via shuffling) for Algorithm 1foot_5 . Let noise for i-th user be η i = γ + i -γ - i , where γ + i , γ - i i.i.d. ∼ Pólya(1/n, e -ε/g ). For each batch b, choose n = l(b), g = ⌈ε ′ √ n⌉, τ = ⌈ g ε ′ log(2/p ′ )⌉, and m = ng + 2τ + 1, ε = ε/4, ε ′ = ε/2, q = p ′ = 1 2T . Then, Algorithm 1 achieves (ε, 0)-DP in the distributed model. Moreover, setting β(b) = O log(|Φ(b)|b 2 T ) 2l(b) + 2 log(|Φ(b)|b 2 T ) εl(b) , it enjoys expected regret E [Reg(T )] = O a∈[K]:∆a>0 log T ∆ a + K log T ε . Moreover, the communication per user before S is O(log m) bits. Proof. This proof shares the same idea as in the proof of Theorem 1. We only need to highlight the difference. Privacy: As before, we need to ensure that the view of the server already satisfies ε-DP for each batch b ≥ 1. In the following, we fix any batch b and arm a, we have x i = r i a (b) and n = l(b). Thus, by Definition 8, it suffices to show that for any two neighboring datasets LLR(S(R(x 1 ), . . . , R( x n )), S(R(x ′ 1 ), . . . , R(x ′ n ))) ≤ ε. To this end, by the ε-relaxed security of S used in Algorithm 1 and triangle inequality, we have LLR(S(R(x 1 ), . . . , R( x n )), S(R(x ′ 1 ), . . . , R(x ′ n ))) ≤2 ε + LLR(SIM( i R(x i ) mod m), SIM( i R(x ′ i ) mod m)) ≤2 ε + LLR( i R(x i ) mod m, i R(x ′ i ) mod m), where the last inequality follows from data processing inequality. The remaining step to bound the second term above is the same as in the proof of Theorem 1. With ε = ε/4 and ε ′ = ε/2, we have the total privacy loss is ε. Regret: As before, the key is to establish that with high probability A( y) - i x i ≤ O 1 ε log(2/p ′ ) + 1 ε log(2/p ′ ) , where y := y a (b). We again can divide the LHS of equation 8 into A( y) - i x i ≤ A( y) - 1 g i x i Term (i) + 1 g i x i - i x i Term (ii) . In particular, Term (ii) can be bounded by using the same method, i.e., for any p ′ ∈ (0, 1], with probability at least 1 -p ′ , Term (ii) ≤ O 1 ε ′ log(2/p ′ ) . For Term (i), we would like to show that P A( y) - 1 g i x i > τ g ≤ q + p ′ . This can be established by the same steps in the proof of Theorem 1 while conditioning on the high probability event that S is q-relaxed SecAgg. More specifically, compared to the proof of Theorem 1, the key difference here is that y = ( i y i ) mod m only holds with high probability 1q by the definition of q-relaxed SecAgg. Thus, for any p ′ ∈ (0, 1), let q = p ′ , we have with probability at least 1 -3p ′ , |A( y) -i x i | ≤ O 1 ε log(2/p ′ ) 1 ε log(2/p ′ ) , which implies that Assumption 1 holds with σ = O(1/ε) and h = O(1/ε). E APPENDIX FOR RDP IN SECTION 5 E.1 PROOF OF PROPOSITION 1 Proof. We first establish the following result. Claim 1. For all λ ∈ R, we have cosh(λ) ≤ e λ 2 /2 . To show this, by the infinite product representation of the hyperbolic cosine function, we have cosh(λ) = ∞ k=1 1 + 4λ 2 π 2 (2k -1) 2 (a) ≤ exp ∞ k=1 4λ 2 π 2 (2k -1) 2 (b) = exp(λ 2 /2), where (a) holds by the fact that 1 + x ≤ e x for all x ∈ R and (b) follows from ∞ k=1 1 (2k-1) 2 = π 2 8 . Then, note that X ∼ Sk(0, σ 2 ), then its moment generating function (MGF) is given by E e λX = exp(σ 2 (cosh(λ) -1)). Hence, by the above claim, we have E e λX ≤ exp(σ 2 (e λ 2 /2 -1). Further, note that e x -1 ≤ 2x for x ∈ [0, 1]. Thus, for |λ| ≤ √ 2, we have E e λX ≤ e λ 2 σ 2 = e λ 2 2σ 2 2 . Hence, by the definition of sub-exponential random variable (cf. Lemma 7), X is (2σ 2 , √ 2 )-subexponential, which again by Lemma 7 implies the required concentration result, i.e., for any p ∈ (0, 1], with probability at least 1p, |X| ≤ 2σ log(2/p) + √ 2 log(2/p).

E.2 PROOF OF THEOREM 2

We will leverage the following result in (Agarwal et al., 2021, Theorem 3.5 ) to prove privacy guarantee. Lemma 3. For α ∈ Z, α > 1, let X ∼ Sk(0, σ 2 ). Then, an algorithm M that adds X to a sensitivity-∆ query satisfies (α, ε(α))-RDP with ε(α) given by ε(α) ≤ α∆ 2 2σ 2 + min (2α -1)∆ 2 + 6∆ 4σ 4 , 3∆ 2σ 2 . Proof of Theorem 2. Privacy: As before, we fix any batch b and arm a, for simplicity, we let x i = r i a (b) and n = l(b). Then, it suffices to show that the mechanism H that accepts an input dataset { x i } i and outputs i x i + i η i is private. To this end, since each local randomizer R generates noise η i ∼ Sk(0, g 2 nε 2 ) and Skellam is closed under summation, we have that the total noise i η i ∼ Sk(0, g 2 ε 2 ). Thus, by Lemma 3 with ∆ = g, we have that for each batch b with n = l(b) and g = ⌈sε √ n⌉, Algorithm 1 is (α, ε n (α))-RDP with ε n (α) given by ε n (α) = αε 2 2 + min (2α -1)ε 2 4s 2 n + 3ε 2s 3 n 3/2 , 3ε 2 2s √ n . Since n = l(b) > 1, we have that for all batches b ≥ 1, ε n (α) ≤ ε(α) := αε 2 2 + min (2α -1)ε 2 4s 2 + 3ε 2s 3 , 3ε Regret: We will establish the following high probability bound so that we can apply our generic regret bound in Lemma 2 A( y) - i x i ≤ O 1 ε log(1/p) + 1 sε log(2/p) , where y := y a (b) for each batch b and arm a. We again divide the LHS of equation 9 into A( y) - i x i ≤ A( y) - 1 g i x i Term (i) + 1 g i x i - i x i Term (ii) . In particular, Term (ii) can be bounded by using the same method before, i.e., P i 1 g • x i - i x i > 2 n g 2 log(2/p) ≤ p. Hence, by the choice of g = ⌈sε √ n⌉, we establish that with high probability Term (ii) ≤ O 1 ε • s log(1/p) . For Term (i), as in the previous proof, the key is to show that P   i∈[n] η i > τ   ≤ p. To this end, we will utilize our established result in Proposition 1. Note that the total noise i η i ∼ Sk(0, g 2 ε 2 ), and hence by Proposition 1 and the choice of τ = ⌈ 2g ε log(2/p) + √ 2 log(2/p)⌉, we have the above result. Following previous proof, this result implies that with high probability Term (i) ≤ τ g ≤ O 1 ε log(1/p) + 1 sε log(1/p) . Combining the bounds on Term (i) and Term (ii), we have that the private noise satisfies Assumption 1 with constants σ = O(1/ε) and h = O( 1 sε ). Hence, by the generic regret bound in Lemma 2, we have established the required result.

F ACHIEVING CDP IN THE DISTRIBUTED MODEL VIA DISCRETE GAUSSIAN

We first introduce the following definition of concentrated differential privacy Bun & Steinke (2016) . Definition 10 (Concentrated Differential Privacy). A randomized mechanism M satisfies 1 2 ε 2 -CDP if for all neighboring datasets D, D ′ and all α ∈ (1, ∞), we have D α (M(D), M(D ′ )) ≤ 1 2 ε 2 α. Remark 5 (CDP vs. RDP). From the definition, we can see that 1 2 ε 2 -CDP is equivalent to satisfying (α, 1 2 ε 2 α)-RDP simultaneously for all α > 1. The discrete Gaussian mechanism is first proposed and investigated by Canonne et al. ( 2020) and has been recently applied to federated learning Kairouz et al. (2021a) . We apply it to bandit learning to demonstrate the flexibility of our proposed algorithm and analysis. Definition 11 (Discrete Gaussian Distribution). Let µ, σ ∈ R with σ > 0. A random variable X has a discrete Gaussian distribution with location µ and scale σ, denoted by N Z (µ, σ 2 ), if it has a probability mass function given by ∀x ∈ Z, P [X = x] = e -(x-µ) 2 /2σ 2 y∈Z e -(y-µ) 2 /2σ 2 . The following result from Canonne et al. ( 2020) will be useful in our privacy and regret analysis. Fact 2 (Discrete Gaussian Privacy and Utility). Let ∆, ε > 0. Let q : X n → Z satisfy |q(x)q(x ′ )| ≤ ∆ for all x, x ′ differing on a single entry. Define a randomized algorithm M : X n → Z by M (x) = q(x)+Y , where Y ∼ N Z (0, ∆ 2 /ε 2 ). Then, M satisfies 1 2 ε 2 -concentrated differential privacy (CDP). Moreover, Y is ∆ 2 /ε 2 -sub-Gaussian and hence for all t ≥ 0, P [|Y | ≥ t] ≤ 2e -t 2 ε 2 /(2∆ 2 ) . However, a direct application of above results does not work. This is because the sum of discrete Gaussian is not a discrete Gaussian, and hence one cannot directly apply the privacy guarantee of discrete Gaussian when analyzing the view of the analyzer via summing all the noise from users. To overcome this, we will rely on a recent result in Kairouz et al. (2021a) , which shows that under reasonable parameter regimes, the sum of discrete Gaussian is close to a discrete Gaussian. The regret analysis will again build on the generic regret bound for Algorithm 1. Theorem 4 (CDP via SecAgg). Fix ε ∈ (0, 1) and a scaling factor s ≥ 1. Let noise for i-th user be η i ∼ N Z (0, g 2 nε 2 ). For each b ≥ 1, choose n = l(b), g = ⌈sε √ n⌉, τ = ⌈ g ε 2 log(2/p)⌉, p = 1/T and m = ng + 2τ + 1. Then, Algorithm 1 achieves 1 2 ε 2 -CDP with ε given by ε ≤ min ε 2 + 1 2 ξ, ε + ξ , where ξ := 10 • T 2 -1 k=1 e -2π 2 s 2 • k k+1 . Meanwhile, the regret is given by Reg(T ) = O   a∈[K]:∆>0 log T ∆ a + K √ log T ε   , and the communication messages per user before S are O(log m) bits. Remark 6 (Privacy-Communication Trade-off). We can observe an interesting trade-off between privacy and communication cost. In particular, as s increases, ξ approaches zero and hence the privacy loss approaches the one under continuous Gaussian. However, a larger s leads to a larger m and hence a larger communication overhead. We will leverage the following result in (Kairouz et al., 2021a, Proposition 13 ) to prove privacy. Lemma 4 (Privacy for the sum of discrete Gaussian). Let σ ≥ 1/2. Let X i ∼ N Z (0, σ 2 ) independently for each i. Let Z n = n i=1 X i . Then, an algorithm M that adds Z n to a sensitivity-∆ query satisfies 1 2 ε 2 -CDP with ε given by ε = min ∆ 2 nσ 2 + 1 2 ξ, ∆ √ nσ + ξ , where ξ := 10 • n-1 k=1 e -2π 2 σ 2 k k+1 . Proof of Theorem 4. Privacy: As before, we fix any batch b and arm a, for simplicity, we let x i = r i a (b) and n = l(b). Then, it suffices to show that the mechanism H that accepts an input dataset { x i } i and outputs i x i + i η i is private. To this end, each local randomizer R generates noise η i ∼ N Z (0, g 2 nε 2 ), and hence by Lemma 4 with ∆ = g, we have that H is 1 2 ε 2 n -concentrated differential privacy with ε n given by ε n = min ε 2 + 1 2 ξ n , ε + ξ n , where ξ n := 10 • n-1 k=1 e -2π 2 g 2 nε 2 • k k+1 . Note that g = ⌈sε √ n⌉, we have τ n ≤ τ := 10 • T 2 -1 k=1 e -2π 2 s 2 • k k+1 since n = l(b) ≤ T /2 . Thus, we have that Algorithm 1 is is 1 2 ε 2 -concentrated differential privacy with ε given by ε = min ε 2 + 1 2 ξ, ε + ξ , where ξ := 10 • T 2 -1 k=1 e -2π 2 s 2 • k k+1 . Regret: We will establish the following high probability bound so that we can apply our generic regret bound in Lemma 2 A( y) - i x i ≤ O 1 ε log(1/p) , where y := y a (b). We again divide the LHS of equation 10 into A( y) - i x i ≤ A( y) - 1 g i x i Term (i) + 1 g i x i - i x i Term (ii) . In particular, Term (ii) can be bounded by using the same method before, i.e., P i 1 g • x i - i x i > 2 n g 2 log(2/p) ≤ p. Hence, by the choice of g = ⌈sε √ n⌉, we establish that with high probability Term (ii) ≤ O 1 ε • s log(1/p) . For Term (i), as in the previous proof, the key is to show that P   i∈[n] η i > τ   ≤ p. This follows from the fact that discrete Gaussian is sub-Gaussian and sum of sub-Gaussian is still sub-Gaussian. Thus, by Lemma 6 and the choice of τ = g ε 2 log(2/p), we have the above result, which implies that with high probability Term (i) ≤ τ g = O 1 ε log(1/p) . Since s ≥ 1, combining the bounds on Term (i) and Term (ii), we have that the private noise satisfies Assumption 1 with constants σ = O(1/ε) and h = 0. Hence, by the generic regret bound in Lemma 2, we have established the required result.

G MORE DETAILS ON SECURE AGGREGATION AND SECURE SHUFFLING

In this section, we provide more details about secure aggregation (SecAgg) and secure shuffling including practical implementations and recent theoretical results for DP guarantees. We will also discuss some limitations of both protocols, which in turn highlights that both have some advantages over the other and thus how to select them really depends on particular applications.

G.1 SECURE AGGREGATION

Practical implementations: SecAgg is a lightweight instance of secure multi-party computation (MPC) based on cryptographic primitives such that the server only learns the aggregated result (e.g., sum) of all participating users' values. This is often achieved via additive masking over a finite group (Bonawitz et al., 2017; Bell et al., 2020) . The high-level idea is that participating users add randomly sampled zero-sum mask values by working in the space of integers modulo m, which guarantees that each user's masked value is indistinguishable from a random value. However, when all masked values are summed modulo m by the server, all the masks will be cancelled out and the server observes the true modulo sum. Bonawitz et al. (2017) proposed the first scalable SecAgg protocol where both communication and computation costs of each user scale linearly with the number of all participating users. Recently, Bell et al. ( 2020) presents a further improvement where both client computation and communication depend logarithmically on the number of participating clients. SecAgg for distributed DP: First note that the fact that the server only learns the sum of values under SecAgg does not necessarily imply differential privacy since this aggregated result still has the risk of leaking each user's sensitive information. To provide a formal DP guarantee under SecAgg, each participating user can first perturb her own data with a moderate random noise such that in the aggregated sum, the total noise is large enough to provide a high privacy guarantee. Only until recently, SecAgg with DP has been systematically studied, mainly in the context of private (federated) supervised learning (Kairouz et al., 2021a; Agarwal et al., 2021) while SecAgg in the context of private online learning remains open until our work. More importantly, there exist no formal convergence guarantees of SGD in Kairouz et al. (2021a) ; Agarwal et al. (2021) due to the biased gradient estimates. To the best of our knowledge, the very recent work (Chen et al., 2022b) is the only one that derives upper bound on the convergence rate when working with SecAgg in private supervised learning. However, the privacy guarantee in it is only approximated DP rather than pure DP considered in our paper. Limitations of SecAgg: As pointed out in Kairouz et al. (2021b) , several limitations of SecAgg still exist despite of recent advances. For example, it assumes a semi-honest server and allows it to see the per-round aggregates. Moreover, it is not efficient for sparse vector aggregation.

G.2 SECURE SHUFFLING

Practical implementations: Secure shuffling ensures that the server only learns an unordered collection (i.e., multiset) of the messages sent by all the participating users. This is often achieved by a third party shuffling function via cryptographic onion routing or mixnets (Dingledine et al., 2004) or oblivious shuffling with a trusted hardware (Bittau et al., 2017) . Secure shuffling for distributed DP: The additional randomness introduced by the shuffling function can be utilized to achieve a similar utility as the central model while without a trustworthy server. This particular distributed model via secure shuffling is also often called shuffle model. In particular, Cheu et al. (2019) first show that for the problem of summing n real-valued numbers within [0, 1], the expected error under shuffle model with (ε, δ)-DP guarantee is O( 1 ε log n δ ). For comparison, under the central model, the standard Laplace mechanism achieves (ε, 0)-DP (i.e., pure DP) with an error O(1/ε) (Dwork et al., 2006) while an error Ω( √ n/ε) is necessary under the local model (Chan et al., 2012; Beimel et al., 2008) . Subsequent works on private scalar sum under the shuffle model have improved both the communication cost and accuracy in Cheu et al. (2019) . More specifically, instead of sending O(ε √ n) bits per user as in Cheu et al. (2019) , the protocols proposed in Balle et al. (2020) ; Ghazi et al. (2020b) achieve (ε, δ)-DP with an error O(1/ε) while each user only sends O(log n + log(1/δ)) bits. A closely related direction in the shuffle model is privacy amplification bounds (Erlingsson et al., 2019; Balle et al., 2019; Feldman et al., 2022) . That is, Algorithm 2 Local Randomizer R (Central Model) 1: Input: Each user data x i ∈ [0, 1] 2: Parameters: precision g ∈ N, modulo m ∈ N, batch size n ∈ N, privacy parameter ϕ 3: Encode x i as x i = ⌊x i g⌋ + Ber(x i g -⌊x i g⌋) 4: Modulo clip y i = x i mod m // no local noise 5: Output: y i the shuffling of n locally private data yields a gain in privacy guarantees. In particular, Feldman et al. (2022) 7 show that randomly shuffling of n ε 0 -DP locally randomized data yields an (ε, δ)-DP guarantee with ε = O ε 0 √ log(1/δ) √ n when ε 0 ≤ 1 and ε = O √ e ε 0 log(1/δ) √ n when ε 0 > 1. Shuffle model has also been studied in the context of empirical risk minimization (ERM) (Girgis et al., 2021) and stochastic convex optimization (Cheu et al., 2021; Lowy & Razaviyayn, 2021) , in an attempt to recover some of the utility under the central model while without a trusted server. To the best of our knowledge, In all the above mentioned works on shuffle model, only approximate DP is achieved. To the best of our knowledge, there are only two existing shuffling protocols that can attain pure DP (Ghazi et al., 2020a; Cheu & Yan, 2021) . In particular, by simulating a relaxed SecAgg via a shuffle protocol, Cheu & Yan (2021) are the first to show that there exists a shuffle protocol for bounded sums that satisfies (ε, 0)-DP with an expected error of O(1/ε). This is main inspiration for us to achieve pure DP in MAB with secure shuffling. Limitations of secure shuffling: As pointed out in Kairouz et al. (2021b) , one of the limitations is the requirement of a trusted intermediary for the shuffling function. Another is that the privacy guarantee under the shuffle model degrades in proportion to the number of adversarial users.

H ALGORITHM 1 UNDER CENTRAL AND LOCAL MODELS

In this section, we will show that Algorithm 1 and variants of our template protocol P allow us to achieve central and local DP using only discrete noise while attaining the same optimal regret bound as in previous works using continuous noise (Sajed & Sheffet, 2019; Ren et al., 2020) .

H.1 CENTRAL MODEL

For illustration, we will mainly consider S as a SecAgg, while secure shuffling can be applied using the same way as in the main paper. In the central model, we just need to set η i = 0 in the local randomizer R and add central noise in the analyzer A (see Algorithm 2 and Algorithm 3). Then, we have the following privacy and regret guarantees. Theorem 5 (Central Pure-DP via SecAgg). Let P = (R, S, A) be a protocol such that R given by Algorithm 2, S is any SecAgg protocol and A is given by Algorithm 3 with η ∼ Lap Z (g/ε). Fix any ε ∈ (0, 1), for each b ≥ 1, choose n = l(b), g = ⌈ε √ n⌉, τ = ⌈ g ε log(2/p)⌉, p = 1/T and m = ng + 2τ + 1. Then, Algorithm 1 instantiated with protocol P is able to achieve (ε, 0)-DP in the central model with expected regret given by E [Reg(T )] = O   a∈[K]:∆a>0 log T ∆ a + K log T ε   . Moreover, the communication messages per user before S are O(log m) bits. Remark 7. We can generate η ∼ Lap Z (g/ε) as η = η 1η 2 , where η 1 , η 2 is independent and polya(1, e -ε/g ) distributed. set z = (ym)/g // correction for underflow 7: else set z = y/g 8: Output: z Proof of Theorem 5. Privacy: By the definition of central DP and post-processing, it suffices to show that y in Algorithm 3 is private. To this end, we again apply the distributive property of modular sum to obtain that y = ( y + (η mod m)) mod m = i x i + η mod m. Since the sensitivity of i x i is g and η ∼ Lap Z (g/ε), by Fact 3, we have obtained (ε, 0)-DP in the central model. Regret: As before, the key is to establish that with high probability A( y) - i x i ≤ O 1 ε log(1/p) + 1 ε log(1/p) , where y := y a (b). Then, we can apply our generic regret bound in Lemma 2. We again can divide the LHS of equation 11 into A( y) - i x i ≤ A( y) - 1 g i x i Term (i) + 1 g i x i - i x i Term (ii) . In particular, Term (ii) can be bounded by using the same method, i.e., for any p ∈ (0, 1], with probability at least 1p, Term (ii) ≤ O 1 ε log(1/p) . For Term (i), we would like to show that P A( y) - 1 g i x i > τ g ≤ p. As before, let E noise denote the event that ( i x i ) + η ∈ [ i x iτ, i x i + τ ], then by the concentration of discrete Laplace, we have P [E noise ] ≥ 1p. In the following, we condition on the event of E noise to analyze the output A( y). Note that y = (( i x i ) + η) mod m and hence we can use the same steps as in the proof of Theorem 1 to conclude that Term (i) ≤ O 1 ε log(1/p) . Finally, by our generic regret bound in Lemma 2, we have the regret bound.

H.2 LOCAL MODEL

In the local model, each local randomizer R needs to inject discrete Laplace noise to guarantee pure DP while the analyzer A is simply the same as in the main paper (see Algorithm 4 and Algorithm 5). To analyze the regret, we need the concentration of the sum of discrete Laplace. To this end, we have the following result. Lemma 5 (Concentration of discrete Laplace). Let {X i } n i=1 be i.i.d random variable distributed according to Lap Z (1/ε). Then, we have X i is (c 2 1 ε 2 , c 1 ε )-sub-exponential for some absolute con- Proof. First, we note that if X ∼ Lap Z (1/ε), then it can rewritten as X = N 1 -N 2 where N 1 , N 2 is geometrically distributed, i.e., P [N 1 = k] = P [N 2 = k] = β k (1 -β) with β = e -ε . We can also write it as X = Ñ1 -Ñ2 , where Ñ1 = N 1 -E [N 1 ] and Ñ1 = N 2 -E [N 2 ], since E [N 1 ] = E [N 2 ]. Then, by (Hillar & Wibisono, 2013, Lemma 4.3 ) and the equivalence of different definitions of sub-exponential (cf. Proposition 2.7.1 in Vershynin ( 2018)), we have Ñ1 , Ñ2 are (c 2 1 1 ε 2 , c 1 1 ε )-subexponential for some absolute constant c 1 > 0. Now, since X = Ñ1 -Ñ2 , by the weighted sum of zero-mean sub-exponential (cf. Corollary 4.2 in Zhang & Chen (2020)), we have X is (c 2 1 ε 2 , c 1 ε )sub-exponential for some absolute constant c > 0. Thus, by Lemma 9, we have for any p ∈ (0, 1], let v ≥ max{ c ε 2n log(2/p), 2 c ε log(2/p)}, with probability at least 1p, | i X i | ≤ v. Theorem 6 (Local Pure-DP via SecAgg). Let P = (R, S, A) be a protocol such that R given by Algorithm 4 with η i ∼ Lap Z (g/ε), S is any SecAgg protocol and A is given by Algorithm 5. Proof. Privacy: By the privacy guarantee of discrete Laplace and the sensitivity of x i , we directly have the local pure DP for our algorithm. Regret: The first step is to show that with high probabilit, for a large batch size n = l(b) ≥ 2 log(2/p), A( y) - i x i ≤ O 1 ε l(b) log(1/p) , where y := y a (b). We again can divide the LHS of equation 13 into A( y) - i x i ≤ A( y) - 1 g i x i Term (i) + 1 g i x i - i x i Term (ii) . In particular, Term (ii) can be bounded by using the same method, i.e., for any p ∈ (0, 1], with probability at least 1p, Term (ii) ≤ O 1 ε log(1/p) . To bound Term (i), the key is again to bound the total noise i η i , i. We then make a minor modification of the proof of Lemma 2. The idea is simple: we divide the batches into two cases -one is that l(b) ≥ 2 log(2T ) (noting p = 1/T ) and the other is that l(b) ≤ 2 log(2T ). First, one can easily bound the total regret for the second case. In particular, during all the batches such that l(b) ≤ 2 log(2T ), the total regret is bounded by O((K -1) log T ). 

I TIGHT PRIVACY ACCOUNTING FOR RETURNING USERS VIA RDP

In this section, we demonstrate the key advantages of obtaining RDP guarantees compared to approximate DP in the context of MAB. The key idea here is standard in privacy literature and we provide the details for completeness. The key message is that using the conversion from RDP to approximate DP allows us to save additional logarithmic factor in δ in the composition compared to using advanced composition theorem. In the main paper, as in all the previous works on private bandit learning, we focus on the case where the users are unique, i.e., no returning users. However, if some user returns and participates in multiple batches, then the total privacy loss needs to resort to composition. In particular, let us consider the following returning situations. Assumption 2 (Returning Users). Any user can participate in at most B batches, but within each batch b, she only contributes once. Note that the total B batches can even span multiple learning process, each of which consists of a T -round MAB online learning. Let's first consider approximate DP in the distributed model, e.g., binomial noise in Tenenbaum et al. (2021) via secure shuffling. To guarantee that each user is (ε, δ)-DP during the entire learning process, by advanced composition theorem (cf. Theorem 7), we need to guarantee that each batch b ∈ [B], it is (ε i , δ i )-DP, where ε i = ε 2 √ 2B log(1/δ ′ ) , δ i = δ 2B and δ ′ = δ 2 . Thus, for each batch the



For more general linear bandits under distributed DP via shuffling, see Chowdhury & Zhou (2022b);Garcelon et al. (2022), which also have similar limitations. Instead of SecAgg, we can also use secure shuffling as S (see Appendix D.2), since each has advantages over the other. The high-level idea is same for both techniques; i.e., to ensure that after receiving messages from S, the server cannot distinguish each individual's message. One can sample from Pólya as follows. First, sample λ ∼ Gamma(r, β/(1-β)) and then use it to sample X ∼ Poisson(λ), which is known to follow Pólya(r, β) distribution(Goryczka & Xiong, 2015). This assumption that users only contributes once is adopted in nearly all previous works for privacy analysis in bandits. We also provide a privacy analysis for returning users via RDP, see Appendix I. Code is available at https://github.com/sayakrc/Differentially-Private-Bandits. Since the output of a shuffling protocols is a multiset, we need to first compute ( y) mod m as the y for the subroutine A in Algorithm 1. The results inFeldman et al. (2022) hold for adaptive randomizers, but then it needs first to shuffle and then apply the local randomizer. For a fixed randomizer, shuffle-then-randomize is equivalent to randomizethen shuffle.



Best-known performance of private MAB under different privacy models (K = number of arms, T = time horizon, ∆ a = reward gap of arm a w.r.t. best arm, ε, δ, α = privacy parameters) Trust Model Privacy Guarantee Best-

each individual mechanism S • R n b operates on n b users' rewards, i.e., on a dataset from D n b . With this notation, we have the following definition of distributed differential privacy. Definition 3 (Distributed DP). A protocol P = (R, S, A) is said to satisfy DP (or RDP) in the distributed model if the mechanism M P satisfies Definition 1 (or Definition 2).

e., R a (b)), which in turn gives the new mean estimate µ a (b) of arm a after being divided by the total pulls l(b). Then, upper and lower confidence bounds, UCB a (b) and LCB a (b), respectively, are computed around the mean estimate µ a (b) with a properly chosen confidence width β(b). Finally, after the iteration over all active arms in batch b (denoted by the set Φ(b)), it adopts the standard arm elimination criterion to remove all obviously sub-optimal arms, i.e., it removes an arm a from Φ(b) if UCB a (b) falls below LCB a ′ (b) of any other arm a ′ ∈ Φ(b). It now only remains to design a distributed DP protocol P.

Private Batch-Based Successive Arm Elimination 1: Parameters: # arms K, Time horizon T , privacy level ε > 0, Confidence radii {β(b)} b≥1 2: Initialize: Batch count b = 1, Active arm set Φ(b) = {1, . . . , K}, Estimate µ a (1) = 0, ∀a ∈ [K] 3: for batch b = 1, 2, . . . do

of rewards R a (b) = A( y a (b)) µ a (b) = R a (b)/l(b) 14: Compute confidence bounds UCB a (b) = µ a (b)+β(b) and LCB a (b) = µ a (b)-β(b) 15:

a (b) added at each batch b for each active arm a. The following lemma gives a generic regret bound of our algorithm under mild tail assumptions on n a (b). Lemma 1 (Generic regret). Let there exist constants σ, h > 0 such that, with probability ≥ 1-p, |n a (b)| ≤ N := O σ log(KT /p)+h log(KT /p) for all b ≥ 1, a ∈ [K]. Then, setting confidence radius β(b) = O log(KT /p)/l(b) + N /l(b) and p = 1/T , Algorithm 1 enjoys expected regret

Figure 1: Comparison of time-average regret for Dist-DP-SE, Dist-RDP-SE, and DP-SE. Top: Synthetic Gaussian bandit instances with (a, b) large reward gap (easy instance) and (c) small reward gap (hard instance). Bottom: Bandit instances generated from MSLR-WEB10K learning to rank dataset.

. In addition to valued-based algorithms considered in Vietri et al. (2020); Garcelon et al. (2021), similar performance is established for policy-based algorithms in tabular episodic RL under the central and local model (Chowdhury & Zhou, 2022a). Beyond the tabular setting, differentially private LQR control is studied in Chowdhury et al. (2021). More recently, private episodic RL with linear function approximation has been investigated in Luyo et al. (2021); Zhou (2022); Liao et al. (2021) under both central and local models, where similar regret gap as in contextual bandits exist (i.e., O( √ T ) vs. O(T 3/4 )).

Recall that n a (b) := R a (b) -l(b) i=1 r i a (b) is the total noise injected in the sum of rewards for arm a during batch b. We consider the following tail property on n a (b). Assumption 1 (Concentration of Private Noise). Fix any p ∈ (0, 1], a ∈ [K], b ≥ 1, there exist non-negative constants σ, h (possibly depending on b) such that, with probability at least 1 -2p, |n a (b)| ≤ σ log(2/p) + h log(2/p).

|Φ(b)| is the number of active arms in batch b. Then, for any p ∈ (0, 1], with probability at least 1 -3p, the regret of Algorithm 1 Kσ log(KT /p) + Kh log(KT /p)   . Taking p = 1/T and assuming T ≥ K, yields the expected regret Let E b be the event that for all active arms | µ a (b)µ a | ≤ β(b) and E = ∪ b≥1 E b . Then, we first show that with the choice of β(b) given by equation 2, we have P [E] ≥ 1 -3p for any p ∈ (0, 1]

which implies that Assumption 1 holds with σ = O(1/ε) and h = O(1/ε).Inspired byCheu & Yan (2021);Balle et al. (2020), we first divide the LHS of equation 4 as follows.

Analyzer A (Central Model) 1: Input: y (output of SecAgg) 2: Parameters: precision g ∈ N, modulo m ∈ N, batch size n ∈ N, accuracy parameter τ 3: Generate discrete Laplace noise η and set η = η mod m 4: Set y = ( y + η) mod m 5: if y > ng + τ then 6:

Local Randomizer R (Local Model) 1: Input: Each user data x i ∈ [0, 1] 2: Parameters: precision g ∈ N, modulo m ∈ N, batch size n ∈ N, privacy parameter ϕ 3: Encode x i as x i = ⌊x i g⌋ + Ber(x i g -⌊x i g⌋) 4: Generate discrete Laplace noise η i 5: Add noise and modulo clip y i = ( x i + η i ) mod m 6: Output: y i // for SecAgg S Algorithm 5 Analyzer A (Local Model) 1: Input: y (output of SecAgg) 2: Parameters: precision g ∈ N, modulo m ∈ N, batch size n ∈ N, accuracy parameter τ 3: Set y = y 4: if y > ng + τ then 5: set z = (ym)/g // correction for underflow 6: else set z = y/g 7: Output: z stant c > 0. As a result, we have for any p ∈ (0, 1], with probability at least 1p, | n i=1 X i | ≤ v, where v ≥ max{ c ε 2n log(2/p), 2 c ε log(2/p)}.

Fix any ε ∈ (0, 1), for each b ≥ 1, choose n = l(b), g = ⌈ε √ n⌉, τ = ⌈ cg ε 2n log(2/p)⌉ (c is the constant in Lemma 5), p = 1/T and m = ng + 2τ + 1. Then, Algorithm 1 instantiated with protocol P is able to achieve (ε, 0)-DP in the local model with expected regret given by E [Reg(T )] = O communication messages per user before S are O(log m) bits.

To this end, by Lemma 5, when n = l(b) ≥ 2 log(2/p) and τ = cg ε 2n log(2/p), the above concentration holds. Then, following the same steps as before, we can conclude that Term(i) ≤ O( 1 ε 2l(b) log(1/p)) when l(b) ≥ 2 log(2/p).

For the first case where l(b) is large, by equation 13 and the same steps in the proof of Lemma 2, we can conclude thatl( ba ) ≤ max c 1 log(KT /p)where ba is the last batch such that the suboptimal arm a is still active. Thus, putting everything together and noting that ∆ a ≤ 1 and ε ∈ (0, 1), we have

b) 17: end for 18: Subroutine: Local Randomizer R (Input:x i ∈ [0, 1], Output: y i ) 19: Require: precision g ∈ N, modulo m ∈ N, batch size n ∈ N, privacy level ε 20: Encode x i as x i = ⌊x i g⌋ + Ber(x i g -⌊x i g⌋) 21: Generate discrete noise η i (depending on n, ε, g)// random noise generator 22: Add noise and modulo clip y i = ( x i + η i ) mod m 23: Subroutine: Secure Aggregation S (Input: y 1 , . . . , y n , Output: y) 24: Require: modulo m ∈ N 25: Securely compute y = ( Require: precision g ∈ N, modulo m ∈ N, batch size n ∈ N, accuracy level τ ∈ R 28: if y > ng + τ then

. MAB under the local model is first studied in Ren et al. (2020) for pure DP and in Zheng et al. (2020) for appromixate DP. The local model have also been considered in Tao et al. (2022); Chen et al. (2020); Wang et al. (2022); Zhou & Tan (2021). Motivated by the regret gap between the central model and local model (see Table1),Tenenbaum et al. (2021) consider MAB in the distributed model via secure shuffling where, however, only approximate DP is achieved and the resultant regret bound still has a gap with respect to the one under the central model.

Tenenbaum et al. (2021); Chowdhury & Zhou (2022b);Garcelon et al. (2022) are the only works that study shuffle model in the context of bandit learning.

8. ACKNOWLEDGEMENTS

XZ is supported in part by NSF CNS-2153220. XZ would like to thank Albert Cheu for insightful discussion on achieving pure DP via shuffling.

annex

Published as a conference paper at ICLR 2023 variance of the privacy noise isNow, we turn to the case of RDP, (e.g., obtained via Skellam mechanism in Theorem 2 or via discrete Gaussian in Theorem 4). To gain the insight, we again consider the case that the scaling factor s is large enough such that for each batch b, it is (α, αε 2 2 )-RDP for all α, i.e., it is approximately ε 2 2 -CDP. Thus, B composition of it yields that it is now Bε 2 2 -CDP, and by the conversion lemma (cf. Lemma 10), it is (ε ′ , δ)-DP with ε ′ = O(ε B log(1/δ)). Thus, in order to guarantee (ε, δ)-DP in the distributed model, the variance of the privacy noise at each batch isComparing equation 14 and equation 15, one can immediately see the gain of log(B/δ) in the variance, which will translate to a gain of O( log(B/δ)) in the regret bound.Remark 8. For a more accurate privacy accounting, a better way is to consider using numeric evaluation rather than the above loose bound.Remark 9. If one is interested in pure DP, then by simple composition, the privacy loss scales linearly with respect to B rather than √ B.

J MORE DETAILS ON SIMULATIONS

We numerically compare the performance of Algorithm 1 under pure-DP and RDP guarantees in the distributed model (named Dist-DP-SE and Dist-RDP-SE, respectively) with the DP-SE algorithm of Sajed & Sheffet (2019) , which achieves pure-DP under the central model. We vary the privacy level as ε ∈ {0.1, 0.5, 1}, where a lower value of ε indicates higher level of privacy.In Figure 2 , we consider the easy instance, i.e., where arm means are sampled uniformly in [0.25, 0.75]. In Figure 3 , we consider the hard instance, i.e., where arm means are sampled uniformly in [0.45, 0.55]. The sampled rewards are Gaussian distributed with the given means and truncated to [0, 1]. We plot results for K = 10 arms.We see that, for higher value of time horizon T , the time-average regret of Dist-DP-SE is order-wise same to that of DP-SE, i.e., we are able to achieve similar regret performance in the distributed trust model as that is achieved in the central trust model. As mentioned before, we observe a gap for small value of T , which is the price we pay for discrete privacy noise (i.e., additional data quantization error on the order of O(and not requiring a trusted central server. Hence, if we lower the level of privacy (i.e., higher value of ε), this gap becomes smaller, which indicates an inherent trade-off between privacy and utility.We also observe that if we relax the requirement of privacy from pure-DP to RDP, then we can achieve a considerable gain in regret performance; more so when privacy level is high (i.e., ε is small). This gain depends on the scaling factor s -the higher the scale, the higher the gain in regret.In Figure 4 , we compare regret achieved by our generic batch-based successive arm elimination algorithm (Algorithm 1) instantiated with different protocols P under different trust models and privacy guarantees: (i) central model with pure-DP (CDP-SE), (ii) local model with pure-DP (LDP-SE), (iii) Distributed model with pure-DP (Dist-DP-SE), Renyi-DP (Dist-RDP-SE) and Concentrated-DP (Dist-CDP-SE). First, consider the pure-DP algorithms. We observe that regret performance of CDP-SE and Dist-DP-SE is similar (with a much better regret than LDP-SE). Now, if we relax the pure-DP requirement, then we achieve better regret performance both for Dist-RDP-SE and Dist-CDP-SE. Furthermore, Dist-CDP-SE performs better in terms of regret than Dist-RDP-SE. This is due to the fact that under CDP, we use discrete Gaussian noise (which has sub-Gaussian tails) as opposed to the Skellam noise (which has sub-exponential tails) used under RDP.In Figure 5 , we show clear plots for our experiment on bandit instances generated from Microsoft Learning to Rank dataset MSLR-WEB10K (Qin & Liu, 2013) . Lemma 6 (Concentration of sub-Gaussian). A mean-zero random variable X is σ 2 -sub-Gaussian if for all λ ∈ R E e λX ≤ expThen, it satisfies that for any p ∈ (0, 1], with probability at least 1p,Lemma 7 (Concentration of sub-exponential). A mean-zero random variable X is (σ 2 , h)-sub-Then, we haveThus, it satisfies that for any p ∈ (0, 1], with probability at least 1p, |X| ≤ √ 2σ log(2/p) + 2h log(2/p). Lemma 8 (Hoeffding's Inequality). Let X 1 , . . . , X n be independent and identically distributed (i.i.d) random variables and X i ∈ [0, 1] with probability one. Then, for any p ∈ (0, 1], with probability at least 1p,Lemma 9 (Sum of sub-exponential). Let {X i } n i=1 be independent zero-mean (σ 2 i , h i )-subexponential random variables. Then, i X i is ( i σ 2 i , h * )-sub-exponential, where h * := max i h i . Thus, we haveIn other words, for any p ∈ (0, 1], if v ≥ max{ 2 i σ 2 i log(2/p), 2h * log(2/p)}, with probability at least 1p, | i X i | ≤ v. Lemma 10 (Conversion Lemma). If M satisfies (α, ε(α))-RDP, then for any δ ∈ (0, 1), M satisfies (ε, δ)-DP where ε = inf α>1 ε(α) + log(1/(αδ)) α -1 + log(1 -1/α).If M satisfies 1 2 ε 2 -CDP, then for for any δ ∈ (0, 1), M satisfies (ε ′ , δ)-DP whereMoreover, if M satisfies (ε, 0)-DP, then it satisfies (α, 1 2 ε 2 α)-RDP simultaneously for all α ∈ (1, ∞). Theorem 7 (Advanced composition). Given target privacy parameters ε ′ ∈ (0, 1) and δ ′ > 0, to ensure (ε ′ , kδ + δ ′ )-DP for the composition of k (adaptive) mechanisms, it suffices that each mechanism is (ε, δ)-DP with ε = .

