DISTRIBUTED DIFFERENTIAL PRIVACY IN MULTI-ARMED BANDITS

Abstract

We consider the standard K-armed bandit problem under a distributed trust model of differential privacy (DP), which enables to guarantee privacy without a trustworthy server. Under this trust model, previous work on private bandits largely focus on achieving privacy using a shuffle protocol, where a batch of users data are randomly permuted before sending to a central server. This protocol achieves (ε, δ) or approximate-DP guarantee by sacrificing an additive O K log T √ log(1/δ) ε



factor in T -step cumulative regret. In contrast, the optimal privacy cost to achieve a stronger (ε, 0) or pure-DP guarantee under the widely used central trust model is only Θ K log T ε , where, however, a trusted server is required. In this work, we aim to obtain a pure-DP guarantee under distributed trust model while sacrificing no more regret than that under the central trust model. We achieve this by designing a generic bandit algorithm based on successive arm elimination, where privacy is guaranteed by corrupting rewards with an equivalent discrete Laplace noise ensured by a secure computation protocol. We also show that our algorithm, when instantiated with Skellam noise and the secure protocol, ensures Rényi differential privacy -a stronger notion than approximate DP -under distributed trust model with a privacy cost of O K √ log T ε . Our theoretical findings are corroborated by numerical evaluations on both synthetic and real-world data.

1. INTRODUCTION

The multi-armed bandit (MAB) problem provides a simple but powerful framework for sequential decision-making under uncertainty with bandit feedback, which has attracted a wide range of practical applications such as online advertising (Abe et al., 2003 ), product recommendations (Li et al., 2010) , clinical trials (Tewari & Murphy, 2017) , to name a few. Along with its broad applicability, however, there is an increasing concern of privacy risk in MAB due to its intrinsic dependence on users' feedback, which could leak users' sensitive information (Pan et al., 2019) . To alleviate the above concern, the notion of differential privacy, introduced by Dwork et al. (2006) in the field of computer science theory, has recently been adopted to design privacy-preserving bandit algorithms (see, e.g., Mishra & Thakurta (2015) ; Tossou & Dimitrakakis (2016); Shariff & Sheffet (2018)). Differential privacy (DP) provides a principled way to mathematically prove privacy guarantees against adversaries with arbitrary auxiliary information about users. To achieve this, a differentially private bandit algorithm typically relies on a well-tuned random noise to obscure each user's contribution to the output, depending on privacy levels ε, δ -smaller values lead to stronger protection but also suffer worse utility (i.e., regret). For example, the central server of a recommendation system can use random noise to perturb its statistics on each item after receiving feedback (i.e., clicks/ratings) from users. This is often termed as central model (Dwork et al., 2014) , since the central server has the trust of its users and hence has a direct access to their raw  (ε, 0)-DP Θ a∈[K]:∆a>0 log T ∆a + K log T ε (Sajed & Sheffet, 2019) Local (ε, 0)-DP Θ 1 ε 2 a∈[K]:∆a>0 log T ∆a (Ren et al., 2020) Distributed (ε, δ)-DP O a:∆a>0 log T ∆a + K log T √ log 1 δ ε (Tenenbaum et al., 2021) Distributed (ε, 0)-DP Θ a∈[K]:∆a>0 log T ∆a + K log T ε (Theorem 1) Distributed O(α, αε 2 2 )-RDP O a∈[K]:∆a>0 log T ∆a + K √ log T ε (Theorem 2) data. Under this model, an optimal private MAB algorithm with a pure DP guarantee (i.e., when δ = 0) is proposed in Sajed & Sheffet (2019), which only incurs an additive O K log T ε term in the cumulative regret compared to the standard setting when privacy is not sought after (Auer, 2002) . However, this high trust model is not always feasible in practice since users may not be willing to share their raw data directly to the server. This motivates to employ a local model (Kasiviswanathan et al., 2011) of trust, where DP is achieved without a trusted server as each user perturbs her data prior to sharing with the server. This ensures a stronger privacy protection, but leads to a high cost in utility due to large aggregated noise from all users. As shown in Ren et al. (2020) , under the local model, private MAB algorithms have to incur a multiplicative 1/ε 2 factor in the regret rather than the additive one in the central model. In attempts to recover the same utility of central model while without a trustworthy server like the local model, an intermediate DP trust model called distributed model has gained an increasing interest, especially in the context of (federated) supervised learning (Kairouz et al., 2021b; Agarwal et al., 2021; Kairouz et al., 2021a; Girgis et al., 2021; Lowy & Razaviyayn, 2021) . Under this model, each user first perturbs her data via a local randomizer, and then sends the randomized data to a secure computation function. This secure function can be leveraged to guarantee privacy through aggregated noise from distributed users. There are two popular secure computation functions: secure aggregation (Bonawitz et al., 2017) and secure shuffling (Bittau et al., 2017) . The former often relies on cryptographic primitives to securely aggregate users' data so that the central server only learns the aggregated result, while the latter securely shuffle users' messages to hide their source. To the best of our knowledge, distributed DP model is far less studied in online learning as compared to supervised learning, with only known results for standard K-armed bandits in Tenenbaum et al. (2021) , where secure shuffling is adopted. Despite being pioneer work, the results obtained in this paper have several limitations: (i) The privacy guarantee is obtained only for approximate DP (δ > 0) -a stronger pure DP (δ = 0) guarantee is not achieved; (ii) The cost of privacy is a multiplicative log(1/δ) factor away from that of central model, leading to a higher regret bound; (iii) The secure protocol works only for binary rewards (or communication intensive for real rewards).foot_0  Our contributions. In this work, we design the first communication-efficient MAB algorithm that satisfies pure DP in the distributed model while attaining the same regret bound as in the central model (see Table 1 ). We overcome several key challenges that arise in the design and analysis of distributed DP algorithms for bandits. We now list the challenges and our proposed solutions below. (a) Private and communication efficient algorithm design. Secure aggregation (SecAgg) works only in the integer domain due to an inherent modular operation (Bonawitz et al., 2017) . Hence, leveraging this in bandits to achieve distributed DP with real rewards needs adopting data quantization, discrete privacy noise and modular summation arithmetic in the algorithm design. To this end, we take a batch version of the successive arm elimination technique as a building block of our algorithm, and on top of it, employ a privacy protocol tailored to discrete privacy noise and modular operation (see Algorithm 1). Instantiating the protocol at each user with Pólya random noise, we



For more general linear bandits under distributed DP via shuffling, see Chowdhury & Zhou (2022b); Garcelon et al. (2022), which also have similar limitations.



Best-known performance of private MAB under different privacy models (K = number of arms, T = time horizon, ∆ a = reward gap of arm a w.r.t. best arm, ε, δ, α = privacy parameters)

