A STATISTICAL FRAMEWORK FOR PERSONALIZED FED-ERATED LEARNING AND ESTIMATION: THEORY, ALGO-RITHMS, AND PRIVACY

Abstract

A distinguishing characteristic of federated learning is that the (local) client data could have statistical heterogeneity. This heterogeneity has motivated the design of personalized learning, where individual (personalized) models are trained, through collaboration. There have been various personalization methods proposed in literature, with seemingly very different forms and methods ranging from use of a single global model for local regularization and model interpolation, to use of multiple global models for personalized clustering, etc. In this work, we begin with a statistical framework that unifies several different algorithms as well as suggest new algorithms. We apply our framework to personalized estimation, and connect it to the classical empirical Bayes' methodology. We develop novel private personalized estimation under this framework. We then use our statistical framework to propose new personalized learning algorithms, including AdaPeD based on information-geometry regularization, which numerically outperforms several known algorithms. We develop privacy for personalized learning methods with guarantees for user-level privacy and composition. We numerically evaluate the performance as well as the privacy for both the estimation and learning problems, demonstrating the advantages of our proposed methods.

1. INTRODUCTION

The federated learning (FL) paradigm has had huge recent success both in industry and academia (McMahan et al., 2017; Kairouz et al., 2021) , as it enables to leverage data available in dispersed devices for learning while maintaining data privacy. Yet, it was recently realized that for some applications, due to the statistical heterogeneity of local data, a single global learning model may perform poorly for individual clients. This motivated the need for personalized learning achieved through collaboration, and there have been a plethora of personalized models proposed in the literature as well (Fallah et al., 2020; Dinh et al., 2020; Deng et al., 2020; Mansour et al., 2020; Acar et al., 2021; Li et al., 2021; Ozkara et al., 2021; Zhang et al., 2021; Hu et al., 2020) . However, the proposed approaches appear to use very different forms and methods, and there is a lack of understanding of an underlying fundamental statistical framework. Such a statistical framework could help develop theoretical bounds for performance, suggest new algorithms as well as perhaps give grounding to known methods. Our work addresses this gap. In particular, we consider the fundamental question of how one can use collaboration to help personalized learning and estimation for users who have limited data that they want to keep private. Our proposed framework is founded on the requirement not only of personalization but also privacy, as maintaining local data privacy is what makes the federated learning framework attractive -and thus any algorithm that aims to be impactful needs to also give formal privacy guarantees. The goal of this paper is to develop a statistical framework that leads to new algorithms with provable privacy guarantees, and performance bounds. Our main contributions are (i) Development of a statistical framework for federated personalized estimation and learning (ii) Theoretical bounds and novel algorithms for private personalized estimation (iii) Design and privacy analysis of new private personalized learning algorithms; as elaborated below. Omitted proofs/details are in appendices. • Statistical framework: We connect this problem to the classical empirical Bayes' method, pioneered by Stein (1956) ; James & Stein (1961) ; Robbins (1956) , which proposed a hierarchical statistical model Gelman et al. (2013) . This is modeled by an unknown population distribution P from which local parameters {θ i } are generated, which in turn generate the local data through the distribution Q(θ i ). Despite the large literature on this topic, especially in the context of statistical estimation, creating a framework for FL poses new challenges. In contrast to classical empirical Bayes' estimation, we introduce a distributed setting and develop a framework that allows information (communication and privacy) constraintsfoot_0 . This framework enables us to develop statistical performance bounds as well as suggests (private) personalized federated estimation algorithms. Moreover, we develop our framework beyond estimation, for (supervised) distributed learning, where clients want to build local predictive models with limited local (labeled) samples; we develop this framework in Section 3, which leads to new (private) personalized learning algorithms. • Private personalized estimation: Our goal is to estimate individual (local) parameters, when each user has very limited (heterogeneous) data. Such a scenario motivates federated estimation of individual parameters, privately. More precisely, the users observe data generated by an unknown distribution parametrized by their individual (unknown) local parameters θ i , and want to estimate their local parameters θ i leveraging very limited local data; see Section 2 for more details. For the hierarchical statistical model, classical results have shown that one can enhance the estimate of individual parameters based on the observations of a population of samples, despite having independently generated parameters from an unknown population distributions. However, this has not been studied for the distributed case, with privacy and communication constraints, which we do (see Theorem 2 for the Gaussian case and Theorem 4 for the Bernoulli case, and also for mixture population models in Appendix D). We estimate the (parametrized) population distribution under these privacy and communication constraints and use this as an empirical prior for local estimation. The effective amplification of local samples through collaboration, in Section 2, gives us theoretical insight about when collaboration is most useful, under privacy and/or communication constraints. Our results suggest how to optimally balance estimates from local and population models. We also numerically evaluate these methods, including application to polling data (see Section 4 and Appendices) to show advantages of such collaborative estimation compared to local methods. • Private personalized learning: The goal here is to obtain individual learning models capable of predicting labels with limited local data in a supervised learning setting. This is the use case for federated learning with privacy guarantees. It is intimately related to the estimation problem with distinctions including (i) to design good label predictors rather than just estimate local parameters (ii) the focus on iterative methods for optimization, requiring strong compositional privacy guarantees. Therefore, the statistical formulation for learning has a similar flavor to that in estimation, where there is a population model for local (parametrized) statistics for labeled data; see Section 3 for more details. We develop several algorithms, including AdaPeD (in Section 3.2), AdaMix (in Section 3.1), and DP-AdaPeD (in Section 3.3), inspired by the statistical framework. AdaPeD uses information divergence constraints along with adaptive weighting of local models and population models. By operating in probability (rather than Euclidean) space, using information-geometry (divergence), enables AdaPeD to operate with different local model sizes and architectures, giving it greater flexibility than existing methods. We integrate it with user-level privacy to develop DP-AdaPeD, with strong compositional privacy guarantees (Theorem 5). AdaMix is inspired by mixture population distributions, which adaptively weighs multiple global models and combines it with local data for personalization. We numerically evaluate these algorithms for synthetic and real data in Section 4. Related Work. Our work can be seen in the intersection of personalized learning, estimation, and privacy. Below we give a brief description of related work; a more detailed comparison which connects our framework to other personalized algorithms is given in Appendix J. Personalized FL: Recent work adopted different approaches for learning personalized models, which can be explained by our statistical framework for suitable choices of population distributions as explained in Appendix J: These include, meta-learning based methods (Fallah et al., 2020; Acar et al., 2021; Khodak et al., 2019) ; regularization (Deng et al., 2020; Mansour et al., 2020; Hanzely Privacy for Personalized Learning. There has been a lot of work in privacy for FL when the goal is to learn a single global model (see (Girgis et al., 2021b ) and references therein); though there are fewer papers that address user-level privacy (Liu et al., 2020; Levy et al., 2021; Ghazi et al., 2021) . There has been more recent work on applying these ideas to learn personalized models (Girgis et al., 2022; Jain et al., 2021b; Geyer et al., 2017; Hu et al., 2020; Li et al., 2020) . These are for specific algorithms/models, e.g., Jain et al. (2021b) focuses on the common representation model for linear regression described earlier or on item-level privacy (Hu et al., 2020; Li et al., 2020) . We believe that DP-AdaPeD proposed in this paper is among the first user-level private personalized learning algorithms with compositional guarantees, applicable to general deep learning architectures.

2. PERSONALIZED ESTIMATION

We consider a client-server architecture, where there are m clients. Let P(Γ) denote a global population distribution that is parameterized by an unknown Γ and let θ 1 , . . . , θ m are sampled i.i.d. from P(Γ) and are unknown to the clients. Client i is given a dataset X i := (X i1 , . . . , X in ), where X ij , j ∈ [n] are sampled i.i.d. from some distribution Q(θ i ), parameterized by θ i ∈ R d . Note that heterogeneity in clients' datasets is induced through the variance in P(Γ), and if the variance of P(Γ) is zero, then all clients observe i.i.d. datasets sampled from the same underlying distribution. The goal at client i for all i ∈ [m] is to estimate θ i through the help of the server. We focus on one-round communication schemes, where client j applies a (potentially randomized) mechanism q on its dataset X j and sends q j := q(X j ) to the server, who aggregates the received messages, which is denoted by Agg(q 1 , . . . , q m ), and broadcasts that to all clients. Based on (X i , Agg(q 1 , . . . , q m )), client i outputs an estimate θ i of θ i . We measure the performance of θ i through the Bayesian risk for mean squared error (MSE), as defined below (where P is the true prior distribution with associated density π, θ i ∼ P is the true local parameter, and θ i = θ(X i , Agg(q 1 , . . . , q m )) is the estimator): E θi∼P E θi,q,X1,...,Xm θ i -θ i 2 = E θi,q,X1,...,Xm θ i -θ i 2 π(θ i )dθ i . The above statistical framework can model many different scenarios, and we will study in detail three settings: Gaussian and Bernoulli models (Sections 2.1, 2.2 below), and Mixture model (Appendix D).

2.1. GAUSSIAN MODEL

In the Gaussian setting, P(Γ) = N (µ, σ 2 θ I d ) and Q(θ i ) = N (θ i , σ 2 x I d ) for all i ∈ [m], which implies that θ 1 , . . . , θ m ∼ N (µ, σ 2 θ I d ) i.i.d. and X i1 , . . . , X in ∼ N (θ i , σ 2 x I d ) i.i.d. for i ∈ [m]. Here, σ θ ≥ 0, σ x > 0 are known, and µ, θ 1 , . . . , θ m are unknown. For the case of a single local sample this is identical to the classical James-Stein estimator (James & Stein, 1961) ; Theorem 1 does a simple extension for multiple local samples and is actually a stepping stone for the information constrained estimation result of Theorem 2. Omitted proofs/details are provided in Appendix B. Our proposed estimator. Since there is no distribution on µ, and given µ, we know the distribution of θ i 's, and subsequently, of X ij 's. So, we consider the maximum likelihood estimator: θ 1 , . . . , θ m , µ := arg max θ1,...,θm,µ p {θi,Xi}|µ (θ 1 , . . . , θ m , X 1 , . . . , X m |µ) (2) Theorem 1. Solving (2) yields the following closed form expressions for µ and θ 1 , . . . , θ m : µ = 1 m m i=1 X i and θ i = aX i + (1 -a) µ, for i ∈ [m], where a = σ 2 θ σ 2 θ + σ 2 x/n . The above estimator achieves the MSE: E θi,X1,...,Xm θ i -θ i 2 ≤ dσ 2 x n 1-a m + a . It follows that the mechanism q and the aggregation function Agg for the estimators in (3) (as described in (1)) are just the average functions, where client i sends q i = q(X i ) := X i = 1 n n j=1 X ij to the server, and the server sends µ := Agg(q 1 , . . . , q m ) = 1 m m i=1 q i back to the clients. Remark 1 (Personalized estimate vs. local estimate). When σ θ → 0, then a → 0, which implies that θ i → µ and MSE → dσ 2 x /mn. Otherwise, when σ 2 θ is large in comparison to σ 2 x /n or n → ∞, then a → 1, which implies that θ i → X i and MSE → dσ 2 x /n. These conform to the facts that (i) when there is no heterogeneity, then the global average is the best estimator, and (ii) when heterogeneity is not small, and we have a lot of local samples, then the local average is the best estimator. Observe that the multiplicative gap between the MSE of the proposed personalized estimator and the MSE of the local estimator (based on local data only, which gives an MSE of dσ 2 x /n) is given by ( 1-a m + a) ≤ 1 that proves the superiority of the personalized model over the local model, which is equal to 1/m when σ θ = 0 and equal to 0.01 when m = 10 4 , n = 100 and σ 2 x = 10, σ 2 θ = 10 -3 , for example. Remark 2 (Optimality of our personalized estimator). In Appendix B, we show the minimax lower bound: inf θ sup θ∈Θ E X∼N (θ,σ 2 x ) θ(X) -θ 2 ≥ dσ 2 x n 1-a m + a , which exactly matches the upper bound on the MSE in Theorem 1, thus establishes the optimality our personalized estimator in (3). Privacy and communication constraints. Observe that the scheme presented above does not protect privacy of clients' data and messages from the clients to the server can be made communicationefficient. These could be achieved by employing specific mechanisms q at clients: For privacy, we can take a differentially-private q, and for communication-efficiency, we can take q to be a quantizer. Inspired by the scheme presented above, here we consider q to be a function q : R d → Y, that takes the average of n data points as its input, and the aggregator function Agg to be the average function. Define µ q := 1 m m i=1 q(X i ) and consider the following personalized estimator for the i-th client: θ i = aX i + (1 -a) µ q , for some a ∈ [0, 1]. Theorem 2. Suppose for all x ∈ R d , q satisfies E[q(x)] = x and E q(x) -x 2 ≤ dσ 2 q for some finite σ q . Then the personalized estimator in (4) has MSE: E θi,q,X1,...,Xm θ i -θ i 2 ≤ dσ 2 x n 1 -a m + a where a = σ 2 θ + σ 2 q/m-1 σ 2 θ + σ 2 q/m-1 + σ 2 x/n . (5) Furthermore, assuming µ ∈ [-r, r] d for some constant r (but µ is unknown), we have: 1. Communication efficiency: For any k ∈ N, there is a q whose output can be represented using k-bits (i.e., q is a quantizer) that achieves the MSE in (5) with probability at least 1 -2 /mn and with σ q = b (2 k -1) , where b = r + σ θ log(m 2 n) + σx √ n log(m 2 n). 2. Privacy: For any 0 ∈ (0, 1), δ > 0, there is a q that is user-level ( 0 , δ)-locally differentially private, that achieves the MSE in (5) with probability at least 1 -2 /mn and with σ q = b 0 8 log(2/δ), where b = r + σ θ log(m 2 n) + σx √ n log(m 2 n).

2.2. BERNOULLI MODEL

For the Bernoulli model, P is supported on [0, 1], and p 1 , . . . , p m are sampled i.i.d. from P, and client i is given n i.i.d. samples X i1 , . . . , X in ∼ Bern(p i ). This setting has been studied in (Tian et al. (2017) ; Vinayak et al. ( 2019)) for estimating P, whereas, our goal is to estimate individual parameter p i at client i using the information from other clients. In order to derive a closed form MSE result, we assume that P is the Beta distribution. 3 Here, Γ = (α, β), p 1 , . . . , p m are unknown, and client i's goal is to estimate p i such that the Bayesian risk E pi∼π E pi,X1,...,Xm ( p i -p i ) 2 is minimized, where π denotes the density of the Beta distribution. Omitted proofs/details are provided in Appendix C. Analogous to the Gaussian case, we can show that if α, β are known, then the posterior mean estimator has a closed form expression: p i = aX i + (1 -a) α α+β , where a = n /α+β+n and α /(α+β) is the mean of the beta distribution. When α, β are unknown, inspired by the above discussion, a natural approach would be to estimate the global mean µ = α /(α+β) and the weight a = n /(α+β+n), and use that in the above estimator. Note that, for a we need to estimate α + β, which is equal to µ(1-µ) /σ 2 -1, where σ 2 = αβ /(α+β) 2 (α+β+1) is the variance of the beta distribution. Therefore, it is enough to estimate µ and σ 2 for the personalized estimators { p i }. In order to make calculations simpler, instead of making one estimate of µ, σ 2 for all clients, we let each client make its own estimate of µ, σ 2 (without using their own data) as: µ i = 1 m-1 l =i X l and σ 2 i = 1 m-2 l =i (X l -µ l ) 2 , 4 and then define the local weight as a i = n µ i (1-µ i ) / σ 2 i -1+n . Now, client i uses the following personalized estimator: p i = a i X i + (1 -a i ) µ i . Theorem 3. With probability at least 1-1 mn , the MSE of the personalized estimator in (6) is given by: E pi∼π E X1,...,Xm ( p i -p i ) 2 ≤ E[ a 2 i ] αβ n(α+β)(α+β+1) + E[(1 -a i ) 2 ] αβ (α+β) 2 (α+β+1) + 3 log(4m 2 n) m-1 . Remark 3. When n → ∞, then a i → 1, which implies that MSE tends to the MSE of the local estimator X i , which means if local samples are abundant, collaboration does not help much. When σ 2 = αβ /(α+β) 2 (α+β+1) → 0, i.e. there is very small heterogeneity in the system, then a i → 0, which implies that MSE tends to the error due to moment estimation (the last term in the MSE in Theorem 3). Privacy constraints. For any privacy parameter 0 > 0 and input x ∈ [0, 1], define q priv : [0, 1] → R: q priv (x) = -1 e 0 -1 w.p. e 0 e 0 +1 -x e 0 -1 e 0 +1 , e 0 e 0 -1 w.p. 1 e 0 +1 + x e 0 -1 e 0 +1 . (7) The mechanism q priv is unbiased and satisfies user-level 0 -LDP. Thus, the ith client sends q priv (X i ) to the server, which computes µ priv i = 1 m-1 l =i q priv (X l ) and the variance σ 2(priv) i = 1 m-2 l =i (q priv (X l )) -µ priv l ) 2 and sends ( µ priv i , σ 2(priv) i ) to client i. Upon receiving this, client i defines a priv i = n µ priv i (1-µ priv i ) / σ 2(priv) i +n and uses p priv i = a priv i X i + (1 -a priv i ) µ priv to estimate p i . Theorem 4. With probability at least 1 -1 mn , the MSE of the personalized estimator p priv i defined above is given by: E pi∼π E q priv ,X1,...,Xm ( p priv i -p i ) 2 ≤ E[( a priv i ) 2 ] αβ n(α+β)(α+β+1) + E[(1 - a priv i ) 2 ] αβ (α+β) 2 (α+β+1) + (e 0 +1) 2 log(4m 2 n) 3(e 0 -1) 2 (m-1) . See Remark 4 (in Appendix B) and Remarks 6 and 7 (in Appendix C) for a discussion on privacy, communication efficiency, and client sampling. 

3. PERSONALIZED LEARNING

p {θi,Yi}|{Xi} (θ 1 , . . . , θ m , Y 1 , . . . , Y m |X 1 , . . . , X m ) = m i=1 p(θ i ) m i=1 n j=1 p θi (Y ij |X ij ). (8) Note that if we minimize the negative log likelihood of (8), we would get the optimal parameters: θ 1 , . . . , θ m := arg min θ1,...,θm m i=1 n j=1 -log(p θi (Y ij |X ij )) + m i=1 -log(p(θ i )). Here, f i (θ i ) := n j=1 -log(p θi (Y ij |X ij )) denotes the loss function at the i-th client, which only depends on the local data, and R({θ i }) := m i=1 -log(p(θ i )) is the regularizer that depends on the (unknown) global population distribution P (parametrized by unknown Γ). Note that when clients have little data and we have large number of clients, i.e., n m -the setting of federated learning, clients may not be able to learn good personalized models from their local data alone (if they do, it would lead to large loss). In order to learn better personalized models, clients may utilize other clients' data through collaboration, and the above regularizer (and estimates of the unknown prior distribution P, through estimating its parameters Γ) dictates how the collaboration might be utilized. The above-described statistical framework (9) can model many different scenarios, as detailed below: 1. When P(Γ) ≡ GM({p l } k l=1 , {µ l } k l=1 , {σ 2 θ,l } k l=1 ) is a Gaussian mixture, for Γ = {p l } k l=1 , {µ l } k l=1 , {σ θ,l } k l=1 } : p l ≥ 0, k l=1 p l = 1, σ θ,l ≥ 0, µ l ∈ R d , then R({θ i }) = - m i=1 log k l=1 p l exp(- µ l -θi 2 2 2σ 2 θ,l )/((2πσ θ,l ) d/2 ) . Here, the client models θ 1 , . . . , θ m are drawn i.i.d. from P(Γ), where θ i ∼ N (µ l , σ 2 θ,l I d ) with prob. p l , for l = 1, . . . , k. For k = 1, R({θ i }) = md 2 log(2πσ 2 θ ) + m i=1 µ-θi 2 2 2σ 2 θ . Here, unknown µ can be connected to the global model and θ i 's as local models, and the alternating iterative optimization optimizes over both. This justifies the use of 2 regularizer in earlier personalized learning works (Dinh et al., 2020; Ozkara et al., 2021; Hanzely & Richtárik, 2020; Hanzely et al., 2020; Li et al., 2021) . Yij ) , where σ(z) = 1 /1+e -z for any z ∈ R, then f i (θ i ) is the cross-entropy (or logistic) loss as in logistic regression.

2.. When

P(Γ) ≡ Laplace(µ, b), for Γ = {µ, b > 0}, then R({θ i }) = m log(2b) + m i=1 θi-µ 1 b . 3. When p θi (Y ij |X ij ) is according to N (θ i , σ 2 x ), then f i (θ i ) is the quadratic loss as in linear regression. When p θi (Y ij |X ij ) = σ( θ i , X ij ) Yij (1 -σ( θ i , X ij )) (1-

3.1. AD AMI X : ADAPTIVE PERSONALIZATION WITH GAUSSIAN MIXTURE PRIOR

Now we write the full objective function for the Gaussian mixture prior model for a generic local loss function f i (θ i ) at client i (the case of linear/logistic regression with (single) Gaussian prior and solving using alternating gradient descent is discussed in Appendices E, F.): arg min {θi},{µ l },{p l },{σ θ,l } m i=1 F gm i (θ i ) := m i=1 f i (θ i ) -log( k l=1 p l exp(- µ l -θ i 2 2 2σ 2 θ,l )/((2πσ θ,l ) d/2 )) (10) A common example of f i (θ i ) is a generic neural network loss function with multi-class softmax output layer and cross entropy loss, i.e., f i (θ Yij ) , where σ(z) = 1 /1+e -z for any z ∈ R. To solve (10), we can either use an alternating gradient descent approach, or we can use a clustering based approach where the server runs a (soft) clustering algorithm on received personalized models. We adopt the second approach here (described in Algorithm 1) as it provides an interesting point of view and can be combined with DP clustering algorithms. Here clients receive the global parameters from the server and do a local iteration on the personalized model (multiple local iterations can be introduced as in FedAvg (McMahan et al., 2017) ), later the clients send the personalized models. Receiving the personalized models, server initiates GMM algorithm that outputs global parametersfoot_5  i ) := n j=1 -log(p θi (Y ij |X ij )), where p θi (Y ij |X ij ) = σ( θ i , X ij ) Yij (1 -σ( θ i , X ij )) (1- P (0) , µ (0) 1 , . . . , µ (0) k , σ θ,1 , . . . , σ θ,k . 2: for t = 1 to T do 3: On Clients: 4: for i = 1 to m: do Receive P (t-1) , µ (t-1) 1 , . . . , µ (t-1) k , and σ (t-1) θ,1 , . . . , σ (t-1) θ,k from the server 6: Update the personalized parameters: θ (t) i ← θ (t-1) i -η∇ θ (t-1) i F gm i (θ (t-1) i ) 7: Send θ (t) i to the server 8: end for 9: At the Server: Update the global parameters: P (t) , µ (t) 1 , . . . , µ θ,1 , . . . , σ (t) θ,k ← GMM θ (t) 1 , . . . , θ (t) m , k 12: Broadcast P (t) , {µ (t) i } k i=1 , {σ θ,i } k i=1 to all clients 13: end for Output: Personalized models θ T 1 , . . . , θ T m . It has been empirically observed that the knowledge distillation (KD) regularizer (between local and global models) results in better performance than the 2 regularizer (Ozkara et al., 2021) . In fact, using our framework, we can define, for the first time, a certain prior distribution that gives the KD regularizer (see Appendix H). We use the following loss function at the i-th client: f i (θ i ) + 1 2 log(2ψ) + f KD i (θ i , µ) 2ψ , where µ denotes the global model, θ i denotes the personalized model at client i, and ψ can be viewed as controlling heterogeneity. The goal for each client is to minimize its local loss function, so individual components cannot be too large. For the second term, this implies that ψ cannot be unbounded. For the third term, if f KD i (θ i , µ) is large, then ψ will also increase (implying that the local parameters are too deviated from the global parameter), hence, it is better to emphasize local training loss to make the first term small. If f KD i (θ i , µ) is small, then ψ will also decrease (implying that the local parameters are close to the global parameter), so it is better to collaborate and learn better personalized models. Such adaptive weighting quantifies the uncertainty in population distribution during training, balances the learning accordingly, and improves the empirical performance over nonadaptive methods, e.g., (Ozkara et al., 2021) . To optimize (11) we propose an alternating minimization approach, which we call AdaPeD; see Algorithm 2. Besides the personalized model θ t i , each client i keeps local copies of the global model µ t i and of the dissimilarity term ψ t i , and at synchronization times, server aggregates them to obtain global versions of these µ t , ψ t . In this way, the local training of θ t i also incorporates knowledge from other clients' data through µ t i . In the end, clients have learned their personalized models {θ T i } m i=1 . 3.3 DP-AD APED: DIFFERENTIALLY PRIVATE ADAPTIVE PERSONALIZATION VIA DISTILL. Note that client i communicates µ t i , ψ t i (which are updated by accessing the dataset for computing the gradients h t i , k t i ) to the server. So, to privatize µ t i , ψ t i , client i adds appropriate noise to h k i , k t i . In order to obtain DP-AdaPeD, we replace lines 13 and 15, respectively, by the update rules: µ t+1 i = µ t i -η 2 h t i max{ h t i /C 1 , 1} + ν 1 and ψ t+1 i = ψ t i -η 3 k t i max{|k t i |/C 2 , 1} + ν 2 , where ν 1 ∼ N (0, σ 2 q1 I d ) and ν 2 ∼ N (0, σ 2 q2 ), for some σ q1 , σ q2 > 0 that depend on the desired privacy level and C 1 , C 2 , which are some predefined constants. The theorem below (proved in Appendix I) states the Rényi Differential Privacy (RDP) guarantees. Theorem 5. After T iterations, DP-AdaPeD satisfies (α, (α))-RDP for α > 1, where (α) = K m 2 6 T τ α C 2 1 Kσ 2 q 1 + C 2 2 Kσ 2 q 2 , where K m denotes the sampling ratio of clients at each global iteration. We bound the RDP, as it gives better privacy composition than using the strong composition (Mironov et al., 2019) . We can also convert our results to user-level ( , δ)-DP by using the standard conversion from RDP to ( , δ)-DP (Canonne et al., 2020) . See Appendix A for background on privacy. 

4. EXPERIMENTS

Algorithm 2 Adaptive Personalization via Distillation (AdaPeD) Parameters: local variances {ψ 0 i }, personalized models {θ 0 i }, local copies of the global model {µ 0 i }, learning rates η 1 , η 2 , η 3 , synchronization gap τ 1: for t = 0 to T -1 do 2: if τ divides t then 3: On Server do: 4: Choose a subset K t ⊆ [n] of K clients 5: Broadcast µ t and ψ t 6: On Clients i ∈ K t (in parallel) do: 7: Receive µ t , ψ t ; set µ t i = µ t , ψ t i = ψ t 8: end if 9: On Clients i ∈ K t (in parallel) do: 10: Compute g t i := ∇ θ t i f i (θ t i ) + ∇ θ t i f KD i (θ t i ,µ t i ) 2ψ t i 11: Update: θ t+1 i = θ t i -η 1 g t i 12: Compute h t i := ∇ µ t i f KD i (θ t+1 i ,µ t i ) /2ψ t i 13: Update: µ t+1 i = µ t i -η 2 h t i 14: Compute k t i := 1 2ψ t i -f KD i (θ t+1 i ,µ t+1 i ) /2(ψ t i ) 2 15: Update: ψ t+1 i = ψ t i -η 3 k t i 16: if τ divides t + 1 then 17: Clients send µ t i and ψ t i to Server 18: Server receives {µ t i } i∈K t and {ψ t i } i∈K t

19:

Server computes µ t+1 = 1 K i∈K t µ t i and ψ t+1 = 1 K i∈K t ψ t i 20: end if 21: end for Output: Personalized models (θ T i ) m i=1 Personalized Estimation. We run one experiment for Bernoulli setting with real political data and the other for Gaussian setting with synthetic data. The latter one is differentially private. • Political tendencies on county level. One natural application of Bernoulli setting is modeling bipartisan elections (Tian et al., 2017) . We did a case study by using US presidential elections on county level between 2000-2020, with m = 3112 counties in our dataset. For each county the goal is to determine the political tendency parameter p i . Given 6 election data we did 6-fold cross validation, with 5 elections for training and 1 election for test data. Local estimator takes an average of 5 training samples and personalized estimator is the posterior mean. To simulate a Bernoulli setting we set the data equal to 1 if Republican party won the election and 0 otherwise. We observe the personalized estimator provides MSE (averaged over 6 runs) gain of 10.7 ± 1.9% against local estimator. • DP personalized estimation. To measure the performance tradeoff of the DP mechanism described in Section 2.1, we create a synthetic experiment for Gaussian setting. We let m = 10000, n = 15 and σ θ = 0.1, σ x = 0.5, and create a dataset at each client as described in Gaussian setting. Applying the DP mechanism we obtain the following result in Figure 1a . Here, as expected, when privacy is low ( 0 is high) the private personalized estimator recovers the regular personalized estimator. For higher privacy the private estimator's performance starts to become worse than the non-private estimator. Personalized Learning. First we describe the experiment setting and then the results. • Experiment setting. We consider image classification on MNIST, FEMNIST (Caldas et al., 2018) , CIFAR-10, CIFAR-100 (experimental details for CIFAR-100 is given in Appendix K); and train a CNN, similar to the one considered in (McMahan et al., 2017) , that has 2 convolutional and 3 fully connected layers. We set m = 66 for FEMNIST and m = 50 for MNIST, CIFAR-10, CIFAR-100. For FEMNIST, we use a subset of 198 writers so that each client has access to data from 3 authors, which results in a natural type of data heterogeneity due to writing styles of authors. On MNIST, CIFAR-10 we introduce pathological heterogeneity by letting each client sample data from 3 and 4 randomly selected classes only, respectively. We set τ = 10 and vary the batch size so that each epoch consists of 60 iterations. On MNIST we train for 50 epochs, on CIFAR-10 for 250 epochs, on FEMNIST for 40 and 80 epochs, for 0.33 and 0.15 client sampling ratio, respectively. We discuss further details in Appendix K. (Fallah et al., 2020) 59.95 ± 0.79 34.78 ± 0.41 93.51 ± 0.31 QuPeD (FP) (Ozkara et al., 2021) 71.61 ± 0.70 51.94 ± 0.21 95.99 ± 0.08 Federated ML (Shen et al., 2020) 71.09 ± 0.67 50.42 ± 0.26 95.12 ± 0.18 • Results. In Table 1 we compare AdaPeD against FedAvg (McMahan et al., 2017) , FedAvg+ (Jiang et al., 2019) and various personalized FL algorithms: pFedMe (Dinh et al., 2020), Per-FedAvg (Fallah et al., 2020) , QuPeD (Ozkara et al., 2021) without model compression, and Federated ML (Shen et al., 2020) . We report further results in Appendix K. We observe AdaPeD consistently outperforms other methods. It can be seen that methods that use knowledge distillation perform better; on top of this, AdaPeD enables us adjust the dependence on collaboration according to the compatibility of global and local decisions/scores. For instance, we set σ 2 θ to a certain value initially, and observe it progressively decrease, which implies clients start to rely on the collaboration more and more. Interestingly, this is not always the case: for DP-AdaPeD, we first observe a decrease in σ 2 θ and later it increases. This suggests: while there is not much accumulated noise, clients prefer to collaborate, and as the noise accumulation on the global model increases due to DP noise, clients prefer not to collaborate. This is exactly the type of autonomous behavior we aimed with adaptive regularization.

Method

• DP-AdaPeD. In Figure 1b and Table 2 , we observe performance of DP-AdaPeD under different values. DP-AdaPeD outperforms DP-FedAvg because personalized models do not need to be privatized by DP mechanism, whereas the global model needs to be. Our experiments provide user-level privacy (more stringent, but appropriate in FL), as opposed to the item-level privacy. • DP-AdaPeD with unsampled client iterations. When we let unsampled clients to do local iterations (free in terms of privacy cost and a realistic scenario in cross-silo settings) described in Appendix H, we can increase DP-AdaPeD's performance under more aggressive privacy constants . For instance, for FEMNIST with 1/3 client sampling we obtain the result reported in Figure 1b . • AdaMix. We consider linear regression on synthetic data, with m = 1000 clients and each client has n ∈ {10, 20, 30} local samples. Each local model θ i ∈ R d is drawn from a mixture of two Gaussian distributions N (µ, Σ) and N (-µ, Σ), where Σ = 0.001 × I d and d = 50. Each client sample (X ij , Y ij ) is distributed as X ij ∼ N (0, I d ) and Y ij = X ij , θ i + w ij , where w ij ∼ N (0, 0.1). Table 3 demonstrates the superior performance of AdaMix against the local estimator.

5. CONCLUSION

We proposed a statistical framework leading to new personalized federated estimation and learning algorithms (e.g., AdaMix, AdaPeD); we also incorporated privacy (and communication) constraints into our algorithms and analyzed them. Open questions include information theoretic lower bounds and its comparison to proposed methods; examination of how far the proposed alternating minimization methods (such as in AdaMix, AdaPeD) are from global optima.

A PRELIMINARIES

We give standard privacy definitions that we use in Section A.1, some existing results on RDP to DP conversion and RDP composition in Section A.2, and user-level differential privacy in Section A.3.

A.1 PRIVACY DEFINITIONS

In this subsection, we define different privacy notions that we will use in this paper: local differential privacy (LDP), central different privacy (DP), and Renyi differential privacy (RDP), and their userlevel counterparts. Definition 1 (Local Differential Privacy -LDP (Kasiviswanathan et al., 2011) ). For 0 ≥ 0, a randomized mechanism R : X → Y is said to be 0 -local differentially private (in short, 0 -LDP), if for every pair of inputs d, d ∈ X , we have Pr[R(d) ∈ S] ≤ e 0 Pr[R(d ) ∈ S], ∀S ⊂ Y. ( ) Let D = {x 1 , . . . , x n } denote a dataset comprising n points from X . We say that two datasets D = {x 1 , . . . , x n } and D = {x 1 , . . . , x n } are neighboring (and denoted by D ∼ D ) if they differ in one data point, i.e., there exists an i ∈ [n] such that x i = x i and for every j ∈ [n], j = i, we have x j = x j . Definition 2 (Central Differential Privacy -DP (Dwork et al., 2006; Dwork & Roth, 2014) ). For , δ ≥ 0, a randomized mechanism M : X n → Y is said to be ( , δ)-differentially private (in short, ( , δ)-DP), if for all neighboring datasets D ∼ D ∈ X n and every subset S ⊆ Y, we have Pr [M(D) ∈ S] ≤ e 0 Pr [M(D ) ∈ S] + δ. (13) If δ = 0, then the privacy is referred to as pure DP. Definition 3 ((λ, (λ))-RDP (Renyi Differential Privacy) (Mironov, 2017)  ). A randomized mech- anism M : X n → Y is said to have (λ)-Renyi differential privacy of order λ ∈ (1, ∞) (in short, (λ, (λ))-RDP), if for any neighboring datasets D ∼ D ∈ X n , the Renyi divergence between M(D) and M(D ) is upper-bounded by (λ), i.e., D λ (M(D)||M(D )) = 1 λ -1 log E θ∼M(D ) M(D)(θ) M(D )(θ) λ ≤ (λ) , where M(D)(θ) denotes the probability that M on input D generates the output θ. For convenience, instead of (λ) being an upper bound, we define it as (λ) = sup D∼D D λ (M(D)||M(D )).

A.2 RDP TO DP CONVERSION AND RDP COMPOSITION

As mentioned after Theorem 5, we can convert the RDP guarantees of DP-AdaPeD to its DP guarantees using existing conversion results from literature. To the best of our knowledge, the following gives the best conversion. Lemma 1 (From RDP to DP (Canonne et al., 2020; Balle et al., 2020) ). Suppose for any λ > 1, a mechanism M is (λ, (λ))-RDP. Then, the mechanism M is ( , δ)-DP, where , δ are define below: For a given δ ∈ (0, 1) : = min λ (λ) + log (1/δ) + (λ -1) log (1 -1/λ) -log (λ) λ -1 For a given > 0 : δ = min λ exp ((λ -1) ( (λ) -)) λ -1 1 - 1 λ λ . The main strength of RDP in comparison to other privacy notions comes from composition. The following result states that if we adaptively compose two RDP mechanisms with the same order, their privacy parameters add up in the resulting mechanism. Lemma 2 (Adaptive composition of RDP (Mironov, 2017 , Proposition 1)). For any λ > 1, let M 1 : X → Y 1 be a (λ, 1 (λ))-RDP mechanism and M 2 : Y 1 × X → Y be a (λ, 2 (λ))-RDP mechanism. Then, the mechanism defined by (M 1 , M 2 ) satisfies (λ, 1 (λ) + 2 (λ))-RDP. We can define user-level LDP/DP/RDP analogously to their item-level counterparts using the neighborhood relation dis defined above.

B PERSONALIZED ESTIMATION -GAUSSIAN MODEL

B.1 PROOF OF THEOREM 1 We will derive the optimal estimator and prove the MSE for one dimensional case, i.e., for d = 1; the final result can be obtained by applying these to each of the d coordinates separately.  C + m i=1 n j=1 X j i -θ i 2 σ 2 x + m i=1 (θ i -µ) 2 σ 2 θ , where the second equality is obtained from the fact that the log function is a monotonic function, and C is a constant independent of the variables θ = (θ 1 , . . . , θ m ). Observe that the objective function F (θ, µ) = m i=1 n j=1 (X j i -θi) 2 σ 2 x + m i=1 (θi-µ) 2 σ 2 θ is jointly convex in (θ, µ). Thus, the optimal is obtained by setting the derivative to zero as it is an unbounded optimization problem. ∂F ∂θ i µ=μ,θi= θi = n j=1 2( θi -X j i ) σ 2 x + 2( θi -μ) σ 2 θ = 0, ∀i ∈ [m] ∂F ∂µ µ=μ,θi= θi = m i=1 2(μ -θi ) σ 2 θ = 0. By solving these m + 1 equations in m + 1 unknowns, we get: θi = α   1 n n j=1 X j i   + (1 -α)   1 mn m i=1 n j=1 X j i   , where α = σ 2 θ σ 2 θ + σ 2 x n . By letting X i = 1 n n j=1 X j i for all i ∈ [m], we can write θi = αX i + (1 - α) 1 m m i=1 X i . Observe that E θi |θ = αθ i + 1-α m m l=1 θ l , where θ = (θ 1 , . . . , θ m ). Thus, the estimator ( 14) is an unbiased estimate of {θ i }. Substituting the θi in the MSE, we get that E X1,...,Xm θi -θ i 2 = E θ E X1,...,Xm θi -θ i 2 |θ = E θ E X1,...,Xm θi -E θi |θ + E θi |θ -θ i 2 |θ = E θ E X1,...,Xm θi -E θi |θ 2 |θ + E θ E X1,...,Xm E θi |θ -θ i 2 |θ (15) Claim 1. E θ E X1,...,Xm θi -E θi |θ 2 |θ = α 2 σ 2 x n + (1 -α) 2 σ 2 x mn + 2α(1 -α) σ 2 x mn E θ E X1,...,Xm E θi |θ -θ i 2 |θ = (1 -α) 2 E θ   1 m m k=1 θ k -θ i 2   ≤ (1 -α) 2 σ 2 θ (m -1) m Proof. For the first equation: E θ E X1,...,Xm θi -E θi |θ 2 |θ = E θ   E X1,...,Xm   α(X i -θ i ) + (1 -α) 1 m m k=1 (X k -θ k ) 2 | θ     = α 2 E E (X i -θ i ) 2 | θ + (1 -α) 2 E   E   1 m m k=1 (X k -θ k ) 2 | θ     + 2α(1 -α)E E 1 m m k=1 (X i -θ i )(X k -θ k ) | θ = α 2 σ 2 x n + (1 -α) 2 σ 2 x mn + 2α(1 -α) σ 2 x mn For the second equation, first note that E θi |θ - θ i = αθ i + 1-α m m k=1 θ k -θ i = (1 - α) 1 m m k=1 θ k -θ i : E θ E X1,...,Xm E θi |θ -θ i 2 |θ = (1 -α) 2 E   1 m m k=1 θ k -θ i 2   = (1 -α) 2 m 2 E      k =i (θ k -θ i )   2    = (1 -α) 2 m 2   k =i E(θ k -θ i ) 2 + k =i,l =i,k =l E(θ k -θ i )(θ l -θ i )   ≤ (1 -α) 2 m 2   k =i [E(θ k -µ) 2 + E(θ i -µ) 2 ] + k =i,l =i,k =l E(θ k -θ i )(θ l -θ i )   = (1 -α) 2 m 2   2(m -1)σ 2 θ + k =i,l =i,k =l E(θ k -θ i )(θ l -θ i )   = (1 -α) 2 m 2   2(m -1)σ 2 θ + k =i,l =i,k =l E(µ -θ i ) 2   (Since E[θ k ] = µ for all k ∈ [m]) = (1 -α) 2 m 2 2(m -1)σ 2 θ + (m -1)(m -2)σ 2 θ = (1 -α) 2 σ 2 θ (m -1) m This concludes the proof of Claim 1. Substituting the result of Claim 1 into (15), we get E X1,...,Xm θi -θ i 2 ≤ α 2 σ 2 x n + (1 -α) 2 σ 2 x mn + 2α(1 -α) σ 2 x mn + (1 -α) 2 σ 2 θ (m -1) m ( ) (a) = σ 2 x n α 2 + (1 -α) 2 + 2α(1 -α) m + α(1 -α) m -1 m = σ 2 x n α + 1 -α m , where in (a) we used α = σ 2 θ σ 2 θ + σ 2 x n for the last term to write (1 -α) 2 σ 2 θ (m-1) m = σ 2 x n α(1 -α) m-1 m . Observe that the estimator in ( 14) is a weighted summation between two estimators: the local estimator Similar to the proof of Theorem 1, here also we will derive the optimal estimator and prove the MSE for the one dimensional case, and the final result can be obtained by applying these to each of the d coordinates separately. X i = 1 n n j=1 X j i , Let θ = (θ 1 , . . . , θ m ) denote the personalized models vector. For given a constraint function q, we set the personalized model as follows: θi = α   1 n n j=1 X j i   + (1 -α) 1 m m i=1 q(X i ) ∀i ∈ [m], where X i = 1 n n j=1 X j i . From the second condition on the function q, we get that E θi |θ = αθ i + 1 -α m m l=1 θ l , Thus, by following similar steps as the proof of Theorem 1, we get that: E θi -θ i 2 = E E θi -θ i 2 |θ = E E θi -E θi |θ + E θi |θ -θ i 2 |θ = E E θi -E θi |θ 2 |θ + E E E θi |θ -θ i 2 |θ (a) = α 2 σ 2 x n + (1 -α) 2 E   1 m m l=1 q(X l ) -θ l 2 |θ   + 2α(1 -α)E X i -θ i 1 m m l=1 q(X l ) -θ l |θ + (1 -α) 2 E   1 m m k=1 θ k -θ i 2   (b) = α 2 σ 2 x n + (1 -α) 2 σ 2 x n + σ 2 q m + 2α(1 -α)σ 2 x mn + (1 -α) 2 E   1 m m k=1 θ k -θ i 2   ≤ α 2 σ 2 x n + (1 -α) 2 σ 2 x n + σ 2 q m + 2α(1 -α) σ 2 x mn + (1 -α) 2 σ 2 θ (m -1) m (c) = σ 2 x n α 2 + (1 -α) 2 + 2α(1 -α) m + α(1 -α) m -1 m = σ 2 x n α + 1 -α m , where step (a) follows by substituting the expectation of the personalized model from (18). Step (b) follows from the first and third conditions of the function q. Step (c) follows by choosing α = σ 2 θ + σ 2 q m-1 σ 2 θ + σ 2 q m-1 + σ 2 x n . This derives the result stated in (5) in Theorem 2.

B.2.1 PROOF OF THEOREM 2, PART 1

The proof consists of two steps. First, we use the concentration property of the Gaussian distribution to show that the local sample means {X i } are bounded within a small range with high probability. Second, we apply an unbiased stochastic quantizer on the projected sample mean. The local samples X 1 i , . . . , X n i are drawn i.i.d. from a Gaussian distribution with mean θ i and variance σ 2 x , and hence, we have that X i ∼ N (θ i , σ x n ). Thus, from the concentration property of the Gaussian distribution, we get that Pr [|X i -θ i | > c 1 ] ≤ exp - nc 2 1 σ 2 x for all i ∈ [m]. Similarly, the models θ 1 , . . . , θ m are drawn i.i.d. from a Gaussian distribution with mean µ ∈ [-r, r] and variance σ 2 θ , hence,, we get Pr[|θ i -µ| > c 2 ] ≤ exp - c 2 2 σ 2 θ for all i ∈ [m]. Let E = X i ∈ [-a, a] : ∀i ∈ [m] , where a = r + c 1 + c 2 . Thus, from the union bound, we get that Pr[E] > 1 -m(e - nc 2 1 σ 2 x + e - c 2 2 σ 2 θ ). By setting c 1 = σ 2 x n log(m 2 n) and c 2 = σ 2 θ log(m 2 n), we get that a = r + σx √ n log(m 2 n) + σ θ log(m 2 n), Pr[E] = 1 -2 mn . Let q k : [-a, a] → Y k be a quantization function with k-bits, where Y k is a discrete set of cardinality |Y k | = 2 k . For given x ∈ [-a, a], the output of the function q k is given by: q k (x) = 2a 2 k -1 ( x + Bern (x -x )) -a, where Bern(p) is a Bernoulli random variable with bias p, and x = 2 k -1 2a (x + a) ∈ [0, 2 k -1]. Observe that the output of the function q k requires only k-bits for transmission. Furthermore, the function q k satisfies the following conditions: E [q k (x)] = x, σ 2 q k = E (q k (x) -x) 2 ≤ a 2 (2 k -1) 2 . ( ) Let each client applies the function q k on the projected local mean Xi = Proj [-a,a] X i and sends the output to the server for all i ∈ [m]. Conditioned on the event E, i.e., X i ∈ [-a, a] ∀i ∈ [m], and using (19), we get that M SE = E θ,X θi -θ i 2 ≤ σ 2 x n 1 -α m + α , where α = σ 2 θ + a 2 (2 k -1) 2 (m-1) σ 2 θ + a 2 (2 k -1) 2 (m-1) + σ 2 x n and a = r + σx √ n log(m 2 n) + σ θ log(m 2 n). Note that the event E happens with probability at least 1 -2 mn . B.2.2 PROOF OF THEOREM 2, PART 2 We define the (random) mechanism q p : [-a, a] → R that takes an input x ∈ [-a, a] and generates a user-level ( 0 , δ)-LDP output y ∈ R, where y = q p (x) is given by: q p (x) = x + ν, where ν ∼ N (0, σ 2 0 ) is a Gaussian noise. By setting σ 2 0 = 8a 2 log(2/δ) 2 0 , we get that the output of the function q p (x) is ( 0 , δ)-LDP from Dwork & Roth (2014) . Furthermore, the function q p satisfies the following conditions: E [q p (x)] = x, σ 2 qp = E (q p (x) -x) 2 ≤ 8a 2 log(2/δ) 2 0 . ( ) Similar to the proof of Theorem 2, Part 1, let each client applies the function q p on the projected local mean Xi = Proj [-a,a] X i and sends the output to the server for all i ∈ [m]. Conditioned on the event E, i.e., X i ∈ [-a, a] ∀i ∈ [m], and using ( 19), we get that MSE = E θ,X θi -θ i 2 ≤ σ 2 x n 1 -α m + α , where α = σ 2 θ + 8a 2 log(2/δ) 2 0 (m-1) σ 2 θ + 8a 2 log(2/δ) 2 0 (m-1) + σ 2 x n and a = r + σx √ n log(m 2 n) + σ θ log(m 2 n). Note that the event E happens with probability at least 1 -2 mn . Remark 4 (Privacy with communication efficiency). Note that our private estimation algorithm for the Gaussian case adds Gaussian noise (which is a real number) but that can also be made communication-efficient by alternatively adding a discrete Gaussian noise (Canonne et al., 2020) .

B.3 LOWER BOUND

Here we discuss the lower bound using Fisher information technique similar to Barnes et al. (2020) . In particular we use a Bayesian version of Cramer-Rao lower bound and van Trees inequality Gill & Levit (1995) . Let us denote f (X|θ) as the data generating conditional density function and π(θ) as the prior distribution that generates θ. Let us denote E θ as the expectation with respect to the randomness of θ and E as the expectation with respect to randomness of X and θ. First we define two types of Fisher information: I X (θ) = E θ ∇ θ log(f (X|θ))∇ θ log(f (X|θ)) T I(π) = E∇ θ log(π(θ))∇ θ log(π(θ)) T namely Fisher information of estimating θ from samples X and Fisher information of prior π. Here the logarithm is elementwise. For van Trees inequality we need the following regularity conditions: • f (X|•) and π(•) are absolutely continuous and π(•) vanishes at the end points of Θ. • E θ ∇ θ log(f (X|θ)) = 0 • We also assume both density functions are continuously differentiable. These assumptions are satisfied for the Gaussian setting for any finite mean µ, they are satisfied for Bernoulli setting as long as parameters α and β are larger than 1. Assuming local samples X are generated i.i.d with f (x|θ), the van Trees inequality for one dimension is as follows: E( θ(X) -θ) 2 ≥ 1 nEI x (θ) + I(π) where I X (θ) = E θ log(f (X|θ)) 2 and I(π) = E log(π(θ)) 2 . Assuming θ ∈ R d and each dimension is independent from each other, by Gill & Levit (1995) we have: E θ(X) -θ 2 ≥ d 2 nETr(I x (θ)) + Tr(I(π)) Note, the lower bound on the average risk directly translates as a lower bound on sup θ∈Θ E X θ(X)θ 2 . Before our proof we have a useful fact: Fact 1. Given some random variable X ∼ N (Y, σ 2 y ) where Y ∼ N (Z, σ 2 z ) we have X ∼ N (z, σ 2 z + σ 2 y ). Proof. We will give the proof in one dimension, however, it can easily be extended to multidimensional case where each dimension is independent. For all t ∈ R we have, E X [exp(itX)] = E Y E X [exp(itX)|Y ] = E Y [exp(itY - σ 2 x t 2 2 )] = exp(- σ 2 x t 2 2 )E Y [exp(itY )] = exp(- σ 2 x t 2 2 ) exp(itz - σ 2 y t 2 2 ) = exp(itz - (σ 2 x + σ 2 y )t 2 2 ) where the last line is the characteristic function of a Gaussian with mean z and variance σ 2 x + σ 2 y . Gaussian case with perfect knowledge of prior. In this setting we know that θ i ∼ N (µ1, σ 2 θ I d ), hence, I(π) = 1 σ 2 θ I d , similarly I X (θ) = 1 σ 2 x I d . Then, sup θi E θ i (X) -θ i 2 ≥ d 2 nE d σ 2 x + d σ 2 θ = dσ 2 θ σ 2 x nσ 2 θ + σ 2 x ( ) Gaussian case with estimated population mean. In this setting instead of a true prior we have a prior whose mean is the average of all data spread across clients, i.e., we assume θ i ∼ N ( µ, σ 2 θ I d ) where µ = 1 mn m,n i,j X j i . We additionally know that there is a Markov relation such that X j i |θ j ∼ N (θ j , σ 2 x I d ) and θ j ∼ N (µ, σ 2 θ I d ). While the true prior is parameterized with mean µ, θ i in this form is not parameterized by µ but by µ which itself has randomness due X j i . However, using Fact 1 twice we can write θ i ∼ N (µ, (σ 2 θ + σ 2 θ m + σ 2 x mn )I d ). Then using the van Trees inequality similar to the lower bound in perfect case we can obtain: sup θi∈Θ E X θ i (X) -θ i 2 ≥ d σ 2 θ σ 2 x + σ 4 x mn nσ 2 θ + σ 2 x (30) C PERSONALIZED ESTIMATION -BERNOULLI MODEL C.1 WHEN α, β ARE KNOWN Analogous to the Gaussian case, we can show that if α, β are known, then the posterior mean estimator has a closed form expression: p i = aX i + (1 -a) α α+β (where a = n /α+β+n) and achieves the MSE: E pi∼π E pi,X1,...,Xm ( p i -p i ) 2 ≤ αβ n(α+β)(α+β+1) n α+β+n . We show this below. For a client i, let π(p i ) be distributed as Beta(α, β). In this setting, we model that each client generates local samples according to Bern(p i ). Consequently, each client has a Binomial distribution regarding the sum of local data samples. Estimating Bernoulli parameter p i is related to Binomial distribution Bin(n, p i ) (the sum of data samples) Z i since it is the sufficient statistic of Bernoulli distribution. The distribution for Binomial variable Z i given p i is P (Z i = z i |p i ) = n zi p zi i (1 -p i ) n-zi . It is a known fact that for any prior, the Bayesian MSE risk minimizer is the posterior mean E [p i |Z i = z i ]. When p i ∼ Beta(α, β), we have posterior f (p i |Z i = z i ) = P (z i |p i ) P (z i ) π(p i ) = n zi p zi i (1 -p i ) n-zi P (z i ) p α-1 i (1 -p i ) β-1 B(α, β) = n zi P (z i ) B(α + z i , β + n -z i ) B(α, β) p α+zi-1 i (1 -p i ) β+n-zi-1 B(α + z i , β + n -z i ) , where B(α, β) = Γ(α)Γ(β) Γ(α+β) , and P (z i ) = P (z i |p i )π(p i )dp i = n z i p zi i (1 -p i ) n-zi p α-1 i (1 -p i ) β-1 B(α, β) dp i = n z i B(z i + α, n -z i + β) B(α, β) p α+zi-1 i (1 -p i ) β+n-zi-1 B(α + z i , β + n -z i ) dp i integral of a Beta distribution = n z i B(z i + α, n -z i + β) B(α, β) Thus, we get that the posterior distribution f (p i |Z i = z i ) = p α+z i -1 i (1-pi) β+n-z i -1 B(α+zi,β+n-zi) is a beta distribution Beta(z i + α, n -z i + β). As a result, the posterior mean is given by: p i = α + Z i α + β + n = a Z i n + (1 -a) α α + β , where a = n α+β+n . Observe that E pi∼Beta(α,β) [p i ] = α α+β , i.e., the estimator is a weighted summation between the local estimator zi n and the global estimator µ = α α+β . We have R pi ( p i ) = E π E( p i -p i ) 2 . The MSE of the posterior mean is given by: MSE = E[(p i -p i ) 2 ] = E a z i n -p i + (1 -a)(µ -p i ) 2 = a 2 E z i n -p i 2 + (1 -a) 2 E (µ -p i ) 2 = a 2 E pi∼π(pi) p i (1 -p i ) n + (1 -a) 2 αβ (α + β) 2 (α + β + 1) = a 2 αβ n(α + β)(α + β + 1) + (1 -a) 2 αβ (α + β) 2 (α + β + 1) = αβ n(α + β)(α + β + 1) n α + β + n . The last equality is obtained by setting a = n α+β+n . Remark 5. Note that X i := Zi n is the estimator based only on the local data and α /(α+β) is the true global mean, and p i = aX i + (1 -a) α α+β , where a = n /α+β+n (see (31)) is the estimator based on all the data. Observe that when n → ∞, then a → 1, which implies that p i → X i . Otherwise, when α + β is large (i.e., the variance of the beta distribution is small), then a → 0, which implies that p i → α /(α+β). Both these conclusions conform to the conventional wisdom as mentioned in the Gaussian case. It can be shown that the local estimate X i achieves the Bayesian risk of α+β+1), which implies that the personalized estimation with perfect prior always outperforms the local estimate with a multiplicative gain a = n /(n+α+β) ≤ 1. E pi∼Beta(α,β) E Xi [(X i -p i ) 2 ] = E p i ∼Beta(α,β) (pi(1-pi)) /n = αβ /n(α+β)(

C.2 WHEN α, β ARE UNKNOWN: PROOF OF THEOREM 3

The personalized model of the ith client with unknown parameters α, β is given by: pi = a i X i + (1 -a i ) (μ i ) , where a i = n μi (1-μi ) σ2 i +n , the empirical mean μi = 1 m-1 l =i X l , and the empirical variance (Tian et al., 2017 , Lemma 1), with probability 1 -1 m 2 n , we get that σ2 i = 1 m-2 l =i (X l -μi ) 2 . From |µ -μi | ≤ 3 log(4m 2 n) m -1 |σ 2 -σ2 i | ≤ 3 log(4m 2 n) m -1 , where µ = α α+β , σ 2 = αβ (α+β) 2 (α+β+1) are the true mean and variance of the beta distribution, respectively. Let c = 3 log(4m 2 n) m-1 . Conditioned on the event E = {|µ -μi | ≤ c, |σ 2 -σ2 i | ≤ c : ∀i ∈ [m] } that happens with probability at least 1 -1 mn , we get that: E (p i -p i ) 2 |Z -i = a 2 E Z i n -p i 2 + (1 -a) 2 E (μ i -p i ) 2 |Z -i = a 2 αβ n(α + β)(α + β + 1) + (1 -a) 2 E (µ -p i ) 2 + (µ -μi ) 2 = a 2 αβ n(α + β)(α + β + 1) + (1 -a) 2 αβ (α + β) 2 (α + β + 1) + (µ -μi ) 2 ≤ a 2 αβ n(α + β)(α + β + 1) + (1 -a) 2 αβ (α + β) 2 (α + β + 1) + c 2 , where the expectation is with respect to z i ∼ Binom(p i , n) and p i ∼ Beta(α, β) and Z -i = {z 1 , . . . , z i-1 , z i+1 , . . . , z m } denotes the entire dataset except the ith client data (z i ). By taking the expectation with respect to the datasets Z -i , we get that the MSE is bounded by: MSE ≤ E a 2 αβ n(α + β)(α + β + 1) +E (1 -a) 2 αβ (α + β) 2 (α + β + 1) + 3 log(4m 2 n) m -1 , with probability at least 1 -1 mn . This completes the proof of Theorem 3.

C.3 WITH PRIVACY CONSTRAINTS: PROOF OF THEOREM 4

First, we prove some properties of the private mechanism q p . Observe that for any two inputs x, x ∈ [0, 1], we have that: Pr[q p (x) = y] Pr[q p (x ) = y] = e 0 e 0 +1 -x e 0 -1 e 0 +1 e 0 e 0 +1 -x e 0 -1 e 0 +1 ≤ e 0 , for y = -1 e 0 -1 . Similarly, we can prove (33) for the output y = e 0 e 0 -1 . Thus, the mechanism q p is user-level 0 -LDP. Furthermore, for given x ∈ [0, 1], we have that E [q p (x)] = x. Thus, the output of the mechanism q p is an unbiased estimate of the input x. From the Hoeffding's inequality for bounded random variables, we get that: Pr[|μ (p) i -µ| > t] ≤ 2 exp -3(e 0 -1) 2 (m -1)t 2 (e 0 + 1) 2 Pr[|σ 2(p) i -σ 2 | > t] ≤ 2 exp -3(e 0 -1) 2 (m -1)t 2 (e 0 + 1) 2 Thus, we have that the event E = {|μ (p) i -µ| ≤ c p , |σ 2(p) i -σ 2 | ≤ c p : ∀i ∈ [m]} happens with probability at least 1 -1 mn , where c p = (e 0 +1) 2 log(4m 2 n) 3(e 0 -1) 2 (m-1) . By following the same steps as the non-private estimator, we get the fact that the MSE of the private model is bounded by: MSE ≤ E a 2 αβ n(α + β)(α + β + 1) + E (1 -a) 2 αβ (α + β) 2 (α + β + 1) + (e 0 + 1) 2 log(4m 2 n) 3(e 0 -1) 2 (m -1) , where a (p) = n μ(p) i (1-μ(p) i ) σ2(p) i +n and the expectation is with respect to the clients data {z 1 , . . . , z i-1 , z i+1 , . . . , z m }and the randomness of the private mechanism q p . This completes the proof of Theorem 4. Remark 6 (Privacy with communication efficiency). Note that our private estimation algorithm for the Bernoulli case is already communication-efficient as each client sends only one bit to the server. Remark 7 (Client sampling). For simplicity, in the theoretical analysis in Gaussian and Bernoulli models, we assume that all clients participate in the estimation process. However, a simple modification to our analysis also handles the case where only K out of m clients participate: in all our theorem statements we would have to modify to have K instead m. Note that we do client sampling for our experiments in Table 1 .

D PERSONALIZED ESTIMATION -MIXTURE MODEL

Consider a set of m clients, where the i-th client has a local dataset X i = (X i1 , . . . , X in ) of n samples for i ∈ [m] , where X ij ∈ R d . The local samples X i of the i-th client are drawn i.i.d. from a Gaussian distribution N (θ i , σ 2 x I d ) with unknown mean θ i and known variance σ 2 x I d . In this section, we assume that the personalized models θ 1 , . . . , θ m are drawn i.i.d. from a discrete distribution P = [p 1 , . . . , p k ] for given k candidates µ 1 , . . . , µ k ∈ R d . In other works, Pr[θ i = µ l ] = p l for l ∈ [k] and i ∈ [m]. The goal of each client is to estimate her personalized model {θ i } that minimizes the mean square error defined as follows: MSE = E {θi,Xi} θ i -θi 2 , ( ) where the expectation is taken with respect to the personalized models θ i and the local samples {X ij ∼ N (θ i , σ 2 x I d )}. Furthermore, θi denotes the estimate of the personalized model θ i for i ∈ [m]. First, we start with a simple case when the clients have perfect knowledge of the prior distribution, i.e., the i-th client knows the k Gaussian distributions N µ 1 , σ 2 θ , . . . , N µ k , σ 2 θ and the prior distribution α = [α 1 , . . . , α k ]. This will serve as a stepping stone to handle the more general case when the prior distribution is unknown.

D.1 WHEN THE PRIOR DISTRIBUTION IS KNOWN

In this case, the i-th client does not need the data of the other clients as she has a perfect knowledge about the prior distribution. Theorem 6. For given a perfect knowledge α = [α 1 , . . . , α k ] and N µ 1 , σ 2 θ , . . . , N µ k , σ 2 θ , the optimal personalized estimator that minimizes the MSE is given by: θi = k l=1 a (i) l µ l , ( ) where α (i) l = p l exp - n j=1 X ij -µ l 2 2σ 2 x k s=1 ps exp - n j=1 X ij -µ s 2 2σ 2 x denotes the weight associated to the prior model µ l for l ∈ [k]. Proof. Let θ i ∼ P, where P = [p 1 , . . . , p k ] and p l = Pr[θ i = µ l ] for l ∈ [k]. The goal is to design an estimator θi that minimizes the MSE given by: MSE = E θi∼P E {Xij ∼N (θi,σ 2 x )} θi -θ i 2 . ( ) Let X i = (X i1 , . . . , X in ). By following the standard proof of the minimum MSE, we get that: E θi E Xi θi -θ i 2 = E Xi E θi|Xi θi -E[θ i |X i ] + E[θ i |X i ] -θ i 2 X i = E Xi E θi|Xi E[θ i |X i ] -θ i 2 X i + E Xi E θi|Xi E[θ i |X i ] -θi 2 X i ≥ E Xi E θi|Xi E[θ i |X i ] -θ i 2 X i , ) where the last inequality is achieved with equality when θi = E[θ i |X i ]. The distribution on θ i given the local dataset X i is given by: Pr[θ i = µ l |X i ] = f (X i |θ i = µ l ) Pr[θ i = µ l ] f (X i ) = f (X i |θ i = µ l ) Pr[θ i = µ l ] k s=1 f (X i |θ i = µ s ) Pr[θ i = µ s ] = p l exp - n j=1 Xij -µ l 2 2σ 2 x k s=1 p s exp - n j=1 Xij -µ s 2 2σ 2 x = α (i) l (41) As a result, the optimal estimator is given by: θi = E[θ i |X i ] = k l=1 α (i) l µ l . ( ) This completes the proof of Theorem 6. The optimal personalized estimation in ( 38) is a weighted summation over all possible candidates vectors µ 1 , . . . , µ k , where the weight α x n which increases linearly with the data dimension d. On the other hand, the MSE of the optimal estimator in Theorem 6 is a function of the prior distribution P = [p 1 , . . . , p k ], the prior vectors µ 1 , . . . , µ k , and the local variance σ 2 x .

D.2 WHEN THE PRIOR DISTRIBUTION IS UNKNOWN

Now, we consider a more practical case when the prior distribution P = [p 1 , . . . , p k ] and the candidates µ 1 , . . . , µ k are unknown to the clients. In this case, the clients collaborate with each other by their local data to estimate the priors P and µ 1 , . . . , µ k , and then, each client uses the estimated priors to design her personalized model as in (38). We present Algorithm 3 based on alternating minimization. The algorithm starts by initializing the local models {θ (0) i := 1 n n j=1 X ij }. Then, the algorithm works in rounds alternating between estimating the priors Observe that for given the prior information P (t) , {µ t l }, each client updates her personalized model in Step 6 which is the optimal estimator for given priors according to Theorem 6. On the other hand, for given personalized models {θ (t) i }, we estimate the priors P (t) , {µ t l } using clustering algorithm with k sets in Step 11. The algorithm Cluster takes m vectors a 1 , . . . , a m and an integer k as its input, and its goal is to generate a set of k cluster centers µ 1 , . . . , µ k that minimizes P (t+1) = [p (t+1) 1 , . . . , p (t+1) k ], µ (t+1) 1 , . . . , µ m i=1 min l∈k a i -µ l 2 . Furthermore, these clustering algorithms can also return the prior distribution P, by setting p l := |S l | m , where S l ⊂ {a 1 , . . . , a m } denotes the set of vectors that are belongs to the l-th cluster. There are lots of algorithms that do clustering, but perhaps, Lloyd's algorithm Lloyd (1982) and Ahmadian Ahmadian et al. (2019) are the most common algorithms for k-means clustering. Our Algorithm 3 can work with any clustering algorithm.

Algorithm 3 Alternating Minimization for Personalized Estimation

Input: Number of iterations T , local datasets (X i1 , . . . , X in ) for i ∈ [m]. 1: Initialize θ 0 i = 1 n n j=1 X ij for i ∈ [m]. 2: for t = 1 to T do 3: On Clients:

4:

for i = 1 to m: do Update the personalized model: θ t i ← k l=1 α (i) l µ (t) l and α (i) l = p (t) l exp - n j=1 Xij -µ (t) l 2 2σ 2 x k s=1 p (t) s exp - n j=1 Xij -µ (t) s 2 2σ 2 x 7: Send θ t i to the server 8: end for 9: At the Server: Update the global parameters: P (t) , µ (t) 1 , . . . , µ (t) k ← Cluster θ (t) 1 , . . . , θ (t) m , k 12: Broadcast P (t) , µ (t) 1 , . . . , µ k to all clients 13: end for Output: Personalized models θ T 1 , . . . , θ T m .

D.3 PRIVACY/COMMUNICATION CONSTRAINTS

In the personalized estimation Algorithm 3, each client shares her personalized estimator θ (t) i to the server at each iteration which is not communication-efficient and violates the privacy. In this section we present ideas on how to design communication-efficient and/or private Algorithms for personalized estimation. Lemma 3. Let µ 1 , . . . µ k ∈ R d be unknown means such that µ i 2 ≤ r for each i ∈ [k]. Let θ 1 , . . . , θ m ∼ P, where P = [p 1 , . . . , p k ] and p l = Pr[θ i = µ l ]. For i ∈ [m], let X i1 , . . . , X in ∼ N (θ i , σ 2 x ), i.i.d. Then, with probability at least 1 -1 mn , the following bound holds for all i ∈ [m]: 1 n n j=1 X ij 2 ≤ 4 d σ 2 x n + 2 log(m 2 n) σ 2 x n + r. ( ) Proof. Observe that the vector (X i -θ i ) = 1 n n i=1 X ij -θ i is a sub-Gaussian random vector with proxy σ 2 x n . As a result, we have that: X i -θ i 2 ≤ 4 d σ 2 x n + 2 log(1/η) σ 2 x n , with probability at least 1 -η from Wainwright (2019). Since µ 1 , . . . , µ k ∈ R d are such that µ i 2 ≤ r for each i ∈ [k], we have: X i 2 ≤ 4 d σ 2 x n + 2 log(1/η) σ 2 x n + r, with probability 1 -η from the triangular inequality. Thus, by choosing η = 1 m 2 n and using the union bound, this completes the proof of Lemma 3. Lemma 3 shows that the average of the local samples {X i } has a bounded 2 norm with high probability. Thus, we can design a communication-efficient estimation Algorithm as follows: Each client clips her personal model θ (t) i within radius 4 d σ 2 x n + 2 log(m 2 n) σ 2 x n + r. Then, each client applies a vector-quantization scheme (e.g., Bernstein et al. (2018) ; Alistarh et al. (2017) ; Girgis et al. ( 2021a)) to the clipped vector before sending it to the server. To design a private estimation algorithm with discrete priors, each client clips her personalized estimator θ (t) i within radius 4 d σ 2 x n + 2 log(m 2 n) σ 2 x n + r. Then, we can use a differentially private algorithm for clustering (see e.g., Stemmer (2020) for clustering under LDP constraints and Ghazi et al. (2020) for clustering under central DP constraints.). Since, we run T iterations in Algorithm 3, we can obtain the final privacy analysis ( , δ) using the strong composition theorem Dwork & Roth (2014) .

E PERSONALIZED LEARNING -LINEAR REGRESSION

In this section, we present the personalized linear regression problem. Consider A set of m clients, where the i-th client has a local dataset consisting of n samples (X i1 , Y i1 ), . . . , (X in , Y in ), where X ij ∈ R d denotes the feature vector and Y ij ∈ R denotes the corresponding response. Let Y i = (Y i1 , . . . , Y i1 ) ∈ R n and X i = (X i1 , . . . , X in ) ∈ R n×d denote the response vector and the feature matrix at the i-th client, respectively. Following the standard regression, we assume that the response vector Y i is obtained from a linear model as follows: Y i = X i θ i + w i , where θ i denotes personalized model of the i-th client and w i ∼ N 0, σ 2 x I n is a noise vector. The clients' parameters θ 1 , . . . , θ m are drawn i.i.d. from a Gaussian distribution θ 1 , . . . , θ m ∼ N (µ, σ 2 θ I d ), i.i.d. Our goal is to solve the optimization problem stated in (9) (for the linear regression setup) and learn the optimal personalized parameters { θ i }. The following theorem characterizes the exact form of the optimal { θ i } and computes their minimum mean squared error w.r.t. the true parameters {θ i }. Theorem 7. The optimal personalized parameters at client i with known µ, σ 2 θ , σ 2 x is given by: θ i = I σ 2 θ + X T i X i σ 2 x -1 X T i Y i σ 2 x + µ σ 2 θ . The mean squared error (MSE) of the above θ i is given by: E wi,θi θ i -θ i 2 = Tr I σ 2 θ + X T i X i σ 2 x -1 , Proof. The personalized model with perfect prior is obtained by solving the optimization problem stated in (9), which is given below for convenience. Note that for linear regression with Gaussian prior, we have P(Γ) ≡ N (µ, σ 2 θ I d ) and p θi (Y ij |X ij ) according to N (θ i , σ 2 x ) . θ i = arg min θi n j=1 -log(p θi (Y ij |X ij )) -log(p(θ i )). = arg min θi n j=1 (Y ij -X ij θ i ) 2 2σ 2 x + θ i -µ 2 2σ 2 θ . = arg min θi Y i -X i θ i 2 2σ 2 x + θ i -µ 2 2σ 2 θ . By taking the derivative with respect to θ i , we get ∂ ∂θ i = X T i (X i θ i -Y i ) σ 2 x + θ i -µ σ 2 θ . Equating the above partial derivative to zero, we get that the optimal personalized parameters θ i is given by: θ i = I σ 2 θ + X T i X i σ 2 x -1 X T i Y i σ 2 x + µ σ 2 θ . Taking the expectation w.r.t. w i , we get: E wi [ θ i ] = I σ 2 θ + X T i X i σ 2 x -1 X T i X i θ i σ 2 x + µ σ 2 θ , Thus, we can bound the MSE as following: E wi,θi θ i -θ i 2 = E wi,θi θ i -E wi [ θ i ] + E wi [ θ i ] -θ i 2 = E wi,θi θ i -E wi [ θ i ] 2 + E wi,θi E wi [ θ i ] -θ i 2 + 2E wi,θi θ i -E wi [ θ i ], E wi [ θ i ] -θ i = E wi,θi θ i -E wi [ θ i ] 2 + E wi,θi E wi [ θ i ] -θ i 2 In the last equality, we used E wi,θi θ i -E wi [ θ i ], E wi [ θ i ] -θ i = E θi E wi [ θ i ] -E wi [ θ i ], E wi [ θ i ] -θ i = 0, where the first equality holds because E wi [ θ i ] -θ i is independent of w i . Letting M = I σ 2 θ + X T i Xi σ 2 x , and Tr denoting the trace operation, we get E wi,θi θ i -θ i 2 = Tr M -1 E wi X T i w i σ 2 x X T i w i σ 2 x T M -1 + Tr M -1 E θi θ i -µ σ 2 θ θ i -µ σ 2 θ T M -1 = Tr M -1 X T i X i σ 2 x M -1 + Tr M -1 I σ 2 θ M -1 = Tr M -1 . This completes the proof of Theorem 7. Observe that the local model of the i-th client, i.e., estimating θ i only from the local data (Y i , X i ), is given by: θ (l) i = X T i X i -1 X T i Y i , Algorithm 4 Linear Regression GD Input: Number of iterations T , local datasets (Y i , X i ) for i ∈ [m], learning rate η. 1: Initialize θ 0 i for i ∈ [m], µ 0 , σ 2,0 x , σ 2,0 θ . 2: for t = 1 to T do 3: On Clients: 4: for i = 1 to m: do 5: Receive and set µ t i = µ t , σ 2,t θ,i = σ 2,t θ , σ 2,t x,i = σ 2,t x 6: Update the personalized model: θ t i ← θ t-1 i + η n j=1 Xij (Yij -Xij θ t-1 i ) σ 2,t-1 x,i + µ t-1 i -θ t-1 i σ 2,t-1 θ,i 7: Update local version of mean: µ t i ← µ t-1 i -η µ t-1 i -θ t-1 i σ 2,t-1 θ,i 8: Update local variance: σ 2,t x,i ← σ 2,t-1 x,i -η n 2σ 2,t-1 x,i - n j=1 (Yij -Xij θ t-1 i ) 2 2(σ 2,t-1 x,i ) 2 9: Update global variance: σ 2,t θ,i ← σ 2,t-1 θ,i -η d 2σ 2,t-1 θ,i - µ t-1 i -θ t-1 i 2 2(σ 2,t-1 θ,i ) 2 10: end for 11: At the Server: 12: Aggregate mean: µ t = 1 m m i=1 µ t i 13: Aggregate global variance: σ 2,t θ = 1 m m i=1 σ 2,t θ,i 14: Aggregate local variance: σ 2,t x = 1 m m i=1 σ 2,t x,i 15: Broadcast µ t , σ 2,t θ , σ 2,t x 16: end for Output: Personalized models θ T 1 , . . . , θ T m . where we assume the matrix X T i X i has a full rank (otherwise, we take the pseudo inverse). This local estimate achieves the MSE given by: E θ (l) i -θ i 2 = Tr X T i X i -1 σ 2 x , we can prove it by following similar steps as the proof of Theorem 7. When σ 2 θ → ∞, we can easily see that the local estimate (52) matches the personalized estimate in (47). To make the regression problem more practical, we assume that the mean µ, the local variance σ 2 x , and the global variance σ 2 θ are unknown. Hence, we estimate the personalized parameters by minimizing the negative log likelihood: θ 1 , . . . , θ m = arg min {θi},µ,σ 2 x ,σ 2 θ m i=1 n j=1 -log (p θi (Y ij |X ij )) + m i=1 -log (p (θ i )) = arg min nm 2 log(2πσ 2 x ) + m i=1 n j=1 (Y ij -X ij θ i ) 2 2σ 2 x + md 2 log(2πσ 2 θ ) + m i=1 θ i -µ 2 2σ 2 θ . Instead of solving the above optimization problem explicitly, we can optimize it through gradient descent (GD) and the resulting algorithm is presented in Algorithm 4. In addition to keeping the personalized models {θ t i }, each client also maintains local copies of {µ t i , σ t θ,i , σ t x,i } and updates all these parameters by taking appropriate gradients of the objective in (54) and synchronize them with the server to update the global copy of these parameters {µ t , σ t θ , σ t x }.

F PERSONALIZED LEARNING -LOGISTIC REGRESSION

As described in Section 3, by taking Yij ) , where σ(z) = 1 /1+e -z for any z ∈ R, then the overall optimization problem becomes: P(Γ) ≡ N (µ, σ 2 θ I d ) and p θi (Y ij |X ij ) = σ( θ i , X ij ) Yij (1 - σ( θ i , X ij )) (1- arg min {θi},µ,σ θ m i=1 n j=1 Y ij log 1 1 + e -θi,Xij + (1 -Y ij ) log 1 1 + e θi,Xij + md 2 log(2πσ 2 θ ) + m i=1 µ -θ i 2 2 2σ 2 θ . When µ and σ 2 θ are unknown, we would like to learn them by gradient descent, as in the linear regression case. The corresponding algorithm is described in Algorithm 5. Algorithm 5 Logistic Regression SGD Input: Number of iterations T , local datasets (Y i , X i ) for i ∈ [m], learning rate η. 1: Initialize θ 0 i for i ∈ [m], µ 0 , σ 2,0 θ . 2: for t = 1 to T do 3: On Clients: 4: for i = 1 to m: do 5: Receive (µ t , σ 2,t θ ) from the server and set µ t i := µ t , σ 2,t θ,i := σ 2,t θ 6: Update the personalized model: θ t i ← θ t-1 i -η   n j=1 ∇ θ t-1 i l (p) CE (θ t-1 i , (X j i , Y j i )) + µ t-1 i -θ t-1 i σ 2,t-1 θ,i   , CE denotes the cross-entropy loss.

7:

Update local version of mean: µ t i ← µ t-1 i -η µ t-1 i -θ t-1 i σ 2,t-1 θ, i 8: Update global variance: σ 2,t θ,i ← σ 2,t-1 θ,i -η d 2σ 2,t-1 θ,i - µ t-1 i -θ t-1 i 2 2(σ 2,t-1 θ,i ) 2 9: Send (µ t i , σ 2,t θ,i ) to the server 10: end for 11: At the Server: 12: Receive {(µ t i , σ 2,t θ,i )} from the clients 13: Aggregate mean: µ t = 1 m m i=1 µ t i 14: Aggregate global variance: σ 2,t θ = 1 m m i=1 σ 2,t θ,i Broadcast (µ t , σ 2,t θ ) to all clients 16: end for Output: Personalized models θ T 1 , . . . , θ T m .

G PERSONALIZED LEARNING -MIXTURE MODEL

In this section, we present the linear regression problem as a generalization to the estimation problem with discrete priors. This model falls into the framework studied in Marfoq et al. (2021) and is illustrated to show how our framework also captures it. Consider a set of m clients, where the i-th client has a local dataset (X i1 , Y i1 ), . . . , (X in , Y in ) of m samples, where X ij ∈ R d denotes the feature vector and Y ij ∈ R denotes the corresponding response. Let Y i = (Y i1 , . . . , Y i1 ) ∈ R n and X i = (X i1 , . . . , X in ) ∈ R n×d denote the response vector and the feature matrix at the i-th client, respectively. Following the standard regression, we assume that the response vector Y i is obtained from a linear model as follows: Y i = X i θ i + w i , where θ i denotes personalized model of the i-th client and w i ∼ N 0, σ 2 x I n is a noise vector. The clients models are drawn i.i.d. from a discrete distribution θ 1 , . . . , θ m ∼ P, where P = [p 1 , . . . , p k ] such that p l = Pr[θ i = µ l ] for i ∈ [m] and l ∈ [k]. Our goal is to solve the optimization problem stated in (9) (for the linear regression with the above discrete prior) and learn the optimal personalized parameters { θ i }. We assume that the discrete distribution P and the prior candidates {µ l } k l=1 are unknown to the clients. Inspired from Algorithm 3 for estimation with discrete priors, we obtain Algorithm 6 for learning with discrete prior. Note that this is not a new algorithm, and is essentially the algorithm proposed in Marfoq et al. (2021) applied to linear regression. We wanted to show how our framework captures mixture model in Marfoq et al. (2021) through this example. Description of Algorithm 6. Client i initializes its personalized parameters θ (0) i = (X T i X i ) -1 X T i Y i , which is the optimal as a function of the local dataset at the i-th client without any prior knowledge. In any iteration t, for a given prior information P (t) , {µ (t) l }, the i-th client updates the personalized model as θ t i = k l=1 α (i) l µ (t) l , where the weights α  (i) l ∝ p (t) l exp - Xiµ (t) l - (X i , Y i ) for i ∈ [m]. 1: Initialize θ 0 i = (X T i X i ) -1 X T i Y i for i ∈ [m] (if X T i X i is not full-rank, take the pseudo-inverse). 2: for t = 1 to T do 3: On Clients: 4: for i = 1 to m: do Receive P (t) , µ 1 , . . . , µ k from the server 6: Update the personalized parameters and the coefficients: row) given the data sample x (column). Therefore, each column is a probability vector. Since we want to sample the probability matrix, it suffices to restrict our attention to any set of |Y| -1 rows, as the remaining row can be determined by these |Y| -1 rows. θ t i ← k l=1 α (i) l µ Similarly, for a global parameter µ, let p µ (y|x) define a randomized mapping from X to Y, parameterized by the global parameter µ. Note that for a fixed global parameter µ, the randomized map p µ (y|x) is fixed, whereas, our goal is to sample p θi (y|x) for i = 1, . . . , m, one for each client. For simplicity of notation, define p θi := p θi (y|x) and p µ := p µ (y|x) to be the corresponding probability matrices, and let the distribution for sampling p θi (y|x) be denoted by p pµ (p θi ). Note that different mappings p θi (y|x) correspond to different θ i 's, so we define p(θ i ) (in Equation ( 9)) as p pµ (p θi ), which is the density of sampling the probability matrix p θi (y|x). For the KD population distribution, we define this density p pµ (p θi ) as: p pµ (p θi ) = c(ψ)e -ψD KL (pµ(y|x) p θ i (y|x)) ( ) where ψ is an 'inverse variance' type of parameter, c(ψ) is a normalizing function that depends on (ψ, p µ ), and D KL (p µ (y|x) p θi (y|x)) = x∈X p(x) y∈Y p µ (y|x) log pµ(y|x) y|x) is the conditional KL divergence, where p(x) denotes the probability of sampling a data sample x ∈ X . Now all we need is to find c(ψ) given a fixed µ (and therefore fixed p µ (y|x)). Here we consider D KL (p µ p θi ), but our analysis can be extended to D KL (p θi p µ ) or p θi -p µ 2 as well. p θ i ( For simplicity and to make the calculations easier, we consider a binary classification task with Y = {0, 1} and define p µ (x) := p µ (y = 1|X = x) and q i (x) := p θi (y = 1|X = x). We have: D KL (p µ (y|x) p θi (y|x)) = x p(x) p µ (x)(log p µ (x) -log q i (x)) + (1 -p µ (x))(log(1 -p µ (x)) -log(1 -q i (x))) . Hence, after some algebra we have, p pµ (p θi ) = c(ψ)e ψ x p(x)H(pµ(x)) e ψ x p(x)(pµ(x) log(qi(x))+(1-pµ(x)) log(1-qi(x)))) Then, c(ψ) x 1 0 e ψp(x)H(pµ(x)) e ψp(x)(pµ(x) log(qi(x))+(1-pµ(x)) log(1-qi(x)))) dq i (x) = 1.

Note that

1 0 e ψp(x)(pµ(x) log(qi(x))+(1-pµ(x)) log(1-qi(x)))) dq i (x) = B 1 + p µ (x) ψp(x) , 1 + 1 -p µ (x) ψp(x) Accordingly, after some algebra, we can obtain c(ψ) = e -ψ x p(x)H(pµ(x)) x B 1+ pµ(x) ψp(x) ,1+ 1-pµ (x) ψp(x) , where H is binary Shannon entropy. Substituting this in (57), we get p pµ (p θi ) = e -ψ x p(x)H(pµ(x)) x B(1 + pµ(x) ψp(x) , 1 + 1-pµ(x) ψp(x) ) e -ψD KL (pµ(y|x) p θ i (y|x)) which is the population distribution that can result in a KD type regularizer. Note that when we take the negative logarithm of the population distribution we obtain KL divergence loss and an additional term that depends on ψ and p µ . This is the form seen in Section 3.2 Equation (11) for AdaPeD algorithm. For numerical purpose, we take the additional term -log e -ψ x p(x)H(pµ(x)) x B(1+ pµ(x) ψp(x) ,1+ 1-pµ (x) ψp(x) ) to be simple 1 2 log(2ψ). As mentioned in Section 3.2, this serves the purpose of regularizing ψ. This is in contrast to the objective considered in Ozkara et al. (2021) , which only has the KL divergence loss as the regularizer, without the additional term.

H.2 ADAPED WITH UNSAMPLED CLIENT ITERATIONS

When there is a flexibility in computational resources for doing local iterations, unsampled clients can do local training on their personalized models to speed-up convergence at no cost to privacy. This can be used in cross-silo settings, such as cross-institutional training for hospitals, where privacy is crucial and there are available computing resources most of the time. We propose the algorithm for AdaPeD with with unsampled client iterations in Algorithm 7: Algorithm 7 Adaptive Personalization via Distillation (AdaPeD) with unsampled client iterations Parameters: local variances {ψ 0 i }, personalized models {θ 0 i }, local copies of the global model {µ 0 i }, synchronization gap τ , learning rates η 1 , η 2 , η 3 , number of sampled clients K. 1: for t = 0 to T -1 do 2: if τ divides t then 3: On Server do: 4: Choose a subset K t ⊆ [n] of K clients 5: Broadcast µ t and ψ t 6: On Clients i ∈ K t (in parallel) do: 7: Receive µ t and ψ t ; set µ t i = µ t , ψ t i = ψ t 8: end if

9:

On Clients i / ∈ K t (in parallel) do: 10: Compute g t i := ∇ θ t i f i (θ t i ) + ∇ θ t i f KD i (θ t i ,µ t i i ) 2ψ t i i where t i is the last time index where client i received global parameters from the server 11: Update: θ t+1 i = θ t i -η 1 g t i 12: On Clients i ∈ K t (in parallel) do: 13: Compute g t i := ∇ θ t i f i (θ t i ) + ∇ θ t i f KD i (θ t i ,µ t i ) 2ψ t i 14: Update: θ t+1 i = θ t i -η 1 g t i 15: Compute h t i := ∇ µ t i f KD i (θ t+1 i ,µ t i ) 2ψ t i 16: Update: µ t+1 i = µ t i -η 2 h t i 17: Compute k t i := 1 2ψ t i - f KD i (θ t+1 i ,µ t+1 i ) 2(ψ t i ) 2 18: Update: Of course, when a client is not sampled for a long period of rounds this approach can become similar to a local training; hence, it might be reasonable to put an upper limit on the successive number of local iterations for each client. ψ t+1 i = ψ t i -η 3 k t i 19: if τ divides t +

I PERSONALIZED LEARNING -DP-ADAPED

Proof of Theorem 5 Theorem (Restating Theorem 5). After T iterations, DP-AdaPeD satisfies (α, (α))-RDP for α > 1, where α) = K m 2 6 T τ α C 2 1 Kσ 2 q 1 + C 2 2 Kσ 2 q 2 , where K m denotes the sampling ratio of the clients at each global iteration. Proof. In this section, we provide the privacy analysis of DP-AdaPeD. We first analyze the RDP of a single global round t ∈ [T ] and then, we obtain the results from the composition of the RDP over total T global rounds. Recall that privacy leakage can happen through communicating {µ i } and {ψ t i } and we privatize both of these. In the following, we do the privacy analysis of privatizing {µ i } and a similar analysis could be done for {ψ t i } as well. At each synchronization round t ∈ [T ], the server updates the global model µ t+1 as follows: µ t+1 = 1 K i∈Kt µ t i , where µ t i is the update of the global model at the i-th client that is obtained by running τ local iterations at the i-th client. At each of the local iterations, the client clips the gradient h t i with threshold C 1 and adds a zero-mean Gaussian noise vector with variance σ 2 q1 I d . When neglecting the noise added at the local iterations, the norm-2 sensitivity of updating the global model µ t+1 i at the synchronization round t is bounded by: ∆µ = max K t ,K t µ t+1 -µ t+1 2 2 ≤ τ C 2 1 K 2 , ( ) where K t , K t ⊂ [m] are neighboring sets that differ in only one client. Additionally, µ t+1 = 1 K i∈Kt µ t i and µ t+1 = 1 K i∈K t µ t i . Since we add i.i.d. Gaussian noises with variance σ 2 q1 at each local iteration at each client, and then, we take the average of theses vectors over K clients, it is equivalent to adding a single Gaussian vector to the aggregated vectors with variance τ σ 2 q 1 K . Thus, from the RDP of the sub-sampled Gaussian mechanism in (Mironov et al., 2019 , Table 1 ), Bun et al. (2018) , we get that the global model µ t+1 of a single global iteration of DP-AdaPeD is (α, (1) t (α))-RDP, where t (α) is bounded by: (1) t (α) = K m 2 6αC 2 1 Kσ 2 q1 . Similarly, we can show that the global parameter ψ t+1 at any synchronization round of DP-AdaPeD is (α, t (α))-RDP, where t (α) is bounded by: (2) t (α) = K m 2 6αC 2 2 Kσ 2 q2 . Using adaptive RDP composition (Mironov, 2017 , Proposition 1), we get that each synchronization round of DP-AdaPeD is (α, t (α) + (2) t (α))-RDP. Thus, by running DP-AdaPeD over T /τ synchronization rounds and from the composition of the RDP, we get that DP-AdaPeD is (α, (α))-RDP, where (α) = T τ ( t (α) + (2) t (α)). This completes the proof of Theorem 5.

J EXPANDED RELATED WORK AND CONNECTIONS TO EXISTING METHODS

In Section 1, we mentioned that the our framework has connections to several personalized FL methods. In this appendix we provide a few more details related to these connections. Regularization: As noted earlier using (9) with the Gaussian population prior connects to the use of 2021), which also iterates between local and global model estimates. This form can be explicitly seen in Appendix E, where in Algorithm 4, we see that the Gaussian prior along with iterative optimization yields the regularized form seen in these methods. In these casesfoot_7 , P(Γ) ≡ N (µ, σ 2 θ I d ) for unknown parameters Γ = {µ}. Note that since the parameters of the population distribution are unknown, these need to be estimated during the iterative learning process. In the algorithm, 4 it is seen the µ plays the role of the global model (and is truly so for the linear regression problem studied in Appendix E). Clustered FL: If one uses a discrete mixture model for the population distribution then the iterative algorithm suggested by our framework connects to (Zhang et al., 2021; Mansour et al., 2020; Ghosh et al., 2020; Smith et al., 2017; Marfoq et al., 2021) . In particular, consider a population model with parameters in the m-dimensional probability simplex {α : α = [α 1 , . . . , α k ], α i ≥ 0, ∀i, i α i = 1} which describing a distribution. If there are m (unknown) discrete distributions {D 1 , . . . , D m }, one can consider these as the unknown description of the population model in addition to α. Therefore, each local data are generated either as a mixture as in (Marfoq et al., 2021) or by choosing one of the unknown discrete distributions with probability α dictating the probability of choosing D i , when hard clustering is used (e.g., (Mansour et al., 2020) ). Each node j chooses a mixture probability α (j) uniformly from the m-dimensional probability simplex. In the former case, it uses this mixture probability to generate a local mixture distribution. In the latter it chooses D i with probability α (j) i . As mentioned earlier, not all parametrized distributions can be written as a mixture of finite number distributions, which is the assumption for discrete mixtures. Consider a unimodal Gaussian population distribution (as also studied in Appendix E). Since P(Γ) ≡ N (µ, σ 2 θ I d ), for node i, we sample θ i ∼ P(Γ). We see that the actual data distribution for this node is p θi (y|x) = N (θ i x, σ 2 ). Clearly the set of such distributions {p θi (y|x)} cannot be written as any finite mixture as θ i ∈ R d and p θi (y|x) is a unimodal Gaussian distribution, with same parameter θ i for all data generated in node i. Essentially the generative framework of finite mixtures (as in (Marfoq et al., 2021) ) could be restrictive as it does not capture such parametric models. 2021), but the exact regularizer used does not take into account the full parametrization, and one can therefore improve upon these methods. FL with Multi-task Learning (MTL): In this framework, a fixed relationship between tasks is usually assumed (Smith et al., 2017) . Therefore one can model this as a Gaussian model with known parameters relating the individual models. The individual models are chosen from a joint Gaussian with particular (known) covariance dictating the different models, and therefore giving the quadratic regularization used in FL-MTL (Smith et al., 2017) . In this the parameters of the Gaussian model are known and fixed. 2021) to neural networks, where a set of parameters to obtain a common representation ("head") at each client was obtained and each local client appendd it with a "tail" combining the representation to obtain the final model. This also fits into our statistical framework, where the common representation (head) parameters are chosen from a population model (like the common subspace in the linear case) and the tail parameters are independently chosen (again as in the linear case). Empirical and Hierarchical Bayes: As mentioned in Section 1, our work is inspired by empirical Bayes framework, introduced in (Stein, 1956; Robbins, 1956; James & Stein, 1961) , which is the origin of hierarchical Bayes methods; see also (Gelman et al., 2013, pp. 132) . (Stein, 1956; James & Stein, 1961) studied jointly estimating Gaussian individual parameters, generated by an unknown (parametrized) Gaussian population distribution. They showed a surprising result that one can enhance the estimate of individual parameters based on the observations of a population of Gaussian random variables with independently generated parameters from an unknown (parametrized) Gaussian population distribution. Effectively, this methodology advocated estimating the unknown population distribution using the individual independent samples, and then using it effectively as an empirical prior for individual estimates.foot_8 This was studied for Bernoulli variables with heterogeneously generated individual parameters by Lord (1967) and the optimal error bounds for maximum likelihood estimates for population distributions were recently developed in (Vinayak et al., 2019) . Hierarchical Bayes, builds on empirical Bayes framework and is sometimes associated with a fully Bayes method. Our choice to use empirical Bayes framework as the foundation is also because it is more computationally feasible than a fully Bayes method. The subtle difference between the two is that empirical Bayes uses a point estimate of a (parametrized) prior, whereas, the terminology hierarchical Bayes often refers to a fully Bayes method where the (non-parametric) prior is estimated by computationally intensive methods like MCMC (see the discussion in (Blei et al., 2003) ). As mentioned in Section 1, a contribution of our work is to connect a well studied statistical framework of empirical (hierarchical) Bayes to model heterogeneity in personalized federated learning. This statistical model yields a framework for personalized FL and leads to new algorithms and bounds especially in the local data starved regime.

K ADDITIONAL DETAILS AND RESULTS FOR EXPERIMENTS K.1 IMPLEMENTATION DETAILS

In this section we give further details on implementation and setting of the experiments that were used in Section 4. CIFAR-100 Experiment Setting. We do additional experiments on CIFAR-100. CIFAR-100 is a dataset consisting of 100 classes and 20 superclasses. Each superclass corresponds to a category of 5 classes (e.g. superclass flowers correspond to orchids, poppies, roses, sunflowers, tulips). To introduce heterogeneity we let each client sample data samples from 2 super classes (the classification task is still to classify among 100 classes). For classification on CIFAR-100 dataset we consider a 5-layer CNN with 2 convolutional layers of 64 filters and 5x5 filter size, following that we have 2 fully connected layers with activation sizes of 384,192 and finally an output layer of dimension 100. We set number of local epochs to be 2, batch size to be 25 per client; number of clients is 50, client participation K n = 0.2, and number of epochs 200 (100 communication rounds). In this new dataset the classification task is more complex given the increased number of labels. Hyperparameters. We implemented Per-FedAvg and pFedMe based on the code from GitHub,foot_9 . Other implementations were not available online, so we implemented ourselves. For each of the methods we tuned learning rate in the set {0.3, 0.2, 0.15, 0.125, 0.1, 0.075, 0.05} and have a decaying learning schedule such that learning rate is multiplied by 0.99 at each epoch. We use weight decay of 1e -4. For MNIST and FEMNIST experiments for both personalized and global models we used a 5-layer CNN, the first two layers consist of convolutional layers of filter size 5 × 5 with 6 and 16 filters respectively. Then we have 3 fully connected layers of dimension 256 × 120, 120 × 84, 84 × 10 and lastly a softmax operation. For CIFAR-10 experiments we use a similar CNN, the only difference is the first fully connected layer is of dimension 400 × 120. • AdaPeDfoot_10 : We fine-tuned ψ in between 0.5 -5 with 0.5 increments and set it to 3.5. We set η 3 = 5e -2. We manually prevent ψ becoming smaller than 0.5 so that local loss does not become dominated by the KD loss. We use η 2 = 0.1 and η 1 = 0.1.foot_11 When taking the derivative with respect to ψ we observed sometimes multiplying the right term (consist of KD loss function) by some constant (5 in our experiments) gives better performance. • Per-FedAvg Fallah et al. (2020) and pFedMe Dinh et al. ( 2020): For Per-FedAvg, we used 0.075 as the learning rate and α = 0.001. For pFedMe we used the same learning rate schedule for main learning rate, K = 3 for the number of local iterations; and we used λ = 0.5, η = 0.2. • QuPeD Ozkara et al. (2021) : We choose λ p = 0.25, η 1 = 0.1 and η 3 = 0.1 as stated. • Federated Mutual Learning Shen et al. (2020) : Since authors do not discuss the hyperparameters in the paper, we used α = β = 0.25, global model has the same learning schedule as the personalized models.

K.2 ADDITIONAL EXPERIMENTS

Convergence plots for AdaPeD. We put the experimental convergence plots (test accuracy vs no. of iteration) for AdaPeD in Figure 2 . Personalized estimation: synthetic experiments in Bernoulli setting. For this setting, for P we consider three distributions that (Tian et al., 2017) considered: normal, uniform and '3-spike' which have equal weight at 1/4, 1/2, 3/4. Additionally, we consider a Beta prior. We compute squared error of personalized estimators and local estimators ( Zi n ) w.r.t. the true p i and report the average over all clients. We use m = 10000 clients and 14 local samples similar to (Tian et al., 2017) . Personalized estimator provides a decrease in MSE by 37.1 ± 3.9%, 12.0 ± 1.6%, 24.3 ± 2.8%, 34.0 ± 3.7%, respectively, for each aforementioned population distribution compared to their corresponding local estimators. Furthermore, as theoretically noted, less spread out prior distributions (low data heterogeneity) results in higher MSE gap between personalized and local estimators. Linear regression. For this, we create a setting similar to (Jain et al., 2021a) . We set m = 10, 000, n = 10; and sample client true models according to a Gaussian centered at some randomly chosen µ with variance σ 2 θ . We randomly generate design matrices X i and create Y i at each client by adding a zero mean Gaussian noise with true variance σ 2 x to X i θ i . We set true values σ 2 θ = 0.01, σ 2 x = 0.05 and we sample each component of µ from a Gaussian with 0 mean and 0.1 standard deviation and each component of X from a Gaussian with 0 mean and variance 0.05, both i.i.d. We measure the average MSE over all clients with and compare personalized and local methods. When d = 50, personalized regression has an MSE gain of 8.0 ± 0.8%, 14.8 ± 1.2%, and when d = 100, 9.2±1.1%, 12.3±2.0% compared to local and FedAvg regression, respectively. Moreover, compared to personalized regression where µ, σ θ , σ x are known, alternating algorithm only results in 1% and 4.7% increase in MSE respectively for d = 50 and d = 100. Estimation Experiments. We provide more results for the estimation setting discussed in Figure 1a . In Figure 3a we have a setting with 1000 clients and 5 local samples and in Figure 3b 500 clients and 5 local samples per client. We observe with as the number of clients increase DP-Personalized Estimator can converge to Personalized Estimator with less privacy budget. We also observe compared to Figure 1a , less number of local samples increases the performance discrepancy between personalized and local estimator. 



The homogeneous case for distributed estimation is well-studied; see(Zhang, 2016) and references. In our understanding their numerics seem to be restricted to a unimodal Gaussian population model. Upon receiving {Xi} from all clients, the server can compute { µi, σ 2 i } and sends ( µi, σ 2 i ) to the i-th client. For simplicity we will consider this unknown population distribution P to be parametrized by unknown (arbitrary) parameters Γ. A discrete mixture model can be proposed as a special case of GM with 0 variance. With this we can recover a similar algorithm as inMarfoq et al. (2021). Further details are presented in Appendix G. The work in this paper was partially supported by NSF grants 2139304, 2007714 and gift funding by Meta and Amazon. One can generalize these by including σ 2 θ in the unknown parameters. This was shown to uniformly improve the mean-squared error averaged over the population, compared to an estimate using just the single local sample. https://github.com/CharlieDinh/pFedMe For federated experiments we have used PyTorch's Data Distributed Parallel package. We use https://github.com/tao-shen/FEMNIST_pytorch to import FEMNIST dataset.



and the global estimator μ = 1 m m i=1 X i . Thus, the MSE in (a) consists of four terms: 1) The variance of the local estimator ( The correlation between the local estimator and the global estimator ( σ 2 x nm ). 4) The bias term E θ E X1,...,Xm E θi |θ -θ i 2 |θ . This completes the proof of Theorem 1. B.2 PROOF OF THEOREM 2, EQUATION (5)

increases if the prior p l increases and/or the local samples {X ij } are close to the model µ l for l ∈ [k]. Observe that the optimal estimator θi in Theorem 6 that minimizes the MSE is completely different from the local estimator 1 n n j=1 X ij . Furthermore, it is easy to see that the local estimator has the MSE dσ 2

2 regularizer in earlier personalized learning works Dinh et al. (2020); Ozkara et al. (2021); Hanzely & Richtárik (2020); Hanzely et al. (2020); Li et al. (

Knowledge distillation: The population distribution related to a regularizer based on Kullback-Leibler divergence (knowledge distillation) has been shown in Appendix H. Therefore this can be cast in terms of information geometry where the probability falls of exponentially with in this geometry. Hence these connect to methods such as Lin et al. (2020); Li & Wang (2019); Shen et al. (2020); Ozkara et al. (

Common representations: The works in Du et al. (2021); Jain et al. (2021b) use a linear model where y ∼ N (x θ i , σ 2 ) can be considered a local generative model for node i. The common representation approach assumes that θ i = some k d, where θ i ∈ R d . Therefore, one can parametrize a population by this (unknown) common basis B, and under a mild assumption that the weights are bounded, we can choose a uniform measure in this bounded cube to choose w (i) for each node i. The alternating optimization iteratively discovers the global common representation and the local weights as done in Du et al. (2021); Jain et al. (2021b) (and references therein). This common linear representation approach was generalized in Du et al. (2021); Collins et al. (

AdaPeD Test Accuracy (in %) vs iteration on MNIST with 0.1 sampling ratio.

AdaPeD Test Accuracy (in %) vs iteration on FEMNIST with 0.33 sampling ratio.

Figure 2: Convergence plots (test accuracy vs no. of iteration) for AdaPeD.

Figure 3: In Figure 1a, we plot MSE vs. 0 for personalized estimation with different number of clients, this is the same setting as Figure 1a except the number of clients and local samples.

∈ [m] is provided with a dataset consisting of n data points {(X i1 , Y i1 ), . . . , (X in , Y in )}, where Y ij 's are generated from (X ij , θ i ) using some distribution p θi (Y ij |X ij ). Let Y i := (Y i1 , . . . , Y in ) and X i := (X i1 , . . . , X in ) for i ∈ [m]. The underlying statistical model for our setting is given by

Test accuracy (in %) for CNN model. The CIFAR-10, MNIST, and FEMNIST columns have client sampling ratios K n of 0.2, 0.1, and 0.15, respectively.

FedAvg 11.73 ± 0.85 29.91 ± 1.28 55.79 ± 0.29 DP-AdaPeD (Ours) 93.32 ± 1.18 98.51 ± 0.90 99.01 ± 0.65

(DP-AdaPeD) Test

Mean squared error of ourAdaMix algorithm and the local training for linear regression.

A.3 USER-LEVEL DIFFERENTIAL PRIVACY LEVY ET AL.(2021)    Consider a set of m users, each having a local dataset of n samples. Let D i = {x i1 , . . . , x in } denote the local dataset at the i-th user for i ∈ [m], where x ij ∈ X and X ⊂ R d . We define D = (D 1 , . . . , D m ) ∈ (X n ) m as the entire dataset.We have already defined DP, LDP, and RDP in Section A.1 w.r.t. the item-level privacy. Here, we extend those definition w.r.t. the user-level privacy. In order to do that, we need a generic neighborhood relation between datasets: We say that two datasets D, D are neighboring with respect to distance metric dis if we have dis(D, D ) ≤ 1.we recover the standard definition of the DP/RDP from Definitions 2, 3, which we call item-level DP/RDP. In the item-level DP/RDP, two datasets D, D are neighboring if they differ in a single item. On the other hand, by choosing dis(D, D ) = m i=1 1{D i = D i }, we call it user-level DP/RDP, where two datasets D, D ∈ (X n ) m are neighboring when they differ in a local dataset of any single user. Observe that when each user has a single item (n = 1), then both item-level and user-level privacy are equivalent.

Alternating Minimization for Personalized Learning Input: Number of iterations T , local datasets

θi (y|x) as a randomized mapping from input space X to output class Y, parameterized by θ i . For simplicity, consider the case where |X | is finite, e.g. for MNIST it could be all possible 28 × 28 black and white images. Every p θi (y|x) corresponds to a probability matrix (parameterized by θ i ) of size |Y| × |X |, where the (y, x)'th represents the probability of the class y (

annex

Additional Learning Experiments with Different Number of Clients. We do additional experiments with different number of clients. On FEMNIST we use the same model and same data sample per client as in Section4, number of clients is 30, total number of epochs is 30 and we fix the local iteration to be 40 per epoch, we do full client sampling to simulate a cross-silo environment. As seen in Table 4 , AdaPeD continues to outperform the competing methods following the trend in Section 4. 98.10 ± 0.09 pFedMe (Dinh et al., 2020) 96.03 ± 0.50 Per-FedAvg (Fallah et al., 2020) 96.71 ± 0.14 QuPeD (FP) (Ozkara et al., 2021) 97.72 ± 0.16 Federated ML (Shen et al., 2020) 96.80 ± 0.13On CIFAR-10 we use the same model as in Section4, and divide the dataset to 30 clients where each client has access to data samples from 4 classes. Total number of epochs is 250 and we fix the local iteration to be 40 per epoch; we set K n = 0.2 and number of local epochs to be 2. AdaPeD outperforms the competing methods in parallel to the experiments in Section4, as can be seen in Table 5 . 71.97 ± 0.09 Per-FedAvg (Fallah et al., 2020) 64.09 ± 0.46 QuPeD (FP) (Ozkara et al., 2021) 73.21 ± 0.44 Federated ML (Shen et al., 2020) 72.53 ± 0.36Additional Experiment Implementation Details.We use the same strategy as in Appendix K.1 to tune the main learning rates. We use 1e-4 weight decay.• AdaPeD: We fine-tuned ψ in between 0.5 -5 with 0.5 increments and set it to 4 for CIFAR-10/100 and to 3 for FEMNIST. We manually prevent ψ becoming smaller than 1 so that local loss does not become dominated by the KD loss. We use η 2 = 0.075 and η 1 = 0.075 for CIFAR-10 and CIFAR-100 and η 2 = 0.1 and η 1 = 0.1 for FEMNIST.• Per-FedAvg (Fallah et al., 2020) and pFedMe (Dinh et al., 2020) : For Per-FedAvg, we used 0.1 as the learning rate and α = 0.0001. For pFedMe we used the same learning rate schedule for main learning rate, L = 3 for the number of local approximation iterations; and we used λ = 0.1, η = 0.1.• QuPeD Ozkara et al. (2021) : We set λ p = 0.25, η 1 = 0.1 for local learning rate and η 2 = 0.1 for global learning rate.• Federated Mutual Learning Shen et al. (2020) : Since authors do not discuss the hyperparameters in the paper, we used α = β = 0.25.

