SPARSE RANDOM NETWORKS FOR COMMUNICATION-EFFICIENT FEDERATED LEARNING

Abstract

One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial random values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a stochastic binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than 1 bit per parameter (bpp)), convergence speed, and final model size (less than 1 bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime. * First two authors contributed equally. Work done while F.P. was visiting Imperial College London. (2) Our framework provides efficient communication from clients to the server by requiring (less than) 1 bpp per client while yielding faster convergence and higher accuracy than the baselines. (3) We propose a Bayesian aggregation strategy at the server side to better deal with partial client participation and non-IID data splits. (4) The final model (a sparse network with random weights) can be efficiently represented with a random seed and a binary mask which requires (less than) 1 bpp -at least 32× more efficient storage and communication of the final model with respect to standard FL strategies. (5) We demonstrate the efficacy of our strategy on MNIST, EMNSIT, CIFAR-10, and CIFAR-100 datasets under both IID and non-IID data splits; and show improvements in accuracy, bitrate, convergence speed, and final model size over relevant baselines, under various system configurations. In this section, we briefly discuss the related work in (1) communication-efficient FL, (2) pruning for FL, and (3) finding subnetworks in a random network.

1. INTRODUCTION

Federated learning (FL) is a distributed learning framework where clients collaboratively train a model by performing local training on their data and by sharing their local updates with a server every few iterations, which in turn aggregates the local updates to create a global model, that is then transmitted to the clients for the next round of training. While being an appealing approach for enabling model training without the need to collect client data at the server, uplink communication of local updates is a significant bottleneck in FL (Kairouz et al., 2021) . This has motivated research in communication-efficient FL strategies (McMahan et al., 2017a) and various gradient compression schemes via sparsification (Lin et al., 2018; Wang et al., 2018; Barnes et al., 2020; Ozfatura et al., 2021; Isik et al., 2022) , quantization (Alistarh et al., 2017; Wen et al., 2017; Bernstein et al., 2018; Mitchell et al., 2022) , and low-rank approximation (Konečnỳ et al., 2016; Vargaftik et al., 2021; 2022; Basat et al., 2022) . In this work, while aiming for communication efficiency in FL, we take a radically different approach from prior work, and propose a strategy that does not require communication of weight updates. To be more precise, instead of training the weights, (1) the server initializes a dense random network with d weights, denoted by the weight vector w init = (w init 1 , w init 2 , . . . , w init d ), using a random seed SEED, and broadcasts SEED to the clients enabling them to reproduce the same w init locally, (2) both the server and the clients keep the weights frozen at their initial values w init at all times, (3) clients collaboratively train a probability mask of d parameters θ = (θ 1 , θ 2 , . . . , θ d ) ∈ [0, 1] d , (4) the server samples a binary mask from the trained probability mask and generates a sparse network with random weights -or a subnetwork inside the initial dense random network as follows w final = Bern(θ) ⊙ w init , where Bern(•) is the Bernoulli sampling operation and ⊙ the element-wise multiplication. We call the proposed framework Federated Probabilistic Mask Training (FedPM) and summarize it in Figure 1 . At first glance, it may seem surprising that there exist subnetworks inside randomly initialized networks that could perform well without ever modifying the weight values. This phenomenon has been explored to some extent in prior work (Zhou et al., 2019; Ramanujan et al., 2020; Pensia et al., 2020; Diffenderfer & Kailkhura, 2020; Aladago & Torresani, 2021) with different strategies for finding the subnetworks. However, how to find these subnetworks in a FL setting has not attracted much attention so far. Some exceptions to this are works by Li et al. (2021) ; Vallapuram et al. (2022) ; Mozaffari et al. (2021) , which provide improvements in other FL challenges, such as personalization and poisoning attacks, while not being competitive with existing (dense) compression methods such as QSGD (Alistarh et al., 2017) , DRIVE (Vargaftik et al., 2021) , and SignSGD (Bernstein et al., 2018) in terms of accuracy under the same communication budget. In this work, we propose a stochastic way of finding such subnetworks while reaching higher accuracy at a reduced communication cost -less than 1 bit per parameter (bpp). Figure 1 : Extracting a randomly weighted sparse network using the trainable probability mask θ t in the forward-pass of round t (for clients and the server). In practice, clients collaboratively train continuous scores s ∈ R d , and then at inference time, the clients (or the server) find θ t = Sigmoid(s t ) ∈ [0, 1] d . We skip this step in the figure for the sake of simplicity. In addition to the accuracy and communication gains, our framework also provides an efficient representation of the final model post-training by requiring less than 1 bpp to represent (i) the random seed that generates the initial weights w init , and (ii) a sampled binary vector Bern(θ) (computed with the trained θ). Therefore, the final model enjoys a memory-efficient deployment -a crucial feature for machine learning at power-constrained edge devices. Another advantage our framework brings is the privacy amplification under some settings, thanks to the stochastic nature of our training strategy. Our contributions can be summarized as follows: (1) We propose a FL framework, in which the clients do not train the model weights, but instead a stochastic binary mask to be used in sparsifying the dense network with random weights. This differs from the standard training approaches in the literature. Communication-Efficient FL. One way of improving communication efficiency in FL is to compress the model updates using gradient compression methods like sparsification (Aji & Heafield, 2017; Lin et al., 2018; Wang et al., 2018; Barnes et al., 2020; Ozfatura et al., 2021; Isik et al., 2022) , quantization (Alistarh et al., 2017; Wen et al., 2017; Suresh et al., 2017; Bernstein et al., 2018; Mitchell et al., 2022) , and low-rank approximation (Wang et al., 2018; Vogels et al., 2019; Vargaftik et al., 2021; 2022; Mohtashami et al., 2022; Basat et al., 2022) ; or more FL-oriented compression schemes such as (Konečnỳ et al., 2016; McMahan et al., 2017a; Sattler et al., 2019; Rothchild et al., 2020; Reisizadeh et al., 2020; Haddadpour et al., 2020; 2021) , while training a dense network. Our framework differs from these dense compression methods substantially due to the unconventional stochastic mask training strategy; however, we take SignSGD (Bernstein et al., 2018) , TernGrad (Wen et al., 2017) , QSGD (Alistarh et al., 2017) , DRIVE (Vargaftik et al., 2021), and EDEN (Vargaftik et al., 2022) as our baselines since they work in the same bitrate regime as ours (≈1 bpp). Pruning for FL. Since the introduction of the Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2018) , there has been growing interest in finding sparse and trainable networks at initialization. The main hypothesis in this line of work is that there exist sparse networks (lottery tickets) inside randomly initialized dense networks such that those sparse networks can be trained to a surprisingly good performance. In the original paper, the strategy for finding these lottery tickets is to iteratively train the dense network, i.e., finding the lottery tickets is expensive. We distinguish our approach from the FL papers that utilize the LTH (Li et al., 2020; Ji et al., 2020; Seo et al., 2021) and pruning (Lin et al., 2020; Munir et al., 2021; Yu et al., 2021; Liu et al., 2021; Jiang et al., 2022; Babakniya et al., 2022; Dai et al., 2022; Lin et al., 2020; Bibikar et al., 2022) for three mains reasons: (i) These methods require training the weight values, and thus cannot provide an efficient representation of the final model as FedPM does. Recall that we achieve at least 32× more efficient storage and communication of the final model by representing it with just a random seed and a binary mask. (ii) Some of these works require finding the lottery tickets prior to FL training (Li et al., 2020) . While this could improve the communication cost during the FL training since they communicate sparse networks, it increases the computation cost significantly due to the burden of finding lottery tickets via training. (iii) As opposed to the LTH-or pruning-based FL works, our framework learns with what probability a particular weight should stay in the final model, i.e., the final sparsity level is also a learned parameter optimized for the best performance. Overall, since FedPM is not in the same bitrate regime as these works (they require higher bitrates to communicate continuous weight values), we do not compare against them. Finding Subnetworks Inside a Random Network. Our work is closest to recent works of Zhou et al. (2019); Ramanujan et al. (2020) ; Pensia et al. (2020); Aladago & Torresani (2021) , which find subnetworks (or supermasks) inside a dense network with random weights that perform surprisingly well without ever training the weights, but in a centralized scenario. In this work, we take advantage of the existence of such subnetworks to reduce the communication budget in FL to less than 1 bpp with faster convergence and higher accuracy than our relevant baselines in the same bitrate regime, while further compressing the final model, all simultaneously. Prior works (Li et al., 2021; Vallapuram et al., 2022; Mozaffari et al., 2021 ) also consider finding subnetworks inside a dense random network in a FL setting, but they differ from our approach on several levels. For instance, they focus on different challenges in FL, such as personalization and poisoning attacks, which limits their ability to improve over existing compression methods in accuracy-communication bitrate tradeoff. One fundamental reason for this is their deterministic mask training strategy, which involves hard thresholding or sign operations. On the other hand, the stochasticity in FedPM allows us to (i) enjoy a better accuracy-communication cost tradeoff, (ii) have an unbiased estimate of the true aggregate of the local masks with a provable upper bound on the error, (iii) design an improved aggregation strategy with a Bayesian approach so that the previous masks at the server are not hard replaced -a useful strategy specifically in unbalanced non-IID splits, and (iv) gain privacy benefits via amplification in the Bernoulli sampling step. To demonstrate these benefits over deterministic schemes, we compare our method against FedMask (Li et al., 2021) by adapting it slightly to mainly focus on communication efficiency, rather than personalization, and to improve its accuracy-communication efficiency performance. More specifically, we discard the initial pruning stage that was deployed for personalization. This change was necessary because (a) this paper does not study personalization, so this pruning step would put FedMask at a disadvantage in our experimental setup, and (b) the initial pruning step requires extra training at client devices, which is computationally more expensive than FedPM and the dense baselines.

3. FEDERATED PROBABILISTIC MASK TRAINING (FedPM)

We first describe the simpler version of the FedPM framework in Section 3.1, which provides an unbiased estimation of the mean of the learned probability masks at the server with bounded error. Next, we propose a modification in our aggregation strategy by exploiting the underlying Bernoulli mechanism in Section 3.2. This helps boost the performance of FedPM in the case of partial client participation. We then discuss the details of the distribution of the initial weights in Section 3.3, and finally describe the privacy benefits of FedPM in Section 3.4. We use capital letters for random variables, small letters for their realization and deterministic quantities, and bold letters for vectors. Moreover, we indicate with x u,t the state of the local vector x (e.g., the local mask) at client u during round t, and with x u,t i its i-th component. Global values are denoted with x g,t and x g,t i , and sets are indicated with calligraphic fonts. We denote a neural network with weight vector p as f p .

3.1. FedPM

In this section, we present the general FedPM training pipeline. First, the server randomly initializes a neural network f w init , parameterized by the weight vector w init = (w init 1 , w init 2 , . . . , w init d ) ∈ R d , whose components are sampled IID according to a distribution P w using a randomly generated seed SEED. The random SEED value is then communicated to all the clients, which can locally sample the same pseudo-random vector w init , which is kept fixed and never modified during training. The goal for the clients is to collaboratively train a probability mask θ ∈ [0, 1] d , which indicates the Bernoulli parameters for the global stochastic binary mask M ∼ Bern(θ) ∈ {0, 1} d , such that the function f Ẇ maximizes its performance on a given task, where Ẇ = M ⊙ w init . Specifically, FedPM learns the probabilities for the weights of being active, which are given by the probability mask θ = (θ 1 , θ 2 , . . . , θ d ) ∈ [0, 1] d . To achieve this, at every round t, the server samples a set K t of |K t | = K participants (out of the total N clients), which individually train their local probability masks θ k,t , k ∈ K t , by using their local datasets D k , each composed of D k = |D k | samples. These local masks are then aggregated by the server in a communication-efficient way to estimate the optimal θ. At test time, at the server, the initial random network f w init is sparsified using the global probability mask θ g,t , following the stochastic approach in Figure 1 . In the following sections, we provide more details on each step of each round. We give the pseudocode for FedPM in Appendix A.

3.1.1. LOCAL TRAINING OF PROBABILITY MASKS

Upon receiving a global probability mask θ g,t-1 from the server at the beginning of round t, the client k performs local training and updates the mask via back-propagation. First, however, we have to guarantee that the updated probability mask satisfies θ k,t ∈ [0, 1] d . While this can be achieved with a regularization term in the loss, this may require clipping θ k,t ∈ [0, 1] d before taking a Bernoulli sample, especially in the early training stages. Clipping would then make the estimate at the server biased and hence lead to a slower convergence and lower accuracy. Therefore, similarly to the work of Zhou et al. (2019) , we introduce another mask, called score mask s = (s 1 , s 2 , . . . , s d ) ∈ R d , that has unbounded support and can be used to generate the probability masks through the oneto-one sigmoid function by setting θ = Sigmoid(s). Then, the procedure for local training of the probability mask at round t is as follows (here, the steps from Step 2 to 4 describe one local iteration, which is repeated a number τ of times as standard in FL (McMahan et al., 2017a) ): (1) The server sends the global probability mask θ g,t-1 to K chosen clients, and the clients set s k,t = Sigmoid -1 (θ g,t-1 ), where Sigmoid -1 (•) is the inverse of the sigmoid function. (2) Then, the clients generate a binary mask by first transforming back θ k,t = Sigmoid(s k,t ), and then sampling a binary mask M k,t from θ k,t as shown in Figure 1 : m k,t ∼ Bern(θ k,t ). (3) The sampled binary mask then sparsifies the initial weight vector w init : ẇk,t = m k,t ⊙ w init . (4) ẇk,t is then used for forward pass, and the loss L(f ẇk,t , D k ) on the local task is backpropagated to update the score mask as s k,t = s k,t -η∇L(f ẇk,t , D k ) (η is the local learning rate).

All the local operations from

Step 2 to Step 4 are differentiable, except for the Bernoulli sampling. We backpropagate the gradients through the Bernoulli sampling operation with a straight-through estimator Bengio et al. (2013) , using the first-order gradient of the Bernoulli function, which is simply equal to the probability mask θ k,t . ... ...

Server Estimate True Mean

Figure 2: Communication-efficient estimation of the mean of the probability masks θg,t . Each client communicates a stochastic binary mask m k,t sampled from the local probabillity mask θ k,t . We reduce the bitrate to less than 1 bit per parameter by using arithmetic coding to encode m k,t . When the frequency of 1's is far from 0.5 (which is usually the case with FedPM), the number of bits per parameter to communicate m k,t is less than 1. See Figure 3 for more details.

3.1.2. COMMUNICATION STRATEGY

Once the local training at round t is completed, the server needs to distill the global probability mask θ g,t , by taking the empirical average of the local probability masks θg,t = 1 K k∈Kt θ k,t collected from the clients. However, since we aim for communication efficiency, the clients do not send their local probability masks directly. Instead, they communicate a stochastic binary sample M k,t from their probability masks sampled as m k,t ∼ Bern(θ k,t ), and then the server estimates the global aggregate θg,t as θg,t = 1 K k∈Kt m k,t . This distributed mean estimation problem with communication constraints is summarized in Figure 2 . Our estimator θg,t = 1 K k∈Kt m k,t is an unbiased estimate of the true aggregate, in that E M k,t ∼Bern(θ k,t ) ∀k∈Kt [ θg,t ] = E M k,t ∼Bern(θ k,t ) ∀k∈Kt 1 K k∈Kt M k,t = 1 K k∈Kt E M k,t ∼Bern(θ k,t ) [M k,t ] = 1 K k∈Kt θ k,t = θg,t . Moreover, the estimation error is upper bounded as (the proof is given in Appendix B) E M k,t ∼Bern(θ k,t ) ∀k∈Kt || θg,t -θg,t || 2 2 ≤ d 4K . Since each client communicates a stochastic binary mask M k,t , 1 bpp is the worst-case bitrate for FedPM. We can further reduce the bitrate to less than 1 by using arithmetic coding (Rissanen & Langdon, 1979) or universal coding (Krichevsky & Trofimov, 1981; Barron et al., 1998) to encode m k,t , and achieve the empirical entropy since d is large. This gives us smaller bitrates whenever the frequency of 1's in m k,t is far from 0.5 -which is usually the case for our method (see Figure 3 and Appendix E.2 for results). We note that, with a deterministic mask training approach as in FedMask (Li et al., 2021) , arithmetic coding of m k,t s does not provide any further gain in bitrate, as we have empirically observed that the frequency of 1's is always around 0.5 (see Figure 3 and Appendix E.2) -here we apply arithmetic coding for FedMask to improve our baseline although it was not proposed in the original paper. Moreover, FedMask (Li et al., 2021) and HideNSeek (Vallapuram et al., 2022) do not enjoy the guarantees we have as their estimator (i) is not unbiased and (ii) does not have an upper bound on the estimation error due to hard thresholding (Li et al., 2021) and sign operations (Vallapuram et al., 2022) . This is another benefit of our stochastic sampling approach.

3.2. FedPM WITH BAYESIAN AGGREGATION

Another important aspect that differentiates our work from existing masking methods such as Fed-Mask (Li et al., 2021) and HideNSeek (Vallapuram et al., 2022) is the Bayesian aggregation strategy, which exploits the underlying stochastic mask to synthesize a global model, boosting the performance in scenarios where only a fraction of the clients participate in each round. Given the probabilistic interpretation of the FedPM mask's values, at the server side we further model the probability mask θ g,t with a Beta distribution Beta(α g,t , β g,t ), parameterized by the round-dependent parameters α g,t and β g,t , which are initialized to α g,0 = β g,0 = λ 0 . At the beginning of the training process, there is no prior knowledge indicating which network weight should be more important than the others, and so each entry in the probability mask is uniformly distributed in [0, 1] -which is the prior distribution. Consequently, the clients' local binary masks M k,t s are the data the server uses to update its belief on each weight score, and so the aggregation strategy corresponds now to a posterior update. Specifically, given the conjugate relation between the Beta-Bernoulli distributions, the new posteriors are still Beta distributions with parameters α g,t = α g,t-1 + M agg,t and β g,t = β g,t-1 + K • 1 -M agg,t ∀t ≥ 1, where M agg,t = k∈Kt M k,t , and 1 is the d-dimensional all-ones vector. Then, the server broadcasts to the clients the mode of the Bernoulli distributions, as suggested by Ferreira et al. (2021) , θ g,t = α g,t -1 α g,t + β g,t -2 , ( ) where the division operation is applied element-wise. However, to obtain the best performance out of this method, the Beta parameters should be re-initialized to their original values λ 0 with some regularity. We present an ablation study to demonstrate the improvements gained by the Bayesian aggregation strategy and the reasonable choices for the resetting frequency in Section 4.3. Notice that if λ 0 = 1, and if α and β are re-initialized at the beginning of each round, the method is equivalent to the aggregation strategy detailed in Section 3.1.2.

3.3. WEIGHT DISTRIBUTION

As mentioned in Section 3.1, the fixed weight vector w init is initialized by sampling from the distribution P w using the randomly generated SEED. We note that the choice of this distribution impacts two important aspects of FedPM: (i) the values of w init highly influence the final accuracy achieved by the model, as they represent the building blocks to extract a subnetwork f ẇ (see Figure 1 ), which should be rich enough to solve the learning task, and (ii) the size of the sample space of P w affects the number of bits needed to store the model during the inference process (this is different from the 1 bpp model storage when the model is not in use). Regarding (i), as also proposed in Ramanujan et al. ( 2020), we sample weights from a uniform distribution, whose domain is {-σ, +σ}, where σ is the standard deviation of the Kaiming Normal distribution (He et al., 2015) . In this way, we control the variance of the neurons' output to be ∼ 1, which avoids the vanishing or the explosion of activation values. Previous experiments in (Zhou et al., 2019; Ramanujan et al., 2020 ) also demonstrate the superior performance achieved by binary weights distributions when compared to standard continuous counterparts, e.g., Gaussian. Regarding (ii), even if knowing the value of SEED is enough to perfectly reconstruct the vector w init , one would have to generate the entire vector at every inference step. Consequently, to achieve fast inference, the actual values of the weights need to be stored in the memory of the devices during the inference process. Fortunately, our initialization allows for efficient storage even during inference since (after reconstructing w final using SEED and m final ∈ {0, 1} d ) we only need to indicate whether the weight values in w final are -σ, 0, or +σ, with a ternary representation that can be efficiently deployed on hardware (Alemdar et al., 2017) .

3.4. PRIVACY

Privacy is another challenge in FL as the model updates (in our case, M k,t s) may leak information about the client data. In Appendix C, we analyze the differential privacy guarantees of FedPM and give an initial foray into how FedPM can be helpful in amplifying privacy.

4. EXPERIMENTS

In this section, we empirically show the performance of FedPM in terms of accuracy, bitrate, converge speed, and the final model size. We consider four datasets: CIFAR-10 with 10 classes, CIFAR-100 (Krizhevsky et al., 2009) with 100 classes, MNIST (Deng, 2012) (Bernstein et al., 2018) , TernGrad (Wen et al., 2017) , QSGD (Alistarh et al., 2017) , DRIVE (Vargaftik et al., 2021) , EDEN (Vargaftik et al., 2022) , and FedMask (Li et al., 2021) on IID data split and full client participation in Section 4.1. We then extend our experiments to non-IID data splits and partial participation in Section 4.2. Finally, in Section 4.3, we present a key ablation study to justify why the Bayesian aggregation strategy is necessary for partial participation and to demonstrate how the resetting frequency affects the convergence rate and the final accuracy. Clients perform 3 local epochs in all experiments. We provide additional details on the experimental setup in Appendix D. We present results averaged over 3 runs. 4.1 IID DATA SPLIT AND FULL PARTICIPATION (K = N ) In this section, we focus on IID data distribution and the case when all the clients participate in the training at each round. We set the number of clients to N = K = 10. We report the estimated bitrate for the arithmetic code that uses the empirical frequency of the symbols (for our method FedPM, this corresponds to the frequency of 1's in m k,t ) -which is equal to the empirical entropy for blocklength d as large as the model size. In Figure 3 , we compare the accuracy, bitrate, and convergence speed of FedPM with relevant baselines. As can be seen in the figure, FedPM converges to the highest accuracy on all four datasets. DRIVE, EDEN, and QSGD (they mostly overlap in the accuracy plots) seem to be the three baselines that perform the best after FedPM; however, their convergence speed is significantly lower than FedPM. In terms of convergence speed, FedMask is the fastest among the baselines -in fact, at the beginning of the training, FedMask is faster than FedPM as well. However, its final accuracy is lower than the others. We also would like to highlight that while some of our baselines, such as FedMask and TernGrad, have a visibly high variance in accuracy, FedPM shows stable training behavior across all experiments. In terms of bitrate, SignSGD and FedMask consistently spend 1 bpp, which is the default number when a binary mask or sign mask is communicated. This means binary values (1's and 0's) are almost equally distributed in their masks, which prevents them from enjoying additional bitrate gains. Across all experiments, TernGrad has the highest bitrate. We would like to leave a note about the bitrate of QSGD. Unlike other baselines, including our work, QSGD can go down to very low bitrates by adjusting the number of levels in quantization. We have observed that in the extreme quantization case, QSGD underperforms FedPM. Then, we have decided to increase the number of quantization levels in QSGD to see if it improves the accuracy. However, as can be seen from the plots, even with bitrate larger than 1, QSGD still underperforms FedPM. The only two baselines that challenge FedPM in terms of bitrate are DRIVE and EDEN. While FedPM has lower bitrates on CIFAR-10 and EMNIST; DRIVE and EDEN have better bitrates on CIFAR-100 and MNIST. However, the accuracy of DRIVE and EDEN on these datasets (specifically CIFAR-100) is significantly lower than that of FedPM, with slower convergence. As for the final model size, FedPM needs only 0.8 bpp for the CONV-6 model trained on CIFAR-10, 0.85 bpp for the CONV-10 model trained on CIFAR-100, 0.96 bpp for the CONV-4 model trained on MNIST, and 0.83 bpp for the CONV-4 model trained on EMNIST. On the other hand, other baselines that train a dense model, namely SignSGD, TernGrad, QSGD, DRIVE, and EDEN, would need to represent each weight with their full precision value, i.e., 32 bpp. This implies that FedPM provides around 38.6× improvement in the storage or the communication of the final model. Since FedMask also trains a sparse model, it enjoys a similar gain in the final model size requiring 1 bpp across all the models. Due to the stochastic masking procedure and uneven distribution of 1's and 0's in the binary masks, FedPM has up to 0.17 bpp improvement over the deterministic procedure in FedMask, which adds up to a large gain due to the huge model size. We provide additional experimental results with ResNet-18 model on CIFAR-10 and CIFAR-100 datasets in Appendix E.1; and observe similar improvements over the baselines.

4.2. NON-IID DATA SPLIT AND PARTIAL PARTICIPATION (K < N )

This section considers more realistic scenarios, in which the local clients' datasets are generated from slightly different data distributions. We focus on CIFAR-10 with CONV-6, and we compare FedPM against (i) the most promising baselines, which, based on the results of Section 4.1, are DRIVE, EDEN, and QSGD, and (ii) FedMask, as it is the only sparse baseline. To choose the size of each dataset |D n | = D n , for each client n ∈ {1, . . . , N }, an integer j n is sampled uniformly from {10, 11, . . . , 100}. Then, a coefficient p n = jn n jj is computed, which represents the size of the local dataset D n as a fraction of the size of the full dataset, i.e., the training set of CIFAR-10. In this way, highly unbalanced datasets can be generated from the central one. Moreover, since the task is a classification problem, we impose a maximum number of different labels, or classes, c max , that one client can see. Consequently, clients need cooperation to learn the statistics of other classes' distributions, as the test dataset contains samples from all classes. In addition, partial participation is also considered, meaning that at each round, the server uniformly samples a fraction ρ = K N of the clients to participate in the training round. This is motivated in real-world scenarios by the scarcity of physical communication network resources, which may limit the availability of part of the clients during one round. The maximum number of classes per local dataset is set to c max ∈ {2, 4}, and the participation ratio is set to ρ ∈ {0.1, 0.2, 0.5, 1}. For ρ = 1 and ρ = 0.5, the total number of clients is set to N = 10 (and so K is equal to 10 and 5, respectively). For ρ = 0.2, we set N = 100 (and so K = 20), and for ρ = 0.1, we set N = 50 (and so K = 5), which is the worst scenario among all combinations, given the small amount of information the server can collect at the end of each round. When ρ = 1, for the FedPM algorithm, we keep the same aggregation strategy exposed in Section 3.1.2 and Figure 2 ; and we switch to the Bayesian aggregation method (see Section 3.2) when there is partial participation, i.e., when ρ < 1. Indeed, applying the Bayesian aggregation method is revealed to be crucial for achieving good accuracy when ρ < 1 and data are non-IID, obtaining a large gain with respect to the simpler version in Section 3.1.2, which resets the Beta priors at each round (or takes the average of the samples, as explained in Section 3.2). We elaborate more on this observation with an ablation study in Section 4.3. We adopt a simple heuristic schedule to reset the priors: Reset every 3 rounds when ρ = 0.5 and ρ = 0.2, and every 10 rounds when ρ = 0.1. As expected, the smaller the ratio ρ, the larger the number of rounds we should wait before resetting the priors to collect more information from a much more diverse pool of clients (see Section 4.3 for a rule of thumb on the resetting frequency). Table 1 reports the results with c max = 4 and 2. FedPM seems to outperform all the baselines in every configuration, as the Bayesian aggregation allows the server to collect more data before resetting the priors, which is important when clients' data distributions are non-IID, and only a fraction of the clients participate in each round. This strategy can be seen as the FedPM counterpart of decreasing the learning rate (which we applied in the other dense compression-based baselines, like DRIVE, EDEN, and QSGD). It is seen from Table 1 that FedMask (Li et al., 2021) is struggling in the non-IID case, as applying a hard threshold on the scores to binarize the mask does not provide a proper way to implement multiple-rounds aggregation, emphasizing the benefit of the stochastic process in FedPM. It is interesting to notice that, especially when c max = 4, the lower the value of ρ, the larger the gap between FedPM and the baselines, corroborating the fact that the Bayesian strategy can better deal with partial participation. Analysis of the communication bitrate is provided in Appendix. E.2.

4.3. ABLATION STUDY ON THE BAYESIAN AGGREGATION STRATEGY

In this section, we try to answer two questions: (1) Is Bayesian aggregation really necessary? and (2) What is the effect of resetting frequency on the convergence rate and the final accuracy? We do this by analyzing the effect of different resetting frequencies of the Beta priors on the training behavior of FedPM with non-IID data split and partial client participation; and report the results in Figure 4 . Hereafter, we denote with γ the number of aggregation rounds before resetting the priors. For instance, γ = 1 corresponds to resetting the priors at every iteration, which is equivalent to the aggregation method presented in Section 3.1.2. On the other extreme, γ = 200 indicates that the priors are never reset. It is seen that γ = 1 curves fluctuate significantly and never converge to the best accuracy in any setting, while γ = 200 curves look smoother but converge to the lowest accuracy in all settings. This intuitively makes sense because, as already mentioned in Section 3.2, by increasing the value of γ, we allow the server to consider the information coming from multiple rounds while updating the global parameters. Indeed, with partial participation and non-IID data, a single round's updates may convey skewed information, depending on the level of data heterogeneity c max , and client participation ratio ρ. As a rule of thumb for the resetting frequency value, we suggest tuning γ around the value 1 ρ . The rationale behind this is that with uniform client sampling, at least 1 ρ rounds are needed to have the non-zero probability to sample from each client once before resetting the prior. In practice, we do not need to sample exactly from every client, as enough information is contained in the updates of the other sampled ones. 

5. CONCLUSION

In this work, we introduced Federated Probabilistic Mask Training (FedPM) -a communicationefficient FL strategy. FedPM relies on the idea of finding a sparse network in a randomly initialized dense network, which is then sparsified by a collaboratively trained stochastic binary mask. In addition to reducing the communication cost to less than 1 bit per parameter (bpp), FedPM also reaches higher accuracy with faster convergence than the relevant baselines, and can potentially amplify privacy while additionally outputting a compressed final model with a size less than 1 bpp. Throughout the manuscript, we highlighted the advantages of having a stochastic mask training approach rather than a deterministic one in terms of accuracy, bitrate, and privacy.

6. ETHICS STATEMENT

All the experiments in the paper were performed on publicly available datasets. When we evaluated our strategy, we only considered accuracy as a measure of performance. However, as pointed out by Hooker et al. (2020) , compression methods may disproportionately impact different subgroups of the data. We agree that this may potentially create a fairness issue in all communication-efficient federated learning frameworks and deserves more attention from the community.

7. REPRODUCTION STATEMENT

The codebase for this work is open-sourced at https://github.com/BerivanIsik/ sparse-random-networks. All the hyperparameters necessary to reproduce the results in the paper can be found in Appendix D. We only used publicly available standard datasets and included links to them in the manuscript.

B PROOF OF THE UPPER BOUND ON THE ESTIMATION ERROR

We now provide proof of the upper bound on the estimation error in Eq. 2. Recall that our true mean is θg,t = 1 K k∈Kt θ k,t , whereas our estimate is θg,t = 1 K k∈Kt m k,t , where m k,t ∼ Bern(θ k,t ). Then we can compute the error as E M k,t ∼Bern(θ k,t ) ∀k∈Kt || θg,t -θg,t || 2 2 = d i=1 E M k,t i ∼Bern(θ k,t i ) ∀k∈Kt θg,t i -θg,t i 2 (5) = d i=1 E M k,t i ∼Bern(θ k,t i ) ∀k∈Kt   1 K k∈Kt (M k,t i -θ k,t i ) 2   (6) = 1 K 2 d i=1 E M k,t i ∼Bern(θ k,t i ) ∀k∈Kt   k∈Kt (M k,t i -θ k,t i ) 2   (7) = 1 K 2 d i=1 E M k,t i ∼Bern(θ k,t i ) ∀k∈Kt k∈Kt M k,t i -θ k,t i 2 (8) = 1 K 2 d i=1 k∈Kt E M k,t i ∼Bern(θ k,t i ) (M k,t i -θ k,t i ) 2 (9) = 1 K 2 d i=1 k∈Kt E M k,t i ∼Bern(θ k,t i ) [(M k,t i ) 2 ] -(θ k,t i ) 2 (10) = 1 K 2 d i=1 k∈Kt θ k,t i -(θ k,t i ) 2 ≤ d 4K . From ( 5) to (6), we use the definition of θg,t i = 1 K K k=1 m k,t i and θg,t i = 1 K K k=1 θ k,t i . From (7) to (8), we use the fact that E M k,t i ∼Bern(θ k,t i ) ∀k∈Kt [M k,t i -θ k,t i ] = 0; and M k,t i -θ k,t i and M l,t i -θ l,t i are independent for l ̸ = k ∈ [K]. Finally, the inequality in (8) follows from θ k,t i ∈ [0, 1] for all k ∈ [K].

C PRIVACY AMPLIFICATION AND BIAS CORRECTION

Differential privacy (DP) guarantees that the probability of an outcome of an algorithm that runs on client data does not change much by a single client's data. This is typically ensured via injecting noise to a function of the client data at a particular step in the algorithm with some utility loss in the application. While there have been many DP strategies developed for FL and deep learning (Abadi et al., 2016; McMahan et al., 2017b; Agarwal et al., 2021; Andrew et al., 2021) , these strategies typically suffer from severe performance degradation due to noise injection. To make DP practical, researchers have explored certain randomization mechanisms that amplify the privacy guarantee. When these mechanisms are parts of the FL framework, such as sampling (data (Balle et al., 2018; Wang et al., 2019) or device (Balle et al., 2020; Girgis et al., 2021; Hasircioglu & Gunduz, 2022) ) and shuffling (Erlingsson et al., 2019; Feldman et al., 2022) , the amplification comes for free. This is helpful because the overall process can meet a stronger privacy guarantee without increasing the noise level. FedPM promises one such amplification due to the stochastic Bernoulli sampling step. We first revisit the definitions of differential privacy (Dwork et al., 2006) , Rényi divergence, and Rényi differential privacy (Mironov, 2017) ; and then present the amplification result. Definition 1. [Adjacent Datasets] Two datasets D, D ′ ∈ D are called adjacent if they differ in at most one data sample. Definition 2. [(ϵ, δ)-DP] A randomized mechanism f : D → R offers (ϵ, δ)-differential privacy if for any adjacent D, D ′ ∈ D and S ⊂ R Pr[f (D) ∈ S] ≤ e ϵ Pr[f (D ′ ∈ S)] + δ. Definition 3. [Rényi Divergence] For two probability distributions P and Q defined over R, the Rényi divergence of order α > 1 is D α (P ||Q) = 1 α -1 log E x∼Q P (x) Q(x) α . Definition 4. [(α, ϵ)-RDP] A randomized mechanism f : D → R offers ϵ-Rényi differential privacy of order α (or in short (α, ϵ)-RDP) if for any adjacent D, D ′ ∈ D, it holds that D α (f (D)||f (D ′ )) ≤ ϵ. In particular, Imola & Chaudhuri (2021) have shown that when a sample M ∈ {0, 1} d from an already privatized vector θ ∈ [c, 1 -c] d , where 0 < c < 0.5, is released to a third party (instead of θ itself), the privacy is amplified under some conditions. More precisely, when there is an (α, ϵ)-Rényi Differential Privacy mechanism (Mironov, 2017) that privatizes θ ∈ [c, 1 -c] d , releasing a sample from Bern(θ) yields an improved privacy budget (the smaller ϵ, the better the privacy): ϵ amp ≤ min {ϵ, d • r α (c)}. Here, r α (p) is the binary symmetric Rényi divergence function defined as r α (p) = 1 α-1 log p α (1 -p) 1-α + (1 -p) α p 1-α . Notice that FedPM already involves this Bernoulli sampling step in the communication protocol and in the forward pass m k,t ∼ Bern(θ k,t ). However, the d term in the upper bound limits the amplification for large model sizes. We believe it is worth exploring a tighter upper bound on ϵ amp to enjoy privacy amplification in FedPM with practical models. Nonetheless, in Appendix C, we demonstrate the impact of this amplification on a distributed mean estimation problem, described in Figure 2 , where the goal is to estimate the true mean of the probability masks θ = 1 K K k=1 θ k under communication and privacy constraints. We also provide a bias correction mechanism, specific to our scheme in Figure 5 in Appendix C, that mitigates the bias due to the DP mechanism and reduces the estimation error. Now, suppose that we have an (α, ϵ)-RDP algorithm f that outputs privatized θ k ∈ [c, 1 -c] d with 0 < c < 0.5, using local client data D k . As summarized in Figure 5 , we are interested in what happens when instead of releasing θ k = f (D k ), the client k releases a Bernoulli sample from it: m k ∈ {0, 1} d ∼ Bern(θ k ). We already explained the advantages in terms of communication bitrate, estimation error, unbiasedness throughout the manuscript; however, this approach also amplifies the privacy guarantees, meaning that it makes the overall privacy budget smaller ϵ amp ≤ ϵ. Quantitatively, Imola & Chaudhuri (2021) showed that after the Bernoulli sampling, the privacy budget of the overall process is ϵ amp ≤ min {ϵ, dr α (c)}, where r α (•) is the Rényi divergence of the binary symmetric function. More precisely, consider P, Q random variables with support on {x 1 , x 2 } ⊂ Θ and let p = Pr[P = x 1 ], 1 -p = Pr(Q = x 1 ). Then the Rényi divergence is defined as r α (p) = R α (P, Q) = 1 α -1 log (p α (1 -p) 1-α + (1 -p) α p 1-α ). Notice that FedPM already involves this Bernoulli sampling step in the communication protocol and in the forward pass m k,t ∼ Bern(θ k,t ). This implies that FedPM improves the privacy guarantee without changing the privacy mechanism -e.g. without increasing the injected noise level. However, the d term in the upper bound limits the amplification for large model sizes. We believe it is worth exploring a tighter upper bound on ϵ amp to enjoy privacy amplification in FedPM with practical models. Nonetheless, we demonstrate the impact of this amplification on a distributed mean estimation problem, described in Figure 5 , where the probability masks with a small ϵ, where δ ≈ 1 N 2 and ∆ 2 is the ℓ 2 -sensitivity of the probability masks (in our case ∆ 2 = (1 -2c) √ d). We transfer the above amplification results in RDP to DP using the well-known relation: θ k ∈ [c, 1 -c] Remark C.1. Mironov (2017) showed that if f is an (α, ϵ)-RDP mechanism, it also satisfies (ϵ + log 1/δ α-1 , δ)-DP for any 0 < δ < 1. ...

Server Estimate True Mean

Figure 5 : Distributed mean estimation scheme in FedPM, modified for differential privacy. Since clipping after the noise addition step would lead to bias in the estimated mean, we work out a bias correction mechanism. We denote with θ one general parameter at client k for one parameter, with θ its noisy version, and with θ = clip( θ) its clipped version. Specifically, if θ = θ + η is the noisy version of the parameter, where η ∼ N (0, σ 2 ), then clip( θ) =    θ, if c ≤ θ + η ≤ 1 -c 1 -c, if θ + η > 1 -c c, if θ + η < c. We now compute E M , where M ∼ Bern( θ), to analyze the bias E M -E [M ] = E M -θ, where M ∼ Bern(θ). First of all, notice that E M = 1 0 E M | θ = ρ f (ρ)dρ = 1 0 ρf (ρ)dρ = E[ θ]. And we now compute the mean of the clipped parameter E θ = 1 0 ρf (ρ)dρ = +∞ -∞ clip(θ + η)f (η)dη = c-θ -∞ c • f (η)dη + 1-c-θ c-θ (θ + η) • f (η)dη + +∞ 1-c-θ (1 -c) • f (η)dη = cΦ σ (c -θ) + θ 1-c-θ c-θ f (η)dη + 1-c-θ c-θ ηf (η)dη + (1 -c) (1 -Φ σ (1 -c -θ)) = cΦ σ (c -θ) + θ [Φ σ (1 -c -θ) -Φ σ (c -θ)] + -σ √ 2π e -(1-c-θ) 2 2σ 2 -e -(c-θ) 2 2σ 2 + + (1 -c) (1 -Φ σ (1 -c -θ)) = 1 -c + [θ -1 + c]Φ σ (1 -c -θ) + [c -θ]Φ σ (c -θ) + -σe -(c-θ) 2 2σ 2 √ 2π e -2(c-θ)-1 -1 , where Φ σ (•) is the cumulative distribution function of a Gaussian random variable with standard deviation σ, and zero mean. We use this relation to correct the bias in θ. In practice, to adopt the bias-correction strategy, we sample the function E θ , which is a function of the true parameter θ, noise standard deviation σ, and clipping parameter c, at Q different points x 1 , . . . , x Q , i.e., different values for the uncorrupted θ, and we store the values in a table. Indeed, the values σ and c are set at the beginning of the training process, secretly shared among the participants, and never modified. Then, once the server computes an estimate for θ, it corrects it by finding the closest outputs of E θ in the stored table, and it inverts the map by choosing the corresponding x i , i.e., the original θ. We conduct our experiments with N = 100 clients, each having independent probability masks with dimension d = 5 and range [0.2, 0.8], i.e., θ ∈ [0.2, 0.8] 5 . Figure 6 shows the estimation error || θg,t θg,t || 2 2 under no noise injection case (i.e. no DP) with the black line. Recall that we want to reach a smaller estimation error and smaller ϵ (i.e., a stronger privacy guarantee). The red curve corresponds to the ϵ vs. estimation error behavior if Bernoulli sampling did not amplify the privacy. The blue curve shows the amplified ϵ (i.e. ϵ amp ≤ ϵ) vs. estimation error behavior, and it overlaps with the red curve for ϵ values smaller than d • r α (c) = 8.96, where there is no privacy amplification, i.e., ϵ amp = ϵ. However, notice that the blue line never reaches ϵ's higher than this value due to amplification, while enjoying smaller estimation errors that the red curve can only achieve with very large ϵ. This shows the promise of FedPM in having a better privacy-accuracy performance than most baselines that do not have amplification. Finally, the green curve shows that bias correction improves this performance further even with ϵ < d • r α (c) = 8.96 by achieving lower estimation errors with the same ϵ.

D ADDITIONAL EXPERIMENTAL DETAILS

In Table 2 , we provide the architectures for all the models used in our experiments. Clients performed 3 local epochs with a batch size of 128 and a local learning rate of 0.1 in all the experiments. Notice that there is no server learning rate in FedPM; instead, we tune the prior resetting schedule in Bayesian aggregation for the experiments in Section 4.2. We conducted our experiments on NVIDIA Titan X GPUs on an internal cluster server, using 1 GPU per one run. In the non-IID and partial participation experiments in Section 4.2, to distill the final model, we may apply both stochastic sampling, as during training, or a hard-threshold method, similar to the one adopted in FedMask (Li et al., 2021) . In the latter, a binary mask coefficient m i is set to 1 if θ i > α ths , and 0 otherwise. For all experiments but one, when α ths ∈ [0.4, 0.6], the thresholding test accuracy is always higher than the sampling method, and so we use the threshold method. However, in the extreme case c max = 2 and ρ = 0.1, the optimal values for α max were in [0.2, 0.4] and [0.6, 0.8] Figure 6 : The effect of privacy amplification and bias correction in the privacy budget (ϵ) vs. estimation error behavior. Comparing red and blue curves, we see that we can reach small estimation errors without increasing ϵ thanks to the amplification (see the vertical blue line at low estimation error.). While the red curve and blue curve overlap for ϵ < d • r α (c) = 8.96, in that regime, we benefit from our bias correction strategy to reach a lower error. 

E.1 ADDITIONAL EXPERIMENTS ON RESNET ARCHITECTURES

In this section, we provide additional experimental results with ResNet-18 (He et al., 2016) on CIFAR-10 and CIFAR-100 datasets. For these experiments, we focus on IID data distribution and the case when all the clients participate in the training at each round. We use the same hyperparameters from Section 4.1. We provide the details of the ResNet-18 architecture in Table 3 below. Figures 7 and 8 show the results on CIFAR-10 and CIFAR-100 datasets, respectively. It is seen that FedPM outperforms all the baselines in terms of accuracy. Although DRIVE and EDEN require (Bernstein et al., 2018) , TernGrad (Wen et al., 2017) , QSGD (Alistarh et al., 2017) , DRIVE (Vargaftik et al., 2021) , EDEN (Vargaftik et al., 2022) , and FedMask (Li et al., 2021) , with ResNet-18 on CIFAR-10. Figure 8 : Accuracy and bitrate comparison of FedPM with baselines SignSGD (Bernstein et al., 2018) , TernGrad (Wen et al., 2017) , QSGD (Alistarh et al., 2017) , DRIVE (Vargaftik et al., 2021) , EDEN (Vargaftik et al., 2022) , and FedMask (Li et al., 2021) , with ResNet-18 on CIFAR-100. approximately 0.1 smaller bitrates than FedPM, they also reach lower accuracy. In summary, the advantages of FedPM discussed in the main manuscript carry over to ResNet-18 model as well.

E.2 BITRATE CONSIDERATIONS ON NON-IID DATA

We now report the communication bitrate considerations on the non-IID data split experiments described in Section 4.2. Table 4 reports the average bitrate needed by different algorithms over the whole training process when c max = 4 and c max = 2. By simply multiplying the obtained average bitrate by the total number of rounds t max = 200, we obtain the total number of bits one element in the global probability mask needs to converge to its final value, indicating the total amount of information communicated during the training process. We first observe that both DRIVE and EDEN consume almost the same amount of bits no matter the system configuration and round number (very small variance), and it is instead model dependent (see Figure 3 ). On the contrary, FedPM and QSGD report higher bitrate variability, as it depends on both the training phase and system setting. As already observed in Section 4.1, FedMask balances almost uniformly the binary updates, leading to a bitrate that is basically fixed to 1. For both c max = 4 and c max = 2, FedPM yields the smallest bitrate when ρ = 1, whereas for the other scenarios, EDEN and DRIVE are slightly more efficient. We argue that this is motivated by the fact that, as the learning task becomes harder due to the high system heterogeneity, all the models struggle to converge to good and stable solutions, which means that FedPM is still uncertain about the weights' importance probabilities θ, setting many of them close to 0.5. However, we think that this may be a useful feature of FedPM to quantify its internal uncertainty, which we will further analyze. Algorithm ρ = 1 ρ = 0.5 ρ = 0.2 ρ = 0.1 DRIVE (Vargaftik et al., 2021) 0.885 ± 9 • 10 -5 0.885 ± 1 • 10 -4 0.885 ± 6 • 10 -5 0.885 ± 1 • 10 -4 EDEN (Vargaftik et al., 2022) 0.885 ± 1 • 10 -4 0.885 ± 1 • 10 -4 0.885 ± 8 • 10 -5 0.885 ± 1 • 10 -4 c max = 4 QSGD (Alistarh et al., 2017) 0.982 ± 0.027 0.923 ± 0.029 1.188 ± 0.034 0.910 ± 0.05 FedMask (Li et al., 2021) 1.000 ± 3 • 10 -6 1.000 ± 8 • 10 -8 1.000 ± 2 • 10 -6 1.000 ± 6 • 10 -7 FedPM (Ours) 0.863 ± 0.077 0.912 ± 0.056 0.965 ± 1 • 0.01812 0.996 ± 0.003 DRIVE (Vargaftik et al., 2021) 0.885 ± 7 • 10 -5 0.885 ± 2 • 10 -4 0.885 ± 7 • 10 -5 0.885 ± 2 • 10 -4 EDEN (Vargaftik et al., 2022) 0.885 ± 1 • 10 -4 0.885 ± 7 • 10 -5 0.885 ± 6 • 10 -5 0.885 ± 7 • 10 -5 c max = 2 QSGD (Alistarh et al., 2017) 1.230 ± 0.043 1.234 ± 0.038 1.100 ± 0.01 1.082 ± 0.01 FedMask (Li et al., 2021) 1.000 ± 2 • 10 -6 1.000 ± 2 • 10 -6 1.000 ± 1 • 10 -5 1.000 ± 2 • 10 -7 FedPM (Ours) 0.868 ± 0.076 0.904 ± 0.063 0.980 ± 0.014 0.997 ± 0.01 To conclude the analysis, we also report the FedPM bpp for the final model, which is an indication of the average number of bits needed per one parameter of the model. In the case of c max = 4, the final model sizes are 0.79 bpp, 0.834 bpp, and 0.99 bpp, when ρ = {0.1, 0.5, 1}, respectively. When c max = 2, the final model sizes are 0.8 bpp, 0.817 bpp, and 0.992 bpp. Consequently, at the end of the training process, FedPM remains the most efficient option, as already observed in Section 4.1.



Figure 3: Accuracy and bitrate comparison of FedPM with SignSGD(Bernstein et al., 2018), Tern-Grad(Wen et al., 2017), QSGD(Alistarh et al., 2017), DRIVE(Vargaftik et al., 2021), EDEN (Vargaftik et al., 2022), and FedMask (Li et al., 2021), all performing in the same bitrate regime.

Figure 4: Accuracy for different values of γ -the number of rounds before resetting the priors.

d are a function of client data D k ; and are first corrupted by Gaussian noise, and then clipped to the range [c, 1 -c] d . Our goal is, as before, to estimate the true mean θ = 1 K k∈Kt θ k by averaging the sampled binary masks, i.e., θ = 1 K k∈Kt m k . Differently from our previous experiments, we have privacy constraints now, meaning that we want to guarantee (ϵ, δ)-DP by injecting a Gaussian noise with variance σ 2 =

Figure 7: Accuracy and bitrate comparison of FedPM with baselines SignSGD(Bernstein et al., 2018), TernGrad(Wen et al., 2017), QSGD(Alistarh et al., 2017), DRIVE(Vargaftik et al., 2021), EDEN(Vargaftik et al., 2022), and FedMask(Li et al., 2021), with ResNet-18 on CIFAR-10.

with 10 classes, and EMNIST(Cohen et al., 2017) with 47 classes. For CIFAR-100, we use a 10-layer convolutional network (CNN) CONV-10 and ResNet-18He et al. (2016); for CIFAR-10, a 6-layer CNN CONV-6 and ResNet-18He et al. (2016); and for MNIST and EMNIST, a 4-layer CNN CONV-4. A detailed description of the architectures can be found in Appendix D. Due to limited space, we provide the results on ResNet-18 in Appendix E.1. We first compare FedPM with SignSGD

Average final accuracy ±σ in non-IID data split with c max = 4 and 2, and client participation ratios ρ = {0.1, 0.2, 0.5, 1}, for FedPM, FedMask, and the strongest baselines in the IID experiments: EDEN, DRIVE, and QSGD. The training duration was set to t max = 200 rounds.

Architectures for CONV-4, CONV-6, and CONV-10 models used in the experiments. , probably due to the high randomness given by the highly heterogeneous scenario. Consequently, for the last experiment, we just adopt the stochastic sampling strategy to evaluate the model, as further optimizing the α ths means adapting to the test dataset, which may corrupt the ability of the model to generalize.

ResNet-18 architecture. Residual Block 4 3 × 3 conv, 512 filters 3 × 3 conv, 512 filters × 2 Output Layer 4 × 4 average pool stride 1, fully-connected, softmax

Average bitrate ±σ over the whole training process in non-IID data split with c max = 4 and c max = 2, and partial participation with ratios ρ = {0.1, 0.5, 1}, for FedPM, FedMask, and the strongest baselines in the IID experiments: EDEN, DRIVE, and QSGD. The training duration was set to t max = 200 rounds.

8. ACKNOWLEDGEMENT

The authors would like to thank the anonymous reviewers and area chairs who provided valuable feedback; and Zachary Charles, Mahdi Haghifam, Peter Kairouz, and Nicole Mitchell for inspiring discussions. This work was supported in part by a Sony Stanford Graduate Fellowship, a National Science Foundation (NSF) award, a Meta research grant, and the European Union under the Italian National Recovery and Resilience Plan (NRRP) of NextGenerationEU, partnership on "Telecommunications of the Future" (PE0000001 -program "RESTART").

A FedPM ALGORITHM

We provide the pseudocode for FedPM in Algorithms 1 and 2. In Algorithm 2, the prior resetting scheduling policy is controlled by the procedure ResPrior(t), which may depend on quantities other than the round number t, such as loss.Algorithm 1 FedPM. Hyperparameters: learning rate η, minibatch size B, number of local iterations τ . Inputs: local datasets D i , i = 1, . . . , N Output: random seed SEED and binary mask parameters m k,T At the server, initialize a random network with weight vector w init ∈ R d using a random seed SEED, and broadcast it to the clients. At the server, initialize the random score vector s g,0 ∈ R d , and compute θ g,0 ← Sigmoid(s g,0 ). At the server, initialize Beta priors α g,0 = β g,0 = λ 0 .On Client Nodes: for k ∈ K t do Receive θ g,t-1 from the server and set s k,t = Sigmoid -1 (θ g,t-1 ).Send the arithmetic coded binary mask m k,t to the server. end for On the Server Node: Receive m k,t 's from K client nodes. θ g,t = BayesAgg( {m k,t } k∈Kt , t) // See Algorithm 2. Broadcast θ g,t to all client nodes. end for Sample the final binary mask m final ∼ Bern(θ g,T ). Generate the final model: ẇfinal ← m final ⊙ w init .Algorithm 2 BayesAgg.Inputs: clients' updates {m k,t } k∈Kt , and round number t Output: global probability mask θ g,t if ResPriors(t) then α g,t-1 = β g,t-1 = λ 0 end if Compute m agg,t = k∈Kt m k,t . α g,t = α g,t-1 + m agg,t β g,t = β g,t-1 + K • 1 -m agg,t θ g,t = α g,t -1 α g,t +β g,t -2 Return θ g,t

