FEDERATED GENERALIZED BAYESIAN LEARNING VIA DISTRIBUTED STEIN VARIATIONAL GRADIENT DESCENT

Abstract

This paper introduces Distributed Stein Variational Gradient Descent (DSVGD), a non-parametric generalized Bayesian inference framework for federated learning. DSVGD maintains a number of non-random and interacting particles at a central server to represent the current iterate of the model global posterior. The particles are iteratively downloaded and updated by one of the agents with the end goal of minimizing the global free energy. By varying the number of particles, DSVGD enables a flexible trade-off between per-iteration communication load and number of communication rounds. DSVGD is shown to compare favorably to benchmark frequentist and Bayesian federated learning strategies in terms of accuracy and scalability with respect to the number of agents, while also providing well-calibrated, and hence trustworthy, predictions. An overview of the benchmarks considered in the experiments is provided in Table 1 . Table 1 : Overview of benchmarks used in the experiments.

1. INTRODUCTION

Federated learning refers to the collaborative training of a machine learning model across agents with distinct data sets, and it applies at different scales, from industrial data silos to mobile devices (Kairouz et al., 2019) . While some common challenges exist, such as the general statistical heterogeneity -"non-iidnes" -of the distributed data sets, each setting also brings its own distinct problems. In this paper, we are specifically interested in a small-scale federated learning setting consisting of mobile or embedded devices, each having a limited data set and running a small-sized model due to their constrained memory. As an example, consider the deployment of health monitors based on data from smart-watch ECG data. In this context, we argue that it is essential to tackle the following challenges, which are largely not addressed by existing solutions: • Trustworthiness: In applications such as personal health assistants, the learning agents' recommendations need to be reliable and trustworthy, e.g., to decide when to contact a doctor in case of a possible emergency; • Number of communication rounds: When models are small, the payload per communication round may not be the main contributor to the overall latency of the training process. In contrast, accommodating many communication rounds requiring arbitrating channel access among multiple devices may yield slow wall-clock time convergence (Lin et al., 2020) . Most existing federated learning algorithms, such as Federated Averaging (FedAvg) (McMahan et al., 2017) , are based on frequentist principles, relying on the identification of a single model parameter vector. Frequentist learning is known to be unable to capture epistemic uncertainty, yielding overconfident decisions (Guo et al., 2017) . Furthermore, the focus of most existing works is on reducing the load per-communication round via compression, rather than decreasing the number of rounds by providing more informative updates at each round (Kairouz et al., 2019) . This paper introduces a trustworthy solution that is able to reduce the number of communication rounds via a non-parametric variational inference-based implementation of federated Bayesian learning. Federated Bayesian learning has the general aim of computing the global posterior distribution in the model parameter space. Existing decentralized, or federated, Bayesian learning protocols are either based on Variational Inference (VI) (Angelino et al., 2016; Neiswanger et al., 2015; Broderick et al., 2013; Corinzia & Buhmann, 2019b) or Monte Carlo (MC) sampling (Ahn et al., 2014; Mesquita et al., 2020; Wei & Conlon, 2019) . State-of-the-art methods in either category include Partitioned Variational Inference (PVI), which has been recently introduced as a unifying distributed VI framework that relies on the optimization over parametric posteriors; and Distributed Stochastic Gradient Langevin Dynamics (DSGLD), which is an MC sampling technique that maintains a number of Markov chains updated via local Stochastic Gradient Descent (SGD) with the addition of 𝜃 𝑛 [𝑙] ← 𝜃 𝑛 [𝑙-1] + 𝜖 𝜙 ⋆ (𝜃 𝑛 [𝑙-1] ) 3 … 𝜃 𝑛 𝑖-1 𝑛=1 𝑁 𝜃 𝑛 𝑖 𝑛=1 𝑁 Agents (b) Figure 1 : Federated learning across K agents equipped with local datasets and assisted by a central server: (a) in DVI agents exchange the current model posterior q (i) (θ) with the server, while (b) in DSVGD agents exchange particles {θ n } N n=1 providing a non-parametric estimate of the posterior. Gaussian noise (Ahn et al., 2014; Welling & Teh, 2011) . The performance of VI-based protocols is generally limited by the bias entailed by the variational approximation, while MC sampling is slow and suffers from the difficulty of assessing convergence (Angelino et al., 2016) . Stein Variational Gradient Descent (SVGD) has been introduced in (Liu & Wang, 2016) as a nonparametric Bayesian framework that approximates a target posterior distribution via non-random and interacting particles. SVGD inherits the flexibility of non-parametric Bayesian inference methods, while improving the convergence speed of MC sampling (Liu & Wang, 2016) . By controlling the number of particles, SVGD can provide flexible performance in terms of bias, convergence speed, and per-iteration complexity. This paper introduces a novel non-parametric distributed learning algorithm, termed Distributed Stein Variational Gradient Descent (DSVGD), that transfers the mentioned benefits of SVGD to federated learning. As illustrated in Fig. 1 , DSVGD targets a generalized Bayesian learning formulation, with arbitrary loss functions (Knoblauch et al., 2019) ; and maintains a number of non-random and interacting particles at a central server to represent the current iterate of the global posterior. At each iteration, the particles are downloaded and updated by one of the agents by minimizing a local free energy functional before being uploaded to the server. DSVGD is shown to enable (i) a trade-off between per-iteration communication load and number of communication rounds by varying the number of particles; while (ii) being able to make trustworthy decisions through Bayesian inference.

2. SYSTEM SET-UP

We consider the federated learning set-up in Fig. 1 , where each agent k = 1, . . . , K has a distinct local dataset with associated training loss L k (θ) for model parameter θ. The agents communicate through a central node with the goal of computing the global posterior distribution q(θ) over the shared model parameter θ ∈ R d for some prior distribution p 0 (θ) (Angelino et al., 2016) . Specifically, following the generalized Bayesian learning framework (Knoblauch et al., 2019) , the agents aim at obtaining the distribution q(θ) that minimizes the global free energy min q(θ) F (q(θ)) = K k=1 E θ∼q(θ) [L k (θ)] + αD(q(θ)||p 0 (θ)) , where α > 0 is a temperature parameter. The (generalized, or Gibbs) global posterior q opt (θ) solving problem (1) must strike a balance between minimizing the sum loss function (first term in F (q)) and the model complexity defined by the divergence from a reference prior (second term in F (q)). It is given as q opt (θ) = 1 Z • qopt (θ), with qopt (θ) = p 0 (θ) exp - 1 α K k=1 L k (θ) , where we denoted as Z the normalization constant. It is useful to note that the global free energy can also be written as the scaled KL F (q(θ)) = αD(q(θ)||q opt (θ)). The main challenge in computing the optimal posterior q opt (θ) in a distributed manner is that each agent k is only aware of its local loss L k (θ). By exchanging information through the server, the K agents wish to obtain an estimate of the global posterior (2) without disclosing their local datasets neither to the server nor to the other agents. In this paper, we introduce a novel non-parametric distributed generalized Bayesian learning framework that addresses this challenge by integrating Distributed VI (DVI) and SVGD (Liu & Wang, 2016) .

3. DISTRIBUTED VARIATIONAL INFERENCE

In this section, we describe a general Expectation Propagation (EP)-based framework (Vehtari et al., 2020) , which we term as DVI, that aims at computing the global posterior in a federated fashion (Bui et al., 2018; Corinzia & Buhmann, 2019b) . DVI starts from the observation that the posterior (2) factorizes as the product q(θ) = p 0 (θ) K k=1 t k (θ), where the term t k (•) is given by the scaled local likelihood exp(α -1 L k (θ))/Z. Since the normalization constant Z depends on all data sets, the true scaled local likelihood t k (•) cannot be directly computed at agent k. The idea of DVI is to iteratively update approximate likelihood factors t k (θ) for k = 1, ..., K by means of local optimization steps at the agents and communication through the server, with the aim of minimizing the global free energy (1) over distribution (3). We give here the standard implementation of DVI in which a single agent is schedule at each time, although parallel implementations are possible and discussed below. Accordingly, at each communication round i = 1, 2, ..., the server maintains the current iterate q (i-1) (θ) of the global posterior, and schedules an agent k ∈ {1, 2, . . . , K}, which proceeds as follows: 1. Agent k downloads the current global variational posterior distribution q (i-1) (θ) from the server (see Fig. 1 (a), step 1 ); 2. Agent k updates the global posterior by minimizing the local free energy F (i) k (q(θ)) (see Fig. 1(a), step 2 ) q (i) (θ) = argmin q(θ) F (i) k (q(θ)) = E θ∼q(θ) [L k (θ)] + αD(q(θ)||p (i) k (θ)) , where we have defined the (unnormalized) cavity distribution p(i) k (θ) as p(i) k (θ) = q (i-1) (θ) t (i-1) k (θ) . (5) The cavity distribution p(i) k (θ), which removes the contribution of the current approximate likelihood of agent k from the current global posterior iterate, serves as a prior for the update in (4). In a manner similar to (2), the local free energy is minimized by the tilted distribution p (i) k (θ) ∝ p(i) k (θ) with p(i) k (θ) = p(i) k (θ) exp - 1 α L k (θ) ; (6) 3. Agent k sends the updated posterior q (i) (•) = p (i) k (•) to the server (see Fig. 1 (a), step 3 ), and updates its approximate likelihood accordingly as t (i) k (θ) = q (i) (θ) q (i-1) (θ) t (i-1) k (θ); (7) Finally, non-scheduled agents k = k set t (i) k (θ) = t (i-1) k (θ) , and the server sets the next iterate as q (i) (θ). We have the following key property of DVI. Theorem 1. The global posterior q opt (θ) in (2) is the unique fixed point of the DVI algorithm. The fixed-point property in Theorem 1 can be verified directly by setting q (i-1) (θ) = q opt (θ) and t (i-1) k (θ) = exp(α -1 L k (θ))/Z and by observing that this leads to the fixed point condition q (i) (θ) = q (i-1) (θ) = q opt (θ). The proof is provided in Sec. A.6. Importantly, this property is not tied to the sequential implementation detailed above, and it applies also if multiple devices are scheduled in parallel, as long as one sets the next iterate as q (i) (θ) = p 0 (θ) k∈K (i) t (i) k (θ) k ∈K (i) t (i) k (θ) , where K (i) denotes the set of scheduled agents at communication round i and we have t (i) k (θ) = t (i-1) k (θ) and t (i) k (θ) updated following (7).

4. PRELIMINARIES

In this section, we briefly review PVI, which serves as an important benchmark, and SVGD, on which we build the proposed Bayesian federated learning solution.

4.1. PARTITIONED VARIATIONAL INFERENCE

The exact minimization of the local free energy function (4) assumed by DVI is often not tractable. To address this problem, in its most typical form, PVI constrains the local free energy minimization (4) to the space of parametric distributions that factorize as q(θ|η) = p 0 (θ|η 0 ) K k=1 t k (θ|η k ), where prior p 0 (•|η 0 ) = ExpFam(•|η 0 ) and approximate likelihood t k (•|η k ) = ExpFam(•|η k ) are selected from the same exponential-family distribution, with natural parameters η 0 and η k , respectively. PVI follows the same steps as DVI with the caveat that the local free energy (4) for agent k is minimized over the natural parameter η. This can be done efficiently, albeit approximately, using for e.g., natural gradient descent (Amari, 1998) . The bias imposed by the parametrization in PVI significantly affects the quality of the approximation of the obtained posterior q(θ) with respect to the true global posterior q opt (θ) in the presence of model misspecification. In this case, the fixed point property in Theorem 1 no longer applies.

4.2. STEIN VARIATIONAL GRADIENT DESCENT (SVGD)

SVGD tackles the minimization of the (scaled) free energy functional D(q(θ)||p(θ)), for an unnormalized target distribution p(θ), over a non-parametric generalized posterior q(θ) defined over the model parameters θ ∈ R d . The posterior q(θ) is represented by a set of particles {θ n } N n=1 , with θ n ∈ R d . In practice, an approximation of q(θ) can be obtained from the particles {θ n } N n=1 through a Kernel Density Estimator (KDE) as q(θ) = N -1 N n=1 K(θ, θ n ) for some kernel function K(•, •) (Bishop, 2006) . The particles are iteratively updated through a series of transformations that are optimized to minimize the free energy. The transformations are restricted to lie within the unit ball of a Reproducing Kernel Hilbert Space (RKHS) H d = H × . . . × H. It is shown by Liu & Wang (2016) that this optimization yields the SVGD update θ [l] n ← -θ [l-1] n + N N j=1 [k(θ [l-1] j , θ [l-1] n )∇ θj log p(θ [l-1] j ) + ∇ θj k(θ [l-1] j , θ [l-1] n )] for n = 1, . . . , N , where k(•, •) is the positive definite kernel associated with RKHS H. The first term in the update (8) drives the particles towards the regions of the target distribution p(θ) with high probability, while the second term drives the particles away from each other, encouraging exploration in the model parameter space. It is known that, in the asymptotic limit of a large number N of particles, the empirical distribution encoded by the particles {θ [l] n } N n=1 converges to the normalized target distribution p(θ) ∝ p(θ) (Liu, 2017b) .

5. DISTRIBUTED STEIN VARIATIONAL GRADIENT DESCENT

In this section, we introduce DSVGD, a novel distributed algorithm that tackles the generalized Bayesian inference problem (1) via DVI over a non-parametric particle-based representation of the global posterior. As illustrated in Fig. 1(b) , DSVGD is based on the iterative optimization of local free energy functionals (4) via SVGD (see Sec. 4), and on the exchange of particles between the central server and agents. Given the flexibility of the non-parametric form of the posterior, DSVGD doesn't suffer from the bias caused by the parametrization assumed by PVI. As a result, in the limit of a sufficiently large number of particles, DSVGD benefits from the fixed point property of DVI stated in Theorem 1, recovering the true global posterior as a fixed point of its iterations. Furthermore, as we will discuss, DSVGD enables devices to exchange more informative messages regarding the current iterate of the posterior by increasing the number of particles. This can in turn reduce the number of communication rounds and the overall communication load to convergence, at the cost of a larger per-round load. In this regard, we note that, in practice, a small number of particles is sufficient to obtain state-of-the-art performance (Liu & Wang, 2016) , as verified in Sec. 7. In order to facilitate the presentation, we first introduce a simpler version of DSVGD that has the practical drawback of requiring each agent to store a number of particles that increases linearly with the number of iterations in which the agent is scheduled. Then, we present a more practical algorithm, for which the memory requirements do not scale with the number of iterations as each agent must only memorize a set of N local particles across different iterations. Algorithmic table for Algorithm 1: Distributed Stein Variational Gradient Descent (DSVGD) Input: prior p0(θ), local loss functions {L k (θ)} K k=1 , temperature α > 0, kernels K(•, •) and k(•, •) Output: global approximate posterior q(θ) = N -1 N n=1 K(θ, θn) initialize q (0) (θ) = p0(θ); {θ (0) n } N n=1 i.i.d ∼ p0(θ); {θ (0) k,n = θ (0) n } N n=1 and t } N n=1 and {θ (i-1) k,n } N n=1 Agent k sends the updated global particles {θ (i) n } N n=1 to the server Agent k carries distillation to obtain {θ (i) k,n } N n=1 encoding t (i) k (θ) using ( 17) and {θ (i) n } N n=1 end return q(θ) = N -1 N n=1 K(θ, θ (I) n ) Unconstrained-DSVGD (U-DSVGD) in addition to discussions on complexity and convergence, can be found respectively in Sec. A.1 and Sec. A.4 in the supplementary materials. A direct extension of DSVGD, termed Parallel-DSVGD (P-DSVGD), where multiple agents are scheduled per round can be found in Sec. A.5 of the Appendix.

5.1. U-DSVGD

In this section, we present a simplified DSVGD variant, which we refer to as U-DSVGD. We follow the standard implementation of DVI with a single agent k scheduled at each communication round i = 1, 2, . . ., although, as discussed, parallel implementations are also possible. Let us define as I (i) k ⊆ {1, . . . , i} the subset of rounds at which agent k is scheduled prior, and including, iteration i. At the beginning of each round i, the server maintains the iterate of the current global particles {θ (i-1) n } N n=1 , while each agent k keeps a local buffer of particles {θ (j-1) n , θ (j) n } N n=1 for all previous rounds j ∈ I (i-1) k at which agent k was scheduled. The growing memory requirements at the agents will be dealt with by the final version of DSVGD to be introduced in Sec. 5.2. Furthermore, as illustrated in Fig. 1 (b), at each iteration i, U-DSVGD schedules an agent k ∈ {1, 2, . . . , K} and carries out the following steps. 1. Agent k downloads the current global particles {θ (i-1) n } N n=1 from the server (see Fig. 1(b) , step 1 ) and includes them in the local buffer. 2. Agent k updates each downloaded particle as θ [l] n ← -θ [l-1] n + φ(θ [l-1] n ), for l = 1, . . . , L, where L is the number of local iterations; [l] denotes the local iteration index; we have the initialization θ [0] n = θ (i-1) n ; and the function φ(•) is to be optimized within the unit ball of a RKHS H d . The function φ(•) is specifically optimized to maximize the steepest descent decrease of a particle-based approximation of the local energy (4). To elaborate, we denote as q (i-1) (θ) = N n=1 K(θ, θ (i-1) n ) the KDE of the current global posterior iterate encoded by particles {θ (i-1) n } N n=1 . Adopting the factorization (3) for the global posterior (cf. ( 7)), we define the current local approximate likelihood t (i-1) k (θ) = j∈I (i-1) k q (j) (θ) q (j-1) (θ) = q (i-1) (θ) q (i-2) (θ) t (i-2) k (θ). Note that (10) can be computed using all the particles in the buffer at agent k at iteration i. Finally, the (unnormalized) tilted distribution p(i) k (cf. ( 6)) is written as p(i) k (θ) = q (i-1) (θ) t (i-1) k (θ) exp - 1 α L k (θ) . ( ) Following SVGD, the update ( 9) is optimized to maximize the steepest descent decrease of the Kullback-Leibler (KL) divergence between the approximate global posterior q [l] φ (θ) encoded via particles {θ [l] n } N n=1 and the tilted distribution p(i) k (θ) in (11) (see Fig. 1(b), step 2 ), i.e., i = 0 (L = 100) $JHQW i = 1 (L = 100) $JHQW i = 2 (L = 100) $JHQW i = 3 (L = 100) $JHQW i = 4 (L = 100) Figure 2: Gaussian toy example with uniform prior and K = 2. Dashed lines represent local posteriors, the shaded area represents the true global posterior, while the solid blue line is the approximate posterior obtained using a KDE over the particles.DSVGD schedules agent 1 and 2 at odd and even number of communication rounds i, respectively. φ (•) ← -arg max φ(•)∈H d - d d D(q [l-1] φ (θ)||p (i) k (θ)), s.t. ||φ|| H d ≤ 1 . Thus, recalling (8), the particles are updated as θ [l] n ← -θ [l-1] n + N N j=1 [k(θ [l-1] j , θ [l-1] n )∇ θj log p(i) k (θ [l-1] j )+∇ θj k(θ [l-1] j , θ [l-1] n )] ,for l=1, . . . ,L. (13) 3. Agent k sets θ (i) n = θ [L] n for n = 1, . . . , N . Particles {θ (i) n } N n=1 are added to the buffer and sent to the server (see Fig. 1 (b), step 3 ) that updates the current global particles as {θ n } N n=1 = {θ (i) n } N n=1 . In order to implement the described U-DSVGD algorithm, we need to compute the gradient in ( 13) at agent k. First, by ( 11), we have ∇ θ log p(i) k (θ) = ∇ θ log q (i-1) (θ) -∇ θ log t (i-1) k (θ) - 1 α ∇ θ L k (θ). Using ( 10), the second gradient term can be obtained in a recursive manner using the local buffer as ∇ θ log t (i-1) k (θ) = ∇ θ log t (i-2) k (θ) if agent k not scheduled at iteration (i -1) ∇ θ log t (i-2) k (θ) + ∇ θ log q (i-1) (θ) -∇ θ log q (i-2) (θ) otherwise. ( ) Finally, the gradients ∇ θ log q (j) (θ) can be directly computed from the KDE expression of q (j) (θ), with initializations t (0) (θ) = 1 and q (0) (θ) = p 0 (θ). The inner loop of U-DSVGD inherits the asymptotic convergence properties of SVGD in terms of local free energies, but existing results do not imply that the global free energy decreases across the iterations. This result is provided in the next theorem, whose precise formulation can be found in Sec. A.6 of the Appendix. Theorem 2 (Guaranteed per-iteration decrease of the global free energy.). The decrease in the global free energy from local iteration l to l + 1 during communication round i for which agent k is scheduled can be lower bounded as F (q [l] (θ)) -F (q [l+1] (θ)) ≥ α S(q [l] , p (i) k )(1 -γ) -2α(K -1)l (i) max 2D(q [l+1] ||q [l] ), ( ) where l (i) max = sup θ max m =k | log(t (i-1) m (θ)) • exp( 1 α L m (θ)) |, S(q, p) denotes the Kernalized Stein Discrepancy between distributions q and p (Liu et al., 2016) , and γ is a constant depending on the RKHS kernel and the target distribution. The first term in bound ( 16) quantifies the decrease in the local free energy at agent k, which depends on the "distance" between current iterate q [l] and the local target given by the tilted distribution p (i) k (θ); while the second term quantifies the effect of the update on the local free energies of other agents. In the presence of only one agent, the second terms reduce to zero, and one recovers the upper bound on the guaranteed per-iteration improvement for SVGD derived in Korba et al. (2020) .

5.2. DSVGD

In this section, we describe the final version of DSVGD, which, unlike U-DSVGD, requires each agent k to maintain only N local particles {θ form of model distillation (Hinton et al., 2015; Chen & Chao, 2020) via SVGD. Specifically, L additional SVGD steps are used to approximate the term t (i) k,n } N (i) k (θ) using the N local particles {θ (i) k,n } N n=1 . It is noted that this approximation step is not necessarily harmful to the overall performance, since describing the factor t (i) k (θ) with fewer particles can have a denoising effect acting as a regularizer. DSVGD operates as U-DSVGD apart from the computation of the gradient in ( 14) and the management of the local particle buffers. The key idea is that, instead of using the recursion (15) to compute ( 14), DSVGD computes the gradient ∇ θ log t (i-1) k (θ) from the KDE t (i-1) k (θ) = N n=1 K(θ, θ (i-1) k,n ) based on the local particles {θ (i-1) k,n } N n=1 in the buffer. At the end of each round i, the local particles {θ (i-1) k,n } N n=1 are updated by running L local SVGD iterations with target given by the updated local factor t (i) k (θ) = q (i) (θ) q (i-1) (θ) t (i-1) k (θ) . This amounts to the updates θ [l ] k,n ← -θ [l -1] k,n + N N j=1 [k(θ [l -1] k,j , θ [l -1] k,n )∇ θj log t (i) k (θ) + ∇ θj k(θ [l -1] k,j , θ [l -1] k,n )], for l = 1, . . . , L and some learning rate , where the gradient ∇ θ log t (i) k (θ) = ∇ θ log q (i) (θ) + ∇ θ log t (i-1) k (θ) -∇ θ log q (i -1) (θ) can be directly computed using KDE based on the available particles {θ (i) n } N n=1 (updated global particles), {θ (i-1) k,n } N n=1 (local particles) and {θ (i-1) n } N n=1 (downloaded global particles). Finally, we note that the distillation operation can be performed after sending the updated global particles to the server and thus enabling pipelining of the L local iterations with operations at the server and other agents. DSVGD is summarized in Algorithm 1. Extensions of SVGD. Since its introduction, SVGD has been extended in various directions. Most related to this work is Zhuo et al. (2018) , which introduces a message-passing SVGD solution for high-dimensional latent parameter spaces by leveraging conditional independence properties in the variational posterior; and Yoon et al. (2018) , which uses SVGD as the per-task base learner in a metalearning algorithm approximating Expectation Maximization.

6. RELATED WORK

Generalized Bayesian Inference. Owing to its reliance on point estimates in the model parameter space, frequentist learning methods, such as Federated Stochastic Gradient Descent (FedSGD), FedAvg and their extensions (Zhang et al., 2020; Li et al., 2018; Pathak & Wainwright, 2020; Nguyen et al., 2020; Wang et al., 2020) are limited in their capacity to combat overfitting and quantify uncertainty (Guo et al., 2017; Mitros & Mac Namee, 2019; Neal, 2012; Jospin et al., 2020; MacKay, 2002) . This contrasts with the generalized Bayesian inference framework that produces distributional, rather than point, estimates by optimizing the free energy functional, which is a theoretically principled bound on the generalization performance (Zhang, 2006; Knoblauch et al., 2019) . Practical algorithms for generalized Bayesian inference can leverage computationally efficient scalable solutions based on either MC sampling or VI methods (Angelino et al., 2016; Alquier et al., 2016) . Distributed MC Sampling. The design of algorithms for distributed Bayesian learning has been so far mostly focused on one-shot, or "embarrassingly parallel", solutions under ideal communications (Jordan et al., 2019) . These implement distributed MC "consensus" protocols, whereby samples from the global posterior are approximately synthesized by combining particles from local posteriors (Scott et al., 2016; Liu & Ihler, 2014) . Iterative extensions, such as Weierstrass sampling (Wang & Dunson, 2013; Rendell et al., 2018) , impose consistency constraints across devices and iterations in a way similar to the Alternating Direction Method of Multipliers (ADMM) (Angelino et al., 2016) . State-of-the-art results have been obtained via DSGLD (Ahn et al., 2014) . Distributed VI Learning. Considering first one-shot model fusion of local models, Bayesian methods have been used to deal with parameter invariance and weight matching (Yurochkin et al., 2019; Claici et al., 2020) . Iterative VI such as streaming variational Bias (SVB) (Broderick et al., 2013) provide a VI-based framework for the exponential family to combine local models into global ones. PVI provides a general framework that can implement SVB, as well as online VI (Bui et al., 2018) and has been extended to multi-task learning in Corinzia & Buhmann (2019a) .

7. EXPERIMENTS

As in Liu & Wang (2016) , for all our experiments with SVGD and DSVGD, we use the Radial Basis Function (RBF) kernel k(x, x 0 ) = exp(-||x -x 0 || 2 2 /h). The bandwidth h is adapted to the set of particles used in each update by setting h = med 2 / log n, where med is the median of the pairwise distances between the particles in the current iterate. The Gaussian kernel K(•, •) used for the KDEs has a bandwidth equal to 0.55. Unless specified otherwise, we use AdaGrad with momentum to choose the learning rates and for (U-)DSVGD. Throughout, we fix the temperature parameter α = 1 in (1). Finally, to ensure a fair comparison with distributed schemes, we run centralized schemes for the same total number I × L of iterations across all experiments. Additional results for all experiments can be found in Appendix B in the supplementary materials, which include also additional implementation details. 

Gaussian 1D mixture toy example.

We start by considering a simple one-dimensional mixture model in which the local unnormalized local posteriors p k (θ) = p 0 (θ) exp(-α -1 L k (θ)) at each agent k are defined as p 1 (θ) = p 0 (θ)N (θ|1, 4) and p 2 (θ) = p 0 (θ)(N (θ| -3, 1) + N (θ|3, 2)) and the prior p 0 (θ) is uniform over [-6, 6] , i.e., p 0 (θ) = U(θ| -6, 6). The local posteriors are shown in Fig. 2 as dashed lines, along with the global posterior q opt (θ) ∝ qopt (θ) in (2), which is represented as a shaded area. We fix the number of particles to N = 200. The approximate posteriors obtained from the KDE over the global particles are plotted in Fig. 2 as solid lines. It can be observed that at each round, the global posterior updated by DSVGD integrates the local likelihood of the scheduled agent, while still preserving information about the likelihood of the other agent from prior iterates, until (approximate) convergence to the true global posterior q opt , which is a normalized version of qopt in (2), is reached. Finally, in Fig. 3 , we plot the KL divergence between q(θ) and q opt (θ) as a function of the number of rounds. Both U-DSVGD and DSVGD exhibit similar behaviour, converging to SVGD and outperforming the parametric counterparts PVI and Global Variational Inference (GVI) (Bui et al., 2018) . 

Bayesian logistic regression. We now consider

k (θ) = (x k ,y k )∈D k l(x k , y k , w) , where D k is the dataset at agent k with covariates x k ∈ R d and label y k ∈ {-1, 1}, and the loss function l(x k , y k , w) is the cross-entropy. Point decisions are taken based on the maximum of the average predictive distribution. We consider the datasets Covertype and Twonorm (Gershman et al., 2012) . We randomly split the training dataset into partitions of equal size among the K agents. We also include FedAvg, Stochastic Gradient Langevin Dynamics (SGLD) and DSGLD for comparison. We note that FedAvg is implemented here for consistency with the other schemes by scheduling a single agent at each step. In Fig. 4 , we study how the accuracy evolves as function of the number of communication rounds i, or number of communication rounds, across different datasets, using N = 2 and N = 6 particles. We observe that DSVGD consistently outperforms the mentioned decentralized benchmarks and that, in contrast to FedAvg and DSGLD, its performance scales well with the number K of agents. Furthermore, the number N of particles is seen to control the trade-off between the communication load, which increases with N , and the convergence speed, which improves as N grows larger. of communication rouds, DSVGD can also reduce the overall communication load. For example, in the third plot in Fig. 4 , DSVGD reaches an accuracy of 70% after 5 communication rounds with N = 6, requiring the exchange of 30 particles. In contrast, FedAvg requires around 100 rounds to obtain the same accuracy, making the total communication load much higher than that of DSVGD. To capture heterogeneous datasets with non i.i.d. data, we now consider for different dataset partitions across K = 4 agents. In the homogeneous case, labels are split equally among agents, while, in the heterogeneous case, each agent stores 40% of one label and 10% of the other. DSVGD is seen in Fig. 5 to have a robust performance against heterogeneity as compared to FedAvg, whose convergence speed is severely affected. This result hinges on the fact that Bayesian learning provides a predictive distribution that is a more accurate estimate of the ground-truth posterior distribution. This is true irrespective of the level of "non-iidness": Bayesian learning can account in a principled away for all competing "explanations" provided by different devices. This is in contrast to FedAvg, whose reliance on a point estimate of the parameters yields an overconfident predictive distribution that cannot properly account for the diversity of predictions provided by different devices.

&RQILGHQFH $FFXUDF\

)HG$YJ Bayesian Neural Networks. We now consider regression and multi-label classification with Bayesian Neural Networks (BNN) models. The experimental setup is the same as in Hernández-Lobato & Adams (2015), with the only exception that the prior of the weights is set to p 0 (w) = N (w|0, λ -1 I d ) with a fixed precision λ = e. We plot the average Root Mean Square Error (RMSE) for K = 2 and K = 20 agents in Fig. 6 for regression over the Kin8nm and Year datasets, and accuracy for multi-label classification on the MNIST and Fashion MNIST datasets in Fig. 7 . Confirming the results for logistic regression, DSVGD consistently outperforms the other decentralized benchmarks in terms of RMSE and accuracy, while being more robust in terms of convergence speed to an increase in the number of agents. Calibration. Reliability plots are a common visual tool used to quantify and visualize model calibration (Guo et al., 2017) . They report the average sample accuracy as function of the confidence level of the model. Perfect calibration yields an accuracy equal to the corresponding confidence (dashed line in Fig. 8 ). Fig. 8 shows the reliability plots for FedAvg and DSVGD on the Fashion MNIST dataset for the BNN setting. While increasing the number of hidden neurons negatively affects FedAvg due to overfitting, DSVGD enjoys excellent calibration even for large models and is hence able to make trustworthy predictions.  (i) n } N n=1 end return q(θ) = N n=1 K(θ, θ (I) n )

A.2 A RELATIONSHIP BETWEEN PVI AND U-DSVGD

We show here that PVI with a Gaussian variational posterior q(θ|η) = N (θ|λ 2 η, λ 2 I d ) of fixed covariance λ 2 I d and mean λ 2 η parametrized by natural parameter η can be recovered as a special case of U-DSVGD. To elaborate, consider U-DSVGD with one particle θ 1 (i.e., N = 1), an RKHS kernel that satisfies ∇ θ k(θ, θ) = 0 and k(θ, θ) = 1 (the RBF kernel is an example of such kernel) and an isotropic Gaussian kernel K(θ, θ (i) 1 ) = N (θ|θ (i) 1 , λ 2 I d ) of bandwidth λ used for computing the KDE of the global posterior using the particles. The U-DSVGD particles update in (13) reduces to the following single particle update: θ [l] 1 ← -θ [l-1] 1 + ∇ θ log p(i) k (θ [l-1] 1 ), for l = 1, . . . , L, with tilted distribution p(i) k (θ) ∝ q (i-1) (θ) t (i-1) k (θ) exp - 1 α L k (θ) . ( ) The numerator in ( 19) can be rewritten as q (i-1) (θ) = K(θ, θ (i-1) 1 ) = q(θ|η (i-1) ) with η (i-1) = λ -2 θ (i-1) 1 , while the denominator can be rewritten as t (i-1) k (θ) = j∈I (i-1) k q(θ|η (j) ) q(θ|η (j-1) ) = t k (θ|η (i-1) k ), with η -1) . This recovers the PVI update (6). (i-1) k = j∈I (i-1) k η (j) -η (j

A.3 RELIABILITY PLOTS

In this part we give some background on reliability plots and Maximum Calibration Error (MCE). Reliability plots are a visual tool to evaluate model calibration (DeGroot & Fienberg, 1983; Niculescu-Mizil & Caruana, 2005) . Consider a model that outputs a prediction ŷ(x i ) and a probability p(x i ) of correct detection for an input x i with true label y i . We divide the test samples into bins {B j } B j=1 , each bin B j containing all indices of samples whose prediction confidence falls into the interval ( j-1 B , j B ] where B is the total number of bins. Reliability plots evaluate the accuracy as function of the confidence which are defined respectively as acc(B j ) = 1 |B j | i∈Bj 1 {ŷ(xi)=yi} and conf(B j ) = 1 |B j | i∈Bj p(x i ). Perfect calibration means that the accuracy is equal to the confidence across all bins. For example, given 100 predictions, each with confidence approximately 0.7, one should expect that around 70% of these predictions be correctly classified. To compute p(x), we need the predictive probability p(y t |x t ) for all samples t ∈ [1; T ]. This can be obtained by marginalizing the data likelihood with respect to the weights vector w. This marginalization is generally intractable but can be approximated for both Bayesian logistic regression and Bayesian Neural Networks as detailed in Sec. A.3.1 and Sec. A.3.2. While reliability plots are a useful tool to visually represent the calibration of a model, it is often desirable to have a single scalar measure of miscalibration. In this paper, we use the MCE that measures the worst case deviation of the model calibration from perfect calibration (Guo et al., 2017) . Mathematically, the MCE is defined as MCE = max j∈{1,...,B} |acc(B j ) -conf(B j )|. ( ) Additional numerical results using both reliability plots and MCE can be found in Sec. B.5.

A.3.1 PREDICTIVE DISTRIBUTION FOR BAYESIAN LOGISTIC REGRESSION WITH SVGD AND DSVGD

In this section, we show how the predictive distribution for the Bayesian logistic regression experiment can be obtained when using DSVGD or SVGD. The predictive distribution provides the confidence values to be used in the calibration experiment. Given a KDE of the posterior q(w) = N n=1 k(w, w n ) with N particles {w n } N n=1 the predictive probability for Bayesian logistic regression can be estimated as p(y t = 1|x t ) ≈ p(y t = 1|x t , w)q(w)dw = N n=1 1 N (2λ 2 π) d/2 exp( -1 2λ 2 ||w n -w|| 2 ) 1 + exp(-wx T t ) dw. (22) A good approximation of ( 22) can be obtained by replacing the logistic sigmoid function with the probit function (Bishop, 2006, Sec. 4.5) , yielding p(y t = 1|x t ) ≈ N n=1 1 N 1 1 + exp(-κ(σ 2 )µ n ) , ( ) where µ n = w n x T t , σ 2 = 1 λ 2 x t x T t , and κ(σ 2 ) = 1 + σ 2 π 8 -1/2 . ( ) A.

3.2. PREDICTIVE DISTRIBUTION FOR BAYESIAN NEURAL NETWORKS WITH SVGD AND DSVGD

In a manner similar to ( 22), the predictive distribution for BNN can be estimated as p(y t = 1|x t ) ≈ N n=1 1 N (2λ 2 π) d/2 f (x t , w) exp -||w n -w|| 2 2λ 2 dw, where f (x t , w) is the sigmoid output of the BNN with weights w. Using the first order Taylor approximation of the network output around the n-th particle (Bishop, 2006, Sec. 5.7 .1) f (x t , w) ≈ f (x t , w n ) + ∇ T w f (x t , w)(w -w n ), the predictive distribution can now be rewritten as p(y t = 1|x t ) ≈ N n=1 1 N (2λ 2 π) d/2 [f (x t , w n ) + ∇ T w f (x t , w)(w -w n )] exp -||w n -w|| 2 2λ 2 dw = N n=1 1 N f (x t , w n ) + N n=1 1 N ∇ T w f (x t , w)w n -∇ T w f (x t , w)w n = N n=1 1 N f (x t , w n ), where we have used the fact that N (w|w n , λ 2 I d )dw = 1 and wN (w|w n , λ 2 I d )dw = w n .

A.4 SPACE-TIME COMPLEXITY, COMMUNICATION LOAD AND CONVERGENCE

This section offers a brief discussion on the complexity, communication load and convergence of DSVGD. Space Complexity. DSVGD inherits the space complexity of SVGD. In particular, DSVGD requires the computation of the kernel matrix k(•, •) between all particles at each local iteration, which can then be deleted before the next iteration. This requires O(N 2 ) space complexity. As pointed out by Liu & Wang (2016) and noticed in our experiments, for sufficiently small problems of practical interest for mobile embedded applications, few particles are enough to obtain state-of-the art performance. Furthermore, N particles of dimension d need to be saved in the local buffer, requiring O(N d) space. Given that N is generally much lower than the number of data samples, saving the particles in the local buffer shouldn't be problematic. Time complexity. When scheduled, an agent has to perform O(max(L, L )N 2 ) operations with O(LN 2 ) operations for the first loop (lines 5-11) and O(L N 2 ) operations for the second loop (lines 15-21) in Algorithm 1. Furthermore, the L distillation iterations in the second loop can be performed by the scheduled agent after it has sent its global particles to the central server. This enables the pipelining of the second loop with the operations at the server and at other agents, which can potentially reduce the wall-clock time per communication round. Communication load. Using DSVGD, the communication load between a scheduled agent and the central server is of the order O(N d) since N particles of dimensions d need to be exchanged at each communication round. In contrast, the communication load of PVI depends on the selected parametrization. For instance, one can use PVI with a fully factorized Gaussian approximate posterior, which requires only 2d parameters to be shared with the server, namely mean and variance of each of the d parameters at the price of having lower accuracy. Convergence. The two local SVGD loops produce a set of global and local particles, respectively, that are convergent to their respective targets as the number N of particles increases (Liu, 2017a) . Furthermore, as discussed, a fixed point of the set of local free energy minimization problems is guaranteed to be a local optimum for the global free energy problem (see Property 3 in Bui et al. ( 2018)). This property hence carries over to DSVGD in the limit of large number of particles. However, convergence to a fixed point is an open question for PVI, and consequently also for DSVGD.

A.5 PARALLEL-DSVGD

In this section, we present a direct extension of DSVGD in which multiple agents can be scheduled in parallel during the same communication round. In Parallel-DSVGD (P-DSVGD), each agent in the set K (i) of scheduled agents at round i applies the same steps as in DSVGD except that it shares the local particles {θ As discussed in Sec. 3, a parallel implementation requires the i-th iterate of the global posterior to be obtained as q (i) (θ) = p 0 (θ) k∈K (i) t (i) k (θ) k ∈K (i) t (i) k (θ), where t (i) k (θ) = t (i-1) k (θ) for k ∈ K (i) . To replicate this same behaviour while preserving the non-parametric property of DSVGD, in P-DSVGD, each agent k ∈ K (i) shares its local particles {θ (i) k,n } N n=1 representing the approximate likelihood where t (i) k (θ) = N -1 N n K(θ, θ k,n ). Then, to approximate q (i) (θ) in (28), using SVGD, the server carries out L s SVGD updates as φ [l] n ← -φ [l-1] n + N N j=1 [k(φ [l-1] j , φ [l-1] n )∇ θj log q (i) (φ [l-1] j )+∇ φj k(φ [l-1] j , φ [l-1] n )] ,for l=1, . . . ,L s . (29) For the (i+1)-th communication round, scheduled agents K (i+1) download particles {φ (i+1) n } N n=1 = {φ [Ls] n } N n=1 that are treated in a similar fashion as in DSVGD. The full algorithmic table for Parallel-Distributed Stein Variational Gradient Descent (P-DSVGD) is provided in Algorithm 5. Numerical results for P-DSVGD are provided in Sec. B.3 of the Appendix.

A.6 PROOFS

In this section, we prove Theorem 1 and 2. Theorem 1. The global posterior q opt (θ) in ( 2) is the unique fixed point of the DVI algorithm. Proof. Consider the general implementation of DVI, were a set K of agents are scheduled in parallel. DVI is equivalent to the following functional mapping   i ∈K t i (θ) {t k (θ)} k∈K   - →    i ∈K t i (θ) t k (θ) = 1 Z exp -1 α L k (θ) k∈K    q(θ) = p 0 (θ) K i=1 t i (θ) q (θ) = p 0 (θ) i ∈K t i (θ) k∈K t k (θ) where Z = p 0 (θ) i ∈K t i (θ) k∈K t k (θ)dθ. Therefore, assuming that all devices k are periodically scheduled, q(θ) is a fixed point of DVI if and only if the following equality holds t k (θ) = t k (θ) for k = 1, . . . , K. This condition is satisfied by q(θ) = q opt (θ) and by no other distribution. This concludes the proof. We move now to Theorem 2 for U-DSVGD. We leave the analysis of the impact of the additional distillation step used by DSVGD for future work. The analysis builds on the following result from Korba et al. (2020) , which is restated here using our notation. Denote by || • || H the norm in the RKHS H defined by the positive definite kernel k(θ, θ ). We assume that the kernel satisfies the following technical condition: there exist a constant B > 0 such that ||k(θ, •)|| H ≤ B and d j=1 ∂k(θ, •) ∂θ j 2 H ≤ B 2 . ( ) This condition is for instance satisfied by the RBF kernel with B = 1 (Zhou, 2008) . Furthermore, we define the kernelized Stein discrepancy (Liu et al., 2016) between two distributions p and q as S(p, q), and the total variation distance as ||q -p|| T V = 1 2 |q(θ) -p(θ)|dθ. Lemma 1. (Guaranteed per-iteration decrease of the local free energy.) (Korba et al., 2020) For a kernel satisfying (30), assume that, at a given communication round i and local iteration l, with agent k scheduled, we have: • the maximum absolute eigenvalue of the Hessian -∇ 2 log p(i) k (θ) is upper bounded by a constant M > 0; and • the inequality S(q [l] (θ), p(i) k ) < C holds for some C > 0. For learning rate ≤ (β -1)/(βBC 1 2 ) with any β > 1, the decrease in the local KL divergence from local iteration l to l + 1 satisfies the inequality F (q [l+1] (θ)) -F (q [l] (θ)) ≤ -α S(q [l] , p(i) k )(1 -γ), where γ = (( β 2 + M )B 2 )/2. Lemma 1 shows that by choosing a learning rate ≤ min(γ -1 , (β -1)/(βBC 1 2 ), one can guarantee a per-iteration decrease in the local-free energy, i.e., in the KL divergence between the particles' distribution and the target tilted distribution p(i) k (θ) that depends on the kernelized Stein discrepancy S(q [l] , p(i) k ) at the iteration before the update. Lemma 2. (Relationship between global and local free energy.) The global free energy F (q(θ)) in ( 1) is related to the local free energy F (i) k (q(θ)) in ( 4) of the k-th scheduled agent as F (q(θ)) = F (i) k (q(θ)) + α m =k E q(θ) log t (i-1) m (θ) exp(-1 α L m (θ)) . (32) Proof. The global free energy (1) can be written as F (q(θ)) = αE q(θ) log q(θ) p 0 (θ) exp(-1 α K m=1 L m (θ)) = αE q(θ) log q(θ) p(i) k (θ) • q (i-1) (θ) t (i-1) k (θ) p 0 (θ) exp(-1 α m =k L m (θ)) = αE q(θ) log q(θ) p(i) k (θ) + αE q(θ) log p 0 (θ) m =k t (i-1) m (θ) p 0 (θ) exp(-1 α m =k L m (θ)) = F (i) k (q(θ)) + α m =k E q(θ) log t (i-1) m (θ) exp(-1 α L m (θ)) , where in the second equality we have used (11); and in the third equality we have used the equality q (i-1) (θ) = p 0 (θ) K m=1 t (i-1) m (θ) , which is guaranteed by the U-DSVGD update ( 10) and (11) (see Bui et al. (2018, Property 2) ). Theorem 2 (Guaranteed per-iteration decrease of the global free energy.). The decrease in the global free energy from local iteration l to l + 1 during communication round i for which agent k is scheduled can be lower bounded as [l] ), (16) where l F (q [l] (θ)) -F (q [l+1] (θ)) ≥ α S(q [l] , p (i) k )(1 -γ) -2α(K -1)l (i) max 2D(q [l+1] ||q (i) max = sup θ max m =k | log(t (i-1) m (θ)) • exp( 1 α L m (θ)) |, S(q, p) denotes the Kernalized Stein Discrepancy between distributions q and p (Liu et al., 2016) , and γ is a constant depending on the RKHS kernel and the target distribution. We know from Lemma 1 that a learning rate ≤ min(γ -1 , (β -1)/(βBC 1 2 ) is sufficient to ensure a per-iteration decrease in the local free energy. Given that the KL divergence in the second term in (16) generally increases with , 2 demonstrates that, in order to guarantee a reduction of the global free energy, a smaller learning rate may be required. We also note that the KL divergence term D(q [l+1] ||q [l] ) may be explicitly related to the learning rate by following Pinder et al. (2020, Sec. 8 ), but we do not further pursue this aspect here. We finally remark that, in the presence of K = 1 agent, the upper bound (31) in (Korba et al., 2020) is recovered. This is because, in the presence of one agent, the global free energy reduces to the local free energy (see ( 32)) and accordingly U-DSVGD reduces to SVGD. Proof. We wish to obtain an upper bound on the decrease of the global free energy F (q [l+1] (θ)) -F (q [l] (θ)) across each local SVGD iteration during communication round i. Using (32), the decrease in the global free energy can be written as F (q [l+1] (θ))-F (q [l] (θ)) = F (i) k (q [l+1] (θ)) -F (i) k (q [l] (θ)) (a) +α m =k E q [l+1] (θ) log t (i-1) m (θ) exp(-1 α L m (θ)) -E q [l] (θ) log t (i-1) m (θ) exp(-1 α L m (θ)) . (34) We now derive upper bounds for (a) and (b). Using Lemma 1 and the definition of the local free energy in (4), we have the following upper bound on (a) (a) = F (i) k (q [l+1] (θ)) -F (i) k (q [l] (θ)) ≤ -α S(q [l] (θ), p(i) k )(1 -γ), while (b) can be rewritten and upper bounded by using the properties of the total variation distance as (b) = (q [l+1] (θ) -q [l] (θ)) log t (i-1) m (θ) exp(-1 α L m (θ)) dθ ≤ 2l (i) max ||q [l+1] -q [l] || T V . Using Pinsker's inequality (Pinsker, 1964) , the term (b) can be further upper bounded as (b) ≤ 2l (i) max 2D(q [l+1] ||q [l] ). Accordingly, the global energy dissipation in (34) can be upper bounded as in ( 16). B This section is complementary to the 1-D mixture of Gaussians experiment in Sec. 7 of the main text. We compare DSVGD with PVI and the counterpart centralized schemes. In Fig. 9 (a), we plot the KL divergence between the global posterior q opt (θ) and its current approximation q(θ) as a function of the number of communication rounds i, which corresponds to the number of communication rounds for decentralized schemes. We use N = 200 particles for U-DSVGD and DSVGD with L = L = 200 local iterations. The number of SVGD iterations is fixed to 800. A Gaussian prior p 0 (θ) = N (θ|0, 1) is assumed in lieu of the uniform prior considered in Fig. 2 to facilitate the implementation of PVI and conventional centralized GVI which was done following Bui et al. (2018, Property 4) . More specifically, we use Gaussian approximate likelihoods, i.e., t k (θ|η) = N (θ| -η1 2η2 , -1 2η2 ) with natural parameters η 1 and η 2 < 0. We observe that DSVGD has similar convergence speed as PVI, while having a superior performance thanks to the reduced bias of non-parametric models. Furthermore, DSVGD exhibits the same performance as U-DSVGD with the advantage of having memory requirements that do not scale with the number of iterations. Finally, both U-DSVGD and DSVGD converge to the performance of (centralized) SVGD as the number of rounds increases. In Fig. 9 (b), we plot the same KL divergence as function of the number of local iterations L. We use I = 5 rounds for the decentralized schemes. It is observed that non-parametric schemes-namely SVGD and (U-)DSVGD-require a sufficiently large number of local iterations in order to outperform the parametric strategies PVI and GVI. 

B.2 2-D MIXTURE OF GAUSSIANS TOY EXAMPLE

We now consider the following 2-D mixture of Gaussians model: p 1 (θ) = N (µ 0 , Σ 0 )(N (µ 1 , Σ 1 ) + N (µ 2 , Σ 2 )) and p 2 (θ) = N (µ 0 , Σ 0 )N (µ 3 , Σ 3 ) where µ 0 = [0, 0] ; Σ 0 = 4 2 2 4 µ 1 = [-1.71, -1.801] ; Σ 1 = 0.226 0.1652 0.1652 0.6779 µ 2 = [1, 0] ; Σ 2 = 2 0.5 0.5 2 µ 3 = [1, 0] ; Σ 3 = 3 0.5 0.5 3 . We plot in Fig. 10 the approximate posterior q(θ) (blue solid contour lines) and the exact posterior q opt (θ) (red dashed contour lines) for PVI, GVI, SVGD and DSVGD. We see that, as in the 1-D case and in contrast to parametric methods PVI and GVI, non-parametric methods SVGD and DSVGD are able to capture the different modes of the posterior, obtaining lower values for the KL divergence between the approximate and exact posterior.

B.3 BAYESIAN LOGISTIC REGRESSION

This section provides additional results for the Bayesian logistic regression experiment in Sec. 7 of the main text. In Fig. 11 , we compare the performance of DSVGD (bottom row), and U-DSVGD (top row) both with SVGD and NPV (Gershman et al., 2012) using the model described in Sec. 7. We use 9 binary classification datasets summarized in Appendix C as used in Liu & Wang (2016) and Gershman et al. (2012) . We assumed N = 100 particles. To ensure fairness, we used L = 800 iterations for SVGD, while U-DSVGD and DSVGD are executed with two agents with half of the dataset split randomly at each agent. We set I = 4 rounds and L = L = 200 local iterations. In Fig. 11 , we plot the accuracy and the log-likelihood of the four algorithms. We observe that both U-DSVGD and DSVGD perform similarly to SVGD and NPV over most datasets, while allowing a This section provides additional results on the calibration experiment conducted in Sec. 7 of the main text using additional datasets. In Fig. 19 , we show the reliability plots for SVGD, DSVGD and FedAvg with K = 20 agents across various datasets and for different number of neurons in the hidden layer. We first note that DSVGD retains the same calibration level as SVGD across all datasets. Furthermore, while increasing the number of hidden neurons negatively affects FedAvg due to overfitting, it does not affect the trustworthiness of the predictions for the Bayesian counterparts. This is a general property for Bayesian methods that contrast with frequentist approaches, for which increasing the number of parameters improves accuracy at the price of miscalibration (Guo et al., 2017) . Fig. 20 plots the accuracy and MCE as function of the number of particles N . While increasing N improves the accuracy (as also shown in Fig. 12 ) for SVGD and DSVGD, the MCE is unaffected and is lower than the MCE value for FedAvg.

C IMPLEMENTATION DETAILS

C.1 DATASETS, BENCHMARKS AND HYPERPARAMTERS DETAILS Datasets. We summarize in Table 2 the main parameters used across different datasets that are invariant across all experiments. The covertype datasetfoot_0 and the remaining binary classfication datasets that are selected from the Gunnar Raetsch's Benchmark datasetsfoot_1 as compiled by Mika et al. (1999) are used directly without normalization as in Liu & Wang (2016) except for the vehicle sensors datasetfoot_2 which is normalized by removing the mean of each feature and dividing by their standard deviations. Regression datasetsfoot_3 are normalized by removing the mean of each feature and dividing by their standard deviations, and multi-label classification datasets 56 are normalized by multiplying each pixel value by 0.99/255 and adding 0.01 such that every pixel value after normalization belongs to the interval [0.01, 1]. All performance metrics used are averaged over the number of trials. In each trial, unless specified otherwise, we permute the datasets and randomly split them across different agents. Hyperparameters. The hyperparameters used are summarized in Table 3 . These apply for all schemes except for DSGLD and SGLD, where the learning rates are annealed and are respectively equal to a 0 • (0.5 + i • L + l) -0.55 and a 0 • (0.5 + l) -0.55 to ensure that they go from the order of 0.01 to 0.0001 as advised by Welling & Teh (2011) . a 0 is fixed according to the values in Table 4 . DSGLD implementation. DSGLD is implemented by splitting the N particles among the K agents. More specifically, when scheduled, each agent runs N/K Markov chains. We assumed that the response delay in addition to the trajectory length of the chains (Ahn et al., 2014) to be equal among all workers and unchanged throughout the learning process. Scheduling. Unless specified otherwise, we use a round robin scheduler to schedule agents. However, any scheduler can be used as long as it schedules one agent per communication round. 



https://www.csie.ntu.edu.tw/ ˜cjlin/libsvmtools/datasets/binary.html http://theoval.cmp.uea.ac.uk/matlab/default.html http://www.ecs.umass.edu/ ˜mduarte/Software.html https://archive.ics.uci.edu/ml/datasets.php http://yann.lecun.com/exdb/mnist/ https://github.com/zalandoresearch/fashion-mnist All learning rates for non-parametric particle-based benchmark schemes used are scaled by a factor of 1/N to match our learning rate and ensure fair comparison.



for k = 1, . . . , K for i = 1, . . . , I do Server schedules an Agent k Agent k downloads current global particles {θ

Figure 4: Accuracy for Bayesian logistic regression with (left) K = 2 agents and (right) K = 20 agents as function of the number of communication rounds i (N = 6 particles, L = L = 200).

Figure 3: KL divergence between exact and approximate global posteriors as function of the number of rounds i (L = L = 200).

Figure 5: Log-likelihood for Bayesian logistic regression with non-iid data distributions (N = 6, L = L = 200).

Bayesian logistic regression for binary classification using the same setting as in Gershman et al. (2012). The model parameters θ = [w, log(ξ)] include the regression weights w ∈ R d along with the logarithm of a precision parameter ξ. The prior is given as p 0 (w, ξ) = p 0 (w|ξ)p 0 (ξ), with p 0 (w|ξ) = N (w|0, ξ -1 I d ) and p 0 (ξ) = Gamma(ξ|a, b) with a = 1 and b = 0.01. The local training loss L k (θ) at each agent k is given as L

Figure 6: Average RMSE as a function of the number of communication rounds i for regression using BNN with a single hidden layer of ReLUs with (left) K = 2 agents and (right) K = 20 agents (N = 20, L = L = 200, 100 hidden neurons for the Year Prediction and 50 for Kin8nm).

Figure 7: Multi-label classification accuracy using BNN with a single hidden layer of 100 neurons as function of i, or number of communication rounds, using MNIST and Fashion MNIST with (left) K = 2 agents and (right) K = 20 agents (N = 20, L = L = 200).

Figure 8: Reliability plots for classification using BNN with variable number of hidden neurons using fashion MNIST (N = 20, I = 10, L = L = 200, K = 20).

Agent k sets updated global particles {θ (i) n = θ [L] n } N n=1 and memorize them in the local buffer Agent k sends particles {θ (i) n } N n=1 to the server and server sets {θn = θ

} N n=1 with the server instead of the global ones. Then, the server distills the received local particles into a set of N server-side particles {φ (i) n } N n=1 using SVGD to obtain the next iterate of the global posterior.

Figure 9: KL divergence between exact and approximate global posteriors (a) as function of the number of communication rounds i for L = L = 200; and (b) as function of the local iterations number L for I = 5.

Figure 10: Performance comparison of (a) GVI, (b) PVI, (c) SVGD and (d) DSVGD for a multivariate Gaussian mixture model. Solid contour lines correspond to the approximate posterior while dashed contour lines to the exact posterior (N = 200, I = 5, L = 200 and b = 0.1).

Figure 19: Reliability plots for classification using Bayesian neural networks for a variable number of hidden neurons with FedAvg (top row), SVGD (middle row) and DSVGD (bottom row). We use N = 20 particles (I = 10, L = L = 200 and K = 20 agents).

Figure 20: Accuracy and Maximum Calibration Error (MCE) as function of the number of particles N for Bayesian neural networks. We fix I = 10, L = L = 200 and K = 20 agents in both figures.

FedAvg is implemented as inMcMahan et al. (2017) with the only difference that the server schedules a single agent at a time. Each scheduled agent performs L SGD iterations to minimize its local loss.PVI and GVI implementation. PVI and GVI are implemented using a Gaussian parametrization for both the posterior and the prior. The natural parameters are updated via the closed form update inBui et al. (2018, Property 4).

To this end, in each round i, at the end of the L local SVGD updates in (13), DSVGD carries out a

Through reduction of the number

Overview of datasets and parameters used in the experiments. Datasets in bold are used in the experiments section of the main text.

A COMPLEMENTARY MATERIALS

A.1 ALGORITHMIC TABLES Algorithm 2: Partitioned Variational Inference (PVI) (Bui et al., 2018) Input: prior p0(θ), local loss function {L k (θ)} K k=1 , temperature α > 0 Output: global posterior q(θ|η) initialize t (0) k (θ) = 1 for k = 1, . . . , K; q (0) (θ) = p0(θ) for i = 1, . . . , I do At scheduled agent k, download current global parameters η (i-1) from server Agent k solves local free energy problem in (4) to obtain new global parameters η (i) Agent k sends η (i) to the server and server sets η ← -η (i) Agent k updates new approximate likelihood: t k (θ|ηq(θ|η (i-1) ) t k (θ|η) end return q(θ) = q(θ|η (I) )Algorithm 3: Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) Input: target distribution p(θ), initial particles {θ•), learning rate Output: particles {θn} N n=1 that approximates the target normalized distribution for i = 1, . . . , L do for n = 1, . . . , N doAlgorithm 4: Unconstrained-Distributed Stein Variational Gradient Descent (U-DSVGD)∼ p0(θ) for i = 1, . . . , I do // New communication round: server schedules an agent k At scheduled agent k, download and memorize in local buffer current global particles {θ) and ∇ θ log t (i-1) k (θ) computed using ( 15)Algorithm 5: Parallel-Distributed Stein Variational Gradient Descent (P-DSVGD)

. , I do

Server schedules a set K (i) of agents in parallel Agents downloads current server particles {φ } N n=1 and {θAgents carries distillation to obtain {θk (θ) using (17) and {θAgents sends the obtained local particles {θ i) to the server Server obtains {φ distributed implementation. We note that NPV requires computation of the Hessian matrix which is relatively impractical to compute.We plot in Fig. 12 (a) the accuracy as function of the number of particles N . DSGLD is executed with two agents, where N/2 chains per agent are ran for a trajectory of length 4 and 500 rounds, which we have found to work best. We found that SVGD, DSVGD and U-DSVGD exhibit the same performance, which is superior to Particle Mirror Descent (PMD) and similar to SGLD and DSGLD when the number of particles increases. Fig. 12 (b) plots the accuracy for DSVGD for the same setting for different number of communication rounds. We can see that, by increasing the number of particles, i.e., the communication load, one can obtain similar accuracy as for a lower number of particles but with a higher number of communication rounds. For example, N = 8 with I = 6 communication rounds achieves similar performance as N = 4 with I = 10 communication rounds. Under review as a conference paper at ICLR 2021 Fig. 14 shows the accuracy of DSVGD for different datasets as function of the total number L of local iterations. We fix N = 6, I = 10, L = L = 200 for U-DSVGD, DSGLD and DSVGD while L = 2000 for SVGD and SGLD. We observe that U-DSVGD and DSVGD have similar performance to SVGD and that they consistently outperform other schemes for sufficiently high L.Fig. 16 is complementary to Fig. 4 in the main text. We note that the slightly noisy behaviour of DSVGD with K = 20 agents is attributed to the small local dataset sizes resulting from splitting the original small datasets.Finally, Fig. 15 compares the accuracy of P-DSVGD with FedAvg and DSGLD with K = 100 agents and a proportion of 0.2 randomly scheduled agents per communication round. We see that P-DSVGD exhibits similar behaviour and gain over other schemes similarly to DSVGD. 

