A HIERARCHICAL BAYESIAN APPROACH TO FEDER-ATED LEARNING

Abstract

We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. That is, we not only justify the previous Fed-Avg and Fed-Prox algorithms whose learning protocols look intuitive but theoretically less underpinned, but also generalise them even further via principled Bayesian approaches. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of O(1/ √ t), the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.

1. INTRODUCTION

Federated Learning (FL) aims to enable a set of clients to collaboratively train a model in a privacy preserving manner, without sharing data with each other or a central server. Compared to conventional centralised optimisation problems, FL comes with a host of statistical and systems challenges -such as communication bottlenecks and sporadic participation. The key statistical challenge is non-i.i.d. data distributions across clients, each of which has a different data collection bias and potentially a different data annotation policy/labeling function -for example, in the case of any user preference learning. The classic and most popularly deployed FL algorithms are Fed-Avg (McMahan et al., 2017) and Fed-Prox (Li et al., 2018) , however, even when a global model can be learned, it often underperforms on each client's local data distribution in scenarios of high heterogeneity (Li et al., 2019; Karimireddy et al., 2019; Wang et al., 2020) . Studies have attempted to alleviate this by personalising learning at each client, allowing each local model to deviate from the shared global model Sun et al. (2021) . However, this remains challenging given that each client may have a limited amount of local data for personalised learning. 2021). However, these methods are not complete and principled solutionshaving not yet have provided full Bayesian descriptions of the FL problem, and having had resort to ad-hoc treatments to achieve tractable learning. The key difference is that they fundamentally treat network weights θ as a random variable shared across all clients. We introduce a hierarchical Bayesian model that assigns each client it's own random variable for model weights θ i , and these are linked via a higher level random variable ϕ as p(θ 1:N , ϕ) = p(ϕ) N i=1 p(θ i |ϕ). This has several crucial benefits: Firstly, given this hierarchy, variational inference in our framework decomposes into separable optimisation problems over θ i s and ϕ, enabling a practical Bayesian learning algorithm to be derived that is fully compatible with FL constraints, without resorting to ad-hoc treatments or strong assumptions. Secondly, this framework can be instantiated with different assumptions on p(θ i |ϕ) to deal elegantly and robustly with different kinds of statistical heterogeneity, as well as for principled and effective model personalisation. Our resulting algorithm, termed Federated Hierarchical Bayes (FedHB) is empirically effective, as we demonstrate in a wide range of experiments on established benchmarks. More importantly, it benefits from rigorous theoretical support. In particular, we provide convergence guarantees showing that FedHB has the same O(1/ √ T ) convergence rate as centralised SGD algorithms, which are not provided by related prior art Zhang et al. ( 2022); Chen & Chao (2021). We also provide a generalisation bound showing that FedHB is asymptotically optimal, which has not been shown by prior work such as Al-Shedivat et al. (2021) . Furthermore we show that FedHB subsumes classic methods FedAvg McMahan et al. (2017) and FedProx Li et al. (2018) as special cases, and ultimately provides additional justification and explanation for these seminal methods.

2. BAYESIAN FL: GENERAL FRAMEWORK

We introduce two types of latent random variables, ϕ and {θ i } N i=1 . Each θ i is deployed as the network weights for client i's backbone. The variable ϕ can be viewed as a globally shared variable that is responsible for linking the individual client parameters θ i . We assume conditionally independent and identical priors, p(θ 1:N |ϕ) = N i=1 p(θ i |ϕ). Thus the prior for the latent variables (ϕ, {θ i } N i=1 ) is formed in a hierarchical manner as (1). The local data for client i, denoted by D i , is generatedfoot_0 by θ i , (Prior) p(ϕ, θ 1:N ) = p(ϕ) N i=1 p(θ i |ϕ) (Likelihood) p(D i |θ i ) = (x,y)∈Di p(y|x, θ i ), where p(y|x, θ i ) is a conventional neural network model (e.g., softmax link for classification tasks). See the graphical model in Fig. 1 (a) where the iid clients are governed by a single random variable ϕ. Given the data D 1 , . . . , D N , we infer the posterior, p(ϕ, θ 1:N |D 1:N ) ∝ p(ϕ) N i=1 p(θ i |ϕ)p(D i |θ i ), which is intractable in general, and we adopt the variational inference to approximate it: q(ϕ, θ 1:N ; L) := q(ϕ; L 0 ) N i=1 q i (θ i ; L i ), where the variational parameters L consists of L 0 (parameters for q(ϕ)) and {L i } N i=1 's (parameters for q i (θ i )'s from individual clients). Note that although θ i 's are independent across clients under (2), they are differently modeled (emphasised by the subscript i in notation q i ), reflecting different posterior beliefs originating from heterogeneity of local data D i 's.

2.1. FROM VARIATIONAL INFERENCE TO FEDERATED LEARNING ALGORITHM

Using the standard variational inference techniques (Blei et al., 2017; Kingma & Welling, 2014) , we can derive the ELBO objective function (details in Appendix A). We denote the negative ELBO by L



Note that we do not deal with generative modeling of input images x. Inputs x are always given, and only conditionals p(y|x) are modeled. See Fig. 1(b) for the in-depth graphical model diagram.



These challenges have motivated several attempts to model the FL problem from a Bayesian perspective. Introducing distributions on model parameters θ has enabled various schemes for estimating a global model posterior p(θ|D 1:N ) from clients' local posteriors p(θ|D i ), or to regularise the learning of local models given a prior defined by the global model Zhang et al. (2022); Al-Shedivat et al. (2021); Chen & Chao (

(a) Overall model (b) Individual client (c) Global prediction (d) Personalisation Figure 1: Graphical models. (a) Plate view of iid clients. (b) Individual client data with input images x given and only p(y|x) modeled. (c) & (d): Global prediction and personalisation as probabilistic inference problems (shaded nodes = evidences, red colored nodes = targets to infer, x * = test input in global prediction, D p = training data for personalisation and x p = test input).

