A HIERARCHICAL BAYESIAN APPROACH TO FEDER-ATED LEARNING

Abstract

We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. That is, we not only justify the previous Fed-Avg and Fed-Prox algorithms whose learning protocols look intuitive but theoretically less underpinned, but also generalise them even further via principled Bayesian approaches. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of O(1/ √ t), the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.

1. INTRODUCTION

Federated Learning (FL) aims to enable a set of clients to collaboratively train a model in a privacy preserving manner, without sharing data with each other or a central server. Compared to conventional centralised optimisation problems, FL comes with a host of statistical and systems challenges -such as communication bottlenecks and sporadic participation. The key statistical challenge is non-i.i.d. data distributions across clients, each of which has a different data collection bias and potentially a different data annotation policy/labeling function -for example, in the case of any user preference learning. The classic and most popularly deployed FL algorithms are Fed-Avg (McMahan et al., 2017) and Fed-Prox (Li et al., 2018) , however, even when a global model can be learned, it often underperforms on each client's local data distribution in scenarios of high heterogeneity (Li et al., 2019; Karimireddy et al., 2019; Wang et al., 2020) 2021). However, these methods are not complete and principled solutionshaving not yet have provided full Bayesian descriptions of the FL problem, and having had resort to ad-hoc treatments to achieve tractable learning. The key difference is that they fundamentally treat network weights θ as a random variable shared across all clients. We introduce a hierarchical Bayesian model that assigns each client it's own random variable for model weights θ i , and these are linked via a higher level random variable ϕ as p(θ 1:N , ϕ) = p(ϕ) N i=1 p(θ i |ϕ). This has several crucial benefits: Firstly, given this hierarchy, variational inference in our framework decomposes into separable optimisation problems over θ i s and ϕ, enabling a practical Bayesian learning algorithm



. Studies have attempted to alleviate this by personalising learning at each client, allowing each local model to deviate from the shared global model Sun et al. (2021). However, this remains challenging given that each client may have a limited amount of local data for personalised learning. These challenges have motivated several attempts to model the FL problem from a Bayesian perspective. Introducing distributions on model parameters θ has enabled various schemes for estimating a global model posterior p(θ|D 1:N ) from clients' local posteriors p(θ|D i ), or to regularise the learning of local models given a prior defined by the global model Zhang et al. (2022); Al-Shedivat et al. (2021); Chen & Chao (

