FEDERATED AVERAGING AS EXPECTACTION-MAXIMIZATION

Abstract

Federated averaging (FedAvg), despite its simplicity, has been the main approach in training neural networks in the federated learning setting. In this work, we show that the algorithmic choices of the FedAvg algorithm correspond to optimizing a single objective function that involves the global and all of the shard specific models using a hard version of the well known Expectation-Maximization (EM) algorithm. As a result, we gain a better understanding of the behavior and design choices of federated averaging while being able to provide interesting connections to recent literature. Based on this view, we further propose FedSparse, a version of federated averaging that employs prior distributions to promote model sparsity. In this way, we obtain a procedure that leads to reductions in both server-client and client-server communication costs as well as more efficient models.

1. INTRODUCTION

Smart devices have become ubiquitous in today's world and are generating large amounts of potentially sensitive data. Traditionally, such data is transmitted and stored in a central location for training machine learning models. Such methods rightly raise privacy concerns and we seek the means for training powerful models, such as neural networks, without the need to transmit potentially sensitive data. To this end, Federated Learning (FL) (McMahan et al., 2016) has been proposed to train global machine learning models without the need for participating devices to transmit their data to the server. The Federated Averaging (FedAvg) (McMahan et al., 2016) algorithm communicates the parameters of the machine learning model instead of the data itself, which is a more private means of communication. The FedAvg algorithm was originally proposed through empirical observations. While it can be shown that it converges (Li et al., 2019) , its theoretical understanding in terms of the model assumptions as well as the underlying objective function is still not well understood. The first contribution of this work improves our understanding of FedAvg; we show that FedAvg can be derived by applying the general Expectation-Maximization (EM) framework to a simple hierarchical model. This novel view has several interesting consequences: it sheds light on the algorithmic choices of FedAvg, bridges FedAvg with meta-learning, connects several extensions of FedAvg and provides fruitful ground for future extensions. Apart from theoretical grounding, the FL scenario poses several practical challenges, especially in the "cross-device" setting (Kairouz et al., 2019) that we consider in this work. In particular, communicating model updates over multiple rounds across a large amount of devices can incur significant communication costs. Communication via the public internet infrastructure and mobile networks is potentially slow and not for free. Equally important, training (and inference) takes place on-device and is therefore restricted by the edge-devices' hardware constraints on memory, speed and heat dissipation capabilities. Therefore, jointly addressing both of these issues is an important step towards building practical FL systems, as also discussed in Kairouz et al. (2019) . Through the novel EM view of FedAvg that we introduce, we develop our second contribution, FedSparse. FedSparse allows for learning sparse models at the client and server via a careful choice of priors within the hierarchical model. As a result, it tackles the aforementioned challenges, since it can simultaneously reduce the overall communication and computation at the client devices. Empirically, FedSparse provides better communication-accuracy trade-offs compared to both FedAvg as well as methods proposed for similar reasons (Caldas et al., 2018) .

