FEDAVG CONVERGES TO ZERO TRAINING LOSS LINEARLY FOR OVERPARAMETERIZED MULTI-LAYER NEURAL NETWORKS

Abstract

Federated Learning (FL) is a distributed learning paradigm that allows multiple clients to learn a joint model by utilizing privately held data at each client. Significant research efforts have been devoted to develop advanced algorithms that deal with the situation where the data at individual clients have heterogeneous distributions. In this work, we show that data heterogeneity can be dealt from a different perspective. That is, by utilizing a certain overparameterized multi-layer neural network at each client, even the vanilla FedAvg (a.k.a. the Local SGD) algorithm can accurately optimize the training problem: When each client has a neural network with one wide layer of size N (where N is the number of total training samples), followed by layers of smaller widths, FedAvg converges linearly to a solution that achieves (almost) zero training loss, without requiring any assumptions on the clients' data distributions. To our knowledge, this is the first work that demonstrates such resilience to data heterogeneity for FedAvg when trained on multi-layer neural networks. Our experiments also confirm that, neural networks of large size can achieve better and more stable performance for FL problems.

1. INTRODUCTION

In Federated Learning (FL), multiple clients collaborate with the help of a server to learn a joint model McMahan et al. (2017) . The privacy guarantees of FL has made it a popular distributed learning paradigm, as each client holds a private data set and aims to learn a global model without leaking its data to other nodes or the server. The performance of FL algorithms is known to degrade when training data at individual nodes originates from different distributions, referred to as the heterogeneous data setting Yu et al. Motivated by these observations, we ask: Is it possible to handle the the data heterogeneity issue from a different perspective, without modifying the vanilla FedAvg algorithm? To answer this question, in this work we show that FedAvg can indeed perform very well regardless of the heterogeneity conditions, if the models to be learned are nice enough. Specifically, FedAvg finds solutions that achieve almost zero training loss (or almost global optimal solution) very quickly (i.e., linearly), when the FL model to be trained is certain overparameterized multi-layer neural network. To the best of our knowledge, this is the first result that shows (linear) convergence of FedAvg in the overparameterized regime for training multilayer neural networks. The major contributions of our work are listed below. • Under certain assumptions on the neural network architecture, we prove some key properties of the clients' (stochastic) gradients during the training phase (Lemmas 1 and 2). These results allow us to establish convergence of FedAvg for training overparameterized neural networks without imposing restrictive heterogeneity assumptions on the gradients of the local loss functions.



(2019a); Woodworth et al. (2020a). In the past few years, a substantial research effort has been devoted towards developing a large number of algorithms that can better deal with data heterogeneity, Karimireddy et al. (2020b); Zhang et al. (2021); Li et al. (2018); Acar et al. (2020); Khanduri et al. (2021). However, in practice it has been observed by a number of recent works, that in spite of the data heterogeneity, the simple vanilla FedAvg algorithm (a.k.a. the Local SGD) still offers competitive performance in comparison to the state-of-the-art. For example, see

Table 2 in Karimireddy et al. (2020a), Table 1 in Reddi et al. (2020), and Table 2 in Yang et al. (2021) for performance comparison of FedAvg on popular FL tasks.

