FEDAVG CONVERGES TO ZERO TRAINING LOSS LINEARLY FOR OVERPARAMETERIZED MULTI-LAYER NEURAL NETWORKS

Abstract

Federated Learning (FL) is a distributed learning paradigm that allows multiple clients to learn a joint model by utilizing privately held data at each client. Significant research efforts have been devoted to develop advanced algorithms that deal with the situation where the data at individual clients have heterogeneous distributions. In this work, we show that data heterogeneity can be dealt from a different perspective. That is, by utilizing a certain overparameterized multi-layer neural network at each client, even the vanilla FedAvg (a.k.a. the Local SGD) algorithm can accurately optimize the training problem: When each client has a neural network with one wide layer of size N (where N is the number of total training samples), followed by layers of smaller widths, FedAvg converges linearly to a solution that achieves (almost) zero training loss, without requiring any assumptions on the clients' data distributions. To our knowledge, this is the first work that demonstrates such resilience to data heterogeneity for FedAvg when trained on multi-layer neural networks. Our experiments also confirm that, neural networks of large size can achieve better and more stable performance for FL problems.

1. INTRODUCTION

In Federated Learning (FL), multiple clients collaborate with the help of a server to learn a joint model McMahan et al. (2017) . The privacy guarantees of FL has made it a popular distributed learning paradigm, as each client holds a private data set and aims to learn a global model without leaking its data to other nodes or the server. The performance of FL algorithms is known to degrade when training data at individual nodes originates from different distributions, referred to as the heterogeneous data setting Yu et al. (2019a) ; Woodworth et al. (2020a) . In the past few years, a substantial research effort has been devoted towards developing a large number of algorithms that can better deal with data heterogeneity, Karimireddy et al. Motivated by these observations, we ask: Is it possible to handle the the data heterogeneity issue from a different perspective, without modifying the vanilla FedAvg algorithm? To answer this question, in this work we show that FedAvg can indeed perform very well regardless of the heterogeneity conditions, if the models to be learned are nice enough. Specifically, FedAvg finds solutions that achieve almost zero training loss (or almost global optimal solution) very quickly (i.e., linearly), when the FL model to be trained is certain overparameterized multi-layer neural network. To the best of our knowledge, this is the first result that shows (linear) convergence of FedAvg in the overparameterized regime for training multilayer neural networks. The major contributions of our work are listed below. • Under certain assumptions on the neural network architecture, we prove some key properties of the clients' (stochastic) gradients during the training phase (Lemmas 1 and 2). These results allow us to establish convergence of FedAvg for training overparameterized neural networks without imposing restrictive heterogeneity assumptions on the gradients of the local loss functions. • We design a special initialization strategy for training the network using FedAvg. The initialization is designed such that the singular values of the model parameters and the outputs of the first layer of local and aggregated model parameters stay positive definite during the training. This property combined with overparameterization enables FedAvg to converge linearly to a (near) optimal solution. • We conduct experiments on CIFAR-10 and MNIST datasets in both i.i.d. and heterogeneous data settings to compare the performance of FedAvg on various network architectures of different sizes. To our knowledge, this is the first work that shows the linear convergence of FedAvg (both SGD and GD versions) to the optimal solution when training a overparameterized multi-layer neural networks. 2021) analyzed the performance of FedAvg on a single hidden-layer neural network for the case when each client utilizes GD for the local updates. The authors established linear convergence of FedAvg using the NTK parameterization and showed that it suffices to design the neural network of width Ω(N 4 ) to achieve this performance (where N is the number of training samples). Similarly, Deng and Mahdavi (2021) analyzed the performance of FedAvg on a ReLU neural network but when each client utilizes SGD (or GD) for the local updates. The authors proved convergence of FedAvg under the standard parameterization while requiring the very large network width of Ω(N 18 ). Note that since individual clients can be devices with limited computational capabilities, in realistic settings it is undesirable to have networks of such large widths. In contrast to both these works, we focus on the more practical setting of a multi-layer neural network Nguyen and Mondelli (2020) and establish linear convergence of FedAvg even for the case when each client utilizes SGD for the local updates. Importantly, we show that with proper initialization, it only requires a network of width N at each client, which is much smaller compared to the unrealistic requirements of Huang et al. ( 2021); Deng and Mahdavi (2021).

2. PROBLEM SETUP

In this section, we define the multi-layer neural network and formalize the problem we aim to solve. We consider a distributed system of K clients with each client having access to a privately held data set. We assume that each client k ∈ {1, . . . , K } has N k training samples denoted as {(X k , Y k )}, with X k ∈ R N k ×din and Y k ∈ R N k ×dout . Note that each row of X k and Y k represents the feature vector and its corresponding label, and d in and d out denote the feature (input) and label (output) dimensions, respectively. We further denote N = K k=1 N k as the total samples across all clients.



A model is generally referred to as overparameterized if the number of (trainable) parameters are more than the number of training samples N .



(2020b); Zhang et al. (2021); Li et al. (2018); Acar et al. (2020); Khanduri et al. (2021). However, in practice it has been observed by a number of recent works, that in spite of the data heterogeneity, the simple vanilla FedAvg algorithm (a.k.a. the Local SGD) still offers competitive performance in comparison to the state-of-the-art. For example, see

Related Work: Federated Learning (FL). FL algorithms were first proposed in McMahan et al. (2017), where within each communication round the clients utilize their private data to update the model parameters using multiple SGD steps. Earlier works analyzed the performance of FedAvg for the case of homogeneous data setting Zhou and Cong (2018); Stich (2018); Lin et al. (2020); Woodworth et al. (2020b); Wang and Joshi (2021), i.e., when the local data at each client follows the same underlying distribution. Motivated by practical applications, recent works have analyzed FedAvg for heterogeneous client data distributions Yu et al. (2019b;a); Haddadpour and Mahdavi (2019); Woodworth et al. (2020a) and it was observed that the performance of FedAvg degrades as the data heterogeneity increases. To address the data heterogeneity issue among clients, many works have focused on developing sophisticated algorithms Karimireddy et al. (2020b); Zhang et al. (2021); Acar et al. (2020); Li et al. (2018); Khanduri et al. (2021); Karimireddy et al. (2020a); Das et al. (2020). Overparameterized Neural Networks. The surprising performance of overparameterized neural networks 1 has raised significant research interest in the ML community to analyze the phenomenon of overparameterization Belkin et al. (2019). Consequently, many works have analyzed the performance of centralized (stochastic) gradient descent (S)GD on overparameterized neural network architectures under different settings Jacot et al. (2018); Li and Liang (2018); Arora et al. (2019); Du et al. (2018; 2019); Allen-Zhu et al. (2019); Zou and Gu (2019); Nguyen and Mondelli (2020); Nguyen (2021). However, there are only a handful of works that have attempted to analyze the performance of overparameterized neural networks in the distributed setting Li et al. (2021); Huang et al. (2021); Deng and Mahdavi (2021). The works most closely related to our work are Huang et al. (2021) and Deng and Mahdavi (2021). Huang et al. (

Table 2 in Karimireddy et al. (2020a), Table 1 in Reddi et al. (2020), and Table 2 in Yang et al. (2021) for performance comparison of FedAvg on popular FL tasks.

