CLIENT SELECTION IN FEDERATED LEARNING: CONVERGENCE ANALYSIS AND POWER-OF-CHOICE SELECTION STRATEGIES Anonymous authors Paper under double-blind review

Abstract

Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Several works have analyzed the convergence of federated learning by accounting of data heterogeneity, communication and computation limitations, and partial client participation. However, they assume unbiased client participation, where clients are selected at random or in proportion of their data sizes. In this paper, we present the first convergence analysis of federated optimization for biased client selection strategies, and quantify how the selection skew affects convergence speed. We reveal that biasing client selection towards clients with higher local loss achieves faster error convergence. Using this insight, we propose POWER-OF-CHOICE, a communication-and computation-efficient client selection framework that can flexibly span the trade-off between convergence speed and solution bias. We also propose an extension of POWER-OF-CHOICE that is able to maintain convergence speed improvement while diminishing the selection skew. Our experiments demonstrate that POWER-OF-CHOICE strategies can converge up to 3× faster and give 10% higher test accuracy than the baseline random selection.

1. INTRODUCTION

Until recently, machine learning models were largely trained in the data center setting (Dean et al., 2012) using powerful computing nodes, fast inter-node communication links, and large centrally available training datasets. The future of machine learning lies in moving both data collection as well as model training to the edge. The emerging paradigm of federated learning (McMahan et al., 2017; Kairouz et al., 2019; Bonawitz et al., 2019) considers a large number of resource-constrained mobile devices that collect training data from their environment. Due to limited communication capabilities and privacy concerns, these data cannot be directly sent over to the cloud. Instead, the nodes locally perform a few iterations of training using local-update stochastic gradient descent (SGD) (Yu et al., 2018; Stich, 2018; Wang & Joshi, 2018; 2019) , and only send model updates periodically to the aggregating cloud server. Besides communication limitations, the key scalability challenge faced by the federated learning framework is that the client nodes can have highly heterogeneous local datasets and computation speeds. The effect of data heterogeneity on the convergence of local-update SGD is analyzed in several recent works (Reddi et al., 2020; Haddadpour & Mahdavi, 2019; Khaled et al., 2020; Stich & Karimireddy, 2019; Woodworth et al., 2020; Koloskova et al., 2020; Huo et al., 2020; Zhang et al., 2020; Pathak & Wainwright, 2020; Malinovsky et al., 2020; Sahu et al., 2019) and methods to overcome the adverse effects of data and computational heterogeneity are proposed in (Sahu et al., 2019; Wang et al., 2020; Karimireddy et al., 2019) , among others. Partial Client Participation. Most of the recent works described above assume full client participation, that is, all nodes participate in every training round. In practice, only a small fraction of client nodes participate in each training round, which can exacerbate the adverse effects of data heterogeneity. While some existing convergence guarantees for full client participation and methods to tackle heterogeneity can be generalized to partial client participation (Li et al., 2020) , these generalizations are limited to unbiased client participation, where each client's contribution to the expected global objective optimized in each round is proportional to its dataset size. In Ruan et al. ( 2020), the authors analyze the convergence with flexible device participation, where devices can freely join or leave the training process or send incomplete updates to the server. However, adaptive client selection that is cognizant of the training progress at each client has not been understood yet. It is important to analyze and understand biased client selection strategies since they can sharply accelerate error convergence, and hence boost communication efficiency in heterogeneous environments by preferentially selecting clients with higher local loss values, as we show in our paper. This idea has been explored in recent empirical studies (Goetz et al., 2019; Laguel et al., 2020; Ribero & Vikalo, 2020) . Nishio & Yonetani (2019) Our Contributions. In this paper, we present the first (to the best of our knowledge) convergence analysis of federated learning with biased client selection that is cognizant of the training progress at each client. We discover that biasing the client selection towards clients with higher local losses increases the rate of convergence compared to unbiased client selection. Using this insight, we propose the POWER-OF-CHOICE client selection strategy and show by extensive experiments that POWER-OF-CHOICE yields up to 3× faster convergence with 10% higher test performance than the standard federated averaging with random selection. POWER-OF-CHOICE is designed to incur minimal communication and computation overhead, enhancing resource efficiency in federated learning. In fact, we show that even with 3× less clients participating in each round as compared to random selection, POWER-OF-CHOICE gives 2× faster convergence and 5% higher test accuracy.

2. PROBLEM FORMULATION

Consider a cross-device federated learning setup with total K clients, where client k has a local dataset B k consisting |B k | = D k data samples. The clients are connected via a central aggregating server, and seek to collectively find the model parameter w that minimizes the empirical risk: F (w) = 1 K k=1 D k K k=1 ξ∈B k f (w, ξ) = K k=1 p k F k (w) where f (w, ξ) is the composite loss function for sample ξ and parameter vector w. The term (Stich, 2018; Wang & Joshi, 2018; Yu et al., 2018) and sends its locally updated model back to the server. Then, the server updates the global model using the local models and broadcasts the global model to a new set of active clients. p k = D k / K k=1 D k is Formally, we index the local SGD iterations with t ≥ 0. The set of active clients at iteration t is denoted by S (t) . Since active clients performs τ steps of local update, the active set S (t) also remains constant for every τ iterations. That is, if (t + 1) mod τ = 0, then S (t+1) = S (t+2) = • • • = S (t+τ ) .



proposed grouping clients based on hardware and wireless resources in order to save communication resources. Goetz et al. (2019) (which we include as a benchmark in our experiments) proposed client selection with local loss, and Ribero & Vikalo (2020) proposed utilizing the progression of clients' weights. But these schemes are limited to empirical demonstration without a rigorous analysis of how selection skew affects convergence speed. Another relevant line of work (Jiang et al., 2019; Katharopoulos & Fleuret, 2018; Shah et al., 2020; Salehi et al., 2018) employs biased selection or importance sampling of data to speed-up convergence of classic centralized SGD -they propose preferentially selecting samples with highest loss or highest gradient norm to perform the next SGD iteration. In contrast, Shah et al. (2020) proposes biased selection of lower loss samples to improve robustness to outliers. Generalizing such strategies to the federated learning setting is a non-trivial and open problem because of the large-scale distributed and heterogeneous nature of the training data.

the fraction of data at the k-th client, andF k (w) = 1 |B k | ξ∈B k f (w, ξ) is the local objective function of client k.In federated learning, the vectors w * , and w * k for k = 1, . . . , K that minimize F (w) and F k (w) respectively can be very different from each other. We defineF * = min w F (w) = F (w * ) and F * k = min w F k (w) = F k (w * k ).Federated Averaging with Partial Client Participation. The most common algorithm to solve (1) is federated averaging (FedAvg) proposed in McMahan et al. (2017). The algorithm divides the training into communication rounds. At each round, to save communication cost at the central server, the global server only selects a fraction C of m = CK clients to participate in the training. Each selected/active client performs τ iterations of local SGD

