CLIENT SELECTION IN FEDERATED LEARNING: CONVERGENCE ANALYSIS AND POWER-OF-CHOICE SELECTION STRATEGIES Anonymous authors Paper under double-blind review

Abstract

Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Several works have analyzed the convergence of federated learning by accounting of data heterogeneity, communication and computation limitations, and partial client participation. However, they assume unbiased client participation, where clients are selected at random or in proportion of their data sizes. In this paper, we present the first convergence analysis of federated optimization for biased client selection strategies, and quantify how the selection skew affects convergence speed. We reveal that biasing client selection towards clients with higher local loss achieves faster error convergence. Using this insight, we propose POWER-OF-CHOICE, a communication-and computation-efficient client selection framework that can flexibly span the trade-off between convergence speed and solution bias. We also propose an extension of POWER-OF-CHOICE that is able to maintain convergence speed improvement while diminishing the selection skew. Our experiments demonstrate that POWER-OF-CHOICE strategies can converge up to 3× faster and give 10% higher test accuracy than the baseline random selection.

1. INTRODUCTION

Until recently, machine learning models were largely trained in the data center setting (Dean et al., 2012) using powerful computing nodes, fast inter-node communication links, and large centrally available training datasets. The future of machine learning lies in moving both data collection as well as model training to the edge. The emerging paradigm of federated learning (McMahan et al., 2017; Kairouz et al., 2019; Bonawitz et al., 2019) considers a large number of resource-constrained mobile devices that collect training data from their environment. Due to limited communication capabilities and privacy concerns, these data cannot be directly sent over to the cloud. Instead, the nodes locally perform a few iterations of training using local-update stochastic gradient descent (SGD) (Yu et al., 2018; Stich, 2018; Wang & Joshi, 2018; 2019) , and only send model updates periodically to the aggregating cloud server. Besides communication limitations, the key scalability challenge faced by the federated learning framework is that the client nodes can have highly heterogeneous local datasets and computation speeds. The effect of data heterogeneity on the convergence of local-update SGD is analyzed in several recent works (Reddi et al., 2020; Haddadpour & Mahdavi, 2019; Khaled et al., 2020; Stich & Karimireddy, 2019; Woodworth et al., 2020; Koloskova et al., 2020; Huo et al., 2020; Zhang et al., 2020; Pathak & Wainwright, 2020; Malinovsky et al., 2020; Sahu et al., 2019) and methods to overcome the adverse effects of data and computational heterogeneity are proposed in (Sahu et al., 2019; Wang et al., 2020; Karimireddy et al., 2019) , among others. Partial Client Participation. Most of the recent works described above assume full client participation, that is, all nodes participate in every training round. In practice, only a small fraction of client nodes participate in each training round, which can exacerbate the adverse effects of data heterogeneity. While some existing convergence guarantees for full client participation and methods to tackle heterogeneity can be generalized to partial client participation (Li et al., 2020) , these generalizations are limited to unbiased client participation, where each client's contribution to the expected global objective optimized in each round is proportional to its dataset size. In Ruan et al. (2020) , the authors analyze the convergence with flexible device participation, where devices can freely join or leave the training process or send incomplete updates to the server. However, adaptive client selection that is cognizant of the training progress at each client has not been understood yet. It is important to analyze and understand biased client selection strategies since they can sharply accelerate error convergence, and hence boost communication efficiency in heterogeneous environments by preferentially selecting clients with higher local loss values, as we show in our paper. This idea has been explored in recent empirical studies (Goetz et al., 2019; Laguel et al., 2020; Ribero & Vikalo, 2020) . Nishio & Yonetani (2019) proposed grouping clients based on hardware and wireless resources in order to save communication resources. Goetz et al. (2019) (which we include as a benchmark in our experiments) proposed client selection with local loss, and Ribero & Vikalo (2020) proposed utilizing the progression of clients' weights. But these schemes are limited to empirical demonstration without a rigorous analysis of how selection skew affects convergence speed. Another relevant line of work (Jiang et al., 2019; Katharopoulos & Fleuret, 2018; Shah et al., 2020; Salehi et al., 2018) employs biased selection or importance sampling of data to speed-up convergence of classic centralized SGD -they propose preferentially selecting samples with highest loss or highest gradient norm to perform the next SGD iteration. In contrast, Shah et al. (2020) proposes biased selection of lower loss samples to improve robustness to outliers. Generalizing such strategies to the federated learning setting is a non-trivial and open problem because of the large-scale distributed and heterogeneous nature of the training data. Our Contributions. In this paper, we present the first (to the best of our knowledge) convergence analysis of federated learning with biased client selection that is cognizant of the training progress at each client. We discover that biasing the client selection towards clients with higher local losses increases the rate of convergence compared to unbiased client selection. Using this insight, we propose the POWER-OF-CHOICE client selection strategy and show by extensive experiments that POWER-OF-CHOICE yields up to 3× faster convergence with 10% higher test performance than the standard federated averaging with random selection. POWER-OF-CHOICE is designed to incur minimal communication and computation overhead, enhancing resource efficiency in federated learning. In fact, we show that even with 3× less clients participating in each round as compared to random selection, POWER-OF-CHOICE gives 2× faster convergence and 5% higher test accuracy.

2. PROBLEM FORMULATION

Consider a cross-device federated learning setup with total K clients, where client k has a local dataset B k consisting |B k | = D k data samples. The clients are connected via a central aggregating server, and seek to collectively find the model parameter w that minimizes the empirical risk: F (w) = 1 K k=1 D k K k=1 ξ∈B k f (w, ξ) = K k=1 p k F k (w) where f (w, ξ) is the composite loss function for sample ξ and parameter vector w. The term p k = D k / K k=1 D k is the fraction of data at the k-th client, and F k (w) = 1 |B k | ξ∈B k f (w, ξ) is the local objective function of client k. In federated learning, the vectors w * , and w * k for k = 1, . . . , K that minimize F (w) and F k (w) respectively can be very different from each other. We define F * = min w F (w) = F (w * ) and F * k = min w F k (w) = F k (w * k ). Federated Averaging with Partial Client Participation. The most common algorithm to solve (1) is federated averaging (FedAvg) proposed in McMahan et al. (2017) . The algorithm divides the training into communication rounds. At each round, to save communication cost at the central server, the global server only selects a fraction C of m = CK clients to participate in the training. Each selected/active client performs τ iterations of local SGD (Stich, 2018; Wang & Joshi, 2018; Yu et al., 2018) and sends its locally updated model back to the server. Then, the server updates the global model using the local models and broadcasts the global model to a new set of active clients. Formally, we index the local SGD iterations with t ≥ 0. The set of active clients at iteration t is denoted by S (t) . Since active clients performs τ steps of local update, the active set S (t) also remains constant for every τ iterations. That is, if (t + 1) mod τ = 0, then S (t+1) = S (t+2) = • • • = S (t+τ ) . select 1 select 2 F 1 (w) > F 2 (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " N W V z m v h e N B T q j R 4 f w b J 8 + x N K M H k = " > A A A C C n i c b V D L S s N A F J 3 4 r P E V d e l m t A h 1 U 5 I q 6 E q K Q n F Z w T 6 g D W E y n b R D J 5 M w M 1 F K 6 N q N v + L G h S J u / Q J 3 / o 2 T N o v a e u D C 4 Z x 7 u f c e P 2 Z U K t v + M Z a W V 1 b X 1 g s b 5 u b W 9 s 6 u t b f f l F E i M G n g i E W i 7 S N J G O W k o a h i p B 0 L g k K f k Z Y / v M n 8 1 g M R k k b 8 X o 1 i 4 o a o z 2 l A M V J a 8 q y j m u e U u i F S A z 9 I H 8 e n V 2 b N q 8 w K p m c V 7 b I 9 A V w k T k 6 K I E f d s 7 6 7 v Q g n I e E K M y R l x 7 F j 5 a Z I K I o Z G Z v d R J I Y 4 S H q k 4 6 m H I V E u u n k l T E 8 0 U o P B p H Q x R W c q L M T K Q q l H I W + 7 s x u l P N e J v 7 n d R I V X L o p 5 X G i C M f T R U H C o I p g l g v s U U G w Y i N N E B Z U 3 w r x A A m E l U 4 v C 8 G Z f 3 m R N C t l 5 6 x c u T s v V q / z O A r g E B y D E n D A B a i C W 1 A H D Y D B E 3 g B b + D d e D Z e j Q / j c 9 q 6 Z O Q z B + A P j K 9 f 0 4 u Z F A = = < / l a t e x i t > F 1 (w) < F 2 (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 u g 1 d u X 8 x b 6 a 9 t / Z a D n Q w p E O 9 d k = " > A A A C C n i c b V D L S s N A F J 3 4 r P E V d e l m t A h 1 U 5 I q 6 M J F U S g u K 9 g H t C F M p p N 2 6 G Q S Z i Z K C V 2 7 8 V f c u F D E r V / g z r 9 x 0 m Z R W w 9 c O J x z L / f e 4 8 e M S m X b P 8 b S 8 s r q 2 n p h w 9 z c 2 t 7 Z t f b 2 m z J K B C Y N H L F I t H 0 k C a O c N B R V j L R j Q V D o M 9 L y h z e Z 3 3 o g Q t K I 3 6 t R T N w Q 9 T k N K E Z K S 5 5 1 V P O c U j d E a u A H 6 e P 4 9 M q s e Z V Z w f S s o l 2 2 J 4 C L x M l J E e S o e 9 Z 3 t x f h J C R c Y Y a k 7 D h 2 r N w U C U U x I 2 O z m 0 g S I z x E f d L R l K O Q S D e d v D K G J 1 r p w S A S u r i C E 3 V 2 I k W h l K P Q 1 5 3 Z j X L e y 8 T / v E 6 i g k s 3 p T x O F O F 4 u i h I G F Q R z H K B P S o I V m y k C c K C 6 l s h H i C B s N L p Z S E 4 8 y 8 v k m a l 7 J y V K 3 f n x e p 1 H k c B H I J j U A I O u A B V c A v q o A E w e A I v 4 A 2 8 G 8 / G q / F h f E 5 b l 4 x 8 5 g D 8 g f H 1 C 9 B h m R I = < / l a t e x i t > w (0) < l a t e x i t s h a 1 _ b a s e 6 4 = " y M u y 7 P M L L p m L o z + g D 5 V T w r P e Q t I = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y V U l i o p S D B W s D A W i T 6 k N l S O 6 7 R W H T u y H V A V Z W D h V 1 g Y Q I i V j 2 D j b 3 D a D N B y J E t H 5 9 w j 3 3 v 8 i F G l H e f b K q y s r q 1 v F D d L W 9 s 7 u 3 v 2 / k F b i V h i 0 s K C C d n 1 k S K M c t L S V D P S j S R B o c 9 I x 5 9 c Z X 7 n n k h F B b / V 0 4 h 4 I R p x G l C M t J E G d r k v j J 2 l k 3 6 I 9 N g P k o c 0 v U u q z k k 6 s C t O z Z k B L h M 3 J x W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k p X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C o + N M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u Z E t z F k 5 d J u 1 5 z T 2 v 1 m 7 N K 4 z K v o w j K 4 A h U g Q v O Q Q N c g y Z o A Q w e w T N 4 B W / W k / V i v V s f 8 9 G C l W c O w R 9 Y n z 9 d G J i J < / l a t e x i t > w (1) < l a t e x i t s h a 1 _ b a s e 6 4 = " r J B P A n + B B G k k y E B G b d 8 5 g o z z L e E = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y V U l i o p S D B W s D A W i T 6 k N l S O 6 7 R W H T u y H V A V Z W D h V 1 g Y Q I i V j 2 D j b 3 D a D N B y J E t H 5 9 w j 3 3 v 8 i F G l H e f b K q y s r q 1 v F D d L W 9 s 7 u 3 v 2 / k F b i V h i 0 s K C C d n 1 k S K M c t L S V D P S j S R B o c 9 I x 5 9 c Z X 7 n n k h F B b / V 0 4 h 4 I R p x G l C M t J E G d r k v j J 2 l k 3 6 I 9 N g P k o c 0 v U u q 7 k k 6 s C t O z Z k B L h M 3 J x W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k p X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C o + N M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u Z E t z F k 5 d J u 1 5 z T 2 v 1 m 7 N K 4 z K v o w j K 4 A h U g Q v O Q Q N c g y Z o A Q w e w T N 4 B W / W k / V i v V s f 8 9 G C l W c O w R 9 Y n z 9 e n p i K < / l a t e x i t > w (3) < l a t e x i t s h a 1 _ b a s e 6 4 = " v 7 3 5 l e f L V m B f + S U s h 0 A R x o y F u h 4 = " > A A A C B H i c b V C 7 T s M w F H V 4 l v I K M H a x q J D K U i U t E o w V L I x F o g + p D Z X j O q 1 V x 4 5 s B 1 R F G V j 4 F R Y G E G L l I 9 j 4 G 5 w 2 A 7 Q c y d L R O f f I 9 x 4 / Y l R p x / m 2 V l b X 1 j c 2 C 1 v F 7 Z 3 d v X 3 7 4 L C t R C w x a W H B h O z 6 S B F G O W l p q h n p R p K g 0 G e k 4 0 + u M r 9 z T 6 S i g t / q a U S 8 E I 0 4 D S h G 2 k g D u 9 Q X x s 7 S S T 9 E e u w H y U O a 3 i W V + m k 6 s M t O 1 Z k B L h M 3 J 2 W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k x X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C k + M M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u a E t z F k 5 d J u 1 Z 1 6 9 X a z V m 5 c Z n X U Q A l c A w q w A X n o A G u Q R O 0 A A a P 4 B m 8 g j f r y X q x 3 q 2 P + e i K l W e O w B 9 Y n z 9 h q p i M < / l a t e x i t > w (4) < l a t e x i t s h a 1 _ b a s e 6 4 = " x i 5 H C Y R b Q s V V p J h M A c W C Q J D f r m A = " > A A A C B H i c b V C 7 T s M w F H V 4 l v I K M H a x q J D K U i W l E o w V L I x F o g + p D Z X j O q 1 V x 4 5 s B 1 R F G V j 4 F R Y G E G L l I 9 j 4 G 5 w 2 A 7 Q c y d L R O f f I 9 x 4 / Y l R p x / m 2 V l b X 1 j c 2 C 1 v F 7 Z 3 d v X 3 7 4 L C t R C w x a W H B h O z 6 S B F G O W l p q h n p R p K g 0 G e k 4 0 + u M r 9 z T 6 S i g t / q a U S 8 E I 0 4 D S h G 2 k g D u 9 Q X x s 7 S S T 9 E e u w H y U O a 3 i W V + m k 6 s M t O 1 Z k B L h M 3 J 2 W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k x X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C k + M M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u a E t z F k 5 d J u 1 Z 1 z 6 q 1 m 3 q 5 c Z n X U Q A l c A w q w A X n o A G u Q R O 0 A A a P 4 B m 8 g j f r y X q x 3 q 2 P + e i K l W e O w B 9 Y n z 9 j M J i N < / l a t e x i t > w (5) < l a t e x i t s h a 1 _ b a s e 6 4 = " D F u K y k V o H W q E x 4 9 G z O v 6 D T k y / H 0 = " > A A A C B H i c b V C 7 T s M w F H V 4 l v I K M H a x q J D K U i U F B G M F C 2 O R 6 E N q Q + W 4 T m v V s S P b A V V R B h Z + h Y U B h F j 5 C D b + B q f N A C 1 H s n R 0 z j 3 y v c e P G F X a c b 6 t p e W V 1 b X 1 w k Z x c 2 t 7 Z 9 f e 2 2 8 p E U t M m l g w I T s + U o R R T p q a a k Y 6 k S Q o 9 B l p + + O r z G / f E 6 m o 4 L d 6 E h E v R E N O A 4 q R N l L f L v W E s b N 0 0 g u R H v l B 8 p C m d 0 n l 7 D j t 2 2 W n 6 k w B F 4 m b k z L I 0 e j b X 7 2 B w H F I u M Y M K d V 1 n U h 7 C Z K a Y k b S Y i 9 W J E J 4 j I a k a y h H I V F e M j 0 i h U d G G c B A S P O 4 h l P 1 d y J B o V K T 0 D e T 2 a J q 3 s v E / 7 x u r I M L L 6 E 8 i j X h e P Z R E D O o B c w a g Q M q C d Z s Y g j C k p p d I R 4 h i b A 2 v R V N C e 7 8 y Y u k V a u 6 J 9 X a z W m 5 f p n X U Q A l c A g q w A X n o A 6 u Q Q M 0 A Q a P 4 B m 8 g j f r y X q x 3 q 2 P 2 e i S l W c O w B 9 Y n z 9 k t p i O < / l a t e x i t > w ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " w V T 2 P P w B m p y x m w Y e n J 3 C B k d 2 x 7 s = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o s g L k p S B V 0 W 3 b i s Y B / Q x D K Z T t q h k 0 m Y m S g l 9 D f c u F D E r T / j z r 9 x k m a h r Q c G D u f c y z 1 z / J g z p W 3 7 2 y q t r K 6 t b 5 Q 3 K 1 v b O 7 t 7 1 f 2 D j o o S S W i b R D y S P R 8 r y p m g b c 0 0 p 7 1 Y U h z 6 n H b 9 y U 3 m d x + p V C w S 9 3 o a U y / E I 8 E C R r A 2 k u u G W I / 9 I H 2 a P Z w N q j W 7 b u d A y 8 Q p S A 0 K t A b V L 3 c Y k S S k Q h O O l e o 7 d q y 9 F E v N C K e z i p s o G m M y w S P a N 1 T g k C o v z T P P 0 I l R h i i I p H l C o 1 z 9 v Z H i U K l p 6 J v J L K N a 9 D L x P 6 + f 6 O D K S 5 m I E 0 0 F m R 8 K E o 5 0 h L I C 0 J B J S j S f G o K J Z C Y r I m M s M d G m p o o p w V n 8 8 j L p N O r O e b 1 x d 1 F r X h d 1 l O E I j u E U H L i E J t x C C 9 p A I I Z n e I U 3 K 7 F e r H f r Y z 5 a s o q d Q / g D 6 / M H H g a R u w = = < / l a t e x i t > w < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 D L p k S s q + 1 e B p G e V U Y f K U B t Z m 5 M = " > A A A B 8 X i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o x m U F + 8 C 2 l E x 6 p w 3 N Z I Y k o 5 S h f + H G h S J u / R t 3 / o 2 Z d h b a e i B w O O d e c u 7 x Y 8 G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z Q o h g 0 W i U i 1 f a p R c I k N w 4 3 A d q y Q h r 7 A l j + + y f z W I y r N I 3 l v J j H 2 Q j q U P O C M G i s 9 d E N q R n 6 Q P k 3 7 p b J b c W c g y 8 T L S R l y 1 P u l r + 4 g Y k m I 0 j B B t e 5 4 b m x 6 K V W G M 4 H T Y j f R G F M 2 p k P s W C p p i L q X z h J P y a l V B i S I l H 3 S k J n 6 e y O l o d a T 0 L e T W U K 9 6 G X i f 1 4 n M c F V L + U y T g x K N v 8 o S A Q x E c n O J w O u k B k x s Y Q y x W 1 W w k Z U U W Z s S U V b g r d 4 8 j J p V i v e e a V 6 d 1 G u X e d 1 F O A Y T u A M P L i E G t x C H R r A Q M I z v M K b o 5 0 X 5 9 3 5 m I + u O P n O E f y B 8 / k D / Q K R H w = = < / l a t e x i t > F 1 (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 P O f p 0 Y / 3 9 y 0 W k l k L N I h X l Q N d x s = " > A A A B + H i c d V D L S g M x F M 3 U V 6 2 P j r p 0 E y x C 3 Q z z a G 2 7 K w j i s o J 9 Q D u U T J p p Q z M P k o x S h 3 6 J G x e K u P V T 3 P k 3 Z t o K K n o g c D j n X u 7 J 8 W J G h T T N D y 2 3 t r 6 x u Z X f L u z s 7 u 0 X 9 Y P D j o g S j k k b R y z i P Q 8 J w m h I 2 p J K R n o x J y j w G O l 6 0 4 v M 7 9 4 S L m g U 3 s h Z T N w A j U P q U 4 y k k o Z 6 8 X J o l Q c B k h P P T + / m Z 0 O 9 Z B p m o 2 4 6 N j Q N 2 6 5 V H U u R a q X h O O f Q M s w F S m C F 1 l B / H 4 w i n A Q k l J g h I f q W G U s 3 R V x S z M i 8 M E g E i R G e o j H p K x q i g A g 3 X Q S f w 1 O l j K A f c f V C C R f q 9 4 0 U B U L M A k 9 N Z h H F b y 8 T / / L 6 i f T r b k r D O J E k x M t D f s K g j G D W A h x R T r B k M 0 U Q 5 l R l h X i C O M J S d V V Q J X z 9 F P 5 P O r Z h O Y Z 9 X S k 1 7 V U d e X A M T k A Z W K A G m u A K t E A b Y J C A B / A E n r V 7 7 V F 7 0 V 6 X o z l t t X M E f k B 7 + w R h A Z L h < / l a t e x i t > F 2 (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " / N Y g N 7 5 f 1 A 3 l u i V l e 2 4 B W J X 3 x p 8 = " > A A A B + H i c d V D L S s N A F J 3 U V 6 2 P R l 2 6 G S x C 3 Y Q k t m m 7 K w j i s o J 9 Q B v K Z D p p h 0 4 e z E y U G v o l b l w o 4 t Z P c e f f O G k r q O i B g c M 5 9 3 L P H C 9 m V E j T / N B y a + s b m 1 v 5 7 c L O 7 t 5 + U T 8 4 7 I g o 4 Z i 0 c c Q i 3 v O Q I I y G p C 2 p Z K Q X c 4 I C j 5 G u N 7 3 I / O 4 t 4 Y J G 4 Y 2 c x c Q N 0 D i k P s V I K m m o F y + H d n k Q I D n x / P R u f j b U S 6 Z h N h q O V Y W m Y d d r d a e i i G U 7 1 Y Y D L c N c o A R W a A 3 1 9 8 E o w k l A Q o k Z E q J v m b F 0 U 8 Q l x Y z M C 4 N E k B j h K R q T v q I h C o h w 0 0 X w O T x V y g j 6 E V c v l H C h f t 9 I U S D E L P D U Z B Z R / P Y y 8 S + v n 0 i / 7 q Y 0 j B N J Q r w 8 5 C c M y g h m L c A R 5 Q R L N l M E Y U 5 V V o g n i C M s V V c F V c L X T + H / p G M b 1 r l h X 1 d K T X t V R x 4 c g x N Q B h a o g S a 4 A i 3 Q B h g k 4 A E 8 g W f t X n v U X r T X 5 W h O W + 0 c g R / Q 3 j 4 B g p W S + A = = < / l a t e x i t > F (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 w y A d b 0 h X P O S G J J e i B i D h O 7 p d 4 I = " > A A A B 9 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 W o m 5 C 0 x W R Z E M R l B f u A N p T J d N I O n U z i z K R S Q r / D j Q t F 3 P o x 7 v w b J 2 0 E F T 0 w c D j n X u 6 Z 4 8 e M S m V Z H 0 Z h b X 1 j c 6 u 4 X d r Z 3 d s / K B 8 e d W S U C E z a O G K R 6 P l I E k Y 5 a S u q G O n F g q D Q Z 6 T r T y 8 z v z s j Q t K I 3 6 p 5 T L w Q j T k N K E Z K S 9 5 V d R A i N f G D 9 H 5 x P i x X L P P C d e t O D V q m t U R G b K f h O t D O l Q r I 0 R q W 3 w e j C C c h 4 Q o z J G X f t m L l p U g o i h l Z l A a J J D H C U z Q m f U 0 5 C o n 0 0 m X o B T z T y g g G k d C P K 7 h U v 2 + k K J R y H v p 6 M o s o f 3 u Z + J f X T 1 T g e i n l c a I I x 6 t D Q c K g i m D W A B x R Q b B i c 0 0 Q F l R n h X i C B M J K 9 1 T S J X z 9 F P 5 P O j X T r p u 1 m 0 a l W c v r K I I T c A q q w A Y O a I J r 0 A J t g M E d e A B P 4 N m Y G Y / G i / G 6 G i 0 Y + c 4 x + A H j 7 R O t z J I B < / l a t e x i t > w (2) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 s 8 Q s v 8 J a i C z K G F H 0 L U 1 A F T h Q w c = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y V U l i o p S D B W s D A W i T 6 k N l S O 6 7 R W H T u y H V A V Z W D h V 1 g Y Q I i V j 2 D j b 3 D a D N B y J E t H 5 9 w j 3 3 v 8 i F G l H e f b K q y s r q 1 v F D d L W 9 s 7 u 3 v 2 / k F b i V h i 0 s K C C d n 1 k S K M c t L S V D P S j S R B o c 9 I x 5 9 c Z X 7 n n k h F B b / V 0 4 h 4 I R p x G l C M t J E G d r k v j J 2 l k 3 6 I 9 N g P k o c 0 v U u q 9 Z N 0 Y F e c m j M D X C Z u T i o g R 3 N g f / W H A s c h 4 R o z p F T P d S L t J U h q i h l J S / 1 Y k Q j h C R q R n q E c h U R 5 y e y I F B 4 b Z Q g D I c 3 j G s 7 U 3 4 k E h U p N Q 9 9 M Z o u q R S 8 T / / N 6 s Q 4 u v I T y K N a E 4 / l H Q c y g F j B r B A 6 p J F i z q S E I S 2 p 2 h X i M J M L a 9 F Y y J b i L J y + T d r 3 m n t b q N 2 e V x m V e R x G U w R G o A h e c g w a 4 B k 3 Q A h g 8 g m f w C t 6 s J + v F e r c + 5 q M F K 8 8 c g j + w P n 8 A Y C S Y i w = = < / l a t e x i t > (a) Selecting Higher Loss Clients F 1 (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 P O f p 0 Y / 3 9 y 0 W k l k L N I h X l Q N d x s = " > A A A B + H i c d V D L S g M x F M 3 U V 6 2 P j r p 0 E y x C 3 Q z z a G 2 7 K w j i s o J 9 Q D u U T J p p Q z M P k o x S h 3 6 J G x e K u P V T 3 P k 3 Z t o K K n o g c D j n X u 7 J 8 W J G h T T N D y 2 3 t r 6 x u Z X f L u z s 7 u 0 X 9 Y P D j o g S j k k b R y z i P Q 8 J w m h I 2 p J K R n o x J y j w G O l 6 0 4 v M 7 9 4 S L m g U 3 s h Z T N w A j U P q U 4 y k k o Z 6 8 X J o l Q c B k h P P T + / m Z 0 O 9 Z B p m o 2 4 6 N j Q N 2 6 5 V H U u R a q X h O O f Q M s w F S m C F 1 l B / H 4 w i n A Q k l J g h I f q W G U s 3 R V x S z M i 8 M E g E i R G e o j H p K x q i g A g 3 X Q S f w 1 O l j K A f c f V C C R f q 9 4 0 U B U L M A k 9 N Z h H F b y 8 T / / L 6 i f T r b k r D O J E k x M t D f s K g j G D W A h x R T r B k M 0 U Q 5 l R l h X i C O M J S d V V Q J X z 9 F P 5 P O r Z h O Y Z 9 X S k 1 7 V U d e X A M T k A Z W K A G m u A K t E A b Y J C A B / A E n r V 7 7 V F 7 0 V 6 X o z l t t X M E f k B 7 + w R h A Z L h < / l a t e x i t > F 2 (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " / N Y g N 7 5 f 1 A 3 l u i V l e 2 4 B W J X 3 x p 8 = " > A A A B + H i c d V D L S s N A F J 3 U V 6 2 P R l 2 6 G S x C 3 Y Q k t m m 7 K w j i s o J 9 Q B v K Z D p p h 0 4 e z E y U G v o l b l w o 4 t Z P c e f f O G k r q O i B g c M 5 9 3 L P H C 9 m V E j T / N B y a + s b m 1 v 5 7 c L O 7 t 5 + U T 8 4 7 I g o 4 Z i 0 c c Q i 3 v O Q I I y G p C 2 p Z K Q X c 4 I C j 5 G u N 7 3 I / O 4 t 4 Y J G 4 Y 2 c x c Q N 0 D i k P s V I K m m o F y + H d n k Q I D n x / P R u f j b U S 6 Z h N h q O V Y W m Y d d r d a e i i G U 7 1 Y Y D L c N c o A R W a A 3 1 9 8 E o w k l A Q o k Z E q J v m b F 0 U 8 Q l x Y z M C 4 N E k B j h K R q T v q I h C o h w 0 0 X w O T x V y g j 6 E V c v l H C h f t 9 I U S D E L P D U Z B Z R / P Y y 8 S + v n 0 i / 7 q Y 0 j B N J Q r w 8 5 C c M y g h m L c A R 5 Q R L N l M E Y U 5 V V o g n i C M s V V c F V c L X T + H / p G M b 1 r l h X 1 d K T X t V R x 4 c g x N Q B h a o g S a 4 A i 3 Q B h g k 4 A E 8 g W f t X n v U X r T X 5 W h O W + 0 c g R / Q 3 j 4 B g p W S + A = = < / l a t e x i t > F (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 w y A d b 0 h X P O S G J J e i B i D h O 7 p d 4 I = " > A A A B 9 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 W o m 5 C 0 x W R Z E M R l B f u A N p T J d N I O n U z i z K R S Q r / D j Q t F 3 P o x 7 v w b J 2 0 E F T 0 w c D j n X u 6 Z 4 8 e M S m V Z H 0 Z h b X 1 j c 6 u 4 X d r Z 3 d s / K B 8 e d W S U C E z a O G K R 6 P l I E k Y 5 a S u q G O n F g q D Q Z 6 T r T y 8 z v z s j Q t K I 3 6 p 5 T L w Q j T k N K E Z K S 9 5 V d R A i N f G D 9 H 5 x P i x X L P P C d e t O D V q m t U R G b K f h O t D O l Q r I 0 R q W 3 w e j C C c h 4 Q o z J G X f t m L l p U g o i h l Z l A a J J D H C U z Q m f U 0 5 C o n 0 0 m X o B T z T y g g G k d C P K 7 h U v 2 + k K J R y H v p 6 M o s o f 3 u Z + J f X T 1 T g e i n l c a I I x 6 t D Q c K g i m D W A B x R Q b B i c 0 0 Q F l R n h X i C B M J K 9 1 T S J X z 9 F P 5 P O j X T r p u 1 m 0 a l W c v r K I I T c A q q w A Y O a I J r 0 A J t g M E d e A B P 4 N m Y G Y / G i / G 6 G i 0 Y + c 4 x + A H j 7 R O t z J I B < / l a t e x i t > w ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " w V T 2 P P w B m p y x m w Y e n J 3 C B k d 2 x 7 s = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o s g L k p S B V 0 W 3 b i s Y B / Q x D K Z T t q h k 0 m Y m S g l 9 D f c u F D E r T / j z r 9 x k m a h r Q c G D u f c y z 1 z / J g z p W 3 7 2 y q t r K 6 t b 5 Q 3 K 1 v b O 7 t 7 1 f 2 D j o o S S W i b R D y S P R 8 r y p m g b c 0 0 p 7 1 Y U h z 6 n H b 9 y U 3 m d x + p V C w S 9 3 o a U y / E I 8 E C R r A 2 k u u G W I / 9 I H 2 a P Z w N q j W 7 b u d A y 8 Q p S A 0 K t A b V L 3 c Y k S S k Q h O O l e o 7 d q y 9 F E v N C K e z i p s o G m M y w S P a N 1 T g k C o v z T P P 0 I l R h i i I p H l C o 1 z 9 v Z H i U K l p 6 J v J L K N a 9 D L x P 6 + f 6 O D K S 5 m I E 0 0 F m R 8 K E o 5 0 h L I C 0 J B J S j S f G o K J Z C Y r I m M s M d G m p o o p w V n 8 8 j L p N O r O e b 1 x d 1 F r X h d 1 l O E I j u E U H L i E J t x C C 9 p A I I Z n e I U 3 K 7 F e r H f r Y z 5 a s o q d Q / g D 6 / M H H g a R u w = = < / l a t e x i t > w (0) < l a t e x i t s h a 1 _ b a s e 6 4 = " y M u y 7 P M L L p m L o z + g D 5 V T w r P e Q t I = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y V U l i o p S D B W s D A W i T 6 k N l S O 6 7 R W H T u y H V A V Z W D h V 1 g Y Q I i V j 2 D j b 3 D a D N B y J E t H 5 9 w j 3 3 v 8 i F G l H e f b K q y s r q 1 v F D d L W 9 s 7 u 3 v 2 / k F b i V h i 0 s K C C d n 1 k S K M c t L S V D P S j S R B o c 9 I x 5 9 c Z X 7 n n k h F B b / V 0 4 h 4 I R p x G l C M t J E G d r k v j J 2 l k 3 6 I 9 N g P k o c 0 v U u q z k k 6 s C t O z Z k B L h M 3 J x W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k p X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C o + N M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u Z E t z F k 5 d J u 1 5 z T 2 v 1 m 7 N K 4 z K v o w j K 4 A h U g Q v O Q Q N c g y Z o A Q w e w T N 4 B W / W k / V i v V s f 8 9 G C l W c O w R 9 Y n z 9 d G J i J < / l a t e x i t > w (2) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 s 8 Q s v 8 J a i C z K G F H 0 L U 1 A F T h Q w c = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y V U l i o p S D B W s D A W i T 6 k N l S O 6 7 R W H T u y H V A V Z W D h V 1 g Y Q I i V j 2 D j b 3 D a D N B y J E t H 5 9 w j 3 3 v 8 i F G l H e f b K q y s r q 1 v F D d L W 9 s 7 u 3 v 2 / k F b i V h i 0 s K C C d n 1 k S K M c t L S V D P S j S R B o c 9 I x 5 9 c Z X 7 n n k h F B b / V 0 4 h 4 I R p x G l C M t J E G d r k v j J 2 l k 3 6 I 9 N g P k o c 0 v U u q 9 Z N 0 Y F e c m j M D X C Z u T i o g R 3 N g f / W H A s c h 4 R o z p F T P d S L t J U h q i h l J S / 1 Y k Q j h C R q R n q E c h U R 5 y e y I F B 4 b Z Q g D I c 3 j G s 7 U 3 4 k E h U p N Q 9 9 M Z o u q R S 8 T / / N 6 s Q 4 u v I T y K N a E 4 / l H Q c y g F j B r B A 6 p J F i z q S E I S 2 p 2 h X i M J M L a 9 F Y y J b i L J y + T d r 3 m n t b q N 2 e V x m V e R x G U w R G o A h e c g w a 4 B k 3 Q A h g 8 g m f w C t 6 s J + v F e r c + 5 q M F K 8 8 c g j + w P n 8 A Y C S Y i w = = < / l a t e x i t > w (5) < l a t e x i t s h a 1 _ b a s e 6 4 = " D F u K y k V o H W q E x 4 9 G z O v 6 D T k y / H 0 = " > A A A C B H i c b V C 7 T s M w F H V 4 l v I K M H a x q J D K U i U F B G M F C 2 O R 6 E N q Q + W 4 T m v V s S P b A V V R B h Z + h Y U B h F j 5 C D b + B q f N A C 1 H s n R 0 z j 3 y v c e P G F X a c b 6 t p e W V 1 b X 1 w k Z x c 2 t 7 Z 9 f e 2 2 8 p E U t M m l g w I T s + U o R R T p q a a k Y 6 k S Q o 9 B l p + + O r z G / f E 6 m o 4 L d 6 E h E v R E N O A 4 q R N l L f L v W E s b N 0 0 g u R H v l B 8 p C m d 0 n l 7 D j t 2 2 W n 6 k w B F 4 m b k z L I 0 e j b X 7 2 B w H F I u M Y M K d V 1 n U h 7 C Z K a Y k b S Y i 9 W J E J 4 j I a k a y h H I V F e M j 0 i h U d G G c B A S P O 4 h l P 1 d y J B o V K T 0 D e T 2 a J q 3 s v E / 7 x u r I M L L 6 E 8 i j X h e P Z R E D O o B c w a g Q M q C d Z s Y g j C k p p d I R 4 h i b A 2 v R V N C e 7 8 y Y u k V a u 6 J 9 X a z W m 5 f p n X U Q A l c A g q w A X n o A 6 u Q Q M 0 A Q a P 4 B m 8 g j f r y X q x 3 q 2 P 2 e i S l W c O w B 9 Y n z 9 k t p i O < / l a t e x i t > w < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 D L p k S s q + 1 e B p G e V U Y f K U B t Z m 5 M = " > A A A B 8 X i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o x m U F + 8 C 2 l E x 6 p w 3 N Z I Y k o 5 S h f + H G h S J u / R t 3 / o 2 Z d h b a e i B w O O d e c u 7 x Y 8 G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z Q o h g 0 W i U i 1 f a p R c I k N w 4 3 A d q y Q h r 7 A l j + + y f z W I y r N I 3 l v J j H 2 Q j q U P O C M G i s 9 d E N q R n 6 Q P k 3 7 p b J b c W c g y 8 T L S R l y 1 P u l r + 4 g Y k m I 0 j B B t e 5 4 b m x 6 K V W G M 4 H T Y j f R G F M 2 p k P s W C p p i L q X z h J P y a l V B i S I l H 3 S k J n 6 e y O l o d a T 0 L e T W U K 9 6 G X i f 1 4 n M c F V L + U y T g x K N v 8 o S A Q x E c n O J w O u k B k x s Y Q y x W 1 W w k Z U U W Z s S U V b g r d 4 8 j J p V i v e e a V 6 d 1 G u X e d 1 F O A Y T u A M P L i E G t x C H R r A Q M I z v M K b o 5 0 X 5 9 3 5 m I + u O P n O E f y B 8 / k D / Q K R H w = = < / l a t e x i t > client selected: 2 à 2 à 1 à 1 à 2 w (4) < l a t e x i t s h a 1 _ b a s e 6 4 = " x i 5 H C Y R b Q s V V p J h M A c W C Q J D f r m A = " > A A A C B H i c b V C 7 T s M w F H V 4 l v I K M H a x q J D K U i W l E o w V L I x F o g + p D Z X j O q 1 V x 4 5 s B 1 R F G V j 4 F R Y G E G L l I 9 j 4 G 5 w 2 A 7 Q c y d L R O f f I 9 x 4 / Y l R p x / m 2 V l b X 1 j c 2 C 1 v F 7 Z 3 d v X 3 7 4 L C t R C w x a W H B h O z 6 S B F G O W l p q h n p R p K g 0 G e k 4 0 + u M r 9 z T 6 S i g t / q a U S 8 E I 0 4 D S h G 2 k g D u 9 Q X x s 7 S S T 9 E e u w H y U O a 3 i W V + m k 6 s M t O 1 Z k B L h M 3 J 2 W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k x X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C k + M M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u a E t z F k 5 d J u 1 Z 1 z 6 q 1 m 3 q 5 c Z n X U Q A l c A w q w A X n o A G u Q R O 0 A A a P 4 B m 8 g j f r y X q x 3 q 2 P + e i K l W e O w B 9 Y n z 9 j M J i N < / l a t e x i t > w (1) < l a t e x i t s h a 1 _ b a s e 6 4 = " r J B P A n + B B G k k y E B G b d 8 5 g o z z L e E = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y V U l i o p S D B W s D A W i T 6 k N l S O 6 7 R W H T u y H V A V Z W D h V 1 g Y Q I i V j 2 D j b 3 D a D N B y J E t H 5 9 w j 3 3 v 8 i F G l H e f b K q y s r q 1 v F D d L W 9 s 7 u 3 v 2 / k F b i V h i 0 s K C C d n 1 k S K M c t L S V D P S j S R B o c 9 I x 5 9 c Z X 7 n n k h F B b / V 0 4 h 4 I R p x G l C M t J E G d r k v j J 2 l k 3 6 I 9 N g P k o c 0 v U u q 7 k k 6 s C t O z Z k B L h M 3 J x W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k p X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C o + N M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u Z E t z F k 5 d J u 1 5 z T 2 v 1 m 7 N K 4 z K v o w j K 4 A h U g Q v O Q Q N c g y Z o A Q w e w T N 4 B W / W k / V i v V s f 8 9 G C l W c O w R 9 Y n z 9 e n p i K < / l a t e x i t > w (3) < l a t e x i t s h a 1 _ b a s e 6 4 = " v 7 3 5 l e f L V m B f + S U s h 0 A R x o y F u h 4 = " > A A A C B H i c b V C 7 T s M w F H V 4 l v I K M H a x q J D K U i U t E o w V L I x F o g + p D Z X j O q 1 V x 4 5 s B 1 R F G V j 4 F R Y G E G L l I 9 j 4 G 5 w 2 A 7 Q c y d L R O f f I 9 x 4 / Y l R p x / m 2 V l b X 1 j c 2 C 1 v F 7 Z 3 d v X 3 7 4 L C t R C w x a W H B h O z 6 S B F G O W l p q h n p R p K g 0 G e k 4 0 + u M r 9 z T 6 S i g t / q a U S 8 E I 0 4 D S h G 2 k g D u 9 Q X x s 7 S S T 9 E e u w H y U O a 3 i W V + m k 6 s M t O 1 Z k B L h M 3 J 2 W Q o z m w v / p D g e O Q c I 0 Z U q r n O p H 2 E i Q 1 x Y y k x X 6 s S I T w B I 1 I z 1 C O Q q K 8 Z H Z E C k + M M o S B k O Z x D W f q 7 0 S C Q q W m o W 8 m s 0 X V o p e J / 3 m 9 W A c X X k J 5 F G v C 8 f y j I G Z Q C 5 g 1 A o d U E q z Z 1 B C E J T W 7 Q j x G E m F t e i u a E t z F k 5 d J u 1 Z 1 6 9 X a z V m 5 c Z n X U Q A l c A w q w A X n o A G u Q R O 0 A A a P 4 B m 8 g j f r y X q x 3 q 2 P + e i K l W e O w B 9 Y n z 9 h q p i M < / l a t e x i t > (b) Random Client Selection Accordingly, the update rule of FedAvg can be written as follows: w (t+1) k = w (t) k -η t g k (w (t) k , ξ (t) k ) for (t + 1) mod τ = 0 1 m j∈S (t) w (t) j -η t g j (w (t) j , ξ (t) j ) w (t+1) for (t + 1) mod τ = 0 where w (t+1) k denotes the local model parameters of client k at iteration t, η t is the learning rate, and g k (w (t) k , ξ (t) k ) = 1 b ξ∈ξ (t) k ∇f (w (t) k , ξ) is the stochastic gradient over mini-batch ξ (t) k of size b that is randomly sampled from client k's local dataset B k . Moreover, w (t+1) denotes the global model at server. Although w (t) is only updated after every τ iterations, for the purpose of convergence analysis we consider a virtual sequence of w (t) that is updated at each iteration as follows w (t+1) = w (t) -η t g (t) = w (t) -η t   1 m k∈S (t) g k (w (t) k , ξ (t) k )   (3) with g (t) = 1 m k∈S (t) g k (w (t) k , ξ k ). Note that in (2) and (3) we do not weight the client models by their dataset fractions p k because p k is considered in the client selection scheme used to decide the set S (t) . Our convergence analysis can be generalized to when the global model is a weighted average instead of a simple average of client models, and we show in Appendix E that our convergence analysis also covers the sampling uniformly at random without replacement scheme proposed by Li et al. (2020) . The set S (t) can be sampled either with or without replacement. For sampling with replacement, we assume that multiple copies of the same client in the set S (t) behave as different clients, that is, they perform local updates independently. Client Selection Strategy. To guarantee FedAvg converges to the stationary points of the objective function (1), most current analysis frameworks (Li et al., 2020; Karimireddy et al., 2019; Wang et al., 2020) consider a strategy that selects the set S (t) by sampling m clients at random (with replacement) such that client k is selected with probability p k , the fraction of data at that client. This sampling scheme is unbiased since it ensures that in expectation, the update rule (3) is the same as full client participation. Hence, it enjoys the same convergence properties as local-update SGD methods (Stich, 2018; Wang & Joshi, 2018) . We denote this unbiased random client selection strategy as π rand . In this paper, we consider a class of biased client selection strategies that is cognizant of the global training progress which (to the best of our knowledge) has not been worked on before. Note that for any aggregation scheme and sampling scheme with partial client participation, if the expectation over the sampling scheme for the update rule of the global model is equal to the case of the update rule for full client participation, we distinguish this as an unbiased client participation scheme. For example in Horváth & Richtárik (2020) , even with a biased sampling scheme, with the normalizing aggregation the update rule is unbiased. Henceforth, we state that our paper encompasses both biased and unbiased update rules. In the two-client example in Figure 1 , we set S (t+1) = arg max k∈[K] F k (w (t) ), a single client with the highest local loss at the current global model. In this toy example, the selection strategy cannot guarantee the updates (3) equals to the full client participation case in expectation. Nevertheless, it gives faster convergence to the global minimum than the random one. Motivated by this observation, we define a client selection strategy π as a function that maps the current global model w to a selected set of clients S(π, w).

3. CONVERGENCE ANALYSIS

In this section we analyze the convergence of federated averaging with partial device participation for any client selection strategy π as defined above. This analysis reveals that biased client selection can give faster convergence, albeit at the risk of having a non-vanishing gap between the true optimum w * = arg min F (w) and lim t→∞ w (t) . We use this insight in Section 4 to design client selection strategies that strike a balance between convergence speed and bias.

3.1. ASSUMPTIONS AND DEFINITIONS

First we introduce the assumptions and definitions utilized for our convergence analysis. Assumption 3.1. F 1 , ..., F k are all L-smooth, i.e., for all v and w, F k (v) ≤ F k (w) + (v - w) T ∇F k (w) + L 2 v -w 2 2 . Assumption 3.2. F 1 , ..., F k are all µ-strongly convex, i.e., for all v and w, F k (v) ≥ F k (w) + (v -w) T ∇F k (w) + µ 2 v -w 2 2 . Assumption 3.3. For the mini-batch ξ k uniformly sampled at random from B k from user k, the resulting stochastic gradient is unbiased, that is, E[g k (w k , ξ k )] = ∇F k (w k ). Also, the variance of stochastic gradients is bounded: E g k (w k , ξ k ) -∇F k (w k ) 2 ≤ σ 2 for all k = 1, ..., K. Assumption 3.4. The stochastic gradient's expected squared norm is uniformly bounded, i.e., E g k (w k , ξ k ) 2 ≤ G 2 for k = 1, ..., K. The above assumptions are common in related literature, see (Stich, 2018; Basu et al., 2019; Li et al., 2020; Ruan et al., 2020) . Next, we introduce two metrics, the local-global objective gap and the selection skew, which feature prominently in the convergence analysis presented in Theorem 3.1.  Γ F * - K k=1 p k F * k = K k=1 p k (F k (w * ) -F k (w * k )) ≥ 0. Note that Γ is an inherent property of the local and global objective functions, and it is independent of the client selection strategy. This definition was introduced in previous literature by Li et al. (2020) . A larger Γ implies higher data heterogeneity. If Γ = 0 then it implies that the local and global optimal values are consistent, and there is no solution bias due to the client selection strategy (see Theorem 3.1). Next, we define another metric called selection skew, which captures the effect of the client selection strategy on the local-global objective gap. Definition 3.2 (Selection Skew). For any k ∈ S(π, w) we define, ρ(S(π, w), w ) = E S(π,w) [ 1 m k∈S(π,w) (F k (w ) -F * k )] F (w ) - K k=1 p k F * k ≥ 0, which reflects the skew of a client selection strategy π. The first w in ρ(S(π, w), w ) is the parameter vector that governs the client selection and w is the point at which F k and F in the numerator and denominator respectively are evaluated. Note, E S(π,w) [•] is the expectation over the randomness from the selection strategy π, since there can be multiple sets S that π can map from a specific w. Since ρ(S(π, w), w ) is a function of versions of the global model w and w , which change during training, we define two related metrics that are independent of w and w . These metrics enable us to obtain a conservative error bound in the convergence analysis. where w * = arg min w F (w). From (6), we have ρ ≤ ρ for any client selection strategy π. Effect of the Client Selection Strategy on ρ and ρ. For the unbiased client selection strategy π rand we have ρ(S(π rand , w), w ) = 1 for all w and w since the numerator and denominator of (5) become equal, and ρ = ρ = 1. For a client selection strategy π that chooses clients with higher F k (w) more often, ρ and ρ will be larger (and ≥ 1). In the convergence analysis we show that a larger ρ implies faster convergence, albeit with a potential error gap, which is proportional to ( ρ/ρ -1). Motivated by this, in Section 4 we present an adaptive client selection strategy that prefers selecting clients with higher loss F k (w) and achieves faster convergence speed with low solution bias.

3.2. MAIN CONVERGENCE RESULT

Here, we present the convergence results for any client selection strategy π for federated averaging with partial device participation in terms of local-global objective gap Γ, and selection skew ρ, ρ. Theorem 3.1 (Convergence with Decaying Learning Rate). Under Assumptions 3.1 to 3.4, for learning rate η t = 1 µ(t+γ) with γ = 4L µ , and any client selection strategy π, the error after T iterations of federated averaging with partial device participation satisfies E[F (w (T ) )] -F * ≤ 1 (T + γ) 4L(32τ 2 G 2 + σ 2 /m) 3µ 2 ρ + 8L 2 Γ µ 2 + Lγ w (0) -w * 2 2 Vanishing Error Term + 8LΓ 3µ ρ ρ -1 Non-vanishing bias,Q(ρ, ρ) To the best of our knowledge, Theorem 3.1 provides the first convergence analysis of federated averaging with a biased client selection strategy π. We also show the results for fixed learning rate in Appendix A. The proof is presented in Appendix C. The first part of our proof follows techniques presented by Li et al. (2020) . Then we introduce the novel concept of selection skew to the proof, and analyze the effect of biased client selection strategies that has not been seen before in previous literature. We highlight that our convergence result is a general analysis that is applicable for any selection strategy π that is cognizant of the training progress. In the following paragraphs, we discuss the effects of the two terms in (7) in detail. Large ρ and Faster Convergence. A key insight from Theorem 3.1 is that a larger selection skew ρ results in faster convergence at the rate O( 1 T ρ ). Note that since we obtain ρ (defined in ( 6)) by taking a minimum of the selection skew ρ(S(π, w), w ) over w, w , this is a conservative bound on the true convergence rate. In practice, since the selection skew ρ(S(π, w), w ) changes during training depending on the current global model w and the local models w , the true convergence rate can be improved by a factor larger than and at least equal to ρ. Non-vanishing Bias Term. The second term Q(ρ, ρ) = 8LΓ 3µ ρ ρ -1 in (7) denotes the solution bias, which is dependent on the selection strategy. By the definitions of ρ and ρ, it follows that ρ ≥ ρ, which implies that Q(ρ, ρ) ≥ 0. For an unbiased selection strategy, we have ρ = ρ = 1, Q(ρ, ρ) = 0, and hence (7) recovers previous bound for unbiased selection strategy as (Li et al., 2020) . For ρ > 1, while we gain faster convergence rate by a factor of ρ, we cannot guarantee Q(ρ, ρ) = 0. Thus, there is a trade-off between the convergence speed and the solution bias. Later in the experimental results, we show that even with biased selection strategies, the term ρ ρ -1 in Q(ρ, ρ) can be close to 0, and hence Q(ρ, ρ) has a negligible effect on the final error floor.

4. PROPOSED POWER-OF-CHOICE CLIENT SELECTION STRATEGY

From ( 5) and ( 6) we discover that a selection strategy π that prefers clients with larger F k (w) -F * k will result in a larger ρ, yielding faster convergence. Using this insight, a naive client selection strategy can be choosing the clients with highest local loss F k (w). However, a larger selection skew ρ may result in a larger ρ/ ρ, i.e., a larger non-vanishing error term. This naive selection strategy has another drawback -to find the current local loss F k (w), it requires sending the current global model to all K clients and having them evaluate F k and sending it back. This additional communication and computation cost can be prohibitively high because the number of clients K is typically very large, and these clients have limited communication and computation capabilities. In this section, we use these insights regarding the trade-off between convergence speed, solution bias and communication/computation overhead to propose the POWER-OF-CHOICE client selection strategy. POWER-OF-CHOICE is based on the power of d choices load balancing strategy (Mitzenmacher, 1996) , which is extensively used in queueing systems. In the POWER-OF-CHOICE client selection strategy (denoted by π pow-d ), the central server chooses the active client set S (t) as follows: 1. Sample the Candidate Client Set. The central server samples a candidate set A of d (m ≤ d ≤ K) clients without replacement such that client k is chosen with probability p k , the fraction of data at the k-th client for k = 1, . . . K.

2.. Estimate Local

Losses. The server sends the current global model w (t) to the clients in set A, and these clients compute and send back to the central server their local loss F k (w (t) ). • Communication-and Computation-efficient Variant π rpow-d : To save both local computation and communication cost, the selected clients for each round sends their accumulated averaged loss over local iterations, i.e.,foot_0 

3.. Select

τ |ξ (l) k | t l=t-τ +1 ξ∈ξ (l) k f (w (l) k , ξ) when they send their local models to the server. The server uses the latest received value from each client as a proxy for F k (w) to select the clients. For the clients that have not been selected yet, the latest value is set to ∞. practice, the convergence speed and the solution bias is dictated by ρ(w (τ t/τ ) , w (t) ) which changes during training. With π pow-d which is biased towards higher local losses, we expect the selection skew ρ(w, w ) to reduce through the course of training. We conjecture that this is why π pow-d gives faster convergence as well as little or no solution bias in our experiments presented in Section 5.

5. EXPERIMENTAL RESULTS

We evaluate our proposed π pow-d and its practical variants π cpow-d , π rpow-d , and π adapow-d by three sets of experiments: (1) quadratic optimization, (2) logistic regression on a synthetic federated dataset, Synthetic(1,1) (Sahu et al., 2019), and (3) DNN trained on a non-iid partitioned FMNIST dataset (Xiao et al., 2017) . We also benchmark the selection strategy proposed by Goetz et al. (2019) , active federated learning, denoted as π afl . Details of the experimental setup are provided in Appendix F, and the code for all experiments are shared in the supplementary material. To validate consistency in our results, we present additional experiments with DNN trained on a non-iid partitioned EMNIST (Cohen et al., 2017) dataset sorted by digits with K = 500 clients. We present the results in Appendix G.4. d shows convergence speed-up as with K = 30, but the bias is smaller. Figure 3 shows the theoretical values ρ and ρ/ρ which represents the convergence speed and the solution bias respectively in our convergence analysis. Compared to π rand , π pow-d has higher ρ for all d implying higher convergence speed than π rand . By varying d we can span different points on the trade-off between the convergence speed and bias. For d = 15 and K = 100, ρ/ρ of π pow-d and π rand are approximately identical, but π pow-d has higher ρ, implying that π pow-d can yield higher convergence speed with negligible solution bias. In Appendix G.1, we present the clients' selected frequency ratio for π pow-d and π rand which gives novel insights regarding the difference between the two strategies. For the synthetic dataset simulations, we present the global losses in Figure 4 for π rand and π pow-d for different d and m. We show that π pow-d converges approximately 3× faster to the global loss ≈ 0.7 than π rand when d = 10m, with a slightly higher error floor. Even with d = 2m, we get 2× faster convergence to global loss ≈ 0.7 than π rand . Elimination of Selection Skew with π adapow-d . For π pow-d , the selection skew is the trade-off for the convergence speed gain in Figure 2 and Figure 4 . For both simulations, π pow-d converges slightly above the global minimum value due to the selection skew. We eliminate this selection skew while maintaining the benefit of convergence speed with π adapow-d . In Figure 2 Experiments with Heterogeneously Distributed FMNIST. As elaborated in Appendix F, α determines the data heterogeneity across clients. Smaller α indicates larger data heterogeneity. In Figure 5 , we present the test accuracy and training losses for the different sampling strategies from the FMNIST experiments with α = 0.3 and α = 2. Observe that π pow-d achieves approximately 10% and 5% higher test accuracy than π rand and π afl respectively for both α = 2 and α = 0.3. For higher α (less data heterogeneity) larger d (more selection skew) performs better than smaller d. Performance of the Communication-and Computation-Efficient variants. Next, we evaluate π cpow-d and π rpow-d which were introduced in Section 4. In Figure 6 , for α = 2, π rpow-d and π cpow-d each yields approximately 5% and 6% higher accuracy than π rand , but both yield lower accuracy than π pow-d that utilizes the highest computation and communication resources. For α = 0.3, π cpow-d and π rpow-d perform as well as π pow-d and give a 10% accuracy improvement over π rand . Moreover, π pow-d , π rpow-d and π cpow-d all have higher accuracy and faster convergence than π afl . We evaluate the communication and computation efficiency of POWER-OF-CHOICE by comparing different strategies in terms of R 60 , the number of communication rounds required to reach test accuracy 60%, and t comp , the average computation time (in seconds) spent per round. The computation time includes the the time taken by the central server to select the clients (including the computation time for the d clients to compute their local loss values) and the time taken by selected clients to perform local updates. In Table 1 , with only C = 0.03 fraction of clients, π pow-d , π cpow-d , and π rpow-d have about 5% higher test accuracy than (π rand , C = 0.1). The R 60 for π pow-d , π cpow-d , π rpow-d is 0.52, 0.47, 0.57 times that of (π rand , C = 0.1) respectively. This implies that even for π rpow-d which does not incur any additional communication cost for client selection, we can get a 2× reduction in the number of communication rounds using 1/3 of clients compared to (π rand , C = 0.1) and still get higher test accuracy performance. Note that the computation time t comp for π cpow-d and π rpow-d with C = 0.03 is smaller than that of π rand with C = 0.1. In Appendix G.2, we show that the results for α = 2 are consistent with the α = 0.3 case shown in Table 1 . In Appendix G.5, we also show that for C = 0.1, the results are consistent with the C = 0.03 case. Effect of Mini-batch Size and Local Epochs. We evaluate the effect of mini-batch size b and local epochs τ on the FMNIST experiments with different sets of hyper-parameters: (b, τ ) ∈ {(128, 30), (64, 100)}. Note that (b, τ ) = (64, 30) is the default hyper-parameter setting for the previous results. The figures are presented in Appendix G.6. For b = 128, we observe that the performance improvement of π pow-d over π rand and π afl is consistent with b = 64 (see Figure 12 ). In Figure 14 , for τ = 100, with smaller data heterogeneity, the performance gap between π rand and π pow-d is consistent with that of τ = 30. For larger data heterogeneity, however, increasing the local epochs results in π rand and π pow-d performing similarly. This shows that with larger data heterogeneity, larger τ results in increasing the selection skew towards specific clients, and weakens generalization.

6. CONCLUDING REMARKS

In this work, we present the convergence guarantees for federated learning with partial device participation with any biased client selection strategy. We discover that biasing client selection can speed up the convergence at the rate O( 1 T ρ ) where ρ is the selection skew towards clients with higher local losses. Motivated by this insight, we propose the adaptive client selection strategy POWER-OF-CHOICE. Extensive experiments validate that POWER-OF-CHOICE yields 3× faster convergence and 10% higher test accuracy than the baseline federated averaging with random selection. Even with using fewer clients than random selection, POWER-OF-CHOICE converges 2 × faster with high test performance. An interesting future direction is to improve the fairness (Li et al., 2019; Yu et al., 2020; Lyu et al., 2020; Mohri et al., 2019) and robustness (Pillutla et al., 2019) of the POWER-OF-CHOICE strategy by modifying step 3 of the POWER-OF-CHOICE algorithm to use a different metric such as the clipped loss or the q-fair loss proposed Li et al. (2019) instead of F k (w).

A ADDITIONAL THEOREM

Theorem A.1 (Convergence with Fixed Learning Rate). Under Assumptions 3.1 to 3.4, a fixed learning rate η ≤ min{ 1 2µB , 1 4L } where B = 1 + 3ρ 8 , and any client selection strategy π as defined above, the error after T iterations of federated averaging with partial device participation satisfies F (w (T ) ) -F * ≤ L µ 1 -ηµ 1 + 3ρ 8 T   F (w (0) ) -F * - 4 η 32τ 2 G 2 + σ 2 m + 6ρLΓ + 2Γ( ρ -ρ) 8 + 3ρ   Vanishing Term + 4Lη 32τ 2 G 2 + σ 2 m + 6ρLΓ µ(8 + 3ρ) + 8LΓ( ρ -ρ) µ(8 + 3ρ) Non-vanishing bias As T → ∞ the first term in (8) goes to 0 and the second term becomes the bias term for the fixed learning rate case. For a small η, we have that the bias term for the fixed learning rate case in Theorem A.1 is upper bounded by 8LΓ 3µ ρ ρ -1 which is identical to the decaying-learning rate case. The proof is presented in Appendix D.

B PRELIMINARIES FOR PROOF OF THEOREM 3.1 AND THEOREM A.1

We present the preliminary lemmas used for proof of Theorem 3.1 and Theorem A.1. We will denote the expectation over the sampling random source S (t) as E S (t) and the expectation over all the random sources as E. Lemma B.1. Suppose F k is L-smooth with global minimum at w * k , then for any w k in the domain of F k , we have that ∇F k (w k ) 2 ≤ 2L(F k (w k ) -F k (w * k )) Proof. F k (w k ) -F k (w * k ) -∇F k (w * k ), w k -w * k ≥ 1 2L ∇F k (w k ) -∇F k (w * k ) 2 F k (w k ) -F k (w * k ) ≥ 1 2L ∇F k (w k ) 2 Lemma B.2 (Expected average discrepancy between w (t) and w (t) k for k ∈ S (t) ). 1 m E[ k∈S (t) w (t) -w (t) k 2 ] ≤ 16η 2 t τ 2 G 2 Proof. 1 m k∈S (t) w (t) -w (t) k 2 = 1 m k∈S (t) 1 m k ∈S (t) (w (t) k -w (t) k ) 2 (13) ≤ 1 m 2 k∈S (t) k ∈S (t) w (t) k -w (t) k 2 (14) = 1 m 2 k =k , k,k ∈S (t) w (t) k -w (t) k 2 (15) Observe from the update rule that k, k are in the same set S (t) and hence the terms where k = k in the summation in ( 14) will be zero resulting in (15). Moreover for any arbitrary t there is a t 0 such that 0 ≤ tt 0 < τ that w (t0) k = w (t0) k since the selected clients are updated with the global model at every τ . Hence even for an arbitrary t we have that the difference between w (t) k -w (t) k 2 is upper bounded by τ updates. With non-increasing η t over t and η t0 ≤ 2η t , (15) can be further bounded as, 1 m 2 k =k , k,k ∈S (t) w (t) k -w (t) k 2 ≤ 1 m 2 k =k , k,k ∈S (t) t0+τ -1 i=t0 η i (g k (w (i) k , ξ (i) k ) -g k (w (i) k , ξ (i) k )) 2 (16) ≤ η 2 t0 τ m 2 k =k , k,k ∈S (t) t0+τ -1 i=t0 (g k (w (i) k , ξ (i) k ) -g k (w (i) k , ξ (i) k )) 2 (17) ≤ η 2 t0 τ m 2 k =k , k,k ∈S (t) t0+τ -1 i=t0 [2 g k (w (i) k , ξ (i) k ) 2 + 2 g k (w (i) k , ξ (i) k ) 2 ] (18) By taking expectation over ( 18), E[ 1 m 2 k =k , k,k ∈S (t) w (t) k -w (t) k 2 ] ≤ 2η 2 t0 τ m 2 E[ k =k , k,k ∈S (t) t0+τ -1 i=t0 ( g k (w (i) k , ξ (i) k ) 2 + g k (w (i) k , ξ (i) k ) 2 )] (19) ≤ 2η 2 t0 τ m 2 E S (t) [ k =k , k,k ∈S (t) t0+τ -1 i=t0 2G 2 ] (20) = 2η 2 t0 τ m 2 E S (t) [ k =k , k,k ∈S (t) 2τ G 2 ] (21) ≤ 16η 2 t (m -1)τ 2 G 2 m (22) ≤ 16η 2 t τ 2 G 2 where ( 22) is because there can be at most m(m -1) pairs such that k = k in S (t) . Lemma B.3 (Upper bound for expectation over w (t) -w * 2 for any selection strategy π). With E[•], the total expectation over all random sources including the random source from selection strategy we have the upper bound: E[ w (t) -w * 2 ] ≤ 1 m E[ k∈S (t) w (t) k -w * 2 ] Proof. E[ w (t) -w * 2 ] = E[ 1 m k∈S (t) w (t) k -w * 2 ] = E[ 1 m k∈S (t) (w (t) k -w * ) 2 ] (25) ≤ 1 m E[ k∈S (t) w (t) k -w * 2 ] C PROOF OF THEOREM 3.1 With g (t) = 1 m k∈S (t) g k (w (t) k , ξ k ) as defined in Section 2, we have that w (t+1) -w * 2 = w (t) -η t g (t) -w * 2 (27) = w (t) -η t g (t) -w * - η t m k∈S (t) ∇F k (w (t) k ) + η t m k∈S (t) ∇F k (w (t) k ) 2 (28) = w (t) -w * - η t m k∈S (t) ∇F k (w (t) k ) 2 + η 2 t 1 m k∈S (t) ∇F k (w (t) k ) -g (t) 2 + 2η t w (t) -w * - η t m k∈S (t) ∇F k (w (t) k ), 1 m k∈S (t) ∇F k (w (t) k ) -g (t) (29) = w (t) -w * 2 -2η t w (t) -w * , 1 m k∈S (t) ∇F k (w (t) k ) A1 + 2η t w (t) -w * - η t m k∈S (t) ∇F k (w (t) k ), 1 m k∈S (t) ∇F k (w (t) k ) -g (t) A2 + η 2 t 1 m k∈S (t) ∇F k (w (t) k ) 2 A3 + η 2 t 1 m k∈S (t) ∇F k (w (t) k ) -g (t) 2 A4 (30) First let's bound A 1 . -2η t w (t) -w * , 1 m k∈S (t) ∇F k (w (t) k ) = - 2η t m k∈S (t) w (t) -w * , ∇F k (w (t) k ) (31) = - 2η t m k∈S (t) w (t) -w (t) k , ∇F k (w (t) k ) - 2η t m k∈S (t) w (t) k -w * , ∇F k (w (t) k ) ≤ η t m k∈S (t) 1 η t w (t) -w (t) k 2 + η t ∇F k (w (t) k ) 2 - 2η t m k∈S (t) w (t) k -w * , ∇F k (w (t) k ) (33) = 1 m k∈S (t) w (t) -w (t) k 2 + η 2 t m k∈S (t) ∇F k (w (t) k ) 2 - 2η t m k∈S (t) w (t) k -w * , ∇F k (w (t) k ) (34) ≤ 1 m k∈S (t) w (t) -w (t) k 2 + 2Lη 2 t m k∈S (t) (F k (w (t) k ) -F * k ) - 2η t m k∈S (t) w (t) k -w * , ∇F k (w (t) k ) (35) ≤ 1 m k∈S (t) w (t) -w (t) k 2 + 2Lη 2 t m k∈S (t) (F k (w (t) k ) -F * k ) - 2η t m k∈S (t) (F k (w (t) k ) -F k (w * )) + µ 2 w (t) k -w * 2 (36) ≤ 16η 2 t τ 2 G 2 - η t µ m k∈S (t) w (t) k -w * 2 + 2Lη 2 t m k∈S (t) (F k (w (t) k ) -F * k ) - 2η t m k∈S (t) (F k (w (t) k ) -F k (w * )) where ( 33) is due to the AM-GM inequality and Cauchy-Schwarz inequality, ( 35) is due to Lemma B.1, ( 36) is due to the µ-convexity of F k , and ( 37) is due to Lemma B.2. Next, in expectation, E[A 2 ] = 0 due to the unbiased gradient. Next again with Lemma B.1 we bound A 3 as follows: η 2 t 1 m k∈S (t) ∇F k (w (t) k ) 2 = η 2 t m k∈S (t) ∇F k (w (t) k ) 2 (38) ≤ 2Lη 2 t m k∈S (t) (F k (w (t) k ) -F * k ) Lastly we can bound A 4 using the bound of variance of stochastic gradients as, E[η 2 t 1 m k∈S (t) ∇F k (w (t) k ) -g (t) 2 ] = η 2 t E[ k∈S (t) 1 m (g k (w (t) k , ξ (t) k ) -∇F k (w (t) k )) 2 ] (40) = η 2 t m 2 E S (t) [ k∈S (t) E g k (w (t) k , ξ (t) k ) -∇F k (w (t) k ) 2 ] (41) ≤ η 2 t σ 2 m (42) Using the bounds of A 1 , A 2 , A 3 , A 4 above we have that the expectation of the LHS of ( 27) is bounded as E[ w (t+1) -w * 2 ] ≤E[ w (t) -w * 2 ] - η t µ m E[ k∈S (t) w (t) k -w * 2 ] + 16η 2 t τ 2 G 2 + η 2 t σ 2 m + 4Lη 2 t m E[ k∈S (t) (F k (w (t) k ) -F * k )] - 2η t m E[ k∈S (t) (F k (w (t) k ) -F k (w * ))] ≤(1 -η t µ)E[ w (t) -w * 2 ] + 16η 2 t τ 2 G 2 + η 2 t σ 2 m + 4Lη 2 t m E[ k∈S (t) (F k (w (t) k ) -F * k )] - 2η t m E[ k∈S (t) (F k (w (t) k ) -F k (w * ))] A5 where ( 44) is due to Lemma B.3. Now we aim to bound A 5 in (44). First we can represent A 5 in a different form as: E[ 4Lη 2 t m k∈S (t) (F k (w (t) k ) -F * k ) - 2η t m k∈S (t) (F k (w (t) k ) -F k (w * ))] =E[ 4Lη 2 t m k∈S (t) F k (w (t) k ) - 2η t m k∈S (t) F k (w (t) k ) - 2η t m k∈S (t) (F * k -F k (w * )) + 2η t m k∈S (t) F * k - 4Lη 2 t m k∈S (t) F * k ] (45) =E[ 2η t (2Lη t -1) m k∈S (t) (F k (w (t) k ) -F * k ) A6 ] + 2η t E[ 1 m k∈S (t) (F k (w * ) -F * k )] where ( 48) is due to µ-convexity, (49) is due to Lemma B.1 and the AM-GM inequality and Cauchy-Schwarz inequality, and ( 51) is due to the fact that νt(1-ηtµ) 2ηt ≤ 1. Hence using this bound of A 6 we can upper bound A 5 as E[ 4Lη 2 t m k∈S (t) (F k (w (t) k ) -F * k ) - 2η t m k∈S (t) (F k (w (t) k ) -F k (w * ))] ≤ 1 m E[ k∈S (t) w (t) k -w (t) 2 ] - ν t m (1 -η t L)E[ k∈S (t) (F k (w (t) ) -F * k )] + 2η t m E[ k∈S (t) (F k (w * ) -F * k )] ( ) ≤16η 2 t τ 2 G 2 - ν t m (1 -η t L)E[ k∈S (t) (F k (w (t) ) -F * k )] + 2η t m E[ k∈S (t) (F k (w * ) -F * k )] (53) =16η 2 t τ 2 G 2 -ν t (1 -η t L)E[ρ(S(π, w (τ t/τ ) ), w (t) )(F (w (t) ) - K k=1 p k F * k )] + 2η t E[ρ(S(π, w (τ t/τ ) ), w * )(F * - K k=1 p k F * k )] ≤16η 2 t τ 2 G 2 -ν t (1 -η t L)ρ(E[F (w (t) )] - K k=1 p k F * k ) A7 +2η t ρΓ where ( 54) is due to the definition of ρ(S(π, w), w ) in Definition 3.2 and ( 55) is due to the definition of Γ in Definition 3.1 and the definitions of ρ, ρ in Definition 3.2. We can expand A 7 in (55) as -ν t (1 -η t L)ρ(E[F (w (t) )] - K k=1 p k F * k ) (56) = -ν t (1 -η t L)ρ K k=1 p k (E[F k (w (t) ] -F * + F * -F * k ) (57) = -ν t (1 -η t L)ρ K k=1 p k (E[F k (w (t) ] -F * ) -ν t (1 -η t L)ρ K k=1 p k (F * -F * k ) (58) = -ν t (1 -η t L)ρ(E[F (w (t) )] -F * ) -ν t (1 -η t L)ρΓ (59) ≤ - ν t (1 -η t L)µρ 2 E[ w (t) -w * 2 ] -ν t (1 -η t L)ρΓ (60) ≤ - 3η t µρ 8 E[ w (t) -w * 2 ] -2η t (1 -2Lη t )(1 -η t L)ρΓ (61) ≤ - 3η t µρ 8 E[ w (t) -w * 2 ] -2η t ρΓ + 6η 2 t ρLΓ where ( 60) is due to the µ-convexity, (61) is due to -2η t (1 -2Lη t )(1 -η t L) ≤ -3 4 η t , and (62) is due to -(1 -2Lη t )(1 -η t L) ≤ -(1 -3Lη t ). Hence we can finally bound A 5 as 4Lη 2 t m E[ k∈S (t) (F k (w (t) k ) -F * k ) - 2η t m k∈S (t) (F k (w (t) k ) -F k (w * ))] ≤ - 3η t µρ 8 E[ w (t) -w * 2 ] + 2η t Γ( ρ -ρ) + η 2 t (6ρLΓ + 16τ 2 G 2 ) Now we can bound E[ w (t+1) -w * 2 ] as E[ w (t+1) -w * 2 ] ≤ 1 -η t µ 1 + 3ρ 8 E[ w (t) -w * 2 ] + η 2 t 32τ 2 G 2 + σ 2 m + 6ρLΓ + 2η t Γ( ρ -ρ) By defining ∆ t+1 = E[ w (t+1) -w * 2 ], B = 1 + 3ρ 8 , C = 32τ 2 G 2 + σ 2 m + 6ρLΓ, D = 2Γ( ρ -ρ), we have that ∆ t+1 ≤ (1 -η t µB)∆ t + η 2 t C + η t D By setting ∆ t ≤ ψ t+γ , η t = β t+γ and β > 1 µB , γ > 0 by induction we have that ψ = max γ w (0) -w * 2 , 1 βµB -1 β 2 C + Dβ(t + γ) Then by the L-smoothness of F (•), we have that E[F (w (t) )] -F * ≤ L 2 ∆ t ≤ L 2 ψ γ + t D PROOF OF THEOREM A.1 With fixed learning rate η t = η, we can rewrite (65) as ∆ t+1 ≤ (1 -ηµB)∆ t + η 2 C + ηD and with η ≤ min{ 1 2µB , 1 4L } using recursion of (68) we have that ∆ t ≤ (1 -ηµB) t ∆ 0 + η 2 C + ηD ηµB (1 -(1 -ηµB) t ) Using ∆ t ≤ 2 µ (F (w (t) ) -F * ) and L-smoothness, we have that F (w (t) ) -F * ≤ L µ (1 -ηµB) t (F (w (0) ) -F * ) + L(ηC + D) 2µB (1 -(1 -ηµB) t ) (70) = L µ 1 -ηµ 1 + 3ρ 8 t (F (w (0) ) -F * ) + 4L(ηC + D) µ(8 + 3ρ) 1 -1 -ηµ 1 + 3ρ 8 t E EXTENSION: GENERALIZATION TO DIFFERENT AVERAGING SCHEMES While we considered a simple averaging scheme where w (t+1) = 1 m k∈S (t) w (t) k -η t g k (w (t) k ) , we can extend the averaging scheme to any scheme q such that the averaging weights q k are invariant in time and satisfies k∈S (t) q k = 1 for any t. Note that q includes the random sampling without replacement scheme introduced by Li et al. (2020) where the clients are sampled uniformly at random without replacement with the averaging coefficients q k = p k K/m. With such averaging scheme q, we denote the global model for the averaging scheme q k as w (t) , where w (t+1) k∈S (t) q k w (t) k -η t g k (w (t) k ) , and the update rule changes to w (t+1) = w (t) -η t g (t) = w (t) -η t   k∈S (t) q k g k (w (t) k , ξ (t) k )   ( ) where g (t) = k∈S (t) q k g k (w (t) k , ξ k ). We show that the convergence analysis for the averaging scheme q is consistent with Theorem 3.1. In the case of the averaging scheme q, we have that Lemma B.2 and Lemma B.3 shown in Appendix B, each becomes 1 m E[ k∈S (t) w (t) -w (t) k 2 ] ≤ 16η 2 t m(m -1)τ 2 G 2 (73) E[ w (t) -w * 2 ] ≤ mE[ k∈S (t) q k w (t) k -w * 2 ] (74) Then, using the same method we used for the proof of Theorem 3.1, we have that E[ w (t+1) -w * 2 ] ≤ 1 - η t µ m E[ w (t) -w * 2 ] + η 2 t σ 2 m + 16m 2 (m -1)η 2 t τ 2 G 2 + E   2Lη 2 t (1 + m) k∈S (t) q k (F k (w (t) k ) -F * k ) -2η t k∈S (t) q k (F k (w (t) k ) -F k (w * ))   M By defining the selection skew for averaging scheme q similar to Definition 5 as ρ q (S(π, w), w ) = E S(π,w) [ k∈S(π,w) q k (F k (w ) -F * k )] F (w ) - K k=1 p k F * k ≥ 0, and ρ q min w,w ρ q (S(π, w), w ) (77) ρ q max w ρ q (S(π, w), w * ) = max w E S(π,w) [ k∈S(π,w) q k (F k (w * ) -F * k )] Γ With η t < 1/(2L(1 + m)), using the same methodology for proof of Theorem 3.1 we have that M becomes upper bounded as E   2Lη 2 t (1 + m) k∈S (t) q k (F k (w (t) k ) -F * k ) -2η t k∈S (t) q k (F k (w (t) k ) -F k (w * ))   (79) ≤ - η t µρ q 2 E[ w (t) -w * 2 ] + 2η t Γ( ρ q -ρ q ) + 16m 2 (m -1)η 2 t τ 2 G 2 + 2Lη 2 t (2 + m)ρ q Γ (80) Finally we have that E[ w (t+1) -w * 2 ] ≤ 1 -η t µ 1 m + ρ q 2 E[ w (t) -w * 2 ] + 2η t Γ( ρ q -ρ q ) +η 2 t [32m 2 (m -1)τ 2 G 2 + σ 2 m + 2L(2 + m)ρ q Γ] By defining ∆ t+1 = E[ w (t+1) -w * 2 ], B = 1 m + ρ q 2 , C = 32m 2 (m -1)τ 2 G 2 + σ 2 m + 2L(2 + m)ρ q Γ, D = 2Γ( ρ q -ρ q ), we have that ∆ t+1 ≤ (1 -η t µ B) ∆ t + η 2 t C + η t D Again, by setting ∆ t ≤ ψ t+γ , η t = β t+γ and β > 1 µ B , γ > 0 by induction we have that ψ = max γ w (0) -w * 2 , 1 βµ B -1 β 2 C + Dβ(t + γ) Then by the L-smoothness of F (•), we have that E[F (w (t) )] -F * ≤ L 2 ∆ t ≤ L 2 ψ γ + t (84) With β = m µ , γ = 4m(1+m)L µ and η t = β t+γ , we have that E[F ( w (T ) )] -F * ≤ 1 (T + γ) Lm 2 (32m(m -1)τ 2 G 2 + σ 2 ) µ 2 ρ q + 2L 2 m(m + 2)Γ µ 2 + Lγ w (0) -w * 2 2 Vanishing Error Term + 2LΓ ρ q µ ρ q ρ q -1 Non-vanishing bias which is consistent with Theorem 3.1.

F EXPERIMENT DETAILS

Quadratic Model Optimization. For the quadratic model optimization, we set each local objective function as strongly convex as follows: F k (w) = 1 2 w H k w -e k w + 1 2 e k H -1 k e k (86) H k ∈ R v×v is a diagonal matrix H k = h k I with h k ∼ U(1, ) and e k ∈ R v is an arbitrary vector. We set the global objective function as F (w) = K k=1 p k F k (w) , where the data size p k follows the power law distribution P (x; a) = ax a-1 , 0 ≤ x ≤ 1, a = 3. We can easily show that the optimum for F k (w) and F (w) is w * k = H -1 k e k and w * = ( K k=1 p k H k ) -1 ( K k=1 p k e k ) respectively. The gradient descent update rule for the local model of client k in the quadratic model optimization is w (t+1) k = w (t) k -η(H k w (t) k -e k ) where the global model is defined as w (t+1) = 1 m k∈S (t) w (t+1) k . We sample m = KC clients for every round where for each round the clients perform τ gradient descent local iterations with fixed learning rate η and then these local models are averaged to update the global model. For the implementation of π adapow-d , d was decreased half from d = K for every 5000 rounds. For all simulations we set τ = 2, v = 5, η = 2 × 10 -5 . For the estimation of ρ and ρ for the quadratic model, we get the estimates of the theoretical ρ, ρ values by doing a grid search over a large range of possible w, w for ρ(S(π, w), w ) and ρ(S(π, w), w * ) respectively. The distribution of S(π, w) is estimated by simulating 10000 iterations of client sampling for each π and w. Logistic Regression on Synthetic Dataset. We conduct simulations on synthetic data which allows precise manipulation of heterogeneity. Using the methodology constructed in (Sahu et al., 2019) , we use the dataset with large data heterogeneity, Synthetic(1,1). We assume in total 30 devices where the local dataset sizes for each device follows the power law. For the implementation of π adapow-d , d was decreased to d = m from d = K at half the entire communication rounds. We set the mini batch-size to 50 with τ = 30, and η = 0.05, where η is decayed to η/2 every 300 and 600 rounds. DNN on FMNIST Dataset. We train a deep multi-layer perceptron network with two hidden layers on the FMNIST dataset (Xiao et al., 2017) . We construct the heterogeneous data partition amongst clients using the Dirichlet distribution Dir K (α) (Hsu et al., 2019) , where α determines the degree of the data heterogeneity across clients (the data size imbalance and degree of label skew across clients). Smaller α indicates larger data heterogeneity. For all experiments we use mini-batch size of b = 64, with τ = 30 and η = 0.005, where η is decayed by half for every 150, 300 rounds. We experiment with three different seeds for the randomness in the dataset partition across clients and present the averaged results. All experiments are conducted with clusters equipped with one NVIDIA TitanX GPU. The number of clusters we use vary by C, the fraction of clients we select. The machines communicate amongst each other through Ethernet to transfer the model parameters and information necessary for client selection. Each machine is regarded as one client in the federated learning setting. The algorithms are implemented by PyTorch. Pseudo-code of the variants of pow-d: cpow-d and rpow-d. We here present the pseudo-code for π cpow-d and π rpow-d . Note that the pseudo-code for π cpow-d in Algorithm 1 can be generalized to the algorithm for π pow-d , by 1  | ξ k | ξ∈ ξ k f (w, ξ) to F k (w).

G.1 SELECTED CLIENT PROFILE

We further visualize the difference between our proposed sampling strategy π pow-d and the baseline scheme π rand by showing the selected frequency ratio of the clients for K = 30, C = 0.1 for the quadratic simulations in Figure 7 . Note that the selected ratio for π rand reflects each client's dataset size. We show that the selected frequencies of clients for π pow-d are not proportional to the data size of the clients, and we are selecting clients frequently even when they have relatively low data size like client 6 or 22. We are also not necessarily frequently selecting the clients that have the highest data size such as client 26. This aligns well with our main motivation of POWER-OF-CHOICE that weighting the clients' importance based on their data size does not achieve the best performance, and rather considering their local loss values along with the data size better represents their importance. Note that the selected frequency for π rand is less biased than π pow-d . 2 , we show the communication and computation efficiency of POWER-OF-CHOICE for α = 2, as we showed for α = 0.3 in Table 1 in Section 5. With C = 0.03 fraction of clients, π pow-d , π cpow-d , and π rpow-d have better test accuracy of at least approximately 10% higher test accuracy performance than (π rand , C = 0.1). R 60 for π pow-d , π cpow-d , π rpow-d is 0.61, 0.66, 0.73 times that of (π rand , C = 0.1) respectively. This indicates that we can reduce the number of communication rounds by at least 0.6 using 1/3 of clients compared to (π rand , C = 0.1) and still get higher test accuracy performance. The computation time t comp for π cpow-d and π rpow-d with C = 0.03 is smaller than that of (π rand , C = 0.1). 

G.3 INTERMITTENT CLIENT AVAILABILITY

In real world scenarios, certain clients may not be available due to varying availability of resources such as battery power or wireless connectivity. Hence we experiment with a virtual scenario, where amongst K clients, for each communication round, we select clients alternately from one group out of two fixed groups, where each group has 0.5K clients. This altering selection reflects a more realistic client selection scenario where, for example, we have different time zones across clients. For each communication round, we select 0.1 portion of clients from the corresponding group uniformly at random and exclude them from the client selection process. This random exclusion of certain clients represents the randomness in the client availability within that group for cases such as low battery power or wireless connectivity. In Figure 8 we show that π pow-d and π rpow-d achieves 10% and 5% test accuracy improvement respectively compared to π rand for α = 2. For α = 3, both π pow-d and π rpow-d shows 10% improvement. Therefore, we demonstrate that POWER-OF-CHOICE also performs well in a realistic scenario where clients are available intermittently.

G.4 RESULTS FOR DNN ON NON-IID PARTITIONED EMNIST DATASET

To provide further validation of the consistency in our results of π pow-d and its variants on the FMNIST dataset, we present additional experiment results on the EMNIST dataset sorted by digits with K = 500, C = 0.03. We train a deep multi-layer perceptron network with two hidden layers on the dataset partitioned heterogeneously across the clients in the same way as for the FMNIST dataset. For all experiments, we use b = 64, τ = 30, and η = 0.005 where η is decayed by half at round 300. In Figure 9 , we show that π pow-d performs with significantly higher test accuracy than π rand for varying d for both α = 2 and 0.3. For α = 2, π afl is able to follow the performance of π pow-d in the later communication rounds, but is slower in achieving the same test accuracy than π pow-d . Moreover, in Figure 10 , we show that π cpow-d works as good as π pow-d for both large and small data heterogeneity. The performance of π rpow-d falls behind π pow-d and π cpow-d for smaller data heterogeneity, whereas for larger data heterogeneity, π rpow-d is able to perform similarly with π pow-d and π cpow-d . In Figure 11 , for larger C = 0.1 with α = 2, the test accuracy improvement for π pow-d is even higher than the case of C = 0.03 with approximately 15% improvement. π cpow-d performs slightly lower in test accuracy than π pow-d but still performs better than π rand and π afl . π rpow-d performs as well as π afl . For α = 0.3, π pow-d , π cpow-d , and π rpow-d have approximately equal test accuracy performance, higher than π rand by 5%. The POWER-OF-CHOICE strategies all perform slightly better than π afl . Therefore we show that POWER-OF-CHOICE performs well for selecting a larger fraction of clients, i.e., when we have larger C = 0.1 > 0.03.

G.6 EFFECT OF THE LOCAL EPOCHS AND MINI-BATCH SIZE

We present the experiment results elaborated in Section 5 for the different hyper-parameter settings (b, τ ) ∈ {(128, 30), (64, 100)} in Figure 12 , 13, 14, and 15 below. 



d = m makes our proposed POWER-OF-CHOICE strategy to become analogous to an unbiased sampling strategy, which has no non-vanishing bias term.



Figure 1: A toy example with F 1 (w), F 2 (w) as the local objective, and F (w) = (F 1 (w) + F 2 (w))/2 as the global objective function with global minimum w * . At each round, only one client is selected to perform local updates. (a): Model updates for sampling clients with larger loss; (b): Model updates for sampling clients uniformly at random (we select client in the order of 2,2,1,1,2).

Definition 3.1 (Local-Global Objective Gap). For the global optimum w * = arg min w F (w) and local optimum w * k = arg min w F k (w) we define the local-global objective gap as

Variations of π pow-d . The three steps of π pow-d can be flexibly modified to take into account practical considerations. For example, intermittent client availability can be accounted for in step 1 by constructing set A only from the set of available clients in that round. We demonstrate the performance of π pow-d with intermittent client availability in Appendix G.3. The local computation cost and server-client communication cost in step 2 can be reduced or eliminated by the following proposed variants of π pow-d (see Appendix F for their pseudo-codes). • Computation-efficient Variant π cpow-d : To save local computation cost, instead of evaluating the F k (w) by going through the entire local dataset B k , we use an estimate ξ∈ ξ k f (w, ξ)/| ξ k |, where ξ k is the mini-batch of b samples sampled uniformly at random from B k .

Figure2: Global loss performance of π rand , π pow-d , and π adapow-d for the quadratic experiments with C = 0.1. π pow-d convergences faster than π rand for even selecting from a small pool of clients (K = 30). As convergence speed increases, solution bias also increases for π pow-d , but π adapow-d is able to eliminate this solution bias while gaining nearly identical convergence speed to π pow-d .

Figure 3: Estimated theoretical values ρ and ρ/ ρ for the quadratic simulation. The convergence speed (ρ) and bias ( ρ/ρ) are consistent with the results shown in Figure 2 for π rand and π pow-d . Quadratic and Synthetic Simulation Results. In Figure 2(a), even with few clients (K = 30), π pow-d converges faster than π rand with nearly negligible solution bias for small d. The convergence speed increases with the increase in d, at the cost of higher error floor due to the solution bias. For K = 100 in Figure 2(b), π pow-dshows convergence speed-up as with K = 30, but the bias is smaller. Figure3shows the theoretical values ρ and ρ/ρ which represents the convergence speed and the solution bias respectively in our convergence analysis. Compared to π rand , π pow-d has higher ρ for all d implying higher convergence speed than π rand . By varying d we can span different points on the trade-off between the convergence speed and bias. For d = 15 and K = 100, ρ/ρ of π pow-d and π rand are approximately identical, but π pow-d has higher ρ, implying that π pow-d can yield higher convergence speed with negligible solution bias. In Appendix G.1, we present the clients' selected frequency ratio for π pow-d and π rand which gives novel insights regarding the difference between the two strategies. For the synthetic dataset simulations, we present the global losses in Figure4for π rand and π pow-d for different d and m. We show that π pow-d converges approximately 3× faster to the global loss ≈ 0.7 than π rand when d = 10m, with a slightly higher error floor. Even with d = 2m, we get 2× faster convergence to global loss ≈ 0.7 than π rand .

Figure 4: Global loss for logistic regression on the synthetic dataset, Synthetic(1,1), with π rand , π pow-d , and π adapow-d for d ∈ {2m, 10m} where K = 30, m ∈ {1, 2, 3}. π pow-d converges approximately 3 × faster for d = 10m and 2 × faster for d = 2m than π rand to the global loss ≈ 0.7. π adapow-d is able to converge to the minimum global loss 3× faster than π rand .

Figure 5(a)  shows that this performance improvement due to the increase of d eventually converges. For smaller α, as in Figure5(b), smaller d = 6 performs better than larger d which shows that too much solution bias is adversarial to the performance in the presence of large data heterogeneity. The observations on training loss are consistent with the test accuracy results.

Figure 6: Test accuracy and training loss for different sampling strategies including π cpow-d and π rpow-d , for K = 100, C = 0.03 on the FMNIST dataset. π rpow-d which requires no additional communication and minor computation, yields higher test accuracy than π rand and π afl .

Pseudo code for cpow-d: computation efficient variant of pow-d 1: Input: m, d, p k for k ∈ [K], mini-batch size b = | ξ k | for computing 1 | ξ k | ξ∈ ξ k f (w, ξ) 2: Output: S (t) 3: Initialize: empty sets S (t) and A 4: Global server do 5: Get A = {d indices sampled without replacement from [K] by p k } 6: Send the global model w (t) to the d clients in A 7: Receive 1 | ξ k | ξ∈ ξ k f (w, ξ) from all clients in A 8: Get S (t) = {m clients with largest 1 | ξ k | ξ∈ ξ k f (w, ξ) (break ties randomly)} 9: Clients in A in parallel do 10: Create mini-batch ξ k from sampling b samples uniformly at random from B k and compute 1 | ξ k | ξ∈ ξ k f (w, ξ) and send it to the server 11: return S (t) Algorithm 2 Pseudo code for rpow-d: computation & communication efficient variant of pow-d 1: Input: m, d, p k for k ∈ [K] 2: Output: S (t) 3: Initialize: empty sets S (t) and A, and list A tmp with K elements all equal to inf 4: All client k ∈ S (t-1) do {d indices sampled without replacement from [K] by p k } 9:Get S (t) = {m clients with largest values in [A tmp[i]  for i ∈ A], (break ties randomly)} 10: return S(t)

(b) Selected client profile for πpow-d

Figure 7: Clients' selected frequency ratio for optimizing the quadratic model for π rand and π pow-d with K = 30, C = 0.1. The selected ratio is sorted in the descending order.

Figure 8: Test accuracy and training loss in the virtual environment where clients have intermittent availability for K = 100, C = 0.03 with π rand , π pow-d , and π rpow-d on the FMNIST dataset. For both α = 2 and α = 3, π pow-d achieves approximately 10% higher test accuracy than π rand .

Figure 9: Test accuracy and training loss for different sampling strategies for K = 500, C = 0.03 with π rand , π pow-d , and π afl on the EMNIST dataset.

Figure 10: Test accuracy and training loss for different sampling strategies for K = 500, C = 0.03 with π rand , π pow-d , π cpow-d , π rpow-d , and π afl on the EMNIST dataset.

Figure 12: Test accuracy and training loss for π rand , π pow-d , and π afl for K = 100, C = 0.03 on the FMNIST dataset with mini-batch size b = 128 and τ = 30.

Figure 14: Test accuracy and training loss for π rand , π pow-d , and π afl for K = 100, C = 0.03 on the FMNIST dataset with mini-batch size b = 64 and τ = 100.

Comparison of R 60 , t comp (sec), and test accuracy (%) for different sampling strategies with α = 0.3. In the parentheses we show the ratio of each value with that for π rand with C = 0.1.

Comparison of R 60 , t comp (sec), and test accuracy (%) for different sampling strategies with α = 2. The ratio R 60 / (R 60 for rand, C = 0.1) and t comp / (t comp for rand, C = 0.1) are each shown in the parenthesis.

