SHARPER RATES AND FLEXIBLE FRAMEWORK FOR NONCONVEX SGD WITH CLIENT AND DATA SAMPLING Anonymous authors Paper under double-blind review

Abstract

We revisit the classical problem of finding an approximately stationary point of the average of n smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is O n + n 1/2 ε -1 , attained by the optimal SGD methods SPIDER (Fang et al., 2018) and PAGE (Li et al., 2021) , for example, where ε is the error tolerance. However, i) the big-O notation hides crucial dependencies on the smoothness constants associated with the functions, and ii) the rates and theory in these methods assume simplistic sampling mechanisms that do not offer any flexibility. In this work we remedy the situation. First, we generalize the PAGE algorithm so that it can provably work with virtually any (unbiased) sampling mechanism. This is particularly useful in federated learning, as it allows us to construct and better understand the impact of various combinations of client and data sampling strategies. Second, our analysis is sharper as we make explicit use of certain novel inequalities that capture the intricate interplay between the smoothness constants and the sampling procedure. Indeed, our analysis is better even for the simple sampling procedure analyzed in the PAGE paper. However, this already improved bound can be further sharpened by a different sampling scheme which we propose. In summary, we provide the most general and most accurate analysis of optimal SGD in the smooth nonconvex regime. Finally, our theoretical findings are supposed with carefully designed experiments.

1. INTRODUCTION

In this paper, we consider the minimization of the average of n smooth functions (1) in the nonconvex setting in the regime when the number of functions n is very large. In this regime, calculation of the exact gradient can be infeasible and the classical gradient descent method (GD) (Nesterov, 2018) can not be applied. The structure of the problem is generic, and such problems arise in many applications, including machine learning (Bishop & Nasrabadi, 2006) and computer vision (Goodfellow et al., 2016) . Problems of this form are the basis of empirical risk minimization (ERM), which is the prevalent paradigm for training supervised machine learning models.

1.1. FINITE-SUM OPTIMIZATION IN THE SMOOTH NONCONVEX REGIME

We consider the finite-sum optimization problem min x∈R d f (x) := 1 n n i=1 f i (x) , where f i : R d → R is a smooth (and possibly nonconvex) function for all i ∈ [n] := {1, . . . , n}. We are interested in randomized algorithms that find an ε-stationary point of (1) by returning a random point x such that E ∇f ( x) 2 ≤ ε. The main efficiency metric of gradient-based algorithms for finding such a point is the (expected) number of gradient evaluations ∇f i ; we will refer to it as the complexity of an algorithm.

1.2. RELATED WORK

The area of algorithmic research devoted to designing methods for solving the ERM problem (1) in the smooth nonconvex regime is one of the most highly developed and most competitive in optimization. The path to optimality. Let us provide a lightning-speed overview of recent progress. The complexity of GD for solving (1) is O nε -1 , but this was subsequently improved by more elaborate stochastic methods, including SAGA, SVRG and SCSG (Defazio et al., 2014; Johnson & Zhang, 2013; Lei et al., 2017; Horváth & Richtárik, 2019) , which enjoy the better complexity O n + n 2/3 ε -1 . Further progress was obtained by methods such as SNVRG and Geom-SARAH (Zhou et al., 2018; Horváth et al., 2020) , improving the complexity to O n + n 1/2 ε -1 . Finally, the methods SPIDER, SpiderBoost, SARAH and PAGE (Fang et al., 2018; Wang et al., 2019; Nguyen et al., 2017; Li et al., 2021) , among others, shaved-off certain logarithmic factors and obtained the optimal complexity O n + n 1/2 ε -1 , matching lower bounds (Li et al., 2021) . Optimal, but hiding a secret. While it may look that this is the end of the road, the starting point of our work is the observation that the big-O notation in the above results hides important and typically very large data-dependent constants. For instance, it is rarely noted that the more precise complexity of GD is O L -nε -1 , while the complexity of the optimal methods, for instance PAGE, is O n + L + n 1/2 ε -1 , where L -≤ L + are different and often very large smoothness constants. Moreover, it is easy to generate examples of problems (see Example 1) in which the ratio L + /L -is as large one desires. Client and data sampling in federated learning. Furthermore, several modern applications, notably federated learning (Konečný et al., 2016; McMahan et al., 2017) , depend on elaborate client and data sampling mechanisms, which are not properly understood. However, optimal SGD methods were considered in combination with very simple mechanisms only, such as sampling a random function f i several times independently with replacement (Li et al., 2021) . We thus believe that an in-depth study of sampling mechanisms for optimal methods will be of interest to the federated learning community. There exists prior work on analyzing non-optimal SGD variants with flexible mechanisms For example, using the "arbitrary sampling" paradigm, originally proposed by Richtárik & Takáč (2016) in the study of randomized coordinate descent methods, Horváth & Richtárik (2019) and Qian et al. (2021) analyzed SVRG, SAGA, and SARAH methods, and showed that it is possible to improve the dependence of these methods on the smoothness constants via carefully crafted sampling strategies. Further, Zhao & Zhang (2014) investigated the stratified sampling, but only provided the analysis for vanilla SGD, and in the convex case.

1.3. SUMMARY OF CONTRIBUTIONS

• Specifically, in the original paper (Li et al., 2021) , the optimal (w.r.t. n and ε) optimization method PAGE was analyzed with a simple uniform mini-batch sampling with replacement. We analyze PAGE with virtually any (unbiased) sampling mechanism using a novel Assumption 4. Moreover, we show that some samplings can improve the convergence rate O n + L + n 1/2 ε -1 of PAGE (see Table 2 ). • We improve the analysis of PAGE using a new quantity, the weighted Hessian Variance L ± (or L ±,w ), that is well-defined if the functions f i are L i -smooth. We show that, when the functions f i are "similar" in the sense of the weighted Hessian Variance, PAGE enjoys faster convergence rates (see Table 2 ). Also, unlike (Szlendak et al., 2021) , we introduce weights w i that can play a crucial role in some samplings. Moreover, the experiments in Sec 5 agree with our theoretical results. • Our framework is flexible and can be generalized to the composition of samplings. These samplings naturally emerge in federated learning (Konečný et al., 2016; McMahan et al., 2017) , and we show that our framework can be helpful in the analysis of problems from federated learning.

2. ASSUMPTIONS

We need the following standard assumptions from nonconvex optimization. Assumption 1. There exists f * ∈ R such that f (x) ≥ f * for all x ∈ R d . Assumption 2. There exists L -≥ 0 such that ∇f (x) -∇f (y) ≤ L -x -y for all x, y ∈ R d . Table 1 : The constants A, B, w i and |S| that characterize the samplings in Assumption 4.

Sampling scheme

A w i B |S| Reference Uniform With Replacement 1 /τ 1 /n 1 /τ ≤ τ Sec. E.3 Importance 1 /τ q i 1 /τ ≤ τ Sec. E.3 Nice n-τ τ (n-1) 1 /n n-τ τ (n-1) τ Sec. E.1 Independent 1 n i=1 p i 1-p i p i 1-p i n i=1 p i 1-p i 0 n i=1 p i Sec. E.2 Extended Nice n-τ τ (n-1) li n i=1 li n-τ τ (n-1)

≤ τ

Sec. E.4 Notation: n = # of data points; τ = batch size; qi = probability to sample i th data point in the multinomial distribution; pi = probability to sample i th data point in the bernoulli distribution; li = # of times to repeat i th data point before apply the Nice sampling. Assumption 3. For all i ∈ [n], there existsa constant L i > 0 such that ∇f i (x) -∇f i (y) ≤ L i x -y for all x, y ∈ R d .

2.1. TIGHT VARIANCE CONTROL OF GENERAL SAMPLING ESTIMATORS

In Algorithm 1 (a generalization of PAGE), we form an estimator of the gradient ∇f via subsampling. In our search for achieving the combined goal of providing a general (in terms of the range of sampling techniques we cater for) and refined (in terms of the sharpness of our results, even when compared to known results using the same sampling technique) analysis of PAGE, we have identified several powerful tools, the first of which is Assumption 4. Let S n := {(w 1 , . . . , w n ) ∈ R n | w 1 , . . . , w n ≥ 0, n i=1 w i = 1} be the standard simplex and (Ω, F, P) a probability space. Assumption 4 (Weighted AB Inequality). Consider the random mapping S : R d × • • • × R d × Ω → R d , which we will call "sampling", such that E [S(a 1 , . . . , a n ; ω)] = 1 n n i=1 a i for all a 1 , . . . , a n ∈ R d . Assume that there exist A, B ≥ 0 and weights (w 1 , . . . , w n ) ∈ S n such that E S(a 1 , . . . , a n ; ω) -1 n n i=1 a i 2 ≤ A n n i=1 1 nwi a i 2 -B 1 n n i=1 a i 2 , ∀a 1 , . . . , a n ∈ R d . (2) For simplicity, we denote S ({a i } n i=1 ) := S(a 1 , . . . , a n ) := S(a 1 , . . . , a n ; ω). Further, the collection of samplings satisfying Assumption 4 will be denotes as S(A, B, {w i } n i=1 ). The main purpose of a sampling S ∈ S(A, B, {w i } n i=1 ) is to estimate the mean 1 n n i=1 a i using some random subsets (possibly containing some elements more than once) of the set {a 1 , . . . , a n }. Assumption 4 is the only nonstandard assumption in our paper, and we refer to Table 1 , where we provide examples of samplings that satisfy this assumption. It represents a convenient framework to build the theory. We now define the cardinality |S| of a sampling S ∈ S(A, B, {w i } n i=1 ). Definition 1 (Cardinality of a Sampling). Let us take S ∈ S(A, B, {w i } n i=1 ), and define the function S ω (a 1 , . . . , a n ) : R d × • • • × R d → R d such that S ω (a 1 , . . . , a n ) := S(a 1 , . . . , a n ; ω). If the function S ω (a 1 , . . . , a n ) depends only on a subset A(ω) of the arguments (a 1 , . . . , a n ), where A(ω) : Ω → 2 {a1,...,an} , we define |S| := E [|A(ω)|] . Assumption 4 is most closely related to two independent works: (Horváth & Richtárik, 2019) and (Szlendak et al., 2021) . Horváth & Richtárik (2019) analyzed several non-optimal SGD methods for "arbitrary samplings"; these are random set-valued mappings S with values being the subsets of [n] . The distribution of a such a sampling is uniquely determined by assigning probabilities to all 2 n subsets of [n] . In particular, they show that Assumption 4 holds with S(a 1 , . . . , a n ) = 1  Θ n + √ nL + ε - Uniform With Replacement (new) Θ n + max{ √ nL ± ,L -} ε - Importance Θ n + √ n 1 n n i=1 L i ε qi = L i i=1 L i Stratified Θ   n + max √ n 1 g g i=1 L 2 i,± ,gL - ε   The functions fi are splitted into g groups Notation: n = # of data points; ε = error tolerance; L-, Li, L±, L+ and Li,± are smoothness constants such that L-≤ 1 n n i=1 Li, L-≤ L+ and L± ≤ L+; g = # of groups in the Stratified sampling. the possibility of B being nonzero, as in this way they obtain a tighter inequality, which they can use in their analysis. However, their inequality only involves uniform weights {w i }. Our Assumption 4 offers the tightest known way to control of the variance of the sampling estimator, and our analysis can take advantage of it. See Table 1 for an overview of several samplings and the values A, B and {w i } for which Assumption 4 is satisfied.

2.2. SAMPLING-DEPENDENT SMOOTHNESS CONSTANTS

We now define two smoothness constants that depend on the weights {w i } n i=1 of a sampling S and on the functions f i . Definition 2. Given a sampling S ∈ S(A, B, {w i } n i=1 ), let L +,w be a constant for which 1 n n i=1 1 nwi ∇f i (x) -∇f i (y) 2 ≤ L 2 +,w x -y 2 , ∀x, y ∈ R d . For (w 1 , . . . , w n ) = ( 1 /n, . . . , 1 /n), we define L + := L +,w . Definition 3. Given a sampling S ∈ S(A, B, {w i } n i=1 ), let L ±,w be a constant for which 1 n n i=1 1 nwi ∇f i (x) -∇f i (y) 2 -∇f (x) -∇f (y) 2 ≤ L 2 ±,w x -y 2 , ∀x, y ∈ R d . For (w 1 , . . . , w n ) = ( 1 /n, . . . , 1 /n), we define L ± := L ±,w . One can interpret Definition 2 as weighted mean-squared smoothness property (Arjevani et al., 2019) , and Definition 3 as weighted Hessian variance (Szlendak et al., 2021) that captures the similarity between the functions f i . The constants L +,w and L ±,w help us better to understand the structure of the optimization problem (1) in connection with a particular choice of a sampling scheme. Note that Definitions 2, 3 and Assumption 4 are connected with the weights {w i } n i=1 . The next result states that L 2 +,w and L 2 ±,w are finite provided the functions f i are L i -smooth for all i ∈ [n]. Theorem 4. If Assumption 3 holds, then L 2 +,w = L 2 ±,w = 1 n n i=1 1 nwi L 2 i satisfy Def. 2 and 3. Indeed, from Assumption 3 and the inequality ∇f (x) -∇f (y) 2 ≥ 0 we get 1 n n i=1 1 nwi ∇f i (x) -∇f i (y) 2 -∇f (x) -∇f (y) 2 ≤ 1 n n i=1 1 nwi L 2 i x -y 2 , thus we can take L 2 ±,w = 1 n n i=1 1 nwi L 2 i . The proof for L 2 +,w is the same. From the proof, one can see that we ignore ∇f (x) -∇f (y) 2 when estimating L 2 ±,w . However, by doing that, the obtained result is not tight. Algorithm 1 PAGE 1: Input: initial point x 0 ∈ R d , stepsize γ > 0, probability p ∈ (0, 1] 2: g 0 = ∇f (x 0 ) 3: for t = 0, 1, . . . , T do 4: x t+1 = x t -γg t 5: Generate a random sampling function S t 6: g t+1 = ∇f (x t+1 ) with probability p g t + S t {∇f i (x t+1 ) -∇f i (x t )} n i=1 with probability 1 -p 7: end for

3. A GENERAL AND REFINED THEORETICAL ANALYSIS OF PAGE

In the Algorithm 1, we provide the description of the PAGE method. The choice of PAGE as the base method is driven by the simplicity of the proof in the original paper. However, we believe that other methods, including SPIDER and SARAH, can also admit samplings from Assumption 4. In this section, we provide theoretical results for Algorithm 1. Let us define ∆ 0 := f (x 0 ) -f * . Theorem 5. Suppose that Assumptions 1, 2, 3 hold and the samplings S t ∈ S(A, B, {w i } n i=1 ). Then Algorithm 1 (PAGE) has the convergence rate E ∇f ( x T ) 2 ≤ 2∆0 γT , where γ ≤ L -+ 1-p p (A -B) L 2 +,w + BL 2 ±,w -1 . To reach an ε-stationary point, it is enough to do T := 2∆0 ε L -+ 1-p p (A -B) L 2 +,w + BL 2 ±,w iterations of Algorithm 1. To deduce the gradient complexity, we provide the following corollary. Corollary 1. Suppose that the assumptions of Thm 5 hold. Let us take p = |S| |S|+n . Then the complexity (the expected number of gradient calculations ∇f i ) of Algorithm 1 equals N := Θ (n + |S|T ) = Θ n + ∆0 ε |S| L -+ n |S| (A -B) L 2 +,w + BL 2 ±,w Proof. At each iteration, the expected # gradient calculations equals pn + (1 -p)|S| ≤ 2|S|. Thus the total expected number of gradient calculations equals n + 2|S|T to get an ε-stationary point. The original result from (Li et al., 2021) states that the complexity of PAGE with batch size τ is N orig := Θ n + ∆0 ε τ L -+ √ n τ L + ≥ Θ n + ∆0 √ nL+ ε for all τ ∈ {1, 2, . . . , n}.

3.1. Uniform With Replacement SAMPLING

Let us do a sanity check and substitute the parameters of the sampling that the original paper uses. We take the Uniform With Replacement sampling (see Sec E.3) with batch size τ (note that τ ≥ |S|), A = B = 1 /τ and w i = 1 /n for all i ∈ [n] (see Table 1 ) and get the complexity N uniform = Θ n + ∆0 ε τ L -+ √ n τ L ± for all τ ∈ {1, 2, . . . , n}. Next, let us fix τ ≤ max √ nL± L-, 1 , and, finally, obtain that N uniform = Θ n + ∆0 max{ √ nL±,L-} ε . Let us compare it with (4). With the same sampling, our analysis provides better complexity; indeed, note that max{ Szlendak et al. (2021) ). Moreover, Szlendak et al. (2021) provides examples of the optimization problems when L ± is small and L + is large, so the difference can be arbitrary large. √ nL ± , L -} ≤ √ nL + (see Lemma 2 in

3.2. Nice SAMPLING

Next, we consider the Nice sampling (see Sec E.1) and get that the complexity N nice = Θ n + ∆0 ε τ L -+ 1 τ n(n-τ ) (n-1) L ± . Unlike the Uniform With Replacement sampling, for ε small enough, the Nice sampling recovers the complexity of GD for τ = n, which is equal to Θ ∆0nL- ε .

3.3. Importance SAMPLING

Let us consider the Importance sampling (see Sec E.3) that justifies the introduction of the weights w i . We can get the complexity N importance = Θ n + ∆0 ε τ L -+ √ n τ L ±,w ≤ Θ n + ∆0 max{ √ nL±,w,L-} ε for τ ≤ max √ nL±,w L- , 1 . Now, we take q i = w i = Li i=1 Li and use the results from Sec F to obtain N importance = Θ n + ∆0 √ n( 1 n n i=1 Li) ε ≤ N orig (See Sec G). In Example 2, we consider the optimization task where 1 n n i=1 L i is √ n times smaller than L + . Thus the complexity N importance can be at least √ n times smaller that the complexity N orig .

3.4. THE POWER OF B > 0

In all previous examples, the constant A = B > 0. If A = B, then the complexity N = Θ n + ∆0 ε |S| L -+ n |S| BL 2 ±,w , thus the complexity N does not depend on L 2 +,w , which greater of equal to L 2 ±,w . This is the first analysis of optimal SGD, which uses B > 0.

3.5. ANALYSIS UNDER PŁ CONDITION

The previous results can be extended to the optimization problems that satisfy the Polyak-Łojasiewicz condition. Under this assumption, Algorithm 1 enjoys a linear convergence rate. Assumption 5. There exists µ > 0 such that the function f satisfy (Polyak-Łojasiewicz) PŁ-condition: ∇f (x) 2 ≥ 2µ(f (x) -f * ) ∀x ∈ R, where f * = inf x∈R d f (x) > -∞. Using Assumption 5, we can improve the convergence rate of PAGE. Theorem 6. Suppose that Assumptions 1, 2, 3, 5 and the samplings S t ∈ S(A, B, {w i } n i=1 ). Then Algorithm 1 (PAGE) has the convergence rate E f (x T ) -f * ≤ (1 -γµ) T ∆ 0 , where γ ≤ min L -+ 2(1-p) p (A -B) L 2 +,w + BL 2 ±,w -1 , p 2µ .

4. COMPOSITION OF SAMPLINGS: APPLICATION TO FEDERATED LEARNING

In Sec 3, we analyze the PAGE method with samplings that satisfy Assumption 4. Now, let us assume that the functions f i have the finite-sum form, i.e., f i (x) := 1 mi mi j=1 f ij (x), thus we an optimization problem min x∈R d f (x) := 1 n n i=1 1 mi mi j=1 f ij (x) , Another way to get the problem is to assume that we split the functions f i into groups of sizes m i . All in all, let us consider (5) instead of (1). The problem (5) occurs in many applications, including distributed optimization and federated learning (Konečný et al., 2016; McMahan et al., 2017) . In federated learning, many devices and machines (nodes) store local datasets that they do not share with other nodes. The local datasets are represented by functions f i , and all nodes solve the common optimization problem (5). Due to Algorithm 2 PAGE with composition of samplings 1: Input: initial point x 0 ∈ R d , stepsize γ > 0, probability p ∈ (0, 1], g 0 = ∇f (x 0 ) 2: for t = 0, 1, . . . , T do 3: x t+1 = x t -γg t 4: c t+1 = 1 with probability p 0 with probability 1 -p 5: if c t+1 = 1 then 6: (Kairouz et al., 2021) , it is infeasible to store and compute the functions f i locally in one machine. g t+1 = ∇f (x t+1 ) /* h t+1 i = S t i {∇f ij (x t+1 ) -∇f ij (x t )} mi j=1 for all i ∈ [n] /* In general, when we solve (1) in one machine, we have the freedom of choosing a sampling S for the functions f i , which we have shown in Sec 3. However, in federated learning, a sampling of nodes or the functions f i is dictated by hardware limits or network quality (Kairouz et al., 2021) . Still, each i th node can choose sampling S i to sample the functions f ij . As a result, we have a composition of the sampling S and the samplings S i (see Algorithm 2). Assumption 6. For all j ∈ [m i ], i ∈ [n], there exists a Lipschitz constant L ij such that ∇f ij (x) -∇f ij (y) ≤ L ij x -y for all x, y ∈ R d . We now introduce the counterpart of Definitions 2 and 3. Definition 7. For all i ∈ [n] and any sampling S i ∈ S(A i , B i , {w ij } mi j=1 ), define constant L i,+,wi such that 1 mi mi j=1 1 miwij ∇f ij (x) -∇f ij (y) 2 ≤ L 2 i,+,wi x -y 2 ∀x, y ∈ R d . Definition 8. For all i ∈ [n] and any sampling S i ∈ S(A i , B i , {w ij } mi j=1 ), define constant L i,±,wi such that 1 mi mi j=1 1 miwij ∇f ij (x) -∇f ij (y) 2 -∇f i (x) -∇f i (y) 2 ≤ L 2 i,±,wi x -y 2 ∀x, y ∈ R d . Let us provide the counterpart of Thm 5 for Algorithm 2. Theorem 9. Suppose that Assumptions 1, 2, 3, 6 hold and the samplings S t ∈ S(A, B, {w i } n i=1 ) and the samplings S t i ∈ S(A i , B i , {w ij } mi j=1 ) for all i ∈ [n]. More- over, B ≤ 1. Then Algorithm 2 has the convergence rate E ∇f ( x T ) 2 ≤ 2∆0 γT , where γ ≤   L -+ 1 -p p 1 n n i=1 A nw i + (1 -B) n (A i -B i )L 2 i,+,wi + B i L 2 i,±,wi + (A -B)L 2 +,w + BL 2 ±,w   -1 . The obtained theorem provides a general framework that helps analyze the convergence rates of the composition of samplings that satisfy Assumption 4. We discuss the obtained result in different contexts. Under review as a conference paper at ICLR 2023 4.1 FEDERATED LEARNING For simplicity, let us assume that the samplings S t and S t i are Uniform With Replacement samplings with batch sizes τ and τ i for all i ∈ n, accordingly, then to get ε-stationary point, it is enough to do T := Θ ∆0 ε L -+ 1-p pτ 1 n n i=1 1 τi L 2 i,± + L 2 ± iterations. Note that T ≥ Θ ∆0 ε L -+ 1-p pτ L 2 ± for all τ i ≥ 1 for all i ∈ [n]. It means that after some point, there is no benefit in increasing batch sizes τ i . In order to balance 1 n n i=1 1 τi L 2 i,± and L 2 ± , one can take τ i = Θ L 2 i,±/L 2 ± . The constant L 2 i,± captures the intra-variance inside i th node, while L 2 ± captures the inter-variance between nodes. If the intra-variance is small with respect to the inter-variance, then our theory suggests taking small batch sizes and vice versa. 4.2 Stratified SAMPLING Let us provide another example that is closely related to (Zhao & Zhang, 2014) . Let us consider (1) and use a variation of the Stratified sampling (Zhao & Zhang, 2014) : we split the functions f i into g = n /m groups, where m is the number of functions in each group. Thus we get the problem (5) with f (x) = 1 g g i=1 1 m m j=1 f ij (x). Let us assume that we always sample all groups, thus A = B = 0, and the sampling S t i are Nice samplings with batch sizes τ 1 for all i ∈ [n]. Applying Thm 9, we get the convergence rate T group := Θ ∆0 ε L -+ 1-p pgτ1 1 g g i=1 L 2 i,± . At each iteration, the algorithm calculates gτ 1 gradients, thus we should take p = gτ1 gτ1+n to get the complexity N group := Θ (n + gτ 1 T ) = Θ n + ∆0 ε gτ 1 L -+ √ n 1 g g i=1 L 2 i,± . Let us take τ 1 ≤ max √ n 1 g g i=1 L 2 i,± gL- , 1 to ob- tain the complexity N group = Θ n + ∆0 max √ n 1 g g i=1 L 2 i,± ,gL- ε . Comparing the complexity N group with the complexity N uniform from Sec 3, one can see that if split the functions f i in a "right way", such that L i,± is small for i ∈ [n] (see Example 3), then we can get at least √ n / √ g times improvement with the Stratified sampling.

5. EXPERIMENTS

We now provide experiments 1 with synthetic quadratic optimization tasks, where the functions f i , in general, are nonconvex quadratic functions. Note that our goal here is to check whether the dependencies that our theory predicts are correct for the problem (1). The procedures that generate synthetic quadratic optimization tasks give us control over the choice of smoothness constants. All parameters, including the step sizes, are chosen as suggested by the corresponding theory. In the plots, we represent the relation between the norm of gradients and the number of gradient calculations ∇f i .

5.1. QUADRATIC OPTIMIZATION TASKS WITH VARIOUS HESSIAN VARIANCES L ±

Using Algorithm 3 (see Appendix), we generated various quadratic optimization tasks with different smoothness constants L ± ∈ [0, 1.0] and fixed L -≈ 1.0 (see Fig. 1 ). We choose d = 10, n = 1000, regularization λ = 0.001, and the noise scale s ∈ {0, 0.1, 0.5, 1}. According to Sec 3 and Table 2 , the gradient complexity of original PAGE method ("Vanilla PAGE" in Fig. 1 ) is proportional to L + . While the gradient complexity of the new analysis with the Uniform With Replacement sampling ("Uniform With Replacement" in Fig. 1 ) is proportional to L ± , which is always less or equal L + . In Fig. 1 , one can see that the smaller L ± with respect to L + , the better the performance of "Uniform With Replacement." Moreover, we provide experiments with the Importance sampling ("Importance" in Fig. 1 ) with q i = Li 

5.2. QUADRATIC OPTIMIZATION TASKS WITH VARIOUS LOCAL LIPSCHITZ CONSTATNS L i

Using Algorithm 4 (see Appendix), we synthesized various quadratic optimization tasks with different smoothness constants L i (see Fig. 2 ). We choose d = 10, n = 1000, the regularization λ = 0.001, and the noise scale s ∈ {0, 0.1, 0.5, 10.0}. We generated tasks in such way that the difference between max i L i and min i L i increases. First, one can see that the Uniform With Replacement sampling with the new analysis ("Uniform With Replacement" in Fig. 2 ) has better performance even in the cases of significant variations of L i . Next, we see the stability of the Importance sampling ("Importance" in Fig. 2 ) with respect to this variations. 

5.3. NONCONVEX CLASSIFICATION PROBLEM WITH LIBSVM DATASETS

We now solve nonconvex machine learning tasks and compare samplings on LIBSVM datasets (Chang & Lin, 2011) (see details in Sec A.2). As in previous sections, PAGE with the Importance sampling performs better, especially in the australian dataset where the variation of L i is large. In this section, we consider the same setup as in Sec 5.1. In Figure 4 , we fix L ± , and show that the Importance sampling has better convergence rates with different batch sizes. Note that with large batches, the competitors reduce to the GD method, and the difference is not significant. 

A.2 DETAILS ON EXPERIMENTS WITH LIBSVM DATASETS

We compare the samplings on practical machine learnings with LIBSVM datasets (Chang & Lin, 2011 ) (under the 3-clause BSD license). Parameters of Algorithm 1 are chosen as suggested in Thm 5 and Cor 1. We take the parameters for Uniform With Replacement and Importance samplings from Table 1 with q i = Li n i=1 Li . We consider the logistic regression task with a nonconvex regularization (Wang et al., 2019)  f (x 1 , x 2 ) := 1 n n i=1   -log exp a i x yi y∈{1,2} exp a i x y + λ y∈{1,2} d k=1 {x y } 2 k 1 + {x y } 2 k   → min x1,x2∈R d , where {•} k is an indexing operation, a i ∈ R d is the feature of a i th sample, y i ∈ {1, 2} is the label of a i th sample, constant λ = 0.001. We fix batch size τ = 1 and take w8a dataset (dimension d = 300, number of samples n = 49,749) and australian dataset (dimension d = 14, number of samples n = 690) from LIBSVM. For the logistic regression, the Lipschitz constants L i can be estimated. The distribution of Lipschitz constants L i across datapoints for that two datasets is presented in Fig. 5 . We use Thm 4 to obtain L 2 +,w and L 2 ±,w . In this experimentfoot_0 , we compare the Uniform With Replacement sampling and the Importance sampling on the logistic regression task from Sec. A.2 in a distributed environment. The training of the models is carried on australian dataset from LIBSVM. The dataset is reshuffled with uniform distribution, and then it is split across n = 10 clients. In all experiments, we use Algorithm 2 with theoretical stepsizes according to Theorem 9. We take the parameters of the Uniform With Replacement and Importance samplings from Table 1 with || f(x k )|| 2 Uniform With Replacement [ points=1] Uniform With Replacement [ points=7] Uniform With Replacement [ points=21] Uniform With Replacement [ points=35] Uniform With Replacement [ points=56] Importance [ points=1] Importance [ points=7] Importance [ points=21] Importance [ points=35] Importance [ points=56] || f(x k )|| 2 Uniform With Replacement [ points=1] Uniform With Replacement [ points=7] Uniform With Replacement [ points=21] Uniform With Replacement [ points=35] Uniform With Replacement [ points=56] Importance [ points=1] Importance [ points=7] Importance [ points=21] Importance [ points=35] Importance [ points=56] )|| 2 Uniform With Replacement [ points=1] Uniform With Replacement [ points=7] Uniform With Replacement [ points=21] Uniform With Replacement [ points=35] Uniform With Replacement [ points=56] Importance [ points=1] Importance [ points=7] Importance [ points=21] Importance [ points=35] Importance [ points=56] )|| 2 Uniform With Replacement [ points=1] Uniform With Replacement [ points=7] Uniform With Replacement [ points=21] Uniform With Replacement [ points=35] Uniform With Replacement [ points=56] Importance [ points=1] Importance [ points=7] Importance [ points=21] Importance [ points=35] Importance [ points=56] (d) τ clients = 9, #clients n = 10 q i = Li n i=1 Li . According to Algorithm 2, we have the samplings S t that sample clients, and the samplings S t i that sample data from the local datasets of clients. Algorithm 2 allows mixed sampling strategies that satisfy Assumption 4. For simplicity, we consider that the samplings S t and S t i are of the same type. For the logistic regression, the Lipschitz constants L i and L ij of the gradients of functions f i (x) and f ij (x) can be estimated. As in Sec A.2, we use Thm 4 to obtain the constants L 2 i,+,w , L 2 i,±,w , L 2 +,w and L 2 ±,w . The results of experiments are provided in Fig. 6 . We denote by τ points the batch size of the samplings S t i for all i ∈ [n], and by τ clients the batch size of the sampling S t . The number of gradient calculations in Fig. 6 stands for the total number of gradient calculations in all clients. We demonstrate results for different values of the batch sizes τ clients and τ points . As in previous experiments, the Importance sampling has better empirical performance than the Uniform With Replacement sampling. In addition to it, we observe that plots with small batch sizes τ points converge faster.

A.4 COMPUTING ENVIRONMENT

The code was written in Python 3.6.8 using PyTorch 1.9 (Paszke et al., 2019) and optimization research simulator FL PyTorch (Burlachenko et al., 2021) . The distributed environment was emulated on a machine with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz and 64 cores.

B AUXILIARY FACTS

We use the following auxiliary fact in out proofs: 1. Let us take a random vector ξ ∈ R d , then E ξ 2 = E ξ -E [ξ] 2 + E [ξ] 2 . ( ) C EXAMPLES OF OPTIMIZATION PROBLEMS Example 1. For simplicity, let us assume that n is even. Let us consider the optimization problem ( 1) with f i (x) = a 2 x 2 + b 2 x 2 for i ∈ {1, • • • , n/2} and f i (x) = -a 2 x 2 + b 2 x 2 for i ∈ {n/2 + 1, • • • , n}, where x ∈ R and b ≥ 0. Then f (x) = b 2 x 2 and L 2 -= sup x =y ∇f (x) -∇f (y) 2 x -y 2 = b 2 . Moreover, L 2 + = sup x =y 1 n n i=1 ∇f i (x) -∇f i (y) 2 x -y 2 = 1 2 (a + b) 2 + (a -b) 2 , and we can take a arbitrary large. Example 2. Let us assume that n ≥ 2 and consider the optimization problem (1) with f 1 (x) = b 2 x 2 and f i (x) = 0 for i ∈ {2, • • • , n}, where x ∈ R and b ≥ 0. Then f (x) = b 2n x 2 , L -= sup x =y ∇f (x) -∇f (y) x -y = b n , 1 n n i=1 L i = 1 n sup x =y ∇f 1 (x) -∇f 1 (y) 2 x -y 2 = b n , and L + = sup x =y 1 n n i=1 ∇f i (x) -∇f i (y) 2 x -y 2 = b √ n . Example 3. Let us consider the optimization problem (5) with f (x) = 1 g g i=1 1 m m j=1 f ij (x) and f ij (x) = bi 2 x 2 for all i ∈ [g] and j ∈ [m] , where x ∈ R and b 1 ≥ 0 and b i = 0 for all i ∈ {2, . . . , g}. Then f (x) = b1 2g x 2 , L -= sup x =y ∇f (x) -∇f (y) x -y = b 1 g , L 2 ± = sup x =y 1 gm g i=1 m j=1 ∇f ij (x) -∇f ij (y) 2 -∇f (x) -∇f (y) 2 x -y 2 = 1 g - 1 g 2 b 2 1 , L 2 i,± = sup x =y 1 m n j=m ∇f ij (x) -∇f ij (y) 2 -∇f i (x) -∇f i (y) 2 x -y 2 = 0 ∀i ∈ [n]. Substituting the smoothness constants to the complexity N uniform from Sec 3 and N uniform from Sec 4, one can show that N uniform = Θ n + ∆ 0 max{ √ nL ± , L -} ε = Θ n + ∆ 0 √ nb 1 ε √ g and N group = Θ   n + ∆ 0 max √ n 1 g g i=1 L 2 i,± , gL - ε   = Θ n + ∆ 0 b 1 ε . The complexity N group is √ n / √ g times better than the complexity N uniform .

D MISSING PROOFS

Lemma 1. Suppose that Assumption 2 holds and let x t+1 = x t -γg t . Then for any g t ∈ R d and γ > 0, we have f (x t+1 ) ≤ f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L - 2 x t+1 -x t 2 + γ 2 g t -∇f (x t ) 2 . (7) Proof. Using Assumption 2, we have f (x t+1 ) ≤ f (x t ) + ∇f (x t ), x t+1 -x t + L - 2 x t+1 -x t 2 = f (x t ) -γ ∇f (x t ), g t + L - 2 x t+1 -x t 2 . Next, due to -x, y = 1 2 x -y 2 -1 2 x 2 -1 2 y 2 , we obtain f (x t+1 ) ≤ f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L - 2 x t+1 -x t 2 + γ 2 g t -∇f (x t ) 2 . Theorem 5. Suppose that Assumptions 1, 2, 3 hold and the samplings S t ∈ S(A, B, {w i } n i=1 ). Then Algorithm 1 (PAGE) has the convergence rate E ∇f ( x T ) 2 ≤ 2∆0 γT , where γ ≤ L -+ 1-p p (A -B) L 2 +,w + BL 2 ±,w Proof. We start with the estimation of the variance of the noise: E g t+1 -∇f (x t+1 ) 2 = (1 -p)E g t + S t {∇f i (x t+1 ) -∇f i (x t )} n i=1 -∇f (x t+1 ) 2 = (1 -p) S t {∇f i (x t+1 ) -∇f i (x t )} n i=1 -∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) g t -∇f (x t ) 2 , where we used the unbiasedness of the sampling. Using Assumption 4, we have E g t+1 -∇f (x t+1 ) 2 ≤ (1 -p) A n i=1 1 n 2 w i ∇f i (x t+1 ) -∇f i (x t ) 2 -B ∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) g t -∇f (x t ) 2 . Using the definition of L +,w and L ±,w , we get E g t+1 -∇f (x t+1 ) 2 ≤ (1 -p) A n i=1 1 n 2 w i ∇f i (x t+1 ) -∇f i (x t ) 2 -B ∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) g t -∇f (x t ) 2 = (1 -p) (A -B) n i=1 1 n 2 w i ∇f i (x t+1 ) -∇f i (x t ) 2 + B n i=1 1 n 2 w i ∇f i (x t+1 ) -∇f i (x t ) 2 -∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) g t -∇f (x t ) 2 ≤ (1 -p) (A -B) L 2 +,w + BL 2 ±,w x t+1 -x t 2 + (1 -p) g t -∇f (x t ) 2 . Under review as a conference paper at ICLR 2023 We now continue the proof using Lemma 1. We add (7) with γ 2p × (8), and take expectation to get E f (x t+1 ) -f * + γ 2p g t+1 -∇f (x t+1 ) 2 ≤ E f x t -f * - γ 2 ∇f x t 2 - 1 2γ - L - 2 x t+1 -x t 2 + γ 2 g t -∇f x t 2 + γ 2p E (1 -p) g t -∇f x t 2 + (1 -p) (A -B) L 2 +,w + BL 2 ±,w x t+1 -x t 2 = E f x t -f * + γ 2p g t -∇f x t 2 - γ 2 ∇f x t 2 - 1 2γ - L - 2 - (1 -p)γ 2p (A -B) L 2 +,w + BL 2 ±,w x t+1 -x t 2 ≤ E f x t -f * + γ 2p g t -∇f x t 2 - γ 2 ∇f x t 2 , where the last inequality holds due to 1 2γ -L- 2 -(1-p)γ 2p (A -B) L 2 +,w + BL 2 ±,w ≥ 0 by choosing stepsize γ ≤ L -+ 1 -p p (A -B) L 2 +,w + BL 2 ±,w -1 . Now, if we define Φ t := f (x t ) -f * + γ 2p g t -∇f (x t ) 2 , then (9) can be written in the form E [Φ t+1 ] ≤ E [Φ t ] - γ 2 E ∇f x t 2 . Summing up from t = 0 to T -1, we get E [Φ T ] ≤ E [Φ 0 ] - γ 2 T -1 t=0 E ∇f x t 2 . Then according to the output of the algorithm, i.e., x T is randomly chosen from {x t } t∈[T ] and Φ 0 = f x 0 -f * + γ 2p g 0 -∇f x 0 2 = f x 0 -f * def = ∆ 0 , we have E ∇f ( x T ) 2 ≤ 2∆ 0 γT . Theorem 6. Suppose that Assumptions 1, 2, 3, 5 and the samplings S t ∈ S(A, B, {w i } n i=1 ). Then Algorithm 1 (PAGE) has the convergence rate E f (x T ) -f * ≤ (1 -γµ) T ∆ 0 , where γ ≤ min L -+ 2(1-p) p (A -B) L 2 +,w + BL 2 ±,w -1 , p 2µ . Proof. From the proof of Thm 5, we know that E g t+1 -∇f (x t+1 ) 2 ≤ (1 -p) (A -B) L 2 +,w + BL 2 ±,w x t+1 -x t 2 + (1 -p) g t -∇f (x t ) 2 . ( ) Using Lemma 1, we add (7) with γ p × (10), and take expectation to get E f (x t+1 ) -f * + γ p g t+1 -∇f (x t+1 ) 2 ≤ E f x t -f * - γ 2 ∇f x t 2 - 1 2γ - L - 2 x t+1 -x t 2 + γ 2 g t -∇f x t 2 + γ p E (1 -p) g t -∇f x t 2 + (1 -p) (A -B) L 2 +,w + BL 2 ±,w x t+1 -x t 2 = E f x t -f * + 1 - p 2 γ p g t -∇f x t 2 - γ 2 ∇f x t 2 - 1 2γ - L - 2 - (1 -p)γ p (A -B) L 2 +,w + BL 2 ±,w x t+1 -x t 2 ≤ E f x t -f * + 1 - p 2 γ p g t -∇f x t 2 - γ 2 ∇f x t 2 , where the last inequality holds due to 1 2γ -L- 2 -(1-p)γ p (A -B) L 2 +,w + BL 2 ±,w ≥ 0 by choosing stepsize γ ≤ L -+ 2(1 -p) p (A -B) L 2 +,w + BL 2 ±,w -1 . Next, using Assumption 5 and γ ≤ p 2µ , we have E f (x t+1 ) -f * + γ p g t+1 -∇f (x t+1 ) 2 ≤ (1 -γµ) E f x t -f * + γ p g t -∇f x t 2 . Unrolling the recursion and considering that g 0 = ∇f (x 0 ), we can complete the proof of theorem. 

E DERIVATIONS OF

χ i := 1 i ∈ S 0 otherwise. . Due to p i = Prob (i ∈ S) = τ n , we have E   1 n i∈S a i p i 2   = E   1 τ n i=1 χ i a i 2   = 1 τ 2 n i=1 E χ i a i 2 + 1 τ 2 i =j E [ χ i a i , χ j a j ] = 1 τ 2 n i=1 E [χ i ] a i 2 + 1 τ 2 i =j E [ χ i , χ j ] a i , a j = 1 nτ n i=1 a i 2 + τ -1 n(n -1)τ i =j a i , a j = 1 nτ n i=1 a i 2 + τ -1 n(n -1)τ   n i=1 a i 2 - n i=1 a i 2   = n -τ τ (n -1) 1 n n i=1 a i 2 + τ -1 n(n -1)τ n i=1 a i 2 , where we use E χ 2 i = E [χ i ] = τ n and E [χ i χ j ] = τ (τ -1) n(n-1) , when i = j. Finally, we have E   1 n i∈S a i p i - 1 n n i=1 a i 2   = E   1 n i∈S a i p i 2   - 1 n n i=1 a i 2 = n -τ τ (n -1) 1 n n i=1 a i 2 + τ -1 n(n -1)τ n i=1 a i 2 - 1 n n i=1 a i 2 = n -τ τ (n -1)   1 n n i=1 a i 2 - 1 n n i=1 a i 2   . Thus we have A = B = n-τ τ (n-1) and w i = 1 n for all i ∈ [n]. E.2 Independent SAMPLING Let us define i.i.d. random variables χ i = 1 with probability p i 0 with probability 1 -p i , . for all i ∈ [n] and take S := {i ∈ [n] | χ i = 1}. We now fix a 1 , . . . , a n ∈ R d . A sampling S(a 1 , . . . , a n ) := 1 n i∈S ai pi is called the Independent sampling, where p i := Prob(i ∈ S). We get E   1 n i∈S a i p i - 1 n n i=1 a i 2   = E   1 n n i=1 1 p i χ i a i 2   - 1 n n i=1 a i 2 = n i=1 E [χ i ] n 2 p 2 i a i 2 + i =j E [χ i ] E [χ j ] n 2 p i p j a i , a j - 1 n n i=1 a i 2 = n i=1 1 n 2 p i a i 2 + 1 n 2   n i=1 a i 2 - n i=1 a i 2   - 1 n n i=1 a i 2 = 1 n 2 n i=1 1 p i -1 a i 2 . Thus we have A =  χ k =          1 with probability q 1 2 with probability q 2 . . . n with probability q n , where (q 1 , . . . , q n ) ∈ S n (simple simplex). A sampling S(a 1 , . . . , a n ) := 1 τ τ k=1 a χ k nq χ k is called the Importance sampling. The Importance sampling reduces to the Uniform With Replacement sampling when q i = 1 /n for all i ∈ [n]. Note that |S| ≤ τ. Let us bound the variance E   1 τ τ k=1 a χ k nq χ k - 1 n n i=1 a i 2   = 1 τ 2 τ k=1 E   a χ k nq χ k - 1 n n i=1 a i 2   + 1 τ 2 k =k E a χ k nq χ k - 1 n n i=1 a i , a χ k nq χ k - 1 n n i=1 a i . Using the independents and unbiasedness of the random variables, the last term vanishes and we get E   1 τ τ k=1 a χ k nq χ k - 1 n n i=1 a i 2   = 1 τ 2 τ k=1 E   a χ k nq χ k - 1 n n i=1 a i 2   (6) = 1 τ 2 τ k=1 E a χ k nq χ k 2 - 1 τ 1 n n i=1 a i 2 = 1 τ n i=1 q i a i nq i 2 - 1 τ 1 n n i=1 a i 2 = 1 τ   1 n n i=1 1 nq i a i 2 - 1 n n i=1 a i 2   . Thus we have A = B = 1 τ , and w i = q i for all i ∈ [n].

E.4 Extended Nice SAMPLING

In this section, we analyze the extension of Nice sampling. First, we l i times repeat each vector a i , then we use the Nice sampling. We define ãi :=              n j=1 lj nl1 a 1 1 ≤ i ≤ l 1 n j=1 lj nl2 a 2 l 1 + 1 ≤ i ≤ l 1 + l 2 . . . n j=1 lj nln a n n-1 j=1 l j ≤ i ≤ n j=1 l j , , a i ∈ R d and l i ≥ 1 for all i ∈ [n]. Then we have 1 n n i=1 a i (x) = 1 N N i=1 ãi (x), where N := n j=1 l j . Also, we denote N k := k j=1 l j . For some τ > 0, we apply the Nice sampling method: S(a 1 , . . . , a n ) := 1 N i∈S ãi p i = N i=1 1 τ χ i ãi , χ i = 1 i ∈ S 0 otherwise , p i = Prob (i ∈ S) , and S is a random set with cardinality τ from [N ]. The sampling S(a 1 , . . . , a n ) is called the Extended Nice sampling. We now ready to bound the variance. Using the results for the Nice sampling, we obtain E   S(a 1 , . . . , a n ) - 1 n n i=1 a i (x) 2   = E   S(a 1 , . . . , a n ) - 1 N N i=1 ãi (x) 2   = n -τ τ (n -1) 1 N N i=1 ãi 2 - n -τ τ (n -1) 1 N N i=1 ãi 2 = n -τ τ (n -1) 1 N N nl 1 2 N1 i=1 a 1 2 + 1 N N nl 2 2 N2 i=N1+1 a 2 2 + • • • + 1 N N nl n 2 N i=Nn-1+1 a n 2   - n -τ τ (n -1) 1 n n i=1 a i 2 = n -τ τ (n -1) N nl 1 1 n a 1 2 + N nl 2 1 n a 2 2 + • • • + N nl n 1 n a n 2 - n -τ τ (n -1) 1 n n i=1 a i 2 = n -τ τ (n -1) n i=1 1 n 2 w i a i 2 - n -τ τ (n -1) 1 n n i=1 a i 2 where w i = li N . Thus we have A = B = n-τ τ (n-1) and w i = li N for i ∈ [n].

F THE OPTIMAL CHOICE OF w i

Let us consider L 2 +,w and L 2 ±,w . In Sec 2, we show that one can take L 2 +,w = L 2 ±,w = 1 n n i=1 1 nwi L 2 i . Let us minimize 1 n n i=1 1 nwi L 2 i with respect to the weights w i such that w 1 , . . . , w n ≥ 0 and n i=1 w i = 1. Using the method of Lagrange multipliers, we can construct a Lagrangian L(w, λ) := 1 n n i=1 1 nw i L 2 i -λ n i=1 w i -1 . Next, we calculate partial derivatives ∂L ∂w i = - 1 n 2 w 2 i L 2 i -λ = 0∀i ∈ [n] and get w 2 i = - L 2 i n 2 λ . Using n i=1 w i = 1, we can show that the weights w * i = Li n i=1 Li are the solutions of the minimization problem. Moreover, L 2 ±,w * = 1 n n i=1 1 nw * i L 2 i = 1 n n i=1 L i 2 . G THE COMPLEXITY OF ALGORITHM 1 WITH THE Importance SAMPLING The expected number of gradient calculations ∇f i of Algorithm 1 with the Importance sampling, the optimal w i * from Sec. F, and τ ≤ max √ nL±,w L- , 1 equals N importance = O   n + ∆ 0 ε τ   L -+ √ n τ 1 n n i=1 1 nw i * L 2 i     = O n + ∆ 0 ε τ L -+ √ n τ 1 n n i=1 L i = O n + ∆ 0 ε √ nL ±,w * + ∆ 0 √ n 1 n n i=1 L i ε = O n + ∆ 0 √ n 1 n n i=1 L i ε . H MISSING PROOFS: THE COMPOSITION OF SAMPLINGS  )) - 1 n n i=1   1 m i mi j=1 a ij   2    ≤ 1 n n i=1 A nw i + (1 -B) n    A i m i mi j=1 1 m i w ij a ij 2 -B i 1 m i mi j=1 a ij 2    + A n n i=1 1 nw i 1 m i mi j=1 a ij 2 -B 1 n n i=1   1 m i mi j=1 a ij   2 , where a ij ∈ R d for all j ∈ [m i ] and i ∈ [n]. Proof. We denote a i := S i (a i1 , . . . , a imi ) and a i := 1 mi mi j=1 a ij . Using (6), we have E   S ( a 1 , . . . , a n ) - 1 n n i=1 a i 2   = E   E S   S ( a 1 , . . . , a n ) - 1 n n i=1 a i 2     = E   E S   S ( a 1 , . . . , a n ) - 1 n n i=1 a i 2     + E   1 n n i=1 a i - 1 n n i=1 a i 2   . Next, using Assumption 4 for the sampling S, we get E   S ( a 1 , . . . , a n ) - 1 n n i=1 a i 2   ≤ A 1 n n i=1 1 nw i E a i 2 -BE   1 n n i=1 a i 2   + E   1 n n i=1 a i - 1 n n i=1 a i 2   . Due to (6), we obtain E   S ( a 1 , . . . , a n ) - 1 n n i=1 a i 2   ≤ A 1 n n i=1 1 nw i E a i -a i 2 + A 1 n n i=1 1 nw i a i 2 -BE   1 n n i=1 a i - 1 n n i=1 a i 2   -B 1 n n i=1 a i 2 + E   1 n n i=1 a i - 1 n n i=1 a i 2   = A 1 n n i=1 1 nw i E a i -a i 2 + A 1 n n i=1 1 nw i a i 2 + (1 -B)E   1 n n i=1 a i - 1 n n i=1 a i 2   -B 1 n n i=1 a i 2 = A 1 n n i=1 1 nw i E a i -a i 2 + A 1 n n i=1 1 nw i a i 2 + (1 -B) n 2 n i=1 E a i -a i 2 -B 1 n n i=1 a i 2 = 1 n n i=1 A nw i + (1 -B) n E a i -a i 2 + A 1 n n i=1 1 nw i a i 2 -B 1 n n i=1 a i 2 . Using Assumption 4 for the samplings S i , we have E   S ( a 1 , . . . , a n ) - . 1 n n i=1 a i 2   ≤ 1 n n i=1 A nw i + (1 -B) n   A i 1 m i mi j=1 1 m i w ij a ij 2 -B i a i 2   + A 1 n n i=1 1 nw i a i 2 - Proof. We start with the estimation of the variance of the noise: where we used the unbiasedness of the composition of samplings. Using Lemma 2, we have E g t+1 -∇f (x t+1 E g t+1 -∇f (x t+1 ) 2 ≤ (1 -p)   1 n n i=1 A nw i + (1 -B) n   A i m i mi j=1 1 m i w ij ∇f ij (x t+1 ) -∇f ij (x t ) 2 -B i ∇f i (x t+1 ) -∇f i (x t ) 2   + A n n i=1 1 nw i ∇f i (x t+1 ) -∇f i (x t ) 2 -B ∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) g t -∇f (x t ) 2 . Using Definitions 2, 3, 7 and 8, we get E g t+1 -∇f (x t+1 ) 2 ≤ (1 -p)   1 n n i=1 A nw i + (1 -B) n   A i -B i m i mi j=1 1 m i w ij ∇f ij (x t+1 ) -∇f ij (x t ) 2 + B i   1 m i mi j=1 1 m i w ij ∇f ij (x t+1 ) -∇f ij (x t ) 2 -∇f i (x t+1 ) -∇f i (x t ) 2     + A -B n n i=1 1 nw i ∇f i (x t+1 ) -∇f i (x t ) I ARTIFICIAL QUADRATIC OPTIMIZATION TASKS In this section, we provide algorithms that we use to generate artificial optimization tasks for experiments. Algorithm 3 and Algorithm 4 allow us to control the smoothness constants L ± and L i , accordingly, via the noise scales. Algorithm 3 Generate quadratic optimization task with controlled L ± (homogeneity) 1: Parameters: number nodes n, dimension d, regularizer λ, and noise scale s. Take the initial tridiagonal matrix Take the initial tridiagonal matrix  A i = ν s i 4         A i = ν s i 4         2 -1 0 -1 . . . . . . . . . . . . -1 0 -1 2         ∈ R d×d



Our code: https://github.com/mysteryresearcher/page_ab_fl_experiment_a3



i := Prob(i ∈ S), |S| = |S|, some A ≥ 0, w 1 , . . . , w n ≥ 0 and B = 0. Recently, Szlendak et al. (2021) studied a similar inequality, but in the context of communication-efficient distributed training with randomized gradient compression operators. They explicitly set out to study correlated compressors, and for this reason introduced the second term in the right hand side; i.e., they considered

Figure 1: Comparison of samplings and methods on quadratic optimization tasks with various L ± .

Figure 2: Comparison of samplings and methods on quadratic optimization tasks with various L i .

Figure 3: Comparison of samplings on nonconvex machine learning tasks with LIBSVM datasets.

Figure 4: Comparison of samplings and methods with various batch sizes.

Figure 5: The distribution of Lipschitz constants L i

(c) τ clients = 6, #clients n = 10

Figure 6: Comparison of methods on australian dataset from LIBSVM

THE PARAMETERS FOR THE SAMPLINGS E.1 Nice SAMPLING Let S be a random subset uniformly chosen from [n] with a fixed cardinality τ . Let us fix a 1 , . . . , a n ∈ R d . A sampling S(a 1 , . . . , a n ) := 1 n i∈S ai pi is called the Nice sampling, where p i := Prob(i ∈ S). Let us bound E S(a 1 , . . . , a n ) -1 n n i=1 a i 2 and find parameters from Assumption 4. Note that |S| = |S| = τ. We introduce auxiliary random variables

for all i ∈ [n]. E.3 Importance AND Uniform With Replacement SAMPLING Let us fix τ > 0. For all k ∈ [τ ], we define i.i.d. random variables

(1 -p)E g t + S S i {∇f ij (x t+1 ) -∇f ij (x t )} mi j=1 n i=1 -∇f (x t+1 ) 2 = (1 -p) S S i {∇f ij (x t+1 ) -∇f ij (x t )} mi j=1 n i=1 -∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) g t -∇f (x t ) 2 ,

the minimum eigenvalue λ min (A) 9: for i = 1, . . . , n do 10:Update matrix Ai = A i + (λ -λ min (A))I 11: end for 12: Take starting point x 0 = ( √ d, 0, • • • , 0) 13: Output: matrices A 1 , • • • , A n , vectors b 1 , • • • , b n , starting point x 0Algorithm 4 Generate quadratic optimization task with controlled L i 1: Parameters: number nodes n, dimension d, regularizer λ, and noise scale s. 2: for i = 1, . . . , n do 3: Generate random noises ν s i = 1 + sξ s i , where i.i.d. ξ s i ∼ ExponentialfDistribution(1.0) 4: Generate random noises ν b i = sξ b i , i.i.d. ξ b i ∼ NormalDistribution(0, 1) 5: Take vector b i = (-1 4 + ν b i , 0, • • • , 0) ∈ R d 6:

Take starting point x 0 = ( √ d, 0, • • • , 0) 9: Output: matrices A 1 , • • • , A n , vectors b 1 , • • • , b n , starting point x 0

The complexity of methods and samplings from Table1 and Sec 4.

FL Interpretation: Calculate the full gradients ∇f i on the nodes and collect them */

FL Interpretation: Calculate the mini-batches h t+1Generate a sampling S t and set g t+1 = g t + S t {h t+1

Lemma 2. Let us assume that a random sampling function S satisfies Assumption 4 with some A, B and weights w i , and a random sampling function S i satisfy Assumption 4 with some A i , B i and weights w ij for all i ∈ [n]. Moreover, B ≤ 1. Then

Suppose that Assumptions 1, 2, 3, 6 hold and the samplings S t ∈ S(A, B, {w i } n i=1 ) and the samplings S t i ∈ S(A i , B i , {w ij } mi j=1 ) for all i ∈ [n]. Moreover, B ≤ 1. Then Algorithm 2 has the convergence rate E ∇f ( x T )

+B

1From this point the proof of theorem repeats the proof of Thm 5 withinstead of (A -B) L 2 +,w + BL 2 ±,w .

