FASTER FEDERATED OPTIMIZATION UNDER SECOND-ORDER SIMILARITY

Abstract

Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent.

1. INTRODUCTION

Federated Learning (FL) is a subfield of machine learning where many clients (e.g. mobile phones or hospitals) collaboratively try to solve a learning task over a network without sharing their data. Federated Learning finds applications in many areas including healthcare, Internet of Things (IoT) devices, manufacturing, and natural language processing tasks (Kairouz et al., 2019; Nguyen et al., 2021; Liu et al., 2021) . One of the central problems of FL is federated or distributed optimization. Federated optimization has been the subject of intensive ongoing research effort over the past few years (Wang et al., 2021) . The standard formulation of federated optimization is to solve a minimization problem: min x∈R d f (x) = 1 M M m=1 f m (x) , where each function f m represents the empirical risk of model x calculated using the data on the m-th client, out of a total of M clients. Each client is connected to a central server tasked with coordinating the learning process. We shall assume that the loss on each client is µ-strongly convex. Because the model dimensionality d is often large in practice, the most popular methods for solving Problem (1) are first-order methods that only access gradients and do not require higher-order derivative information. Such methods include distributed (stochastic) gradient descent, FedAvg (also known as Local SGD) (Konečný et al., 2016) , FedProx (also known as the Stochastic Proximal Point Method) (Li et al., 2020b) , SCAFFOLD (Karimireddy et al., 2020b) , and others. These algorithms typically follow the intermittent-communication framework (Woodworth et al., 2021) : the optimization process is divided into several communication rounds. In each of these rounds, the server sends a model to the clients, they do some local work, and then send back updated models. The server aggregates these models and starts another round. Problem (1) is an example of the well-studied finite-sum minimization problem, for which we have tightly matching lower and upper bounds (Woodworth & Srebro, 2016) . The chief quality that differentiates federated optimization from the finite-sum minimization problem is that we mainly care about communication complexity rather than the number of gradient accesses. That is, we care about the number of times that each node communicates with the central server rather than the number of local gradients accessed on each machine. This is because the cost of communication is often much higher than the cost of local computation, as Kairouz et al. (2019) state: "It is now well-understood that communication can be a primary bottleneck for federated learning." One of the main sources of this bottleneck is that when all clients participate in the learning process, the cost of communication can be very high (Shahid et al., 2021) . This can be alleviated in part by using client sampling (also known as partial participation): by sampling only a small number of clients for each round of communication, we can reduce the communication burden while retaining or even accelerating the training process (Chen et al., 2022) . Our main focus in this paper is to develop methods for solving Problem (1) using client sampling and under the following assumption: Assumption 1. (Second-order similarity). We assume that for all x, y ∈ R d we have 1 M M m=1 ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 ≤ δ 2 ∥x -y∥ 2 . Assumption 1 is a slight generalization of the δ-relatedness assumption used by Arjevani & Shamir (2015) in the context of quadratic optimization and by Sun et al. (2022) for strongly-convex optimization. It is also known as function similarity (Kovalev et al., 2022) . Assumption 1 holds (with relatively small δ) in many practical settings, including statistical learning for quadratics (Shamir et al., 2014) , generalized linear models (Hendrikx et al., 2020) , and semi-supervised learning (Chayti & Karimireddy, 2022) . We provide more details on the applications of second-order similarity in Appendix B. Under the full participation communication model where all clients participate each iteration, several methods can solve Problem (1) under Assumption 1, including ones that tightly match existing lower bounds (Kovalev et al., 2022) . In contrast, for the setting we consider (partial participation or client sampling), no lower bounds are known. The main question of our work is: Can we design faster methods for federated optimization (Problem (1)) under second-order similarity (Assumption 1) using client sampling? 1.1 CONTRIBUTIONS We answer the above question in the affirmative and show the utility of using client sampling in optimization under second-order similarity for strongly convex objectives. Our main contributions are as follows: • A new algorithm for federated optimization (SVRP, Algorithm 2). We develop a new algorithm, SVRP (Stochastic Variance-Reduced Proximal Point) , that utilizes client sampling to improve upon the existing algorithms for solving Problem 1 under second-order similarity. SVRP has a better dependence on the number of clients M in its communication complexity than all existing algorithms (see Table 1 ), and achieves superior performance when the dissimilarity constant δ is small enough. SVRP trades off a higher computational complexity for less communication. • Catalyst-accelerated SVRP. By using Catalyst (Lin et al., 2015) , we accelerate SVRP and obtain a new algorithm (Catalyzed SVRP) that improves the dependence on the effective conditioning from δ 2 µ 2 to δ µ . Catalyzed SVRP also has a better convergence rate (in number of communication steps, ignoring constants and logarithmic factors) than all existing accelerated algorithms for this problem under Assumption 1, reducing the dependence on the number of clients multiplied by the effective conditioning δ µ from δ µ M to δ µ M 3/4 (see Table 1 ). While both SVRP and Catalyzed SVRP achieve a communication complexity that is better than algorithms designed for the standard finite-sum setting (like SVRG or SAGA), the computational complexity is a lot worse. This is because we tradeoff local computation complexity for a reduced communication complexity. Additionally, both SVRP and Catalyzed SVRP are based upon a novel combination of variance-reduction techniques and the stochastic proximal point method (SPPM). SPPM is our starting point, and we provide a new analysis for it that might be of independent interest. Our analysis of SPPM is simple, allows for approximate evaluations of the proximal operator, and extends to include variance reduction. In Appendix H we also consider the more general constrained optimization problem and provide similar convergence rates in that setting. Õ δ µ M - (Arjevani & Shamir, 2015) SVRP Õ δ 2 µ 2 + M ✓ Theorem 2 (NEW) Catalyzed SVRP Õ δ µ M 3/4 + M ✓ Theorem 3 (NEW)

Distributed optimization under Assumption 1.

There is a long line of work analyzing distributed optimization under Assumption 1 and strong convexity: Shamir et al. (2014) first gave DANE and analyzed it for quadratics, and showed the benefits of using second-order similarity in the setting of statistical learning for quadratic objectives. Zhang & Lin (2015) developed the DiSCO algorithm that improved upon DANE for quadratics, and also analyzed it for self-concordant objectives. Arjevani & Shamir (2015) gave a lower bound that matched the rate given by DANE, though without allowing for client sampling. The theory of DANE was later improved in (Yuan & Li, 2019) , allowing for local convergence for non-quadratic objectives. Another algorithm SCAFFOLD (Karimireddy et al., 2020b) can be seen as a variant of DANE and is also analyzed for quadratics. In the context of decentralized optimization, Sun et al. (2022) gave SONATA and showed a similar rate to DANE but for general strongly convex objectives, then Tian et al. (2022) improved the convergence rate of SONATA by acceleration. Finally, Kovalev et al. (2022) improved the convergence rate of accelerated SONATA even further by removing extra logarithmic factors. We give an overview of all related results in Table 1 and provide more thorough comparisons in the theory and algorithms section. A parallel line of work considers Assumption 1 as a special case of relative strong convexity and smoothness and utilizes methods based on mirror descent. Hendrikx et al. (2020) take this view and consider an accelerated variant of mirror descent in the distributed setting, while Dragomir et al. (2021) also consider sampling and variance-reduction. Because this setting is much more general, it is more challenging to prove tight convergence rates, and in the worst-case the convergence rates are no better than the minimax rate under only smoothness and strong convexity. Another line of work considers federated optimization for Problem 1 under Assumption 1 but without convexity, as well as optimization under the scenario where the number of clients is very large or infinite. Examples of this include MIME (Karimireddy et al., 2020a) , FedShuffleMVR (Horváth et al., 2022) , as well as AuxMom and AuxMVR (Chayti & Karimireddy, 2022) . Our focus in this work is on the convex setting. Stochastic Proximal Point Method. The stochastic proximal point method (SPPM) is our starting point in developing SVRP and its Catalyzed variant. SPPM is well-studied, and we only briefly review some results on its convergence: Bertsekas (2011) terms it the incremental proximal point method, and provides analysis showing nonasymptotic convergence around a solution under the assumptions that each f m is Lipschitz. Ryu & Boyd (2014) provide convergence rates for the algorithm, and observe that it is stable to learning rate misspecification unlike stochastic gradient descent (SGD). Pȃtras ¸cu & Necoara (2017) analyze SPPM for constrained optimization with random projections. Asi & Duchi (2019) study a more general method (AProx) that includes SPPM as a special case, giving stability and convergence rates under convexity. Asi et al. (2020) ; Chadha et al. (2022) further consider minibatching and convergence under interpolation for AProx. In the context of federated learning for non-convex optimization, SPPM is also known as FedProx (Li et al., 2020b) and has been analyzed in several settings, see e.g. (Yuan & Li, 2022) . Unfortunately, in the non-convex setting the convergence rates achieved by FedProx/SPPM are no better than SGD. SPPM can be applied to more than just federated learning and has found applications in matrix and tensor completion (Bumin & Huang, 2021) and reinforcement learning (Asadi et al., 2021) . It can also be implemented efficiently for various optimization problems (Shtoff, 2022) . Compression for communication efficiency. Client sampling is one way of achieving communication efficiency, but there are other ways to do that in federated learning, such as compressing the vectors exchanged between the server and the clients. Szlendak et al. (2022) ; Beznosikov & Gasnikov (2022) consider federated optimization with compression under Assumption 1 and obtain better convergence rates using specially-crafted compression operators. Because these techniques are orthogonal, exploring combinations of client sampling and compression may be a promising avenue for future work.

3. PRELIMINARIES

We say that a differentiable function f is µ-strongly convex (for µ > 0) if for all x, y ∈ R d we have f (x) ≥ f (y) + ⟨∇f (y), x -y⟩ + µ 2 ∥x -y∥ 2 . We assume: Assumption 2. All the functions f 1 , f 2 , . . . , f M in problem (1) are µ-strongly convex. We also assume Problem (1) has a solution x * ∈ R d . The assumption that every f m is strongly convex is common in the analysis of federated learning algorithms (Karimireddy et al., 2020b; Mishchenko et al., 2022b) and is often realized when each f m represents a convex empirical loss with ℓ 2 -regularization. We assume that f has a minimizer x * , and by strong convexity this minimizer is unique. The proximal mapping associated with a function h and stepsize η > 0 is defined as prox ηh (x) = arg min y∈R d ηh(y) + 1 2 ∥y -x∥ 2 . When h is convex, the minimization problem has a unique solution and hence the proximal operator is well-defined. We say that a point y ∈ R d is a b-approximation of the proximal operator evaluated at x if y -prox ηh (x) 2 ≤ b. When h = f m for some m ∈ [M ], computing the proximal operator is equivalent to solving a local optimization problem on node m.

4. ALGORITHMS AND THEORY

In this section we develop the main algorithms of our work for solving Problem 1 under Assumption 1, smoothness, and strong convexity. In the first subsection, we analyze the stochastic proximal point method and explore some of its desirable properties. Next, we augment the stochastic proximal point method with variance-reduction and develop SVRP, a novel algorithm that improves upon existing algorithms using client sampling. Finally, we use the Catalyst acceleration framework (Lin et al., 2015) to improve the convergence rate of SVRP. We give more details on how the algorithms are applied in a client-server setting in Appendix I.

4.1. BASICS: STOCHASTIC PROXIMAL POINT METHOD

The starting point of our investigation is the stochastic proximal point method (SPPM) (Algorithm 1), because the stochastic proximal point algorithm can achieve rates of convergence that are smoothnessindependent and which rely only on the strong convexity of the minimization objectives (Asi & Algorithm 1: Stochastic Proximal Point Method (SPPM) Data: Stepsize η, initialization x 0 , number of steps K, proximal solution accuracy b. for k = 0, 1, 2, . . . , K -1 do Sample ξ k ∼ D. Update with b-approximation of the stochastic proximal point operator: x k+1 ≃ prox ηf ξ k (x k ) . Duchi, 2019). This makes SPPM a much better starting point for developing new algorithms if our goal is to obtain convergence rates that depend on the dissimilarity constant δ instead of the (typically larger) smoothness constant L. The stochastic proximal point can be applied to the general stochastic expectation problem, which has the following form: min x∈R d [f (x) = E ξ∼D [f ξ (x)]] , where each f ξ is µ-strongly convex and differentiable. Observe that Problem (1) is a special case of (2) where D has finite support. We assume that f has a (necessarily unique) minimizer x * . In this formulation, sampling a new ξ ∼ D corresponds to sampling a node/client in federated optimization, and then a proximal iteration corresponds to a local optimization problem to be solved on node ξ. The next theorem characterizes the convergence of SPPM in this setting: Theorem 1. Suppose that f ξ is almost surely µ-strongly convex, let x * be the minimizer of f , and define σ 2 * = E ξ∼D ∥∇f ξ (x * )∥ 2 . Suppose that for each k we have that x k+1 is a b-approximation of the proximal. Set η = µϵ 2σ 2 * and b ≤ ϵ 4 (ηµ) 2 (1+ηµ) 2 . Then E ∥x K -x * ∥ 2 ≤ ϵ after K iterations: K = 1 + 2σ 2 * µ 2 ϵ log 4∥x0-x * ∥ 2 ϵ . The full proofs of all the theorems are relegated to the supplementary material. Theorem 1 for SPPM essentially gives the same rate as (Asi & Duchi, 2019, Proposition 5.3) but with a different proof technique. Our analysis relies on a straightforward application of the contractivity of the proximal, making it easier to extend to the variance-reduced case compared to the more involved analysis of (Asi & Duchi, 2019) . Comparison with SGD. The iteration complexity of SGD in the same setting is: K SGD = 2L µ + 2σ 2 * µ 2 ϵ log 2∥x0-x * ∥ 2 ϵ . See (Needell et al., 2014; Gower et al., 2019) for a derivation of this iteration complexity and (Nguyen et al., 2019) for a matching lower bound (up to log factors and constants). Observe that while the dependence on the stochastic noise term is the same in both ( 3) and ( 4), the iteration complexity of SGD also has an additional dependence on the condition number κ = L µ while the iteration complexity of the stochastic proximal point method is entirely independent of the magnitude of the smoothness constant L. Thus, we can obtain a faster convergence rate than SGD if we have access to stochastic proximal operator evaluations. In federated optimization, a stochastic proximal operator evaluation can be done entirely with local work and with no communication, and thus is relatively cheap. Indeed, the iteration complexity of SPPM can even beat accelerated SGD because acceleration only reduces the dependence on the condition number from L/µ to L/µ, whereas SPPM has no dependence on the condition number to begin with. Communication vs computation complexities. Every iteration of SPPM involves two communication steps: the server sends the current iterate x k to node ξ, and then node ξ sends x k+1 back to the server. Thus the communication complexity of SPPM is the same as eq. ( 3) multiplied by two. Each node needs to solve the optimization problem min x∈R d f ξ (x) + 1 2η ∥x -x k ∥ 2 up to the accuracy b given in Theorem 1. If each f ξ is L-smooth and µ-strongly convex, this is a L + 1 2η -smooth and µ + 1 2η -strongly convex minimization problem, and thus can be solved to the Set g k = ∇f (w k ) -∇f m k (w k ). Compute a b-approximation of the stochastic proximal point operator associated with f m k : x k+1 ≃ prox ηfm k (x k -ηg k ) . (5) Sample c k ∼ Bernoulli(p) and update w k+1 = x k+1 if c k = 1, w k if c k = 0. desired precision b in O ηL+1 ηµ+1 log 1 b local gradient accesses using accelerated gradient descent (Nesterov, 2018) . When η = µϵ 4σ 2 * , this corresponds to a per-iteration computational complexity of O µLϵ+4σ 2 * µ 2 ϵ+4σ 2 * log 1 b . Note that if f ξ itself represents the loss on a local dataset (as is common in federated learning), we may use methods tailored for stochastic or finite-sum problems such as Random Reshuffling (Mishchenko et al., 2020) , Katyusha (Allen-Zhu, 2017), or (accelerated) SVRG. Thus we see that, compared to SGD, SPPM trades off a higher computational complexity for a lower communication complexity. Related work. We compare against related convergence results for SPPM in Appendix E.

4.2. THE SVRP ALGORITHM

The rate of SPPM, while independent of the condition number κ = L µ , is sublinear. While a sublinear rate is optimal for the stochastic oracle (Foster et al., 2019; Woodworth & Srebro, 2021) , it is suboptimal in the setting of smooth finite-sum minimization (Woodworth & Srebro, 2016) . In this section, we develop a novel variance-reduced method, SVRP (Stochastic Variance-Reduced Proximal Method, Algorithm 2), that converges linearly and relies only on second-order similarity. Variance-reduced methods such as SVRG (Johnson & Zhang, 2013) , SAGA (Defazio et al., 2014) or SARAH (Nguyen et al., 2017) improve the convergence rate of SGD for finite-sum problems by constructing gradient estimators whose variance vanishes over time. While SGD is used as the building block in most existing variance-reduced methods, in the preceding section we saw that the stochastic proximal point method is more communication-efficient; It stands to reason that variancereduced variants of SPPM could also be more communication-efficient under second-order similarity. We apply variance-reduction to SPPM and develop our algorithm in the next two steps. Step (a): Adapting SVRG-style variance reduction to SPPM. We use SVRG-style variancereduction coupled with the stochastic proximal point method as a base. To see how, we start with SGD iterations: at each step k we sample node m k uniformly at random from [M ] and update as x k+1 = x k -η∇f m k (x k ). The main problem with SGD is that the stochastic gradient estimator has a non-vanishing variance that slows down convergence. SVRG (Johnson & Zhang, 2013) modifies SGD by adding a correction term g k at each iteration: g k = ∇f (w k ) -∇f m k (w k ), x k+1 = x k -η [∇f m k (x k ) + g k ] , where w k is an anchor point that is periodically reset to the current iterate. Thus we added the correction term g k in order to reduce the variance in the gradient estimator and allow the algorithm to converge. We propose to do the same for the stochastic proximal point method, where we instead change the argument to the proximal operator: g k = ∇f (w k ) -∇f m k (w k ), x k+1 = prox ηfm k (x k -ηg k ) . We can expect the correction term g k to function similarly and allow SPPM to converge faster. Step (b): Removing the loop. Rather than reset the anchor point w k to the current iterate at fixed intervals, SVRP is instead loopless: it uses a random coin flip to determine when to communicate and re-compute full gradients. Loopless variants of variance-reduced algorithms such as L-SVRG (Kovalev et al., 2020) and Loopless SARAH (Li et al., 2020a) enjoy the same convergence guarantees as their ordinary counterparts, but with superior empirical performance and simpler analysis. Combining the previous two steps gives us SVRP (Algorithm 2), and we give the convergence result next. Theorem 2. (Convergence of SVRP). Suppose that Assumptions 1 and 2 hold, and that each x k+1 is a b-approximation of the proximal (5). Let τ = min ηµ 1+2ηµ , p 2 . Set the parameters of Algorithm 2 as η = µ 2δ 2 , b ≤ ϵτ (ηµ) 2 2(1+ηµ) 3 , and p = 1 M . Then the final iterate x K satisfies E ∥x K -x * ∥ 2 ≤ ϵ provided that the total number of iterations K is larger than T iter : T iter = Õ M + δ 2 µ 2 log 1 ϵ . Communication complexity. We consider one communication to represent the server exchanging one vector with one client. At each step of SVRP, the server samples a client m k and sends them the current iterate x k , the client then computes g k and x k+1 locally, and sends x k+1 back to the server. Then, with probability p, the server changes the anchor point w k+1 to x k+1 , sends w k+1 to the new clients, each client m then computes ∇f m (w k+1 ) and sends it back to the server, which averages the received gradients to get ∇f (w k+1 ); The server then proceeds to send ∇f (w k+1 ) back to all the clients. Thus the expected communication complexity is E [T comm ] = (2 + 3pM ) T iter = 5T iter = Õ M + δ 2 µ 2 log 1 ϵ . Compared to the SVRG communication complexity Õ M + L µ log 1 ϵ (Sebbouh et al., 2019) , this replaces the L/µ dependence with a δ 2 /µ 2 dependence. This is better when δ ≤ √ Lµ. Comparison with existing results. The lower bound related the most to our setting is given by (Arjevani & Shamir, 2015) , only considers the setting of full participation (i.e. no client sampling) and corresponds to a communication complexity of Õ δ µ M . Our result improves upon this when M > (δ/µ) 3/2 . Note that while our result attains a superior dependence on M , the dependence on the effective conditioning δ/µ is worse. In the next section, we shall improve this via acceleration. Note that the best existing results for optimization under second-order similarity, such as DiSCO (Zhang & Lin, 2015) , Accelerated SONATA (Tian et al., 2022) , and Extragradient sliding (Kovalev et al., 2022) , match this lower bound (Kovalev et al., 2022 , Table 1 ). Therefore, our result shows that significantly better convergence can be obtained under second-order similarity when using client sampling. Computational complexity. Similar to the discussion of the computational complexity of SPPM in the previous section, here too we essentially need to solve on each sampled device an optimization problem involving an (L + η)-smooth and (µ + η)-strongly convex function up to the accuracy b. This can be done using accelerated gradient descent in Õ L+η µ+η log 1 b = Õ Lδ 2 µ +1 2δ 2 +1 log 1 b gradient accesses. Compared to SGD or SVRG, we can see that we trade off a higher local Computational complexity for a lower number of communications steps to convergence. Similar methods. Point-SAGA (Defazio, 2016) uses the stochastic proximal point method coupled with SAGA-style variance reduction. Point-SAGA achieves optimal performance under smoothness and strong convexity, but it inherits the heavy memory requirements of SAGA and its performance under Assumption 1 is unknown. Another similar algorithm is SCAFFOLD (Karimireddy et al., 2020b): SCAFFOLD uses a similar SVRG-style correction sequence, and their method can be viewed as approximately solving a local unregularized minimization problem at each step. Unfortunately, their analysis under Assumption 1 only holds for quadratic objectives and without client sampling.

4.3. ACCELERATING SVRP VIA CATALYST

In this section we improve SVRP by augmenting it with acceleration. Acceleration is an effective way to improve the convergence rate of first-order gradient methods, capable of achieving better convergence rates in deterministic (Nesterov, 2018) , finite-sum (Allen-Zhu, 2017), and stochastic (Jain et al., 2018) settings. Catalyst (Lin et al., 2015) is a generic framework for accelerated first-order methods that we utilize to accelerate SVRP. Catalyst (Algorithm 3) is essentially an accelerated proximal point method that relies on a solver A to solve the proximal point iterations. We use SVRP (Algorithm 2) as the solver A, and term the resulting algorithm Catalyzed SVRP. Catalyst gives us fast linear convergence provided that we solve certain regularized subproblems to a high accuracy. To do that, we run SVRP (as the method A) with an appropriate set of parameters and for a fixed number of iterations. The details of our application of Catalyst are given in the supplementary material (Appendix G). The complexity of Catalyzed SVRP is given next. Theorem 3. (Catalyzed SVRP convergence rate). For Catalyst (Algorithm 3) with SVRP (Algorithm 2) as the inner solver A, suppose Assumptions 1 and 2 hold, and that f is L-smooth. Then for a specific choice of the algorithm parameters, the expected communication complexity E [T comm ] to reach an accuracy ϵ, up to polylogarithmic factors and absolute constants, is E [T comm ] = Õ M + δ µ M 3/4 log 1 ϵ . Improvement over SVRP. Compared to ordinary SVRP, observe that since δ µ M 3/4 ≤ M +(δ/µ) 2 , Theorem 3 gives a communication complexity that is always better than what Theorem 2 provides up to logarithmic factors. In particular, the rate of Catalyzed SVRP given by ( 6) is strictly better than the rate of vanilla SVRP when δ µ ≤ √ M . Note that unlike the communication complexity given by Theorem 2, the rate given by Theorem 3 requires smoothness, but the smoothness constant L only shows up inside a logarithmic factor, and hence does not show up in eq. ( 6) (as Õ notation hides polylogarithmic factors). Comparison with the smooth setting. When each function is L-smooth and µ-strongly convex, Accelerated variants of SVRG reach an ϵ-accurate solution in Õ M + L µ M 1/2 communication steps and O(1) local work per node (Lin et al., 2015) . The communication complexity Catalyzed SVRP achieves is better when δ ≤ L √ M , at the cost of more local computation. Improvement over prior work. The best existing algorithms for solving Problem (1) under Assumption 1 are Accelerated SONATA (Tian et al., 2022) and Extragradient sliding (Kovalev et al., 2022) , both of which achieve a communication complexity of Õ δ µ M . The rate given by eq. ( 6) is better than this rate, achieving a smaller dependence on the number of nodes M . Arjevani & Shamir (2015) give a lower bound matching the rate Õ δ µ M : ignoring polylogarithmic factors, our result improves upon their lower bound through the usage of client sampling. This improvement is possible because Arjevani & Shamir (2015) use an oracle model that does not allow for client sampling. Thus, Catalyzed SVRP improves upon all existing methods for optimization under Assumption 1, smoothness, and strong convexity when δ is sufficiently small. Computational complexity. Catalyzed SVRP uses SVRP as an inner solver, and hence inherits its computational requirements: at each iteration k, we have to sample a node m k and evaluate the proximal operator associated with its local objective f m k . However, we set the solution accuracy to b = 0 when applying Catalyst to SVRP, i.e. we require exact stochastic proximal operator evaluation; This is only done for convenience of analysis, and we believe the analysis can be extended to the case of nonzero but small b. In practice, the proximal operation can be computed exactly in many cases, and otherwise we can compute the proximal operator to machine accuracy using accelerated gradient descent.

5. EXPERIMENTS

We run linear regression with ℓ 2 regularization, where each client has a loss function of the form f (x) = 1 M M m=1 f m (x) = 1 n n i=1 (z ⊤ m,i x -y m,i ) 2 + λ 2 ∥x∥ 2 where z m,i ∈ R d and y m,i ∈ R represent the feature and label vectors respectively, for m ∈ [M ] and i ∈ [n]. We do two sets of experiments: in the first set, we generate the data vectors z m,i synthetically and force the second-order similarity assumption to hold with δ small relative to L, with L ≈ 3330 and δ ≈ 10, and regularization constant λ = 1. We vary the number of clients M as M ∈ {1000, 2000, 3000}. In the second set, we use the "a9a" dataset from LIBSVM (Chang & Lin, 2011) , each client's data is constructed by sampling from the original training dataset with n = 2000 samples per client. We vary the number of clients M as M ∈ {20, 40, 60}, and measure L ≈ 6.33, δ ≈ 0.22, and set the regularization parameter as λ = 0.1. We simulate our results on a single machine, running each method for 10000 communication steps. Our results are given in Figure 1 . We compare SVRP against SVRG, SCAFFOLD, and the Accelerated Extragradient algorithms, using the optimal theoretical stepsize for each algorithm. In all cases, SVRP achieves superior performance to existing algorithms for both the real and synthetic data experiments. This is more pronounced in the synthetic data experiments as δ is much smaller than L and the number of agents M is very large. 

A BASIC FACTS AND NOTATION

We shall make use of the following facts from linear algebra: for any a, b ∈ R d and any ζ > 0, 2 ⟨a, b⟩ = ∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 . ( ) ∥a∥ 2 ≤ (1 + ζ) ∥a -b∥ 2 + 1 + ζ -1 ∥b∥ 2 .

B APPLICATIONS OF SECOND-ORDER SIMILARITY

In this section we give a few examples where Assumption 1 holds. These are known in the literature, and we collect them here for motivation. Relation to smoothness. The standard assumption in analyzing federated optimization algorithms for Problem 1 is smoothness: Definition 4. (Smoothness) We say that a differentiable function f is L-smooth if for all x, y ∈ R d we have ∥∇f (x) -∇f (y)∥ ≤ L ∥x -y∥. Note that L-smoothness of each f 1 , f 2 , . . . , f M implies Assumption 1 holds with δ = L, as we have for any x, y ∈ R d 1 M M m=1 ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 = 1 M M m=1 ∥∇f m (x) -∇f m (y)∥ 2 -∥∇f (x) -∇f (y)∥ 2 ≤ 1 M M m=1 ∥∇f m (x) -∇f m (y)∥ 2 ≤ L 2 ∥x -y∥ 2 . Thus we have that, under smoothness, Assumption 1 holds. While typically we are interested in the case in which δ is much smaller than L, the fact that by default we have δ ≤ L means that we do not lose any generality by considering optimization under Assumption 1. Relation to mean-squared smoothness. Observe that in the preceding proof we did not actually use that each f m is L-smooth, only that 1 M M m=1 ∥∇f m (x) -∇f m (y)∥ 2 ≤ L 2 ∥x -y∥ 2 . This assumption is known as mean-squared smoothness in the literature, and this proof shows it is also a special case of second-order similarity. Statistical Learning. Suppose that each function f m corresponds to empirical risk minimization with data drawn according to a distribution D w : n . This can be much smaller than L, especially when the number of data points per node n is large. Zhang & Lin (2015) show that a similar concentration holds for non-quadratic minimization, albeit with extra dependence on the data dimensionality d. Hendrikx et al. (2020) remove the dependence on the data dimensionality for generalized linear models under some additional assumptions. f m (x) = 1 n n i=1 ℓ(x, z m,i ), While clients normally do not ordinarily share the same data distribution D w in federated optimization, clustering clients together such that each group of clients has similar data is a common strategy (Sattler et al., 2020; Ghosh et al., 2020) , and as such we can apply algorithms designed for Assumption 1 to clusters of clients with similar data. Other examples. Karimireddy et al. (2020b) observe that Assumption 1 holds with δ = 0 when using objective perturbation as a differential privacy mechanism for empirical risk minimization, since objective perturbation relies on adding linear noise that does not affect the differences of gradients. Chayti & Karimireddy (2022) give more examples relevant to federated learning. Hessian formulation. The way we wrote Assumption 1 in the main text is 1 M M m=1 ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 ≤ δ 2 ∥x -y∥ 2 . For twice-differentiable objectives, this is a biproduct of the following inequality on the Hessians: for all x ∈ R d we have 1 M M m=1 ∇ 2 f m (x) -∇ 2 f (x) 2 op ≤ δ 2 . This motivates the name second-order similarity. To see why this is the case, observe that by Taylor's theorem (Duistermaat, 2004 , Theorem 2.8.3), Jensen's inequality and the convexity of the squared norm, we have ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 = 1 0 ∇ 2 f m (θx + (1 -θ)y) -∇ 2 f (θx + (1 -θ)y) (x -y)dθ 2 ≤ 1 0 ∇ 2 f m (θx + (1 -θ)y) -∇ 2 f (θx + (1 -θ)y) (x -y) 2 dθ ≤ 1 0 ∇ 2 f m (θx + (1 -θ)y) -∇ 2 f (θx + (1 -θ)y) 2 op ∥x -y∥ 2 dθ. Averaging with respect to m gives 1 M M m=1 ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 ≤ 1 0 1 M M m=1 ∇ 2 f m (θx + (1 -θ)y) -∇ 2 f (θx + (1 -θ)y) 2 op ∥x -y∥ 2 dθ ≤ 1 0 δ 2 ∥x -y∥ 2 dθ = δ 2 ∥x -y∥ 2 . Therefore if it holds that max m∈[M ] sup x∈R d ∇ 2 f m (x) -∇f (x) op ≤ δ (a condition known as Hessian similarity), we necessarily have that Assumption 1 also holds.

C ALGORITHM-INDEPENDENT RESULTS

This section collects all facts and propositions that are algorithm-independent.

C.1 FACTS ABOUT THE PROXIMAL OPERATOR

In this section we derive two useful facts about the proximal operator. Both facts are relatively straightforward to derive. Fact 1. Let h : R d → R be a convex differentiable function and η > 0 be a stepsize. Then for all x ∈ R d , prox ηh (x + η∇h(x)) = x. Proof. Solving the proximal is equivalent to prox ηh (z) = argmin y∈R d h(y) + 1 2η ∥y -z∥ 2 . This is a strongly convex minimization problem for any η > 0, hence the (necessarily unique) minimizer of this problem satisfies the first order optimality condition ∇h(y) + 1 η (y -z) = 0. Now observe that we have ∇h(x) + 1 η (x -(x + η∇h(x))) = ∇h(x) + -η∇h(x) η = 0. It follows that x = prox ηh (x + η∇h(x)). ■ Fact 2. (Tight contractivity of the proximal operator). If h is µ-strongly convex and differentiable, then for all η > 0 and for any x, y ∈ R d we have prox ηh (x) -prox ηh (y) 2 ≤ 1 (1 + ηµ) 2 ∥x -y∥ 2 Proof. This lemma can be seen as a tighter version of (Mishchenko et al., 2022a , Lemma 5) though our proof technique is different. Note that p(x) = prox ηh (x) satisfies η∇h(p(x)) + [p(x) -x] = 0, or equivalently p(x) = x -η∇h(p(x)). Using this we have  ∥p(x) -p(y)∥ 2 = ∥[x -η∇h(p(x))] -[y -η∇h(p(y))]∥ 2 = ∥[x -y] -η [∇h(p(x)) -∇h(p(y))]∥ 2 = ∥x -y∥ 2 + η 2 ∥∇h(p(x)) -∇h(p(y))∥ 2 -2η ⟨x -y, ∇h(p(x)) -∇h(p(y))⟩ . ( ) Let D h (u, v) = h(u) -h(v) -⟨∇h(v), u -v⟩ be the Bregman divergence associated with h at u, v. It is easy to show that ⟨u -v, ∇h(u) -∇h(v)⟩ = D h (u, v) + D h (v, u) . This is a special case of the three-point identity (Chen & Teboulle, 1993, Lemma 3.1) . Using this with u = p(x) and v = p(y) and plugging back into (12) we get ∥p(x) -p(y)∥ 2 = ∥x -y∥ 2 -η 2 ∥∇h(p(x)) -∇h(p(y))∥ 2 -2η [D h (p(x), p(y)) + D h (p(y), p(x))] . Note that because h is strongly convex, we have that D h (p(y), p(x)) ≥ µ 2 ∥p(y) -p(x)∥ 2 and D h (p(x), p(y)) ≥ µ 2 ∥p(y) -p(x)∥ 2 , hence ∥p(x) -p(y)∥ 2 ≤ ∥x -y∥ 2 -η 2 ∥∇h(p(x)) -∇h(p(y))∥ 2 -2ηµ∥p(x) -p(y)∥ 2 . (13) Strong convexity implies that for any two points u, v ∥∇h(u) -∇h(v)∥ 2 ≥ µ 2 ∥u -v∥ 2 , see (Nesterov, 2018, Theorem 2.1.10 ) for a proof. Using this in eq. ( 13) with u = p(x) and v = p(y) yields ∥p(x) -p(y)∥ 2 ≤ ∥x -y∥ 2 -η 2 µ 2 ∥p(x) -p(y)∥ 2 -2ηµ∥p(x) -p(y)∥ 2 . Rearranging gives 1 + η 2 µ 2 + 2ηµ ∥p(x) -p(y)∥ 2 ≤ ∥x -y∥ 2 . It remains to notice that (1 + ηµ) 2 = 1 + η 2 µ 2 + 2ηµ. ■ C.2 A LEMMA FOR SOLVING RECURRENCES Lemma 1. Suppose that we have a sequence of positive values (r k ) K-1 k=0 satisfying, for some θ > 0 and some c > 0 r k+1 ≤ 1 1 + θ [r k + c] . Then the sequence satisfies r K ≤ 1 (1 + θ) K r 0 + min K 1 + θ , 1 θ c. Proof. We start from the recurrence to get r k+1 ≤ 1 1 + θ r k + c 1 + θ ≤ 1 1 + θ 1 1 + θ r k-1 + c 1 + θ + c 1 + θ = 1 (1 + θ) 2 r k-1 + c (1 + θ) 2 + c (1 + θ) . Continuing similarly we obtain r K ≤ 1 (1 + θ) K r 0 + c 1 + θ K-1 j=0 1 1 + θ j . ( ) We can now bound the latter sum in two ways, the first is to note that since 1 + θ > 1 we have that 1 1+θ < 1, and hence K-1 j=0 1 1 + θ j ≤ K-1 j=0 1 j = K. ( ) The second way is to use the convergence of the geometric series K-1 j=0 1 1 + θ j ≤ ∞ j=0 1 1 + θ j = 1 1 -1 1+θ = 1 + θ θ . ( ) Using eqs. ( 15) and ( 16) in ( 14) gives r K ≤ 1 (1 + θ) K r 0 + c 1 + θ min K, 1 + θ θ = 1 (1 + θ) K r 0 + c • min K 1 + θ , 1 θ , and this is the lemma's statement. ■

D PROOFS FOR SPPM (ALGORITHM 1)

Proof of Theorem 1. Using eq. ( 8) and our assumption that the proximal operators are solved up to the accuracy b we have ∥x k+1 -x * ∥ 2 = x k+1 -prox ηf ξ k (x k ) + prox ηf ξ k (x k ) -x * 2 ≤ 1 + 1 ηµ x k+1 -prox ηf ξ k (x k ) 2 + (1 + ηµ) prox ηf ξ k (x k ) -x * 2 ≤ 1 + ηµ ηµ b + (1 + ηµ) prox ηf ξ k (x k ) -x * 2 . ( ) For the second term in eq. ( 17) we have by Fact 1 that x * = prox ηf ξ k (x * + η∇f ξ k (x * )), then using the contraction of the prox (Fact 2) we get prox ηf ξ k (x k ) -x * 2 = prox ηf ξ k (x k ) -prox ηf ξ k (x * + η∇f ξ k (x * )) 2 ≤ 1 (1 + ηµ) 2 ∥x k -(x * + η∇f ξ k (x * ))∥ 2 . ( ) Expanding out the square we have prox ηf ξ k (x k ) -x * 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * -η∇f ξ k (x * )∥ 2 = 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 ∥∇f ξ k (x * )∥ 2 -2η ⟨x k -x * , ∇f ξ k (x * )⟩ . Denote expectation conditional on x k by E k [•]. Taking expectation conditional on x k we get E k prox ηf ξ k (x k ) -x * 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 E ∥∇f ξ k (x * )∥ 2 -2η ⟨x k -x * , E k [∇f ξ k (x * )]⟩ = 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 σ 2 * , where we used that E [∇f ξ k (x * )] = ∇f (x * ) = 0 and the definition of σ 2 * . Taking conditional expectation in eq. ( 17) and plugging the last line gives E k ∥x k+1 -x * ∥ 2 ≤ 1 + ηµ ηµ b + 1 + ηµ (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 σ 2 * = 1 1 + ηµ ∥x k -x * ∥ 2 + η 2 σ 2 * + (1 + ηµ) 2 ηµ b . Taking unconditional expectation gives E ∥x k+1 -x * ∥ 2 ≤ 1 1 + ηµ E ∥x k -x * ∥ 2 + η 2 σ 2 * + (1 + ηµ) 2 ηµ b . We can then use Lemma 1 to get that at step K we have E ∥x K -x * ∥ 2 ≤ 1 1 + ηµ K ∥x 0 -x * ∥ 2 + ησ 2 * µ + (1 + ηµ) 2 (ηµ) 2 b. ( ) This proves the first statement of the theorem. For the second statement, observe that when η = µϵ (1+ηµ) 2 we have ησ 2 * µ + (1 + ηµ) 2 (ηµ) 2 b ≤ ϵ 2 + ϵ 4 = 3ϵ 4 . ( ) Moreover, we have using the inequality 1 -x ≤ exp(-x) for all x > 0 and our choice of K that 1 1 + ηµ K = 1 - ηµ 1 + ηµ K ≤ exp - ηµK 1 + ηµ ≤ ϵ 4 1 ∥x 0 -x * ∥ 2 . ( ) Plugging eqs. ( 20) and ( 21) into eq. ( 19) yields E ∥x K -x * ∥ 2 ≤ ϵ, and this is the second statement of the theorem. ■ E RELATED WORK FOR SPPM Toulis et al. (2015) study SPPM where they assume each stochastic proximal operation has unbiased and bounded error from the proximal operation with respect to the full proximal, i.e. for ϵ ξ (x) = 1 η prox ηf ξ (x) -prox ηf (x) we have: E [ϵ ξ ] = 0, and, E ∥ϵ ξ ∥ 2 ≤ σ 2 . ( ) Under this assumption, Toulis et al. (2015) prove that the stochastic proximal point method can achieve a rate that is completely independent of the smoothness constant, and which matches the optimal rate for eq. ( 2) for (potentially) nonsmooth objectives. Kim et al. (2022) show a similar result under the same condition for a momentum variant of SPPM. It is not clear how to satisfy ( 22) in practice: the next example shows that even in the simple setting of quadratic minimization, the iterations are not unbiased. Example 1. Take f 1 = ax 2 , f 2 = 2ax 2 , and f = 1 2 (f 1 + f 2 ), then for any x ∈ R d we have for ξ drawn uniformly at random from {1, 2}: E prox ηf ξ (x) -prox ηf (x) = (1 + 3ηa)x 1 + 6ηa + 8η 2 a 2 - x 3ηa + 1 ̸ = 0. Thus the errors are not unbiased. Moreover, it can be shown that the variance scales with ∥x∥ 2 , and thus can be made very large. In contrast, the convergence rate given by Theorem 1 does not require condition ( 22), and results in linear convergence for the functions of Example 1 (as both f 1 and f 2 share a minimizer at x = 0). We note that Ryu & Boyd (2014) also study the convergence of SPPM without a condition like ( 22), but their theory does not show convergence to an ϵ-approximate solution even if the stepsize is taken proportional to ϵ. Pȃtras ¸cu & Necoara (2017) also analyze the convergence of SPPM and present two different cases: if the stepsize is held constant (as in our Theorem 1), then their theory shows only convergence to a neighborhood whose size cannot be made small by varying the stepsize; Alternatively, by using a decreasing stepsize they can show a O 1 ϵ iteration complexity, but then this complexity requires that each f ξ is smooth. As mentioned previously, Asi & Duchi (2019, Proposition 5.3) show the same O 1 ϵ rate without requiring smoothness or a condition like ( 22), but using exact evaluations of the proximal operator. Theorem 1 gives the same iteration complexity without requiring smoothness or bounded variance, and while allowing for approximate proximal point operator evaluations. Note that the improvement of allowing approximate evaluations over (Asi & Duchi, 2019) is not very significant, as the approximation has to be very tight. The main way we depart from (Asi & Duchi, 2019 ) is that we use the contractivity of the proximal as the main proof tool, and this extends more easily to the variance-reduced setting of SVRP.

F PROOFS FOR SVRP (ALGORITHM 2)

Proof of Theorem 2. Let xk+1 = prox ηfm k (x k -ηg k ). Then by eq. ( 8) and our assumption that ∥x k+1 -xk+1 ∥ 2 ≤ b we have for any a > 0 ∥x k+1 -x * ∥ 2 = ∥x k+1 -xk+1 + xk+1 -x * ∥ 2 ≤ 1 + a -1 ∥x k+1 -xk+1 ∥ 2 + (1 + a)∥x k+1 -x * ∥ 2 ≤ 1 + a -1 b + (1 + a)∥x k+1 -x * ∥ 2 . Plugging in a = η 2 µ 2 1+2ηµ we get ∥x k+1 -x * ∥ 2 ≤ 1 + ηµ ηµ 2 b + (1 + ηµ) 2 1 + 2ηµ ∥x k+1 -x * ∥ 2 . ( ) For the second term in eq. ( 23), we have by Fact 1 that x * = prox ηfm k (x * + η∇f m k (x * )), then using Fact 2 we get ∥x k+1 -x * ∥ 2 = prox ηfm k (x k -ηg k ) -prox ηfm k (x * + η∇f m k (x * )) 2 ≤ 1 (1 + ηµ) 2 ∥x k -ηg k -(x * + η∇f m k (x * ))∥ 2 . Expanding out the square we have ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * -η (g k + ∇f m k (x * ))∥ 2 = 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 ∥g k + ∇f m k (x * )∥ 2 -2η ⟨x k -x * , g k + ∇f m k (x * )⟩ . We denote by E k [•] the expectation conditional on all information up to (and including) the iterate x k , then E k ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 [∥x k -x * ∥ 2 + η 2 E k ∥g k + ∇f m k (x * )∥ 2 -2η ⟨x k -x * , E k [g k + ∇f m k (x * )]⟩], where in the last term the expectation went inside the inner product since the expectation is conditioned on knowledge of x k , and the randomness in m is independent of x k . Note that this expectation can be computed as E k [g k + ∇f m k (x * )] = E k [∇f (w k ) -∇f m k (w k ) + ∇f m k (x * )] = ∇f (w k ) -∇f (w k ) + ∇f (x * ) = 0 + 0 = 0, where we used that since x * minimizes f we must have ∇f (x * ) = 0. Plugging this into ( 24) gives E k ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 E k ∥g k + ∇f m k (x * )∥ 2 . ( ) For the second term, we can add ∇f (x * ) term inside (as it is a zero) to get E k ∥g k + ∇f m k (x * )∥ 2 = E k ∥g k + ∇f m k (x * ) -∇f (x * )∥ 2 = E k ∥∇f (w k ) -∇f m k (w k ) + ∇f m k (x * ) -∇f (x * )∥ 2 = E k ∥∇f (w k ) -∇f m (w k ) -[∇f (x * ) -∇f m (x * )]∥ 2 = 1 M M m=1 ∥∇f (w k ) -∇f m (w k ) -[∇f (x * ) -∇f m (x * )]∥ 2 . ( ) Using Assumption 1 with eq. ( 26) we have E k ∥g k + ∇f m (x * )∥ 2 ≤ 1 M M m=1 ∥∇f (w k ) -∇f m (w k ) -[∇f (x * ) -∇f m (x * )]∥ 2 ≤ δ 2 ∥w k -x * ∥ 2 . Hence we can bound (25) as E k ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 . ( ) Taking conditional expectation in eq. ( 23) and plugging the estimate of eq. ( 27) in we get E k ∥x k+1 -x * ∥ 2 ≤ 1 + ηµ ηµ 2 b + (1 + ηµ) 2 1 + 2ηµ E k ∥x k+1 -x * ∥ 2 ≤ 1 + ηµ ηµ 2 b + 1 1 + 2ηµ ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 . ( ) Observe that by design we have E k ∥w k+1 -x * ∥ 2 = p • ∥x k+1 -x * ∥ 2 + (1 -p) • ∥w k -x * ∥ 2 . Let α = ηµ p , then using eqs. ( 28) and ( 29) we have E k ∥x k+1 -x * ∥ 2 + αE k ∥w k+1 -x * ∥ 2 = (1 + αp)E k ∥x k+1 -x * ∥ 2 + α(1 -p) • ∥w k -x * ∥ 2 ≤ 1 + αp 1 + 2ηµ ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 + α(1 -p) • ∥w k -x * ∥ 2 + (1 + αp) 1 + ηµ ηµ 2 b = 1 + αp 1 + 2ηµ ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 + α(1 -p) • ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b = 1 + αp 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 -p + η 2 δ 2 (1 + αp) α(1 + 2ηµ) ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b = 1 + ηµ 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 -p + pηδ 2 µ 1 + ηµ 1 + 2ηµ ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b. Note that by condition on the stepsize we have ηδ 2 /µ ≤ 1 2 , hence ηδ 2 µ • 1 + ηµ 1 + 2ηµ ≤ 1 2 1 + ηµ 1 + 2ηµ ≤ 1 2 • 1 = 1 2 . Using this in the second term of eq. ( 30) gives E k ∥x k+1 -x * ∥ 2 + αE k ∥w k+1 -x * ∥ 2 ≤ 1 + ηµ 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 -p + p 2 ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b = 1 + ηµ 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 - p 2 ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b ≤ max 1 + ηµ 1 + 2ηµ , 1 - p 2 ∥x k -x * ∥ 2 + α∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b. Define the Lyapunov function V k = ∥x k -x * ∥ 2 + ηµ p ∥w k -x * ∥ 2 . Then the last equation can simply be written as E k [V k+1 ] ≤ max 1 + ηµ 1 + 2ηµ , 1 - p 2 • V k + (1 + ηµ) 3 (ηµ) 2 b. Taking unconditional expectation gives E [V k+1 ] ≤ max 1 + ηµ 1 + 2ηµ , 1 - p 2 E [V k ] + (1 + ηµ) 3 (ηµ) 2 b. Let τ = min{ ηµ 1+2ηµ , p 2 }, then max 1+ηµ 1+2ηµ , 1 -p 2 = 1 -τ , and we get the simple recursion E [V k+1 ] ≤ (1 -τ )E [V k ] + (1 + ηµ) 3 (ηµ) 2 b. Iterating this for k steps and using the formula for the sum of the geometric series gives for any k ≤ K, E [V k ] ≤ (1 -τ ) k E [V 0 ] + (1 + ηµ) 3 (ηµ) 2 b k-1 t=0 (1 -τ ) t ≤ (1 -τ ) k E [V 0 ] + (1 + ηµ) 3 (ηµ) 2 b ∞ t=0 (1 -τ ) t = (1 -τ ) k E [V 0 ] + (1 + ηµ) 3 (ηµ) 2 τ b. Now note that E ∥x k -x * ∥ 2 ≤ E [V k ] . And by initialization we have w 0 = x 0 , hence E [V 0 ] = ∥x 0 -x * ∥ 2 + ηµ p ∥w 0 -x * ∥ 2 = 1 + ηµ p ∥x 0 -x * ∥ 2 . ( ) Plugging eqs. ( 32) and (33) into eq. ( 31) gives for any k ≤ K, E ∥x k -x * ∥ 2 ≤ 1 + ηµ p (1 -τ ) k ∥x 0 -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 τ b. For the second statement of the theorem, observe that by assumption on b we can bound the right hand side of eq. ( 34) as E ∥x k -x * ∥ 2 ≤ 1 + ηµ p (1 -τ ) k ∥x 0 -x * ∥ 2 + ϵ 2 . Next, we use that for all x ≥ 0 we have 1 -x ≤ exp(-x), hence we have after K iterations of SVRP that E ∥x K -x * ∥ 2 ≤ 1 + ηµ p exp(-τ K)∥x 0 -x * ∥ 2 + ϵ 2 . Thus if we run for the following number of iterations K ≥ 1 τ log   2∥x 0 -x * ∥ 2 1 + ηµ p ϵ   We get that E ∥x K -x * ∥ 2 ≤ ϵ 2 + ϵ 2 = ϵ. Choose η = µ 2δ 2 and p = 1 M , then eq. ( 35) reduces to K ≥ 2 max δ 2 µ 2 + 1, M log   2∥x 0 -x * ∥ 2 1 + µ 2 M 2δ 2 ϵ   . This gives the second part of the theorem. ■

G PROOFS FOR CATALYST+SVRP

The convergence rate of Catalyst is given by the following proposition: Proposition 1. (Catalyst convergence rate). Run Catalyst (Algorithm 3) with smoothing parameter γ for a µ-strongly convex function f . Let q = µ µ+γ and choose ρ ≤ √ q. Assume at each timestep t = 1, 2, . . . , T we have that x t from eq. ( 36) satisfies E h t (x t ) -min x∈R d h t (x) ≤ ϵ t def = 2 9 (f (x 0 ) -f (x * )) (1 -ρ) t . ( ) Then the iterates generated by Algorithm 3 satisfy E [f (x t ) -f (x * )] ≤ 8 ( √ q -ρ) 2 (1 -ρ) t+1 (f (x 0 ) -f (x * )). ( ) Proof. This is (Lin et al., 2015, Theorem 3.1) . ■ We see that in order to apply Catalyst, we have to solve the problem point iterations of eq. ( 36) up to the accuracy given by eq. ( 37). However, the accuracy ϵ t depends on the suboptimality gap f (x 0 ) -f (x * ), which we do not have access to in general. The next proposition shows that this is not a problem for methods with linear convergence, and we can instead run the method for a fixed number of iterations: Algorithm 3: Catalyst Data: Initialization x 0 , smoothing parameter γ, algorithm A, strong convexity constant µ. Initialize y 0 = x 0 . Initialize q = µ µ+γ and α 0 = √ q. for t = 1, 2, . . . , T -1 do Find x t using A starting from the initialization x t-1 : x t ≈ argmin x∈R d h t (x) def = f (x) + γ 2 ∥x -y t-1 ∥ 2 . ( ) Update α t ∈ (0, 1) as α 2 t = (1 -α t )α 2 t-1 + qα t . Compute y t using an extrapolation step y t = x t + β t (x t -x t-1 ) with β t = α t-1 (1 -α t-1 ) α 2 t-1 + α t . Proposition 2. (Inner method complexity in Catalyst). Consider Algorithm 3 run with smoothing parameter γ on a µ-strongly convex function f , with q = µ µ+γ and ρ ≤ √ q. Suppose that the method A generates iterates (z s ) s≥0 such that E h t (z s ) -min x∈R d h t (x) ≤ A(1 -τ A,h ) s h t (z 0 ) -min x∈R d h t (x) , then the precision ϵ t from eq. ( 37) is reached in expectation when the number of iterations s of A exceeds T A where T A = 1 τ A,h log A • 2 1 -ρ + 2592γ µ(1 -ρ) 2 ( √ q -ρ) 2 . Proof. This is (Lin et al., 2015, Proposition 3.2) . ■ Thus applying Catalyst to accelerating SVRP reduces to verifying if the condition in (39) holds for SVRP. The next proposition shows it does. Proposition 3. Define h t as in eq. (36) with γ + µ ≤ δ, and let the objective f : R d → R be a finite sum (as in eq. ( 1)) where each f m is µ-strongly convex, f is L-smooth, and Assumption 1 holds. Use SVRP (Algorithm 2) as the solver A to minimize h t , then the iterates z 1 , z 2 , . . . , z s generated by SVRP with stepsize η = µ+γ 2δ 2 , solution accuracy b = 0, and communication probability p = 1 M satisfy E h t (z s ) -min x∈R d h t (x) ≤ A(1 -τ ) s h t (z 0 ) -min x∈R d h t (x) , where A def = L + γ µ + γ 1 + (γ + µ) 2 M δ 2 , and, τ def = 1 2 min 1 δ 2 (γ+µ) 2 + 1 , 1 M . Proof. The function h t (from Algorithm 3) is defined in eq. ( 36) as h t (x) = f (x) + γ 2 ∥x -y t-1 ∥ 2 . By the finite-sum structure, we can write h t as h t (x) = 1 M M m=1 f m (x) + γ 2 ∥x -y t-1 ∥ 2 , ( ) where each f m is µ-strongly convex. For each m ∈ [M ], define h t,m (x) def = f m (x) + γ 2 ∥x -y t-1 ∥ 2 . Note that h t,m is µ + γ-strongly convex. Moreover, by direct computation we have for any x, y ∈ R d that ∇h t,m (x) -∇h t (x) = ∇f m (x) + γ(x -y t-1 ) -(∇f (x) + γ(x -y t-1 )) = ∇f m (x) -∇f (x). Thus using this combined with Assumption 1 we have 1 M M m=1 ∥∇h t,m (x) -∇h t (x) -[∇h t,m (y) -∇h t (y)]∥ 2 = 1 M M m=1 ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 ≤ δ 2 ∥x -y∥ 2 . It follows that problem eq. ( 40) satisfies the conditions of Theorem 2 on the convergence of SVRP, and thus specializing the theorem we get for the iterates z 1 , z 2 , . . . , z s by SVRP with stepsize η = µ+γ 2δ 2 , solution accuracy b = 0, and communication probability p = 1 M that E ∥z s -z * ∥ 2 ≤ 1 + (µ + γ) 2 M 2δ 2 (1 -τ ) s ∥z 0 -z * ∥ 2 , where τ = 1 2 min 1 δ 2 (µ+γ) 2 +1 , 1 M and z * = arg min x∈R d h t (x). Note that because f is L-smooth and µ-strongly convex, we have that h t is L + γ-smooth and µ + γ-strongly convex, hence for any x ∈ R d we have h t (x) -h t (z * ) ≥ µ + γ 2 ∥x -z * ∥ 2 (42) h t (x) -h t (z * ) ≤ L + γ 2 ∥x -z * ∥ 2 . ( ) Using ( 42) with x = z 0 and (43) with x = z s and combining with (41) we obtain h t (z s ) -h t (z * ) ≤ L + γ 2 ∥z s -z * ∥ 2 ≤ L + γ 2 1 + (µ + γ) 2 M 2δ 2 (1 -τ ) s ∥z 0 -z * ∥ 2 ≤ L + γ µ + γ 1 + (µ + γ) 2 M 2δ 2 (1 -τ ) s (h t (z 0 ) -h t (z * )) , where τ = 1 2 min 1 δ 2 (µ+γ) 2 +1 , 1 M and z * minimizes h t . ■ G.1 PROOF OF THEOREM Proof. The proof of this theorem is a straighfrforward combination of Propositions 1, 2, and 3. We set p = 1 M , b = 0, and shall set η and γ later. In Proposition 1, choose ρ = √ q 2 = √ µ/(µ+γ) , then the convergence guarantee in eq. ( 38) is E [f (x t ) -f (x * )] ≤ 32 q 1 - √ q 2 t+1 (f (x 0 ) -f (x * )) = 32(µ + γ) µ 1 - √ q 2 t+1 (f (x 0 ) -f (x * )) ≤ 32(µ + γ) µ exp - √ q 2 (t + 1) (f (x 0 ) -f (x * )),

H EXTENSION TO THE CONSTRAINED SETTING

We consider the constrained problem defined as follows: let R be a convex constraint function with an easy-to-compute proximal operator (i.e. we can compute prox R (•) easily), then the composite finite-sum minimization problem is min x∈R d F (x) = f (x) + R(x) = 1 M M m=1 f m (x) + R(x) . The constrained optimization problem min x∈C f (x) can be reduced to problem (46) by letting R = δ C be the indicator function on the closed convex set C, and the proximal operator associated with R in this case reduces to the projection on C (Beck, 2017, Theorem 6.24). We shall use the following variant of SVRP to solve this problem: Set g k = ∇f (w k ) -∇f m k (w k ). Compute a b-approximation of the stochastic proximal point operator associated with f m k : x k+1 ≃ prox ηfm k +ηR (x k -ηg k ) . Sample c k ∼ Bernoulli(p) and update w k+1 = x k+1 if c k = 1, w k if c k = 0. The following theorem gives the convergence rate of Algorithm 4: Theorem 5. (Convergence of SVRP in the composite setting). Suppose that Assumptions 1 and that each f m k is µ-strongly convex, and let x * be the minimizer of Problem (46). Suppose that each x k+1 is a b-approximation of the proximal (47). Let τ = min ηµ 1+2ηµ , p 2 . Set the parameters of Algorithm 4 as η = µ 2δ 2 , b ≤ ϵτ (ηµ) 2 2(1+ηµ) 3 , and p = 1 M . Then the final iterate x K satisfies E ∥x K -x * ∥ 2 ≤ ϵ provided that the total number of iterations K is larger than T iter : T iter = Õ M + δ 2 µ 2 log 1 ϵ . We first discuss the computational and communication complexities incurred by Algorithm 4 and then give the proof of Theorem 5 afterwards. Computational Complexity. Note that this algorithm requires evaluating the proximal operator eq. ( 47), this is equivalent to solving the local optimization problem min x∈R d f m k (x) + 1 2η ∥x -z k ∥ 2 + R(x) , for z k = x k -ηg k . When each f m k is µ-strongly convex, this is a composite convex optimization problem where the smooth part is L + 1 η -smooth and µ + 1 η -strongly convex, and where the constraint r has an easy to compute proximal operator. This can be solved by accelerated proximal gradient descent (Schmidt et al., 2011, Proposition 4) to any desired accuracy b in O   L + 1 η µ + 1 η log 1 b   Now observe that by first-order optimality of x * we have 0 ∈ ∂F (x * ) = ∇f (x * ) + ∂R(x * ) It follows that -∇f (x * ) ∈ ∂R(x * ), and thus we can apply Fact 3 to get that x * = prox ηfm k +ηR (x * + η∇f m k (x * ) -η∇f (x * )). Using this in the second term in eq. ( 53) followed by Fact 4 we get ∥x k+1 -x * ∥ 2 = prox ηfm k (x k -ηg k ) -prox ηfm k (x * + η∇f m k (x * ) -η∇f (x * )) 2 ≤ 1 (1 + ηµ) 2 ∥x k -ηg k -(x * + η∇f m k (x * ) -η∇f (x * ))∥ 2 . Expanding out the square we have ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * -η (g k + ∇f m k (x * ) -∇f (x * ))∥ 2 = 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 ∥g k + ∇f m k (x * ) -∇f (x * )∥ 2 -2η ⟨x k -x * , g k + ∇f m k (x * ) -∇f (x * )⟩ . We denote by E k [•] the expectation conditional on all information up to (and including) the iterate x k , then E k ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 [∥x k -x * ∥ 2 + η 2 E k ∥g k + ∇f m k (x * ) -∇f (x * )∥ 2 -2η ⟨x k -x * , E k [g k + ∇f m k (x * ) -∇f (x * )]⟩], where in the last term the expectation went inside the inner product since the expectation is conditioned on knowledge of x k , and the randomness in m is independent of x k . Note that this expectation can be computed as E k [g k + ∇f m k (x * ) -∇f (x * )] = E k [∇f (w k ) -∇f m k (w k ) + ∇f m k (x * ) -∇f (x * )] = ∇f (w k ) -∇f (w k ) + ∇f (x * ) -∇f (x * ) = 0 + 0 = 0. Plugging this into (54) gives E k ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 E k ∥g k + ∇f m k (x * ) -∇f (x * )∥ 2 . For the second term, we have E k ∥g k + ∇f m k (x * ) -∇f (x * )∥ 2 = E k ∥∇f (w k ) -∇f m k (w k ) + ∇f m k (x * ) -∇f (x * )∥ 2 = E k ∥∇f (w k ) -∇f m (w k ) -[∇f (x * ) -∇f m (x * )]∥ 2 = 1 M M m=1 ∥∇f (w k ) -∇f m (w k ) -[∇f (x * ) -∇f m (x * )]∥ 2 . (56) Using Assumption 1 with eq. ( 56) we have E k ∥g k + ∇f m (x * )∥ 2 ≤ 1 M M m=1 ∥∇f (w k ) -∇f m (w k ) -[∇f (x * ) -∇f m (x * )]∥ 2 ≤ δ 2 ∥w k -x * ∥ 2 . Hence we can bound (55) as E k ∥x k+1 -x * ∥ 2 ≤ 1 (1 + ηµ) 2 ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 . ( ) Taking conditional expectation in eq. ( 53) and plugging the estimate of eq. ( 57) in we get E k ∥x k+1 -x * ∥ 2 ≤ 1 + ηµ ηµ 2 b + (1 + ηµ) 2 1 + 2ηµ E k ∥x k+1 -x * ∥ 2 ≤ 1 + ηµ ηµ 2 b + 1 1 + 2ηµ ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 . ( ) Observe that by design we have E k ∥w k+1 -x * ∥ 2 = p • ∥x k+1 -x * ∥ 2 + (1 -p) • ∥w k -x * ∥ 2 . ( ) Let α = ηµ p , then using eqs. ( 58) and ( 59) we have E k ∥x k+1 -x * ∥ 2 + αE k ∥w k+1 -x * ∥ 2 = (1 + αp)E k ∥x k+1 -x * ∥ 2 + α(1 -p) • ∥w k -x * ∥ 2 ≤ 1 + αp 1 + 2ηµ ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 + α(1 -p) • ∥w k -x * ∥ 2 + (1 + αp) 1 + ηµ ηµ 2 b = 1 + αp 1 + 2ηµ ∥x k -x * ∥ 2 + η 2 δ 2 ∥w k -x * ∥ 2 + α(1 -p) • ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b = 1 + αp 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 -p + η 2 δ 2 (1 + αp) α(1 + 2ηµ) ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b = 1 + ηµ 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 -p + pηδ 2 µ 1 + ηµ 1 + 2ηµ ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b. Note that by condition on the stepsize we have ηδ 2 /µ ≤ 1 2 , hence ηδ 2 µ • 1 + ηµ 1 + 2ηµ ≤ 1 2 1 + ηµ 1 + 2ηµ ≤ 1 2 • 1 = 1 2 . Using this in the second term of eq. ( 60) gives E k ∥x k+1 -x * ∥ 2 + αE k ∥w k+1 -x * ∥ 2 ≤ 1 + ηµ 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 -p + p 2 ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b = 1 + ηµ 1 + 2ηµ ∥x k -x * ∥ 2 + α 1 - p 2 ∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b ≤ max 1 + ηµ 1 + 2ηµ , 1 - p 2 ∥x k -x * ∥ 2 + α∥w k -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 b. Define the Lyapunov function V k = ∥x k -x * ∥ 2 + ηµ p ∥w k -x * ∥ 2 . Then the last equation can simply be written as E k [V k+1 ] ≤ max 1 + ηµ 1 + 2ηµ , 1 - p 2 • V k + (1 + ηµ) 3 (ηµ) 2 b. Taking unconditional expectation gives E [V k+1 ] ≤ max 1 + ηµ 1 + 2ηµ , 1 - p 2 E [V k ] + (1 + ηµ) 3 (ηµ) 2 b. Let τ = min{ ηµ 1+2ηµ , p 2 }, then max 1+ηµ 1+2ηµ , 1 -p 2 = 1 -τ , and we get the simple recursion E [V k+1 ] ≤ (1 -τ )E [V k ] + (1 + ηµ) 3 (ηµ) 2 b. Iterating this for k steps and using the formula for the sum of the geometric series gives for any k ≤ K, E [V k ] ≤ (1 -τ ) k E [V 0 ] + (1 + ηµ) 3 (ηµ) 2 b k-1 t=0 (1 -τ ) t ≤ (1 -τ ) k E [V 0 ] + (1 + ηµ) 3 (ηµ) 2 b ∞ t=0 (1 -τ ) t = (1 -τ ) k E [V 0 ] + (1 + ηµ) 3 (ηµ) 2 τ b. Now note that E ∥x k -x * ∥ 2 ≤ E [V k ] . And by initialization we have w 0 = x 0 , hence E [V 0 ] = ∥x 0 -x * ∥ 2 + ηµ p ∥w 0 -x * ∥ 2 = 1 + ηµ p ∥x 0 -x * ∥ 2 . ( ) Plugging eqs. ( 62) and ( 63) into eq. ( 61) gives for any k ≤ K, E ∥x k -x * ∥ 2 ≤ 1 + ηµ p (1 -τ ) k ∥x 0 -x * ∥ 2 + (1 + ηµ) 3 (ηµ) 2 τ b. ( ) For the second statement of the theorem, observe that by assumption on b we can bound the right hand side of eq. ( 64) as E ∥x k -x * ∥ 2 ≤ 1 + ηµ p (1 -τ ) k ∥x 0 -x * ∥ 2 + ϵ 2 . Next, we use that for all x ≥ 0 we have 1 -x ≤ exp(-x), hence we have after K iterations of SVRP that E ∥x K -x * ∥ 2 ≤ 1 + ηµ p exp(-τ K)∥x 0 -x * ∥ 2 + ϵ 2 . Thus if we run for the following number of iterations K ≥ 1 τ log   2∥x 0 -x * ∥ 2 1 + ηµ p ϵ   We get that E ∥x K -x * ∥ 2 ≤ ϵ 2 + ϵ 2 = ϵ. Choose η = µ 2δ 2 and p = 1 M , then eq. ( 65) reduces to K ≥ 2 max δ 2 µ 2 + 1, M log   2∥x 0 -x * ∥ 2 1 + µ 2 M 2δ 2 ϵ   . This gives the second part of the theorem. ■

I CLIENT-SERVER FORMULATIONS OF THE ALGORITHMS

The algorithms we gave (Algorithm 1 and Algorithm 2) are written in notation that does not make clear the exact role of server and client in the process. Below, we rewrite the algorithms to make clear the role of the clients and the server, in Algorithm 5 and Algorithm 2. Note that both formulations are exactly equivalent, and this is just for clarity. Both algorithms rely on evaluating the proximal operator approximately using local data: we give an example solution method using gradient descent in Algorithm 7. By standard convergence results for gradient descent, Algorithm 7 halts in at most Client m evaluates the local proximal operator up to accuracy b on its local data (e.g. using Algorithm 7): x k+1 ≃ prox ηfm k (x k ). Client m sends the updated model x k+1 to the server.



CONCLUSIONIn this paper, we develop two algorithms that utilize client sampling in order to reduce the amount of communication necessary for federated optimization under the second-order similarity assumption, with the faster of the two algorithms reducing the best-known communication complexity from Õ δ µ M(Kovalev et al., 2022) to Õ δ µ M 3/4 + M . Both algorithms utilize variancereduction on top of the stochastic proximal point algorithm, for which we also provide a new simple and smoothness-free analysis.In all cases, the algorithms tradeoff more local work for a reduced communication complexity. An important direction in future research is to investigate whether this tradeoff is necessary, and to derive lower bounds under the partial participation communication model and second-order similarity.



Stochastic Variance-Reduced Proximal Point (SVRP) Method Data: Stepsize η, initialization x 0 , number of steps K, communication probability p, local solution accuracy b. Initialize w 0 = x 0 . for k = 0, 1, 2, . . . , K -1 do Sample m k uniformly at random from [M ].

Figure 1: Top row: experiments on synthetic data. Left figure has M = 1000 clients, middle has M = 2000, and right has M = 3000. Bottom row: experiments on real data. Left has M = 20 clients, middle has M = 40 clients, and right has M = 60 clients. We plot the squared distance from the optimum point for all methods on a logarithmic scale versus the number of communication steps, where one exchange of vectors between the server and a single client counts as a communication step.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : stochastic proximal point method . . . . . . . . . . . . . . . . . . . . . . 4.2 The SVRP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Accelerating SVRP via Catalyst . . . . . . . . . . . . . . . . . . . . . . . . . . . about the proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 A lemma for solving recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . D Proofs for SPPM (Algorithm 1) E Related work for SPPM F Proofs for SVRP (Algorithm 2) G Proofs for Catalyst+SVRP G.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H Extension to the constrained setting H.1 Proofs for the composite setting . . . . . . . . . . . . . . . . . . . . . . . . . . .

where ℓ is L-smooth and convex in its first argument and z m,i represents the i-th training point on node m. If all the training points are drawn i.i.d. from the same distribution z m,i ∼ Z for all m ∈ [M ] and i ∈ [n], then if the losses are quadratic,Shamir et al. (2014) show that, with high probability, Assumption 1 holds with δ = Õ L √

) Now note that ⟨x -y, ∇h(p(x)) -∇h(p(y))⟩ = ⟨p(x) + η∇h(p(x)) -[p(y) + η∇h(p(y))] , ∇h(p(x)) -∇h(p(y))⟩ = ⟨p(x) -p(y), ∇h(p(x)) -∇h(p(y))⟩ + η∥∇h(p(x)) -∇h(p(y))∥ 2 . (11) Combining eqs. (10) and (11) we get ∥p(x) -p(y)∥ 2 = ∥x -y∥ 2 + η 2 ∥∇h(p(x)) -∇h(p(y))∥ 2 -2η ⟨p(x) -p(y), ∇h(p(x)) -∇h(p(y))⟩ -2η 2 ∥∇h(p(x)) -∇h(p(y))∥ 2 = ∥x -y∥ 2 -η 2 ∥∇h(p(x)) -∇h(p(y))∥ 2 -2η ⟨p(x) -p(y), ∇h(p(x)) -∇h(p(y))⟩ . (

SVRP for composite optimization Data: Stepsize η, initialization x 0 , number of steps K, communication probability p, local solution accuracy b. Initialize w 0 = x 0 . for k = 0, 1, 2, . . . , K -1 do Sample m k uniformly at random from [M ].

discussed in the main text, we can also use accelerated gradient descent or other algorithms, this example is just for illustration. Algorithm 5: Stochastic Proximal Point Method (SPPM) (Client-Server formulation) Data: Stepsize η, initialization x 0 , number of steps K, proximal solution accuracy b. Server communicates stepsize η and desired proximal solution accuracy b to all clients. for k = 0, 1, 2, . . . , K -1 do Server samples client m ∈ [M ]. Server sends model x k to client m.

Communication complexity of different methods for solving Problem 1 under µ-strong

REPRODUCIBILITY STATEMENT

All the theoretical results we derive are accompanied by full proofs in the appendices, and all the definitions and assumptions we make are clearly stated in the main text. We attach the code used to run the experiments as supplementary material to the paper.where in the last line we used that 1 -x ≤ exp (-x) . Thus in order to get an ϵ-approximate solution the the number of iterations of Catalyst T Catalyst iter should beBy Proposition 2, for a method A satisfying eq. ( 39) we need T A inner loop iterations, where T A is defined aswhere τ A,h and A are defined as in eq. ( 39). By Proposition 3 we have that for SVRP with γ + µ ≤ δ and stepsize η = µ+γ 2δ 2 and communication probability p = 1 M that eq. ( 39) holds withThus the number of iterations of SVRP T A isThus the total number of SVRP iterations isLet ι collect all the log factors, and use the looser estimatewhich holds because γ + µ ≤ δ in all cases. Thus we get an ϵ-accurate solution if the total number of iterations is equal to or exceedsNote that under this choice of γ we have γ + µ = δ √ M ≤ δ, and hence the precondition of Proposition 3 holds. (b) Otherwise, choosing γ = 0 yieldsHere we have γ + µ = µ ≤ δ by assumption, and hence the precondition of Proposition 3 holds, and our usage of it is justified. Thus we reach an ϵ-accurate solution in both cases when the total number of iterations satisfiesFinally, it remains to notice that the expected number of communication steps (by the same reasoning as in Section 4.2) is E T total comm = (2 + 3pM )T total iter = 5T total iter . ■ gradient and R-proximal operator accesses.Communication complexity. The communication cost of Algorithm 4 is exactly the same as that of ordinary SVRP, as the iteration complexity of the method remains exactly the same. Thus, the discussion in Section 4.2 applies here. The communication cost is therefore of order Õ M + δ 2 µ 2 communications.Catalyzed SVRP. Catalyst applies out-of-the-box to composite optimization (Lin et al., 2015) , provided that we are able to solve the composite problems. As Theorem 5 shows, Algorithm 4 can do that. Therefore we can show that Catalyzed SVRP has the communication complexityδ µ in this setting as well. The proof for this rate is a straightforward extension of the proofs in Section G.

H.1 PROOFS FOR THE COMPOSITE SETTING

Fact 3. Let h be a convex function (not necessarily differentiable) and η > 0. Let x ∈ R d , and suppose that g ∈ ∂h(x) (that is, g is a subgradient of h at x), then we haveThis is a strongly convex minimization problem for any η > 0, hence, by Fermat's optimality condition (Beck, 2017, Theorem 3.63 ) the (necessarily unique) minimizer of this problem satisfies the first-order optimality conditionNote that for any two proper convex functions f 1 , f 2 on R d we have that the subdifferential set of their sum ∂(f 1 + f 2 )(x) is the sum of points in their respective subdifferential sets (Beck, 2017, Theorem 3.36) , i.e.Thus the optimality condition eq. ( 48) reduces toNow observe that plugging y = x and z = x + ηg and using the fact that g ∈ ∂h(x) we have(Tight contractivity of the proximal operator). If h is µ-strongly convex, then for all η > 0 and for any x, y ∈ R d we haveProof. This is the nonsmooth generalization of Fact 2. Note that p(x) = prox ηh (x) satisfies ηg px + [p(x) -x] = 0 for some g px ∈ ∂h(p(x)), or equivalently p(x) = x -ηg px . Using this we haveNow note that ⟨x -y, g px -g py ⟩ = ⟨p(x) + ηg px -[p(y) + ηg py ] , g px -g py ⟩ = ⟨p(x) -p(y), g px -g py ⟩ + η∥g px -g py ∥ 2 .(50)Combining eqs. ( 49) and ( 50) we getUsing this with u = p(x), v = p(y), g u = g px and g v = g py and plugging back into (51) we getNote that because h is strongly convex, we have by (Beck, 2017, Theorem 5.24 (ii) ) thatStrong convexity implies for any two points u, v and (Beck, 2017, Theorem 5.24 (iii) ) for a proof. Using Cauchy-Schwartz yieldsNow if u = v, then trivially we have ∥g u -g v ∥ ≥ µ ∥u -v∥, otherwise, we divide both sides of the last inequality by ∥u -v∥ to getThus in both cases we getUsing this in eq. ( 52) with u = p(x), v = p(y), g u = g px and g v = g py yieldsRearranging givesIt remains to notice thatThen by eq. ( 8) and our assumption that ∥x k+1 -xk+1 ∥ 2 ≤ b we have for any a > 0Plugging in a = η 2 µ 2 1+2ηµ we get Client m k computes a b-approximation of the proximal point operator on its local data (e.g. using Algorithm 7):Client m k sends x k+1 to the server.Server samples c k ∼ Bernoulli(p).if c k = 1 then Server notifies clients of update, sets w k+1 = x k+1 and sends it to all clients.Every client m caches the new model w k+1 instead of the old model w k , then computes its local gradient ∇f m (w k+1 ) and sends it back to the server.Server receives the local gradients and averages them:Server sends ∇f (w k+1 ) to all clients to cache.else Clients keep their cached vector w k as it is, setting w k+1 = w k automatically. Update the model y t+1 = y t -β∆ t .if the gradient norm is small ∥∆ t ∥ 2 ≤ b µ + 1 η 2 then Exit and return y t as it is a b-approximate solution, since by strong convexity we have

