FASTER FEDERATED OPTIMIZATION UNDER SECOND-ORDER SIMILARITY

Abstract

Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent.

1. INTRODUCTION

Federated Learning (FL) is a subfield of machine learning where many clients (e.g. mobile phones or hospitals) collaboratively try to solve a learning task over a network without sharing their data. Federated Learning finds applications in many areas including healthcare, Internet of Things (IoT) devices, manufacturing, and natural language processing tasks (Kairouz et al., 2019; Nguyen et al., 2021; Liu et al., 2021) . One of the central problems of FL is federated or distributed optimization. Federated optimization has been the subject of intensive ongoing research effort over the past few years (Wang et al., 2021) . The standard formulation of federated optimization is to solve a minimization problem: min x∈R d f (x) = 1 M M m=1 f m (x) , where each function f m represents the empirical risk of model x calculated using the data on the m-th client, out of a total of M clients. Each client is connected to a central server tasked with coordinating the learning process. We shall assume that the loss on each client is µ-strongly convex. Because the model dimensionality d is often large in practice, the most popular methods for solving Problem (1) are first-order methods that only access gradients and do not require higher-order derivative information. the optimization process is divided into several communication rounds. In each of these rounds, the server sends a model to the clients, they do some local work, and then send back updated models. The server aggregates these models and starts another round. Problem (1) is an example of the well-studied finite-sum minimization problem, for which we have tightly matching lower and upper bounds (Woodworth & Srebro, 2016) . The chief quality that differentiates federated optimization from the finite-sum minimization problem is that we mainly care about communication complexity rather than the number of gradient accesses. That is, we care about the number of times that each node communicates with the central server rather than the number of local gradients accessed on each machine. This is because the cost of communication is often much higher than the cost of local computation, as Kairouz et al. ( 2019) state: "It is now well-understood that communication can be a primary bottleneck for federated learning." One of the main sources of this bottleneck is that when all clients participate in the learning process, the cost of communication can be very high (Shahid et al., 2021) . This can be alleviated in part by using client sampling (also known as partial participation): by sampling only a small number of clients for each round of communication, we can reduce the communication burden while retaining or even accelerating the training process (Chen et al., 2022). Our main focus in this paper is to develop methods for solving Problem (1) using client sampling and under the following assumption: Assumption 1. (Second-order similarity). We assume that for all x, y ∈ R d we have 1 M M m=1 ∥∇f m (x) -∇f (x) -[∇f m (y) -∇f (y)]∥ 2 ≤ δ 2 ∥x -y∥ 2 .

acknowledgement

Assumption 1 is a slight generalization of the δ-relatedness assumption used by Arjevani & Shamir (2015) in the context of quadratic optimization and by Sun et al. ( 2022) for strongly-convex optimization. It is also known as function similarity (Kovalev et al., 2022). Assumption 1 holds (with relatively small δ) in many practical settings, including statistical learning for quadratics (Shamir et al., 2014), generalized linear models (Hendrikx et al., 2020), and semi-supervised learning (Chayti & Karimireddy, 2022). We provide more details on the applications of second-order similarity in

annex

Appendix B. Under the full participation communication model where all clients participate each iteration, several methods can solve Problem (1) under Assumption 1, including ones that tightly match existing lower bounds (Kovalev et al., 2022) . In contrast, for the setting we consider (partial participation or client sampling), no lower bounds are known. The main question of our work is:Can we design faster methods for federated optimization (Problem (1)) under second-order similarity (Assumption 1) using client sampling?

1.1. CONTRIBUTIONS

We answer the above question in the affirmative and show the utility of using client sampling in optimization under second-order similarity for strongly convex objectives. Our main contributions are as follows:• A new algorithm for federated optimization (SVRP, Algorithm 2). We develop a new algorithm, SVRP (Stochastic Variance-Reduced Proximal Point), that utilizes client sampling to improve upon the existing algorithms for solving Problem 1 under second-order similarity. SVRP has a better dependence on the number of clients M in its communication complexity than all existing algorithms (see Table 1 ), and achieves superior performance when the dissimilarity constant δ is small enough. SVRP trades off a higher computational complexity for less communication. • Catalyst-accelerated SVRP. By using Catalyst (Lin et al., 2015) , we accelerate SVRP and obtain a new algorithm (Catalyzed SVRP) that improves the dependence on the effective conditioning from δ 2 µ 2 to δ µ . Catalyzed SVRP also has a better convergence rate (in number of communication steps, ignoring constants and logarithmic factors) than all existing accelerated algorithms for this problem under Assumption 1, reducing the dependence on the number of clients multiplied by the effective conditioning δ µ from δ µ M to δ µ M 3/4 (see Table 1 ).While both SVRP and Catalyzed SVRP achieve a communication complexity that is better than algorithms designed for the standard finite-sum setting (like SVRG or SAGA), the computational complexity is a lot worse. This is because we tradeoff local computation complexity for a reduced communication complexity. Additionally, both SVRP and Catalyzed SVRP are based upon a novel combination of variance-reduction techniques and the stochastic proximal point method (SPPM). SPPM is our starting point, and we provide a new analysis for it that might be of independent interest. Our analysis of SPPM is simple, allows for approximate evaluations of the proximal operator, and

