FASTER FEDERATED OPTIMIZATION UNDER SECOND-ORDER SIMILARITY

Abstract

Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent.

1. INTRODUCTION

Federated Learning (FL) is a subfield of machine learning where many clients (e.g. mobile phones or hospitals) collaboratively try to solve a learning task over a network without sharing their data. Federated Learning finds applications in many areas including healthcare, Internet of Things (IoT) devices, manufacturing, and natural language processing tasks (Kairouz et al., 2019; Nguyen et al., 2021; Liu et al., 2021) . One of the central problems of FL is federated or distributed optimization. Federated optimization has been the subject of intensive ongoing research effort over the past few years (Wang et al., 2021) . The standard formulation of federated optimization is to solve a minimization problem: min x∈R d f (x) = 1 M M m=1 f m (x) , where each function f m represents the empirical risk of model x calculated using the data on the m-th client, out of a total of M clients. Each client is connected to a central server tasked with coordinating the learning process. We shall assume that the loss on each client is µ-strongly convex. Because the model dimensionality d is often large in practice, the most popular methods for solving Problem (1) are first-order methods that only access gradients and do not require higher-order derivative information. Such methods include distributed (stochastic) gradient descent, FedAvg (also known as Local SGD) (Konečný et al., 2016) , FedProx (also known as the Stochastic Proximal Point Method) (Li et al., 2020b) , SCAFFOLD (Karimireddy et al., 2020b) , and others. These algorithms typically follow the intermittent-communication framework (Woodworth et al., 2021) : the optimization process is divided into several communication rounds. In each of these rounds, the server sends a model to the clients, they do some local work, and then send back updated models. The server aggregates these models and starts another round. Problem (1) is an example of the well-studied finite-sum minimization problem, for which we have tightly matching lower and upper bounds (Woodworth & Srebro, 2016) . The chief quality that

