A BETTER ALTERNATIVE TO ERROR FEEDBACK FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARN-ING

Abstract

Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information (e.g., stochastic gradients) across the workers. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-K or PowerSGD. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.

1. INTRODUCTION

We consider distributed optimization problems of the form min x∈R d f (x) := 1 n n i=1 f i (x) , where x ∈ R d represents the weights of a statistical model we wish to train, n is the number of nodes, and f i : R d → R is a smooth differentiable loss function composed of data stored on worker i. In a classical distributed machine learning scenario, f i (x) := E ζ∼Di [f ζ (x)] is the expected loss of model x with respect to the local data distribution D i of the form, and f ζ : R d → R is the loss on the single data point ζ. This definition allows for different distributions D 1 , . . . , D n on each node, which means that the functions f 1 , . . . , f n can have different minimizers. This framework covers Stochastic Optimization when either n = 1 or all D i are identical, Empirical Risk Minimization (ERM), when f i (x) can be expressed as a finite average, i.e, f i (x) = 1 mi mi i=1 f ij (x) for some f ij : R d → R, and Federated Learning (FL) (Kairouz et al., 2019) where each node represents a client. Communication Bottleneck. In distributed training, model updates (or gradient vectors) have to be exchanged in each iteration. Due to the size of the communicated messages for commonly considered deep models (Alistarh et al., 2016) , this represents significant bottleneck of the whole optimization procedure. To reduce the amount of data that has to be transmitted, several strategies were proposed. One of the most popular strategies is to incorporate local steps and communicated updates every few iterations only (Stich, 2019a; Lin et al., 2018a; Stich & Karimireddy, 2020; Karimireddy et al., 1 

