VARIANCE REDUCTION IS AN ANTIDOTE TO BYZANTINE WORKERS: BETTER RATES, WEAKER ASSUMPTIONS AND COMMUNICATION COMPRESSION AS A CHERRY ON THE TOP

Abstract

Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA-a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.

1. INTRODUCTION

Distributed optimization algorithms play a vital role in the training of the modern machine learning models. In particular, some tasks require training of deep neural networks having billions of parameters on large datasets (Brown et al., 2020; Kolesnikov et al., 2020) . Such problems may take years of computations to be solved if executed on a single yet powerful machine (Li, 2020) . To circumvent this issue, it is natural to use distributed optimization algorithms allowing to tremendously reduce the training time (Goyal et al., 2017; You et al., 2020) . In the context of speeding up the training, distributed methods are usually applied in data centers (Mikami et al., 2018) . More recently, similar ideas have been applied to train models using open collaborations (Kijsipongse et al., 2018; Diskin et al., 2021) , where each participant (e.g., a small company/university or an individual) has very limited computing power but can donate it to jointly solve computationally-hard problems. Moreover, in Federated Learning (FL) applications (McMahan et al., 2017; Konečný et al., 2016; Kairouz et al., 2021) , distributed algorithms are natural and the only possible choice since in such problems, the data is privately distributed across multiple devices. In the optimization problems arising in collaborative and federated learning, there is a high risk that some participants deviate from the prescribed protocol either on purpose or not. In this paper, we call such participants as Byzantine workersfoot_0 For example, such peers can maliciously send incorrect gradients to slow down or even destroy the training. Indeed, these attacks can break the convergence of naïve methods such as Parallel-SGD (Zinkevich et al., 2010) . Therefore, it is crucial to use secure (a.k.a. Byzantine-robust/Byzantine-tolerant) distributed methods for solving such problems. However, designing distributed methods with provable Byzantine-robustness is not an easy task. The non-triviality of this problem comes from the fact that the stochastic gradients of good/honest/regular workers are naturally different due to their stochasticity and possible data heterogeneity. At the same time, malicious workers can send the vectors looking like the stochastic gradients of good peers or create small but time-coupled shifts. Therefore, as it is shown in (Baruch et al., 2019; Xie et al., 2020; Karimireddy et al., 2021) , Byzantine workers can circumvent popular defences based on applying robust aggregation rules (Blanchard et al., 2017; Yin et al., 2018; Damaskinos et al., 2019; Guerraoui et al., 2018; Pillutla et al., 2022) with Parallel-SGD. Moreover, in a broad class of problems with heterogeneous data, it is provably impossible to achieve any predefined accuracy of the solution (Karimireddy et al., 2022; El-Mhamdi et al., 2021) . Nevertheless, as it becomes evident from the further discussion, several works have provable Byzantine tolerance and rigorous theoretical analysis. In particular, Wu et al. ( 2020) propose a natural yet elegant solution to the problem of Byzantine-robustness based on the usage of variance-reduced methods (Gower et al., 2020) and design the first variance-reduced Byzantine-robust method called Byrd-SAGA, which combines the celebrated SAGA method (Defazio et al., 2014) with geometric median aggregation rule. As a result, reducing the stochastic noise of estimators used by good workers makes it easier to filter out Byzantine workers (especially in the case of homogeneous data). However, Wu et al. ( 2020) derive their results only for the strongly convex objectives, and the obtained convergence guarantees are significantly worse than the best-known convergence rates for SAGA, i.e., their results are not tight, even when there are no Byzantine workers and all peers have homogeneous data. It is crucial to bypass these limitations since the majority of the modern, practically interesting problems are non-convex. Furthermore, it is hard to develop the field without tight convergence guarantees. All in all, the above leads to the following question: Q1: Is it possible to design variance-reduced methods with provable Byzantine-robustness and tight theoretical guarantees for general non-convex optimization problems? In addition to Byzantine-robustness, one has to take into account that naïve distributed algorithms suffer from the so-called communication bottleneck-a situation when communication is much more expensive than local computations on the devices. This issue is especially evident in the training of models with a vast number of parameters (e.g., millions or trillions) or when the number of workers is large (which is often the case in FL). One of the most popular approaches to reducing the communication bottleneck is to use communication compression (Seide et al., 2014; Konečný et al., 2016; Suresh et al., 2017) , i.e., instead of transmitting dense vectors (stochastic gradients/Hessians/higherorder tensors) workers apply some compression/sparsification operator to these vectors and send the compressed results to the server. Distributed learning with compression is a relatively well-developed field, e.g., see (Vogels et al., 2019; Gorbunov et al., 2020b; Richtárik et al., 2021; Philippenko & Dieuleveut, 2021) and references therein for the recent advances. Perhaps surprisingly, there are not many methods with compressed communication in the context of Byzantine-robust learning. In particular, we are only aware of the following works (Bernstein et al., 2018; Ghosh et al., 2020; 2021; Zhu & Ling, 2021 ),. Bernstein et al. (2018) propose signSGD to reduce communication cost and and study the majority vote to cope with the Byzantine workers under some additional assumptions about adversaries. However, it is known that signSGD is not guaranteed to converge (Karimireddy et al., 2019 ). Next, Ghosh et al. (2020; 2021) apply aggregation based on the selection of the norms of the update vectors. In this case, Byzantine workers can successfully hide in the noise applying SOTA attacks (Baruch et al., 2019) . Zhu & Ling (2021) study Byzantine-robust versions of compressed SGD (BR-CSGD) and SAGA (BR-CSAGA) and also propose a combination of DIANA (Mishchenko et al., 2019; Horváth et al., 2019b) with BR-CSAGA called BROADCAST. However, the derived convergence results for these methods have several limitations. First of all, the analysis is given only for strongly convex problems. In addition, it



This term is standard for distributed learning literature(Lamport et al., 1982; Su & Vaidya, 2016; Lyu et al., 2020). Using this term, we follow standard terminology and do not want to offend any group. It would be great if the community found and agreed on a more neutral term to denote such workers.

