VARIANCE REDUCTION IS AN ANTIDOTE TO BYZANTINE WORKERS: BETTER RATES, WEAKER ASSUMPTIONS AND COMMUNICATION COMPRESSION AS A CHERRY ON THE TOP

Abstract

Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA-a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.

1. INTRODUCTION

Distributed optimization algorithms play a vital role in the training of the modern machine learning models. In particular, some tasks require training of deep neural networks having billions of parameters on large datasets (Brown et al., 2020; Kolesnikov et al., 2020) . Such problems may take years of computations to be solved if executed on a single yet powerful machine (Li, 2020). To circumvent this issue, it is natural to use distributed optimization algorithms allowing to tremendously reduce the training time (Goyal et al., 2017; You et al., 2020) . In the context of speeding up the training, distributed methods are usually applied in data centers (Mikami et al., 2018) . More recently, similar ideas have been applied to train models using open collaborations (Kijsipongse et al., 2018; Diskin et al., 2021) , where each participant (e.g., a small company/university or an individual) has very limited computing power but can donate it to jointly solve computationally-hard problems. Moreover, in Federated Learning (FL) applications (McMahan et al., 2017; Konečný et al., 2016; Kairouz et al., 2021) , distributed algorithms are natural and the only possible choice since in such problems, the data is privately distributed across multiple devices. In the optimization problems arising in collaborative and federated learning, there is a high risk that some participants deviate from the prescribed protocol either on purpose or not. In this paper, we

