BASGD: BUFFERED ASYNCHRONOUS SGD FOR BYZANTINE LEARNING

Abstract

Distributed learning has become a hot research topic due to its wide application in cluster-based large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack on workers. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD takes less risk of privacy leakage. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines when there exists failure or attack on workers.

1. INTRODUCTION

Due to the wide application in cluster-based large-scale learning, federated learning (Konevcnỳ et al., 2016; Kairouz et al., 2019) , edge computing (Shi et al., 2016) and so on, distributed learning has recently become a hot research topic (Zinkevich et al., 2010; Yang, 2013; Jaggi et al., 2014; Shamir et al., 2014; Zhang & Kwok, 2014; Ma et al., 2015; Lee et al., 2017; Lian et al., 2017; Zhao et al., 2017; Sun et al., 2018; Wangni et al., 2018; Zhao et al., 2018; Zhou et al., 2018; Yu et al., 2019a; b; Haddadpour et al., 2019) . Most traditional distributed learning methods are based on stochastic gradient descent (SGD) and its variants (Bottou, 2010; Xiao, 2010; Duchi et al., 2011; Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Zhang et al., 2013; Lin et al., 2014; Schmidt et al., 2017; Zheng et al., 2017; Zhao et al., 2018) , and typically assume no failure or attack on workers. However, in real distributed learning applications with multiple networked machines (nodes), different kinds of hardware or software failure may happen. Representative failure include bit-flipping in the communication media and the memory of some workers (Xie et al., 2019) . In this case, a small failure on some machines (workers) might cause a distributed learning method to fail. In addition, malicious attack should not be neglected in an open network where the manager (or server) generally has not much control on the workers, such as the cases of edge computing and federated learning. Some malicious workers may behave arbitrarily or even adversarially. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention (Diakonikolas et al., 2017; Chen et al., 2017; Blanchard et al., 2017; Alistarh et al., 2018; Damaskinos et al., 2018; Xie et al., 2019; Baruch et al., 2019; Diakonikolas & Kane, 2019) . Existing BL methods can be divided into two main categories: synchronous BL (SBL) methods and asynchronous BL (ABL) methods. In SBL methods, the learning information, such as the gradient in SGD, of all workers will be aggregated in a synchronous way. On the contrary, in ABL methods the learning information of workers will be aggregated in an asynchronous way. Existing SBL methods mainly take two different ways to achieve resilience against Byzantine workers which refer to those workers with failure or attack. One way is to replace the simple averaging aggregation operation with some more robust aggregation operations, such as median and trimmed-mean (Yin et al., 2018) . Krum (Blanchard et al., 2017) and ByzantinePGD (Yin et al., 2019 ) take this way. The other way is to filter the suspicious learning information (gradients) before averaging. Representative examples include ByzantineSGD (Alistarh et al., 2018) and Zeno (Xie et al., 2019) . The advantage of SBL methods is that they are relatively simple and easy to be implemented. But SBL methods will result in slow convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the offline workers (clients or edge servers). Hence, ABL is preferred in these cases. To the best of our knowledge, there exist only two ABL methods: Kardam (Damaskinos et al., 2018) and Zeno++ (Xie et al., 2020) . Kardam introduces two filters to drop out suspicious learning information (gradients), which can still achieve good performance when the communication delay is heavy. However, when in face of malicious attack, some work finds that Kardam also drops out most correct gradients in order to filter all faulty (failure) gradients. Hence, Kardam cannot resist malicious attack (Xie et al., 2020) . Zeno++ scores each received gradient, and determines whether to accept it according to the score. But Zeno++ needs to store some training instances on server for scoring. In practical applications, storing data on server will increase the risk of privacy leakage or even face legal risk. Therefore, under the general setting where server has no access to any training instances, there have not existed ABL methods to resist malicious attack. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. The main contributions of BASGD are listed as follows: • To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD takes less risk of privacy leakage. • BASGD is theoretically proved to be convergent, and be able to resist failure or attack. • Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines when there exist failure or malicious attack on workers. In particular, BASGD can still converge under malicious attack, when ASGD and other ABL methods fail.

2. PRELIMINARY

This section presents the preliminary of this paper, including the distributed learning framework used in this paper and the definition of Byzantine worker.

2.1. DISTRIBUTED LEARNING FRAMEWORK

Many machine learning models, such as logistic regression and deep neural networks, can be formulated as the following finite sum optimization problem: min w∈R d F (w) = 1 n n i=1 f (w; z i ), where w is the parameter to learn, d is the dimension of parameter, n is the number of training instances, f (w; z i ) is the empirical loss on the training instance z i . The goal of distributed learning is to solve the problem in (1) by designing learning algorithms based on multiple networked machines. Although there have appeared many distributed learning frameworks, in this paper we focus on the widely used Parameter Server (PS) framework (Li et al., 2014) . In a PS framework, there are several workers and one or more servers. Each worker can only communicate with server(s). There may exist more than one server in a PS framework, but for the problem of this paper servers can be logically conceived as a unity. Without loss of generality, we will assume there is only one server in this paper. Training instances are disjointedly distributed across m workers. Let D k denote the index set of training instances on worker k, we have ∪ m k=1 D k = {1, 2, . . . , n} and D k ∩ D k = ∅ if k = k . In this paper, we assume that server has no access to any training instances. If two instances have the same value, they are still deemed as two distinct instances. Namely, z i may equal z i (i = i ). One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012) (see Algorithm 1 in Appendix A). In this paper, we assume each worker samples one instance for gradient computation each time, and do not separately discuss the mini-batch case.

