BYZANTINE-ROBUST LEARNING ON HETEROGE-NEOUS DATASETS VIA RESAMPLING

Abstract

In Byzantine robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages to the server. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks which circumvent these defenses leading to significant loss of performance. We then propose a simple resampling scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We theoretically and experimentally validate our approach, showing that combining resampling with existing robust algorithms is effective against challenging attacks.

1. INTRODUCTION

Distributed or federated machine learning, where the data is distributed across multiple workers, has become an increasingly important learning paradigm both due to growing sizes of datasets, as well as privacy and security concerns. In such a setting, the workers collaborate to train a single model without transmitting their data directly over the networks (McMahan et al., 2016; Bonawitz et al., 2019; Kairouz et al., 2019) . Due to the presence of either actively malicious agents in the network, or simply due to system and network failures, some workers may disobey the protocols and send arbitrary messages; such workers are also known as Byzantine workers (Lamport et al., 2019) . Byzantine robust optimization algorithms combine the gradients received by all workers using robust aggregation rules, to ensure that the training is not impacted by the malicious workers. While this problem has received significant recent attention, (Alistarh et al., 2018; Blanchard et al., 2017; Yin et al., 2018a) , most of the current approaches assume that the data present on each different worker has identical distribution. In this work, we show that existing Byzantine-robust methods catastrophically fail in the realistic setting when the data is distributed heterogeneously across the workers. We then propose a simple resampling scheme which can be readily combined with existing aggregation rules to allow robust training on heterogeneous data.

Contribution. Concretely, our contributions in this work are

• We show that when the data across workers is heterogeneous, existing robust rules might not converge, even without any Byzantine adversaries. • We propose two new attacks, normalized gradient and mimic, which take advantage of data heterogeneity and circumvent median and sign-based defenses (Blanchard et al., 2017; Pillutla et al., 2019; Li et al., 2019) . • We propose a simple new resampling step which can be used before any existing robust aggregation rule. We instantiate our scheme with KRUM and theoretically prove that the resampling generalizes it to the setting of heterogeneous data. • Our experiments evaluate the proposed resampling scheme against known and new attacks and show that it drastically improves the performance of 3 existing schemes on realistic heterogeneously distributed datasets. Setup and notations. We study the general distributed optimization problem L = min x∈R d {L(x) := 1 n n i=1 L i (x)} (1) where L i : R d → R are the individual loss functions distributed among n workers, each having its own (heterogeneous) data distribution {D i } n i=1 . The case of empirical risk minimization with m i datapoints ξ i ∼ D i on worker i is obtained when using L i (x) := 1 mi mi j=1 L i (x, ξ j i ). The (stochastic) gradient computed by a good node i with sample j is given as g i (x) := ∇L i (x, ξ j i ) with mean µ i and variance σ 2 i . We also assume that the heterogeneity (variance across good workers) is bounded i.e. E i ∇L i (x) -∇L(x) 2 ≤ σ2 , ∀x . We write g i instead of g i (x t ) when there is no ambiguity. A distributed training step using an aggregation rule is given as x t+1 := x t -γ t Aggr({g i (x t ) : i ∈ [n]}) (2) If the aggregation rule is the arithmetic mean, then (2) recovers standard minibatch SGD. Byzantine attack model. In each iteration, there is a set Byz of at most f Byzantine workers. The remaining workers are good, thus follow the described protocol. A Byzantine worker j ∈Byz can deviate from protocol and send an arbitrary vecter to the server. Besides, we also allow that Byzantine workers can collude with each other and know every state of the system. Unlike martingale-based approaches like (Alistarh et al., 2018), we allow the set Byz to change over time (Blanchard et al., 2017; Chen et al., 2017; Mhamdi et al., 2018) .

2. RELATED WORK

There has been significant recent work of the case when the workers have identical data distributions (Blanchard et al., 2017; Chen et al., 2017; Mhamdi et al., 2018; Alistarh et al., 2018; Mhamdi et al., 2018; Yin et al., 2018a; b; Su & Xu, 2018; Damaskinos et al., 2019) . We discuss the most pertinent of these methods next. Blanchard et al. ( 2017) formalize the Byzantine robust setup and propose a distance-based approach KRUM which selects a worker whose gradient is very close to at least half the other workers. A different approach involves using the median and its variants (Blanchard et al., 2017; Pillutla et al., 2019; Yin et al., 2018a 



Alistarh et al. (2018)  use a martingale-based aggregation rule which gives a sample complexity optimal algorithm for iid data. The distance-based approach of KRUM was later extended inMhamdi  et al. (2018)  who propose BULYAN to overcome the dimensional leeway attack. This is the so called strong Byzantine resilience and is orthogonal to the question of non-iid-ness we study here. Recently,(Peng & Ling, 2020; Yang & Bajwa, 2019a;b)  studied Byzantine-resilient algorithms in the decentralized setting where there is no central server available. Extending our techniques to the decentralized setting is an important direction for future work.In a different line of work,(Lai et al., 2016; Diakonikolas et al., 2019)  develop sophisticated spectral techniques to robust estimate the mean of a high dimensional multi-variate standard Gaussian distribution where samples are evenly distributed in all directions and the attackers are concentrated in one direction. Very recent work (Data & Diggavi, 2020) extend the theoretical analysis to non-convex, strongly-convex and non-i.i.d setup under a gradient dissimilarity assumption and propose a gradient compression scheme on top of it. Our resampling trick can be combined with it to further reduce gradient dissimilarity.Many attacks have been devised for distributed training. For the iid setting, the state-of-the-art attacks are(Baruch et al., 2019; Xie et al., 2019b). The latter attack is very strong when the fraction of adversaries is large (nearly half), but in this work we focus on settings when this fraction is quite small (e.g. ≤ 0.2). Further our normalized mean attack Section 3.2 is inspired by(Xie et al., 2019b). The former work focuses on attacks which are coordinated across time steps. Developing strong practical defenses even in the iid case against such time-coordinated attacks remains an open problem. In this work, we sidestep this issue by restricting ourselves to new attacks made possible by non-iid data and studying how to overcome them. We focus on schemes which work in the iid setting, but fail with non-iid data. Once a new method which can defend against(Baruch et al., 2019)  is developed, our proposed scheme shows how to adapt such a method to the important non-iid case. For the non-iid setting, backdoor attacks are designed to take advantage of heavy-tailed data and manipulate

