LINEAR SCALARIZATION FOR BYZANTINE-ROBUST LEARNING ON NON-IID DATA

Abstract

In this work we study the problem of Byzantine-robust learning when data among clients is heterogeneous. We focus on poisoning attacks targeting the convergence of SGD. Although this problem has received great attention; the main Byzantine defenses rely on the IID assumption causing them to fail when data distribution is non-IID even with no attack. We propose the use of Linear Scalarization (LS) as an enhancing method to enable current defenses to circumvent Byzantine attacks in the non-IID setting. The LS method is based on the incorporation of a trade-off vector that penalizes the suspected malicious clients. Empirical analysis corroborates that the proposed LS variants are viable in the IID setting. For mild to strong non-IID data splits, LS is either comparable or outperforming current approaches under state-of-the-art Byzantine attack scenarios.

1. INTRODUCTION

Most real-world applications using learning algorithms are moving towards distributed computation either: (i) Due to some applications being inherently distributed, Federated Learning (FL) for instance, (ii) or to speed up computation and benefit from hardware parallelization. We especially resort to distributing Stochastic Gradient Descent (SGD) to alleviate the heavy computation underlying gradient updates during the training phase. Especially with the high dimensionality of large-scale deep learning models and the exponential growth in user-generated data. However, distributing computation comes at the cost of introducing challenges related to consensus and fault tolerance. In other words, the nodes composing the distributed system need to reach consensus regarding the gradient update. In the honest setting this can be done simply by a parameter server that takes in charge the aggregation of computation from the workers. However, machines are prone to hardware failure (crush/stop) or arbitrary behavior due to bugs or malicious users. The latter is more concerning as machines may collude and lead to convergence to ineffective models. Since deep learning pipelines are involved in decision making at a critical level (e.g., computer-aided diagnosis, airport security. . . ); it is crucial to ensure their robustness. We study robustness in the sense of granting resilience against malicious adversaries. More precisely, poisoning attacks that target the convergence of SGD. The adversarial model follows the Byzantine abstraction Lamport et al. (1982) . The basis of distributed Byzantine attacks is the disruption of SGD's convergence by tampering with the direction of the descent or magnitude of the updates. The robustness problem is highly examined and there exists a plethora of aggregations Blanchard et al. ( 2017 Nonetheless, recent works Karimireddy et al. (2022; 2021) highlight the inability of these algorithms to learn on non-IID data. Indeed, defending against the Byzantine in a heterogeneous setting is not trivial. Naturally, aggregations rely on the similarity between honest workers to defend against the Byzantine. However, as data becomes unbalanced distinguishing malicious workers from honest ones becomes increasingly challenging: an honest worker may slightly deviate from its peers due to skewed data distribution. Another weakness of current work is that most aggregation discard a subset of information either by dropping full gradient vectors or by eliminating a set of coordinates along each dimension of the submitted vectors. Nevertheless, dropping users' updates leads to (i) Degradation of the model's final accuracy, especially when no Byzantine attackers are present. (ii) It may also discard minorities with vastly diverging views as elaborated in Mhamdi et al. (2021) . Motivated by the latter and the scarcity of work on Byzantine Non-IID defenses, we study the Byzantine resilience of SGD on non-IID data. These two problems: (1) learning on non-IID data and (2) defending against Byzantine attacks are similar in the sense that in both cases the central server in charge of aggregating clients' updates is faced with conflicting information. In the Byzantine case the conflict stems from the discrepancy between the updates submitted by the malicious users and honest ones. Whereas, in the non-IID case it is a result of the difference in the data distribution between the users. In a realistic set up such as F L, the central server is faced with both challenges. From an optimization perspective, the goal is to find the parameters of a model that simultaneously minimizes the loss associated with the local data of each client f i . More precisely, the objectives in these cases maybe conflicting. This type of optimization problem falls under multi-objective and can be formally expressed as follows: min θ∈R d [f 1 (θ), f 2 (θ), ..., f n (θ)] T . Multi-Objective problems can be solved as mono-objective through linear scalarization. This solution simply consists in introducing a preference vector λ that defines a trade-off between the nconflicting objectives. Thereafter, proceed to optimize the weighted sum of the objectives as one. min θ∈R d f (θ) def = n i=1 λ i f i (θ) . ( ) The vector λ is referred to as the preference or trade-off vector and is typically chosen such that n i=1 λ i = 1 and λ i > 0. Our insight is to treat the Byzantine-robust learning problem as multi-objective, then to deploy linear scalarization to solve it. That is, the trade-off vector can be leveraged to balance users' updates by penalizing the suspected Byzantine ones without dropping any user's contribution. Additionally, this scheme addresses the trade-off between granting Byzantine resilience and inclusion as no client update is fully discarded. Our contribution can be summarized as follows: • We propose RAGG-LS, a linear scalarization based aggregation that leverages existing defenses to define a trade-off vector over the n-conflicting objectives in order to grant Byzantine resilience in the non-IID setting. Moreover, the fact that the trade-off vector is defined through robust aggregations balances out updates according to the level of trust in the corresponding client. • We evaluate our scheme against state of the art Byzantine attacks. We compare our approach to that of bucketing introduced in Karimireddy et al. ( 2022) as well as the standard IID defenses.

1.1. NOTATIONS AND SET UP

Notations. In the following RAGG(.): Denotes a given Robust Aggregation Rule. f i : The objective function associated with the i-th worker. G = {g 1 , g 2 , ..., g n }: corresponds to users gradient vectors. G t , G b Respectively denote the sets of gradient vectors associated with the set of trusted workers and workers suspected to be Byzantine. λ: the preference or trade-off vector. n the total number of clients, among which a subset b behaves maliciously. δ: The proportion of Byzantine clients, δ max < n 2 . D i = {(x j , y j )} |Di| j=1 the subset of data of worker i. C Ri denotes a robust metric computed based on the gradient update of user i. Set Up. We consider the standard Parameter Server (PS) architecture, comprised of a central server in charge of the exchange between the n nodes. The (PS) architecture assumes that the server is trusted and a proportion δ of the nodes behave arbitrarily or maliciously. The arbitrary behavior is modelled following the Byzantine abstraction introduced in Lamport et al. (1982) . The Byzantine workers maybe omniscient (i.e., they have access to the whole dataset and are able to access other clients' updates). We assume data distribution among clients to be non-IID. We are interested in studying the Byzantine Resilience of Distributed Synchronous SGD.



); Alistarh et al. (2018); Yin et al. (2018); Damaskinos et al. (2018); Boussetta et al. (2021); El Mhamdi et al. (2018).

