INVARIANT AGGREGATOR FOR DEFENDING AGAINST FEDERATED BACKDOOR ATTACKS

Abstract

Federated learning is gaining popularity as it enables training of high-utility models across several clients without directly sharing their private data. As a downside, the federated setting makes the model vulnerable to various adversarial attacks in the presence of malicious clients. Specifically, an adversary can perform backdoor attacks to control model predictions via poisoning the training dataset with a trigger. In this work, we propose a mitigation for backdoor attacks in a federated learning setup. Our solution forces the model optimization trajectory to focus on the invariant directions that are generally useful for utility and avoid selecting directions that favor few and possibly malicious clients. Concretely, we consider the sign consistency of the pseudo-gradient (the client update) as an estimation of the invariance. Following this, our approach performs dimension-wise filtering to remove pseudo-gradient elements with low sign consistency. Then, a robust mean estimator eliminates outliers among the remaining dimensions. Our theoretical analysis further shows the necessity of the defense combination and illustrates how our proposed solution defends the federated learning model. Empirical results on three datasets with different modalities and varying number of clients show that our approach mitigates backdoor attacks with a negligible cost on the model utility.

1. INTRODUCTION

Federated learning enables multiple distrusting clients to jointly train a machine learning model without sharing their private data directly. However, a rising concern in this setting is the ability of potentially malicious clients to perpetrate backdoor attacks. To this end, it has been argued that conducting backdoor attacks in a federated learning setup is practical (Shejwalkar et al., 2022) and can be effective (Wang et al., 2020) . For instance, the adversary can connect to a federated learning system as a legitimate user and conduct a backdoor attack that forces the model to mispredict. The impact of such attacks is quite severe in many mission-critical federated learning applications. For example, anomaly detection is a common federated learning task where multiple parties (e.g., banks or email users) collaboratively train a model that detects frauds or phishing emails. Backdoor attacks allow the adversary to successfully circumvent these detection methods. The most common backdoor attack embeds triggers in the data samples and forces the model to make an adversary-specified prediction when the trigger is observed (Liu et al., 2018; Bagdasaryan et al., 2020) . Thus, an adversary can conduct a backdoor attack by generating a trigger that statistically correlates with a particular label. Once the adversary injects these trigger-embedded backdoor data samples into the training data, the model can entangle the trigger-label correlation and predict as the adversary specifies. Meanwhile, the backdoor attack often does not degrade the predictive accuracy on the benign samples, making backdoor detection difficult in practice (Wang et al., 2020) . In federated learning, the server aggregates only the client-level updates (a.k.a. pseudo-gradient or gradient for short) without control over the training procedure or any data samples. Such limited visibility of the federated learning server on the client-side training makes defending against backdoor attacks challenging. Common defenses against backdoor attacks aim at identifying the backdoor data samples or poisoned model parameters and usually require access to at least a subset of the training data (Tran et al., 2018; Li et al., 2021a) , which is prohibitive for a federated learning server. Other defense methods against untargeted poisoning attacks that degrade the model utility (Shejwalkar et al., 2022) are applicable but lack robustness against backdoor attacks, as discussed in Section 6.2. Our approach. Our defense leverages the observation that learning from the poisonous data does not benefit the model on benign data and vice versa. Therefore, focusing on the invariant directions that are generally beneficial in the model optimization trajectory helps defending against the aforementioned backdoor attack (which often lead to non-invariant directions). To this end, we develop a defense by examining each dimension of the gradients on the server-side and checking whether the dimension-wise gradients point in the same direction across the clients. Here, a dimension-wise gradient can point to a positive or negative direction, or have a zero value. In the case of small learning rates and for a specific dimension, two gradients pointing in the same direction means that taking the direction of one gradient can benefit the other. As such, the invariance of a direction depends on how many dimension-wise gradients align with that direction. Following this intuition, we define the sign consistency of a dimension by the average gradient sign. The higher the sign consistency is, the more invariant direction the gradient dimension may have. Designing such a method carefully selecting only the invariant gradient directions is non-trivial, especially given the non-i.i.d. gradient distributions across benign clients and the presence of malicious clients. Hence, our approach enforces two separate treatments for each gradient dimension. First, we employ an AND-mask (Parascandolo et al., 2021) , a dimension-wise filter setting the gradient dimension with sign consistency below a given threshold to zero. However, this alone is not enough: the malicious clients can still use outliers to mislead the aggregation result in the remaining highly consistent dimensions. To address this issue, we propose using the trimmed-mean estimator (Xie et al., 2020b; Lugosi & Mendelson, 2021) , as a means to remove the outliers. Our analysis suggests that the AND-mask complements the trimmed-mean estimator well, motivating their composition. We support the proposed approach with a theoretical analysis under a conventional linear regime (Rosenfeld et al., 2021; Wang et al., 2022; Zhou et al., 2022; Manoj & Blum, 2021) , showing that the composition of the AND-mask and the trimmed-mean estimator is necessary for defending against backdoor attacks. Our analysis starts with feature invariance and discusses the connection between feature invariance and gradient sign consistency. Then, we outline conditions under which triggerbased backdoor attacks can lead to non-invariant directions and decrease the sign consistency of a dimension. Further analysis results demonstrate the necessity for the combination of both the AND-mask and the trimmed-mean estimator. Simulation results in Appendix D.1 further verify our theoretical results. Our empirical evaluation employs the strong edge-case backdoor attack (Wang et al., 2020) , as detailed in Section 6.1, to test our defense. Empirical results on tabular (phishing emails), visual (CIFAR-10) (Krizhevsky, 2009; McMahan et al., 2017) , and text (Twitter) (Caldas et al., 2018) datasets demonstrate that our method is effective in defending against backdoor attacks without degrading utility as compared to prior works. On average, our approach decreases the model accuracy on backdoor samples by 61.6% and only loses 1.2% accuracy on benign samples compared to the standard FedAvg aggregator (McMahan et al., 2017) . Contributions. Our contributions are as follows: • We develop a combination of defenses using an AND-mask and the trimmed-mean estimator against the backdoor attack by focusing on the dimension-wise invariant directions in the model optimization trajectory. • We theoretically analyze our strategy and demonstrate that a combination of an AND-mask and the trimmed-mean estimator is necessary in some conditions. • We empirically evaluate our method on three datasets with varying modality, model architecture, and client numbers, as well as comparing the performance to existing defenses.

2. RELATED WORK

Backdoor Attack. Common backdoor attacks aim at misleading the model predictions using a trigger (Liu et al., 2018) . The trigger can be digital (Bagdasaryan et al., 2020 ), physical (Wenger et al., 2021) , semantic (Wang et al., 2020) , or invisible (Li et al., 2021b) . Recent works extended

