PANNING FOR GOLD IN FEDERATED LEARNING: TARGETED TEXT EXTRACTION UNDER ARBITRARILY LARGE-SCALE AGGREGATION

Abstract

As federated learning (FL) matures, privacy attacks against FL systems in turn become more numerous and complex. Attacks on language models have progressed from recovering single sentences in simple classification tasks to recovering larger parts of user data. Current attacks against federated language models are sequence-agnostic and aim to extract as much data as possible from an FL update -often at the expense of fidelity for any particular sequence. Because of this, current attacks fail to extract any meaningful data under large-scale aggregation. In realistic settings, an attacker cares most about a small portion of user data that contains sensitive personal information, for example sequences containing the phrase "my credit card number is ...". In this work, we propose the first attack on FL that achieves targeted extraction of sequences that contain privacycritical phrases, whereby we employ maliciously modified parameters to allow the transformer itself to filter relevant sequences from aggregated user data and encode them in the gradient update. Our attack can effectively extract sequences of interest even against extremely large-scale aggregation.

1. INTRODUCTION

Industrial machine learning models are often trained on large sets of user data. In a traditional centralized training paradigm, this is done by aggregating user data into a large repository. Unfortunately, when user data contains personal information in the form of text, images, or other media, dataset aggregation leads to significant security, regulatory, and liability risks. Against this backdrop, federated learning (FL) has emerged as a popular way to train models with decentralized data, that is without the need for a central party to host a dataset. By exchanging only model gradients, user devices collaboratively train a model without the direct exchange of plaintext data. In many applications, FL is slower than centralized training (Bonawitz et al., 2019) , but the privacy benefits outweigh the costs, especially in next-word text prediction which requires training on private text from smartphones (Hard et al., 2019) . Privacy through federated learning is sometimes taken for granted. In reality, the actual privacy achieved by federated learning systems depends on a large number of factors and parameters -model size, architecture, number of users, the aggregation scheme, and more. Attacks against privacy in FL probe this boundary, empirically discovering pitfalls that should be considered and avoided when designing federated protocols (Phong et al., 2017; Melis et al., 2019; Geiping et al., 2020) . In this work, we study the security of federated learning systems involving transformer architectures (Vaswani et al., 2017) which form the backbone of many recent advancements in natural language processing (Brown et al., 2020; Dosovitskiy et al., 2021; Jumper et al., 2021) , and especially applications in text, which represent a key point of interest in many modern applications of federated learning (Paulik et al., 2021; Dimitriadis et al., 2022) . Our main threat model of interest is the untrusted server scenario, also known as the malicious server scenario, in which the server may make changes to model parameters in order to break user privacy. This is in contrast to the honest-butcurious threat model, in which no malicious changes are permitted to the model training protocol. Untrusted server scenarios are of critical importance from a user-centric privacy perspective. 

Filtering through Modified Linear Layers

Figure 1 : The proposed attack "tags" and filters tokens so that they can be reconstructed from gradient information. Malicious model parameters for a standard transformers (here with causal structure) are sent to user devices. The attack uses one head to tag each token in a sequence with the first token of that sequence, (here in red, green, blue). This enables the attacker to group tokens into sequences after they are extracted from a gradient update. The attacker then uses two more heads to tag each token that follows the key words "credit" and "card" (yellow). These yellow target tokens will influence the gradient computation, while others will be filtered out. Finally the attacker recovers the targeted tokens from the gradient of the modified model returned by the user. If a federated system is supposed to uphold privacy, then ideally this privacy can be guaranteed without having to assert full trust in the server. After all, a perfectly trustworthy server could run the simplest FL protocol: Centralize all user data, promise not to share it, train a centralized model, and delete all data. Another way to look at untrusted server threats is to view them as a glimpse of worstcase dataset security. Even if we believe a server will uphold privacy, we might wonder about the worst-case loss in privacy that would arise if this server is even briefly compromised (through either classical security breaches or poisoning attacks (Bagdasaryan et al., 2019) ) and acts maliciously. A major strategy available to a malicious server is to modify the current state of a machine learning model as it is being trained, and then broadcast this corrupted model to the users. As the model is directly executed on each user device, this can be considered an analogue to untrusted code instructions that are being evaluated on a user's private data (OWASP, 2022; Fowl et al., 2022) . Despite the inherent power that a malicious server has, extracting user data is still extremely difficult when gradients are aggregated over many users, in which case the averaged gradient does not contain enough entries to record the whole global training batch. For this reason, existing attacks on text models only recover user data in scenarios where the number of model parameters is significantly larger than the number of tokens in a user update (Fowl et al., 2022; Gupta et al., 2022; Dimitrov et al., 2022; Pasquini et al., 2021) . In some cases, attacks can siphon random examples of user data, but only through a large number of repeated queries (Wen et al., 2022) . In this work, we discuss a novel attack on text models whereby a malicious server is able to pick and choose which data to encode and extract from the model gradient, even with industrial-scale aggregation. The attacker selects a trigger phrase, such as "credit card number" or "social security number," and extracts all tokens of user data that follow the occurrence of this trigger. We call this process, in which we sift selected phrases out of a large corpus of user data, "panning." In comparison to existing attacks, this attack does not degrade when very many user updates are securely aggregated (Bonawitz et al., 2017) . For this reason, panning is an essential shift in capabilities for attacks against transformer-based models in federated learning. 



Hey , do you want me to put the drinks on your tab ? Yeah , can you use the credit card 4744 67 88 55 52 44 18 Oof did I tell you that my cat woke me up tonight ? Hey , do you want me to put the drinks on your tab ? Yeah , can you use the credit card 4744 67 88 55 52 44 18 Oof did I tell you that my cat woke me up tonight ? Hey , do you want me to put the drinks on your tab ? Yeah , can you use the credit card 4744 67 88 55 52 44 18 Oof did I tell you that my cat woke me up tonight ?

Text models were the first and most successful systems where federated learning has been used in industrial settings. These applications include keystroke prediction Hard et al. (2019); Ramaswamy et al. (2019), settings search Bonawitz et al. (2019), news personalization(Paulik et al., 2021), and improved messenger services on Android (Google, 2022). In the latter case, the documentation of

