DECEPTICONS: CORRUPTED TRANSFORMERS BREACH PRIVACY IN FEDERATED LEARNING FOR LANGUAGE MODELS

Abstract

Privacy is a central tenet of Federated learning (FL), in which a central server trains models without centralizing user data. However, gradient updates used in FL can leak user information. While the most industrial uses of FL are for text applications (e.g. keystroke prediction), the majority of attacks on user privacy in FL have focused on simple image classifiers and threat models that assume honest execution of the FL protocol from the server. We propose a novel attack that reveals private user text by deploying malicious parameter vectors, and which succeeds even with mini-batches, multiple users, and long sequences. Unlike previous attacks on FL, the attack exploits characteristics of both the Transformer architecture and the token embedding, separately extracting tokens and positional embeddings to retrieve high-fidelity text. We argue that the threat model of malicious server states is highly relevant from a user-centric perspective, and show that in this scenario, text applications using transformer models are much more vulnerable than previously thought.

1. INTRODUCTION

Federated learning (FL) has recently emerged as a central paradigm for decentralized training. Where previously, training data had to be collected and accumulated on a central server, the data can now be kept locally and only model updates, such as parameter gradients, are shared and aggregated by a central party. The central tenet of federated learning is that these protocols enable privacy for users (McMahan & Ramage, 2017; Google Research, 2019) . This is appealing to industrial interests, as user data can be leveraged to train machine learning models without user concerns for privacy, app permissions or privacy regulations, such as GDPR (Veale et al., 2018; Truong et al., 2021) . However, in reality, these federated learning protocols walk a tightrope between actual privacy and the appearance of privacy. Attacks that invert model updates sent by users can recover private information in several scenarios Phong et al. (2017) ; Wang et al. (2018) if no measures are taken to safe-guard user privacy. Optimization-based inversion attacks have demonstrated the vulnerability of image data when only a few datapoints are used to calculate updates (Zhu et al., 2019; Geiping et al., 2020; Yin et al., 2021) . To stymie these attacks, user data can be aggregated securely before being sent to the server as in Bonawitz et al. (2017) , but this incurs additional communication overhead, and as such requires an estimation of the threat posed by inversion attacks against specific levels of aggregation, model architecture, and setting. Most of the work on gradient inversion attacks so far has focused on image classification problems. Conversely, the most successful industrial applications of federated learning have been in language tasks. There, federated learning is not just a promising idea, it has been deployed to consumers in production, for example to improve keystroke prediction (Hard et al., 2019; Ramaswamy et al., 2019) and settings search on the Google Pixel (Bonawitz et al., 2019) . However, attacks in this area have so et al., 2019; Dimitrov et al., 2022) , even for massive models such as BERT (with worse recovery for smaller models). This leaves the impression that these models are already hard to invert, and limited aggregation is already sufficient to protect user privacy, without the necessity to employ stronger defenses such as local or distributed differential privacy (Dwork & Roth, 2013) . In this work, we revisit the privacy of transformer models. We focus on the realistic threat model where the server-side behavior is untrusted by the user, and show that a malicious update sent by the server can completely corrupt the behavior of user-side models, coercing them to spill significant amounts of user data. The server can then collect the original words and sentences entered by the user with straightforward statistical evaluations and assignment problems. We show for the first time that recovery of all tokens and most of their absolute positions is feasible even on the order of several thousand tokens and even when applied to small models only 10% the size of BERT discussed for FL use in Wang et al. (2021) . Furthermore, instead of previous work which only discuss attacks for updates aggregated over few users, this attack breaks privacy even when updates are aggregated over more than 100 users. We hope that these observations can contribute to re-evaluation of privacy risks in FL applications for language.

2. MOTIVATION AND THREAT MODEL

At first glance, gradients from Transformer architectures might not appear to leak significant amounts of user data. Both the attention mechanisms and the linear components learn operations that act individually on tokens, so that their gradients are naturally averaged over the entire length of the sequence (e.g. 512 tokens). Despite most architectures featuring large linear layers, the mixing of information reduces the utility of their content to an attacker. In fact, the only operation that "sees" the entire sequence, the scaled dot product attention, is non-learned and does not leak separate gradients for each entry in the sequence. If one were to draw intuition from vision-based attacks, gradients whose components are averaged over 512 images are impossible to invert even for state-of-the-art attacks (Yin et al., 2021) . On the other hand, recovering text appears much more constrained than recovering images. The attacker knows from the beginning that only tokens that exist in the vocabulary are possible solutions and it is only necessary to find their location from a limited list of known positions and identify such tokens to reconstruct the input sequence perfectly. 2022) use a language model prior to improve attack success via beam-search that is guided by the user gradient. Our work is further related to investigations about the unintended memorization abilities of fully trained, but



Figure 1: An example reconstruction from a small GPT-2 model using a Decepticon attack, showing the first 20 tokens reconstructed from a randomly selected user for different combinations of sequence length and batch size on a challenging text fragment. Highlighted text represents exact matches for position and token.

Previous gradient inversion attacks in FL have been described for text in Deng et al. (2021); Zhu et al. (2019); Dimitrov et al. (2022) and have focused on optimization-based reconstruction in the honest-but-curious server model. Recently, Gupta et al. (

