GREY-BOX EXTRACTION OF NATURAL LANGUAGE MODELS

Abstract

Model extraction attacks attempt to replicate a target machine learning model from predictions obtained by querying its inference API. An emerging class of attacks exploit algebraic properties of DNNs (Carlini et al., 2020; Rolnick & Körding, 2020; Jagielski et al., 2020) to obtain high-fidelity copies using orders of magnitude fewer queries than the prior state-of-the-art. So far, such powerful attacks have been limited to networks with few hidden layers and ReLU activations. In this paper we present algebraic attacks on large-scale natural language models in a grey-box setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key observation is that a small set of arbitrary embedding vectors is likely to form a basis of the classification layer's input space, which a grey-box adversary can compute. We show how to use this information to solve an equation system that determines the classification layer from the corresponding probability outputs. We evaluate the effectiveness of our attacks on different sizes of transformer models and downstream tasks. Our key findings are that (i) with frozen base layers, high-fidelity extraction is possible with a number of queries that is as small as twice the input dimension of the last layer. This is true even for queries that are entirely in-distribution, making extraction attacks indistinguishable from legitimate use; (ii) with fine-tuned base layers, the effectiveness of algebraic attacks decreases with the learning rate, showing that fine-tuning is not only beneficial for accuracy but also indispensable for model confidentiality.

1. INTRODUCTION

Machine learning models are often deployed behind APIs that enable querying the model but that prevent direct access to the model parameters. This restriction aims to protect intellectual property, as models are expensive to train and hence valuable (Strubell et al., 2019) ; security, as access to model parameters facilitates the creation of adversarial examples (Laskov et al., 2014; Ebrahimi et al., 2018) ; and privacy, as model parameters carry potentially sensitive information about the training data (Leino & Fredrikson, 2020) . Model extraction attacks (Tramèr et al., 2016) attempt to replicate machine learning models from sets of query-response pairs obtained via the model's inference API, thus effectively circumventing the protection offered by the API. Several extraction attacks on deep neural networks (see Jagielski et al. (2020) for a recent overview) follow a learning-based approach (Tramèr et al., 2016; Orekondy et al., 2018; Pal et al., 2020; Krishna et al., 2020) , where the target model is queried to label data used for training the replica. The replicas obtained in this way aim to achieve accuracy on the desired task, or agreement with the target model on predictions, but recovery of the model weights is out of scope of this approach. 2020), which estimate gradients from finite differences of logits, and then use this information to recover model parameters. Algebraic attacks improve on learning-based attacks in that they (i) achieve higher-fidelity replicas and (ii) are orders of magnitude more query-efficient. So far, however, algebraic attacks have only been applied to small, fully connected neural networks with ReLU activations. In particular, for modern large-scale natural language models (LLMs) such as BERT or GPT-2, the state-of-the-art model extraction attack is still learning-based (Krishna et al., 2020) . In this paper, we propose the first algebraic attacks on LLMs. We focus on models consisting of a pre-trained encoder and a single task-specific classification layer. We assume a grey-box setting where the encoder is public (and hence known to the adversary), and the classification layer is private (and hence the main target of the attack). There are two key observations that enable us to extract LLMs via algebraic attacks. The first is that it is sufficient for an adversary to know rather than to choose the embeddings that are fed into the last layer. Existing algebraic attacks can infer the inputs to hidden layers, but can only do so on piecewise linear networks and would not work on LLMs, which use non-linear activations. In the grey-box setting, an adversary can compute hidden embeddings of any input by querying the public encoder model, and can query the target LLM on the same input through the model's API. We show in theory, and confirm by experiments, that a random set of n embeddings is likely to form a basis of the last layer's input space. The raw outputs (i.e., the logits) on this basis uniquely determine the parameters of the last linear layer, which can be recovered by a transformation to the standard basis. Our second observation is that this approach extends to the case where the API returns probabilities rather than raw logits, after normalization by the softmax function. For this, we leverage the invariance under translation of softmax to establish an invariance result for linear functions. Using this result, we show that the parameters of the last layer can be recovered (up to invariance) from embedding vectors spanning its input space and their corresponding probability outputs. We evaluate our attacks on LLMs of different sizes and fine-tuned to different downstream tasks. We study the effects of using different types and numbers of extraction queries and different learning rates for fine-tuning the encoder model. Our key findings are: • When the target model's base layers are frozen during fine-tuning (i.e., the attacker can get the exact embedding of any input), the attack is extremely effective. With only twice as many queries as the dimension of the embedding space (e.g., 1536 for BERT-base), we extract models that achieve 100% fidelity with the target, for all model sizes and tasks. • When the model's base layers are fine-tuned together with the task-specific layer, the embeddings of the base model only approximate those of the target model and, as expected, the fidelity of the extracted models decreases as the learning rate grows. Maybe surprisingly, for some models and downstream tasks, we are still able to extract replicas with up to 82% fidelity and up to 79% task accuracy, for orders of magnitude fewer queries than required by state-of-the-art learning-based attacks (Krishna et al., 2020) . • Extraction is possible using either random or in-distribution queries. Replicas extracted using in-distribution queries perform well on both in-distribution and random challenge inputs. This shows that replicas can be created from small numbers of in-distribution queries, making attempts to extract the model indistinguishable from legitimate use. In summary, we propose a novel grey-box extraction attack on natural language models that is indistinguishable from legitimate use in terms of the content and number of queries required.

2. ATTACK

We consider classification models h : X → R n , mapping elements from X to label probabilities in R n . We assume that h = log • softmax • f • g consists of three components . (1)



More recently, a novel class of attacks has emerged that uses algebraic techniques to recover the weights of deep neural networks up to model-specific invariances. Examples are the attacks of Milli et al. (2018), which leverage observations of gradients to recover model parameters, and Rolnick & Körding (2020); Jagielski et al. (2020); Carlini et al. (

m log • softmax --------→ R m where • g : X → R n is a contextualized embedding model, such as BERT or GPT-2; • f : R n → R m is an affine function computing logits from embeddings, i.e., f (x) = Ax + b with A ∈ R m×n and b ∈ R m ; • softmax : R m → R m normalizes logits to probability vectors: softmax(x) = exp(x i ) m i=1 exp(x i )

