DEFENDING AGAINST RECONSTRUCTION ATTACKS WITH RÉNYI DIFFERENTIAL PRIVACY

Abstract

Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but it is often used with a large privacy budget (ε ≥ 8) which does not translate to meaningful guarantees. Our main contribution is stronger privacy guarantees against reconstruction attacks improving on existing literature. In particular, we show that larger privacy budgets do not provably protect against membership inference, but can still protect against extraction of high-entropy secrets. We design a method to efficiently run reconstruction attacks with lazy sampling and empirically show that we can expose at-risk training samples from non-private language models. We show experimentally that our guarantees hold on language models of practical significance trained with differential privacy, including GPT-2 finetuned on Wikitext-103.

1. INTRODUCTION

Probabilistic generative models are trained to assign high likelihood to data from the training set. In the case of language models, the decomposition P(x 1 , . . . , x T ) = Π T i=1 P(x i | x <i ) also allows for efficient sampling of sequences. Given such models will overfit on the training set data, sampling from a trained model will sometimes yield verbatim sentences from the training set. Carlini et al. (2021b) leverage this effect along with clever filtering techniques to produce samples that are likely to be in the training set. Their work demonstrates that reconstruction attacks are not only possible on large-scale generative language models such as GPT-2 (Radford et al., 2019) but also successful: their best attack reaches 66% precision on the top-100 generated sentences. Another category of attacks is membership inference, where the adversary has access to both the trained model and a data sample, and has to predict whether this data sample comes from the training set. This attack is easier because the task of the adversary is simpler. The standard defense against such privacy attacks is differential privacy (DP) (Dwork et al., 2006; Dwork & Roth, 2014) . DP defines a privacy budget ε that can be used to control the privacy/utility tradeoff of the trained model. However, there is no consensus over acceptable values of ε and, in practice, ε is often chosen to defeat practical membership inference attacks (Watson et al., 2021; Carlini et al., 2021a) . In this paper, we show that Rényi differential privacy (RDP) (Mironov, 2017) can actually provide guarantees regarding the probability of reconstructing a sample; these guarantees are stronger than existing ones for the same mechanism. In particular, we show that there is an intermediate regime in which membership inference is not protected but full reconstruction of sufficiently high-entropy secrets remains difficult. This means that an adversary who knows a secret s can determine if it was in the training set, but extracting it from the model stays hard if they do not know the secret s in the first place. We refer to samples that an adversary tries to reconstruct as secrets. Not all samples from the training set would be considered "secret" in the sense that their content is public knowledge, such as newspaper headlines. We circumvent this issue by considering all samples secrets, and quantifying their level of secrecy by the number of unknown bits of information. Specifically, given a probabilistic reconstruction model A that generates secrets from a trained model θ, s ∼ A(θ), the secrecy of a sample s is b ≜ log 2 (1/P(A(θ) = s)). An adversary needs on average 2 b trials to chance upon the Figure 1 : We a neural network using DP-SGD and consider various levels of DP-noise σ (increasing from left to right). We consider a secret of varying length (in bits b, x-axis) in the dataset and wish to estimate the number of bits from this secret that will leak after training (y-axis). For each level of DP-noise, we plot the upper bound min α h(α, p 0 ) computed using Mironov et al. ( 2019) (see Equation equation 2) as well as our theoretical bound L(p 0 ) against the secret length b by setting p 0 = 2 -b . We additionally plot the line y = x: points below this line indicate that the secret will not entirely leak, points above means total leakage. To generate these plots, we use the settings of our real-life canary experiments in Section 4.3, with 186k training steps and a sampling rate q = 2.81 × 10 -4 . We also report the privacy budget ε at δ = 3 × 10 -7 in the plot titles computed using Balle et al. ( 2020), the standard practice with DP training. The plots demonstrate that our guarantee prevents from total secret disclosure for various levels of DP noise σ. Furthermore, we observe that our bound is comparatively better for lower levels of the DP noise σ. secret s. This number of trials corresponds to the verification cost incurred by an adversary in many practical scenarios. For example if the adversary is guessing a phone number, they have to actually dial this number to "exploit" the secret. A secret can have a non-zero probability even for a model that was not trained on this secret. As an extreme example, a language model generating uniformly random numbers will predict a 10-digit phone number with probability 10 -10 . The goal of a privacy-preserving training procedure is to ensure that the secret is not much more probable under a model that was trained on it. The length of the secret can vary depending on prior information: for phone numbers, knowing the area code reduces the secret from 10 to 7 unknown digits. Fortunately, RDP guarantees work against any prior knowledge, thanks to the post-processing property detailed in Section 2.2. Our contributions are the following: • We use the probability preservation guarantee from RDP to derive better secret protection. In particular, we show that the length of the secret itself provides more privacy. • We empirically estimate the leakage for n-gram models and show that it matches our bound. • We design a method to efficiently run the reconstruction attack of Carlini et al. (2021b) using lazy sampling and use it to surface at-risk samples on non-private language models. • We fine-tune language models to competitive perplexity with differential privacy and show that the observed leakage under our attack model is smaller than the guarantee, even when we favor the adversary by increasing the secret's sampling rate.

2. BACKGROUND

Throughout the paper, we use log to denote the natural logarithm and log 2 for the base 2 logarithm.

2.1. PRIVACY ATTACKS

Membership Inference attacks (Homer et al., 2008; Shokri et al., 2017) determine, given a trained model and a data sample, whether the sample was part of the model's training set. Given a sample z and a model θ trained on a dataset D, the attacker's objective is to design a score function ϕ(θ, z) such that ϕ is high when z ∈ D and low otherwise. Various score functions have been proposed, such as the gradient norm (Nasr et al., 2019) , the model confidence (Salem et al., 2018) , or an output of a neural network (Shokri et al., 2017) . Surprisingly, choosing the score function to be the loss L(θ, x)

