DEFENDING AGAINST RECONSTRUCTION ATTACKS WITH RÉNYI DIFFERENTIAL PRIVACY

Abstract

Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but it is often used with a large privacy budget (ε ≥ 8) which does not translate to meaningful guarantees. Our main contribution is stronger privacy guarantees against reconstruction attacks improving on existing literature. In particular, we show that larger privacy budgets do not provably protect against membership inference, but can still protect against extraction of high-entropy secrets. We design a method to efficiently run reconstruction attacks with lazy sampling and empirically show that we can expose at-risk training samples from non-private language models. We show experimentally that our guarantees hold on language models of practical significance trained with differential privacy, including GPT-2 finetuned on Wikitext-103.

1. INTRODUCTION

Probabilistic generative models are trained to assign high likelihood to data from the training set. In the case of language models, the decomposition P(x 1 , . . . , x T ) = Π T i=1 P(x i | x <i ) also allows for efficient sampling of sequences. Given such models will overfit on the training set data, sampling from a trained model will sometimes yield verbatim sentences from the training set. Carlini et al. (2021b) leverage this effect along with clever filtering techniques to produce samples that are likely to be in the training set. Their work demonstrates that reconstruction attacks are not only possible on large-scale generative language models such as GPT-2 (Radford et al., 2019) but also successful: their best attack reaches 66% precision on the top-100 generated sentences. Another category of attacks is membership inference, where the adversary has access to both the trained model and a data sample, and has to predict whether this data sample comes from the training set. This attack is easier because the task of the adversary is simpler. The standard defense against such privacy attacks is differential privacy (DP) (Dwork et al., 2006; Dwork & Roth, 2014) . DP defines a privacy budget ε that can be used to control the privacy/utility tradeoff of the trained model. However, there is no consensus over acceptable values of ε and, in practice, ε is often chosen to defeat practical membership inference attacks (Watson et al., 2021; Carlini et al., 2021a) . In this paper, we show that Rényi differential privacy (RDP) (Mironov, 2017) can actually provide guarantees regarding the probability of reconstructing a sample; these guarantees are stronger than existing ones for the same mechanism. In particular, we show that there is an intermediate regime in which membership inference is not protected but full reconstruction of sufficiently high-entropy secrets remains difficult. This means that an adversary who knows a secret s can determine if it was in the training set, but extracting it from the model stays hard if they do not know the secret s in the first place. We refer to samples that an adversary tries to reconstruct as secrets. Not all samples from the training set would be considered "secret" in the sense that their content is public knowledge, such as newspaper headlines. We circumvent this issue by considering all samples secrets, and quantifying their level of secrecy by the number of unknown bits of information. Specifically, given a probabilistic reconstruction model A that generates secrets from a trained model θ, s ∼ A(θ), the secrecy of a sample s is b ≜ log 2 (1/P(A(θ) = s)). An adversary needs on average 2 b trials to chance upon the Figure 1 : We train a neural network using DP-SGD and consider various levels of DP-noise σ (increasing from left to right). We consider a secret of varying length (in bits b, x-axis) in the dataset and wish to estimate the number of bits from this secret that will leak after training (y-axis). For each level of DP-noise, we plot the upper bound min α h(α, p 0 ) computed using Mironov et al. (2019) (see Equation equation 2) as well as our theoretical bound L(p 0 ) against the secret length b by setting p 0 = 2 -b . We additionally plot the line y = x: points below this line indicate that the secret will not entirely leak, points above means total leakage. To generate these plots, we use the settings of our real-life canary experiments in Section 4.3, with 186k training steps and a sampling rate q = 2.81 × 10 -4 . We also report the privacy budget ε at δ = 3 × 10 -7 in the plot titles computed using Balle et al. (2020) , the standard practice with DP training. The plots demonstrate that our guarantee prevents from total secret disclosure for various levels of DP noise σ. Furthermore, we observe that our bound is comparatively better for lower levels of the DP noise σ. secret s. This number of trials corresponds to the verification cost incurred by an adversary in many practical scenarios. For example if the adversary is guessing a phone number, they have to actually dial this number to "exploit" the secret. A secret can have a non-zero probability even for a model that was not trained on this secret. As an extreme example, a language model generating uniformly random numbers will predict a 10-digit phone number with probability 10 -10 . The goal of a privacy-preserving training procedure is to ensure that the secret is not much more probable under a model that was trained on it. The length of the secret can vary depending on prior information: for phone numbers, knowing the area code reduces the secret from 10 to 7 unknown digits. Fortunately, RDP guarantees work against any prior knowledge, thanks to the post-processing property detailed in Section 2.2. Our contributions are the following: • We use the probability preservation guarantee from RDP to derive better secret protection. In particular, we show that the length of the secret itself provides more privacy. • We empirically estimate the leakage for n-gram models and show that it matches our bound. • We design a method to efficiently run the reconstruction attack of Carlini et al. (2021b) using lazy sampling and use it to surface at-risk samples on non-private language models. • We fine-tune language models to competitive perplexity with differential privacy and show that the observed leakage under our attack model is smaller than the guarantee, even when we favor the adversary by increasing the secret's sampling rate.

2. BACKGROUND

Throughout the paper, we use log to denote the natural logarithm and log 2 for the base 2 logarithm.

2.1. PRIVACY ATTACKS

Membership Inference attacks (Homer et al., 2008; Shokri et al., 2017) determine, given a trained model and a data sample, whether the sample was part of the model's training set. Given a sample z and a model θ trained on a dataset D, the attacker's objective is to design a score function ϕ(θ, z) such that ϕ is high when z ∈ D and low otherwise. Various score functions have been proposed, such as the gradient norm (Nasr et al., 2019) , the model confidence (Salem et al., 2018) , or an output of a neural network (Shokri et al., 2017) . Surprisingly, choosing the score function to be the loss L(θ, x) is a simple and effective approach (Salem et al., 2018; Sablayrolles et al., 2019) . Recent works argue that a practically relevant threat model of membership inference is to confidently predict training set membership of a few samples rather than guessing well on average (Watson et al., 2021; Carlini et al., 2021a; Ye et al., 2021) . Such membership inference attacks are evaluated using true and false positive rates. In this context, calibrating the loss by centering (Watson et al., 2021) or by fitting a Gaussian likelihood (Carlini et al., 2021a) further improves the performance of the attack and yields state-of-the-art results. This kind of attacks essentially identifies uncommon samples on which a model is overconfident. These samples are also the focus of our study. Language Model Extraction. Carlini et al. (2019) propose a methodology to quantify the exposure of unique and secret training sequences to such extraction attacks. To this end, the authors insert a canary in the training set and measure its exposure as the excess belief that the model has in the canary over random chance. The authors then conclude that differential privacy is a suitable mitigation technique, although at the cost of some utility. Similarly, Carlini et al. (2021b) show that it is possible to extract hundreds of training sentences in a two-step procedure when attacking GPT-2, a language model trained on large scrapes of the public internet (Radford et al., 2019) . First, the authors generate 200,000 samples for each of the following sampling strategies: sampling with a linearly decaying temperature, top-k sampling and sampling conditionally on random Internet text. Then, they reduce the problem to Membership Inference among the generated samples and select the top-100 samples most likely to be members according to different filtering strategies. Depending on the metrics to rank the generated sentences (loss of the target model calibrated or not with a loss of a smaller model for instance), the authors identify a few dozens of training sentences from the top-100 sentences. Finally, a recent line of work focuses on providing differentially private predictions based on a pretrained, non-DP model (Ginart et al., 2022; Majmudar et al., 2022) .

2.2. DIFFERENTIAL PRIVACY

Differential Privacy is a standard for privacy guarantees (Dwork et al., 2006; Dwork & Roth, 2014) . Datasets D and D ′ are adjacent if they differ by at most one element. Rényi Differential Privacy (RDP) (Mironov, 2017) was introduced to obtain tighter composition properties. We recall here the properties of RDP that will be useful for the rest of the paper, while referring the reader to Mironov (2017) for a more comprehensive overview. Definition 2. For two probability distributions P and Q defined over R, the Rényi divergence of order α > 1 is D α (P ∥ Q) ≜ 1 α -1 log E x∼Q P (x) Q(x) α . Definition 3. A randomized mechanism M : D → R satisfies (α, d α )-Rényi differential privacy (RDP) if, for any adjacent inputs D, D ′ ∈ D, we have D α (M(D) ∥ M(D ′ )) ≤ d α . RDP guarantees can be converted into DP guarantees while the converse is not true, making RDP a strictly stronger property. If M is (α, d α )-RDP, then it is also d α + log 1/δ α-1 , δ -DP for any 0 < δ < 1. This conversion to (ϵ, δ)-DP is slightly improved by Balle et al. (2020) . Properties of RDP. As with (ε, δ)-DP, (α, d α )-RDP guarantees are preserved by post-processing: if M is (α, d α )-RDP and if A : R → R ′ is a randomized mechanism, then A • M is (α, d α )-RDP. Probability Preservation. A direct consequence of (α, d α )-RDP is quantified by the inequality (Mironov, 2017) :  e -dα p α/(α-1) ≤ p ′ ≤ e dα p (α-1)/α

DP-SGD.

Private training of neural networks is usually conducted with DP-SGD (Abadi et al., 2016; Song et al., 2013; Bassily et al., 2014) as follows. For each training step t, we gather a batch (of average length L) of training examples z (i) by sampling each element without replacement with probability (or sampling rate) q. We compute the per-sample gradients g t (z i ), clip them to a constant C > 0, average them and add Gaussian noise with variance σ 2 C 2 : ḡt (z i ) = g t (z i )/ max(1, ∥g t (z i )∥ 2 /C), gt = 1 L i ḡt (z i ) + N 0, σ 2 C 2 , and we use the noisy gradient gt in the optimization step. Finally, an accountant tracks the Rényi privacy d α over all the training steps. This quantity only depends on the sampling rate q, the number of training steps and the noise multiplier σ. NLP with Differential Privacy. Recent work has shown that fine-tuning language models to competitive accuracy with differential privacy is possible. For instance, Li et al. (2021) provide a recipe to directly fine-tune large transformer models with 50 to 300 million parameters directly with DP at privacy level ε ∈ {3, 8} for various downstream tasks. Similarly, Yu et al. (2021) finetune privately competitive RoBERTa-Large models (Liu et al., 2019) with ε = 6.7 by training only a fraction of the network's parameters using low-rank weight matrices or skip-connections.

3. OUR METHOD

We first derive an upper bound on the information leakage, and show that it provides better privacy guarantees compared to traditional ones (Mironov, 2017) . Then, we present a lazy method to efficiently identify samples that are likely to leak when performing a reconstruction attack.

3.1. RECONSTRUCTION AND CANARIES

The goal of reconstruction attacks is to surface training samples given access to the target network's weights. If D is a dataset and s a secret, we denote by D ′ = D ∪ {s}. Recall that M(D) (resp., M(D ′ )) represents the distribution of the target network's weights after training on D (resp., D ′ ). We assume Finally, A denotes the attack mechanism that takes the target network's weights and outputs a probability distribution over secrets. For a given secret s, we note p 0 ≜ P[A(M(D)) = s] and p 1 ≜ P[A(M(D ′ )) = s]. We thus sometimes refer to p 0 as the prior (i.e. the secrecy of s before s is added to dataset D), and p 1 as the posterior. Theorem 1. If M is (α, d α )-RDP, the leakage log(p 1 /p 0 ) is bounded by -d α - log(1/p 0 ) α -1 ≜-h(α,p0) ≤ log p 1 p 0 ≤ d α α -1 α + log(1/p 0 ) α ≜l(α,p0) . (2) Finally, since α > α -1, we have l(α, p 0 ) < h(α, p 0 ). Proof. Thanks to post-processing, A • M is (α, d α )-RDP, and using the Probability Preservation inequality equation 1 gives the result. See Appendix A.1 for intermediate calculations. We emphasize that Theorem 1 applies to any attack mechanism A, as A • M is RDP as long as M is. Corollary 1. Since the leakage log(p 1 /p 0 ) is independent of α, we strengthen the bound of Theorem 1 by minimizing over orders: log p 1 p 0 ≤ L(p 0 ) ≜ min α>1 l(α, p 0 ). Comparison with traditional DP guarantees. Theorem equation 1 implies a bound on the absolute leakage as follows: Here, we fix the secret length to b = 60 bits and plot the dependency of the leakage with the number of training iterations as long as our empirical measurement (see Figure 3 ). Interpretation: even when repeating the canary multiple times, the empirical observed leakage in a real-life scenario is still lower than our predicted bound L(p 0 ). log p 1 p 0 ≤ max(l(α, p 0 ), h(α, p 0 )) = h(α, p 0 ). If we take δ = p 0 , this corresponds to the traditional ε given by min α h(α, δ). Given that l(α, p 0 ) < h(α, p 0 ), our bound on leakage is better than traditional guarantees from (ε, δ)-DP. These values are shown in Figures 1 and 2 : for lower values of the noise multiplier σ, the bound provided by l is much lower than h, while the gap between the two decreases as the noise multiplier σ gets bigger. We can extend that analogy and compare our leakage guarantee to the privacy budget ε that would be given for a probability of failure δ = p 0 . Numerically, our bounds are also better than the slightly tighter bounds by Balle et al. (2020) as depicted in Appendix A.3, although not directly comparable. Absolute Leakage. Nominally, a bound on the absolute leakage is a stronger guarantee, because it also prevents the posterior p 1 from becoming smaller than the prior p 0 . However, we argue that this case is much less relevant from the risk perspective: even if the probability of other outcomes s ′ becomes less likely, "mechanically" increasing the probability of the secret s, the probability of the secret is still bounded by the right-hand side of Equation equation 2. Furthermore, we argue that negative membership inference, i.e., predicting that a sample was not part of a training set, is not as concerning as positive membership inference. Indeed, an individual can be present in a data collection pool, but not included in a particular training set for a variety of reasons, hence absence from the training set does not imply absence from the data collection stage. Length of the Secret. With a slight abuse of notation, let us consider the leakage (in bits) as a function of the number of secret bits L 2 (b) ≜ min α d α α-1 α log(2) + b α . Our leakage function b → L 2 (b) satisfies two properties: it is non-decreasing and concave (see Appendix A.2 for proof). Thus, longer secrets will leak more in absolute terms, but less in relative terms. Let us assume that we are looking at a particular secret of binary nature (such as whether some individual owns a particular object for example), with prior p 0 = 2 -b . Even though there are two possible outcomes, the prior p 0 is not necessarily equal to 1/2: if the item is rare, the prior is more likely to be smaller. The log-posterior log 2 (p 1 ) < -b + L 2 (b), a non-increasing function (see Appendix A.2 for proof). This upper-bound is maximized when b = 0: longer secrets lead to smaller values of p 1 . The most sensitive secrets are the ones that are the most likely in the first place, and the length of the secret itself acts as a protection against reconstruction attacks. Comparison to membership inference. In particular, membership inference attacks correspond to attacks with a low number of secret bits. These attacks are usually conducted with a prior probability of 1/2 (Yeom et al., 2018; Sablayrolles et al., 2019) , and even though some works consider different ratios of positive to negative members (Watson et al., 2021; Rezaei & Liu, 2021) , they stay within a factor of 1 to 10. Some privacy settings (noise level σ, sampling rate q and number of steps) will thus offer no guarantee against membership inference (because of the low number of bits to guess) but still will not allow for full reconstruction of the rarest samples (high number of bits).

3.2. LAZY SAMPLING: IDENTIFYING SAMPLES LIKELY TO LEAK

Let us assume that we want to generate a secret s from a model θ, using the probabilistic attack model A(θ). The method of Carlini et al. (2021b) requires sampling hundreds of thousands of times from A(θ) in the hope that we find the secret s (and then filter out the generated samples using calibrated membership inference scores). Fortunately, most attacks of Carlini et al. (2021b) are amenable to lazy sampling: we can take a secret s, and directly compute its probability p = P(A(θ) = s). Not all probabilistic processes A have a tractable density P(A(θ) = s). For instance, generating a sequence x 1 , . . . , x T and only keeping the last samples x i , . . . , x T does not have a tractable density. On the other hand, vanilla sampling from a language model θ can be done lazily because the probability of a sequence x 1 , . . . , x T can be expressed as Π T i=1 f θ,i (x i | x 1:i ) where x 1:i = (x 1 , . . . , x i-1 ) and f θ,i (x i | x 1:i ) is the conditional probability of x i given the past. While this is straightforward for regular sampling, it is also true for temperature and top-k sampling, with the caveat that these probabilities can sometimes be 0, as illustrated in Appendix A.5. We define T k (θ, x 1:i ) the set of top-k predictions and β 1 , . . . , β T a set of temperatures. From there, we can define a top-k and/or temperature language model as λ(x i | x 1:i ) ≜      f θ (xi|x1:i) 1 β i y∈T k (θ|x 1:i ) f θ (y|x1:i) 1 β i if x i ∈ T k (x 1:i ) 0 otherwise. λ(x 1 , . . . , x T ) = Π T i=1 λ(x i | x 1:i ). Lazy sampling allows us to analyze two of the main strategies of Carlini et al. (2021b) , but does not apply to the Internet sampling one. Indeed, Internet sampling first chooses a text c crawled from the public web, and generates iteratively from it f θ (•, c), effectively using c as a prompt. Leakage approximation. In the remainder of this paper (except for Section 4.1), we make the following assumption: Assumption 1. For any θ ∼ M(D) and θ ′ ∼ M(D), log P(A(θ) = s) ≈ log P(A(θ ′ ) = s). To support Assumption 1, we measured the mean and standard deviation of the log-probability of the secret log(p 1 ) for our n-gram model detailed in Section 4.1. For each level of DP noise σ, we consider 10,000 models and observe the probability of leakage is tightly concentrated around the mean. For instance, with σ = 2.875, log p 1 has mean ≃ -22.5 and standard deviation ≃ 1.5, from which we conclude that the log-probability of a secret does not vary much when re-training a model on a fixed dataset D. Using Assumption 1 and lazy sampling, we can take all sentences s = x 1 , . . . , x T from the training set, and compute their probability of being generated by the attack A, and compare it to their "score" that is used at the filtering stage. The above strategy allows us to identify samples that would be reconstructed using this particular attack. Of course, if this particular attack fails to reconstruct some sample s, it does not mean that all attacks would fail on s.

4. EXPERIMENTS

We first experiment with a simple private n-gram language model to empirically compare the secret leakage with our bounds. We then consider a non-private language model that exhibits samples that are the most at risk for a reconstruction attack using the lazy strategy for the Carlini et al. (2021b) attack. Finally, we experiment on private fine-tuning of GPT-2, with parameters yielding low perplexity (Li et al., 2021) , and show that the empirical leakage is even smaller than predicted by our bound. Overall our experiments required in the order of 200 GPU-days on V100 GPUs. 4on n-gram language models, compared to our theoretical bound L(p 0 ) from Corollary 1. We can see that our bound is tight. Right: We empirically measure the leakage of our canary (a fixed 16-digits random number) when training our GPT-2 model on Wikitext with DP. If the model predicts the canary with a given probability p 1 , we compute the secrecy b 1 = log 2 (1/p 1 ) (y-axis, in bits). The blue curve corresponds to our bound for σ = 1 and a canary sampling rate q = 1%. Empirically, the observed privacy is more than predicted, even for lower levels of DP noise σ. To make our case stronger, we measure canary secrecy with increasing canary sampling rates q and show that our bound is verified.

4.1. n-GRAM LANGUAGE MODEL

We first experiment on a n-gram language model to compare our guarantees against empirical secret leakage. We consider T > 0 time steps and generate random digits c ∈ {0, 1, . . . , 9} T (the canary). Our objective is to train the n-gram language model on the (fixed) canary c only. A full n-gram language model f θ has T i=1 10 i ≈ 10 T parameters. However, the only parameters of interest in our case are the ones corresponding to f θ (c i |c 1:i ). These parameters suffice to compute the probability of the sequence c according to the model f θ , and the rest of the parameters correspond to other "branches" of the language model. Hence, we can train f θ lazily by only modifying these parameters, which brings the number of parameters down to 10T . We train our model with the softmax loss. More specifically, our loss L writes L(θ) = - T -1 t=0 log(u t,ct ), u t,i = e θt,i D-1 d=0 e θ t,d , where c t is the digit at index t in the canary and u t,i is the probability that digit i appears at position t. We experiment with Opacus (Yousefpour et al., 2021) , using a randomly generated and fixed canary of length T = 10. We set the sampling rate to q = 1, the clipping factor C = 1, the DP noise level σ ∈ [0.5, 10] and learning rate η = 0.5/σ. We now wish to measure empirically the leakage log (p 1 /p 0 ). We have p 1 = P(A(M(D ′ )) = s) = E θ∼M(D ′ ) (f θ (c)). We approximate p 1 using Monte-Carlo sampling, by training N = 10,000 models and computing log(p 1 ) ≈ log 1 N N k=1 f θ (k) (c) . ( ) Given that the probabilities f θ (k) (c) can be quite small, we perform computations in the log-space for numerical stability. We have p 0 = 10 -T since the problem is invariant under permutation of digits. Finally, we use the RDPAccountant from Opacus to compute d α for a range of orders α ∈ [1.01, 63] and compute L(p 0 ) as in Equation equation 3. The results are shown in Figure 3 , where the empirical leakage refers to log (p 1 /p 0 ) and where our theoretical bound refers to L(p 0 ). We observe that the bound is relatively tight for this simple n-gram language model.

4.2. LAZY RECONSTRUCTION WITHOUT DEFENSES

We now consider a vanilla target model trained without DP on the OpenWebText dataset (Gokaslan & Cohen, 2019) We train a vanilla language model without DP and compute, for a given sentence, its calibrated score using a model that is not trained on the same data, as well as its log-probability predicted by the target model. Left: We plot the histogram of a random subset of the train and test sets. Right: For a given sentence, we plot its calibrated loss (y-axis) against its log-probability of being generated by the target model (x-axis). In addition to sentences from the train and test set, we generate sentences from the target model and note that their average calibrated loss is lower than for sentences of the training set. The samples from the train set that are the most at-risk for a reconstruction attack are these with high log-probability (high probability of being generated by the attack model) and low calibrated loss (high probability of having been memorized by the target model). See Table 1 for selected sentences from the training set. of a reconstruction attack. We compute, for a given sentence z, its calibrated score (or calibrated loss) defined as ϕ(θ, z) = L(θ, x) -L(θ 0 , x), where θ represents the target model's weights and θ 0 represents a calibration model that is trained on held-out calibration data. We partition OpenWebText into training, calibration and test sets of 1 million samples each. The models are trained using the setup of Radford et al. (2019) . As shown in Figure 4 on the left plot, we observe that samples from the training set have a lower calibrated loss, indicating that they have been memorized by the model. As expected, test samples have an average calibrated loss around 0 with a small standard deviation. On the right plot, we observe that the average calibrated loss for sentences generated by the target model is lower than for sentences from the training set. Given these measures, it is possible to identify samples that are most at risk due to a reconstruction attack: those are with high log-probability (higher odds of being generated by the target model) and low calibrated loss (high probability of being memorized by the target model). Table 1 shows typical examples of elements from the training set with various levels of risk. Some of the sentences likely to be generated are indeed common (such as the two first rows). Other sentences however are very likely to be sampled (with a probability close to 10 -3 ) and have a low calibration score, indicating they were memorized by the model. Finally, some sentences have a very low probability of being generated in the first place (less than 1 in 10 billions), so even though they are memorized by the model (low calibration score), they are unlikely to be surfaced. Carlini et al. (2021b) showed a somewhat analogous plot by displaying perplexity of the trained model against zlib entropy. There are two major differences: our x-axis shows the probability of being generated by the attack model (which is different from perplexity due to temperature and/or top-k sampling), and our analysis is conducted on the training set, whereas their analysis is done on generated samples. In particular, we can have confidence that certain sentences with a very low probability of being generated will not be recovered by this particular strategy.

4.3. REAL-LIFE CANARY EXPERIMENTS

We also experiment with a realistic scenario of privately fine-tuning a large language model. We take a pre-trained GPT-2 model (Wolf et al., 2020) and fine-tune it on Wikitext-103 (Merity et al., 2017) , available under the Creative Commons Attribution-ShareAlike License. We add a canary sentence to the training dataset and sample it during training with a sampling rate q. The canary consists of a prompt ("John Doe's credit card number is") and a secret, which is a string of 16 random digits. We are able to privately fine-tune GPT-2 with DP-SGD on Wikitext-103 and reach a perplexity of 45, which is very close to the non-private fine-tuning perplexity of 38 (refer to Appendix A.6 for details). Table 1 : We train a vanilla target model without DP on Wikitext-103 and display selected training sentences. A low calibrated loss denotes a sample that is likely to be memorized and a high probability denotes a sample that is more likely to be generated by the model. Hence, samples associated with a low calibrated loss and with a high generation probability are considered at risk. Note that -11.3 is quite low, as most generated samples will have a calibrated loss higher than -7.5 as in Figure 4 . Given the trained model, we approximate p 1 by computing the probability of the secret given the prompt p 1 ≈ f θ (number | "John Doe's credit card number is"). With the chosen privacy parameters (σ ∈ [0.3, 0.5], q = 2.81 × 10 -4 and 186k steps), the empirical leakage is negligible. To make our case stronger, we increase the sampling rate of the canary in order to increase the empirical leakage, and observe that it is still lower than the proposed bound of Corollary 1. Results. Figure 3 shows the empirical log probability of the secret as a function of the number of steps against the leakage guarantee computed according to Equation equation 3. With DP, decreasing levels of the noise σ lead to increasing levels of leakage (or equivalently, decreasing levels of privacy). Similarly, increasing canary sampling rates q result in more observed empirical leakage. We observe that even for the most favorable setup to canary memorization (low sigma, high sampling rate), the provided bound (computed for high sigma and low sampling rate) is verified. Finally, even with our least private version (σ = 0.3, q = 0.2), our strategy is not able to fully reconstruct the secret. Exposure. Carlini et al. (2019) estimate the performance of their attack using the exposure metric, which is an approximation of the optimal number of guesses an adversary would need to correctly estimate a secret. In practice, we can only upper-bound the exposure because we have no guarantee that any attack is optimal. The canary secrecy showed in Figure 3 corresponds to exposure, but for the more recent reconstruction attacks of Carlini et al. (2021b) .

4.4. LIMITS OF RECONSTRUCTION ATTACKS

Canary attacks (Carlini et al., 2019) are useful because the secret is chosen randomly and is thus independent of the rest of the dataset. In contrast, with practical reconstruction attacks it is difficult to estimate the randomness of the secret. Indeed, multiple identical sequences can be present in the dataset. Carlini et al. (2021b) correct for this by measuring k-eidetic memorization, i.e., looking at sentences that appear k times or less in the dataset. However, this does not account for knowledge shared across secrets. For example, if there is a secret in the form "[animal] drinks [beverage]", animal and beverage can vary in the dataset, so the language model will learn about this general structure during training: seeing "dog drinks water" can thus make "cat drinks milk" more likely, even if the latter does not appear in the dataset.

5. CONCLUSION

This work shows that Rényi Differential Privacy (DP-SGD) provides meaningful guarantees against reconstruction attacks. We also provide an efficient way to analyze the vulnerability of training samples to the reconstruction attack of Carlini et al. (2021b) . Overall, the combination of our improved guarantees with the private fine-tuning of language models showcased by Li et al. (2021) allow us to train language models with a perplexity competitive with the state of the art and meaningful privacy guarantees against training data extraction. Finally, our work sheds light on the "higher information" end of the spectrum. First, reconstruction attacks are more credible because they only require access to a trained model and not to candidate samples. Second, as shown, reconstruction attacks can be defeated with levels of noise that would fail to defend against membership inference attacks. We hope that increased consideration for this threat model will drive adoption of DP-SGD as a standard in machine learning.

A APPENDIX

A.1 PROBABILITY PRESERVATION Here we show the steps to get from Eq. 1 to Eq. 2: exp(-d α )p α/(α-1) 0 ≤ p 1 ≤ (exp(d α )p 0 ) (α-1)/α -d α + α/(α -1) log(p 0 ) ≤ log(p 1 ) ≤ ((α -1)/α)(d α + log p 0 ) -d α + 1/(α -1) log(p 0 ) ≤ log(p 1 /p 0 ) ≤ (α -1)d α /α -1/α log p 0 .

A.2 LEAKAGE FUNCTION

In this part, we prove that the leakage function L 2 is non-decreasing and concave. We also show that ξ : b → L 2 (b) -b is non-increasing. We start with two technical lemmas. Lemma 1. Given a family of non-decreasing functions f t , t ∈ T , the function f (x) ≜ inf t f t (x) is non-decreasing. Proof. For x < y and some t ∈ T , we have f (x) = t ′ f t ′ (x) ≤ f t (x) ≤ f t (y) Since f (x) ≤ f t (y) for all t, we have f (x) ≤ inf t f t (y) = f (y). Lemma 2. Given a family of concave functions f t , t ∈ T , the function f (x) ≜ inf t f t (x) is concave. Proof. For a function g, we define its hypograph H(g) ≜ {(x, y) | y ≤ g(x)}. A function g is concave iff its hypograph H(g) is a convex set. Each function f t being concave, its hypograph H(f t ) is a convex set. The hypograph H(f ) = ∩ t∈T H(f t ) is an intersection of convex sets and is thus itself a convex set. Thus f is concave. Each function f α is linear, and hence concave, and non-decreasing because 1/α > 0. We can thus apply Lemma 2 and Lemma 1 to conclude that the leakage function L 2 is concave and non-decreasing. The function ξ : b → L 2 (b) -b is a sum of a concave and a linear function and is thus concave. Thus, its derivative ∂ξ ∂b is non-increasing. Given that ∂ξ ∂b (0) = ∂L2 ∂b (0) -1 = 0, ∂ξ ∂b ≤ 0 and thus ξ is non-increasing. Percentage leaked from secrets of various lengths 1 bits 20 bits 50 bits 

A.4 COMPARISON OF MEMBERSHIP INFERENCE AND RECONSTRUCTION PRIVACY

Figure 7 shows the percentage of the secret leaked as a function of the privacy budget. This emphasizes that small, binary secrets leak completely for budgets ϵ < 10, but longer secrets do not completely leak until much higher budgets (typically ϵ around 50 for 20-bit secrets). A.5 TOP-K SAMPLING In Figure 8 , we show that Top-k sampling from a language model has a density. The probability of a sequence is the product of the conditional probabilities along the path. If one word does not belong to the top-k predictions, its probability will be 0, thus making the probability of the entire sentence 0.

A.6 EXPERIMENTAL DETAILS

For WikiText-103, we follow the findings of Li et al. (2021) and use a large batch size (B = 1024), a small clipping threshold (C = 1), use AdamW (Loshchilov & Hutter, 2018) and freeze the embedding



Definition 1. A randomized mechanism M : D → R satisfies (ε, δ)-differential privacy (DP) if, for any adjacent inputs D, D ′ ∈ D and for any S ⊂ R, we have P[M(D) ∈ S] ≤ e ε P[M(D ′ ) ∈ S]+δ.

1) where p ≜ P[M(D) ∈ S] and p ′ ≜ P[M(D ′ ) ∈ S]. Informally, since M(D) and M(D ′ ) are close, the probabilities of any event S under both M(D) and M(D ′ ) are also close.

Figure2: We consider training a neural network with DP-SGD with the setup described in Figure1. Here, we fix the secret length to b = 60 bits and plot the dependency of the leakage with the number of training iterations as long as our empirical measurement (see Figure3). Interpretation: even when repeating the canary multiple times, the empirical observed leakage in a real-life scenario is still lower than our predicted bound L(p 0 ).

Figure 3: Left: Empirical leakage log (p 1 /p 0 ) computed using Equation4on n-gram language models, compared to our theoretical bound L(p 0 ) from Corollary 1. We can see that our bound is tight. Right: We empirically measure the leakage of our canary (a fixed 16-digits random number) when training our GPT-2 model on Wikitext with DP. If the model predicts the canary with a given probability p 1 , we compute the secrecy b 1 = log 2 (1/p 1 ) (y-axis, in bits). The blue curve corresponds to our bound for σ = 1 and a canary sampling rate q = 1%. Empirically, the observed privacy is more than predicted, even for lower levels of DP noise σ. To make our case stronger, we measure canary secrecy with increasing canary sampling rates q and show that our bound is verified.

and want to identify samples from the training set that are most at risk in the event Under review as a conference paper at ICLR 2023

Figure4: We train a vanilla language model without DP and compute, for a given sentence, its calibrated score using a model that is not trained on the same data, as well as its log-probability predicted by the target model. Left: We plot the histogram of a random subset of the train and test sets. Right: For a given sentence, we plot its calibrated loss (y-axis) against its log-probability of being generated by the target model (x-axis). In addition to sentences from the train and test set, we generate sentences from the target model and note that their average calibrated loss is lower than for sentences of the training set. The samples from the train set that are the most at-risk for a reconstruction attack are these with high log-probability (high probability of being generated by the attack model) and low calibrated loss (high probability of having been memorized by the target model). See Table1for selected sentences from the training set.

Now let us apply these lemmas to the leakage function. We haveL 2 (b) = min α f α (b) with each f α (b) = d α α-1 log(2)α + b α .

COMPARISON TO BALLE ET AL. (2020) In Figure 5, we display the numerical bounds of Balle et al. (2020).

Figure 5: Comparison with Balle et al. (2020).

Figure 6 furthers the comparison to Balle with a different value of the sampling rate q.

Figure 6: Comparison with Balle et al. (2020), on a DP-SGD run with q = 20% and 100 epochs, and various values of σ.

Figure 7: Percentage of secrets leaked as a function of privacy budget.

annex

layers. We also find that using a low learning rate (lr = 0.0001) is crucial to avoid divergence. With these hyperparameters, we are able to fine-tune GPT-2 on Wikitext-103 and reach a perplexity of 45, which is very close to the non-private fine-tuning perplexity of 38 that we obtain. These results echo the findings of Li et al. (2021) .

