DEFENDING AGAINST RECONSTRUCTION ATTACKS WITH RÉNYI DIFFERENTIAL PRIVACY

Abstract

Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but it is often used with a large privacy budget (ε ≥ 8) which does not translate to meaningful guarantees. Our main contribution is stronger privacy guarantees against reconstruction attacks improving on existing literature. In particular, we show that larger privacy budgets do not provably protect against membership inference, but can still protect against extraction of high-entropy secrets. We design a method to efficiently run reconstruction attacks with lazy sampling and empirically show that we can expose at-risk training samples from non-private language models. We show experimentally that our guarantees hold on language models of practical significance trained with differential privacy, including GPT-2 finetuned on Wikitext-103.

1. INTRODUCTION

Probabilistic generative models are trained to assign high likelihood to data from the training set. In the case of language models, the decomposition P(x 1 , . . . , x T ) = Π T i=1 P(x i | x <i ) also allows for efficient sampling of sequences. Given such models will overfit on the training set data, sampling from a trained model will sometimes yield verbatim sentences from the training set. Carlini et al. (2021b) leverage this effect along with clever filtering techniques to produce samples that are likely to be in the training set. Their work demonstrates that reconstruction attacks are not only possible on large-scale generative language models such as GPT-2 (Radford et al., 2019) but also successful: their best attack reaches 66% precision on the top-100 generated sentences. Another category of attacks is membership inference, where the adversary has access to both the trained model and a data sample, and has to predict whether this data sample comes from the training set. This attack is easier because the task of the adversary is simpler. The standard defense against such privacy attacks is differential privacy (DP) (Dwork et al., 2006; Dwork & Roth, 2014) . DP defines a privacy budget ε that can be used to control the privacy/utility tradeoff of the trained model. However, there is no consensus over acceptable values of ε and, in practice, ε is often chosen to defeat practical membership inference attacks (Watson et al., 2021; Carlini et al., 2021a) . In this paper, we show that Rényi differential privacy (RDP) (Mironov, 2017) can actually provide guarantees regarding the probability of reconstructing a sample; these guarantees are stronger than existing ones for the same mechanism. In particular, we show that there is an intermediate regime in which membership inference is not protected but full reconstruction of sufficiently high-entropy secrets remains difficult. This means that an adversary who knows a secret s can determine if it was in the training set, but extracting it from the model stays hard if they do not know the secret s in the first place. We refer to samples that an adversary tries to reconstruct as secrets. Not all samples from the training set would be considered "secret" in the sense that their content is public knowledge, such as newspaper headlines. We circumvent this issue by considering all samples secrets, and quantifying their level of secrecy by the number of unknown bits of information. Specifically, given a probabilistic reconstruction model A that generates secrets from a trained model θ, s ∼ A(θ), the secrecy of a sample s is b ≜ log 2 (1/P(A(θ) = s)). An adversary needs on average 2 b trials to chance upon the

