DIFFERENTIALLY PRIVATE CONDITIONAL TEXT GENERATION FOR SYNTHETIC DATA Anonymous authors Paper under double-blind review

Abstract

Companies have faced increasing pressure in recent years to anonymize user collected data when sharing internally or to third parties. Text data in particular contains copious amounts of personally identifiable information that has proven to be difficult to de-identify while remain useful for the party of interest. Previous works have suggested that synthetic text generation could provide a promising avenue to curate high performant and private datasets. In this paper, we introduce an approach to synthesize high utility text classification datasets by performing conditional generation through a large language model, distilGPT2, while providing measurable guarantees via differential privacy. We show that naive approaches suffer heavily from utility loss by entangling task-relevant factors in the transformer embedding space, making controlled generation more difficult. We analyze how incorporating a secondary learning objective can improve the performance of the generative model, improving utility of the generated data.

1. INTRODUCTION

In recent years, language models have seen dramatic improvements in performance over NLP tasks. In large part, this has been due to the rapid accumulation of user generated text on the internet. Companies have been able to aggregate millions of documents available online as well as their user data to train these large language models. However, lawmakers and their constituents have grown wary of data collection and usage practices, urging more stringent regulations. In 2018, the EU set the General Data Protection Regulation (GDPR) into motion, with the goal to increase transparency about collected information and give users more control over how their data is handled. (Voigt & Bussche, 2017) . Consequently, companies are now searching for ways to utilize user data without exploiting user privacy. The GDPR begins with the statement: "The protection of natural persons in relation to the processing of personal data is a fundamental right"; it is imperative that we innovate on methods to use data effectively without risking user privacy. In this paper, we study privatization of unstructured text data. Even with safety measures in mind, there has been massive exploitation of user text data. For example, in 2006, as part of their algorithm contest, Netflix released a de-identified dataset of user generated movie reviews. Researchers discovered that surprisingly little information was required to reconstruct the identities of users that contributed to the reviews (Narayanan & Shmatikov, 2006) . Further studies have shown how other methods, such as authorship and membership inference attacks (Carlini et al., 2020) , can be utilized to reconstruct user identities. All this to say, without proper privacy guarantees and careful data analysis, companies risk user data to exploitation. Dwork (2006) and Abadi et al. (2016) proposed differential privacy (DP) and DP-SGD/DP-Adam, respectively, as methods to provide provable and quantifiable guarantees about privacy. Generally, we say that a randomized algorithm satisfies DP if the output distribution is indistinguisable when run on neighboring datasets. However, current trade-offs between privacy and utility, particularly in synthetic text generation, makes it impractical for companies to create useful data with strong privacy guarantees. A common approach for anonymization is to de-identify (redact) personally identifiable tokens in text, such as names and addresses. While this may seem like a reasonable approach on paper with SOTA models reporting accuracies of nearly than 97%, the 3% of tokens that are misidentified could be used by an adversary to re-identify users. Consequently, this approach isn't a strong enough guarantee of privacy. A permissible error from such a model should be lower than 1% (Yogarajan et al., 2020; Al Aziz et al., 2021) , something that has not been achieved today for abitrary datasets. Synthetic data is promising because it avoids the problem of anonymizing an individual's data by instead producing information about non-existent persons. Other approaches to anonymize unstructured text data have focused on word or sentence level perturbations in order to reduce vulnerability to membership inference and authorship attacks. These approaches often heavily degrade semantic quality of the text and may struggle to provide overall privacy guarantees in the context of language peculiarities, such as with the leakage of PII. Other approaches seek to generate data synthetically, such as Libbi et al. ( 2021) and Al Aziz et al. (2021) . However, such studies often show a large tradeoff between privacy and utility or make differentially private guarantees with a potentially unreasonable epsilon parameter (e.g. ϵ > 10). In this paper, we present an approach of generating synthetic text data by performing controllable generation through a large language model. We show it is possible to synthesize text classification datasets with rigorous privacy guarantees. We hope this method will enable companies to share data and train high utility models without putting their users' data at risk. Our contributions are as follows: 1. We present findings on problems that arise when performing conditional finetuning of large language models with DP-Adam. Particulary, we find that it becomes difficult to conditionally prompt the model towards a desired class and generate synthetic data that mimics desired attributes of the original. We propose using a task-relevant loss via a secondary learning objective to solve this issue. 2. We generate synthetic versions of the SST-2 and AG News datasets by performing conditional text generation over a langauge model. We incorporate a combination of generation techniques: attribute conditioning and a gradient based approach (Dathathri et al., 2019) to further steer generation. We show minimal loss in utility of our synthetic datasets (6.3%) with strong privacy guarantees (ϵ = 3). Code to recreate our results are available here: (redacted for review)

2. BACKGROUND

2.1 LANGUAGE MODELING Given a sequence of tokens X = x 0 , ... , x n , language models (LMs) are trained to compute the unconditional probability of the sequence p(X). This probability can be rewritten in terms of product of conditional probabilities by recursively applying the chain-rule (Bengio et al., 2003) as: p(X) = N i=1 p(x i |x 0 , ..., x i-1 ) This allows modeling the language via next-word prediction. We use the transformer architecture (Vaswani et al., 2017) to model the distribution of natural language. Generation of a new sequence y can be created by sequentially sampling its constituents: p θ (y 0 ), p θ (y 1 |y 0 ), ..., p θ (y m |y <m ).

2.2. CONDITIONAL TEXT GENERATION

Conditional generation of text attempts to steer the output of a LM given a desired condition or control variable. Keskar et al. (2019) introduced a method to accomplish this goal by performing training a LM over a dataset, such that the desired condition is prepended to the text body: "BOS [condition] SEP text" (BOS and SEP are special tokens to indiciate the beginning of the sentence and to separate label from the text body, respectively). On the other hand, plug and play controllable language generation (PPLM) (Dathathri et al., 2019) combines an attribute model (such as a discriminator) with a LM to manipulate its output and perform controllable text generation. Given an attribute a and generated text x, let the output of the

