GEDI: GENERATIVE DISCRIMINATOR GUIDED SEQUENCE GENERATION

Abstract

While large-scale language models (LMs) are able to imitate the distribution of natural language well enough to generate realistic text, it is difficult to control which regions of the distribution they generate. This is especially problematic because datasets used for training large LMs usually contain significant toxicity, hate, bias, and negativity. We propose GeDi as an efficient method for using smaller LMs as generative discriminators to guide generation from large LMs to make them safer and more controllable. GeDi guides generation at each step by computing classification probabilities for all possible next tokens via Bayes rule by normalizing over two class-conditional distributions; one conditioned on the desired attribute, or control code, and another conditioned on the undesired attribute, or anti control code. We find that GeDi gives controllability on par with or better than the state of the art method in a variety of settings, while also achieving generation speeds more than 30 times faster. Additionally, training GeDi on only three topics allows us to controllably generate new topics zero-shot from just a keyword. Lastly, we show that GeDi can make GPT-2 and GPT-3 significantly less toxic without sacrificing on linguistic fluency, making it by far the most practical existing method for detoxifying large language models while maintaining a fast generation speed.

1. INTRODUCTION

Natural language generation has seen great progress with the advent of Transformers (Vaswani et al., 2017) and large scale training (Radford et al., 2017; 2018; 2019; Brown et al., 2020) . Large language models (LMs) like GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) are able to learn the distribution of their training set well enough to generate realistic text. However, simply imitating the distribution of the training data during generation has many drawbacks; large-scale text training sets are crawled from the web which is imbued with toxicity, bias, hate, and misinformation. Methods for better controlling or filtering generation are valuable for making LMs trained on such data safer and more generally useful for downstream applications. Existing approaches to controlling LMs have limitations. Class-conditional LMs (CC-LMs) such as CTRL (Keskar et al., 2019) attempt to control text generation by conditioning on a control code, which is an attribute variable representing a data source. However, CTRL is not as useful for controlling what not to generate (i.e. toxicity). Furthermore, using a specific control code can reduce sample diversity across prompts, as samples will generally resemble the data source of the control code. Another approach is to use discriminators to steer generation, but existing methods to do this are very computationally intensive. Weighted decoding (Holtzman et al., 2018) requires feeding candidate next tokens into a discriminator, and thus scales linearly in computation with the number of tokens to be re-weighted. Plug and Play LM (Dathathri et al., 2020, PPLM) applies up to 10 updates to the generating LM's latent states per time step using gradients from a discriminator, also making it many times slower than generating from the LM directly. We present GeDi 1 as an algorithm for efficiently guiding generation from large LMs to make them safer and more controllable. Our proposed method uses CC-LMs as generative discriminators (GeDis) to guide language generation towards desired attributes. We use GeDis to compute classification likelihoods for all candidate next tokens during generation using Bayes rule, saving many thousand-fold in computation as compared with using a standard (non-generative) discriminator to compute this for large vocabulary sizes. We then show how these likelihoods can guide generation from large language models via weighted decoding and filtering. Our experimental results verify the ability of GeDi to control generation in a variety of settings while maintaining linguistic quality on par with strong language models. We apply GeDi (345M parameters) to guide generation from larger language models, and find that: • GeDi trained on sentiment of movie reviews can generate book text with a positive or negative tone better than or equivalently to state of the art baselines [Section 5.1]. Guiding towards positivity also has potential applications towards making LMs friendlier. • GeDi is able to significantly reduce the toxicity of GPT-2 and GPT-3 generation [Section 5.2], without sacrificing linguistic quality as compared with generating from GPT-2 and GPT-3 directly, suggesting applications towards safer language modeling. • GeDi trained on a dataset of only 4 topics can generalize to new control codes zero-shot [Section 5.3], allowing them to guide generation towards a wide variety of topics. • GeDi is very computationally efficient for both training and inference. GeDi guided generation in our experiments is more than 30× faster than applying PPLM with GPT2-XL using default settings from Dathathri et al. (2020) . Additionally, smaller GeDis fine-tuned for less than a day on a single GPU are effective and computationally efficient for controlling larger language models. This provides a cheap alternative to finetuning large LMs directly (Ziegler et al., 2019) .

2. BACKGROUND

2.1 LANGUAGE MODELING Language models (LMs) rely on an auto-regressive factorization to perform density estimation and generation of language data. Auto-regressive sequence models with parameters θ assign a probability to a sequence x 1:T = {x 1 , . . . , x T } by factorizing it using the chain rule as follows: P θ (x 1:T ) = T t=1 P θ (x t |x <t ). Models can assign probabilities to sequences by iteratively predicting a distribution over the next token given the previous tokens. Generating from language models requires iteratively sampling from P θ (x t |x <t ), and then feeding x t back into the model as input for the next step.

2.2. CLASS-CONDITIONAL LANGUAGE MODELING

Class-conditional language models (CC-LMs) such as CTRL (Keskar et al., 2019) are a way for language models to generate while conditioning on an attribute variable. CC-LMs predict a probability distribution P θ (x 1:T |c), where c is a class variable or a "control code" that describes an attribute of the text in x 1:T , which could, for instance, describe sentiment or topic. The auto-regressive factorization for a CC-LM is given by the following equation: P θ (x 1:T |c) = T t=1 P θ (x t |x <t , c). When training a CC-LM on a training set of sequences {x (1) 1:T1 , . . . , x 1:Ti , . . . , x (N ) 1:T N }, each sequence x (i) 1:T is paired with a control code c (i) , which is a label or category of the sequence. The LM is trained to minimize the average negative log-likelihood, L. L = - 1 N N i=1 1 T i Ti t=1 log P θ (x (i) t |x (i) <t , c (i) ). (3) In addition to class-conditional generation, CC-LMs can be used as generative classifiers by applying Bayes rule to compute P θ (c|x 1:T ) ∝ P (c)P θ (x 1:T |c), as is done by Keskar et al. (2019) for source attribution.



pronounced "Jedi"

