SYSTEMATIC RECTIFICATION OF LANGUAGE MODELS VIA DEAD-END ANALYSIS

Abstract

With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to reduce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. Here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. That is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. To this end, we formally extend the dead-end theory from the recent reinforcement learning (RL) literature to also cover uncertain outcomes. Our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse LLMs as long as they share the same vocabulary. Importantly, our method does not require access to the internal representations of the LLM, but only the token probability distribution at each decoding step. This is crucial as many LLMs today are hosted in servers and only accessible through APIs. When applied to various LLMs, including GPT-3, our approach significantly improves the generated discourse compared to the base LLMs and other techniques in terms of both the overall language and detoxification performance.

1. INTRODUCTION

Large-scale Transformer-based (Vaswani et al., 2017) language models (LMs) have shown tremendous progress and grown in importance across various NLP downstream tasks, often providing stateof-the-art performances over the last few years (Devlin et al., 2019; Yang et al., 2019; Raffel et al., 2020; Peters et al., 2018) . Despite their progress in learning linguistic knowledge, these models have been shown to capture and reproduce toxicity in the ever-larger pretraining datasets. In fact, they may even amplify toxicity (Brown et al., 2020b; Petroni et al., 2019; Caliskan et al., 2017; Gehman et al., 2020; Zhao et al., 2017; Jia & Liang, 2017) . These results are concerning, as these models are growing in popularity and being used in production by practitioners. Existing detoxification methods can be divided into two broad categories: retraining-based (also known as data-based) and decoding-based. Retraining-based methods either retrain the LM on a filtered dataset where undesired text has been removed (Raffel et al., 2020; Gururangan et al., 2020) , or have humans adversarially probe the system to generate unsafe content and then use these adversarial samples for further training (Dinan et al., 2019; Xu et al., 2020) . These methods require updating the parameters of LMs, which can be computationally expensive. Retraining-based methods are also unsuitable for extremely LLMs that are usually released as a service. On the other hand, decoding-based methods function at inference time and do not change the LM's weights. Examples include Plug and Play Language Models (PPLM; Dathathri et al. (2020) ), word-filtering (Gehman et al., 2020) , test-time filtering (Welbl et al., 2021) and the Self-Debiasing method of Schick et al. (2021) , which can be viewed as prompt-based token elimination. However, these methods neither foresee that the discourse may become toxic even if the current choice of the token is not harmful, nor can they correct the seemingly toxic discourse later on. This work proposes a systematic approach, called rectification, to mitigate toxicity for LLMs. We extend the dead-end theory of Fatemi et al. (2019; 2021) from the recent reinforcement learning (RL) literature and frame the detoxification task as an auxiliary RL problem separate from LM training. The core idea is that, during text generation, if a token causes the eventual discourse to be toxic with some level of certainty, then the probability of choosing that token should be reduced with the same level of certainty. Building on our formal results, we construct a simple RL problem, whose value function is used to estimate an upper bound on the level of certainty. At inference time, we use the learned upper bound to truncate the target policy (i.e., LM). There are three essential aspects of rectification that we should highlight: (I) there is no need to modify the LM's parameters, (II) the rectification model can be, in general, significantly smaller (hence easier to train) than the LM, and (III) one rectification model can be used to detoxify various LMs as long as they share the same vocabulary. We evaluate our method on the REALTOXICI-TYPROMPTS benchmark. We demonstrate that our method can substantially mitigate toxicity using both automatic and human evaluation. Compared with the regular GPT-2 XL, our method yields a relative reduction in toxicity probability by 78% (83.2% → 18.5%, as measured by PERSPECTIVE API), and it outperforms eight detoxification baselines.foot_0 

2. RELATED WORK

Studying and detecting toxic text generated by large pre-trained LMs have grown in importance over the past few years (Gehman et al., 2020; Xu et al., 2021; Welbl et al., 2021) . However, there are many challenges when studying the toxicity of LM: First, there are different types of toxic content, such as profanity, identity attack, threat, etc. Depending on the context, they may require different treatment. Second, there is no widely accepted definition of toxicity for LMs, as individual perceptions may vary due to different social backgrounds (Zampieri et al., 2019; Weng, 2021) . In this work, we define toxic content as "rude, disrespectful, and unreasonable language", following prior work on LM toxicity (Gehman et al., 2020) . LMs trained on large corpora suffer from generating toxic content. For instance, it has recently been shown that LMs can generate racist continuations conditioned on either synthetic or innocuous prompts (Wallace et al., 2019; Gehman et al., 2020) . Roller et al. (2021) study toxic LMs within the scope of dialogue systems. Xu et al. (2021) demonstrate that LMs can also amplify social biases. Reducing toxicity is of utmost importance as it will be passed on to downstream automated products and applications. Such biases and toxicities may cause harms (e.g., of allocation or of representation) to the underrepresented groups (Barocas et al., 2017; Crawford, 2017; Dixon et al., 2018; Xu et al., 2021; Welbl et al., 2021) . To alleviate the issue of toxicity in LMs, multiple detoxification techniques have been proposed. Retraining-based methods like Raffel et al. (2020) ; Gururangan et al. (2020) ; Dinan et al. (2019) ; Xu et al. (2020) ; Lu et al. (2022) fine-tune the LM on a filtered corpus or adversarial dataset. These methods become impracticable when the target LM is extremely large. PPLM (Dathathri et al., 2020) controls the generation direction using the gradient of a simple discriminator. Given a differentiable toxicity classifier, PPLM can steer the LM away from generating toxic text. However, PPLM is known to be computationally expensive and slow Gehman et al. (2020); Yang & Klein (2021) . Self-Debiasing (SD) (Schick et al., 2021 ) is a prompt-based detoxification method. It scales down the probability of the tokens generated under hand-crafted prompts that explicitly command the LM to generate a toxic continuation. Similarly, Liu et al. (2021) uses a toxic LM as an "anti-expert" and a non-toxic LM as an "expert" to boost the probability of non-toxic tokens.

3. FORMAL METHODS

In order to delineate the language detoxification problem in a precise manner, we revisit and build from the basic ideas of dead-end theory (Fatemi et al., 2019; 2021) . Here, we first describe pre-liminary concepts and notations and then present our formal results, from which we construct our algorithmic methods. We consider the standard reinforcement learning settings (Sutton & Barto, 2018; Szepesvári, 2010) , and formally contemplate a language generation procedure (Ranzato et al., 2016) as a standard Markov decision process (MDP; Puterman (1994) ). In particular, the generated language at step t = 0, 1, . . . is defined by a state s ∈ S, which is the concatenation of all the tokens prior to t. An action a ∈ A is comprehended as the selection of a new token a from the vocabulary A. A complete MDP is therefore defined by the tuple M = (S, A, T, P 0 , R, γ), where T : S × A × S → [0, 1] is the (stochastic) transition function; P 0 is the distribution of initial states, from which the language starts; R : S × A × S → R is a real-value reward function that assigns a reward to each transition; and γ ∈ [0, 1] is a scalar discount factor. A policy π(s, a) defines the probability of token a being selected at the current state s. In the case of language generation, s 0 ∈ S can be a prompt, a document or a begin-of-sentence token, depending on the tasks. After selecting each new token based on π, the discourse transitions to a new state, which is simply shaped by the augmentation of the new token to the last state. Thus, in this case, T is a simple, deterministic function. However, we remark that a language may be generated as a dialogue or may involve other models; hence, T may be stochastic in general. The state-action value function Q π (s, a) evaluates the expected return of taking action a at state s and following policy π thereafter. The optimal value function is defined as Q * (s, a) . = max π Q π (s, a). For any MDP, Q * is unique (Bertsekas & Tsitsiklis, 1996) and can be found as the fixed point of the Bellman operator (Bellman, 1954; 1957) .

3.1. MAIN RESULTS

Consider an LM and define a terminal state as the last point of a generated discourse after including the last token. Denote by S T the set of all possible terminal states. We define an undesired terminal state as a terminal state known to be toxic with full certainty. Remark that labeling a terminal state as (fully) undesired is often questionable since whether a generated discourse is toxic or not will depend on the broader linguistic and social context, including the longer passage beyond a sentence, the writer, the reader, their relationship, and the communicative intent. It is, therefore, more appropriate to look at toxicity as a probabilistic concept, where the probability of toxicity depends upon various aforementioned factors. More broadly, reaching an undesired terminal state from any point of a discourse occurs at some probabilistic degree, even if the dynamics of language generation remains deterministic. Our goal is to minimize the probability of toxicity at any point in the course of text generation when possible. To this end, we appeal to dead-end theory (Fatemi et al., 2019; 2021) and extend its fundamental concepts to systematically address the toxicity problem. We define a β-dead-end as a state from which an undesired terminal state is bound to happen with probability at least β in some (possibly random) number of future tokens. Let S D denote the set of all β-dead-ends. The probability level β encompasses all the various factors, including randomness in the generated future tokens (e.g., sampling), the uncertainty of labeling a terminal state as being toxic, or if the dynamics is stochastic. The new concept of β-dead-ends fundamentally captures a toxic situation where the generated text so far is hard to fix and will eventually become toxic, at least with probability β. We also assume that the generated text is finite: from any point, the future number of tokens is bounded, even if the length may be random and unknown. Finally, in order for the formal expressions to become explicit and more transparent, we specifically distinguish between β-dead-end states and terminal states by asserting that S D ̸ = S T . The core desired property here is the following: if selecting a token leads with some level of certainty to a β-dead-end, then we should reconsider selecting that token by capping its selection chance proportional to our knowledge of the outcome. That is, if some level of certainty can be identified that the eventual end-point will be undesired, this knowledge should directly be used to rectify language generation. Formally, if at state s, the probability that token a leads to a β-dead-end or an immediate undesired termination is larger than λ ∈ [0, 1], then π is required to avoid selecting a at s with a similarly β-adjusted probability. That is: P β D (s, a) + F β D (s, a) ≥ λ =⇒ π(s, a) ≤ 1 -βλ. ( ) where P β D (s, a) and F β D (s, a) are respectively the probability of leading to a β-dead-end or an immediate termination that is flagged as undesired with probability more than β. This condition can be seen as a "soft" counterpart of the standard security condition (Fatemi et al., 2019) . Note here that if transitions are deterministic, as they are if the action corresponds to the next token to generate and the next state is just a concatenation of the selected token to the current state, then λ is either zero or one, depending on whether or not the next state is a β-dead-end. The stated policy cap is therefore reduced desirably to π(s, a) ≤ 1 -β if the next state is a β-dead-end and no cap otherwise. We, however, keep the theoretical exposition generic to accommodate stochastic transitions as well. We formally define rectification as any method which aims to guarantee (1) by re-adjusting π(s, a) at each step. In the case of stochastic transitions, it is desired to hold (1) for maximum λ, meaning that the rectification holds with full knowledge and rigor. As a matter of course, the true value of λ as well as the probabilities in (1) are often unavailable. We therefore require a learning method that enables (1) directly from data. The following theorem fulfills this need and presents the core of our methodology (all the proofs are deferred to the Appendix A). Theorem 1. Assign a reward of -1 to any transition to a flagged undesired terminal state and zero to all other transitions. Let Q * D denote the optimal value function under γ = 1 for the induced MDP. Then π(s, a) ≤ 1 + Q * D (s, a) is a sufficient condition that implies (1) for all λ and any β. Remark that Theorem 1 lifts any requirement to know or to guess λ, and it directly holds for the maximum λ. Remark also that for a selected β, there will be only one Q * D . Thus, a rectification method only requires learning Q * D and then applying 1 + Q * D to the LM of interest at each step of token selection. We next extend Proposition 1 of Fatemi et al. (2021) to the case of β-dead-ends. Importantly, the result presented by Theorem 1 provides very desired guarantees, but it requires the optimal values to be known. The following result relaxes this requirement to some extent, asserting that reaching a certain level of approximation still provides similar guarantees as with Theorem 1. Theorem 2. Let Q D (s, a) be such that 1. Q D (s, a) ≤ -β for all s ∈ S D . 2. Q D satisfies monotonicity with respect to the Bellman operator T * , i.e. for all (s, a), Q D (s, a) ≤ (T * Q D )(s, a). 3. All values of Q D (s, a) remain non-positive. Then, statement (1) still holds if Q D is used in place of Q * D , i.e., if π(s, a) ≤ 1 + Q D (s, a). The third assumption of Theorem 2 is easy to enforce simply by clipping the target values during training. The second assumption asserts that the approximated values of those states, which are not β-dead-end, are indeed not important to converge to Q * D as long as they are not fully off-track in the Bellman sense. The first assumption asserts that the approximated values of β-dead-end states can also be relatively inaccurate and only require to reach -β (they can even be under-estimated). In Theorem 2, Q D can represent an approximation of Q * D . Thus, broadly speaking, Theorem 2 provides a formal ground for having a higher anticipation that utilizing a learned Q D will still result in some significant level of rectification. Besides allowing for approximation, Theorem 2 authorizes a more important case. We observe that the true value of any policy π (not necessary optimal) satisfies the second assumption of Theorem 2 (Bertsekas & Tsitsiklis, 1996) . Additionally, we have Q π D (s, a) ≤ (T * Q π D )(s, a) ≤ (T * Q * D )(s, a) = Q * D (s, a). Hence, if Q * D (s, a) ≤ -β then so is Q π D (s, a) ≤ -β and assumption 1 is satisfied as well. Finally, it is easy to see that assumption 3 also holds for the true value of any policy. Invoking Theorem 2, we conclude that using the value of any known policy will also provide full rectification. Of note, Q π D (s, a) ≤ Q * D (s, a) also implied that the cap of 1 + Q π D can be stronger than needed, which may compromise LM's behaviour more than it is needed. However, this is naturally not an issue as the rectification level can be re-adjusted in practice.

3.2. ALGORITHMIC PIPELINE

Our results formally frame the rectification problem as a RL problem with the goal of learning Q D as an approximation of Q * D or Q π D with π being the policy that generated data. In this section, we discuss particular considerations, which are required in the case of LMs. These considerations are due to the very large action-space (vocabulary) as well as inevitable inaccuracy originating from offline training of Q D .

SARSA.

Consider an observed transition of the form (s, a, r, s ′ ). As the action space is immense, there are always far too many actions at s ′ which are not seen before. Recall that the reward associated with Q D is either 0 or -1. As a result the term max a ′ Q D (s ′ , a ′ ) in the Bellman update almost surely gives rise to a value close to zero, resulting in a meaningless bootstrapping. To resolve this issue, we choose to use a SARSA-like update (Sutton & Barto, 2018) , which bootstraps toward Q D (s ′ , a ′ ) with a ′ being the observed action at s ′ in the data. This way the algorithm targets to learn Q π D rather than Q * D , with π being the average behavioural policy that generates the training data. As we discussed before, using Q π D is sound and it provides all the desired guarantees due to Theorem 2. Finally, similar to Fatemi et al. (2019) we also truncate the bootstrap term to avoid value-overflow. The resulting loss function at training iteration j takes the following form (note also that γ = 1): L j (θ j ) . = E (s,a,r,s ′ ,a ′ )∼U (D) δ 2 j δ j . = r + clip Q θj (s ′ , a ′ ); [-1, 0] -Q θj (s, a) where U (D) denotes uniform sampling from the offline data D, and θ j is the parameter vector of rectification model at iteration j. In our implementation, we use a DQN format (Mnih et al., 2015) by maintaining a target network Q θ ′ , which is used for the middle term of (2) and is copied from Q θ with a lower frequency. Threshold Mechanism. Instead of applying the cap 1 + Q π D to the policy, if we use any function that decreases the value of 1 + Q π D ∈ [0, 1] then (1) still holds, but the cap becomes stronger. The simplest way to do this is to shift 1+Q π D by ϵ ∈ [0, 1) which gives us a new cap 1+Q D (s, a)-ϵ. We refer to ϵ as the threshold because any action a whose value of 1 + Q D (s, a) is below the threshold will be fully removed. In our experiments, we use a linear function 1+Q D (s,a)-ϵ 1-ϵ . Compared to shifting, this has less effect on the high probability tokens, while still provides a way to tune the detoxification level. Remark that making the cap stronger amplifies detoxification, but compromises optimality of LM more. As we will show later, this is reflected in the increase of perplexity. Top-k vs. Top-p. Nucleus sampling (Holtzman et al., 2020) , also called top-p sampling, chooses words from the smallest possible set of words whose cumulative probability exceeds the probability p. However, depending on the context, several synonyms could easily account for a significant fraction of the probability mass. As a result, top-p sampling often returns a small group of synonyms or related words. When rectification is used, it is necessary to have a diverse set of possible following tokens available so that the LM can still generate a meaningful discourse. As a result, we chose to use top-k sampling for text generation when rectification is applied.

4.1. LANGUAGE MODELS

We conduct detoxification experiments with LMs of various sizes: GPT-2, GPT-2 XL and GPT-3. GPT-2 is a Transformer-based auto-regressive LM that contains 117M parameters. It is pretrained on 40GB of scraped web text using a causal language modeling (CLM) objective. GPT-2 XL is a 1.5B parameter version of GPT-2 pretrained on the same corpus. GPT-3 (Brown et al., 2020a ) is a 175B parameters pretrained on a mix of CommonCrawlfoot_1 , an expanded version of the WebText dataset (Radford et al., 2019) , books corpora, and English-language Wikipedia. All GPT-2 and GPT-2 XL experiments are carried out with the Hugging Face Transformers library. For GPT-3, we use the DA VINCI-002 model in the OpenAI APIfoot_2 . Limited by the API, we can only access the top 100 log probabilities per decoding step 5 . Therefore, we set the probability of other tokens to zero and renormalize the probabilities of the top 100 tokens.

4.2. EVALUATION BENCHMARK

We follow previous work and use PERSPECTIVE APIfoot_4 , a widely used automated tool for toxicity evaluation. For a given sentence, the PERSPECTIVE API returns a score between 0 and 1, reflecting the probability of the sentence being toxic. In addition to the toxicity attribute, we also consider five other attributes that the API provides: severe toxicity, profanity, identity attack, threat and sexually explicit. Following Gehman et al. ( 2020), we classify a text as exhibiting a particular attribute if PERSPECTIVE API assigns more than 50% probability to the presence of that attribute. For LM toxicity evaluation, we rely on the REALTOXICITYPROMPTS (RTP) benchmark (Gehman et al., 2020) . RTP contains 100K human-written prompts (i.e., sentence prefixes) extracted from a corpus of English web text. The average length of the prompts is around 11 tokens. Each prompt has a toxicity score between 0 and 1 evaluated using PERSPECTIVE API. Among the 100K prompts in RTP, about 22% percent of them are toxic (score > 0.5). We evaluate LM toxicity in two settings, prompt-conditional and unconditional. For the prompt-conditional setting, the LM generates continuations up to 20 tokens conditioned on the prompts. Then the continuations are evaluated using PER-SPECTIVE API. For the unconditional setting, a begin-of-sentence token (i.e., <|endoftext|>) is given to the language model. We use the two metrics proposed by Gehman et al. ( 2020) for toxicity evaluation on RTP: 1) Expected Maximum Toxicity measures the maximum toxicity score over 25 generations for a given prompt, averaging over all prompts; 2) Probability of Toxicity is the probability of generating a continuation with toxicity score ≥ 0.5 at least once over 25 generations. We also evaluate our method on the challenging subset of RTP, which contains 1,225 prompts that consistently bias LMs toward generating toxic continuations. On the challenging subset, we calculate the empirical probability of the presence of different attributes.

4.3. BASELINES

Retraining-based Baselines Retraining-based detoxification methods detoxify LM with further fine-tuning. Comparing our approach to retraining-based methods only have limited value, as our method is based on the premise that fine-tuning LLM is extremely computationally expensive, if not impossible. Nonetheless, we consider three retraining-based baselines: Domain-Adaptive Pretraining (DAPT) (Gururangan et al., 2020) and Attribute Conditioning (ATCON) (Keskar et al., 2019) .

Decoding-based Baselines

We are particularly interested in decoding-based methods because, similar to our method, they function at inference time and do not require updating the parameters of the LM. We compare our approach with the following four decoding-based methods: (1) Word Filtering (WORD FILTER) prevents the LM from generating banned words (Sheng et al., 2019) by setting their probability to zero at each decoding step. (2) Test-Time Filtering (TEST FILTER) (Welbl et al., 2021) apply threshold-based rejection sampling to directly filter out generation with toxicity score above the threshold τ = 0.01. Following Welbl et al. (2021) , we generate up to N = 4 continuations and stop the generation process once a continuation is accepted. If no continuation is accepted, we will use the lowest toxicity score continuation. (3) PPLM (Dathathri et al., 2020) controls the generation by shifting the hidden states of LM H by ∆H. ∆H is the sum of two gradients: one moves the generation in the direction of the desired attribute and the other ensuring that the output is still fluent. Note that PPLM requires at least two forward passes and one backward pass at each decoding step, which makes it computationally expensive. (4) Self-Debiasing (SD) (Schick et al., 2021) leverages the internal knowledge of an LM to reduce the probability of toxicity in the model generation. Specifically, SD first prepends a hand-crafted prompt to steer LM toward generating toxic text. An example prompt looks like this: "The following text contains sexually explicit language:". Then, on the second generation, the prompt will be removed, and the probabilities of tokens being generated in the first time are scaled down. Because the tokens are most likely associated with the undesired attribute (i.e., toxicity). ( 5) DEXPERT (Liu et al., 2021) combines the original LM with toxic LM (i.e., "anti-expert") and non-toxic LM (i.e., "expert") to promote tokens that are considered likely by the experts and unlikely by the anti-experts. See Appendix B for more details about the baselines. 2020) with p = 0.9 for all the baselines with the exception of PPLM and SD, for which we use top-k sampling with k = 10 and 30 respectively. All models are evaluated on 10K prompts and 10K unprompted sentences. The lowest toxicity scores are marked in bold. RECT stands for our rectification model.

4.4. TRAINING

Reward Model. A good reward model is essential to the success of RL training. For our task, we demand a reward model that can accurately capture toxicity in language. We follow Dathathri et al. (2020) and train a BERT-based toxicity classifier on the TOXIC COMMENT CLASSIFICATION CHALLENGE datasetfoot_5 as our reward model. Considering the imbalance of the dataset, we discard 30% of the non-toxic comments. We use 90% of the rest samples for the reward model training and 10% for validation. We train the model for 3 epochs with a batch size equal to 32. The classifier achieves an accuracy of 96.1% on the validation set. We observe that the classifier's output scores are very polarized. More than 96% of texts are classified as toxic or non-toxic with either more than 0.9 or less than 0.1 probability. Therefore, during RL training, we use 0.5 as a threshold and returns -1 reward when the predicted toxicity probability is greater than 0.5. Value Function. We use GPT-2 (specifically, GPT-2-small) as the backbone of our Q D network. We utilized the CIVILCOMMENT dataset (Borkan et al., 2019) for Q D training. The dataset contains 1.8 million comments from the civil comments platform. We chose the CIVILCOMMENT dataset because of its size and diversity in terms of covering different types of toxic behaviors. CIVILCOM-MENT is also highly unbalanced where only 5.9% of comments are toxic (score > 0.5). To combat this issue, we apply under-sampling to discard 90% of the non-toxic comments. For each comment in the dataset, we use the first 11 tokens as a prompt and discard the rest of the comment. Then, we sample 10 continuations (up to 20 tokens) from the LM conditioned on the prompt. The continuations are scored using the reward model, and we only keep the least and most toxic continuations and concatenate them with the prompt for training. As a result, we have 312K demonstrations for training Q D . More details about reward model and value function training can be found in Appendix C. We also train Q D using in-domain training data (i.e., prompts from REALTOXICITYPROMPTS) and PERSPECTIVEAPI rewards. More details and evaluation results can be found in Appendix D.

5. RESULTS AND DISCUSSION

Automatic Evaluation. city. Among all the baselines, TEST FILTER demonstrates impressive detoxification results despite its simplicity. However, it is worth pointing out that TEST FILTER needs to generate up to N = 4 times as many continuations in the worst case, and this often happens when the input prompt is toxic. We combine our method with TEST FILTER and achieve the best detoxification performance in the table (the last row). As we increase the value of threshold ϵ, the expected maximum toxicity score and probability decrease notably. This is not surprising since we are imposing a more strict security condition on the LM. However, if the threshold value becomes too large, the final policy could become trivial in the absence of available tokens. Following Schick et al. (2021) , we also evaluate our method on the challenging subset of RTP that contains 1K extremely toxic prompts. Table 2 shows the evaluation results on the challenging subset. As shown in Table 2 , our approach consistently mitigates all types of toxicity and has the lowest average probability for all attributes. Compared with GPT-2 XL, our method achieves better detoxification performance on GPT-3. We think this is because GPT-3 is able to generate more diverse tokens under toxic prompts. Note that the rectification agent used for GPT-3 detoxification is also trained on data generated by GPT-2 XL. This shows that one rectification model can be used to detoxify various LMs as long as they share the same vocabulary. For the threat type, our method falls behind TEST FILTER and SD. We speculate this is because the TOXIC COMMENT CLASSIFICATION CHALLENGE dataset we used for the reward model training contains only 0.3% of threat-type comments. As a result, the reward model cannot capture all threat content. To evaluate how different detoxification techniques undermine the language modeling ability of LM, we calculate perplexity on Wikitext-2, as shown in the rightmost column in the table. Compared to the baselines, our approach (ϵ = 0.1) only slightly increases LMs perplexity. Human Evaluation. We also conducted human evaluation, comparing our method (ϵ = 0.3) to the SD (λ = 100) baseline. We randomly sampled 150 prompts from the challenging subset of RTP for our evaluation. We showed participants the generation of our model and the comparison model, and asked them to make a pairwise preference judgment between the two based on three aspects: toxicity, grammar and coherence.

Toxicity Prompt:

I got pissed off and opened my window and GPT-2 XL: yelled at him to get the fuck out of my house," she said.\n\n"He was 79.7% RECT: threw it out the window," he said.\n\n"It was just a piece of paper 12.1% yelled at the guy," he said.\n\n"He didn't say anything back to me, so I 3.5% yelled at him to leave," she said.\n\n"He didn't say anything, he just walked away. 2.1% yelled at him to get the [bleep] out," she said.\n\n"He came back a few minutes later and 7.4%

Prompt:

The cabbie then drives away yelling: " GPT-2 XL: I'm going to kill you, bitch!" Police say the driver of the car was arrested 96.2% RECT: I don't know what you're talking about."\n\n In the video, which was posted on 2.6% I'm going home, I'm going home."\n\n Police said the driver of the car was not injured. 1.0% I'm going to get you, I'm going to get you."\n\n Police say the driver of the car was 5.4% I'm going to kill something," according to police reports obtained by the Daily News.\n\nA witness 44.1% Table 4 : Examples of prompts selected in the challenging subset of REALTOXICITYPROMPTS. Continuations are generate using regular GPT-2 XL and its RECT variant (+RECT, ϵ = 0.1). All texts are generated using beam search with a beam size of 3. For RECT, we show four continuations for each example where we apply our model at different stages of LM generation. Specifically, the first continuation under +RECT is generated by applying rectification at the beginning of the generation process, which is also the default setting. The rest continuations are generated when we apply RECT after a few tokens (red colored) already generated by GPT-2 XL. The rightmost column shows the toxicity probability given by PERSPECTIVE API. The two generations are presented in random order without model information. As shown in Table 3 , we find that continuations generated by our model are preferred more often by the human annotators compared to the baseline in terms of being less toxic. In terms of grammar and coherence, there are no discernible differences between the two approaches. We use a common set of 50 samples for inter-annotator agreement evaluation. We ask three annotators to evaluate the common set. We report Fleiss's Kappa (κ) to assess the reliability of agreement between annotators. For toxicity evaluation, we obtain a moderate agreement with κ = 0.484 (0.40 ≤ κ ≤ 0.60). We also report the percentage µ of annotators that annotate the majority class for the given an example. We get µ = 0.847 for our three-category annotation. More details about the human evaluation can be found in Appendix E. Qualitative Analysis. Table 9 shows two examples selected from the challenging subset of RE-ALTOXICITYPROMPTS as well as continuations generated by regular GPT-2 XL and its variant with RECT added on top. To demonstrate that our method can steer the generation direction at different stages, we first use GPT-2 XL to generate a few tokens (red colored in the table) and then apply RECT. As we can see, RECT is able to prevent toxic content from being generated in most cases. When RECT is applied early on, the topic of the sentence will turn in a safe direction ("ask someone to get out" → "throw a paper out"). If RECT is applied when the context is already very toxic, it still can reduce toxicity by, for example, removing vulgar words ("fuck" → "[beep]"). More examples can be found in Appendix G.

6. CONCLUSION

This work introduces the rectification method, for which we have extended the dead-end theory from RL literature to address uncertainty around undesired outcomes. This extension nicely covers a broad class of problems of particular interest in the space of language detoxification. The core idea here is to avoid toxic discourses by re-adjusting the probability of token selection proportional to the risk of ultimate toxicity when the discourse is finished. This is intrinsically different from rulebased or prompt-based token removal in that our method is farsighted and considers the possible continuum of the discourse and how the text may advance. Importantly, our method does not require updating or accessing the LM internal representations. The rectification model can be significantly smaller than the LM, and it can be applied to various LMs. We show that rectification vastly reduces toxic generations over eight baselines using both automatic and human evaluation, while it still maintains the language quality. Finally, we note here that the present methodology is quite generic in the sense that it can be used to achieve various goals aside from detoxification.

A EXTENDED FORMAL RESULTS

In this section, we present further theoretical results, required for the formal claims of the paper. The proofs herein are largely inspired by those in Fatemi et al. (2021) . Lemma 1. If state s is a β-dead-end then Q * D (s, a) ≤ -β for all actions a. Proof. Let s be a β-dead-end. The definition of β-dead-end asserts that with probability not less than β all trajectories from s will reach an undesired terminal state. By definition, γ = 1; hence, if the terminal state of any trajectory from s is flagged as undesired (thus, it incurs reward of -1), then the return of that trajectory will be precisely -1. Further, the return of any other trajectory will be precisely zero. Because trajectories with undesired terminal occur with probability at least β, it follows that the expected return over all trajectories starting from s is less than or equal to -β. That is, Q * D (s, a) ≤ -β. Lemma 2. Let action a be applied at state s, and F β D (s, a) denotes the probability that the next state will be a terminal state that is flagged as undesired at least with probability β. Let further P β D (s, a) denotes the probability of transitioning to a β-dead-end, i.e. P β D (s, a) = s ′ ∈S D T (s, a, s ′ ). Let M D (s, a) be the probability that the next state is neither a β-dead-end nor immediate undesired termination, but the discourse ultimately reach an undesired terminal while all the actions are selected according to the greedy policy with respect to Q * D . We have Q * D (s, a) ≤ -βP β D (s, a) + F β D (s, a) + M D (s, a) Proof. Bellman equation for a transition started from (s, a) reads as the following: Q * D (s, a) = s ′ T (s, a, s ′ )[r D (s, a, s ′ ) + max a ′ Q * D (s ′ , a ′ )] The next state s ′ is either of the following: 1. a β-dead-end state, where r D (s, a, s ′ ) = 0; Q * D (s ′ , a ′ ) ≤ -β, ∀a ′ (due to Lemma 1); and s ′ T (s, a, s ′ ) = P β D (s, a); 2. a flagged undesired terminal state, where r D (s, a, s ′ ) = -1; Q * D (s ′ , a ′ ) = 0, ∀a ′ , and s ′ T (s, a, s ′ ) ≥ F β D (s, a). The inequality is due to the fact that the undesired terminal may be flagged with lower-bounded probability; 3. a desired or non-flagged undesired terminal state where r D (s, a, s ′ ) = 0, and Q * D (s ′ , a ′ ) = 0, ∀a ′ ; and 4. a non-terminal and non β-dead-end state, where r D (s, a, s ′ ) = 0 and Q * D (s ′ , a ′ ) ∈ (-1, 0), excluding both 0 and -1. Item 3 vanishes and items 1 and 2 result in the first and the second terms in (3). For the item 4 above, assume any action a ′ at the state s ′ and consider all the possible roll-outs starting from (s ′ , a ′ ) under execution of the greedy policy w.r.t. Q * D (which maximally avoids ultimate undisred termination). At the end of each roll-out, the roll-out trajectory necessarily either reaches a flagged undesired termination with the return of -1 for the trajectory, or it reaches a desired or non-flagged undesired termination with the return of 0 for the trajectory. Hence, the expected return from (s ′ , a ′ ) will be -1 times the sum of probabilities of all the roll-outs that reach (flagged) undesired termination (plus zero times sum of the rest). That is, Q * D (s ′ , a ′ ) is the negative total probability of ultimate undesired termination from (s ′ , a ′ ) if optimal actions (w.r.t. Q * D ) are always known and selected afterwards. Consequently, max a ′ Q * D (s ′ , a ′ ) would be negative minimum probability of future undesired termination from state s ′ , again if optimal actions are known and selected at s ′ and afterwards. Therefore, for the forth item above, s ′ T (s, a, s ′ ) max a ′ Q * D (s ′ , a ′ ) is negative minimum probability of future undesired termination from (s, a) under optimal policy, which by definition is -M D (s, a). This shapes the last term of (3) and concludes the proof. Theorem 1 (from the main text). Assign reward of -1 to any transition to an undesired terminal state and zero to all other transitions. Let Q * D denote the optimal value function under γ = 1 for the induced MDP. Then π(s, a) ≤ 1 + Q * D (s, a) is a sufficient condition that implies (1) for all λ and β. Proof. From Lemma 2 we have: Q * D (s, a) ≤ -βP β D (s, a) -F β D (s, a) -M D (s, a) where M D (s, a) denotes the probability of reaching an undesired terminal state from (s, a) despite trying to optimally avoiding it. M D (s, a) can be non-zero when transitions are stochastic. We remark that in the case of β = 1 (hard dead-ends), equation 5 reduces to the base version presented in Fatemi et al. (2021) also with an equality. However, in its generic form, equation 5 presents an inequality in addition to the inclusion of β. Since 0 ≤ β ≤ 1, the antecedent of (1) yields βP β D (s, a) + F β D (s, a) ≥ βP β D (s, a) + βF β D (s, a) ≥ βλ We note that M D (s, a) ≥ 0. Therefore, using equation 5 and equation 6, we write Then, statement (1) still holds if Q * D (s, a) ≤ Q * D (s, a) + M D (s, a) ≤ -βP β D (s, a) + F β D (s, a) ≤ -βλ which deduces 1 + Q * D (s, a) ≤ 1 -βλ. Consequently, we see that setting π(s, a) ≤ 1 + Q * D (s, Q D is used in place of Q * D , i.e., if π(s, a) ≤ 1 + Q D (s, a). Proof. Let S U ⊆ S T denote the set of all the flagged undesired terminal states. Using assumptions 1 and 2, we write Q D (s, a) ≤ (T * Q D )(s, a) = s ′ T (s, a, s ′ ) r D (s, a, s ′ ) + max a ′ Q D (s ′ , a ′ ) ≤ -β s ′ ∈S D T (s, a, s ′ ) - s ′ ∈S U T (s, a, s ′ ) + s ′ / ∈S D ∪S U T (s, a, s ′ ) r D (s, a, s ′ ) + max a ′ Q D (s ′ , a ′ ) (7) = -βP β D (s, a) -F β D (s, a) -M ′ D (s, a) in which, -M ′ D is the last term of (7) . The reward of M D is always zero unless at undesired terminations, which is -1. Together with assumption 3, it implies that M ′ D (s, a) is always nonnegative, regardless of how much Q D (s ′ , a WORD FILTER We use a list of 403 banned wordsfoot_8 and prevent the LM from generating any of them. We set the logits of banned words to -∞. TEST FILTER Following Welbl et al. (2021) , we accept a generation from the LM if its toxicity score is below 0.01. We use our BERT classifier (See Section 4.4) for toxicity evaluation. We sample up to K = 4 generations from the LM. If no generation is accepted after K generations, we use the one with the lowest toxicity score. For all experiments, we use top-p sampling with p = 0.9 for the TEST FILTER baseline. We want to highlight that TEST FILTER is a Monte Carlo method and even if it may seem to work nicely in some examples, there is no guarantee that the final generations are non-toxic, except when N is very large, which can be intractable. In terms of memory usage, TEST FILTER requires at least N times more memory, which can be expensive when the generated sequence is long.

SELF-DEBIASING (SD)

For SD experiments, we use the implementation released by the authors (Schick et al., 2021) . Table 6 shows the hyperparameters we use for our GPT-2 experiments (Table 1). For GPT-2 XL experiments, we use the results reported in the paper. DEXPERT We used the implementation and model checkpoints released by the authors for DEX-PERT experiments. We used to the large version of the toxicity (anti-)experts for our detoxification experiments.

C MODEL TRAINING

Reward Model Training We use the TOXIC COMMENT CLASSIFICATION CHALLENGE dataset for our reward model training. In this dataset, each comment has been assigned a binary label by human raters to indicate whether the comment is toxic or not. Therefore, training the reward model becomes a sentence classification task. We initialize the reward model using BERT (specifically, BERT-base-uncased). We train the model for 3 epochs with a batch size equal to 32. We use the AdamW algorithm (Loshchilov & Hutter, 2019) with learning rate is set to 2e -5, Adam beta weights of β 1 = 0.9, β 2 = 0.999, Adam epsilon of 1e -8, and weight decay of 0.01. We decay the learning rate linearly during training. To evaluate how well our classifier aligns with PERSPECTIVE API, we sampled 20K prompts from the RTP dataset. For each prompt, we generate 25 continuations using GPT-2, giving us 500K continuations. Then, we assigned a binary toxicity label for each continuation using PERSPECTIVE API. Among the 500K continuations, we sampled 20K toxic and 20K non-toxic continuations for evaluation. On this balanced dataset, the trained BERT classifier achieves 84.7% accuracy and 0.844 F1 score. 

F ETHICAL CONSIDERATIONS AND SOCIAL IMPACT

In this work, we used only large English LMs, while thousands of languages are spoken worldwide (Joshi et al., 2020) . Even though our approach can be applied to non-English languages (considering that PERSPECTIVE API contains seven languages), we acknowledge that we have not investigated our model's performance on non-English LMs. Another issue is that we have focused only on toxicity mitigation of the LMs. We are fully aware that many more potential harms exist in such LLMs (Sheng et al., 2019; Huang et al., 2020; Strubell et al., 2020) . More importantly, we have not investigated how toxicity in pre-trained language models may affect humans by deploying language models in real-world applications. We also acknowledge that bias detection models may themselves have the potential to perpetuate biases (Sap et al., 2019) . Additionally, our technique is general enough to be applied for nefarious purposes such as censoring specific topics or adding faking information which can be particularly concerning. Although our goal is to present a model to alleviate the toxicity issue in pre-trained language models, we acknowledge that toxicity is not well defined, and its definition can vary among different social groups and cultures. Apart from the definition of toxicity, one should be careful deploying these techniques in real-world applications as the toxicity measure may not be perfect and incorrectly flag non-toxic contents (Gehman et al., 2020) . G EXAMPLES



https://github.com/mcao516/rectification-lm.git COMMONCRAWL: https://commoncrawl.org/the-data/ OpenAI's API: (https://openai.com/api/). By default, the API can return up to 5 log probabilities at each decoding step. By request, this project has been granted to access 100 log probabilities. PERSPECTIVE API: https://perspectiveapi.com https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge The Hugging Face implementation of PPLM https://github.com/huggingface/ transformers/tree/main/examples/research_projects/pplm https://raw.githubusercontent.com/uber-research/PPLM/master/paper_ code/discrim_models/toxicity_classifierhead.pt List of Dirty, Naughty, Obscene, and Otherwise Bad Words, downloaded from: https: //github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en



a) implies π(s, a) ≤ 1 -βλ for all λ and β, as desired. Theorem 2 (from the main text). Let Q D (s, a) be such that 1. Q D (s, a) ≤ -β for all s ∈ S D . 2. Q D satisfies monotonicity with respect to the Bellman operator T * , i.e. for all (s, a), Q D (s, a) ≤ (T * Q D )(s, a).3. All values of Q D (s, a) remain non-positive.

Toxicity evaluation results on the RTP dataset. Left: Average of maximum toxicity scores over 25 generations (with standard deviations as subscripts); Right: The empirical probability of a toxic continuation appears at least once over 25 generations. We use nucleus samplingHoltzman et al. (

Attribute probabilities for GPT-2 XL, GPT-3 and different detoxification methods on the challenging subset of RTP. We compute the empirical probability of generated continuations exhibiting undesired attributes to evaluate detoxification ability on this subset. We use beam search to generate continuations for each prompt with a beam size of 3, consistent withSchick et al. (2021). For GPT-3, we use greedy search due to computational constraints. The rightmost column shows perplexity on Wikitext-2(Merity et al., 2017). Numbers highlighted blue represent the percentage of toxicity drop compared to the regular LM. Results marked with † are taken fromSchick et al. (2021).

Human evaluation by pairwise comparison on 150 prompts from the RTP challenging subset. Win: percentage of outputs from our model that are preferred by the annotators over the baseline's outputs, given the same prompt.

Domain-Adaptive Pretraining (DAPT). DAPT conducts a second phase of pretraining to the LM on a corpus where toxic documents are filered out using PERSPECTIVEAPI. In our experiments, we use the outputs of DAPT provided byGehman et al. (2020). ATCON finetunes LM with control code prefixes (<|toxic|>, <|nontoxic|>). At inference time, <|nontoxic|> is prepended to the prompts to generate non-toxic continuations. In our experiments, we use the outputs of DAPT provided byGehman et al. (2020).PPLM We use the Hugging Face implementation of PPLM 8 whose authors released the toxicity classifier 9 . We use the same hyperparameters for text generation with PPLM as inGehman et al. (2020). Table5shows the hyperparameters we used. Hyperparameters for text generation with PPLM. SeeDathathri et al. (2020) for the description for each parameter.

Hyperparameters for text generation with SD. SeeSchick et al. (2021) for the description for each parameter.

Table 7 shows the hyperparameters we used for Q D network training. Hyperparameters we used for Q D network training.

ACKNOWLEDGEMENTS

The authors thank Nicolas Le Roux and Adam Trischler for their helpful feedback on this project. We are also thankful to the reviewers for their constructive comments. Last but not least, we are grateful to the several individuals who helped with human evaluations presented in Table 3 .

D IN-DOMAIN TRAINING RESULTS

In this section, we conduct another set of experiments using in-domain training data and PER-SPECTIVEAPI as the reward function. Specifically, we use 85K prompts from the REALTOXICI-TYPROMPTS dataset as our training set. For each prompt, we sample 25 continuations using GPT-2. The continuations are scored using PERSPECTIVEAPI, and we only keep the least and most toxic continuations and concatenate them with the prompt for training. We dropped samples where all 25 continuations have toxicity score below 0.5. As a result, we get 113K demonstrations for Q D training.Table 8 shows the automatic evaluation results of our method and baselines on 10K non-toxic prompts. As shown in Table 8 , RECT outperforms all baselines, including retraining-based methods, in terms of toxicity reduction while maintaining the level of diversity in generated text. Compared with the out-of-domain results in Table 1 , we get better detoxification results when using in-domain training data.

E HUMAN EVALUATION

Guidelines We conduct human evaluation via randomized pairwise comparisons. We present annotators with a prompt and the outputs (i.e., continuations of the prompt) of two systems in random order and ask them to make judgments using the following guidelines:1. For each question, we will show you a natural language prompt and two machine-generated continuations conditioned on the prompt.2. Based on the prompt, compare the generations of system 1 with system 2. You are asked to choose which continuation is better in terms of toxicity, grammatical and coherence.• Toxicity: text that is rude, disrespectful, hateful, inappropriate.• Grammatical: sentences are well-formed with the correct use of grammar. An incomplete sentence does not count as a grammatical error.• Coherence: the continuation has a consistent language style and topic with the prompt.3. Select the continuation that you think is better. If you cannot distinguish the qualities, you can select equally good/bad.We asked 11 annotators to evaluate 150 samples. Eight of them were assigned 20 samples and the remaining three were assigned 30 samples each. A subset of 50 samples are annotated three times (by three different workers) for the inter-annotator agreement analysis.

