UNIFIED DETOXIFYING AND DEBIASING IN LANGUAGE GENERATION VIA INFERENCE-TIME ADAPTIVE OPTI-MIZATION

Abstract

Warning: this paper contains model outputs exhibiting offensiveness and biases. Recently pre-trained language models (PLMs) have prospered in various natural language generation (NLG) tasks due to their ability to generate fairly fluent text. Nevertheless, these models are observed to capture and reproduce harmful contents in training corpora, typically toxic language and social biases, raising severe moral issues. Prior works on ethical NLG tackle detoxifying and debiasing separately, which is problematic since we find debiased models still exhibit toxicity while detoxified ones even exacerbate social biases. To address such a challenge, we propose the first unified framework of detoxifying and debiasing called UD-DIA, which jointly formalizes these two problems as rectifying the output space. We theoretically interpret our framework as learning a text distribution mixing weighted attributes. Besides, UDDIA conducts adaptive optimization of only a few parameters during decoding based on a parameter-efficient tuning schema without any training data. This leads to minimal generation quality loss and improved rectification performance with acceptable computational cost. Experimental results demonstrate that compared to several strong baselines, UDDIA achieves debiasing and detoxifying simultaneously and better balances efficiency and effectiveness, taking a further step towards practical ethical NLG.

1. INTRODUCTION

Transformer (Vaswani et al., 2017) based Pre-trained Language Models (PLMs) (Radford et al., 2019; Raffel et al., 2019; Lewis et al., 2020) could produce quite fluent text and have empowered a wide range of downstream Natural Language Generation (NLG) tasks (See et al., 2019; Zhang et al., 2020; Lewis et al., 2020) . However, these PLMs are observed to internalize, propagate, and even amplify problematic contents that exist in crawled unclean corpora, typically toxic language (e.g., offensive text) (Gehman et al., 2020) and social biases (e.g., stereotypes or different model predictions) towards particular demographic groups (e.g., gender and race) (Sheng et al., 2019) , as shown in Figure 1-(a) . As large PLMs are becoming the foundation of the rapidly-growing NLG services (Bommasani et al., 2021) that directly interact with end-users, such pernicious text would propagate misrepresentations (known as representational harms), aggravate inequality of opportunities (Blodgett et al., 2020) , and cause psychological or even material harms (Weidinger et al., 2021) , bringing a profound negative impact on society. Moreover, such issues are found to persist across increasing model sizes (Rae et al., 2021) , emphasizing the urgency of developing practical methods for ethical NLG. These problems have drawn much attention to developing detoxifying and debiasing techniques, and previous methods mainly fall into two paradigms. The first is domain-specific pretraining (Gururangan et al., 2020) , which further trains the model with clean (e.g., non-toxic) corpora (Wang et al., 2022) . • The woman had a job as a nurse at the hospital, but was fired after the incident. • The black person was known for his "punchy" and "aggressive" behavior. • The UFC champ then suggested Justino is a "jerk" to the northeast natives. • We are more likely to believe a woman is mentally ill than a man. women may also be more inclined to attempt suicide. Furthermore, prior methods usually handle detoxifying and debiasing separately. Some works realized the fairness problem of detoxified PLMs, reporting increased perplexity on text related to marginalized groups (Welbl et al., 2021; Xu et al., 2021) . We also observed relevant phenomena shown in Figure 1-(b ). On the one hand, coinciding with (Zhou et al., 2021; Sap et al., 2021) , detoxifying techniques might even amplify social biases. On the other hand, while debiasing methods more or less contribute to toxicity reduction, directly detoxifying methods still result in the best performance. It is therefore necessary to jointly detoxify and debias a NLG model for its ethical use. To handle the above challenges, we propose the first Unified framework of Detoxifying and Debiasing based on Inference-time Adaptive optimization (UDDIA). UDDIA formalizes debiasing and detoxification jointly as rectifying the output distribution by equalizing the dependence between each token and different groups while minimizing the dependence of toxicity. We provide theoretical guarantee for UDDIA by interpreting it as learning a mixture of different attribute (demographic groups or toxicity) conditioned text distributions. In addition to the joint objective formalization, we facilitate the rectification in UDDIA with adaptive optimization schema and parameter-efficient tuning during inference. Extensive experiments show that UDDIA achieves superior performance in bias mitigation and toxicity reduction, as well as satisfactory generation efficiency and minimal loss of NLG quality.

2. RELATED WORK

Our work is related to bias mitigation and toxicity reduction for language generation. Recent literature on both topics take two main paradigms: domain-specific training and constrained decoding. Bias Mitigation. One well-known domain-specific training method for debiasing is Counterfactual Data Augmentation (CDA) (Lu et al., 2020) , which creates pseudo training data to facilitate more diverse contents generated with marginalized group prompts (Saunders & Byrne, 2020; Liu et al., 2020a; Zmigrod et al., 2019) . Another way without augmented data is regularization training. This method applies regularized losses to equalize the generation probabilities prompted from different groups (Qian et al., 2019; Bordia & Bowman, 2019; Huang et al., 2020) , or utilizes discriminators to remove sensitive information (Peng et al., 2020; Liu et al., 2020b) . These trainingbased methods work well but require extra data and are resource-consuming, hindering their use in large PLMs. Consequently, we highlight the constrained decoding paradigm, which consists of three lines of methods. The simplest line is heuristic constraints to involve more diverse group-related tokens (Saunders et al., 2021) . The second line places searched adversarial triggers at the beginning of prompts to stimulate unbiased generation (Sheng et al., 2020) . The last line steers the model output using either an extra PLM (Schick et al., 2021) or a learned projection matrix (Liang et al., 2021) . Toxicity Reduction. Similar to debiasing, detoxification adheres to the two paradigms. The domainspecific training paradigm performs additional pretraining with elaborately filtered non-toxic corpora to dilute the captured toxicity (Gehman et al., 2020; Gururangan et al., 2020; Wang et al., 2022) , conducts attribute (toxic/non-toxic) conditioning training (Gehman et al., 2020) , or achieves style



(b) Toxicity/Bias in Text Generated by Different Models (a) Text Generated by GPT-2

Figure 1: (a) GPT-2 generated sentences that contain stereotypes of marginalized groups (magenta) or toxic contents (steelblue). (b) Toxicity (measured by PERSPECTIVE API) and bias (measured by Regard score (Sheng et al., 2019)) of texts generated by different models. Compared with GPT-2, the detoxified method DExperts obtains the top toxicity mitigation but even amplifies social biases. This effective paradigm needs carefully-created training data and becomes quite expensive for big PLMs. Therefore, we focus on the other paradigm, namely constrained decoding, which avoids undesired tokens by simple filtering (Welbl et al., 2021), adversarial triggers (Sheng et al., 2020), hidden states update (Dathathri et al., 2020) or output distribution projection (Liu et al., 2021) without retraining PLMs. Nonetheless, filtering ignores language diversity, optimizing triggers or hidden states is time-consuming, and direct projection of the output distribution would hurt text fluency.

