HUMAN-GUIDED FAIR CLASSIFICATION FOR NATURAL LANGUAGE PROCESSING

Abstract

Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.

1. INTRODUCTION

With the rise of pretrained large language models (Sun et al., 2019) , text classifiers can now be employed in tasks related to automated hiring (Bhatia et al., 2019) , content moderation (Rieder & Skop, 2021) and social science research (Widmer et al., 2020) . They are also part of machine learning pipelines for unsupervised style transfer (Reid & Zhong, 2021) or reducing the toxicity of language model outputs (Welbl et al., 2021) . However, text classifiers have been shown to often exhibit bias based on sensitive attributes such as gender (De-Arteaga et al., 2019) or demographics (Garg et al., 2019) , even for tasks in which these dimensions should be irrelevant. This can lead to unfair and discriminatory decisions, distort analyses based on these classifiers, or propagate undesirable demographic stereotypes to downstream applications. The intuition that certain demographic indicators should not influence decisions can be formalized in terms of the concept of individual fairness (Dwork et al., 2012) , which posits that similar inputs should be treated similarly by machine learning systems. While in a classification setting similar treatment for two inputs can naturally be defined in terms of both inputs being labeled the same, the notion of input similarity should capture the intuition that certain input characteristics should not influence model decisions. Key challenge: generating valid, intuitive and diverse fairness constraints A key challenge when applying the individual fairness framework is defining the similarity notion φ. Indeed, the definition is often contentious, as fairness is a subjective concept: what counts as a valid demographic indicator, as opposed to a problematic stereotype? Counterfactual definitions of similarity (Kusner et al., 2017) offer a principled solution, but they shift the burden towards the underlying causal model, whose definition can often be similarly contentious. While many other definitions have been proposed, it is widely recognized that the similarity of inputs can often be highly task dependent (Dwork et al., 2012; Barocas et al., 2019) , e.g., two biographies that are identical except for indicators of gender may be considered similar in a professional context, but not in the context of online dating. : Filtered pairs : Expanded set of pairs : Word Replacement pairs "I don't like this movie. It is so old" "I don't like this movie. It is so gay" "She was a great muslim" "She was a great christian" "I don't like this movie. It is so old" "I don't like this movie. It is so gay" "He is an ugly, old, white, racist man" "He is a black man" "The left-wing platform is quite questionable" "The right-wing platform is quite questionable" "I don't like this movie. It is so old" "I don't like this movie. It is so gay" "She was a great muslim" "She was a great christian" "He is an ugly, old, white, racist man" "He is a black man" "The left-wing platform is quite questionable" "The right-wing platform is quite questionable" "She was a great muslim" "She was a great christian" Word Replacement

Style Transfer

+ GPT-3 Active Learning "I don't like this movie. It is so old" "She was a great muslim" "He is an ugly, old, white, racist man" "The left-wing platform is quite questionable"

Crowdworkers

Figure 1 : Workflow overview. We begin by generating sentence pairs using word replacement, and then add pairs of sentences leveraging style transfer and GPT-3. Then, we use active learning and crowdworker judgments to identify pairs that deserve similar treatment according to human intuition. In the context of text classification, most existing works have cast similarity in terms of word replacement (Dixon et al., 2018; Garg et al., 2019; Yurochkin & Sun, 2021; Liang et al., 2020) . Given a sentence s, a similar sentence s is generated by replacing each word in s, that belongs to a list of words A j indicative of a demographic group j, by a word from list A j , indicative of another demographic group j = j. This approach has several limitations: (i) it relies on having exhaustively curated word lists A j of sensitive terms, (ii) perturbations that cannot be represented by replacing single sensitive terms are not covered, and (iii) many terms are only indicative of demographic groups in specific contexts, hence directly replacing them with other terms will not always result in a similar pair (s, s ) according to human intuition. Indeed, word replacement rules can often produce sentence pairs that only differ in an axis not relevant to fairness (e.g., by replacing "white house" with "black house"). In addition, they can generate so-called asymmetric counterfactuals (Garg et al., 2019) : sentence pairs (s, s ) that look similar but clearly do not warrant similar treatment. For example, in the context of toxicity classification, the text "I don't like this movie. It is so old" may not be considered toxic while "I don't like this movie. It is so gay" clearly is. This work: generating fairness specifications for text classification The central challenge we consider in this work is how to generate a diverse set of input pairs that aligns with human intuition about which inputs should be treated similarly in the context of a fixed text classification task. These pairs then induce fairness constraints that collectively define an implict fairness specification on a downstream classifier, as individual fairness postulates that they should be classified in the same way. We address this challenge via a three-stage pipeline, summarized in Fig. 1 . First, we start from a training dataset D for the text classification task under consideration and generate a set C w of candidate pairs (s, s ) by applying word replacement to sentences s ∈ D. Second, to improve diversity and expand on word replacement rules, we extend C w to a larger set of pairs C e by borrowing unsupervised style transfer ideas. We change markers of demographic groups, e.g., "women", "black people" or "Christians", in sentences s ∈ D by replacing the style classifier used by modern unsupervised style transfer methods (Reid & Zhong, 2021; Lee, 2020) with a classifier trained to identify mentions of demographic groups. In addition, we add pairs from GPT-3 (Brown et al., 2020) , prompted to change markers of demographic groups for sentences in D in a zero-shot fashion. Third, to identify which of the generated pairs align with human intuition about fairness in the context of the considered classification task, we design a crowdsourcing experiment in which workers are presented with candidate pairs and indicate if the pairs should be treated similarly for the considered task or not. Since obtaining human feedback is expensive, we label a small subset of the generated pool and train a BERT-based (Devlin et al., 2019) classifier φ to recognize pairs that should be treated similarly, yielding a final set of filtered pairs Ĉr ⊆ C e . To further reduce labeling costs, we use active learning similar to (Grießhaber et al., 2020) to decide which pairs to label. We also demonstrate that the final set of constraints Ĉr can be used for training fairness-aware downstream classifiers, by adopting the Counterfactual Logit Pairing (CLP) regularizer of (Garg et al., 2019) .

