HUMAN-GUIDED FAIR CLASSIFICATION FOR NATURAL LANGUAGE PROCESSING

Abstract

Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.

1. INTRODUCTION

With the rise of pretrained large language models (Sun et al., 2019) , text classifiers can now be employed in tasks related to automated hiring (Bhatia et al., 2019) , content moderation (Rieder & Skop, 2021) and social science research (Widmer et al., 2020) . They are also part of machine learning pipelines for unsupervised style transfer (Reid & Zhong, 2021) or reducing the toxicity of language model outputs (Welbl et al., 2021) . However, text classifiers have been shown to often exhibit bias based on sensitive attributes such as gender (De-Arteaga et al., 2019) or demographics (Garg et al., 2019) , even for tasks in which these dimensions should be irrelevant. This can lead to unfair and discriminatory decisions, distort analyses based on these classifiers, or propagate undesirable demographic stereotypes to downstream applications. The intuition that certain demographic indicators should not influence decisions can be formalized in terms of the concept of individual fairness (Dwork et al., 2012) , which posits that similar inputs should be treated similarly by machine learning systems. While in a classification setting similar treatment for two inputs can naturally be defined in terms of both inputs being labeled the same, the notion of input similarity should capture the intuition that certain input characteristics should not influence model decisions. Key challenge: generating valid, intuitive and diverse fairness constraints A key challenge when applying the individual fairness framework is defining the similarity notion φ. Indeed, the definition is often contentious, as fairness is a subjective concept: what counts as a valid demographic indicator, as opposed to a problematic stereotype? Counterfactual definitions of similarity (Kusner et al., 2017) offer a principled solution, but they shift the burden towards the underlying causal model, whose definition can often be similarly contentious. While many other definitions have been proposed, it is widely recognized that the similarity of inputs can often be highly task dependent (Dwork et al., 2012; Barocas et al., 2019) , e.g., two biographies that are identical except for indicators of gender may be considered similar in a professional context, but not in the context of online dating.

