FAIRFIL: CONTRASTIVE NEURAL DEBIASING METHOD FOR PRETRAINED TEXT ENCODERS

Abstract

Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.

1. INTRODUCTION

Text encoders, which map raw-text data into low-dimensional embeddings, have become one of the fundamental tools for extensive tasks in natural language processing (Kiros et al., 2015; Lin et al., 2017; Shen et al., 2019; Cheng et al., 2020b) . With the development of deep learning, largescale neural sentence encoders pretrained on massive text corpora, such as Infersent (Conneau et al., 2017) , ELMo (Peters et al., 2018) , BERT (Devlin et al., 2019), and GPT (Radford et al., 2018) , have become the mainstream to extract the sentence-level text representations, and have shown desirable performance on many NLP downstream tasks (MacAvaney et al., 2019; Sun et al., 2019; Zhang et al., 2019) . Although these pretrained models have been studied comprehensively from many perspectives, such as performance (Joshi et al., 2020 ), efficiency (Sanh et al., 2019) , and robustness (Liu et al., 2019) , the fairness of pretrained text encoders has not received significant research attention. The fairness issue is also broadly recognized as social bias, which denotes the unbalanced model behaviors with respect to some socially sensitive topics, such as gender, race, and religion (Liang et al., 2020) . For data-driven NLP models, social bias is an intrinsic problem mainly caused by the unbalanced data of text corpora (Bolukbasi et al., 2016) . To quantitatively measure the bias degree of models, prior work proposed several statistical tests (Caliskan et al., 2017; Chaloner & Maldonado, 2019; Brunet et al., 2019) , mostly focusing on word-level embedding models. To evaluate the sentence-level bias in the embedding space, May et al. Although related works have discussed the measurement of social bias in sentence embeddings, debiasing pretrained sentence encoders remains a challenge. Previous word embedding debiasing methods (Bolukbasi et al., 2016; Kaneko & Bollegala, 2019; Manzini et al., 2019) have limited assistance to sentence-level debiasing, because even if the social bias is eliminated at the word level, the sentence-level bias can still be caused by the unbalanced combination of words in the training text. Besides, retraining a state-of-the-art sentence encoder for debiasing requires a massive amount of computational resources, especially for large-scale deep models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) . To the best of our knowledge, Liang et al. (2020) proposed the only sentence-level debiasing method (Sent-Debias) for pretrained text encoders, in which the embeddings are revised by subtracting the latent biased direction vectors learned by Principal Component Analysis (PCA) (Wold et al., 1987) . However, Sent-Debias makes a strong assumption on the linearity of the bias in the sentence embedding space. Further, the calculation of bias directions depends highly on the embeddings extracted from the training data and the number of principal components, preventing the method from adequate generalization. In this paper, we proposed the first neural debiasing method for pretrained sentence encoders. For a given pretrained encoder, our method learns a fair filter (FairFil) network, whose inputs are the original embeddings of the encoder, and outputs are the debiased embeddings. Inspired by the multi-view contrastive learning (Chen et al., 2020), for each training sentence, we first generate an augmentation that has the same semantic meaning but in a different potential bias direction. We contrastively train our FairFil by maximizing the mutual information between the debiased embeddings of the original sentences and corresponding augmentations. To further eliminate bias from sensitive words in sentences, we introduce a debiasing regularizer, which minimizes the mutual information between debiased embeddings and the sensitive words' embeddings. In the experiments, our Fair-Fil outperforms Sent-Debias (Liang et al., 2020) in terms of the fairness and the representativeness of debiased embeddings, indicating our FairFil not only effectively reduces the social bias in the sentence embeddings, but also successfully preserves the rich semantic meaning of input text.

2. PRELIMINARIES

Mutual Information (MI) is a measure of the "amount of information" between two variables (Kullback, 1997) . The mathematical definition of MI is I(x; y) := E p(x,y) log p(x, y) p(x)p(y) , where p(x, y) is the joint distribution of two variables (x, y), and p(x), p(y) are respectively the marginal distributions of x, y. Recently, mutual information has achieved considerable success when applied as a learning criterion in diverse deep learning tasks, such as conditional generation (Chen et al., 2016), domain adaptation (Gholami et al., 2020) , representation learning (Chen et al., 2020), and fairness (Song et al., 2019) . However, the calculation of exact MI in (1) is well-recognized as a challenge, because the expectation w.r.t p(x, y) is always intractable, especially when only samples from p(x, y) are provided. To this end, several upper and lower bounds have been introduced to estimate the MI with samples. For MI maximization tasks (Hjelm et al., 2018; Chen et al., 2020 ), Oord et al. (2018) derived a powerful MI estimator, InfoNCE, based on noise contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010) . Given a batch of sample pairs {(x i , y i )} N i=1 , the InfoNCE estimator is defined with a learnable score function f (x, y): I NCE := 1 N N i=1 log exp(f (x i , y i )) 1 N N j=1 exp(f (x i , y j )) . (2) For MI minimization tasks (Alemi et al., 2017; Song et al., 2019) , Cheng et al. (2020a) introduced a contrastive log-ratio upper bound (CLUB) based on a variational approximation q θ (y|x) of conditional distribution p(y|x): I CLUB := 1 N N i=1 log q θ (y i |x i ) - 1 N N j=1 log q θ (y j |x i ) . (3) In the following, we use the above two MI estimators to induce the sentence encoder, eliminating the biased information and preserving the semantic information from the original raw text.

3. METHOD

Suppose E(•) is a pretrained sentence encoder, which can encode a sentence x into low-dimensional embedding z = E(x). Each sentence x = (w 1 , w 2 , . . . , w L ) is a sequence of words. The embedding space of z has been recognized to have social bias in a series of studies (May et al., 2019; Kurita 



(2019) extended the Word Embedding Association Test (WEAT) (Caliskan et al., 2017) into a Sentence Encoder Association Test (SEAT). Based on the SEAT test, May et al. (2019) claimed the existence of social bias in the pretrained sentence encoders.

