FAIRFIL: CONTRASTIVE NEURAL DEBIASING METHOD FOR PRETRAINED TEXT ENCODERS

Abstract

Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.

1. INTRODUCTION

Text encoders, which map raw-text data into low-dimensional embeddings, have become one of the fundamental tools for extensive tasks in natural language processing (Kiros et al., 2015; Lin et al., 2017; Shen et al., 2019; Cheng et al., 2020b) . With the development of deep learning, largescale neural sentence encoders pretrained on massive text corpora, such as Infersent (Conneau et al., 2017) , ELMo (Peters et al., 2018) , BERT (Devlin et al., 2019) , and GPT (Radford et al., 2018) , have become the mainstream to extract the sentence-level text representations, and have shown desirable performance on many NLP downstream tasks (MacAvaney et al., 2019; Sun et al., 2019; Zhang et al., 2019) . Although these pretrained models have been studied comprehensively from many perspectives, such as performance (Joshi et al., 2020) , efficiency (Sanh et al., 2019) , and robustness (Liu et al., 2019) , the fairness of pretrained text encoders has not received significant research attention. The fairness issue is also broadly recognized as social bias, which denotes the unbalanced model behaviors with respect to some socially sensitive topics, such as gender, race, and religion (Liang et al., 2020) . For data-driven NLP models, social bias is an intrinsic problem mainly caused by the unbalanced data of text corpora (Bolukbasi et al., 2016) . To quantitatively measure the bias degree of models, prior work proposed several statistical tests (Caliskan et al., 2017; Chaloner & Maldonado, 2019; Brunet et al., 2019) , mostly focusing on word-level embedding models. To evaluate the sentence-level bias in the embedding space, May et al. (2019) extended the Word Embedding Association Test (WEAT) (Caliskan et al., 2017) into a Sentence Encoder Association Test (SEAT). Based on the SEAT test, May et al. (2019) claimed the existence of social bias in the pretrained sentence encoders. Although related works have discussed the measurement of social bias in sentence embeddings, debiasing pretrained sentence encoders remains a challenge. Previous word embedding debiasing methods (Bolukbasi et al., 2016; Kaneko & Bollegala, 2019; Manzini et al., 2019) have limited assistance to sentence-level debiasing, because even if the social bias is eliminated at the word level, the sentence-level bias can still be caused by the unbalanced combination of words in the training text. Besides, retraining a state-of-the-art sentence encoder for debiasing requires a massive amount of computational resources, especially for large-scale deep models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) . To the best of our knowledge, Liang et al. (2020) proposed the only sentence-level debiasing method (Sent-Debias) for pretrained text encoders, in which the embeddings are revised by subtracting the latent biased direction vectors learned by Principal Component Analysis (PCA) (Wold et al., 1987) . However, Sent-Debias makes a strong assumption on the linearity of the bias in the sentence embedding space. Further, the calculation of bias directions depends highly on the embeddings extracted from the training data and the number of principal components, preventing the method from adequate generalization. In this paper, we proposed the first neural debiasing method for pretrained sentence encoders. For a given pretrained encoder, our method learns a fair filter (FairFil) network, whose inputs are the original embeddings of the encoder, and outputs are the debiased embeddings. Inspired by the multi-view contrastive learning (Chen et al., 2020) , for each training sentence, we first generate an augmentation that has the same semantic meaning but in a different potential bias direction. We contrastively train our FairFil by maximizing the mutual information between the debiased embeddings of the original sentences and corresponding augmentations. To further eliminate bias from sensitive words in sentences, we introduce a debiasing regularizer, which minimizes the mutual information between debiased embeddings and the sensitive words' embeddings. In the experiments, our Fair-Fil outperforms Sent-Debias (Liang et al., 2020) in terms of the fairness and the representativeness of debiased embeddings, indicating our FairFil not only effectively reduces the social bias in the sentence embeddings, but also successfully preserves the rich semantic meaning of input text.

2. PRELIMINARIES

Mutual Information (MI) is a measure of the "amount of information" between two variables (Kullback, 1997) . The mathematical definition of MI is I(x; y) := E p(x,y) log p(x, y) p(x)p(y) , where p(x, y) is the joint distribution of two variables (x, y), and p(x), p(y) are respectively the marginal distributions of x, y. Recently, mutual information has achieved considerable success when applied as a learning criterion in diverse deep learning tasks, such as conditional generation (Chen et al., 2016) , domain adaptation (Gholami et al., 2020) , representation learning (Chen et al., 2020) , and fairness (Song et al., 2019) . However, the calculation of exact MI in (1) is well-recognized as a challenge, because the expectation w.r.t p(x, y) is always intractable, especially when only samples from p(x, y) are provided. To this end, several upper and lower bounds have been introduced to estimate the MI with samples. For MI maximization tasks (Hjelm et al., 2018; Chen et al., 2020 ), Oord et al. (2018) derived a powerful MI estimator, InfoNCE, based on noise contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010) . Given a batch of sample pairs {(x i , y i )} N i=1 , the InfoNCE estimator is defined with a learnable score function f (x, y): I NCE := 1 N N i=1 log exp(f (x i , y i )) 1 N N j=1 exp(f (x i , y j )) . (2) For MI minimization tasks (Alemi et al., 2017; Song et al., 2019) , Cheng et al. (2020a) introduced a contrastive log-ratio upper bound (CLUB) based on a variational approximation q θ (y|x) of conditional distribution p(y|x): I CLUB := 1 N N i=1 log q θ (y i |x i ) - 1 N N j=1 log q θ (y j |x i ) . (3) In the following, we use the above two MI estimators to induce the sentence encoder, eliminating the biased information and preserving the semantic information from the original raw text.

3. METHOD

Suppose E(•) is a pretrained sentence encoder, which can encode a sentence x into low-dimensional embedding z = E(x). Each sentence x = (w 1 , w 2 , . . . , w L ) is a sequence of words. The embedding space of z has been recognized to have social bias in a series of studies (May et al., 2019; Kurita et al., 2019; Liang et al., 2020) . To eliminate the social bias in the embedding space, we aim to learn a fair filter network f (•) on top of the sentence encoder E(•), such that the output embedding of our fair filter d = f (z) can be debiased. To train the fair filter, we design a multi-view contrastive learning framework, which consists of three steps. First, for each input sentence x, we generate an augmented sentence x that has the same semantic meaning as x but in a different potential bias direction. Then, we maximize the mutual information between the original embedding z = f (x) and the augmented embedding z = f (x ) with the InfoNCE (Oord et al., 2018) contrastive loss. Further, we design a debiasing regularizer to minimize the mutual information between d and sensitive attribute words in x. In the following, we discuss these three steps in detail.

3.1. DATA AUGMENTATIONS WITH SENSITIVE ATTRIBUTES

We first describe the sentence data augmentation process for our FairFil contrastive learning. Denote a social sensitive topic as T = {D 1 , D 2 , . . . , D K }, where D k (k = 1, . . . , K ) is one of the potential bias directions under the topic. For example, if T represents the sensitive topic "gender", then T consists two potential bias directions {D 1 , D 2 } = {"male", "female"}. Similarly, if T is set as the major "religions" of the world, then T could contain {D 1 , D 2 , D 3 , D 4 } = {"Christianity", "Islam", "Judaism", "Buddhism"} as four components. For a given social sensitive topic T = {D 1 , . . . D K }, if a word w is related to one of the potential bias direction D k (denote as w ∈ D k ), we call w a sensitive attribute word of D k (also called bias attribute word in Liang et al. (2020) ). For a sensitive attribute word w ∈ D k , suppose we can always find another sensitive attribute word u ∈ D j , such that w and u has the equivalent semantic meaning but in a different bias direction. Then we call u as a replaceable word of w in direction D j , and denote as u = r j (w). For the topic "gender" = {"male", "female"}, the word w = "boy" is in the potential bias direction D 1 = "male"; a replaceable word of "boy" in "female" direction is r 2 (w) = "girl" ∈ D 2 . With the above definitions, for each sentence x, we generate an augmented sentence x such that x has the same semantic meaning as x but in a different potential bias direction. More specifically, for a sentence x = (w 1 , w 2 , . . . , w L ), we first find the sensitive word positions as an index set P, such that each w p (p ∈ P) is a sensitive attribute words in direction D k . We further make a reasonable assumption that the embedding bias of direction D k is only caused by the sensitive words {w p } p∈P in x. To sample an augmentation to x, we first select another potential bias direction D j , and then replace all sensitive attribute words by their replaceable words in the direction D j . That is, x = {v 1 , v 2 , . . . , v L }, where v l = w l if l / ∈ P, and v l = r j (w l ) if l ∈ P. In Table 1 , we provide an example for sentence augmentation under the "gender" topic.

3.2. CONTRASTIVE LEARNING FRAMEWORK

After obtaining the sentence pair (x, x ) with the augmentation strategy from Section 3.1, we construct a contrastive learning framework to learn our debiasing fair filter f (•). As shown in the Figure 1 (a), our framework consists of the following two steps: (1) We encode sentences (x, x ) into embeddings (z, z ) with the pretrained encoder E(•). Since x and x have the same meaning but different potential bias directions, the embeddings (z, z ) will have different bias directions, which are caused by the sensitive attributed words in x and x .  I NCE = 1 N N i=1 log exp(g(d i , d i )) 1 N N j=1 exp(g(d i , d j )) . By maximize I NCE , we encourage the difference between the positive pair score g(d i , d i ) and the negative pair score g(d i , d j ), so that d i can share more semantic information with d i than other embeddings d j =i .

3.3. DEBIASING REGULARIZER

Practically, the contrastive learning framework in Section 3.2 can already show encouraging debiasing performance (as shown in the Experiments). However, the embedding d can contain extra biased information from z, that only maximizing I(d; d ) fails to eliminate. To encourage no extra bias in d, we introduce a debiasing regularizer which minimizes the mutual information between embedding d and the potential bias from embedding z. As discussed in Section 3.1, in our framework the potential bias of z is assumed to come from the sensitive attribute words in x. Therefore, we should reduce the bias word information from the debiased representation d. Let w p be the embedding of a sensitive attribute word w p in sentence x. The word embedding w p can always be obtained from the pretrained text encoders (Bordia & Bowman, 2019) . We then minimize the mutual information I(w p ; d), using the CLUB mutual information upper bound (Cheng et al., 2020a) to estimate I(w p ; d) with embedding samples. Given a batch of embedding pairs {(d i , w p )} N i=1 , we can calculate the debiasing regularizer as: I CLUB = 1 N N i=1 log q θ (w p i |d i ) - 1 N N j=1 log q θ (w p j |d i ) , where q θ is a variational approximation to ground-truth conditional distribution p(w|d). We parameterize q θ with another neural network. As proved in Cheng et al. (2020a) , the better q θ (w|d) approximates p(w|d), the more accurate I CLUB serves as the mutual information upper bound. Therefore, besides the loss in (5), we also maximize the log-likelihood of q θ (w|d) with samples {(d i , w p i )} N i=1 . Based on the above sections, the overall learning scheme of our fair filter (FairFil) is described in Algorithm 1. Also, we provide an intuitive explanation to the two loss terms in our framework. In Figure 1 Algorithm 1 Updating the FairFil with a sample batch Begin with the pretrained text encoder E(•), and a batch of sentences {x i } N i=1 . Find the sensitive attribute words {w p } and corresponding embeddings {w p }. Generate augmentation x i from x i , by replacing {w p } with {r j (w p )}. Encode (x i , x i ) into embeddings d i = f (E(x i )), d i = f (E(x i )). Calculate I NCE with {(d i , d i )} N i=1 and score function g. if adding debiasing regularizer then Update the variational approximation q θ (w|d) by maximizing log-likelihood with {(d i , w p i )} Calculate I CLUB with q θ (w|d) and {(d i , w p i )} N i=1 . Learning loss L = -I NCE + βI CLUB . else Learning loss L = -I NCE . end if Update FairFil f and score function g by gradient descent with respect to L.

4.1. BIAS IN NATURAL LANGUAGE PROCESSING

Social bias has recently been recognized as an important issue in natural language processing (NLP) systems. The studies on bias in NLP are mainly delineated into two categories: bias in the embedding spaces, and bias in downstream tasks (Blodgett et al., 2020) . For bias in downstream tasks, the analyses cover comprehensive topics, including machine translation (Stanovsky et al., 2019) , language modeling (Bordia & Bowman, 2019) , sentiment analysis (Kiritchenko & Mohammad, 2018) and toxicity detection (Dixon et al., 2018) . The social bias in embedding spaces has been studied from two important perspectives: bias measurements and and debiasing methods. To measure the bias in an embedding space, Caliskan et al. (2017) proposed a Word Embedding Association Test (WEAT), which compares the similarity between two sets of target words and two sets of attribute words. May et al. (2019) further extended the WEAT to a Sentence Encoder Association Test (SEAT), which replaces the word embeddings by sentence embeddings encoded from pre-defined biased sentence templates. For debiasing methods, most of the prior works focus on word-level representations (Bolukbasi et al., 2016; Bordia & Bowman, 2019) . The only sentence-level debiasing method is proposed by Liang et al. (2020) , which learns bias directions by PCA and subtracts them in the embedding space.

4.2. CONTRASTIVE LEARNING

Contrastive learning is a broad class of training strategies that learns meaningful representations by making positive and negative embedding pairs more distinguishable. Usually, contrastive learning requires a pairwise embedding critic as a similarity/distance of data pairs. Then the learning objective is constructed by maximizing the margin between the critic values of positive data pairs and negative data pairs. Previously contrastive learning has shown encouraging performance in many tasks, including metric learning (Weinberger et al., 2006; Davis et al., 2007) , word representation learning (Mikolov et al., 2013) , graph learning (Tang et al., 2015; Grover & Leskovec, 2016) , etc. Recently, contrastive learning has been applied to the unsupervised visual representation learning task, and significantly reduced the performance gap between supervised and unsupervised learning (He et al., 2020; Chen et al., 2020; Qian et al., 2020) . Among these unsupervised methods, Chen et al. (2020) proposed a simple multi-view contrastive learning framework (SimCLR). For each image data, SimCLR generates two augmented images, and then the mutual information of the two augmentation embeddings is maximized within a batch of training data.

5. EXPERIMENTS

We first describe the experimental setup in detail, including the pretrained encoders, the training of FairFil, and the downstream tasks. The results of our FairFil are reported and analyzed, along with the previous Sent-Debias method. In general, we evaluate our neural debiasing method from two perspectives: (1) fairness: we compare the bias degree of the original and debiased sentence embeddings for debiasing performance; and (2) representativeness: we apply the debiased embeddings into downstream tasks, and compare the performance with original embeddings.

5.1. BIAS EVALUATION METRIC

To evaluate the bias in sentence embeddings, we use the Sentence Encoder Association Test (SEAT) (May et al., 2019) , which is an extension of the Word Embedding Association Test (WEAT) (Caliskan et al., 2017) . The WEAT test measures the bias in word embeddings by comparing the distances of two sets of target words to two sets of attribute words. More specifically, denote X and Y as two sets of target word embeddings (e.g., X includes "male" words such as "boy" and "man"; Y contains "female" words like "girl" and "woman"). The attribute sets A and B are selected from some social concepts that should be "equal" to X and Y (e.g., career or personality words). Then the bias degree w.r.t attributes (A, B) of each word embedding t is defined as: s(t, A, B) = mean a∈A cos(t, a) -mean b∈B cos(t, b), where cos(•, •) is the cosine similarity. Based on ( 6), the normalized WEAT effect size is: d WEAT = mean x∈X s(x, A, B) -mean y∈Y s(y, A, B) std t∈X ∪Y s(t, A, B) . ( ) The SEAT test extends WEAT by replacing the word embeddings with sentence embeddings. Both target words and attribute words are converted into sentences with several semantically bleached sentence templates (e.g., "This is <word>"). Then the SEAT statistic is similarly calculated with (7) based on the embeddings of converted sentences. The closer the effect size is to zero, the more fair the embeddings are. Therefore, we report the absolute effect size as the bias measure.

5.2. PRETRAINED ENCODERS

We test our neural debiasing method on BERT (Devlin et al., 2019) . Since the pretrained BERT requires the additional fine-tuning process for downstream tasks, we report the performance of our FairFil under two scenarios: (1) pretrained BERT: we directly learn our FairFil network based on pretrained BERT without any additional fine-tuning; and (2) BERT post tasks: we fix the parameters of the FairFil network learned on pretrained BERT, and then fine-tune the BERT+FairFil together on task-specific data. Note that when fine-tuning, our FairFil will no longer update, which satisfies a fair comparison to Sent-Debias (Liang et al., 2020) . For the downstream tasks of BERT, we follow the setup from Sent-Debias (Liang et al., 2020) and conduct experiments on the following three downstream tasks: (1) SST-2: A sentiment classification task on the Stanford Sentiment Treebank (SST-2) dataset (Socher et al., 2013) , on which sentence embeddings are used to predict the corresponding sentiment labels; (2) CoLA: Another sentiment classification task on the Corpus of Linguistic Acceptability (CoLA) grammatical acceptability judgment (Warstadt et al., 2019) ; and (3) QNLI: A binary question answering task on the Question Natural Language Inference (QNLI) dataset (Wang et al., 2018) .

5.3. TRAINING OF FAIRFIL

We parameterize the fair filter network with one-layer fully-connected neural networks with the ReLU activation function. The score function g in the InfoNCE estimator is set to a two-layer fully-connected network with one-dimensional output. The variational approximation q θ in CLUB estimator is parameterized by a multi-variate Gaussian distribution q θ (w|d ) = N (µ(d), σ 2 (d)), where µ(•) and σ(•) are also two-layer fully-connected neural nets. The batch size is set to 128. The learning rate is 1 × 10 -5 . We train the fair filter for 10 epochs. For an appropriate comparison, we follow the setup of Sent-Debias (Liang et al., 2020) and select the same training data for the training of FairFil. The training corpora consist 183,060 sentences from the following five datasets: WikiText-2 (Merity et al., 201y) , Stanford Sentiment Treebank (Socher et al., 2013) , Reddit (V"olske et al., 2017), MELD (Poria et al., 2019) and POM (Park et al., 2014) . Following Liang et al. (2020) , we mainly select "gender" as the sensitive topic T , and use the same pre-defined word sets of sensitive attribute words and their replaceable words as Sent-Debias did. The word embeddings for training the debiasing regularizer is selected from the token embedding of the pretrained BERT.

5.4. DEBIASING RESULTS

In Tables 2 and 3 we report the evaluation results of debiased embeddings on both the absolute SEAT effect size and the downstream classification accuracy. For the SEAT test, we follow the Caliskan et al. (2017) . The column name Origin refers to the original BERT results, and Sent-D is short for Sent-Debias (Liang et al., 2020) . FairFil -and FairFil (as FairF -and FairF in the tables) are our method without/with the debiasing regularizer in Section 3.3. The best results of effect size (the lower the better) and classification accuracy (the higher the better) are bold among Sent-D, FairFil -, and FairFil. Since the pretrained BERT does not correspond to any downstream task, the classification accuracy is not reported for it. (Bojanowski et al., 2017) 0.565 BERT word (Bolukbasi et al., 2016) 0.861 BERT simple (May et al., 2019) 0.298 Sent-Debias (Liang et al., 2020) 0.256 FairFil -(Ours) 0.179 FairFil (Ours) 0.150 From the SEAT test results, our contrastive learning framework effectively reduces the gender bias for both pretrained BERT and fine-tuned BERT under most test scenarios. Comparing with Sent-Debias, our FairFil reaches a lower bias degree on the majority of the individual SEAT tests. Considering the average of absolute effect size, our FairFil is distinguished by a significant margin to Sent-Debias. Moreover, our FairFil achieves higher downstream classification accuracy than Sent-Debias, which indicates learning neural filter networks can preserve more semantic meaning than subtracting bias directions learned from PCA. For the ablation study, we also report the results of FairFil without the debiasing regularizer, as in FairF -. Only with the contrastive learning framework, FairF -already reduces the bias effectively and even achieves better effect size than the FairF on some of the SEAT tests. With the debiasing regularizer, FairF has better average SEAT effect sizes but slightly loses in terms of the downstream performance. However, the overall performance of FairF and FairF -shows a trade-off between fairness and representativeness of the filter network. We also compare the debiasing performance on a broader class of baselines, including word-level debiasing methods, and report the average absolute SEAT effect size on the pretrained BERT encoder. Both FairF -and FairF achieve a lower bias degree than other baselines. The word-level debiasing methods (FastText (Bojanowski et al., 2017) and BERT word (Bolukbasi et al., 2016) ) To further study output debiased sentence embedding, we visualize the relative of attributes and targets of SEAT before/after our debiasing process. We choose the target words as "he" and "she." Attributes are selected from different social domains. We first contextualize the selected words into sentence templates as described in Section 5.1. We then average the original/debiased embeddings of these sentence template and plot the t-SNE (Maaten & Hinton, 2008) in Figure 3 . From the t-SNE, the debiased encoder provides more balanced distances from gender targets "he/she" to the attribute concepts.

6. CONCLUSIONS

This paper has developed a novel debiasing method for large-scale pretrained text encoder neural networks. We proposed a fair filter (FairFil) network, which takes the original sentence embeddings as input and outputs the debiased sentence embeddings. To train the fair filter, we constructed a multi-view contrast learning framework, which maximizes the mutual information between each sentence and its augmentation. The augmented sentence is generated by replacing sensitive words in the original sentence with words in a similar semantic but different bias directions. Further, we designed a debiasing regularizer that minimizes the mutual information between the debiased embeddings and the corresponding sensitive words in sentences. Experimental results demonstrate the proposed FairFil not only reduces the bias in sentence embedding space, but also maintains the semantic meaning of the embeddings. This post hoc method does not require access to the training corpora, or any retraining process of the pretrained text encoder, which enhances its applicability.



We then feed the sentence embeddings (z, z ) through our fair filter f (•) to obtain the debiased embedding outputs (d, d ). Ideally, d and d should represent the same semantic meaning without social bias. Inspired by SimCLR (Chen et al., 2020), we encourage the overlapped semantic information between d and d by maximizing their mutual information I(d; d ). However, the calculation of I(d; d ) is practically difficult because only embedding samples of d and d are available. Therefore, we use the InfoNCE mutual information estimator (Oord et al., 2018) to minimize the lower bound of I(d; d ) instead. Based on a learnable score function g(•, •),

Figure 1: (a) Contrastive learning framework of FairFil: Sentence x and its augmentation x are encoded into embeddings d and d , respectively. w p is the embedding of a sensitive attribute word selected from x. I NCE maximizes the mutual information between d and d ; I CLUB eliminates the bias information of w p from d. (b) Illustration of information in d and d : The blue and red circles represent the information in d and d , respectively. The intersection is the mutual information between d and d . The shadow area represents the bias information of both embeddings.

(b), the blue and red circles represent d and d , respectively, in the embedding space. The intersection I(d; d ) is the common semantic information extracted from sentences x and x , while the two shadow parts are the extra bias. Note that the perfect debiased embeddings lead to coincident circles. By maximizing I NCE term, we enlarge the overlapped area of d and d ; by minimizing I CLUB , we shrink the biased shadow parts.

Figure 2: Influence of the training data proportion to debias degree of BERT.

Figure 3: T-SNE plots of sentence embedding mean of each words contextualized in templates. The left-hand side is from the original pretrained BERT; the right-hand side is from our FairFil.have the worst debiasing performance, which validates our observation that the word-level debiasing methods cannot reduce sentence-level social bias in NLP models.5.5 ANALYSISTo test the influence of data proportion on the model's debiasing performance, we select WikiText-2 with 13,750 sentences as the training corpora following the setup inLiang et al. (2020). Then we randomly divide the training data into 5 equal-sized partitions. We evaluate the bias degree of the sentence debiasing methods on different combinations of the partitions, specifically with training data proportions (20%, 40%, 60%, 80%, 100%). Under each data proportion, we repeat the training 5 times to obtain the mean and variance of the absolute SEAT effect size. In Figure2, we plot the bias degree of BERT post tasks with different training data proportions. In general, both Sent-Debias and FairFil achieve better performance and smaller variance when the proportion of training data is larger. Under a 20% training proportion, our FairFil can better remove bias in text encoder, which shows FairFil has better data efficiency with the contrastive learning framework.

Examples of generating an augmentation sentence under the sensitive topic "gender".

Performance of debiased embeddings on Pretrained BERT and BERT post SST-2.

Performance of debiased embeddings on BERT post CoLA and BERT post QNLI.



ACKNOWLEDGEMENTS

This research was supported in part by the DOE, NSF and ONR.

