INTERPRETABLE DEBIASING OF VECTORIZED LAN-GUAGE REPRESENTATIONS WITH ITERATIVE ORTHOG-ONALIZATION

Abstract

We propose a new mechanism to augment a word vector embedding representation that offers improved bias removal while retaining the key information-resulting in improved interpretability of the representation. Rather than removing the information associated with a concept that may induce bias, our proposed method identifies two concept subspaces and makes them orthogonal. The resulting representation has these two concepts uncorrelated. Moreover, because they are orthogonal, one can simply apply a rotation on the basis of the representation so that the resulting subspace corresponds with coordinates. This explicit encoding of concepts to coordinates works because they have been made fully orthogonal, which previous approaches do not achieve. Furthermore, we show that this can be extended to multiple subspaces. As a result, one can choose a subset of concepts to be represented transparently and explicitly, while the others are retained in the mixed but extremely expressive format of the representation.

1. INTRODUCTION

Vectorized representation of structured data, especially text in Word2Vec (Mikolov et al., 2013) , GloVe (Pennington et al., 2014) , FastText (Joulin et al., 2016) , etc., have become an enormously powerful and useful method for facilitating language learning and understanding. And while for natural language data contextualized embeddings, e.g., ELMO (Peters et al., 2018) , BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , etc have become the standard for many analysis pipelines, the non-contextualized versions have retained an important purpose for low-resource languages, for their synonym tasks, and their interpretability. In particular, these versions have the intuitive representation that each word is mapped to a vector in a high-dimensional space, and the (cosine) similarity between words in this representation captures how similar the words are in meaning, by how similar are the contexts in which they are commonly used. Such vectorized representations are common among many other types of structured data, including images (Kiela & Bottou, 2014; Lowe, 2004) , nodes in a social network (Grover & Leskovec, 2016; Perozzi et al., 2014) , spatial regions of interest (Jenkins et al., 2019) , merchants in a financial network (Wang et al., 2021) , and many more. In all of these cases, the most effective representations are large, high-dimensional, and trained on a large amount of data. This can be an expensive endeavor, and the goal is often to complete this embedding task once and then use these representations as an intermediate step in many downstream tasks. In this paper, we consider the goal of adding or adjusting structure in existing embeddings as part of a light-weight representation augmentation. The goal is to complete this without expensive retraining of the embedding but to improve the representation's usefulness, meaningfulness, and interpretability. Within language models, this has most commonly been considered within the context of bias removal (Bolukbasi et al., 2016; Dev & Phillips, 2019) . Here commonly, one identifies a linear subspace that encodes some concept (e.g., male-to-female gender) and may modify or remove that subspace when the concept it encodes is not appropriate for a downstream task (e.g., resume ranking). One recently proposed approach of interest called Orthogonal Subspace Correction and

2.2. RECTIFICATION IN ISR

The graded-rotation is the only step that actually augments the data within ISR. It attempts to make orthogonal the identified subspace vectors v 1 and v 2 . Moreover, it applies this operation onto all word vector representatives in the data set as a continuous movement. This is essential for two reasons: first, it is (sub-)differentiable, and second, so it generalizes to all other vectorized representations that may carry some of the connotations of a concept but may not be specifically identified as such via a user-supplied word list. For instance, statistically gendered names can represent gender information in these embeddings, but we may not want to specifically assign a gender to the names since people with those names may not associate with the statistically most likely gender. We leverage the graded-rotation method from OSCaR (Dev et al., 2021a) using their public code. This takes as input two vectors v 1 and v 2 , and all of the vectorized words are projected onto their span and perform a different rotation on each word about the origin. Words close to v 2 are rotated to be nearly orthogonal to v 1 , and words close to v 1 are not changed much. In ISR this is performed after centering the data, and after projecting onto the span of d-dimensional vectors v 1 and v 2 . After the rotation, we reconstitute the full-dimensional coordinates of all vectors. Hence this only modifies 2 out of d (e.g., d = 300) dimensions in the proper basis, and so the effect on the representation of most word representations is small. The exception is those correlated with the targeted concepts v 1 and v 2 , and as intended, those are updated to become nearly-rectified.

2.3. ITERATION IN ISR

The wrapper of ISR is iteration. We find that if we just apply the centering, projection, gradedrotation, un-project, and un-center, the learned subspaces are not completely orthogonal. That is, reidentified µ(A), µ(B), µ(X), and µ(Y ) from the identified word vectors, the vectors v 1 = µ(A)µ(B) and v 2 = µ(X) -µ(Y ) are not quite orthogonal. However, if we repeat this entire (centerproject-rotate-unproject-uncenter) process, then identified vectors quickly approach orthogonality.

3. EVALUATION OF DEBIASING AND RECTIFICATION

We first evaluate the effectiveness of ISR in two ways: how well it actually rectifies or orthogonalizes concepts and how well it reduces bias. Following our models of concepts, all of our methods take as input four word lists: two target word sets X and Y and two sets of attribute words A and B. It learns concepts from each pair using their means µ(A), µ(B), µ(X), and µ(Y ) and then the vectors between them v 1 = µ(A) -µ(B) and v 2 = µ(X) -µ(Y ). We found other approaches, such as the normal direction of a linear classifier or the first principal component of the union of a pair, to be less reliable. Rectification via Dot Product. The dot product score measures the level of orthogonality between two linearly-learned concepts. We focus on concepts represented by two sets, A and B, and the difference between their two means. Given two such vectors v 1 , v 2 ∈ R d we simply compute their Euclidean dot-product as ⟨v 1 , v 2 ⟩ = v ⊤ 1 v 2 = ∥v 1 ∥∥v 2 ∥ cos(θ v1,v2 ), where θ v1,v2 is the angle between the two vectors. If they are orthogonal, the result should be 0. WEAT Score. The Word Embedding Association Test (WEAT) (Caliskan et al., 2017) was derived from the Implicit Association Test (IAT) from psychology. The goal of WEAT is to measure the level of human-like stereotypical bias associated with words in word embeddings. WEAT uses four sets of words: two target word sets X and Y and two sets of attribute words A and B. In short, it computes the average similarity of all pairs, adding those from X, A and Y, B, and subtracting otherwise; details in Appendix E. Scores close to 0 indicate no (biased) association, typical values are in [-2, 2] . Word lists. Our methods and evaluation methods rely on word lists (and their vectorized forms, unless stated otherwise 300-dimensional GloVe on English Wikipedia (Pennington et al., 2014) ). We initially used the standard word list from Caliskan et al. (2017) , found in Appendix F. Later we derive and use large word lists from LIWC (Pennebaker et al., 2001) , described in Section 3.2.

3.1. EVALUATION USING WEAT

As a representative example, we will first explore the relationship between male/female gendered terms and pleasant/unpleasant words. We compare against LP (Dev & Phillips, 2019 ) (Linear Projection) HD (Bolukbasi et al., 2016) (Hard Debiasing) , INLP (Ravfogel et al., 2020) (Iterative Null Space Projection), and OSCaR (Dev et al., 2021a) . iOSCaR denotes iteratively running OSCaR and SR as the non-iterative subspace rectification with our added centering step. Note that Hard Debiasing includes an equalization step where paired gendered words (e.g., dad-mom) are unprojected and the same distance as they were originally. Such a paired word list concept seems mostly specific to binary-gendered terms, and we simply skip this otherwise. The WEAT scores are in Table 1 , evaluated on the same words used as input to the algorithms. In this case, LP actually increases the WEAT score, and HD, INLP, and OSCaR moderately decrease the scores to about 50% of their previous values. Our method ISR significantly reduces the WEAT score to about 0.03, almost removing all evidence of bias. We use 10 iterations of subspace rectification; typically, 2-4 is fine. We show the rate of convergence by iteration in Table 2 . It also shows the dot product (dotP) scores per iteration. ISR quickly converges to a very small dot product of 0, so the subspaces are orthogonal, iOSCaR does not. We apply similar experiments on many other data set pairs in commonly removes more than 98% of the bias. Also, note that SR (only one iteration of the centered rectification process) is not nearly as effective as the iterative process in ISR.

3.2. EVALUATION USING A TEST / TRAIN SPLIT

The evaluation of debiasing using WEAT with such small and carefully chosen word lists is common. E.g., many papers (Bolukbasi et al., 2016; Dev et al., 2021a) select only the he-she pair to train Gen(M/F). However, a larger goal is to generalize to other words not included in the word lists. A natural suggestion is to perform cross-validation. That is, split the word lists into two sets at random. Use one set to operate the debiasing mechanism (train) and the other to evaluate on WEAT (test). There are two concerns about this. First, the train-test split approach is predicated on all data points being drawn iid from an underlying distribution, that way, both splits are reflective of that distribution. However, words from natural language are not iid; they are in some sense each irreplaceable and unique. Second, the above word lists are rather small, and in halving them, they often become too small to effectively either capture the signal or evaluate the generalization. We address these concerns (mainly the second) by building larger word lists. We start by pulling categories from LIWC (Pennebaker et al., 2001) that are related to the small bespoke word lists we studied, when possible. We then choose the 100 closest words to the mean of the smaller list. The details and word lists are in Appendix F. In the following experiments, we perform a 50/50 test/train split on each word list. We perform the debiasing mechanism on the train half and evaluate WEAT on the test half. For each experiment, we repeat it 10 times and report the average value. This captures somewhat how the methods generalize to the concepts at large. However, it does not capture everything as cleanly as the previous (non test/train split) experiment because of the non-iid and irreplaceable nature of individual words. Table 4 shows the results for the test/train split, and Table 5 shows the results for these same large word lists but without the test/train split where the mechanism and evaluation are performed each on the full list. With the test/train split, ISR consistently performs among the best, although there are examples (notably Statistically Gendered Names, Name(M/F)) it does not perform as well. However, in almost all situations where ISR is not the best performing method, another method we propose, iOSCaR (where the non-centered OSCaR is iteratively applied), performs the best. Sometimes some projection-based methods outperform ISR, notably INLP, which iteratively applies projection over 30 times; however, these are also not consistently better than ISR, and especially not iOSCaR. We also observe that, in Table 5 , our method ISR does significantly better without the test/train split while other approaches, like iOSCaR, sometimes do about the same. In fact, ISR always has less than a 0.06 WEAT score. We suspect this is because ISR aligns well with this task, and some words are irreplaceable in defining a concept, making test/train split noisy.

3.3. EVALUATING BIASES IN PRE-TRAINED LANGUAGE MODELS

Societal biases have also been demonstrated to manifest in large pre-trained contextual language models (May et al., 2019; Kurita et al., 2019; Webster et al., 2020; Guo & Caliskan, 2021; Wolfe & Caliskan, 2021) . We evaluate the effectiveness of ISR and iOSCaR at removing such bias on the Sentence Encoder Association Test (SEAT) (May et al., 2019) benchmark. This extends WEAT to contextual representations by constructing semantically neutral template sentences such as "this is a/an [WORD]" to create many vectorized representations, of which averages are taken to generate an effect size similar to in WEAT. Scores closer to 0 indicate less biased associations. We consider 3 masked language models (BERT (Devlin et al., 2019) , ALBERT (Lan et al., 2020) , and RoBERTa (Liu et al., 2019) ) and an autoregressive model (GPT-2 (Radford et al., 2019) ). Results on ALBERT and GPT-2, and more details of the setup, are deferred to the Appendix B; ALBERT results are similar to BERT and RoBERTa, and GPT-2 exhibits less bias, so the measurements are less meaningful. We present baseline results from Counterfactual Data Augmentation (CDA) (Zmigrod et al., 2019) , DROPOUT (Webster et al., 2020) , Iterative Nullspace Projection (INLP) (Ravfogel et al., 2020) , and SENTENCEDEBIAS (Liang et al., 2020) . The last two extend INLP (Ravfogel et al., 2020) and linear projection (Bolukbasi et al., 2016; Dev & Phillips, 2019) to the average of sentences from Wikipedia containing the concept words. To avoid overfitting concerns, for our methods ISR and iOSCaR, we again use a more extensive word list of size 50. These are chosen among the larger word lists from LIWC (Pennebaker et al., 2001) as the words closest to those in the small sets used in SEAT and WEAT. We then, similar to baselines, vectorize sentences containing those words contained in a Wikipedia dump. We report results for six SEAT tests based on male vs. female gender terms against either Career vs. Family (6), Math vs. Arts (7), or Science vs. Arts (8). The 'b' variants use statistically gendered names instead of definitionally gendered terms. Table 6 , reports the publish effect size of SEAT for the baseline debiasing models from Meade et al. (2022) and our proposed methods iOSCaR and ISR. The original average absolute effect size for BERT and RoBERTa without debiasing is 0.620 and 0.940, respectively, and ISR considerably reduces the effect size to 0.190 and 0.385, respectively. These are the lowest aggregate scores among all of the methods. The next closest scores are typically INLP at 0.204 and 0.823, a technique that removes significant information from the embeddings, unlike ISR. Finally, as much as ISR is highly effective at mitigating social bias, it is also relatively stable across several tasks evaluated in this paper. This is in contrast to many other debiasing methods, which, as Meade et al. (2022) reported, have a very high variance across different tasks. 

3.4. EVALUATION OF INFORMATION PRESERVED

A critique of the projection-based debiasing mechanisms is that they destroy important information from the vectorized representations. While LP and HD only modify a rank-1 subspace of a very high-dimensional space and, thus, on the whole, do not change the representation that much, INLP may modify a 35-dimensional subspace, which can cause some non-trivial distortions. Moreover, on task-specific challenges (e.g., pronoun resolution involving gender when the male/female gender subspace is removed), significant important information can be lost using the projection-based approaches. In contrast, the orthogonalization-based approaches (OSCaR and the proposed ISR) only skew a rank-2 subspace and so have the potential to retain much more information. We quantify the task-based information preserved with what we call a Self-WEAT score (or SWEAT score). Given a pair of word lists A, B defining concepts (e.g., Male and Female Terms), we would like to measure how the coherence within each word list A or B compares to the cross-coherence with the other. We can do this by leveraging a random split of each word list and the WEAT score. That is we randomly split A into A 1 and A 2 , and similar B into B 1 and B 2 . Then we compute the WEAT score as WEAT(A 1 , B 1 , A 2 , B 2 ). The SWEAT score is the average of this process repeated 10 times. If A and B retain their distinct meaning, this should reflect a similar SWEAT score before and after a debiasing mechanism is applied. If the distinction is destroyed, the SWEAT score will decrease (towards 0) after the debiasing mechanism. Table 7 shows the results of several experiments on concept pairs and the effect on the SWEAT score of debiasing. The first concept(Concept1) is the one on which the linear debiasing mechanisms are applied, and the SWEAT score is evaluated, and the second (Concept2) is the concept used in the rotation-based mechanisms. We observe that the pure projection-based mechanisms (LP and INLP) significantly decrease the SWEAT score after debiasing. The hard debiasing mechanism, HD, is projection based but does not apply projection to the word list used to define the subspace, so it is not surprising that when SWEAT is measured on the same word list, there is typically minimal change to the scores. However, beyond the word lists, the effect would be similar to LP. For instance, note that the Gen(M/F) set corresponds with the original use Bolukbasi et al. (2016) , and this overlaps with an equalize set of words which we modify their embedding after projection, and while meant to preserve information, it actually decreases the SWEAT score. In contrast, the rotation-based methods, which do not need special restrictions on word lists (especially our method ISR), have almost no decrease in the SWEAT score, hence retaining virtually all of the information pertinent to the two concepts. While OSCaR does not decrease the SWEAT score much, the iterated version iOSCaR exhibits a significant decrease in the SWEAT score, similar to INLP. Other Downstream Tasks To show the effectiveness of our proposed debiasing method, ISR, we also consider other intrinsic and extrinsic tasks. See Appendix C for all the results and details.

4. RECTIFICATION OF THREE CONCEPTS

Algorithm 1 3-ISR(D, (A, B), (X, Y ), (R, S)) 1: for k iterations do 2: Get concept means: µ(A) = 1 |A| a∈A a and µ(B), µ(X), µ(Y ), µ(R), µ(S).

3:

Compute center c as c = µ(A)+µ(B)+µ(X)+µ(Y )+µ(R)+µ(S) 6 4: Get subspaces: v 1 = µ(A) -µ(B), v 2 = µ(X) -µ(Y ), and v 3 = µ(R) -µ(S) .

5:

Center all data: z ← z -c for all z ∈ D. 6: Rectify(D, v 1 , v 2 ) 7: Project: v ⊥ 3 ← Span v1,v2 (v 3 ). 8: Rectify(D, v ⊥ 3 , v 3 ). 9: Uncenter all data: z ← z + c for all z ∈ D. 10: return modified word vectors D The proper way to debias word vector embeddings along multiple concepts has long been an important goal. Applying projection-based methods along multiple linearly learned concepts is an option. However, the most effective of these (INLP) removes dozens of dimensions for each concept addressed, so applying it multiple times would start to significantly degrade the information encoded within the embeddings. Another approach Hard Debiasing relies on paired terms (e.g., boy-girl, aunt-uncle) to be explicitly balanced after a projection, but these paired words do not always seem to exist for other concepts. Using that ISR achieves near-0 dot-products between concepts, we next apply this method iteratively to rectify multiple concepts. As a running example, we consider the issue of how nationalityassociated names (from the USA and Mexico) can be associated with gender (and potentially bias that comes with it) as well as with unpleasant sentiments. In this experiment, we will attempt to de-correlate names from these intersectional issues. Multiple Subspace Rectification. As input we take 3 pairs of concepts A, B (e.g., definitionally male/female gendered terms), R, S (e.g., statistically-associated USA/Mexico names), and X, Y (e.g., pleasant / unpleasant terms). As before, for each list we define a mean µ(A), and for each pair a concept direction v 1 = µ(A) -µ(B), v 2 = µ(X) -µ(Y ), and v 3 = µ(R) -µ(S). The goal is to orthogonalize these concepts so that when we recover v 1 , v 2 , and v 3 from the updated word representations, they are orthogonal. By gradually rotating all data with these words, the premise is that these concepts will de-correlate (and hence de-bias) and retain their internal meaning. We start by centering at c = (µ(A) + µ(B) + µ(X) + µ(Y ) + µ(R) + µ(S))/6 the average of all concepts. Then we follow a Gram-Schmidt-style procedure to orthogonalize these concepts. For the pair of concepts with the smallest dot-product (wlog v 1 and v 2 ), we run one step of graded rotation. Then we apply this approach on the third concept v 3 , but with respect to the span of v 1 and v 2 ; that is denote v ⊥ 3 as the projection of v 3 onto the span of v 1 , v 2 . We then apply a graded-rotation on v 3 with respect to v ⊥ 3 . Then we uncenter with respect to c. This is one iteration, we repeat this entire process for a small number, e.g., 5 iterations. We outline the procedure in Algorithm 1 which takes as input the word vectors of all words X as well as 3 concept pairs (A, B), (X, Y ), (R, S). It leverages the graded-rotation step from OSCaR (Dev et al., 2021a) which we refer to as Rectify. This takes in all the word vectors X and two subspace directions v 1 and v 2 . It modifies all points z ∈ D, but only in the span of v 1 and v 2 so that words aligned with v 2 are rotated towards being orthogonal with v 1 (within that span) and words aligned with v 1 are mostly left as is. We could extend this procedure to more than 3 concept pairs by iteratively applying the Rectify method on each of the jth subspace v j with respect to v ⊥ j = Span v1,...,vj-1 (v j ), the projection onto the span of the previous j -1 directions. Evaluation of three subspace rectification. We evaluate on definitionally gendered male/female terms (GT: A, B), pleasant/unpleasant terms (P/U: X, Y ), and statistically-associated USA/Mexico names (NN: R, S), using associated large word lists. GT v 1 (gendered terms) and NN v 2 (US-A/Mexico) have smallest dot product, so rectify these first within the iteration. Table 8 shows the WEAT score of the ISR mechanism in 5 iterations, measured on the full set; up to iteration 10 is shown in Appendix D.1. All pairwise WEAT scores decrease significantly, with GT vs. NN and GT vs. P/U to about 0.02 and 0.01. The NN vs. P/U has a larger initial value of 1.15 and drops to about 0.14. Also, the pairwise dot products all drop to < 0.006. Finally, in Table 8 we show the SWEAT scores for each concept pair. Each concept retains a high self-correlation, preserving their original associations as desired. We performed an additional experiment with three different concepts; see Appendix D. 

5. DISCUSSION

We introduced a new mechanism for augmenting word vector embeddings or any vectorized embedding representations, namely Iterative Subspace Rectification (ISR). It can un-correlate concepts defined by pairs of word lists; this has applications in debiasing and increasing transparency in otherwise opaque distributed representations. While the method is based on a recent method OSCaR (Dev et al., 2021a) , it adds some essential extensions that crucially allow the resulting subspaces to be completely orthogonal. In particular, this allows one to post-process the embeddings so the identified concepts can be rotated, an isometric transformation, to be along coordinate axes -allowing a mix of specifically encoded and distributedly encoded aspects of the vector representation. Single set concepts. A major design choice that went into the model of concepts and subspaces, as well as measurement, is that concepts are defined as clusters and subspaces by pairs of clusters; see extended discussion in Appendix A. We also considered an ISR-like method for subspaces defined by single word lists (e.g., occupations). This setting is more general and could potentially be used to rectify concepts that do not have two well-defined polar sets, like occupations or perhaps race, nationality, or ethnicity. We did discover a variant of ISR that empirically converged to a dot-product of 0. This finds the single-set subspace as the top principal component; each of these defines lines ℓ 1 and ℓ 2 in R d . Then to identify a center, it finds the pair of points p 1 ∈ ℓ 1 and p 2 ∈ ℓ 2 that are as close as possible; this can be solved analytically. The center is chosen as the midpoint of p 1 and p 2 ; so c = (p 1 + p 2 )/2. While this worked fairly well in the sense of dot-product convergence to 0, it was less clear how to evaluate it in terms of bias removal and information retention. Pursuing the generality of this method would be interesting future work.

6. ETHICS STATEMENT

Several subtle implementation choices were made for this method to achieve its intended results. For instance, the centering step should occur before the projection onto the span of v 1 , v 2 to perform rectification. Also, in the multiple subspace version of ISR, the iteration loop should wrap around both rectify steps (of v 1 , v 2 and of v 3 , v ⊥ 3 ) as opposed to completing one rectification (e.g., iteration of v 1 , v 2 ) and then trying to iteratively perform the other (v 3 , v ⊥ 3 ). Limitations. The main limitation is that the work requires concepts to be easily encoded with a list of words. If the word list is too small, or the relevant words have multiple meanings, then these approaches may prove less effective. An example where stereotypes may occur, but where the community has so far been unable to find suitable word lists to capture concepts include non-binary notions of gender (Dev et al., 2021b) . Our work hence focuses on biases occurring in binary representations of gender (male versus female); we remark this not to be exclusionary but to make clear the challenge of addressing non-binary representations of gender (largely its lack of representation in language models) is a limitation of this work. Another related limitation is the way our work addresses nationality. We do so via the most common names at birth from the USA and Mexico. We do not claim this actually encodes the nationality of someone with such a name, but because of the statistical association we draw on to generate these word lists, it serves to encode stereotypes someone with one of these names may face. Other considerations. While under the standard WEAT measurement, our method ISR can virtually eliminate all traces of unwanted associations, however, this complete elimination of measured bias may not transfer to other applications. This is not a new phenomenon (Gonen & Goldberg, 2019; Wang et al., 2020; Dev et al., 2020; Zhao et al., 2019) , and for instance, may be the result of bias creeping into the other mechanisms used in the evaluation process. For instance, this may be relevant in downstream tasks where other training data and algorithmic decisions contribute to the overall solution and hence are also subject to bias. Nonetheless, we believe this work has demonstrated significant progress toward eliminating a substantial amount of bias from the core vectorized representation of data. This work focuses on debiasing of the English language; all evaluation and methods are specified to this context. We hope these ideas generalize to other languages (c.f., Hirasawa & Komachi (2019) ; Pujari et al. ( 2019)) as well as vectorized representations of other sorts of data, such as images Kiela & Bottou (2014) ; Lowe (2004), social networks (Grover & Leskovec, 2016; Perozzi et al., 2014) , financial networks (Wang et al., 2021) , etc. Finally, and related to the previous points, we investigate one way of measuring and attenuating bias, focusing on applications in natural language processing. There are, however, other forms of bias, as well as ways to measure and attenuate them. On removing bias. As discussed in the limitation section, this work addresses a limited but highly leveraged form of bias in English language models. Other manifestations and evaluations of bias exist, and it is likely no one methodology or framework can address all aspects. Indeed some may argue that such learned correspondence in representations should not be augmented away. Our method attempts to just orthogonalize the representation of these concepts, still allowing, for instance, a place to have an association with females and pleasant sentiments. In particular, we focus on concepts captured using polar sets, and this, for instance, may be limiting for groups whose representation does not fit into one of those polar notions and who feel that the unfair treatment is resultant of that representation. Although we have not explicitly attempted to address such a concern, we hope that if there is a set of words that can robustly represent such a group within these word representations, then it can be paired with the complement of that set, and made orthogonal to other concepts, thus removing the unwanted correlation. Identifying and demonstrating this would be important for future work. Overall, this paper provides a powerful new mechanism for removing unwanted correlations from word vector representations while preserving the existing representation of those concepts. The resulting data representation not only can be shown to dramatically reduce a common bias measurement, but it also increases the interpretation of these representations by allowing multiple identified concepts to occupy coordinate axes.

7. REPRODUCIBILITY STATEMENT

All of the Debiasing models run on a CPU. It takes about 4 minutes to run ISR and iOSCaR on a CPU with 10 iterations. Hardware specifications are NVIDIA GeForce GTX Titan XP 12GB, AMD Ryzen 7 1700 eight-core processor, and 62.8GB RAM. All debiasing approaches were completed in under 5 minutes. We used the publicly available codes for existing or baseline debiasing approaches we compared against. We provide links to publicly available codes in references. All the word lists are in Appendix F. We provide code at https://github.com/poaboagye/ ISR-IterativeSubspaceRectification.

APPENDIX A EXTENDED DISCUSSION

We introduced a new mechanism for augmenting word vector embeddings or any vectorized embedding representations, namely Iterative Subspace Rectification (ISR). It can un-correlate concepts defined by pairs of word lists; this has applications in debiasing and increasing transparency in otherwise opaque distributed representations. While the method is based on a recent method OSCaR (Dev et al., 2021a) , it adds some essential extensions that crucially allow the resulting subspaces to be completely orthogonal. In particular, this allows one to post-process the embeddings so the identified concepts can be rotated, an isometric transformation, to be along coordinate axes -allowing a mix of specifically encoded and distributedly encoded aspects of the vector representation. Why paired-concept subspaces? A major design choice that went into the model of concepts and subspaces, as well as measurement, is that concepts are defined as clusters and subspaces by pairs of clusters. This is more general than subspaces defined by sets of pairs of words (e.g., he-she, man-woman (Bolukbasi et al., 2016) ), but not as general as subspaces defined by a single large list of words (e.g., occupations (Dev et al., 2021a )) using their top principal component. This choice was made for two reasons. First, we empirically observed that single word list subspaces were not as stable. For instance, on gendered terms, we could use either approach, and the one that splits the word list into a set of male terms and another set of female terms seemed to be better aligned to the intended direction than just using a single word list. Second, this allowed for a tighter coupling with the evaluation. Previous work sometimes tried to use standard WEAT word lists to evaluate bias removal, but the comparison Male/Female vs. Math/Art may not correlate with the topics used to drive the mechanism Male/Female vs. Pleasant/Unpleasant -so it was an apples to oranges evaluation. Under this setup, we could directly evaluate on the concepts the methods targeted.

B EVALUATING BIASES IN PRE-TRAINED LANGUAGE MODELS

The advent of large pre-trained language models has led to remarkable success in several natural language processing (NLP) tasks (Peters et al., 2018; Devlin et al., 2019; Lan et al., 2020; Brown et al., 2020) . However, recent works have shown that encoded in pre-trained language models are social biases from the data they were trained on (May et al., 2019; Kurita et al., 2019; Webster et al., 2020; Guo & Caliskan, 2021; Wolfe & Caliskan, 2021) . These encoded biases get propagated or amplified by machine learning models in downstream NLP tasks such as machine translation (Stafanovičs et al., 2020; Wang et al., 2022) and sentiment classification (Kiritchenko & Mohammad, 2018) and visual question answering (Goyal et al., 2017; Hudson & Manning, 2019; Hirota et al., 2022) . Hence it is imperative to mitigate social biases in pre-trained language models. Motivated by this, we conduct an experiment to mitigate gender bias in three masked language models (BERT, ALBERT, and RoBERTa) and an autoregressive language model (GPT-2). We evaluated the performance of ISR and iOSCaR against the Sentence Encoder Association Test (SEAT) (May et al., 2019) benchmark. Sentence Encoder Association Test (SEAT) is a standard intrinsic bias benchmark used to measure the level of bias in pre-trained language models embedding representation. SEAT is an extended version of the Word Embedding Association Test (WEAT) Caliskan et al. (2017) (See Appendix E) to sentence representation, which is particularly useful under pre-trained language models. Similar to WEAT, which measures the stereotypical association between two sets of target concepts and attributes word lists, SEAT substitutes the target and attributes word lists from WEAT into a semantically neutral template such as "this is a/an [WORD]" to create target concepts and attributes sentence lists. The vectorized sentence representation is obtained using the average token representation from the last hidden state. After obtaining the sentence vector representation of the two sets of target concepts and attributes, the WEAT test statistic is computed. We report the effect size in the SEAT evaluation. An effect size closer to 0 indicates no (biased) association. Baseline Debiasing Models Here we describe the four baseline debiasing techniques we compared ISR and iOSCaR against. • Counterfactual Data Augmentation (CDA) (Zmigrod et al., 2019) is a data augmentation technique that re-balances the gendered corpus within the dataset by swapping the male/female attributes to have a more diverse and balanced dataset for the language model pretraining. • DROPOUT (Webster et al., 2020) is a debiasing technique that uses dropout regularization to reduce gender bias by increasing the dropout parameters in the pre-trained language model. • Iterative Nullspace Projection (INLP) (Ravfogel et al., 2020) given your target concept (e.g., male/female gender concept), INLP builds a linear classifier that best separates the target concept and linearly projects all words along the classifier normal. • SENTENCEDEBIAS (Liang et al., 2020) is an extension of linear projection Dev & Phillips (2019) to sentence representation. It starts by identifying the gender direction or subspace and then projects away all the sentence representation from the gender direction or removes the component of the gender subspace for each sentence representation. Pretrained Models We considered these four pre-trained language models in our gender bias mitigation experiments: • BERT (Devlin et al., 2019) • ALBERT (Lan et al., 2020) • RoBERTa (Liu et al., 2019) • GPT-2 (Radford et al., 2019) Concept Subspace Word List We used a more extensive word list of size 50 to determine the concept subspace in the iOSCaR and ISR. We pulled the common categories related to the small SEAT categories from LIWC (Pennebaker et al., 2001) ; and we then chose the 50 closest words from LIWC to the mean of the smaller list. The details and word lists are in Appendix F.1 (We chose the top 50 words starting from left to right). To contextualize the concept subspace of these other words used in iOSCaR and ISR, we identified their occurrence in sentences within a 2.5% fraction of an English Wikipedia dump. Then we took the average token representation from the last hidden state as the vectorized sentence representation. SEAT Test Specifications Table 9 provides more details about the SEAT test evaluation. For GPT-2, all debiased models obtain a larger average absolute effect size than the original GPT-2. Again ISR and INLP have the smallest average absolute effect size of 0.138 and 0.119, close to the average absolute effect sizes of 0.113 for GPT-2. As many effect sizes become negative, we suspect measures of these magnitudes are within the typical noise within the evaluation method. This comports with the finding of Guo & Caliskan (2021) that GPT-2 contains the smallest magnitude of overall bias among these contextual models. Finally, as much as ISR is highly effective at mitigating social bias, it is also relatively stable across several tasks evaluated in this paper. This is excellent since a recent empirical finding from Meade et al. (2022) showed that most debiasing techniques have a very high variance across different tasks. Thus ISR is more stable and able to generalize on various tasks. We also point out that in this experiment, ISR and iOSCaR are trained on a different (contextualized) word set than is used as the key terms in the evaluation sentences. It is another demonstration of the effectiveness of these methods under a test-train split to show it is not explicitly overfitting. 

C DOWNSTREAM TASK OF DEBIASED WORD EMBEDDINGS

To show the effectiveness of our proposed debiasing method, ISR, beyond intrinsic tasks like WEAT, SEAT, and SWEAT, we also considered other intrinsic tasks, namely Bias-by-Projection (Ding et al., 2022) , SemBias Analogy Task (Zhao et al., 2018) and Word Similarity Task, and extrinsic tasks which comprise of POS (part-of-speech) tagging, POS chunking and Named Entity Recognition (NER) (Tjong Kim Sang, 2002) . We considered the small word list to determine the gender and occupation subspace for all the experiments below. 2021a), we perform a natural language inference (NLI) task for bias mitigation. The goal of NLI is to determine if a sentence, i.e., the premise entails, contradicts, or is neutral to another sentence, the hypothesis. Dev et al. (2021a) showed that biased representation could lead to invalid inferences. For instance: Premise: A doctor bought a bagel. Hypothesis 1: A man bought a bagel. Hypothesis 2: A woman bought a bagel. The question being asked under the inference task above is whether "doctor" implies a malegendered connotation or a female-gendered one. Both hypothesis sentences are neutral with respect to the premise sentence. However, a language model trained on a biased word embedding predicts entailment for Hypothesis 1 and contradiction for Hypothesis 2 (Parikh et al., 2016) . Thus the model says "yes" (entailment), a doctor must be a man, and "no" (contradiction), a doctor can't be a woman. The aim now is to debiased the word embedding representation and perform an NLI task to measure the level of bias attenuation while maintaining valid gender associations. When debiasing word embeddings, we don't want to alter valid associations, such as between the word pregnant and words like female and mother. We use the Bias NLI dataset designed by Dev et al. (2020) , which consists of ∼ 1.9 million neutral sentence pairs. They instantiated templates to measure stereotypical inferences with gendered and occupation words. For example: Premise: The man age a bagel. Hypothesis: The accountant ate a bagel. A biased or stereotypical inference is measured as a deviation from the neutrality label with metric: Net Neutral (N. Neu), Fraction Neutral (F. Neu), Dev F1, and Test F1. A Higher neutrality score indicates less bias. Net Neutral is the average probability of the neutral label across all sentence pairs, and Fraction Neutral is the fraction of sentence pairs accurately predicted as neutral. A higher N. Neu and F. Neu. score indicates lower bias. We apply ISR on the first layer of RoBerTa and fine-tune on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) and evaluate on the Bias NLI dataset during inference (Dev et al., 2020) . Table 11 shows the N. Neu, F. Neu, Dev F1, and Test F1 scores of the RoBerTa NLI model across various debiasing methods on the before and after debiasing. ISR is slightly outperformed by OSCaR on the neutrality scores (N. Neu and F. Neu) but performs a bit better on the F1 scores on the dev/test sets. Compared to the three other baselines debiasing methods (LP, HD, INLP), ISR achieves higher neutrality and F1 scores except on the Test F1 score, where it is at par with INLP. This shows that ISR significantly reduces bias as OSCaR compared to other debiasing methods. 2021a), we quantify the level of gender information retention in an extrinsic NLI task. In as much as we want to mitigate bias in vectorized language representations, we do not wish to destroy valid gender associations so that these embeddings retain their utility for other downstream tasks that require robust semantic information. We follow the experimental setup from Dev et al. (2021a) . Likewise, we apply ISR on the first layer of RoBerTa and fine-tune on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) and evaluate on the Sentence Inference Retention Test (SIRT) dataset during inference (Dev et al., 2020) . Unlike the Bias NLI dataset, which contains neutral sentence pairs, the SIRT dataset contains sentence pairs with ground truth labels, either entailment or contradiction. Each entailment and contradiction dataset contains ∼ 47 thousand sentence pairs. We measure the quantity of gender information retained using the metrics Net Entail, Fraction Entail, and Net Contradict, Fraction Contradict for the entailment and contradiction datasets, respectively. The Fraction Entail/Contradict score is the accuracy of the model predictions or the fraction of instances for which the model predicts the entailment/contradiction class. Net Entail/Contradict measures the average probability of the entailment/contradiction label across all sentence pairs. ISR is the best-performing method compared to the other debiasing technique (LP, HD, INLP, and OSCaR) and the Baseline on the SIRT test. We see a significant improvement with net entail and fraction entail at 97.4 and 100.0 and net contradict and fractional contradict at 99.5 and 99.8. Thus we see an almost perfect gendered information preservation with ISR. This is, therefore, in the synergy of our goal to improve bias removal while retaining the key information. CoNLL2003 Shared Task To investigate the debiasing impacts of our proposed debiasing method, ISR, on its ability to still retain a good extrinsic downstream utility and performance in standard natural language processing (NLP) task, we considered the CoNLL2003 shared task (Tjong Kim Sang, 2002) . Under the CoNLL2003 shared task, we use POS (part-of-speech) tagging, POS chunking and Named Entity Recognition (NER) as the three evaluation tasks following Manzini et al. (2019) . Each task is evaluated in two ways: 1) Embedding Matrix Replacement -We first train the specific task model on the biased word embedding, and at test time, we compute the evaluation metric difference between using the biased embeddings and the debiased embeddings, and 2) Model Retraining -Here we train two separate models for a given evaluation task. One on the biased word embeddings and the other on the debiased word embeddings, and at test time, we compute the difference in the performance of these two models. A positive value for the Embedding Matrix Replacement and Model Retraining experiments means the task performs better than the original biased embedding. The results are shown in Overall, ISR shows stable and comparable performance across the three tasks. This signifies that sematic downstream utility is preserved under ISR. Table 13 : Downstream tasks of POS Tagging, POS Chunking, and Named Entity Recognition. A positive value means the task performs better than original biased embedding and ∆ represents the change before and after debiasing. Embedding Matrix Replacement POS Tagging POS Chunking Named Entity Recognition ∆ F1 ∆ Precision ∆ Recall ∆ F1 ∆ Precision ∆ Recall ∆ F1 ∆ Precision ∆ Recall LP 0.0009 -0.0004 -0.0025 -0.0007 -0.0011 -0.0016 -0.0004 -0.0002 -0.0013 HD -0.0009 0.0000 -0.0029 -0.0009 -0.0010 -0.0022 -0.0005 0.0000 -0.0015 INLP 0.0001 -0.0005 0.0006 0.0003 0.0004 0.0006 0.0001 0.0000 0.0003 ISR 0.0000 0.0000 0.0002 0.0000 0.0003 -0.0002 0.0001 0.0000 0.0003

Model Retraining

POS Tagging POS Chunking Named Entity Recognition ∆ F1 ∆ Precision ∆ Recall ∆ F1 ∆ Precision ∆ Recall ∆ F1 ∆ Precision ∆ Recall LP 0.0027 -0.0052 0.0111 0.0002 0.0033 -0.0020 -0.0012 -0.0056 0.0006 HD -0.0052 -0.0127 -0.0086 0.0000 -0.0102 0.0075 -0.0007 -0.0057 0.0024 INLP 0.0030 0.0020 0.0079 0.0046 -0.0359 0.0439 -0.0014 -0.0123 0.0062 ISR 0.0003 -0.0043 0.0033 0.0017 -0.0102 0.0142 -0.0004 -0.0049 0.0032

C.2 INTRINSIC DOWNSTREAM TASKS

To confirm the effectiveness of our method beyond intrinsic measures or metrics like the WEAT, SEAT, and SWEAT, we also run some other intrinsic evaluation methods. Namely, Bias-by-Projection (Ding et al., 2022) , SemBias Analogy Task (Zhao et al., 2018) and Word Similarity Task. We also compared the performance of our proposed debiasing method to (P-DeSIP) Removing potential proxy bias and (U-DeSIP) Removing unresolved bias from Ding et al. (2022) . Both are restricted to debiasing based on gendered terms. WEAT * : Information Retained This is also an intrinsic information preservation metric that was proposed by Dev et al. (2021a) . WEAT * is an extension of WEAT (Caliskan et al., 2017) to measure gendered information (male vs. female associations) retained after debiasing word embedding directly. Here the two target set of gendered words (X : {man, male, boy, brother, him, his, son} and Y : {woman, female, girl, sister, her, hers, daughter}) are kept constant. The main modification to WEAT is that instead of the attribute set of words (A and B) being stereotypical (e.g., A: male-biased occupations and B female-biased ones), A and B are definitionally gendered (A male and B female) so we want the score s(X, Y, A, B) (see Appendix E) to be large. Following, Dev et al. (2021a) we use A, B as he-she on WEAT * (1), as definitionally gendered words (e.g., father, actor, and mother, actress) on WEAT * (2) and as gendered names (e.g., james, ryan and emma, sophia) on WEAT * (3). A higher score indicates more meaningful gendered (male vs. female) information is preserved. All four debiasing techniques (LP, HD, INLP, OSCaR) retain the least gendered information. Thus they destroy the meaningful gendered (male vs. female) information in the word embedding. However, ISR outperforms all four debiasing techniques on WEAT * and even improves the Baseline (Glove without debiasing). ISR, therefore inherently retains the key information while improving bias removal. Bias-by-projection Task. We compute the dot product between gender direction, → he-→ she and top 50,000 most frequent words. The resulting absolute dot products scores are then averaged to get the Bias-by-projection score. After debiasing the word embedding, if the Bias-by-projection score is closer to 0, then we have effectively removed all evidence of gender bias. Hard debiasing achieved the lowest Bias-by-projection score of 0.0002, which is not surprising since it projects the word embedding away from the gender direction and takes a more aggressive approach to remove all gender information from the embedding representation. See Table 15 Note that ISR has nearly the largest score on this task. We actually do not view this as a negative since it means it is able to retain meaningful associations with concepts (e.g., grandma with she, and grandpa with he) that may be useful for natural language understanding, such as document summarization, question answering, and information extraction or co-reference resolution. Sembias Analogy Task. This aims at finding the word pair that is the best analogy to the pair (he, she) by considering these four options: a gender-specific word pair, e.g., (waiter,waitress); a gender-stereotype word pair, e.g., (doctor,nurse); and two highly-similar, bias-free word pairs, e.g. (dog, cat) Zhao et al. (2018) . The dataset used for the Sembias Analogy Task contains 440 instances. 40 instances denoted by SemBias * or are not used during training. Other than the P-DeSIP and U-DeSIP designed for this task Ding et al. (2022) , ISR achieves the highest score accuracy in identifying gender-specific word pairs. See Word Similarity Tasks. In as much as we are interested in removing or eliminating bias or stereotypical association from word embeddings, we want to ensure the semantic information within the word embeddings are preserved. The word similarity task was conducted using the following English word similarity benchmarks: RG65 Rubenstein & Goodenough (1965) 2015), and SimVerb-3500 Gerz et al. (2016) . We measure the semantic information preserved on the glove embedding and all the debiased GloVe models. The Spearman rank coefficient scores shows that ISR and iOSCaR retains useful structures and semantic information from the original embeddings. See 

D ADDITIONAL EXPERIMENT WITH THREE CONCEPTS

We also perform an experiment on rectifying three subspaces with ISR using different concepts. In this case we consider definitionally gendered male/female terms (GT: A, B), pleasant/unpleasant terms (P/U: X, Y ), and statistically gendered male/female names (GN: R, S). We use our large word lists of these terms. This is potentially interesting again because someone's name may not align with the statistically most likely gender association, and it may have an unwanted, unpleasant connotation. So one may want to perform rectification with both gender association and an unpleasant association, another intersectional issue. This experiment is also interesting because gendered terms and statistically gendered names generate subspaces with a large dot product (initially larger than 0.8). As we observe, ISR faces a greater challenge in both reducing this association, and also retaining the information, because of the overlapping space they occupy but still obtain near-orthogonal subspaces. We observe that v 1 (gendered terms) and v 2 (pleasant/unpleasant) have the smallest dot product, so rectify these first within the iteration. Table 17 shows the WEAT score of the ISR mechanism during the iterations 1 through 10, measured on the full set. We observe that all pairwise WEAT scores decrease significantly (to about 0.02) after 10 iterations. Similarly, Table 17 also shows the pairwise dot products throughout 10 iterations. We observe that the largest initial dot-product pair (Gendered Terms vs. Gendered Names) starts very similar at 0.8237 and decreases to 0.0014 after 10 iterations, similar to the values achieved for other pairs. Finally, in Table 18 , we show the SWEAT scores for each concept pair throughout the process. This shows the information retained as a function of the Self-WEAT scores. We observe that while Pleasant/Unpleasant retains most of its SWEAT score, we do see a noticeable decrease for gendered terms and statistically gendered names. This is likely since they start with a dot-product of 0.82, they are very correlated, and some words overlap along the defined subspace. So some non-trivial warping is necessary to orthogonalize the concepts. We also show the results after all 10 iterations for the example with Gendered Terms(M/F) (GT), Nationality associated Names (USA/Mexico) (NN), and Pleasant Unpleasant terms (P/U). The WEAT scores and dot products are in Table 19 , and the SWEAT scores are in Table 20 .

E DEFINITION OF WEAT

The Word Embedding Association Test (WEAT) Caliskan et al. (2017) is the default measurement of association among paired concepts from word lists, via their embedding. It takes two target word lists X and Y (e.g., male and female terms) and two attribute words lists A and B (e.g., pleasant and unpleasant words). For each word w ∈ X ∪ Y we compute the association of w with set A, B as: Then we average this value s(w, A, B) over all words in the X and Y set to get:  s (w, A, B) = 1 |A| a∈A cos (a, w) - 1 |B| b∈B cos (b, w)

F WORD LISTS USED IN EXPERIMENTS

For many experiments we begin with small bespoke word lists used to define concepts. Extended from Caliskan et al. (2017) . • Definitionally Gendered Terms [Gen(M/F)]: -Male Terms: male, man, boy, brother, he, him, his, son -Female Terms: female, woman, girl, sister, she, her, hers, daughter • Pleasant and Unpleasant Words [Please/Un]: - Pleasant: caress, freedom, health, love, peace, cheer, friend, heaven, loyal, pleasure, diamond, gentle, honest, lucky, rainbow, diploma, gift, honor, miracle, sunrise, family, happy, laughter, paradise, vacation -Unpleasant: abuse, crash, filth, murder, sickness, accident, death, grief, poison, stink, assault, disaster, hatred, pollute, tragedy, bomb, divorce, jail, poverty, ugly, cancer, evil, kill, rotten, vomit • Career and Family Words [Career/Family] : aster, clover, hyacinth, marigold, poppy, azalea, crocus, iris, orchid, rose, daffodil, lilac, pansy, tulip, buttercup, daisy, lily, peony, violet, carnation, magnolia, petunia, zinnia -Insect: ant, caterpillar, flea, locust, spider, bedbug, centipede, fly, maggot, tarantula, bee, cockroach, gnat, mosquito, termite, beetle, cricket, hornet, moth, wasp, dragonfly, roach, weevil • Musical Instrument and Weapons [Music/Weap]: - Musical Instruments: bagpipe, cello, guitar, lute, trombone, banjo, clarinet, harmonica, mandolin, trumpet, bassoon, drum, harp, oboe, tuba, bell, fiddle, harpsichord, piano, viola, bongo, flute, horn, saxophone, violin -Weapons: arrow, club, gun, missile, spear, axe, dagger, harpoon, pistol, sword, blade, dynamite, hatchet, rifle, tank, bomb, firearm, knife, shotgun, teargas, cannon, grenade, mace, slingshot, whip F.1 LARGER WORD LISTS FROM LIWC For performing test/train splits, it is often necessary to arrange for larger word lists, so each half of a split has sufficient words to define a concept. For this, we identify some very large word lists (with sometimes hundreds of words) from LIWC Pennebaker et al. (2001) . These word lists initially contained many words with wild card symbols (*) to represent many ways a word can end (e.g., -er, -ed, -est, -es). For each such word, we select all possible matching words in the larger word lists of the embedding. However, these sets are still quite noisy, and some of the words are tangentially related to the concept or very rare, so the embedding representative is not reliable. Including them ultimately does not improve the estimation of the concept in the word embedding. We found it was better to select a careful and central subset of these larger word lists. To do this, we start with the mean of our associated smaller word list (from those above, a touchstone word when the bespoke word list is unavailable) and select the 100 closest words to that mean (including ones in the smaller bespoke list). The result word lists are presented next: • Definitionally Gendered Terms [Gen(M/F)]: -Male (100 Words): father, son, brother, man, his, him, he, boy, himself, husband, uncle, grandfather, nephew, grandson, sons, guy, men, dad, boys, male, sir, king, brothers, boyfriend, prince, stepfather, fellow, guys, businessman, gentleman, earl, mr, grandparents, duke, paternal, monk, fathers, knight, buddy, daddy, stepson, nephews, congressman, uncles, bull, fathered, husbands, chairman, fiance, masculine, patriarch, colt, salesman, godfather, cowboy, grandsons, bachelor, macho, spokesman, schoolboy, kings, males, gentlemen, boyhood, monastery, statesman, grandpa, lad, countrymen, papa, boyish, fraternity, princes, cowboys, penis, dude, baritone, monks, knighted, knights, lions, bulls, prostate, businessmen, strongman, mister, czar, roh, deer, manly, gonzales, dukes, stud, manhood, brethren, paternity, her, mother, girl, she, daughter, wife, sister, herself, grandmother, girlfriend, daughters, aunt, mom, female, niece, lady, girls, women, actress, hers, sisters, granddaughter, boyfriend, princess, mistress, queen, heroine, bride, mothers, maid, waitress, jane, housewife, wives, nun, actresses, feminine, fiancee, ladies, stepmother, stepdaughter, diva, fiance, lesbian, goddess, feminist, duchess, countess, husbands, mrs, maternal, madame, womb, mama, schoolgirl, madam, grandma, businesswoman, hostess, socialite, heiress, maiden, ballerina, witch, mommy, mum, godmother, congresswoman, motherhood, spokeswoman, moms, aunts, queens, nieces, tomboy, feminism, females, uterus, granddaughters, matron, boyfriends, maternity, femininity, heroines, divorcee, princesses, mimi, sorority, landlady, dame, matriarch, dowry, chairwoman, lesbians, girlish, grandmothers, vagina • Pleasant and Unpleasant Words [Please/Un]: -Pleasant (100 Words) : good, pretty, kind, honest, well, beautiful, surprisingly, generous, nice, certainly, wonderful, better, decent, handsome, sure, strong, happy, easy, rich, truly, lovely, excellent, like, charming, intelligent, loving, warm, thoughtful, gentle, polite, fun, perfect, enjoy, smart, healthy, funny, proud, thanks, interesting, great, giving, bright, best, love, wonderfully, definitely, confident, amazingly, terrific, comfortable, passionate, energetic, true, cool, liked, helpful, brilliant, perfectly, lively, importantly, fine, elegant, talented, fair, important, appreciate, exciting, enthusiastic, clever, cheerful, welcome, promising, opportunity, respect, respectful, wise, pleasant, hope, promise, gracious, entertaining, likes, brave, wealthy, enjoyed, sincere, enjoying, impressed, pleased, impressive, surely, gorgeous, impression, sweet, pleasing, useful, eager, promises, caring, stupid, ugly, weak, worse, poor, cruel, arrogant, awful, nasty, terribly, unfair, rude, pathetic, lousy, ineffective, foolish, ignorant, dangerous, miserable, wrong, terrible, disgusting, unfortunately, unfortunate, difficult, unattractive, horrible, abusive, cynical, incompetent, timid, greedy, shockingly, unpleasant, annoying, lazy, inadequate, disappointing, selfish, frustrating, vicious, depressing, brutal, dumb, scared, scary, ridiculous, shameful, pitiful, sad, aggressive, outrageous, desperate, boring, sorry, afraid, harsh, vulnerable, crazy, immoral, worried, confusing, obnoxious, problematic, unhappy, grossly, complain, dreadful, embarrassing, frightening, insecure, hurt, useless, uncomfortable, awkward, confused, dangerously, painful, appalling, careless, discouraging, risky, hurting, heartless, frustrated, deceptive, ineffectual, demeaning, horribly, angry, sick, depressed, messy, worrying, wicked, ridiculously, unacceptable, suffer investigator, ceo, institute, counsel, biologist, diplomat, secretary, working, department, manager, editor, lawrence, law, university, teacher, lawyers, doctor, interview, analyst, managing, producer, lecturer, research, succeeded, company, finance, lawmaker, industrialist, consulting, dean, congressman, studied, staff, leader, graduate, publisher, economics, legal, colleagues, associates, financier, worker, administration, political, job, written, developer, government, employee, librarian, committee, work, boss, succeed, graduated, reporter, agency, trader, works, business, client, directors, programmer, student, bank, supervisor, leading, mentor, family, mother, daughters, relatives, daughter, wife, grandparents, families, husband, marriage, wedding, siblings, grandmother, father, married, mothers, marry, wives, sister, sons, son, divorced, aunt, husbands, grandchildren, cousins, cousin, baby, mom, pregnant, sisters, spouses, brother, niece, spouse, divorce, marriages, fathers, babies, dad, granddaughter, uncle, grandfather, brothers, widowed, widow, honeymoon, aunts, maternal, fiancee, weddings, parent, fiance, maternity, stepfather, nephews, uncles, nephew, paternal, nieces, grandchild, pregnancy, grandson, parental, stepmother, moms, widows, sibling, folks, grandmothers, granddaughters, divorcing, dads, paternity, parenting, grandma, nanny, widower, marries, stepdaughter, motherhood, stepchildren, fathered, grandsons, pregnancies, divorces, grandparent, kin, nannies, daddy, grandkids, mama, mommy, mum easiness, uncertainty, reluctant, shaken, paranoia, impatient, avoid, overwhelmed, ashamed, paranoid, doubt, insecurity, irritated, scare, tension, feared, risk, threatening, scary, uncertain, tense, desperation, phobia, obsessed, shaking, apprehension, unsettling, turmoil, awkward, startled, stress, unsettled, irrational, distressed, desperately, confusing, risks, embarrassment, shame, vulnerability, suspicious, neurotic, timid, restless, aversion, terrifying, irritable, threat, irritation, risked, scares, threats, frighten, alarming, disturbing, irritating, obsessive, horrible , alarms • Statistically American/Mexican Names : -American Names (100 Words) [Name(M/F)]: david, michael, john, chris, alex, daniel, james, mike, robert, kevin, mark, brian, anthony, jason, joe, eric, andrew, ryan, paul, richard, william, victor, jonathan, matt, joseph, tony, steve, justin, brandon, jeff, matthew, scott, nick, christopher, steven, andrea, josh, jay, sam, adam, thomas, jim, joshua, tim, tom, frank, george, aaron, dan, martin, mary, jennifer, jessica, michelle, lisa, sarah, ana, elizabeth, laura, ashley, linda, karen, stephanie, sandra, melissa, amanda, nancy, patricia, emily, nicole, amy, carmen, susan, rosa, angela, diana, rachel, martha, kelly, anna, brenda, sara, julie, kim, barbara, katie, monica, claudia, lauren, gloria, veronica, kathy, heather, samantha, teresa, cindy, kimberly, sharon, christina, [Name(M/F)]: jose, juan, luis, carlos, jesus, jorge, alejandro, miguel, angel, manuel, eduardo, fernando, francisco, antonio, javier, ricardo, oscar, pedro, roberto, alberto, mario, sergio, gerardo, arturo, cesar, armando, omar, diego, alfredo, edgar, raul, enrique, hector, ivan, rafael, julio, gabriel, adrian, pablo, gustavo, andres, josé, jaime, marco, hugo, guillermo, alexis, alan, erick, cristian, maria, guadalupe, lupita, alejandra, karla, adriana, isabel, fernanda, silvia, gabriela, mariana, mari, daniela, erika, paola, margarita, karina, alicia, alma, norma, leticia, angelica, blanca, rosario, rocio, gaby, carolina, dulce, lorena, valeria, cristina, ale, miriam, yolanda, mayra, araceli, marisol, esmeralda, irma, luz, paty, sofia, elena, rosy, maribel, cecilia, alondra, juana, tere, liliana We note that the OSCaR paper (Dev et al., 2021a) also makes an effort to keep the test and training words disjoint. However, they take a slightly different approach in using a similarly large evaluation set, but using a bespoke (carefully chosen by hand) training set. For instance, for the Male/Female gender direction, they primarily use just the words "he" and "she." And for occupations they define a subspace using the top principal component of a small word list (scientist, doctor, nurse, secretary, maid, dancer, cleaner, advocate, player, banker). Their splits are not random.

G DOT PRODUCT SCORES

We also show that the dot product score converges to 0 for ISR but not for iOSCaR on the other concept pairs we experimented with. These results appear in Tables 21, 22, 23, 24, 25, and 26 . It covers the WEAT score on the standard Gendered Terms vs. Pleasant/Unpleasant terms large word lists at various sizes. As for how we selected the top 120 words, we ordered all words from the larger list using their distance from the words in the associated small list from the original IAT that WEAT is based on. We consider subsets of words from size 20 to 120. We observe that the WEAT score after applying ISR decreases as the word list increases in size (from about 0.14 at k = 20 to about 0.01 at k = 60) and is fairly stable up to k = 120. The minimum occurs around k = 80. In contrast, the WEAT score before applying ISR mostly increases until about k = 90. So our choice of k = 100 provides about the best discriminatory power but is not very sensitive in the range of k = 80 to k = 120.



,WordSim-353 Finkelstein et al. (2001), RarewordsLuong et al. (2013),MEN Bruni et al. (2012), MTurk-287Radinsky et al. (2011), and MTurk-771 Halawi et al. (2012), SimLex-999 Hill et al. (

, Y, A, B) is then normalized by the standard deviation of s(w, A, B), ∀ w ∈ X ∪ Y to get the WEAT score. The WEAT score typical lies in the range [-1, 1] and a value closer to 0 indicates less biased association.

Figure 1: We perform an ablation study to understand how sensitive ISR is to the size of the list of words used during training.

WEAT Score on Gender Terms vs Pleasant/Unpleasant.



WEAT Score (WEAT) and Dot Product (dotP) on Gender Terms vs Pleasant/Unpleasant per iteration. ISR converges to orthogonal subspaces (dotP=0), iOSCaR does not.

WEAT Score on Pairs of Concepts -using Bespoke Word Lists.

WEAT Score on Large Lists and Test/Train Split.

WEAT Score on Large Lists and No Test/Train Split.

SEAT test result (effect size) of gender debiased BERT and RoBERTa models. An effect size closer to 0 indicates less (biased) association.

SWEAT Score on Large Lists: Measuring Information Preserved.



SEAT test specifications (see the original workCaliskan et al. (2017) and Appendix F) for In Table10, we report the published effect size of SEAT for the baseline debiasing models fromMeade et al. (2022) and our proposed debiasing methods, iOSCaR and ISR. The results show that our proposed new debiasing method, ISR, can effectively mitigate gender bias in BERT, ALBERT, RoBERTa, and GPT-2, given the SEAT effect size. The original average absolute effect size for BERT, ALBERT, and RoBERTa without debiasing is 0.620, 0.623, and 0.940, respectively,

SEAT test result (effect size) of gender debiased BERT, ALBERT, and RoBERTa and GPT-2 models. An effect size closer to 0 indicates no (biased) association.

Results on NLI Task for Bias Attenuation. Bias is measured as a deviation from the neutrality label with metric: Net Neutral (N. Neu), Fraction Neutral (F. Neu), Dev F1, and Test F1. A Higher neutrality score indicates lower bias. * : represents results reported from original OSCaR paper Dev et al. (2021a).

Results on the Gendered Information Preserved under an NLI Task. The degree of information retention is measured with the SIRT (sentence inference retention test) metric: N. Ent, F. Ent, N. Con, and F. Con. Higher scores indicate more gendered information is retained.OSCaR, and HD perform similarly to the Baseline except for INLP, where we see a drop in all four scores (N. Ent, F. Ent, N. Con, and F. Con).

Under the Embedding Matrix Replacement experiment, ISR and INLP outperform all the other debiasing techniques across all the evaluation tasks and evaluation metrics. Thus they achieve no decrease in performance except precision in POS Tagging for INLP and Recall in POS Chunking for ISR.

Results on WEAT * , a metric to measure how much correctly gendered information is retained after debiasing an embedding. : represents results reported from original OSCaR paperDev  et al. (2021a). A higher score indicates more meaningful gendered (male vs. female) information is preserved.

Average the absolute projection bias of the top 50,000 most frequent words Bias-by-proj SemBias SemBias *

Word Similarity Score (Spearman rank coefficient)



SWEAT Scores after Debiasing

WEAT Scores and dot products after Debiasing WEAT dot productIteration GT vs NN GT vs P/U NN vs P/U GT vs NN GT vs P/U NN vs P/U

: -Career: executive, management, professional, corporation, salary, office, business, career -Family: home, parents, children, family, cousins, marriage, wedding, relatives • Math, Science, and Arts Words [Sci/Art],[Math/Art]: -Math: math, algebra, geometry, calculus, equations, computation, numbers, addition -Science: science, technology, physics, chemistry, einstein, nasa, experiment, astronomy -Arts: poetry, art, dance, literature, novel, symphony, drama, sculpture

, parenthood • Statistically Gendered Names [Name(M/F)]: -Male (100 Words): kevin, john, paul, scott, chris, brian, ryan, anderson, michael, wilson, terry, walker, larry, keith, davis, gary, james, joe, eric, allen, david, jason, bennett, sean, bruce, graham, thomas, peter, russell, jack, stephen, bryan, tony, robert, richard, steven, jerry, frank, patrick, martin, mark, ian, anthony, andy, clark, simon, jon, adam, taylor, jay, sullivan, andrew, brett, jonathan, lewis, reid, quinn, danny, parker, alan, matthew, dennis, mitchell, justin, jimmy, eddie, ellis, randy, riley, charlie, dean, shane, johnny, derek, elliott, george, neil, bradley, jeremy, francis, curtis, casey, nelson, trevor, hayes, harrison, alex, aaron, kyle, jackson, darren, roy, jamie, hunter, fisher, roger, lawrence, blake, william, marshall -Female (100 Words): sarah, lisa, amy, kate, jennifer, linda, laura, mary, elizabeth, anne, jane, katherine, julie, maggie, helen, rebecca, jessica, emily, lauren, margaret,  lucy, caroline, rachel, michelle, emma, katie, diana, marie, louise, barbara, anna,  martha, catherine, ellen, melissa, alice, kathleen, sara, claire, christine, julia, patricia,  stephanie, leslie, karen, cynthia, frances, hannah, natalie, dorothy, vanessa, amanda,  jacqueline, nancy, elaine, samantha, sophie, annie, judith, nicole, kelly, christina,  megan, joanna, ashley, naomi, molly, irene, maria, melanie, ruth, brenda, sylvia, carolyn, parker, holly, eliza, nina, deborah, gwen, marilyn, sandra, esther, veronica, fiona,  edith, eleanor, alicia, erin, eileen, evelyn, alison, princess, kathryn, bridget, claudia,

Dot Products Before and After Debiasing on Large Lists and Test/Train Split Gen(M/F) & Please/Un Gen(M/F) & Career/Family Name(M/F) & Please/Un

Dot Products Before and After Debiasing on Large Lists and Test/Train Split Name(M/F) & Career/Family Gen(M/F) & Name(M/F) Gen(M/F) & Achieve/Anx

Dot Products Before and After Debiasing on Large Lists and Test/Train Split Name(M/F) & Career/Family Career/Family & Achieve/Anx

Dot Products Before and After Debiasing on Large Lists and No Test/Train Split Gen(M/F) & Please/Un Gen(M/F) & Career/Family Name(M/F) & Please/Un

Dot Products Before and After Debiasing on Large Lists and No Test/Train Split Name(M/F) & Career/Family Gen(M/F) & Name(M/F) Gen(M/F) & Achieve/Anx

Dot Products Before and After Debiasing on Large Lists and No Test/Train Split Career/Family & Please/Un Career/Family & Achieve/Anx )

Sample Standard Deviation Score on Large Lists and Test/Train Split.

ACKNOWLEDGMENTS

We thank our support from NSF IIS-1816149, CCF-2115677, and Visa Research.

