DETECTING AND MITIGATING INDIRECT STEREOTYPES IN WORD EMBEDDINGS

Abstract

Societal biases in the usage of words, including harmful stereotypes, are frequently learned by common word embedding methods. These biases manifest not only between a word and an explicit marker of its stereotype, but also between words that share related stereotypes. This latter phenomenon, sometimes called "indirect bias," has resisted prior attempts at debiasing. In this paper, we propose a novel method to mitigate indirect bias in distributional word embeddings by modifying biased relationships between words before embeddings are learned. This is done by considering how the co-occurrence probability of a given pair of words changes in the presence of words marking an attribute of bias, and using this to average out the effect of a bias attribute. To evaluate this method, we perform a series of common tests and demonstrate that measures of bias in the word embeddings are reduced in exchange for some reduction in the semantic quality of the embeddings. In addition, we conduct novel tests for measuring indirect stereo-FIX types by extending the Word Embedding Association Test (WEAT) with new test sets for indirect binary gender stereotypes. With these tests, we demonstrate the presence of more subtle stereotypes not addressed by previous work. The proposed method is able to reduce the presence of some of these new stereotypes, serving as a crucial next step towards non-stereotyped word embeddings. FIX

1. INTRODUCTION

Distributional word embeddings, such as Word2Vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) , are computer representations of words as vectors in semantic space. These embeddings are popular because the geometry of the vectors corresponds to semantic and syntactic structure (Mikolov et al., 2013b) . Unfortunately, societal stereotypes, such as those pertaining to race, gender, national origin, or sexuality, are typically reflected in word embeddings (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018; Papakyriakopoulos et al., 2020) . These stereotypes are so pervasive that they have proved resistant to many existing debiasing techniques (Gonen & Goldberg, 2019) . Techniques attempting to remove or mitigate bias in word vectors are common in the literature. The typical case study for bias mitigation methods in the literature is binaryfoot_0 gender. Subspace methods, such as hard debiasing from Bolukbasi et al. ( 2016 2019) both propose methods to reduce bias towards binary gender by encouraging learned conditional probabilities of words appearing with "he" and with "she" to be equal. Gonen & Goldberg (2019) showed that common "debiasing" methods failed to meaningfully reduce bias in word embeddings. They describe how bias can manifest not only as undesirable association between stereotyped words and words marking a bias attributefoot_1 , but also between stereotyped words themselves. These manifestations are sometimes called direct bias and indirect biasfoot_2 , using the terminology introduced by Bolukbasi et al. ( 2016). An example of this second manifestation of bias is that the word "doctor" might be associated more strongly with stereotypically masculinefoot_3 words than with stereotypically feminine words. At the time, bias mitigation algorithms commonly attempted to address direct bias while leaving indirect bias mostly present. A common trend in the study of the indirect bias is the departure from stereotypes as the object of study in favor of clustering. 2019) which merely attempt to measure how well proposed bias mitigation methods disperse words with similar relationships to the bias attribute in the embedding space. These clustering measures, while useful at capturing some forms of indirect bias, are limited. In particular, it is unclear how dispersed stereotyped words should be in the embedding space, given that the stereotype of a word is not entirely arbitrary and can potentially be estimated based on its semantic, non-stereotypical, meaning. These new bias measures have inspired countless new bias mitigation methods. Nearest neighbor bias mitigation from James & Alvarez-Melis (2019) attempts to equalize each word's association with its masculine (defined by the original undebiased embeddings) neighbors and its feminine neighbors. Double hard debias from Wang et al. ( 2020) projects off the direction defined by the "most gender-biased words" (again, based on alignment in the original embedding's "gender direction") in addition to the standard gender-related subspace. Bordia & Bowman (2019) modify the loss function when learning word embeddings to penalize neutral words having large components in the gender-related subspace which can then be dropped off. Kumar et al. (2020) propose RAN-Debias which attempts to disperse words in the embedding space that share similar binary gender biases (defined again by the original word embeddings) while preserving the original geometry as much as possible. Lauscher et al. ( 2020) describe multiple bias mitigation methods: the standard projection method, averaging original word vectors with an orthogonal transformation that attempts to swap the bias attribute, and a neural method that uses a loss function to group together words exhibiting a bias attribute away from neutral words. These methods, similarly to the bias measures of Gonen & Goldberg (2019) , focus on the clustering and dispersion of words in relation to the bias attribute. In current state-of-the-art models, word embeddings have largely been replaced by contextualized embeddings from transformer models such as BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) . However, word embeddings remain a popular object of study when quantifying bias in NLP, in part due to their simplicity and theoretical results that make them easier to reason about. As advances in the understanding of bias and stereotypes in word embeddings have been adapted for these newer models (Liang et al., 2020; May et al., 2019) , novel techniques to measure and mitigate bias in word embeddings remain relevant.

2.1. WORD EMBEDDINGS

Many word embedding algorithms use the empirical probability that two given words appear near each other in the corpus (Levy & Goldberg, 2014; Pennington et al., 2014) . This empirical probability is computed by counting how many times one word appears in the context of another as a word-context pair. A word-context pair is defined as a pair of words from the corpus that appear within a certain fixed distance from each other, the window size, and within the same sentence. A word-context pair designates one word as appearing in the context of another; in this paper, we will refer to a word-context pair as (a, b) where a is a the word appearing in the context of the



We use the phrase "binary gender" to refer to the common yet unrealistic simplification of gender as just "male" or "female", which we take to be the main source of bias of study in this work. This is a limitation of this work. We use the phrase "bias attribute" to refer to an attribute associated with stereotypes that are to be removed. The reader can assume "bias attribute" in this work will always refer to binary gender without loss of comprehension. We prefer "bias attribute" for generality. Or "explicit" and "implicit" bias. In this work we will refer to these as "direct bias" and "indirect bias" for clarity, even though they are actually just two different manifestations of the same phenomenon. Following the recommendations of Devinney et al. (2022), we prefer the terms "masculine" and "feminine" over "male" and "female" in this work.



) and GN-GloVe from Zhao et al. (2018b), attempt to identify or create a vector subspace of gender-related information (typically a "gender direction") and drop this subspace. Counterfactual Data Substitution from Maudslay et al. (2019), based on Counterfactual Data Augmentation from Lu et al. (2020), swaps explicitly gendered words to counter stereotyped associations. James & Alvarez-Melis (2019) and Qian et al. (

While the measures introduced by Bolukbasi et al. (2016); Caliskan et al. (2017) attempt to quantify the existence of commonly understood stereotypes, work on indirect bias typically uses the measures introduced by Gonen & Goldberg (

