DETECTING AND MITIGATING INDIRECT STEREOTYPES IN WORD EMBEDDINGS

Abstract

Societal biases in the usage of words, including harmful stereotypes, are frequently learned by common word embedding methods. These biases manifest not only between a word and an explicit marker of its stereotype, but also between words that share related stereotypes. This latter phenomenon, sometimes called "indirect bias," has resisted prior attempts at debiasing. In this paper, we propose a novel method to mitigate indirect bias in distributional word embeddings by modifying biased relationships between words before embeddings are learned. This is done by considering how the co-occurrence probability of a given pair of words changes in the presence of words marking an attribute of bias, and using this to average out the effect of a bias attribute. To evaluate this method, we perform a series of common tests and demonstrate that measures of bias in the word embeddings are reduced in exchange for some reduction in the semantic quality of the embeddings. In addition, we conduct novel tests for measuring indirect stereo-FIX types by extending the Word Embedding Association Test (WEAT) with new test sets for indirect binary gender stereotypes. With these tests, we demonstrate the presence of more subtle stereotypes not addressed by previous work. The proposed method is able to reduce the presence of some of these new stereotypes, serving as a crucial next step towards non-stereotyped word embeddings. FIX

1. INTRODUCTION

Distributional word embeddings, such as Word2Vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) , are computer representations of words as vectors in semantic space. These embeddings are popular because the geometry of the vectors corresponds to semantic and syntactic structure (Mikolov et al., 2013b) . Unfortunately, societal stereotypes, such as those pertaining to race, gender, national origin, or sexuality, are typically reflected in word embeddings (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018; Papakyriakopoulos et al., 2020) . These stereotypes are so pervasive that they have proved resistant to many existing debiasing techniques (Gonen & Goldberg, 2019) . Techniques attempting to remove or mitigate bias in word vectors are common in the literature. The typical case study for bias mitigation methods in the literature is binaryfoot_0 gender. Subspace methods, such as hard debiasing from Bolukbasi et al. ( 2016 2019) both propose methods to reduce bias towards binary gender by encouraging learned conditional probabilities of words appearing with "he" and with "she" to be equal. Gonen & Goldberg (2019) showed that common "debiasing" methods failed to meaningfully reduce bias in word embeddings. They describe how bias can manifest not only as undesirable association



We use the phrase "binary gender" to refer to the common yet unrealistic simplification of gender as just "male" or "female", which we take to be the main source of bias of study in this work. This is a limitation of this work.



) and GN-GloVe from Zhao et al. (2018b), attempt to identify or create a vector subspace of gender-related information (typically a "gender direction") and drop this subspace. Counterfactual Data Substitution from Maudslay et al. (2019), based on Counterfactual Data Augmentation from Lu et al. (2020), swaps explicitly gendered words to counter stereotyped associations. James & Alvarez-Melis (2019) and Qian et al. (

