INTERPRETABLE DEBIASING OF VECTORIZED LAN-GUAGE REPRESENTATIONS WITH ITERATIVE ORTHOG-ONALIZATION

Abstract

We propose a new mechanism to augment a word vector embedding representation that offers improved bias removal while retaining the key information-resulting in improved interpretability of the representation. Rather than removing the information associated with a concept that may induce bias, our proposed method identifies two concept subspaces and makes them orthogonal. The resulting representation has these two concepts uncorrelated. Moreover, because they are orthogonal, one can simply apply a rotation on the basis of the representation so that the resulting subspace corresponds with coordinates. This explicit encoding of concepts to coordinates works because they have been made fully orthogonal, which previous approaches do not achieve. Furthermore, we show that this can be extended to multiple subspaces. As a result, one can choose a subset of concepts to be represented transparently and explicitly, while the others are retained in the mixed but extremely expressive format of the representation.

1. INTRODUCTION

Vectorized representation of structured data, especially text in Word2Vec (Mikolov et al., 2013) , GloVe (Pennington et al., 2014) , FastText (Joulin et al., 2016) , etc., have become an enormously powerful and useful method for facilitating language learning and understanding. And while for natural language data contextualized embeddings, e.g., ELMO (Peters et al., 2018) , BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , etc have become the standard for many analysis pipelines, the non-contextualized versions have retained an important purpose for low-resource languages, for their synonym tasks, and their interpretability. In particular, these versions have the intuitive representation that each word is mapped to a vector in a high-dimensional space, and the (cosine) similarity between words in this representation captures how similar the words are in meaning, by how similar are the contexts in which they are commonly used. Such vectorized representations are common among many other types of structured data, including images (Kiela & Bottou, 2014; Lowe, 2004) , nodes in a social network (Grover & Leskovec, 2016; Perozzi et al., 2014) , spatial regions of interest (Jenkins et al., 2019) , merchants in a financial network (Wang et al., 2021) , and many more. In all of these cases, the most effective representations are large, high-dimensional, and trained on a large amount of data. This can be an expensive endeavor, and the goal is often to complete this embedding task once and then use these representations as an intermediate step in many downstream tasks. In this paper, we consider the goal of adding or adjusting structure in existing embeddings as part of a light-weight representation augmentation. The goal is to complete this without expensive retraining of the embedding but to improve the representation's usefulness, meaningfulness, and interpretability. Within language models, this has most commonly been considered within the context of bias removal (Bolukbasi et al., 2016; Dev & Phillips, 2019) . Here commonly, one identifies a linear subspace that encodes some concept (e.g., male-to-female gender) and may modify or remove that subspace when the concept it encodes is not appropriate for a downstream task (e.g., resume ranking). One recently proposed approach of interest called Orthogonal Subspace Correction and Rectification (OSCaR) (Dev et al., 2021a) identifies two subspaces (e.g., male-female gender and occupations) and performs a continuous deformation of the embedding in the span of those subspaces with the goal of making them, and the concepts they represent, orthogonal. We build off of this idea for our approach iterative subspace rectification (ISR), but add some subtle but significant modifications and insights: • We modify how the deformation in the 2-concept subspace takes place. The underlying operation is based on a rotation, and our insight is how to choose the central point that the data is rotated around. • We observe that OSCaR's output representations do not have orthogonal concepts. As such, it can be re-run, iteratively -leading to our approach. Using our centered variant, we call this iterative method ISR. It converges so the inherently represented subspaces are orthogonal. The uncentered OSCaR does not achieve this convergence. • Next, we observe that when using this method towards debiasing, ISR significantly improves the amount of debiasing compared to all previous methods; e.g., instead of about 50% improvement, ISR attains 95% improvement when measured on the standard WEAT test. When we similarly measure on larger word lists that we generate, the iterative methods we develop are the clear, consistent best performers. With these larger lists, we perform a randomized train-test split experiment (which is rarely performed in this domain), and while the improvements are noisier and less dramatic, our methods are the overall best. • Moreover, while other debiasing techniques (e.g., Hard Debiasing (Bolukbasi et al., 2016) , INLP (Ravfogel et al., 2020) ) are based on projections and hence destroy information of the concept for which bias is attenuated (e.g., gender), we can show that ISR preserves the relevant information. We evaluate this based on a new measure called Self Word Embedding Association Test (SWEAT). • Our methods can be extended to multiple subspace debiasing, potentially addressing intersectional issues. The resulting representation creates multiple subspaces, all orthogonal. • Last but not least, the resulting representations are significantly more interpretable. After applying this orthogonalization to multiple subspaces, we can perform a basis rotation (that does not change any cosine similarities or Euclidean distances) that results in each of these identified and orthogonalized concepts along a coordinate axis. That is, we maintain the power, flexibility, and compression of a distributed representation, and selected concepts can recover the intuitive and simple coordinate representation of those features. Afterward, these coordinates could simply be ignored on a downstream task if they should not be involved in some aspect of training (e.g., gender for resume sorting) or retained for co-reference resolution. We provide code at https://github.com/poaboagye/ ISR-IterativeSubspaceRectification. Model of Concepts. Dating back to the discovery of analogies (e.g., man:woman::king:queen) that encoded the linear structure of word embeddings, an intuitive notion of a concept has been a linear subspace. For instance, the male-female gender subspace as the vector from v man to v woman is consistent with the one from v king to v queen . However, this parallel transport does not always generalize. Instead, in this paper, we follow a slight variant. We posit that a concept is mainly reflected by a set of words that all have high mutual similarity and can simply be represented as the mean point of those sets of words. This would be, for instance, definitionally male words (man, he, his, him, boy, etc) would define one concept and definitionally female ones (woman, she, her, hers, girl, etc) another. Then the male-female gender direction can be defined as the vector between these two means (Dev & Phillips, 2019) . Note this explicitly implies this representation of gender is binary and does not attempt to (over-)generalize to other non-binary forms of gender; an important task with many challenges (Dev et al., 2021b) . Note that this perspective of how concepts are represented is aligned with the classic WEAT test (Caliskan et al., 2017) , which considers the cross-correlation of 4 sets (e.g., male-female vs. math-art). But it diverges from other methods that attempt to model broader concepts such as "nationality" or "occupations" as a single linear subspace but not relying on two polar sets. This perspective is hence slightly less general, but we observe it as more reliable.

