ISOTROPY IN THE CONTEXTUAL EMBEDDING SPACE: CLUSTERS AND MANIFOLDS

Abstract

The geometric properties of contextual embedding spaces for deep language models such as BERT and ERNIE, have attracted considerable attention in recent years. Investigations on the contextual embeddings demonstrate a strong anisotropic space such that most of the vectors fall within a narrow cone, leading to high cosine similarities. It is surprising that these LMs are as successful as they are, given that most of their embedding vectors are as similar to one another as they are. In this paper, we argue that the isotropy indeed exists in the space, from a different but more constructive perspective. We identify isolated clusters and low dimensional manifolds in the contextual embedding space, and introduce tools to both qualitatively and quantitatively analyze them. We hope the study in this paper could provide insights towards a better understanding of the deep language models.

1. INTRODUCTION

The polysemous English word "bank" has two common senses: 1. the money sense, a place that people save or borrow money; 2. the river sense, a slope of earth that prevents the flooding. In modern usage, the two senses are very different from one another, though interestingly, both senses share similar etymologies (and both can be traced back to the same word in Proto-Germanic). In the static embedding, multiple instances of the same word (e.g. "bank") will be represented using the same vector. On the contrary, the contextual embedding assigns different vectors to different instances of the same word, depending on the context. Historically, static embedding models like Word2vec (Mikolov et al., 2013b) and GloVe (Pennington et al., 2014) , predated contextual embedding models such as ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) , BERT (Devlin et al., 2018) and ERNIE (Sun et al., 2019) . Much of the literature on language modeling has moved to contextual embeddings recently, largely because of their superior performance on the downstreaming tasks.

1.1. RELATED WORK

The static embeddings are often found to be easier to interpret. For example, the Word2Vec and GloVe papers discuss adding and subtracting vectors, such as: vec(king) -vec(man) + vec(women) = vec(queen). Inspired by this relationship, researchers started to explore geometric properties of static embedding spaces. For example, Mu & Viswanath (2018) proposed a very counter-intuitive method that removes the top principle components (the dominating directions in the transformed embedding space), which surprisingly improved the word representations. Rather than completely discarding the principle components, Liu et al. ( 2019) proposed to use a technique called Conceptor Negation, to softly suppress transformed dimensions with larger variances. Both approaches, simply removing certain principle components as well as Conceptor Negation, produce significant improvements over vanilla embeddings obtained by static language models. In Huang et al. (2020) , the authors studied how to effectively transform static word embeddings from one language to another. Unfortunately, the strong illustrative representation like the king-queen example above, is no longer obvious in a general contextual embedding space. Arguing that syntax structure indeed exists in the contextual embeddings, Hewitt & Manning (2019) proposed a structural probe to identify the syntax trees buried in the space, and found the evidence of implicit syntax tree in BERT and ELMo. The advantage of contextual embedding over the static counterpart, mainly come from its capability to assign different vectors to the same word, depending on the word sense in the context. Researchers in (Reif et al., 2019) found such a geometric representation of word senses in the BERT model. These papers reveal the existence of linguistic features embedded implicitly in the contextual vector spaces.

