PROBING BERT IN HYPERBOLIC SPACES

Abstract

Recently, a variety of probing tasks are proposed to discover linguistic properties learned in contextualized word embeddings. Many of these works implicitly assume these embeddings lay in certain metric spaces, typically the Euclidean space. This work considers a family of geometrically special spaces, the hyperbolic spaces, that exhibit better inductive biases for hierarchical structures and may better reveal linguistic hierarchies encoded in contextualized representations. We introduce a Poincaré probe, a structural probe projecting these embeddings into a Poincaré subspace with explicitly defined hierarchies. We focus on two probing objectives: (a) dependency trees where the hierarchy is defined as headdependent structures; (b) lexical sentiments where the hierarchy is defined as the polarity of words (positivity and negativity). We argue that a key desideratum of a probe is its sensitivity to the existence of linguistic structures. We apply our probes on BERT, a typical contextualized embedding model. In a syntactic subspace, our probe better recovers tree structures than Euclidean probes, revealing the possibility that the geometry of BERT syntax may not necessarily be Euclidean. In a sentiment subspace, we reveal two possible meta-embeddings for positive and negative sentiments and show how lexically-controlled contextualization would change the geometric localization of embeddings. We demonstrate the findings with our Poincaré probe via extensive experiments and visualization 1 .

1. INTRODUCTION

Contextualized word representations with pretrained language models have significantly advanced NLP progress (Peters et al., 2018a; Devlin et al., 2019) . Previous works point out that abundant linguistic knowledge implicitly exists in these representations (Belinkov et al., 2017; Peters et al., 2018b; a; Tenney et al., 2019) . This paper is primarily inspired by Hewitt & Manning (2019) who propose a structural probe to recover dependency trees encoded under squared Euclidean distance in a syntactic subspace. Although being an implicit assumption, there is no strict evidence that the geometry of these syntactic subspaces should be Euclidean, especially under the fact that the Euclidean space has intrinsic difficulties for modeling trees (Linial et al., 1995) . We propose to impose and explore different inductive biases for modeling syntactic subspaces. The hyperbolic space, a special Riemannian space with constant negative curvature, is a good candidate because of its tree-likeness (Nickel & Kiela, 2017; Sarkar, 2011) . We adopt a generalized Poincaré Ball, a special model of hyperbolic spaces, to construct a Poincaré probe for contextualized embeddings. Figure 1 (A, B ) give an example of a tree embedded in the Poincaré ball and compare the Euclidean counterparts. Intuitively, the volume of a Poincaré ball grows exponentially with its radius, which is similar to the phenomenon that the number of nodes of a full tree grows exponentially with its depth. This would give "enough space" to embed the tree. In the meantime, the volume of the Euclidean ball grows polynomially and thus has less capacity to embed tree nodes. Before going any further, it is crutial to differentiate a probe and supervised parser (Hall Maudslay et al., 2020) , and ask what makes a good probe. Ideally, a probe should correctly recover syntactic information intrinsically contained in the embeddings, rather than being a powerful parser by itself. So it is important that the probe should have restricted modeling power but still be sensitive enough to the existence of syntax. For embeddings without strong syntax information (e.g., randomly initialized word embeddings), a probe should not aim to assign high parsing scores (because this would overestimate the existence of syntax), while a parser aims for high scores no matter how bad the input embeddings are. The quality of a probe is defined by its sensitivity to syntax. Our work of probing BERT in hyperbolic spaces is exploratory. As opposed to the Euclidean syntactic subspaces in Hewitt & Manning (2019), we consider the Poincaré syntactic subspace, and show its effectiveness for recovering syntax. Figure 1 (C) gives an example of the reconstructed dependency tree embedded in the Poincaré ball. In our experiments, we highlight two important observations of our Poincaré probe: (a) it does not give higher parsing scores to baseline embeddings (which have no syntactic information) than Euclidean probes, meaning that it is not a better parser; (b) it reveals higher parsing scores, especially for deeper trees, longer edges, and longer sentences, than the Euclidean probe with strictly restricted capacity. Observation (b) can be interpreted from two perspectives: (1) it indicates that the Poincaré probe might be more sensitive to the existence of deeper syntax; (2) the structure of syntactic subspaces of BERT could be different than Euclidean, especially for deeper trees. Consequently, the Euclidean probe may underestimate the syntactic capability of BERT, and BERT may exhibit stronger modeling power for deeper syntax in some special metric space, in our case, a Poincaré ball. To best exploit the inductive bias for hierarchical structures of hyperbolic space, we generalize our Poincaré probe to sentiment analysis. We construct a Poincaré sentiment subspace by predicting sentiments of individual words using vector geometry (Figure 1 D ). We assume two meta representations for the positive and negative sentiments as the roots in the sentiment subspace. The stronger a word's polarity is, the closer it locates to its corresponding meta embedding. In our experiments, with clearly different geometric properties, the Poincaré probe shows that BERT encodes sentiments for each word in a very fine-grained way. We further reveal how the localization of word embeddings may change according to lexically-controlled contextualization, i.e., how different contexts would affect the geometric location of the embeddings in the sentiment subspace. In summary, we present an Poincaré probe to reveal hierarchical linguistic structures encoded in BERT. From a hyperbolic deep learning perspective, our results indicate the possibility of using Poincaré models for learning better representations of linguistic hierarchies. From a linguistic perspective, we reveal the geometric properties of linguistic hierarchies encoded in BERT and posit that



Figure 1: Visualization of different spaces. (A, B) Comparison between trees embedded in Euclidean space and hyperbolic space. We use geodesics, the analogy of straight lines in hyperbolic spaces, to connect nodes in (B). Line/geodesic segments connecting nodes are approximately of the same length in their corresponding spaces. Intuitively, nodes embedded in Euclidean space look more "crowded", while the hyperbolic space allows sufficient capacity to embed trees and enough distances between leaf nodes. (C) A syntax tree embedded in a Poincaré ball. Hierarchy levels correspond to syntactical depths. The higher level a word is in a syntax tree, the closer it is to the origin. (D) Sentiment words embedded in a Poincaré ball. Hierarchy is defined as the sentiment polarity. We assume two meta [POS] and[NEG]  embeddings at the highest level. Words with stronger sentiments are closer to their corresponding meta-embeddings.

