FILTERED INNER PRODUCT PROJECTION FOR CROSSLINGUAL EMBEDDING ALIGNMENT

Abstract

Due to widespread interest in machine translation and transfer learning, there are numerous algorithms for mapping multiple embeddings to a shared representation space. Recently, these algorithms have been studied in the setting of bilingual lexicon induction where one seeks to align the embeddings of a source and a target language such that translated word pairs lie close to one another in a common representation space. In this paper, we propose a method, Filtered Inner Product Projection (FIPP), for mapping embeddings to a common representation space. As semantic shifts are pervasive across languages and domains, FIPP first identifies the common geometric structure in both embeddings and then, only on the common structure, aligns the Gram matrices of these embeddings. FIPP aligns embeddings to isomorphic vector spaces even when the source and target embeddings are of differing dimensionalities. Additionally, FIPP provides computational benefits in ease of implementation and is faster to compute than current approaches. Following the baselines in Glavaš et al. (2019), we evaluate FIPP in the context of bilingual lexicon induction and downstream language tasks. We show that FIPP outperforms existing methods on the XLING (5K) BLI dataset and the XLING (1K) BLI dataset, when using a self-learning approach, while also providing robust performance across downstream tasks.

1. INTRODUCTION

The problem of aligning sets of embeddings, or high dimensional real valued vectors, is of great interest in natural language processing, with applications in machine translation and transfer learning, and shares connections to graph matching and assignment problems (Grave et al., 2019; Gold & Rangarajan, 1996) . Aligning embeddings trained on corpora from different languages has led to improved performance of supervised and unsupervised word and sentence translation (Zou et al., 2013) , sequence labeling (Zhang et al., 2016; Mayhew et al., 2017) , and information retrieval (Vulić & Moens, 2015) . Additionally, linguistic patterns have been studied using embedding alignment algorithms (Schlechtweg et al., 2019; Lauscher & Glavaš, 2019) . Embedding alignments have also been shown to improve the performance of multilingual contextual representation models (i.e. mBERT), when used during intialization, on certain tasks such as multilingual document classification (Artetxe et al., 2020) Recently, algorithms using embedding alignments on the input token representations of contextual embedding models have been shown to provide efficient domain adaptation (Poerner et al., 2020) . Lastly, aligned source and target input embeddings have been shown to improve the transferability of models learned on a source domain to a target domain (Artetxe et al., 2018a; Wang et al., 2018; Mogadala & Rettinger, 2016) . In the bilingual lexicon induction task, one seeks to learn a transformation on the embeddings of a source and a target language so that translated word pairs lie close to one another in the shared representation space. Specifically, one is given a small seed dictionary D containing c pairs of translated words, and embeddings for these word pairs in a source and a target language, X s ∈ R c×d and X t ∈ R c×d . Using this seed dictionary, a transformation is learned on X s and X t with the objective that unseen translation pairs can be induced, often through nearest neighbors search. Previous literature on this topic has focused on aligning embeddings by minimizing matrix or distributional distances (Grave et al., 2019; Jawanpuria et al., 2019; Joulin et al., 2018a) . For instance, Mikolov et al. (2013a) proposed using Stochastic Gradient Descent (SGD) to learn a mapping, Ω, to minimize the sum of squared distances between pairs of words in the seed dictionary X D s Ω-X D t 2 F , which achieves high word translation accuracy for similar languages. Smith et al. ( 2017) and Artetxe et al. ( 2017) independently showed that a mapping with an additional orthogonality constraint, to preserve the geometry of the original spaces, can be solved with the closed form solution to the Orthogonal Procrustes problem, Ω * = arg min Ω∈O(d) X D s Ω -X D t F where O(d) denotes the group of d dimensional orthogonal matrices. However, these methods usually require the dimensions of the source and target language embeddings to be the same, which often may not hold. Furthermore, due to semantic shifts across languages, it's often the case that a word and its translation may not co-occur with the same sets of words (Gulordava & Baroni, 2011) . Therefore, seeking an alignment which minimizes all pairwise distances among translated pairs results in using information not common to both the source and target embeddings. To address these problems, we propose Filtered Inner Product Projection (FIPP) for mapping embeddings from different languages to a shared representation space. FIPP aligns a source embedding X s ∈ R n×d1 to a target embedding X t ∈ R m×d2 and maps vectors in X s to the R d2 space of X t . Instead of word-level information, FIPP focuses on pairwise distance information, specified by the Gram matrices X s X T s and X t X T t , where the rows of X s and X t correspond to embeddings for the c pairs of source and target words from the seed dictionary. During alignment, FIPP tries to achieve the following two goals. First, it is desired that the aligned source embedding FIPP(X s ) = Xs ∈ R c×d2 be structurally close to the original source embedding to ensure that semantic information is retained and prevent against overfitting on the seed dictionary. This goal is reflected in the minimization of the reconstruction loss: Xs XT s -X s X T s 2 F . Second, as the usage of words and their translations vary across languages, instead of requiring Xs to use all of the distance information from X t , FIPP selects a filtered set K of word pairs that have similar distances in both the source and target languages: K = {(i, j) ∈ D : {|X s X T s -X t X T t | ij ≤ }. FIPP then minimizes a transfer loss on this set K, the squared difference in distances between the aligned source embeddings and the target embeddings:  (i,j)∈K ( Xs [i] Xs [j] T -X t [i]X t [j] (Xs) Id 2 R d2 R d1 R d2 Xs Xt Figure 1 . FIPP alignment of source and target embeddings, Xs and Xt, to a common representation space. Note that Xs is modified using information from Xt and mapped to R d2 while Xt is unchanged. transformations as FIPP's alignment stems directly from inner product information. There are numerous benefits to aligning embeddings based on inner products in terms of applicability, interpretability, and efficiency. First, since FIPP's alignment between the source and target embeddings is performed on the inner product matrices, X s X T s and X t X T t 2 R n⇥n , embeddings are not required to be of the same dimensions. This is particularly helpful for transferring relevant knowledge from embeddings trained on large corpuses to smaller/domain-specific embeddings such as in the case of machine translation for low learning in natural language processing. The rest of this paper is organized as follows. We discuss related work in Section 2. We introduce our FIPP model in Section 3. We present experimental results in Section 4 and further discuss findings in Section 5. Lastly, we discuss related applications of FIPP in Section 6 and conclude our paper in Section 7.

2.. Related Work

In this section, we discuss related work in quantifying semantic shifts in embeddings and the task of multilingual embedding alignment.

2.1.. Distributional Methods for Quantifying Semantic Shifts

Prior work has shown monolingual text corpora from different communities or time periods exhibit variations in semantics and syntax (Hamilton et al., 2016a; b) . In order to find linguistic shifts in different corpora, distributional methods make comparisons on word co-occurence distributions. Figure 1 : FIPP alignment of source and target embeddings, s and X t , a common representation space. Note X s is modified using information from X t and mapped to R d2 while X t is unchanged. We show FIPP can be efficiently solved using either low-rank semidefinite approximations or with stochastic gradient descent. Also, we formulate a least squares projection to infer aligned representations for words outside the seed dictionary and present a weighted Procrustes objective which recovers an orthogonal operator that takes into consideration the degree of structural similarity among translation pairs. The method is illustrated in Figure 1 . Compared to previous approaches, FIPP has improved generality, stability, and efficiency. First, since FIPP's alignment between the source and target embeddings is performed on Gram matrices, i.e. X s X T s and X t X T t ∈ R c×c , embeddings are not required to be of the same dimension and are projected to isomorphic vector spaces. This is particularly helpful for aligning embeddings trained on smaller corpora, such as in low resource domains, or compute-intensive settings where embeddings may have been compressed to lower dimensions. Secondly, alignment modifications made on filtered Gram matrices can incorporate varying constraints on alignment at the most granular level, pairwise distances. Lastly, FIPP is easy to implement as it involves only matrix





These approaches have been well studied to classify the introduction of new words and senses(Jatowt & Duh,  2014).

