FILTERED INNER PRODUCT PROJECTION FOR CROSSLINGUAL EMBEDDING ALIGNMENT

Abstract

Due to widespread interest in machine translation and transfer learning, there are numerous algorithms for mapping multiple embeddings to a shared representation space. Recently, these algorithms have been studied in the setting of bilingual lexicon induction where one seeks to align the embeddings of a source and a target language such that translated word pairs lie close to one another in a common representation space. In this paper, we propose a method, Filtered Inner Product Projection (FIPP), for mapping embeddings to a common representation space. As semantic shifts are pervasive across languages and domains, FIPP first identifies the common geometric structure in both embeddings and then, only on the common structure, aligns the Gram matrices of these embeddings. FIPP aligns embeddings to isomorphic vector spaces even when the source and target embeddings are of differing dimensionalities. Additionally, FIPP provides computational benefits in ease of implementation and is faster to compute than current approaches. Following the baselines in Glavaš et al. (2019) , we evaluate FIPP in the context of bilingual lexicon induction and downstream language tasks. We show that FIPP outperforms existing methods on the XLING (5K) BLI dataset and the XLING (1K) BLI dataset, when using a self-learning approach, while also providing robust performance across downstream tasks.

1. INTRODUCTION

The problem of aligning sets of embeddings, or high dimensional real valued vectors, is of great interest in natural language processing, with applications in machine translation and transfer learning, and shares connections to graph matching and assignment problems (Grave et al., 2019; Gold & Rangarajan, 1996) . Aligning embeddings trained on corpora from different languages has led to improved performance of supervised and unsupervised word and sentence translation (Zou et al., 2013) , sequence labeling (Zhang et al., 2016; Mayhew et al., 2017) , and information retrieval (Vulić & Moens, 2015) . Additionally, linguistic patterns have been studied using embedding alignment algorithms (Schlechtweg et al., 2019; Lauscher & Glavaš, 2019) . Embedding alignments have also been shown to improve the performance of multilingual contextual representation models (i.e. mBERT), when used during intialization, on certain tasks such as multilingual document classification (Artetxe et al., 2020) Recently, algorithms using embedding alignments on the input token representations of contextual embedding models have been shown to provide efficient domain adaptation (Poerner et al., 2020) . Lastly, aligned source and target input embeddings have been shown to improve the transferability of models learned on a source domain to a target domain (Artetxe et al., 2018a; Wang et al., 2018; Mogadala & Rettinger, 2016) . In the bilingual lexicon induction task, one seeks to learn a transformation on the embeddings of a source and a target language so that translated word pairs lie close to one another in the shared representation space. Specifically, one is given a small seed dictionary D containing c pairs of translated words, and embeddings for these word pairs in a source and a target language, X s ∈ R c×d and X t ∈ R c×d . Using this seed dictionary, a transformation is learned on X s and X t with the objective that unseen translation pairs can be induced, often through nearest neighbors search. Previous literature on this topic has focused on aligning embeddings by minimizing matrix or distributional distances (Grave et al., 2019; Jawanpuria et al., 2019; Joulin et al., 2018a) . For instance, Mikolov et al. (2013a) proposed using Stochastic Gradient Descent (SGD) to learn a mapping, Ω, to minimize the sum of squared distances between pairs of words in the seed dictionary X D s Ω-X D t 2 F , which achieves high word translation accuracy for similar languages. Smith et al. (2017) and Artetxe et al. (2017) independently showed that a mapping with an additional orthogonality constraint, to preserve the geometry of the original spaces, can be solved with the closed form solution to the Orthogonal Procrustes problem, Ω * = arg min Ω∈O(d) X D s Ω -X D t F where O(d) denotes the group of d dimensional orthogonal matrices. However, these methods usually require the dimensions of the source and target language embeddings to be the same, which often may not hold. Furthermore, due to semantic shifts across languages, it's often the case that a word and its translation may not co-occur with the same sets of words (Gulordava & Baroni, 2011) . Therefore, seeking an alignment which minimizes all pairwise distances among translated pairs results in using information not common to both the source and target embeddings. To address these problems, we propose Filtered Inner Product Projection (FIPP) for mapping embeddings from different languages to a shared representation space. FIPP aligns a source embedding X s ∈ R n×d1 to a target embedding X t ∈ R m×d2 and maps vectors in X s to the R d2 space of X t . Instead of word-level information, FIPP focuses on pairwise distance information, specified by the Gram matrices X s X T s and X t X T t , where the rows of X s and X t correspond to embeddings for the c pairs of source and target words from the seed dictionary. During alignment, FIPP tries to achieve the following two goals. First, it is desired that the aligned source embedding FIPP(X s ) = Xs ∈ R c×d2 be structurally close to the original source embedding to ensure that semantic information is retained and prevent against overfitting on the seed dictionary. This goal is reflected in the minimization of the reconstruction loss: Xs XT s -X s X T s 2 F . Second, as the usage of words and their translations vary across languages, instead of requiring Xs to use all of the distance information from X t , FIPP selects a filtered set K of word pairs that have similar distances in both the source and target languages: K = {(i, j) ∈ D : {|X s X T s -X t X T t | ij ≤ }. FIPP then minimizes a transfer loss on this set K, the squared difference in distances between the aligned source embeddings and the target embeddings: (i,j)∈K ( Xs [i]  Xs [j] T -X t [i]X t [j] T ) 2 . Filtered Inner Product Projection for Multilingual Embedding Alignment FIPP(Xs) Id 2 R d2 R d1 R d2 Xs Xt Figure 1 . FIPP alignment of source and target embeddings, Xs and Xt, to a common representation space. Note that Xs is modified using information from Xt and mapped to R d2 while Xt is unchanged. transformations as FIPP's alignment stems directly from inner product information. There are numerous benefits to aligning embeddings based on inner products in terms of applicability, interpretability, and efficiency. First, since FIPP's alignment between the source and target embeddings is performed on the inner product matrices, X s X T s and X t X T t 2 R n⇥n , embeddings are not required to be of the same dimensions. This is particularly helpful for transferring relevant knowledge from embeddings trained on large corpuses to smaller/domain-specific embeddings such as in the case of machine translation for low learning in natural language processing. The rest of this paper is organized as follows. We discuss related work in Section 2. We introduce our FIPP model in Section 3. We present experimental results in Section 4 and further discuss findings in Section 5. Lastly, we discuss related applications of FIPP in Section 6 and conclude our paper in Section 7.

2.. Related Work

In this section, we discuss related work in quantifying semantic shifts in embeddings and the task of multilingual embedding alignment.

2.1.. Distributional Methods for Quantifying Semantic Shifts

Prior work has shown monolingual text corpora from different communities or time periods exhibit variations in semantics and syntax (Hamilton et al., 2016a; b) . In order to find linguistic shifts in different corpora, distributional methods make comparisons on word co-occurence distributions. These approaches have been well studied to classify the introduction of new words and senses (Jatowt & Duh, 2014) . Word embeddings (Mikolov et al., 2013b; Pennington et al., 2014; Bojanowski et al., 2017 ) map words to representations in a continuous space. By comparing pairwise distances in embeddings trained on separate corpora, one can determine semantic shifts associated with biases and cultural norms (Gulordava & Baroni, 2011; Sagi et al.; Kim et al., 2014) . Figure 1 : FIPP alignment of source and target embeddings, X s and X t , to a common representation space. Note X s is modified using information from X t and mapped to R d2 while X t is unchanged. We show FIPP can be efficiently solved using either low-rank semidefinite approximations or with stochastic gradient descent. Also, we formulate a least squares projection to infer aligned representations for words outside the seed dictionary and present a weighted Procrustes objective which recovers an orthogonal operator that takes into consideration the degree of structural similarity among translation pairs. The method is illustrated in Figure 1 . Compared to previous approaches, FIPP has improved generality, stability, and efficiency. First, since FIPP's alignment between the source and target embeddings is performed on Gram matrices, i.e. X s X T s and X t X T t ∈ R c×c , embeddings are not required to be of the same dimension and are projected to isomorphic vector spaces. This is particularly helpful for aligning embeddings trained on smaller corpora, such as in low resource domains, or compute-intensive settings where embeddings may have been compressed to lower dimensions. Secondly, alignment modifications made on filtered Gram matrices can incorporate varying constraints on alignment at the most granular level, pairwise distances. Lastly, FIPP is easy to implement as it involves only matrix operations, is deterministic, and takes an order of magnitude less time to compute than either the best supervised (Joulin et al., 2018b) or unsupervised approach (Artetxe et al., 2018c) compared against. We conduct a thorough evaluation of FIPP using baselines outlined in Glavaš et al. (2019) including bilingual lexicon induction with 5K and 1K supervision sets and downstream evaluation on MNLI Natural Language Inference and Ted CLDC Document Classification tasks. The rest of this paper is organized as follows. We discuss related work in Section 2. We introduce our FIPP model in Section 3 and usage for inference in Section 4. We present experimental results in Section 5 and further discuss findings in Section 6. We conclude the paper in Section 7.

2.1. DISTRIBUTIONAL METHODS FOR QUANTIFYING SEMANTIC SHIFTS

Prior work has shown that monolingual text corpora from different communities or time periods exhibit variations in semantics and syntax (Hamilton et al., 2016a; b) . Word embeddings (Mikolov et al., 2013b; Pennington et al., 2014; Bojanowski et al., 2017) map words to representations in a continuous space with the objective that the inner product between any two words representations is approximately proportional to their probability of co-occurrence. By comparing pairwise distances in monolingual embeddings trained on separate corpora, one can quantify semantic shifts associated with biases, cultural norms, and temporal differences (Gulordava & Baroni, 2011; Sagi et al., 2011; Kim et al., 2014) . Recently proposed metrics on embeddings compare all pairwise inner products of two embeddings, E and F , of the form EE T -F F T F (Yin et al., 2018) . While these metrics have been applied in quantifying monolingual semantic variation, they have not been explored in context of mapping embeddings to a common representation space or in multilingual settings.

2.2. CROSSLINGUAL EMBEDDING ALIGNMENT

The first work on this topic is by Mikolov et al. (2013a) who proposed using Stochastic Gradient Descent (SGD) to learn a mapping, Ω, to minimize the sum of squared distances between pairs of words in the seed dictionary X s Ω-X t 2 F , which achieves high word translation accuracy for similar languages. Smith et al. (2017) and Artetxe et al. (2017) independently showed that a mapping with an additional orthogonality constraint, to preserve the geometry of the original spaces, can be solved with the closed form solution to the Orthogonal Procrustes problem, Ω * = arg min Ω∈O(d) X s Ω -X t F . Dinu & Baroni (2015) worked on corrections to the "hubness" problem in embedding alignment, where certain word vectors may be close to many other word vectors, arising due to nonuniform density of vectors in the R d space. Smith et al. (2017) proposed the inverted softmax metric for inducing matchings between words in embeddings of different languages. Artetxe et al. (2016) studied the impact of normalization, centering and orthogonality constraints in linear alignment functions. Jawanpuria et al. (2019) presented a composition of orthogonal operators and a Mahalanobis metric of the form U BV T , U, V T ∈ O(d), B 0 to account for observed correlations and moment differences between dimensions (Søgaard et al., 2018) . Joulin et al. (2018a) proposed an alignment based on neighborhood information to account for differences in density and shape of embeddings in their respective R d spaces. Artetxe et al. (2018c) outlined a framework which unifies many existing alignment approaches as compositions of matrix operations such as Orthogonal mappings, Whitening, and Dimensionality Reduction. Nakashole & Flauger (2018) found that locally linear maps vary between different neighborhoods in bilingual embedding spaces which suggests that nonlinearity is beneficial in global alignments. Nakashole (2018) proposed an alignment method using neighborhood sensitive maps which shows strong performance on dissimilar language pairs. Patra et al. (2019) proposed a novel hub filtering method and a semi-supervised alignment approach based on distributional matching. Mohiuddin et al. (2020) learned a non-linear mapping in the latent space of two independently pre-trained autoencoders which provide strong performance on well-studied BLI tasks. A recent method, most similar to ours, Glavaš & Vulić (2020) utilizes non-linear mappings to find a translation vector for each source and target embedding using the cosine similarity and euclidean distances between nearest neighbors and corresponding translations. In the unsupervised setting, where a bilingual seed dictionary is not provided, approaches using adversarial learning, distributional matching, and noisy self-supervision have been used to concurrently learn a matching and an alignment between embeddings (Cao et al., 2016; Zhang et al., 2017; Hoshen & Wolf, 2018; Grave et al., 2019; Artetxe et al., 2017; 2018b; Alvarez-Melis & Jaakkola, 2018) . Discussion on unsupervised approaches is included in Appendix Section I.

3.1. FILTERED INNER PRODUCT PROJECTION OBJECTIVE

In this section, we introduce Filtered Inner Product Projection (FIPP), a method for aligning embeddings in a shared representation space. FIPP aligns a source embedding X s ∈ R n×d1 to a target embedding X t ∈ R m×d2 and projects vectors in X s to Xs ∈ R n×d2 . Let X s ∈ R c×d1 and X t ∈ R c×d2 be the source and target embeddings for pairs in the seed dictionary D, |D| = c min(n, m). FIPP's objective is to minimize a linear combination of a reconstruction loss, which regularizes changes in the pairwise inner products of the source embedding, and a transfer loss, which aligns the source and target embeddings on common portions of their geometries. min Xs∈R c×d 2 Reconstruction Loss Xs XT s -X s X T s 2 F +λ ∆ • ( Xs XT s -X t X T t ) 2 F Transfer Loss (1) where λ, ∈ R + are tunable scalar hyperparameters whose effects are discussed in Section E, • is the Hadamard product, and ∆ is a binary matrix discussed in 3.1.2.

3.1.1. RECONSTRUCTION LOSS

Due to the limited, noisy supervision in our problem setting, an alignment should be regularized against overfitting. Specifically, the aligned space needs to retain a similar geometric structure to the original source embeddings; this has been enforced in previous works by ensuring that alignments are close to orthogonal mappings (Mikolov et al., 2013a; Joulin et al., 2018a; Jawanpuria et al., 2019) . As Xs and X s can be of differing dimensionality, we check structural similarity by comparing pairwise inner products, captured by a reconstruction loss known as the PIP distance or Global Anchor Metric: Xs XT s -X s X T s 2 F (Yin & Shen, 2018; Yin et al., 2018) . Theorem 1. Suppose E ∈ R n×d , F ∈ R n×d are two matrices with orthonormal columns and Ω * = arg min Ω∈O(d) EΩ -F F . It follows that (Yin et al., 2018) : EΩ * -F ≤ EE T -F F T ≤ √ 2 EΩ * -F . This metric has been used in quantifying semantic shifts and has been shown Yin et al. (2018) to be equivalent to the residual of the Orthogonal Procrustes problem up to a small constant factor, as seen in Theorem 1. Note that the PIP distance is invariant to orthogonal operations such as rotations which are known to be present in unaligned embeddings.

3.1.2. TRANSFER LOSS

In aligning X s to X t , we should seek to only utilize common geometric information between the two embedding spaces. We propose a simple approach, although FIPP can admit other forms of filtering mechanisms, denoted as inner product filtering where we only utilize pairwise distances similar in both embedding spaces as defined by a threshold . Specifically, compute a matrix ∆ ∈ {0, 1} c×c where ∆ ij is an indicator on whether |X s,i X T s,j -X t,i X T t,j | < . In this form, is a hyperparameter which determines how close pairwise distances must be in the source and target embeddings to be deemed similar. We then define a transfer loss as being the squared difference between the converted source embedding Xs and target embedding X t , but only on pairs of words in K: ∆ • ( Xs XT s -X t X T t ) 2 F , where • is the Hadamard product. The FIPP objective is a linear combination of the reconstruction and transfer losses.

3.2. APPROXIMATE SOLUTIONS TO

THE FIPP OBJECTIVE 3.2.1 SOLUTIONS USING LOW-RANK SEMIDEFINITE APPROXIMATIONS Denote the Gram matrices G s X s X T s , G t X t X T t and Gs Xs XT s . Lemma 2. The matrix G * which minimizes the FIPP objective for a fixed λ and has entries: G * ij = (XsX T s )ij +λ(XtX T t )ij 1+λ , if (i, j) ∈ K (X s X T s ) ij , otherwise Proof. For a fixed λ and , L F IP P,λ, ( Xs XT s ) can be decomposed as follows: L F IP P,λ, ( Xs XT s ) = Xs XT s -X s X T s 2 F + λ ∆ • ( Xs XT s -X t X T t ) 2 F = i,j∈K (( Gs ij -G s ij ) 2 + λ( Gs ij -G t ij ) 2 ) + i,j / ∈K ( Gs ij -G s ij ) 2 By taking derivatives with respect to Gs ij , the matrix G * which minimizes L F IP P,λ, (•) is: G * = arg min Xs XT s ∈R c×c L F IP P,λ, ( Xs XT s ), G * ij = (XsX T s )ij +λ(XtX T t )ij 1+λ , if (i, j) ∈ K (X s X T s ) ij , otherwise We now have the matrix G * ∈ R c×c which minimizes the FIPP objective. However, for G * to be a valid Gram matrix, it is required that G * ∈ S c×c + , the set of symmetric Positive Semidefinite matrices. Additionally, to recover an Xs ∈ R c×d2 such that Xs XT s = G * , we must have Rank(G * ) ≤ d 2 . Note that G * is symmetric by construction since the set K is commutative and G s , G t are symmetric. However, G * is not necessarily positive semidefinite nor is it necessarily true that Rank(G * ) ≤ d 2 . Therefore, to recover an aligned embedding Xs ∈ R c×d2 , we perform a rank-constrained semidefinite approximation to find min Xs∈R c×d 2 Xs XT s -G * F . Theorem 3. Let G * = QΛQ T be the Eigendecomposition of G * . A matrix Xs ∈ R m×d2 which minimizes Xs XT s -G * F is given by d2 i=1,λi≥0 λ 1 2 i q i , where λ i and q i are the i th largest eigenvalue and corresponding eigenvector. Proof. Since G * ∈ S c×c , its Eigendecomposition is G * = QΛQ T where Q is orthonormal. Let λ, q be the d 2 largest nonnegative eigenvalues in Λ and their corresponding eigenvectors; additionally, denote the complementary eigenvalues and associated eigenvectors as λ⊥ = Λ \ λ, q⊥ = Q \ q. Using the Eckart-Young-Mirsky Theorem for the Frobenius norm (Kishore Kumar & Schneider, 2017) , note that for G ∈ S c×c + , Rank(G) ≤ d 2 ; G * -G F ≥ q⊥λ⊥ q⊥T F = λi∈ λ⊥ λ 1 2 i and that G * -G F is minimized for G = qλ qT . Using this result, we can recover Xs : arg min G∈S c×c + , Rank(G)≤d2 G * -G F = λi∈ λ(λ 1 2 i q i )(λ 1 2 i q i ) T = Xs XT s Using the above matrix approximation, we find our aligned embedding Xs = λi∈ λ λ 1 2 i q i , a minimizer of Xs XT s -G * F . Due to the rank constraint on G, we are only interested in the d 2 largest eigenvalues and corresponding eigenvectors which incurs a complexity of O(d 2 c 2 ) using power iteration (Panju, 2011) .

3.2.2. SOLUTIONS USING STOCHASTIC GRADIENT DESCENT

Alternatively, solutions to the FIPP objective can be obtained using Stochastic Gradient Descent (SGD). This requires defining a single variable Xs ∈ R c×d2 over which to optimize. We find that the solutions obtained after convergence of SGD are close, with respect to the Frobenius norm, to those obtained with low rank PSD approximations up to a rotation. However, the complexity of solving FIPP using SGD is O(T c 2 ), where T is the number of training epochs. Empirically we find T > d 2 for SGD convergence and, as a result, this approach incurs a complexity greater than that of low-rank semidefinite approximations.

3.3. ISOTROPIC PREPROCESSING

Common preprocessing steps used by previous approaches (Joulin et al., 2018a; Artetxe et al., 2018a) , involve normalizing the rows of X s , X t to have an 2 norm of 1 and demeaning columns. The transfer loss of the FIPP objective makes direct comparisons on the Gram matrices, X s X T s and X t X T t , of the source and target embeddings for words in the seed dictionary. To reduce the influence of dimensional biases between X s and X t and ensure words are weighted equally during alignment, it is desired that X s and X t be isotropic -i.e. Z(c) = i∈Xs exp c T X s [i] is approximately constant for any unit vector c (Arora et al., 2016) . Mu & Viswanath (2018) find that a first order approximation to enforce isotropic behavior is achieved by column demeaning while a second order approximation is obtained by the removal of the top PCA components. In FIPP, we apply this simple pre-processing approach by removing the top PCA component of X s and X t . Empirically, the distributions of inner products between a source and target embedding can differ substantially when not preprocessed, rendering a substandard alignment, which is discussed further in the Appendix Section G.

4. INFERENCE AND ALIGNMENT 4.1 INFERENCE WITH LEAST SQUARES PROJECTION

To infer aligned source embeddings for words outside of the supervision dictionary, we make the assumption that source words not used for alignment should preserve their distances to those in the seed dictionary in their respective spaces, i.e., X s X T s ≈ Xs XT s . Using this assumption, we formulate a least squares projection (Boyd & Vandenberghe, 2004 ) on an overdetermined system of equations to recover XT s : XT s = ( XT s Xs ) -1 XT s X s X T s .

4.2. WEIGHTED ORTHOGONAL PROCRUSTES

As Xs ∈ R c×d2 has been optimized only with concern for its inner products, Xs must be rotated so that it's basis matches that of X t . We propose a weighted variant of the Orthogonal Procrustes solution to account for differing levels of translation uncertainty among pairs in the seed dictionary, which may arise due to polysemy, semantic shifts and translation errors. In Weighted Least Squares problems, an inverse variance-covariance weighting W is used (Strutz, 2010; Brignell et al., 2015) to account for differing levels of measurement uncertainty among samples. We solve a weighted Procrustes objective, where measurement error is approximated as the transfer loss for each translation pair, W -1 ii = Xs [i]X T s -X t [i]X T t 2 : SVD((W X t ) T W Xs ) =U ΣV T , Ω W = arg min Ω∈O(d2) W ( Xs Ω -X t ) 2 F = U V T , where O(d 2 ) is the group of d 2 × d 2 orthogonal matrices. The rotation Ω W is then applied to Xs .

5. EXPERIMENTATION

In this section, we report bilingual lexicon induction results from the XLING dataset and downstream experiments performed on the MNLI Natural Language Inference and TED CLDC tasks.

5.1. XLING BILINGUAL LEXICON INDUCTION

The XLing BLI task dictionaries constructed by Glavaš et al. (2019) include all 28 pairs between 8 languages in different language families, Croatian (HR), English (EN), Finnish (FI), French (FR), German (DE), Italian (IT), Russian (RU), and Turkish (TR). The dictionaries use the same vocabulary across languages and are constructed based on word frequency, to reduce biases known to exist in other datasets (Kementchedjhieva et al., 2019) . We evaluate FIPP across the all language pairs using a supervision dictionary of size 5K and 1K. On 5K dictionaries, FIPP outperforms other approaches on 22 of 28 language pairs. On 1K dictionaries, FIPP outperforms other approaches on 23 of 28 language pairs when used along with a Self-Learning Framework (denoted as FIPP + SL) discussed in Appendix Section C. Our code for FIPP is open-source and available on Githubfoot_0 . An advantage of FIPP compared to existing methods is it's computational efficiency. We provide a runtime comparison of FIPP and the best performing unsupervised (Artetxe et al., 2018c) and supervised (Joulin et al., 2018b) alignment methods on the XLING 5K BLI tasks along with Proc-B (Glavaš et al., 2019) a supervised method which performs best out of the compared approaches on 1K seed dictionaries. The average execution time of alignment on 3 runs of the 'EN-DE' XLING 5K dictionary is provided in Table 4 . The implementation used for RCSLS is from Facebook's fastText repofoot_1 with default parameters and the VecMap implementation is from the author's repofoot_2 with and without the 'cuda' flag. Proc-B is implemented from the XLING-Eval repofoot_3 with default parameters. Hardware specifications are 2 Nvidia GeForce GTX 1080 Ti GPUs, 12 Intel Core i7-6800K processors, and 112GB RAM. In this section, we compute alignments with previous methods, a modification of Procrustes and RCSLS (Joulin et al., 2018b) , and FIPP on an English Embedding ∈ R 200 to a German Embedding ∈ R 300 . In Figure 2 , we plot the spectrum of the Gram Matrices for the aligned embeddings from each method and the target German Embedding. While the FIPP aligned embedding is isomorphic to the target German Embedding ∈ R 300 , other methods produce a rank deficient aligned embedding whose spectrum deviates from the target embedding. The null space of aligned source embeddings, for methods other than FIPP, is of dimension d 2 -d 1 . We note that issues can arise when learning and transferring models on embeddings from different rank vector spaces. For regularized models transferred from the source to the target space, at least d 2d 1 column features of the target embedding will not be utilized. Meanwhile, models transferred from the target space to the source space will exhibit bias associated with model parameters corresponding to the source embedding's null space. FIPP is able to align embeddings of different dimensionalities to isomorphic vector spaces unlike competing approaches (Joulin et al., 2018b; Artetxe et al., 2018c) . We evaluate BLI performance on embeddings of different dimensionalities for the EN-DE, (English, German), language pair. We assume that d 1 ≤ d 2 and align EN embeddings with dimensions ∈ {200, 250, 300} to a DE embedding of dimension 300 and compare the performance of FIPP with CCA, implemented in scikit-learnfoot_4 as an iterative estimation of partial least squares, and the best linear transform with orthonormal rows, Ω * = arg min Ω∈R d 1 ×d 2 ,ΩΩ T =I d 1 X s Ω -X t F , equivalent to the Orthogonal Procrustes solution when d 1 = d 2 . Both the Linear and FIPP methods map X s and X t to dimension d 2 while the CCA method maps both embeddings to min(d 1 , d 2 ) which may be undesirable. While the performance of all methods decreases as d 2d 1 increases, the relative performance gap between FIPP and other approaches is maintained.

7. CONCLUSION

In this paper, we introduced Filtered Inner Product Projection (FIPP), a method for aligning multiple embeddings to a common representation space using pairwise inner product information. FIPP accounts for semantic shifts and aligns embeddings only on common portions of their geometries. Unlike previous approaches, FIPP aligns embeddings to equivalent rank vector spaces regardless of their dimensions. We provide two methods for finding approximate solutions to the FIPP objective and show that it can be efficiently solved even in the case of large seed dictionaries. We evaluate FIPP on the task of bilingual lexicon induction using the XLING (5K) dataset and the XLING (1K) dataset, on which it achieves state-of-the-art performance on most language pairs. Our method provides a novel efficient approach to the problem of shared representation learning.

A FULL BLI EXPERIMENTATION -XLING (1K) AND XLING (5K)

In Table 6 below, we provide experimental results for FIPP using 1K seed dictionaries. Unlike in case of a 5K supervision set, FIPP is outperformed by the bootstrapped Procrustes method (Glavaš et al., 2019) and the unsupervised VecMap approach (Artetxe et al., 2018c) . However, the addition of a self learning framework, detailed in Section C, to FIPP (FIPP + SL) results in performance greater than compared methods albeit at the cost of close to a 8x increase in computation time, from 23s to 176s, and the requirement of a GPU. Other well performing methods for XLING 1K, Proc-B (Glavaš et al., 2019) and VecMap (Artetxe et al., 2018c) , also use self-learning frameworks; further analysis is required to understand the importance of self-learning frameworks in the case of small (or no) seed dictionary. The methods compared against were originally proposed in: VecMap (Artetxe et al., 2018c) , MUSE (Lample et al., 2018) , ICP (Hoshen & Wolf, 2018) , CCA (Faruqui & Dyer, 2014) , GWA (Alvarez-Melis & Jaakkola, 2018) , PROC (Mikolov et al., 2013a) , PROC-B (Glavaš et al., 2019) , DLV (Ruder et al., 2018) , and RCSLS (Joulin et al., 2018b Table 9 : Top 10 pairings by cosine similarity when using a Self-Learning framework on the English-French language pair. Let X s ∈ R c×d1 and X t ∈ R c×d2 be the source and target embeddings for pairs in the seed dictionary D. Additionally, X s ∈ R n×d1 and X t ∈ R m×d2 are the source and target embeddings for the entire vocabulary. Assume all vectors have been normalized to have an 2 norm of 1. Each source vector X s,i and target vector X t,j can be rewritten as d dimensional row vectors of inner products with their corresponding seed dictionaries: A s,i = X s,i X T s ∈ R 1×c and A s,j = X t,j X T s ∈ R 1×c . For each source word i, we compute the target word j with the greatest cosine similarity, equivalent to the inner product for vectors with norm of 1, in this d dimensional space as (i, j) = arg max j A s,i A T t,j . For XLING BLI 1K experiments, we find the 14K (i, j) pairs with the largest cosine similarity and augment our seed dictionaries with these pairs. In Table 7 , the top 10 translation pairings, sorted by cosine similarity, obtained using this self-learning framework are shown for the English to French (EN-FR) language pair.

D COMPARISON OF FIPP SOLUTION TO ORTHOGONAL ALIGNMENTS

In this section, we conduct experimentation to quantify the degree to which the FIPP alignment differs from an orthogonal solution and compare performance on monolingual tasks before and after FIPP alignment.

D.1 DEVIATION OF FIPP FROM THE CLOSEST ORTHOGONAL SOLUTION

For each language pair, we quantify the deviation of FIPP from an Orthogonal solution by first calculating the FIPP alignment before rotation, Xs , on the seed dictionary. We then compute the relative deviation of the FIPP alignment with the closest orthogonal alignment on the original embedding, X s . This is equal to D = Xs-Ω * Xs F Xs F where Ω * = arg min Ω∈O(d2) X s Ω -Xs F . The average of these deviations for 1K seed dictionaries is 0.292 and for 5K seed dictionaries is 0.115. Additionally, the 3 language pairs with the largest and smallest deviations from orthogonal solutions are presented in the Table below. We find that in most cases, small deviations from orthogonal solutions are observed between languages in the same language family (i.e. Indo-European -> Indo-European) while those in different Language families tend to have larger deviations (i.e. Turkic -> Indo-European). A notable exception to this observation is English and Finnish which belong to different language families, Indo-European and Uralic respectively, yet have small deviations in their FIPP solution compared to an orthogonal alignment.

Lang. Pair (1K)

2 Deviation Lang. Pair (5K) 2 Deviation German-Italian (DE-IT) 0.025 English-Finnish (EN-FI) 0.008 English-Finnish (EN-FI) 0.090 Croatian-Italian (HR-IT) 0.012 Italian-French (IT-FR) 0.100 Croatian-Russian (HR-RU) 0.014 Turkish-French (TR-FR) 0.405 Finnish-French (FI-FR) 0.343 Turkish-Russian (TR-RU) 0.408 Finnish-Croatian (FI-HR) 0.351 Turkish-Finnish (TR-FI) 0.494 Turkish-Finnish (TR-FI) 0.384 In FIPP, Inner Product Filtering is used to find common geometric information by comparing pairwise distances between a source and target language. To illustrate this step with translation word pairs, in the Table below we show the 5 words with the largest and smallest fraction of zeros, i.e. the "least and most filtered", in the binary filtering matrix ∆ during alignment between English (En) and Italian (It). The words which are least filtered tend to have an individual word sense, i.e. proper nouns, while those which are most filtered are somewhat ambiguous translations. For instance, while the English word "securing" can be translated to the Italian word "fissagio", depending on the context the Italian words "garantire", "assicurare" or "fissare" may be more appropriate. As FIPP does not perform an orthogonal transform, it modifies the inner products of word vectors in the source embedding which can impact performance on monolingual task accuracy. We evaluate the aligned embedding learned using FIPP, Xs , on monolingual word analogy tasks and compare these results to the original fastText embeddings X s . In Table 3 , we compare monolingual English word analogy results for English embeddings Xs which have been aligned to a Turkish target embedding using FIPP. Evaluation of the aligned and original source embeddings on multiple English word analogy experiments show that aligned FIPP embeddings retain performance on monolingual analogy tasks.

E EFFECT OF HYPERPARAMETERS ON BLI PERFORMANCE

In our experiments, we tune the hyperparameters and λ which signify the level of discrimination in the inner product filtering step and the weight of the transfer loss respectively. When tuning, we account for the sparsity of ∆ by scaling λ in the transfer loss by γ = As detailed in Section 3, solutions to FIPP can be calculated either using Low Rank Approximations or Stochastic Gradient Descent (SGD). In this section, we show the error on the FIPP objective for SGD trained over 10,000 epochs on alignment of a Finnish (FI) embedding to a French (FR) embedding. The Adam (Kingma & Ba, 2015) optimizer is used with a learning rate of 1e -3 and the variable being optimized XSGD s is initialized to the original Finnish embedding X s . We find that the SGD solution XSGD 

G EFFECTS OF PREPROCESSING ON INNER PRODUCT DISTRIBUTIONS

We plot the distributions of inner products, entries of X s X T s and X t X T t , for English (En) and Turkish (Tr) words in the XLING 5K training dictionary before and after preprocessing in Figure below. All embeddings used in experimentation are fastText word vectors trained on Wikipedia (Bojanowski et al., 2017) . Since inner products between X s and X t are compared directly, the isotropic preprocessing utilized in FIPP is necessary for removing biases caused by variations in scaling, shifts, and point density across embedding spaces. In this section, we provide the computational and space complexity of the FIPP method as described in the paper for computing Xs . We split the complexity for each step in the method. We leave out steps (i.e. preprocessing) which do not contribute significantly to the runtime or memory footprint. The computational complexity of matrix multiplication between two matrices A 1 A 2 , where The complexity of the alignment using the Orthogonal Procrustes solution and the Least Squares Projection is as follows: A 1 ∈ R m×n A 2 ∈ R n×p , Space Complexity = 3c 2 U ΣV T + cn S + nd 2 XT s Time Complexity = M M (c, d 2 , d 2 ) + M M (d 2 , d 2 , d 2 ) XsV U T + d 3 2 SV D(X T t Xs) + M M (c, d 1 , n) + M M (c, d 2 , c) + c 3 + M M (d 2 , c, n) Least Squares ≤2(cd 2 2 + 3 2 d 3 2 + cd 1 n + c 2 d 2 + cd 2 n) + c 3 H.3 DISCUSSION In performing our analysis, we note that the majority of operations performed are quadratic in the training set size c. While we incur a time complexity of O(c 3 ) during our Least Squares Projection due to the matrix inversion in the normal equation, this inversion is a one time cost. The space complexity of FIPP is O(c 2 ) which is tractable as c is at most 5K. Empirically, FIPP is fast to compute taking less than 30 seconds for a seed dictionary of size 5K which is more efficient than competing methods. I RELATED WORKS: UNSUPERVISED ALIGNMENT METHODS Cao et al. (2016) studied aligning the first two moments of sets of embeddings under Gaussian distribution assumptions. In Zhang et al. (2017) , an alignment between embeddings of different languages is found by matching distributions using an adversarial autoencoder with an orthogonal regularizer. Artetxe et al. (2017) proposes an alignment approach which jointly bootstraps a small seed dictionary and learns an alignment in a self-learning framework. Hoshen & Wolf (2018) first projects embeddings to a subspace spanned by the top p principle components and then learns an alignment by matching the distributions of the projected embeddings. Grave et al. (2019) proposes an unsupervised variant to the Orthogonal Procrustes alignment which jointly learns an orthogonal transform Ω ∈ R d×d and an assignment matrix P ∈ {0, 1} n×n to minimize the Wasserstein distance between the embeddings subject to an unknown rotation. Three approaches utilize the Gram matrices of embeddings in computing alignment initializations and matchings. Artetxe et al. (2018b) studied the alignment of dissimilar language pairs using a Gram matrix based initialization and robust selflearning. Alvarez-Melis & Jaakkola (2018) proposed an Optimal Transport based approach using the Gromov-Wasserstein distance, GW (C, C , p, q) = min Γ∈Π(p,q) i,j,k,l L(C ik , C jl )Γ ij Γ kl where C, C are Gram matrices for normalized embeddings and Γ is an assignment. Aldarmaki et al. (2018) learns a unsupervised linear mapping between a source and target language with a loss at each iteration equal to the sum of squares of proposed source and target Gram matrices. J ABLATION STUDIES We provide an ablation study to quantify improvements associated with the three modifications of our alignment approach compared to a standard Procrustes alignment: (i) isotropic pre-processing (IP), (ii) inner product filtering (IPF) and (iii) weighted procrustes objectives (WP), on the XLING 1K and 5K BLI tasks. For seed dictionaries of size 1K, improvements associated with each portion of FIPP are approximately the same while for seed dictionaries of size 5K, a larger improvement is obtained due to isotropic pre-processing than inner product filtering or weighted procrustes alignment.

J.2 PREPROCESSING ABLATION

We perform an ablation study to understand the impact of preprocessing on XLING 1K and 5K BLI performance both on Procrustes and FIPP. Additionally, we compare the isotropic preprocessing used in our previous experimentation with iterative normalization, a well performing preprocessing method proposed by Zhang et al. (2017) . For training dictionaries of size 1K, iterative normalization and isotropic preprocessing before running procrustes result in equivalent aggregate MAP performance of 0.316. We find that iterative normalization and isotropic preprocessing each achieve better performance on 11 of 28 and 12 of 28 language pairs respectively. Utilizing isotropic preprocessing before running FIPP results higher aggregate MAP performance for 1K training dictionaries (0.344) when compared to iterative normalization (0.331). In this setting, isotropic preprocessing achieves better performance on all 28 language pairs. (Zhang et al., 2017) or Isotropic Preprocessing (IP).

J.3 WEIGHTED PROCRUSTES ABLATION

In order to measure the effect of a Weighted Procrustes rotation on BLI performance, we perform an ablation on both the 1K and 5K XLING BLI datasets against the standard Procrustes formulation. For training dictionaries of size 1K, weighted procrustes achieves a marginally better aggregate MAP performance when compared to standard procrustes -0.344 vs 0.336 respectively. In 27 of 28 language pairs, weighted procrustes provides improved MAP performance over standard procrustes. With training dictionaries of size 5K, weighted procrustes achieves a marginally better aggregate MAP performance when compared to standard procrustes -0.442 vs 0.440 respectively. In 21 of 28 language pairs, weighted procrustes provides improved MAP performance over standard procrustes. 



https://github.com/vinsachi/FIPPCLE https://github.com/facebookresearch/fastText/tree/master/alignment https://github.com/artetxem/vecmap https://github.com/codogogo/xling-eval https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html



Figure 2: Spectrum of Aligned EN Embeddings ∈ R 200 and DE Embeddings ∈ R 300 . Spectrums of Original, RCSLS and Linear EN embeddings are all approx. equivalent.

Figure 3: Comparison of FIPP Objective Loss for (FI-FR) for solutions obtained using SGD vs LRA

approaches the error of the Low Rank Approximation XLRA s , which is the global minima of the FIPP objective, as shown in the Figure below but is not equivalent. While the deviation between SGD and the Low Rank Approximation, D SGD,LRA = is smaller than the deviation between the original embedding and the Low Rank Approximation, 343 we note that the solutions are close but not equivalent.

Figure 4: Gram matrix entries -unprocessed fast-Text embeddings Figure 5: Gram matrix entries -fastText embeddings with preprocessing

Mean Average Precision (MAP) of alignment methods on a subset of XLING BLI with 1K supervision dictionaries, retrieval method is nearest neighbors. Benchmark results obtained fromGlavaš et al. (2019) in which ( ‡) PROC was evaluated without preprocessing and RCSLS was evaluated with Centering + Length Normalization (C+L) preprocessing. Full results for XLING 1K can be found in Appendix Table8.

Mean Average Precision (MAP) of alignment methods on a subset of XLING BLI with 5K supervision dictionaries, retrieval method is nearest neighbors. Benchmark results obtained fromGlavaš et al. (2019) in which (*) Proc-B was reported using a 3K seed dictionary, ( ‡) PROC was evaluated without preprocessing and RCSLS was evaluated with (C+L) preprocessing. Full results for XLING 5K can be found in Appendix Table7.

TED-CLDC micro-averaged F 1 scores using a CNN model with embeddings from different alignment methods. Evaluation followsGlavaš et al. (2019), (*) signifies language pairs for which unsupervised methods were unable to yield successful runs.5.2.2 MNLI -NATURAL LANGUAGE INFERENCEThe multilingual XNLI corpus introduced byConneau et al. (2018), based off the English only MultiNLI(Williams et al., 2018), includes 5 of the 8 languages used in BLI: EN, DE, FR, RU, and TR. We perform the same evaluation asGlavaš et al. (2019) by training an ESIM model(Chen et al., 2017) using EN word embeddings from a shared EN-L2 embedding space for L2 ∈ {DE, FR, RU, TR}. The trained model is then evaluated without further training on the L2 XNLI test set using L2 embeddings from the shared space. The bootstrap Procrustes approach(Glavaš et al., 2019) outperforms other methods narrowly while RCSLS performs worst despite having high BLI accuracy.

MNLI test accuracy using an ESIM model with embeddings from different alignment methods. Evaluation followsGlavaš et al. (2019), (*) signifies language pairs for which unsupervised methods were unable to yield successful runs.

Average alignment time; sup. approaches use a 5K dictionary. FIPP+SL augments with additional 10K samples.

).



Smallest and Largest Deviations of FIPP from Orthogonal Solution, XLING BLI 1K and 5K D.2 EFFECT OF INNER PRODUCT FILTERING ON WORD-LEVEL ALIGNMENT

Most and Least Filtered word pairs during FIPP's Inner Product Filtering for English-Italian alignment D.3 MONOLINGUAL TASK PERFORMANCE OF ALIGNED EMBEDDINGS

Monolingual Analogy Task Performance for English embedding before/after alignment to Turkish embedding.

FIPP Ablation study of Mean Average Precision (MAP) on XLING 1K and 5K BLI task.

Mean Average Precision (MAP) of alignment methods on XLING with 5K supervision dictionaries using either Iterative Normalization(IN)

Mean Average Precision (MAP) of FIPP on XLING 5K and 1K supervision dictionaries using with either weighted procrustes (WP) or procrustes (P) rotation.K BLI PERFORMANCE WITH ALTERNATIVE RETRIEVAL CRITERIAIn order to measure the effect on BLI performance of different retrieval criteria, we perform experimentation using CSLS and Nearest Neighbors on both the 1K and 5K XLING BLI datasets.We find that the CSLS retrieval criterion provides significant performance improvements, compared to Nearest Neighbors search, both for Procrustes and FIPP on both 1K and 5K training dictionaries. For 1K training dictionaries, CSLS improves aggregate MAP, compared to Nearest Neighbors search, by 0.051 and 0.043 for Procrustes and FIPP respectively. In the case of 5K training dictionaries, CSLS improves aggregate MAP, compared to Nearest Neighbors search, by 0.049 and 0.031 for Procrustes and FIPP respectively.

Mean Average Precision (MAP) of FIPP and Procrustes on XLING with 1K supervision dictionaries using with either (CSLS) or Nearest Neighbors (P) retrieval criteria.

Mean Average Precision (MAP) of FIPP and Procrustes on XLING with 5K supervision dictionaries using with either (CSLS) or Nearest Neighbors (P) retrieval criteria.

annex

 (2019) in which (*) Proc-B was reported using a 3K seed dictionary and ( ‡) PROC was evaluated without preprocessing and RCSLS was evaluated with Centering + Length Normalization (C+L) preprocessing.

B EFFECTS OF RUNNING MULTIPLE ITERATIONS OF FIPP OPTIMIZATION

Although other alignment approaches (Joulin et al., 2018b; Artetxe et al., 2018c) run multiple iterations of their alignment objective, we find that running multiple iterations of the FIPP Optimization does not improve performance. We run between 1 and 5 iterations of the FIPP objective. For 1K seed dictionaries, 26/28 language pairs perform best with 1 iteration while 26/28 language pairs perform best with 1 iteration for 5K seed dictionaries. In the case of 1K seed dictionaries, (EN, FI) with 2 iterations and (RU, FR) with 2 iterations resulted in MAP performance increases of 0.002 and 0.001. For 5K seed dictionaries, (EN, FI) with 3 iterations and (EN, FR) with 2 iterations resulted in MAP performance increases of 0.002 and 0.002. Due to limited set of language pairs on which performance improvements were achieved, these results have not been included in Tables 1 and 6 .

C SELF-LEARNING FRAMEWORK

A self-learning framework, used for augmenting the number of available training pairs without direct supervision, has been found to be effective (Glavaš et al., 2019; Artetxe et al., 2018c) both in the case of small seed dictionaries or in an unsupervised setting. We detail a self-learning framework which improves the BLI performance of FIPP in the XLING 1K setting but not in the case of 5K seed pairs.

