COMPARING SEMANTIC AND MORPHOLOGICAL ANAL-OGY COMPLETION IN WORD EMBEDDINGS

Abstract

Word embeddings have prompted great excitement in the NLP community due to their capacity for generalization to unforeseen tasks, including semantic analogy completion. Features such as color and category relationships have been examined by previous work, but this is the first research considering the morphological relationships encoded in word embeddings. We construct several natural experiments examining analogy completion across word stems modified by affixes, and find no evidence that Word2Vec, glove, and fasttext models encode these morphological relationships. We note that a special case of this problem is part-of-speech transformation, and note that the lack of support for part-of-speech analogies is surprising in the context of other successful cases of semantic inference using word embeddings.

1. INTRODUCTION

Prior work (Fejzo et al., 2018) has shown that compounding morphological learning is key to language acquisition and literacy development in children. When we learn a new language, we are able to quickly memorize known words and derive unknown words through our knowledge of root words and morphological features. This ability enables language learners to efficiently and accurately comprehend the meanings of enormous words, but can computers do the same? This introduction reviews prevailing approaches to modeling semantic relationships between words, as well as some prior work on morphological and analogical analysis. Then, methods for quantifying the ability for word embeddings to represent semantic morphological features are proposed in the form of analogy-completion tasks. Experiments are run on three common word embeddings and results are summarized.

1.1. PRIOR WORK

Early popular methods for Natural Language Processing (NLP) relied on simplistic models trained over large datasets, such as the N-gram for statistical language modeling. However, although such models outperform many more sophisticated counterparts, a limitation is the size of obtainable data for training, which impedes accurate representations over a large corpus of words. As the corpus size increases, the computational resources required to handle the learning with this simplistic architecture also accumulate.

1.1.1. WORD EMBEDDING MODELS

In 2013, a group led by Google scientist Tomas Mikolov (Mikolov et al., 2013a) proposed a novel continuous vector representation method along with two neural network models built for the task: Continuous Bag-of-Words (CBOW) Model and Continuous Skip-gram Model. Mikolov's model considered not only the proximity but also the similarity between related words in the vector space, and a multitude of degrees of linearity is encoded into the word embedding. With this property, using simple algebraic operations of the vector can formulate computer generate analogies comparable to those of humans. Pairwise semantic relationships within a subset of words can also be represented using the linear properties of the embedding space. In a subsequent publication, Mikolov (Mikolov et al., 2013b) proposed several optimization methods, such as subsampling the Skip-gram Model. To encode idiomatic information, simple phrases composed of individual words were also trained as separate entries. The optimized model could train on 100 billion words a day, with a total data size of billions compared to a few millions of words in other models. Word embedding has become a central topic in the field of Computational Linguistics and Natural Language Processing. The Word2vec model (Mikolov et al., 2013a) , utilizes Continuous Bag-ofwords (CBOW), which calculates nearby context words but disregards any additional syntactical information. Due to its simplifying nature, CBOW tends to rely on multiplicity rather than syntax. (Mikolov et al., 2013a) GloVe (Pennington et al., 2014) is another word embedding model, first published by Jeffry Pennington and his team from Stanford University in 2014. Like Word2Vec, GloVe utilizes a similar unsupervised process, but the difference is that it combines local word context information and global word co-occurrence statistics. In contrast to Word2Vec and GloVe, FastText (Joulin et al., 2016) by Facebook AI Research (FAIR) is based on a supervised sentence classification task. The task structure encourages word embeddings that can be added together to derive sentence-level representation. It can train on large datasets relatively quickly and run with less computational consumption than prior models.

1.1.2. CONTEXT-SENSITIVE SEQUENTIAL MODELS

Some more recent approaches to general NLP tasks involve context-sensitive word-embeddings using the Transformer Model, first introduced in 2017 Vaswani et al. ( 2017). Similar to Recurrent Neural Networks (RNN), Transformer Models are designed to process sequential input, but the key difference is that it utilizes the "Attention Mechanism", which weights the context vectors according to their estimated importance and forms connections within a sentence. This method enhances encoding of long-term dependencies and improves the overall performance of the model. Due to the high efficiency of the Transformer model, several highly general pre-trained models have been designed, such as BERTDevlin et al. (2018) , GPT-3Brown et al. (2020), and ELMOPeters et al. (2018) , for a variety of tasks with fine-tuning. These context-sensitive embeddings are more expressive than the static word-embedding models that map each word (or idiomatic phrase) to a single vector, regardless of context, but they are much harder to study semantically as any metrics for word similarity become necessarily contextual.

1.1.3. SEMANTIC EMBEDDINGS FROM MORPHOLOGICAL FEATURES

Several approaches have been proposed to specifically encode morphological features into word embedding models. One method is character-based learning on word segmentsCao & Rei (2016), which divides words into segments, treating morphemes separately. However, the character-based model performed worse on semantic similarity and semantic analogy tasks than traditional wordbased models. Another is vector representation by composing characters using bidirectional LSTMs (Ling et al., 2015) . This model excels in morphologically rich languages, languages that indicate part of speech by adding affixes rather than changing the position in a sentence, but learning complicated rules for how characters link together is inefficient on larger corpora.

1.1.4. WORD EMBEDDING ANALOGIES

There is also some prior work examining vector-based analogies in word embedding space. For example, one research paper (Bolukbasi et al., 2016) examines the semantic quality of gender as a difference vector in embedding space for the w2vNEWS embedding (a variant of Word2Vec). The paper focuses specifically on gender bias in word representation and use, and uses similar methods to this paper to explore the semantic representation of gender.

