COMPARING SEMANTIC AND MORPHOLOGICAL ANAL-OGY COMPLETION IN WORD EMBEDDINGS

Abstract

Word embeddings have prompted great excitement in the NLP community due to their capacity for generalization to unforeseen tasks, including semantic analogy completion. Features such as color and category relationships have been examined by previous work, but this is the first research considering the morphological relationships encoded in word embeddings. We construct several natural experiments examining analogy completion across word stems modified by affixes, and find no evidence that Word2Vec, glove, and fasttext models encode these morphological relationships. We note that a special case of this problem is part-of-speech transformation, and note that the lack of support for part-of-speech analogies is surprising in the context of other successful cases of semantic inference using word embeddings.

1. INTRODUCTION

Prior work (Fejzo et al., 2018) has shown that compounding morphological learning is key to language acquisition and literacy development in children. When we learn a new language, we are able to quickly memorize known words and derive unknown words through our knowledge of root words and morphological features. This ability enables language learners to efficiently and accurately comprehend the meanings of enormous words, but can computers do the same? This introduction reviews prevailing approaches to modeling semantic relationships between words, as well as some prior work on morphological and analogical analysis. Then, methods for quantifying the ability for word embeddings to represent semantic morphological features are proposed in the form of analogy-completion tasks. Experiments are run on three common word embeddings and results are summarized.

1.1. PRIOR WORK

Early popular methods for Natural Language Processing (NLP) relied on simplistic models trained over large datasets, such as the N-gram for statistical language modeling. However, although such models outperform many more sophisticated counterparts, a limitation is the size of obtainable data for training, which impedes accurate representations over a large corpus of words. As the corpus size increases, the computational resources required to handle the learning with this simplistic architecture also accumulate.

1.1.1. WORD EMBEDDING MODELS

In 2013, a group led by Google scientist Tomas Mikolov (Mikolov et al., 2013a) proposed a novel continuous vector representation method along with two neural network models built for the task: Continuous Bag-of-Words (CBOW) Model and Continuous Skip-gram Model. Mikolov's model considered not only the proximity but also the similarity between related words in the vector space, and a multitude of degrees of linearity is encoded into the word embedding. With this property, using simple algebraic operations of the vector can formulate computer generate analogies comparable to those of humans. Pairwise semantic relationships within a subset of words can also be represented using the linear properties of the embedding space.

