FASTER TRAINING OF WORD EMBEDDINGS

Abstract

Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information. In this paper, we aim at improving the execution speed of fastText training on homogeneous multi-and manycore CPUs while maintaining accuracy. We present a novel open-source implementation that flexibly incorporates various algorithmic variants including negative sample sharing, batched updates, and a byte-pair encoding-based alternative for subword units. We build these novel variants over a fastText implementation that we carefully optimized for the architecture, memory hierarchy, and parallelism of current manycore CPUs. Our experiments on three languages demonstrate 3-20× speed-up in training time at competitive semantic and syntactic accuracy.

1. INTRODUCTION

Word embeddings have a long history (Rumelhart et al., 1986; Bengio et al., 2003; Collobert & Weston, 2008) , but have received much attention in recent years due to word2vec (Mikolov et al., 2013) and its computationally efficient implementation via skip-gram with negative sampling. Word embeddings capture contextual relationships between the words, and have become a standard input representation for the majority of NLP tasks, benefitting, e.g., classification (Joulin et al., 2016; Deriu et al., 2017) or machine translation (Jansen, 2017; Conneau et al., 2017) . More recently, state-of-the-art results on many language understanding tasks were achieved by deep transformer architectures such as BERT (Devlin et al., 2019) , which however are very compute intensive both at training and inference time, even with pre-trained models and reduced parameter space. Thus, simpler and more lightweight static word embeddings such as fastText (Bojanowski et al., 2017) are still widely used, due to their fast execution, comparable results for particular tasks (Tseng et al., 2019) , and ability to produce a single vector per word, which helps in information retrieval with interpretability and search index construction.

Contributions.

In this paper, we present algorithmic and code optimization techniques to improve the training time of word2vec and fastText embeddings on modern general-purpose multicore and manycore computers. We present an optimized open-source implementation of word2vec and fast-Text that encapsulates a number of algorithmic variants including negative sample sharing, batched updates, and subword units based on byte-pair encoding approach. Our extensive evaluation on three languages shows that the best combinations of optimizations speed up training time by 2.7-20.6 times while maintaining accuracy of selected NLP tasks.

2. WORD EMBEDDINGS

Word2vec. Word2vec is built upon a simple bilinear regression model trained on word cooccurrence, resulting in numerical feature representations as floating point vectors of dimensionality d. Given a word in a sentence, the goal of the algorithm is to maximize the likelihood of predicting surrounding (context) words. To achieve this, the model is trained to increase the probability of predicting particular words if they appear close to a given current word in the training corpus. A popular variant also decreases the probability of predicting words that do not appear close to the current word (negative sampling (Mikolov et al., 2013; Goldberg & Levy, 2014) ). During training, the algorithm processes the corpus in a streaming fashion. Each word w i (called current word) is processed together with its surrounding context words • • • hash('<bro') • • • 'fox' 'brown' 'dog' 'quick' 'jumps' 'a' M in a quick brown fox jumps over • • • • • • 'fox' 'brown' 'dog' 'quick' 'jumps' 'a' M out Figure 1 : Representation of source and target words in the input (M in ) and output (M out ) matrix in fastText (skip-gram). The sentence is "a quick brown fox jumps over the lazy dog", the current word w i is "brown" and the context window size is 2. The words in the corpus are represented as indices of the corresponding rows in M in and M out . {w i-C , ..., w i-1 }, {w i+1 , ..., w i+C }, where C is the range of the context window. There are two modes of operation training the model for the following prediction tasks: • Skip-gram (SG): predict target context words using the current word w i as the source. • CBOW: predict the target current word w i using context words as the source. Each word w in the vocabulary of size V is represented as a source w s by one row in the V × d input matrix M in containing word embeddings, and each word is represented as a target w t by one row in the V × d output matrix M out that is used to calculate the training objective function. The goal of the optimization is to maximize the inner products (minimize difference) of real pairs of source current words with the target context words or vice versa, using the binary logistic loss. This approach can be improved by the use of negative sampling, where the algorithm additionally maximizes the difference between the source current words and the words picked randomly out of the source's context. Training is performed using stochastic gradient descent (SGD). SGD is performed in parallel with p threads by splitting the training corpus into p parts and processing them asynchronously ("Hogwild" approach (Recht et al., 2011) ). The final embedding of each word is its corresponding row in M in . M out is discarded at the end of training. FastText. FastText (Bojanowski et al., 2017) improves word2vec by utilizing subwords of target words during the training. A typical run of fastText uses subwords of lengths k = 3 . . . 6, containing delimiters , at the boundaries. For example, for the word paris and k = 3 the subwords are: pa, par, ari, ris, is . In fastText, the embeddings M in are extended to contain rows representing both entire words as well as hashes of all their subwords. Additionally, the representation of the entire word is added to the set of its subwords. During the execution of the algorithm, the hidden layer h is built by averaging vectors in M in representing the source word's subwords. The final vector embedding for each word is obtained in the same way. M out remains unchanged. Fig. 1 shows an example of how the word vectors are stored and accessed. A single update is described in Alg. 1.

Related work.

FastText has been implemented as a part of the popular Gensim library (Rehurek & Sojka, 2011) using Cython and a default machine's BLAS library (e.g., Intel MKL) for algebraic operations. In our experiments we found the code memory-expensive and slow: training 5 epochs on a 1 GB English Wikipedia dump with 24 threads took approximately 11 hours on a Knights Landing CPU, about 10 times slower than the original fastText. Therefore, we use the original code provided by Facebook Research (2016a) as the baseline in all our experiments. For skip-gram with negative sampling, pWord2Vec (Ji et al., 2016) transforms the "Hogwild" approach into "Hogbatch", by performing updates on multiple context words at once (effectively turning a series of dot products into a matrix-matrix operation) and sharing negative samples for the entire batch. We employ similar techniques in our implementation. Rengasamy et al. (2017) extends this approach by context combining, where multiple contexts can share a set of negative samples and be updated all at once. We do not adapt this approach as it requires careful preprocessing rewarded by a relatively small speedup over pWord2Vec.

