FASTER TRAINING OF WORD EMBEDDINGS

Abstract

Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information. In this paper, we aim at improving the execution speed of fastText training on homogeneous multi-and manycore CPUs while maintaining accuracy. We present a novel open-source implementation that flexibly incorporates various algorithmic variants including negative sample sharing, batched updates, and a byte-pair encoding-based alternative for subword units. We build these novel variants over a fastText implementation that we carefully optimized for the architecture, memory hierarchy, and parallelism of current manycore CPUs. Our experiments on three languages demonstrate 3-20× speed-up in training time at competitive semantic and syntactic accuracy.

1. INTRODUCTION

Word embeddings have a long history (Rumelhart et al., 1986; Bengio et al., 2003; Collobert & Weston, 2008) , but have received much attention in recent years due to word2vec (Mikolov et al., 2013) and its computationally efficient implementation via skip-gram with negative sampling. Word embeddings capture contextual relationships between the words, and have become a standard input representation for the majority of NLP tasks, benefitting, e.g., classification (Joulin et al., 2016; Deriu et al., 2017) or machine translation (Jansen, 2017; Conneau et al., 2017) . More recently, state-of-the-art results on many language understanding tasks were achieved by deep transformer architectures such as BERT (Devlin et al., 2019) , which however are very compute intensive both at training and inference time, even with pre-trained models and reduced parameter space. Thus, simpler and more lightweight static word embeddings such as fastText (Bojanowski et al., 2017) are still widely used, due to their fast execution, comparable results for particular tasks (Tseng et al., 2019) , and ability to produce a single vector per word, which helps in information retrieval with interpretability and search index construction. Contributions. In this paper, we present algorithmic and code optimization techniques to improve the training time of word2vec and fastText embeddings on modern general-purpose multicore and manycore computers. We present an optimized open-source implementation of word2vec and fast-Text that encapsulates a number of algorithmic variants including negative sample sharing, batched updates, and subword units based on byte-pair encoding approach. Our extensive evaluation on three languages shows that the best combinations of optimizations speed up training time by 2.7-20.6 times while maintaining accuracy of selected NLP tasks.

2. WORD EMBEDDINGS

Word2vec. Word2vec is built upon a simple bilinear regression model trained on word cooccurrence, resulting in numerical feature representations as floating point vectors of dimensionality d. Given a word in a sentence, the goal of the algorithm is to maximize the likelihood of predicting surrounding (context) words. To achieve this, the model is trained to increase the probability of predicting particular words if they appear close to a given current word in the training corpus. A popular variant also decreases the probability of predicting words that do not appear close to the current word (negative sampling (Mikolov et al., 2013; Goldberg & Levy, 2014) ). During training, the algorithm processes the corpus in a streaming fashion. Each word w i (called current word) is processed together with its surrounding context words (skip-gram) . The sentence is "a quick brown fox jumps over the lazy dog", the current word w i is "brown" and the context window size is 2. The words in the corpus are represented as indices of the corresponding rows in M in and M out . {w i-C , ..., w i-1 }, {w i+1 , ..., w i+C }, where C is the range of the context window. There are two modes of operation training the model for the following prediction tasks: • Skip-gram (SG): predict target context words using the current word w i as the source. • CBOW: predict the target current word w i using context words as the source. Each word w in the vocabulary of size V is represented as a source w s by one row in the V × d input matrix M in containing word embeddings, and each word is represented as a target w t by one row in the V × d output matrix M out that is used to calculate the training objective function. The goal of the optimization is to maximize the inner products (minimize difference) of real pairs of source current words with the target context words or vice versa, using the binary logistic loss. This approach can be improved by the use of negative sampling, where the algorithm additionally maximizes the difference between the source current words and the words picked randomly out of the source's context. Training is performed using stochastic gradient descent (SGD). SGD is performed in parallel with p threads by splitting the training corpus into p parts and processing them asynchronously ("Hogwild" approach (Recht et al., 2011) ). The final embedding of each word is its corresponding row in M in . M out is discarded at the end of training. FastText. FastText (Bojanowski et al., 2017) improves word2vec by utilizing subwords of target words during the training. A typical run of fastText uses subwords of lengths k = 3 . . . 6, containing delimiters , at the boundaries. For example, for the word paris and k = 3 the subwords are: pa, par, ari, ris, is . In fastText, the embeddings M in are extended to contain rows representing both entire words as well as hashes of all their subwords. Additionally, the representation of the entire word is added to the set of its subwords. During the execution of the algorithm, the hidden layer h is built by averaging vectors in M in representing the source word's subwords. The final vector embedding for each word is obtained in the same way. M out remains unchanged. Fig. 1 shows an example of how the word vectors are stored and accessed. A single update is described in Alg. 1.

Related work.

FastText has been implemented as a part of the popular Gensim library (Rehurek & Sojka, 2011) using Cython and a default machine's BLAS library (e.g., Intel MKL) for algebraic operations. In our experiments we found the code memory-expensive and slow: training 5 epochs on a 1 GB English Wikipedia dump with 24 threads took approximately 11 hours on a Knights Landing CPU, about 10 times slower than the original fastText. Therefore, we use the original code provided by Facebook Research (2016a) as the baseline in all our experiments. For skip-gram with negative sampling, pWord2Vec (Ji et al., 2016) transforms the "Hogwild" approach into "Hogbatch", by performing updates on multiple context words at once (effectively turning a series of dot products into a matrix-matrix operation) and sharing negative samples for the entire batch. We employ similar techniques in our implementation. Rengasamy et al. (2017) extends this approach by context combining, where multiple contexts can share a set of negative samples and be updated all at once. We do not adapt this approach as it requires careful preprocessing rewarded by a relatively small speedup over pWord2Vec.

The work by

Algorithm 1: A single iteration of the original fastText algorithm. In skip-gram, it is performed on each current-context word pair (as source-target). In CBOW, all context words are used as source words at the same time. Word2vec and fastText have been also implemented for GPU clusters. BlazingText (Gupta & Khare, 2017) tackles the problem of efficient batch size and synchronization for multiple GPUs. While this issue is of no concern on CPU, they report the execution time on a single GPU comparable to a 16threaded CPU fastText baseline. We further speed up the CPU implementation. The work by Bae & Yi (2016) reports up to 11× speedup of word2vec with negative sampling run on a K20 GPU over the single-threaded CPU word2vec. However, they report only up to 1.6× speedup over a 12-threaded CPU run. Our no subword version of the code are roughly 5 (skip-gram) and 6 (CBOW) faster than the 20-threaded runs of the original word2vec. Word2vec and fastText are memory-intensive algorithms. Additionally, fine-grained parallelism is limited by relatively small vectors typically used in the computations. These characteristics severely limit the potential advantages of GPU over CPU. Li et al. (2019) discuss a distributed version for many GPUs aiming at the reduction of write conflicts in updates. Similarly (and independently), we made attempts at pre-scheduling a list of currentcontext word updates, but we found the overhead of this preprocessing prohibitive. Nonetheless, the algorithmic variants presented in our paper can be applied in distributed setting, as long as the data used for a single update fits inside a batch used in the distributed computation. This will be the case for our variants, since they either execute separate updates on each current-context word pair, or update current words with their entire (typically small) contexts. This is also the case in the original fastText and therefore, the communication cost should not increase. Another popular word embedding model is GloVe (Pennington et al., 2014) is based on a completely different algorithmic structure, that is, creation and reduction of a global word co-occurrence matrix, there is no straightforward way to apply to it our code optimizations and variants.

3. OPTIMIZATION TECHNIQUES AND ALGORITHMIC VARIANTS

To improve the training time, we first identify the most expensive operations. Assume that the source word(s) have a total of m subwords and that we use n negative samples per target word. Then each update comprises: • Construction of h: a sum of m vectors (line 2 or 4). • Loss calculation: n + 1 dot products and 2(n + 1) vector additions (lines 6-8, 11-13). • Gradient update: m vector additions (lines 16-19 or [21] [22] [23] [24] [25] [26] . In skip-gram, h is built only from a single current word, while in CBOW, it is constructed from all context words. The loss function, in contrast, is computed once per each current-context word pair in skip-gram, while in CBOW, the loss is computed using the entire context. This means that for CBOW, the construction of h and the gradient update consumes most of the execution time, while for skip-gram, these operations take roughly the same amount of time as the loss function calculation (assuming the default parameter of n = 5). All operations listed above are memory-intensive and therefore memory bound: thus, the best approach to optimize them is by reducing memory movement and avoiding unnecessary updates. It is further supported by our observations during the tests on isolated sections of the code. We noted that a large amount of execution time is taken by the latency of lower levels of memory hierarchy when accessing data from the rows scattered across M in and M out , which is an access pattern required to provide high quality embeddings. To speed up the training, we first perform a number of code performance optimizations and then various algorithmic modifications compared to the original fastText. Some modifications depend on each other as illustrated in Fig. 2 . Some, but not all, techniques apply to both modes of operation. All our improvements build on code opt which is a CPU-specific optimization of the original fastText code. For skip-gram, we consider a batch variant and negative sharing across the context (NS CT). For CBOW, we consider keeping track of the values in the hidden layer h and updating them dynamically rather than building this layer from scratch in each iteration (DH: variable context window size, DHF: fixed context window size). We consider combinations of this technique with negative sharing involving different number s of positive samples that the negative samples are shared between (NS s). Additionally, for both CBOW and skip-gram, we test no subword where we remove the subwords from code opt, making it equivalent to optimized word2vec, and BPE h, where we replace samples with BPE tokens obtained from a pre-trained token set of size h. We next discuss these variants, referring to specific parts of Alg. 1 that they modify. Code performance optimizations (code opt). For efficient execution we explicitly vectorize matrix and vector operations using AVX-512 intrinsics. We block and merge operations involving multiple reads from the same location in memory to make them more cache-friendly (temporal locality), like averaging the rows of M in or subsequently reading from and writing to M out . During the creation of hidden layer h (line 2 or 4), we reduce the number of array accesses such that each element of h is stored only once during summing up the vector representations of subwords. This improves the original code which performs a separate store for each subword. To speed up the binary logistic loss function, we vectorize the dot product (lines 6, 11) with the use of eight accumulators to increase instruction-level parallelism without too much register pressure. We merge the update of the gradient g and the relevant rows of M out (lines 7-8, 12-13) to avoid multiple reads from the latter. We still call the loss function once per each w t and w t . Similar to the creation of h, we improve the update of M in (w s ) with g (line 16 or 21-22) by reading each element of g only once for all words and subwords w s . The optimizations in the version code opt are used in all algorithmic variants discussed next. Note that these optimizations can also be applied to other regularization schemes such as the hierarchical softmax used in the original word2vec (Mikolov et al., 2013) . No subwords (no subword). Experiments in Section 4 show that it is sometimes useful to train word embeddings without any subword information. We provide a code variant which disables subwords, but applies all optimizations discussed above. It is algorithmically equivalent to word2vec. In Alg. 1, subwords(w s ) in lines 1-4 and 15-26 becomes an empty set and needs not be processed. While this is expected to improve the training time, especially for CBOw which dedicated a large part of its runtime to averaging and updating subword representations, the information on the word morphology becomes scarce. We will later see that the word embeddings trained without subword information do not perform well when used for syntactic tasks. On the other hand, the training then focuses on semantic information which is reflected in higher semantic quality of these embeddings. Minibatching (batch for SG). For skip-gram, we implement a form of minibatching of the target words per each source word. Rather than following the work of Ji et al. (2016) , which merges all M in (w s ) rows in a minibatch into a matrix, we follow the original fastText's approach hitherto only applied to CBOW, which simply averages all these rows. The advantage of our minibatching over the original fastText skip-gram is being able to execute a single update for each context window of the current word w i , rather than per each current-context word pair. This means that h and the relevant rows of the input matrix M in are updated only once per each current word, independent of context window size. Lines 2 and 16 are now executed only once per context window, in a similar fashion as in lines 4 and 21-22 respectively). This creates an additional delay between reading and writing a word's subword representations increasing the possibility of write conflicts, but our experiments later show that the accuracy remains nearly unaffected. Minibatching can bring significant speed improvements to subword-based training due to the relatively high cost of building h and updating all subword representations. As mentioned, in fastText CBOW, this form of batching is already a part of its algorithmic structure. Negative sharing (NS CT and NS s). We implement negative sharing proposed by Ji et al. (2016) , but adapted for and built over SG batch and the natural batching of fastText CBOW. For skip-gram, we share negative samples among all words in the entire context window of w i (NS CT). For CBOW, we share negative samples for s consecutive current words w i (NS s). In our implementation, s is a hyperparameter chosen by the user. Thus, line 10 is executed only n times every s-th update. While negative sharing results in fewer expensive random memory reads and improved memory locality (e. g., for d = 300, n = 5, and a context window size of 11, the data worked upon takes up c. a. 16 KB, while an usual L1 cache size is not less than 32 KB), it proportionally reduces the number of data samples used in training per current-context word pair. For this reason, despite improvements in execution time, NS yields inferior accuracy. Therefore, we do not report its results in the paper, but only present them in the appendix. Dynamic hidden layer update (DH). CBOW spends a large portion of its execution time building h and updating relevant rows of M in for each subsequent current word w i and its context window. Therefore, we opt for adding and removing subwords only as their words move in and out of the context window as the algorithm processes the training text. After each shift of the context window, we update the rows of M in for all removed subwords, readjust to the gradient g, and add new subwords to h. Thus, rather than performing the entire sum in line 4, the data is processed in five steps. Assuming that x embeddings remain inside the context window after a particular shift: 1. Denormalize h. 2. Update M in for subwords falling out of the context window. 3. Subtract embeddings of subwords falling out of the context window from h. 4. Readjust to gradient g: h = h + xg. 5. Add subwords falling into the context window to h. 6. Normalize h. Note that this creates additional delay between reading and writing to the rows of M in , but empirically it does not harm the vector quality. Fixed window for dynamic hidden layer update (DHF). Since the window size is picked randomly in each iteration, some words will fall in and out of the context window multiple times, forcing DH to remove and add the same subwords to h multiple times over a short period of time. To mitigate this, we fix the window size. While potentially saving time, this approach comes with a pitfall: the variable window size is a natural way of sampling context words that are closer to a current word w i with greater probability, which reflects a greater contribution of these words to the current word's meaning. DHF effectively ignores the impact of the distance of context words.

Byte-Pair vocabulary (BPE h).

We also propose an alternative approach to subword embeddings, replacing the subwords by Byte-Pair Encoding (BPE) tokens (Sennrich et al., 2016) . These are produced with the Hugging Face Tokenizers library (Moi, 2019) in the form of token IDs for the h most frequent word fragments, where h is a hyperparameter. We expect this to reduce execution time and memory consumption as the number of tokens is typically an order of magnitude smaller than that of subwords. To our knowledge, this is the first attempt to apply BPE tokenization to provide additional subword information in fastText-like fashion. In our experiments in Section 4, we train the tokenizer over the same training corpus as our embeddings, but both trainings could use different corpora. In case the BPE variant of fastText is unable to tokenize a word found in its training corpus (e.g., because it was absent from the corpus used for training the tokenizer), the word remains as it is, without additional embeddings. An alternative approach would be to create embeddings for tokens consisting of single characters: however, we found that if many words fail to be tokenized, this may cause a drastic slowdown, likely due to update conflicts on the single-character tokens. In Alg. 1, using the BPE variant means replacing "subwords" with "tokens".

4. EVALUATION

In this section we evaluate our performance-optimized polyalgorithmic implementation for training word embeddings on a current homogeneous multicore system. For each algorithmic variant, and considering three languages, we report the speedup we achieve for training and the obtained accuracy of the generated word embeddings w.r.t. a number of semantic and syntactic quality tests. Setup. We use a dual-socket Intel(R) Xeon(R) Silver 4114 CPU processor (Skylake, 20 physical cores). For evaluation, we create an English corpus as described by Facebook Research (2016b). For other languages, we proceed in analogous fashion: download respective Wikipedia dumps (Wikimedia Foundation, 2001), sanitize and lowercase with the script wikifil.pl authored by Mahoney (2006) . For each language, the script is modified to capture relevant characters and replace relevant words. We truncate the outputs to 1 billion characters. The resulting vocabulary sizes are: (a) 218,316 words for English, (b) 592,674 words for German, (c) 385,596 words for Russian. The purpose of our experiments is to speed up training over the original fastText. We demonstrate the speed-ups of our implementation and show which algorithmic variants maintain accuracy at the same time. The results for English are presented in Table 1 . The names of algorithmic variants match those from Section 3 and Fig. We perform various semantic and syntactic accuracy tests explained below. The best accuracy scores for each test are marked in blue. Pareto-optimal combinations of accuracy scores are shown boldfaced. Pareto-optimal means that no other algorithmic variant dominates it, i.e., is better on each score. Together with Fig. 2 , the tables also show the incremental impact on accuracy and execution speed of each variant. Accuracy tests. We perform multiple evaluation tasks to test the quality of our embeddings. First, we test our embeddings with the word analogy task script provided with word2vec (Mikolov et al., 2013) for both semantic and syntactic accuracy. For English, we use the questions-words (QW) dataset (Mikolov, 2013) . For other languages, we use its translations: German (Köper et al., 2015) and Russian (Kononova, 2017) . Additionally, we employ the Vecto library (Vecto, 2018) to evaluate on the The Bigger Analogy Test Set (BATS) (Gladkova et al., 2016) on English embeddings, with the 3CosAdd method. For all analogy benchmarks, we observe that fastText performs better for syntactic than semantic tasks as was already noted by Bojanowski et al. (2017) . Second, we compute word similarity scores with Facebook MUSE (Conneau et al., 2017) , using monolingual evaluation with word similarity tasks on semantic datasets. For English and German, we use the tests provided by MUSE. For Russian, we use the HJ dataset (Panchenko et al., 2016) . The tables in this section present the averaged MUSE output. For detailed score for each MUSE test set, see Appendix B. Third, we use the scripts provided by the Word Embedding Benchmarks package (Jastrzebski, 2015) to perform the concept categorization (word clustering) task. We evaluate on the semantic Battig test set introduced by Battig & Montague (1969). Evaluating English skip-gram. First, we evaluate multiple variants of skip-gram presented in the first section of Tab. 1. The dependencies between the variants is illustrated in Fig. 2 (a). We observe that only optimizing for efficient execution (code opt) yields for fastText a 2.7-3.7× speedup while maintaining accuracy. The no subword variant yields 8-10× speedup over original and about 3× speedup over code opt. For word analogy, no subword improves the embeddings semantically. The tokenized versions roughly balance between the fastText-and word2vec-style embedding quality, with an exception of BPE 200K where the number of tokens is close to the vocabulary size, effectively turning only the most common subwords into separate tokens. This approach provides a semantic accuracy even greater than original for both BATS and QW, however at a price of syntactic quality, all including roughly 4-5× speedup over original and up to 1.5× speedup over code opt. For QW, the accuracies vary greatly, while BATS indicates that a smaller number of tokens is generally preferable. The batch variant maintains or slightly handicaps the accuracy of fastText, and provides a slightly smaller speedup than the tokenized versions. The different variants of skip-gram perform almost equally well on the word similarity and categorization tasks, and all of them yield Pareto-optimal results. All variants show good parallel scaling. Evaluating English CBOW. The CBOW results are shown in the second section of Tab. 1; the dependencies between variants are in Fig. 2(b ). The code opt variant yields 2.3-4.6× speedup over original, less than for skip-gram, but the obtained accuracy is not Pareto-optimal. For word analogy, CBOW generally performs better on syntactic than semantic questions. The no subword variant provides good scaling, and over 20× speedup over original and about 8-9× speedup over code opt. It diminishes the discrepancy between these scores, albeit impacting negatively the syntactic quality of the embeddings, while achieving highest scores for word similarity and categorization. None of the tokenized variants was able to beat no subword both in speed and evaluation on these tasks, but they provide an improvement in semantic accuracy over original, as well as in word similarity and categorization. The BPE variants achieve roughly 11× speedup over original. The DH variant provides only a slight speedup over code opt (2.7-4.8× speedup over original), but yields higher accuracies in all tasks, while DHF impacts negatively all scores except for MUSE, but provides a speedup over DH with multiple threads. Comparison between skip-gram and CBOW. As a rule of thumb, the fastText implementations of skip-gram perform much better on semantic questions in word analogy tasks and slightly better in word similarity and categorization tasks. For syntactic questions, the CBOW code opt and CBOW DH variants are a better option. On the other hand, CBOW no subword performs nearly as well for word similarity and categorization tasks as skip-gram. Therefore, in specific cases, the former can be used in lieu of skip-gram to boost the execution speed. Evaluation on German and Russian corpora. Table 2 contains results for German and Russian, presented analogously to those in English. In terms of evaluation accuracy, they are largely consistent with English, with the small exception of CBOW BPE 20K, which performs better than CBOW no subword on the German corpus. This indicates the impact of the number of tokens used during training and opens opportunities for further investigation. Noteworthy, CBOW DH achieves the best scores on syntactic tasks for all evaluated languages. Using skip-gram with BPE tokens rather than fastText-style subwords performs very well in terms of both speedups and accuracy scores, all of which are Pareto-optimal. The code opt variants yields slightly better speedups than for English, and further optimizations lead to significantly greater speedups. The code opt variants yield roughly 3.5-5× speedup over their respective original versions. The best achieved improvement is CBOW no subword, up to 50× for Russian. This shows that our improvements are particularly beneficial for morphologically-rich languages with a large number of subwords per word.

5. CONCLUSIONS

We presented a thorough evaluation, and associated open-source implementation, of various optimization techniques for fastText and word2vec. In particular, these include code-level performance optimizations and the use of BPE tokens rather than subwords. For example, for English, our code offers practitioners speedups in the range of 2.7-20.6×, while maintaining a single or multidimensional notion of accuracy. We achieve good parallel scaling, which is expected to bring even more benefits in the future, as the number of cores further increases. The choice of algorithm depends heavily on the accuracy metric: for all languages, there is no universally best variant, which makes a case for our polyalgorithmic implementation and thorough evaluation of trade-offs. Our techniques should also apply to sent2vec (Pagliardini et al., 2018) for sentence embeddings. 

A DETAILS OF EXPERIMENTAL SETUP

For the experiments with questions-words and MUSE, we do not remove any questions containing out-of-vocabulary words. We use full test data sets for all modes, therefore our results are consistent. For MUSE, we use the following tests provided by the libraryfoot_0 . English: EN For the final MUSE scores, we take arithmetic mean of individual scores obtained from these tests. For Russian, MUSE provides no tests. Therefore, we download and use the HJ datasetfoot_1 .

B DETAILED EVALUATION WITH MUSE

Tables 3 and 4 present detailed MUSE results for each test set. For Russian, we use only one test set, thus the results would be redundant with those in Section 4 and Appendix F. For the English and German tables, we run the experiments on embeddings obtained in a different training run: therefore, the individual scores are not expected to average exactly to those in Section 4.

C PREPARATION OF DATA

To prepare German and Russian Wikipedia dumps for training, we modify the wikifil.pl script such that it captures relevant characters and replaces all digits with relevant words in each language. For German, we add äöüß to the set of Latin characters. For Russian, we extract Cyrillic characters. Then, we manually truncate the parsed texts to 1 billion characters, and further truncate to the last complete word in the resulting text. For example, if the German enumeration was cut at the 1 billion boundary, such that Finally, we use iconv to ensure UTF-8 format. For example, for the Russian Wikipedia dump, the order of actions is: wget <wiki dump address>/<ru.dump> perl wikifil-ru.pl <ru.dump> > ruwiki head -c 1000000000 ruwiki > ruwiki9 # manually truncate the text # to the last complete word iconv -t utf-8 ruwiki9 -o ruwiki9-utf We train the embeddings on the file ruwiki9-utf.

D COMPATIBILITY WITH HUGGING FACE TOKENIZERS

We implement tokenization in a way that it is compatible with the files produced by ByteLevelBPETokenizer in the Hugging Face Tokenizers libraryfoot_2 . We apply the same char- acter mapping for UTF-8 characters that use more than one byte, and preprocess the words from the vocabulary such that each begins with a special delimiter character Ġ. Note that the number of tokens h must be selected during tokenization with the Tokenizers library. Selecting BPE tokens or no subword. To use BPE tokens instead of subwords, provide paths to the merge and vocab files produced by the Tokenizers library. -token-merges <path/to/f.txt > -token-vocab <path/to/f.json> Both these arguments must be set to enable the BPE run. The number of tokens h is obtained from the merge and vocab files. For our experiments, we produce these files using our training corpora. Note that these two arguments are incompatible with -no-subwords. Finally, to run the no subword (word2vec) version of our code, use the argument -no-subwords. Note that it is incompatible with -token-merges and -token-vocab. We run both the BPE and no subword experiments using -mode normal (default). The remaining arguments are identical to those used by the original fastText code. To obtain the results for SG original and CBOW original, please run the original library.

F COMPLETE RESULTS: ENGLISH, RUSSIAN AND GERMAN

We provide complete results for the embeddings trained on English (Tab. 6), German (Tab. 7) and Russian (Tab. 8) corpora. In negative sharing (NS), we use s = 11, 80, 160.

G VARIANCE OF RESULTS

Due to low stability of word embeddings across different trainings (e.g., (Antoniak & Mimno, 2018 )), we perform multiple experiments on our variants to characterize standard deviations obtained for the scores of word analogy task on the questions-words test set (QW). We measure across 



https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz https://github.com/nlpub/russe-evaluation/blob/master/russe/evaluation/hj.csv https://github.com/huggingface/tokenizers https://github.com/FT-Submit/ft-mod https://github.com/facebookresearch/fastText runs. We were unable to perform more tests due to time constraints, but the results provide hints on stability. We present the results in Table9. We note that the standard deviation rarely increases above 1. Moreover, it is more likely to be high for no subword and BPE variants of the code.



Figure1: Representation of source and target words in the input (M in ) and output (M out ) matrix in fastText (skip-gram). The sentence is "a quick brown fox jumps over the lazy dog", the current word w i is "brown" and the context window size is 2. The words in the corpus are represented as indices of the corresponding rows in M in and M out .

Figure 2: Dependency between our code variants of skip-gram and CBOW. The experiments for the "NS" variants (no box frames) are only shown in the appendix due to inferior experimental accuracy.

2. We omit negative sharing (NS) due to low accuracy, but show the results in Appendix F. In tokenized runs (BPE), we use h = 20K, 40K, 200K. All other hyperparameters are the fastText defaults. The speedups shown are over the original implementations SG original and CBOW original, respectively, fromBojanowski et al. (2017), run with the same number of threads. The scaling column shows the speedup of our code when run with 20 threads compared to 1 thread. The runtimes are consistent over several runs and are shown in Appendix F.

bpe = ByteLevelBPETokenizer() bpe.train([<corpus file>], vocab_size=<h>) bpe.save(<path>, <filename>) (Note the [] brackets.)E HOW TO RUN EXPERIMENTSWe provide parameterized code 4 to replicate our experiments. It is a modification of the original fastText library 5 . Note that we disabled the production of .bin file to reduce saving time and save storage space. Our experiments apply to unsupervised training with skip-gram and CBOW.In order to compile, CMake, Intel ICPC compiler and a CPU with AVX-512 support are required. Please compile with:cmake . makeTo run, please use the command: fasttext {cbow, skipgram} \ -input <corpus file> \ -output <embeddings file> \ <arguments>Selecting code optimizations. To run a particular algorithm mode, use the argument -mode <mode>.

Data: source word(s) ws, target word wt, learning rate l, number of negative samples n if skip-gram then // Initialize.

Accuracy and speedup achieved with our library over fastText when training on English Wikipedia corpus. Blue: best accuracy in category, bold: Pareto-optimal accuracy, speedup is over the original fastText run with the same number of threads, and scaling is the speedup of 20 threads vs. 1 thread for our code. Higher is better for all metrics.

Accuracy and speedup achieved with our library over fastText when training on (a) German and (b) Russian Wikipedia corpora. Blue: best accuracy in category, bold: Pareto-optimal accuracy, "speedup": over the original fastText run with the same number of threads, "scaling": speedup 20 threads vs. 1 thread for our code. Higher is better for all metrics.Bin Wang, Angela Wang, Fenxiao Chen, Yuncheng Wang, and C-C Jay Kuo. Evaluating word embedding models: Methods and experimental results. APSIPA transactions on signal and information processing, 8, 2019.Wikimedia Foundation. Wikimedia Downloads, 2001. https://dumps.wikimedia.org.

Detailed results of MUSE tests for fastText when training on English Wikipedia corpus. Higher is better. The results are obtained from different training run, hence slight difference from the main results.

Detailed results of MUSE tests for fastText when training on German Wikipedia corpus. Higher is better. The results are obtained from different training run, hence slight difference from the main results.

Table5explains all available modes. An overview of -mode arguments and their connection to our experiments. Parameter s is set with -shared <s>. By default, s (the number of words sharing negative samples) is set to 2C + 1, where C is the maximum context window size (set in the original fastText with the argument -ws <c>). To set a different s, use -shared <s>. Setting this argument to zero will result in the default setting.

Full results: accuracy, runtime and speedup achieved with our library over fastText when training on English Wikipedia corpus. Blue: best accuracy in category, bold: Pareto-optimal accuracy, speedup is over the original fastText run with the same number of threads, and scaling is the speedup of 20 threads vs. 1 thread for our code. For accuracy, speedup and scaling, higher is better. For time, lower is better.

Full results: accuracy, runtime and speedup achieved with our library over fastText when training on German Wikipedia corpus. Blue: best accuracy in category, bold: Pareto-optimal accuracy, speedup is over the original fastText run with the same number of threads, and scaling is the speedup of 20 threads vs. 1 thread for our code. For accuracy, speedup and scaling, higher is better. For time, lower is better.

Full results: accuracy, runtime and speedup achieved with our library over fastText when training on Russian Wikipedia corpus. Blue: best accuracy in category, bold: Pareto-optimal accuracy, speedup is over the original fastText run with the same number of threads, and scaling is the speedup of 20 threads vs. 1 thread for our code. For accuracy, speedup and scaling, higher is better. For time, lower is better.

Standard deviation of the score obtained with word analogy task on QW. Lower is better.

