RETHINKING EMBEDDING COUPLING IN PRE-TRAINED LANGUAGE MODELS

Abstract

We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.

1. INTRODUCTION

The performance of models in natural language processing (NLP) has dramatically improved in recent years, mainly driven by advances in transfer learning from large amounts of unlabeled data (Howard & Ruder, 2018; Devlin et al., 2019) . The most successful paradigm consists of pre-training a large Transformer (Vaswani et al., 2017) model with a self-supervised loss and fine-tuning it on data of a downstream task (Ruder et al., 2019) . Despite its empirical success, inefficiencies have been observed related to the training duration (Liu et al., 2019b) , pre-training objective (Clark et al., 2020b) , and training data (Conneau et al., 2020a) , among others. In this paper, we reconsider a modeling assumption that may have a similarly pervasive practical impact: the coupling of input and output embeddingsfoot_0 in state-of-the-art pre-trained language models. State-of-the-art pre-trained language models (Devlin et al., 2019; Liu et al., 2019b) and their multilingual counterparts (Devlin et al., 2019; Conneau et al., 2020a) have inherited the practice of embedding coupling from their language model predecessors (Press & Wolf, 2017; Inan et al., 2017) . However, in contrast to their language model counterparts, embedding coupling in encoder-only pre-trained models such as Devlin et al. (2019) is only useful during pre-training since output embeddings are generally discarded after fine-tuning. 2 In addition, given the willingness of researchers to exchange additional compute during pre-training for improved downstream performance (Raffel Table 1 : Overview of the number of parameters in (coupled) embedding matrices of state-of-the-art multilingual (top) and monolingual (bottom) models with regard to overall parameter budget. |V |: vocabulary size. N , N emb : number of parameters in total and in the embedding matrix respectively.

Model

Languages |V | N N emb %Emb. mBERT (Devlin et al., 2019) 104 120k 178M 92M 52% XLM-R Base (Conneau et al., 2020a) 100 250k 270M 192M 71% XLM-R Large (Conneau et al., 2020a) 100 250k 550M 256M 47% BERT Base (Devlin et al., 2019 ) 1 30k 110M 23M 21% BERT Large (Devlin et al., 2019) 1 30k 335M 31M 9% et al., 2020; Brown et al., 2020) and the fact that pre-trained models are often used for inference millions of times (Wolf et al., 2019) , pre-training-specific parameter savings are less important overall. On the other hand, tying input and output embeddings constrains the model to use the same dimensionality for both embeddings. This restriction limits the researcher's flexibility in parameterizing the model and can lead to allocating too much capacity to the input embeddings, which may be wasteful. This is a problem particularly for multilingual models, which require large vocabularies with high-dimensional embeddings that make up between 47-71% of the entire parameter budget (Table 1 ), suggesting an inefficient parameter allocation. In this paper, we systematically study the impact of embedding coupling on state-of-the-art pretrained language models, focusing on multilingual models. First, we observe that while naïvely decoupling the input and output embedding parameters does not consistently improve downstream evaluation metrics, decoupling their shapes comes with a host of benefits. In particular, it allows us to independently modify the input and output embedding dimensions. We show that the input embedding dimension can be safely reduced without affecting downstream performance. Since the output embedding is discarded after pre-training, we can increase its dimension, which improves fine-tuning accuracy and outperforms other capacity expansion strategies. By reinvesting saved parameters to the width and depth of the Transformer layers, we furthermore achieve significantly improved performance over a strong mBERT (Devlin et al., 2019) baseline on multilingual tasks from the XTREME benchmark (Hu et al., 2020) . Finally, we combine our techniques in a Rebalanced mBERT (RemBERT) model that outperforms XLM-R (Conneau et al., 2020a) , the state-of-the-art cross-lingual model while having been pre-trained on 3.5× fewer tokens and 10 more languages. We thoroughly investigate reasons for the benefits of embedding decoupling. We observe that an increased output embedding size enables a model to improve on the pre-training task, which correlates with downstream performance. We also find that it leads to Transformers that are more transferable across tasks and languages-particularly for the upper-most layers. Overall, larger output embeddings prevent the model's last layers from over-specializing to the pre-training task (Zhang et al., 2020; Tamkin et al., 2020) , which enables training of more general Transformer models.

2. RELATED WORK

Embedding coupling Sharing input and output embeddings in neural language models was proposed to improve perplexity and motivated based on embedding similarity (Press & Wolf, 2017) as well as by theoretically showing that the output probability space can be constrained to a subspace governed by the embedding matrix for a restricted case (Inan et al., 2017) . Embedding coupling is also common in neural machine translation models where it reduces model complexity (Firat et al., 2016) and saves memory (Johnson et al., 2017) , in recent state-of-the-art language models (Melis et al., 2020) , as well as all pre-trained models we are aware of (Devlin et al., 2019; Liu et al., 2019b) . Transferability of representations Representations of large pre-trained models in computer vision and NLP have been observed to transition from general to task-specific from the first to the useful for domain-adaptive pre-training (Howard & Ruder, 2018; Gururangan et al., 2020) , probing (Elazar & Goldberg, 2019) , and tasks that can be cast in the pre-training objective (Amrami & Goldberg, 2019) . last layer (Yosinski et al., 2014; Howard & Ruder, 2018; Liu et al., 2019a) . In Transformer models, the last few layers have been shown to become specialized to the MLM task and-as a result-less transferable (Zhang et al., 2020; Tamkin et al., 2020) . Multilingual models Recent multilingual models are pre-trained on data covering around 100 languages using a subword vocabulary shared across all languages (Devlin et al., 2019; Pires et al., 2019; Conneau et al., 2020a) . In order to achieve reasonable performance for most languages, these models need to allocate sufficient capacity for each language, known as the curse of multilinguality (Conneau et al., 2020a; Pfeiffer et al., 2020) . As a result, such multilingual models have large vocabularies with large embedding sizes to ensure that tokens in all languages are adequately represented. Efficient models Most work on more efficient pre-trained models focuses on pruning or distillation (Hinton et al., 2015) . Pruning approaches remove parts of the model, typically attention heads (Michel et al., 2019; Voita et al., 2019) while distillation approaches distill a large pre-trained model into a smaller one (Sun et al., 2020) . Distillation can be seen as an alternative form of allocating pre-training capacity via a large teacher model. However, distilling a pre-trained model is expensive (Sanh et al., 2019) and requires overcoming architecture differences and balancing training data and loss terms (Mukherjee & Awadallah, 2020) . Our proposed methods are simpler and complementary to distillation as they can improve the pre-training of compact student models (Turc et al., 2019) .

3. EXPERIMENTAL METHODOLOGY

Efficiency of models has been measured along different dimensions, from the number of floating point operations (Schwartz et al., 2019) to their runtime (Zhou et al., 2020) . We follow previous work (Sun et al., 2020) and compare models in terms of their number of parameters during finetuning (see Appendix A.1 for further justification of this setting). For completeness, we generally report the number of pre-training (PT) and fine-tuning (FT) parameters. Baseline Our baseline has the same architecture as multilingual BERT (mBERT; Devlin et al., 2019) . It consists of 12 Transformer layers with a hidden size H of 768. Input and output embeddings are coupled and have the same dimensionality E as the hidden size, i.e. E out = E in = H. The total number of parameters during pre-training and fine-tuning is 177M (see Appendix A.2 for further details). We train variants of this model that differ in certain hyper-parameters but otherwise are trained under the same conditions to ensure a fair comparison. Tasks For our experiments, we employ tasks from the XTREME benchmark (Hu et al., 2020) that require fine-tuning, including the XNLI (Conneau et al., 2018) , NER (Pan et al., 2017) , PAWS-X (Yang et al., 2019) , XQuAD (Artetxe et al., 2020) , MLQA (Lewis et al., 2020) , and TyDiQA-GoldP (Clark et al., 2020a) datasets. We provide details for them in Appendix A.4. We average results across three fine-tuning runs and evaluate on the dev sets unless otherwise stated.

4. EMBEDDING DECOUPLING REVISITED

Naïve decoupling Embeddings make up a large fraction of the parameter budget in state-of-theart multilingual models (see Table 1 ). We now study the effect of embedding decoupling on such models. In Table 2 , we show the impact of decoupling the input and output embeddings in our baseline model ( §3) with coupled embeddings. Naïvely decoupling the output embedding matrix slightly improves the performance as evidenced by a 0.4 increase on average. However, the gain is not uniformly observed in all tasks. Overall, these results suggest that decoupling the embedding matrices naïvely while keeping the dimensionality fixed does not greatly affect the performance of the model. What is more important, however, is that decoupling the input and output embeddings decouples the shapes, endowing significant modeling flexibility, which we investigate in the following. Input vs output embeddings Decoupling input and output embeddings allows us to flexibly change the dimensionality of both matrices and to determine which one is more important for good transfer performance of the model. To this end, we compare the performance of a model with  E in = 768, E out = 128 to that of a model with E in = 128, E out = 768foot_2 (the remaining hyperparameters are the same as the baseline in §3). During fine-tuning, the latter model has 43% fewer parameters. We show the results in Table 3 . Surprisingly, the model pre-trained with a larger output embedding size is competitive with the comparison method on average despite having 77M fewer parameters during fine-tuning. 4Reducing the input embedding dimension saves a significant number of parameters at a noticeably smaller cost to accuracy than reducing the output embedding size. In light of this, the parameter allocation of multilingual models (see Table 1 ) seems particularly inefficient. For a multilingual model with coupled embeddings, reducing the input embedding dimension to save parameters as proposed by Lan et al. (2020) is very detrimental to performance (see Appendix A.5 for details). The results in this section indicate that the output embedding plays an important role in the transferability of pre-trained representations. For multilingual models in particular, a small input embedding dimension frees up a significant number of parameters at a small cost to performance. In the next section, we study how to improve the performance of a model by resizing embeddings and layers.

5. EMBEDDING AND LAYER RESIZING FOR MORE EFFICIENT FINE-TUNING

Increasing the output embedding size In §4, we observed that reducing E out hurts performance on the fine-tuning tasks, suggesting E out is important for transferability. Motivated by this result, we study the opposite scenario, i.e., whether increasing E out beyond H improves the performance. We experiment with an output embedding size E out in the range {128, 768, 3072} while keeping the input embedding size E in = 128 and all other parts of the model the same as described in §3 (H = 768, 12 layers, etc). We show the results in Table 4 . In all of the tasks we consider, increasing E out monotonically improves the performance. The improvement is particularly impressive for the more complex question answering datasets. It is important to note that during fine-tuning, all three models have the exact same sizes for E in and H. The only difference among them is the output embedding, which is discarded after pre-training. These results show that the effect of additional capacity during pre-training persists through the fine-tuning stage even if the added capacity is discarded after pre-training. We perform an extensive analysis on this behavior in §6. We show results with an English BERT Base model in Appendix A.6, which show the same trend. We show the results in Table 5 . The model with additional layers performs poorly on the question answering tasks, likely because the top layers contain useful semantic information (Tenney et al., 2019) . In addition to higher performance, increasing E out relies only a more expensive dense matrix multiplication, which is highly optimized on typical accelerators and can be scaled up more easily with model parallelism (Shazeer et al., 2018) because of small additional communication cost. We thus focus on increasing E out to expand pre-training capacity and leave an exploration of alternative strategies to future work. Reinvesting input embedding parameters Reducing E in from 768 to 128 reduces the number of parameters from 177M to 100M. We redistribute these 77M parameters for the model with E out = 768 to add capacity where it might be more useful by increasing the width or depth of the model. Specifically, we 1) increase the hidden dimension H of the Transformer layers from 768 to 1024foot_4 and 2) increase the number of Transformer layers (L) from 12 to 23 at the same H to obtain models with similar number of parameters during fine-tuning. Table 6 shows the results for these two strategies. Reinvesting the input embedding parameters in both H and L improves performance on all tasks while increasing the number of Transformer layers L results in the best performance, with an average improvement of 3.9 over the baseline model with coupled embeddings and the same number of fine-tuning parameters overall. A rebalanced mBERT We finally combine and scale up our techniques to design a rebalanced mBERT model that outperforms the current state-of-the-art unsupervised model, XLM-R (Conneau et al., 2020a) . As the performance of Transformer-based models strongly depends on their number of parameters (Raffel et al., 2020) , we propose a Rebalanced mBERT (RemBERT) model that matches XLM-R's number of fine-tuning parameters (559M) while using a reduced embedding size, resized layers, and more effective capacity during pre-training. The model has a vocabulary size of 250k, E in = 256, E out = 1536, and 32 layers with 1152 dimensions and 18 attention heads per layer and was trained on data covering 110 languages. We provide further details in Appendix A.7. We compare RemBERT to XLM-R and the best-performing models on the XTREME leaderboard in Table 7 (see Appendix A.8 for the per-task results). 6 The models in the first three rows use additional task or translation data for fine-tuning, which significantly boosts performance (Hu et al., 2020) . XLM-R and RemBERT are the only two models that are fine-tuned using only the English training data of the corresponding task. XLM-R was trained with a batch size of 2 13 sequences each with 2 9 tokens and 1.5M steps (total of 6.3T tokens). In comparison, RemBERT is trained with 2 11 sequences of 2 9 tokens for 1.76M steps (1.8T tokens). Even though it was trained with 3.5× fewer tokens and has 10 more languages competiting for the model capacity, RemBERT outperforms XLM-R on all tasks we considered. This strong result suggests that our proposed methods are also effective at scale. We will release the pre-trained model checkpoint and the source code for RemBERT in order to promote reproducibility and share the pre-training cost with other researchers.

6. ON THE IMPORTANCE OF THE OUTPUT EMBEDDING SIZE

We carefully design a set of experiments to analyze the impact of an increased output embedding size on various parts of the model. We study the nature of the decoupled input and output representations ( §6.1) and the transferability of the Transformer layers with regard to task-specific ( §6.2) and language-specific knowledge ( §6.3).

6.1. NATURE OF INPUT AND OUTPUT EMBEDDING REPRESENTATIONS

We first investigate to what extent the representations of decoupled input and output embeddings differ based on word embedding association tests (Caliskan et al., 2017) . Similar to Press & Wolf (2017) , for a given pair of words, we evaluate the correlation between human similarity judgements of the strength of the relationship and the dot product of the word embeddings. We evaluate on MEN (Bruni et al., 2014) , MTurk771 (Halawi et al., 2012) , Rare-Word (Luong et al., 2013) , SimLex999 (Hill et al., 2015) , and Verb-143 (Baker et al., 2014) . As our model uses subwords, we average the token representations for words with multiple subwords. We show the results in Table 8 . In the first two rows, we can observe that the input embedding of the decoupled model performs similarly to the embeddings of the coupled model while the output embeddings have lower scores. 7 We note that higher scores are not necessarily desirable as they only measure how well the embedding captures semantic similarity at the lexical level. Focusing on the difference in scores, we can observe that the input embedding learns representations that capture semantic similarity in contrast to the decoupled output embedding. At the same time, the decoupled model achieves higher performance in masked language modeling. The last three rows of Table 8 show that as E out increases, the difference in the input and output embedding increases as well. With additional capacity, the output embedding progressively learns representations that differ more significantly from the input embedding. We also observe that the MLM accuracy increases with E out . Collectively, the results in Table 8 suggest that with increased capacity, the output embeddings learn representations that are worse at capturing traditional semantic similarity (which is purely restricted to the lexical level) while being more specialized to the MLM task (which requires more contextual representations). Decoupling embeddings thus give the model the flexibility to avoid encoding relationships in its output embeddings that may not be useful for its pre-training task. As pre-training performance correlates well with downstream performance (Devlin et al., 2019) , forcing output embeddings to encode lexical information can hurt the latter.

6.2. CROSS-TASK TRANSFERABILITY OF TRANSFORMER LAYER REPRESENTATIONS

We investigate to what extent more capacity in the output embeddings during pre-training reduces the MLM-specific burden on the Transformer layers and hence prevents them from over-specializing to the MLM task. Dropping the last few layers We first study the impact of an increased output embedding size on the transferability of the last few layers. Previous work (Zhang et al., 2020; Tamkin et al., 2020) randomly reinitialized the last few layers to investigate their transferability. However, those parameters are still present during fine-tuning. We propose a more aggressive pruning scheme where we completely remove the last few layers. This setting demonstrates more drastically whether a model's upper layers are over-specialized to the pre-training task by assessing whether performance can be improved with millions fewer parameters. 8 We show the performance of models with 8-12 remaining layers (removing up to 4 of the last layers) for different output embedding sizes E out on XNLI in Figure 1 . For both E out = 128 and E out = 768, removing the last layer improves performance. In other words, the model performs better even with 7.1M fewer parameters. With E out = 128, the performance remains similar when removing the last few layers, which suggests that the last few layers are not critical for transferability. As we increase E out , the last layers become more transferable. With E out = 768, removing more than one layer results in a sharp reduction in performance. Finally when E out = 3072, every layer is 7 This is opposite from what Press & Wolf (2017) observed in 2-layer LSTMs. They find that performance of the output embedding is similar to the embedding of a coupled model. This difference is plausible as the information encoded in large Transformers changes significantly throughout the model (Tenney et al., 2019) . 8 Each Transformer layer with H = 768 has about 7.1M parameters. Published as a conference paper at ICLR 2021 useful and removing any layer worsens the performance. This analysis demonstrates that increasing E out improves the transferability of the representations learned by the last few Transformer layers. Probing analysis We further study whether an increased output embedding size improves the general natural language processing ability of the Transformer. We employ the probing analysis of Tenney et al. (2019) and the mix probing strategy where a 2-layer dense network is trained on top of a weighted combination of the 12 Transformer layers. We evaluate performance with regard to core NLP concepts including part-of-speech tagging (POS), constituents (Consts.), dependencies (Deps.), entities, semantic role labeling (SRL), coreference (Coref.), semantic proto-roles (SPR), and relations (Rel.). For a thorough description of the task setup, see Tenney et al. (2019) .foot_6  We show the results of the probing analysis in Table 9 . As we increase E out , the model improves across all tasks, even though the number of parameters is the same. This demonstrates that increasing E out enables the Transformer layers to learn more general representations.foot_7 

6.3. CROSS-LINGUAL TRANSFERABILITY OF TRANSFORMER LAYER REPRESENTATIONS

So far, our analyses were not specialized to multilingual models. Unlike monolingual models, multilingual models have another dimension of transferability: cross-lingual transfer, the ability to transfer knowledge from one language to another. Previous work (Pires et al., 2019; Artetxe et al., 2020) has found that MLM on multilingual data encourages cross-lingual alignment of representations without explicit cross-lingual supervision. While it has been shown that multilingual models learn useful cross-lingual representations, overspecialization to the pre-training task may result in higher layers being less cross-lingual and focusing on language-specific phenomena necessary for predicting the next word in a given language. To investigate to what extent this is the case and whether increasing E out improves cross-lingual alignment, we evaluate the model's nearest neighbour translation accuracy (Pires et al., 2019) on English-to-German translation (see Appendix A.9 for a description of the method). We show the nearest neighbor translation accuracy for each layer in Figure 2 . As E out increases, we observe that a) the Transformer layers become more language-agnostic as evidenced by higher accuracy and b) the language-agnostic representation is maintained to a higher layer as indicated by a flatter slope from layer 7 to 11. In all cases, the last layer is less language-agnostic than the previous one. The sharp drop in performance after layer 8 at E out = 128 is in line with previous results on cross-lingual retrieval (Pires et al., 2019; Hu et al., 2020) and is partially mitigated by an increased E out . In sum, not only does a larger output embedding size improve cross-task transferability but it also helps with cross-lingual alignment and thereby cross-lingual transfer on downstream tasks.

7. CONCLUSION

We have assessed the impact of embedding coupling in pre-trained language models. We have identified the main benefit of decoupled embeddings to be the flexibility endowed by decoupling their shapes. We showed that input embeddings can be safely reduced and that larger output embeddings and reinvesting saved parameters lead to performance improvements. Our rebalanced multilingual BERT (RemBERT) outperforms XLM-R with the same number of fine-tuning parameters while having been trained on 3.5× fewer tokens. Overall, we found that larger output embeddings lead to more transferable and more general representations, particularly in a Transformer's upper layers. Table 10 : Fine-tuning hyperparameters for all models except RemBERT. Learning rate Batch size Train epochs PAWS-X [3 × 10 -5 , 4 × 10 -5 , 5 × 10 -5 ] 32 3 XNLI [1 × 10 -5 , 2 × 10 -5 , 3 × 10 -5 ] 32 3 SQuAD [2 × 10 -5 , 3 × 10 -5 , 4 × 10 -5 ] 32 3 NER [1 × 10 -5 , 2 × 10 -5 , 3 × 10 -5 , 4 × 10 -5 , 5 × 10 -5 ] 32 3 (2019) using masked language modeling (MLM). We choose this baseline as its behavior has been thoroughly studied (K et al., 2020; Conneau et al., 2020b; Pires et al., 2019; Wu & Dredze, 2019) .

A.3 TRAINING DETAILS

For all pre-training except for the large scale RemBERT, we trained using 64 Google Cloud TPUs. We trained over 26B tokens of Wikipedia data. All fine-tuning experiments were run on 8 Cloud TPUs. For all fine-tuning experiments other than RemBERT, we use batch size of 32. We sweep over the learning rate values specified in Table 10 . We used the SentencePiece tokenizer trained with unigram language modeling A.4 XTREME TASKS For our experiments, we employ tasks from the XTREME benchmark (Hu et al., 2020) . We show statistics for them in Zweigenbaum et al., 2018) , and the Tatoeba dataset (Artetxe & Schwenk, 2019) . We refer the reader to Hu et al. (2020) for more details. We average results across three fine-tuning runs and evaluate on the dev sets unless otherwise stated. Crucially, our finding differs from the dimensionality reduction in ALBERT (Lan et al., 2020) . While they show that smaller embeddings can be used, their input and output embeddings are coupled and use a much smaller vocabulary (30k vs 120k). In contrast, we find that simultaneously decreasing both the input and output embedding size drastically reduces the performance of multilingual models. In Table 12 , we show the impact of their factorized embedding parameterization on a monolingual and a multilingual model. While the English model suffers a smaller (0.8%) drop in accuracy, the multilingual model's performance drops by 2.6%. Direct application of a factorized embedding parameterization (Lan et al., 2020) is thus not viable for multilingual models.

A.6 ENGLISH MONOLINGUAL RESULTS

So far, we have focused on multilingual models as the number of saved parameters when reducing the input embedding size is largest for them. We now apply the same techniques to the English 12-layer BERT Base with a 30k vocabulary (Devlin et al., 2019) . Specifically, we decouple the embeddings, reduce E in to 128, and increase the output embedding size or the number of layers during pre-training. We show the performance on MNLI (Williams et al., 2018) and SQuAD (Rajpurkar et al., 2016) in Table 13 . By adding more capacity during pre-training, performance monotonically increases similar to the multilingual models. Interestingly, pruning a 24-layer model to 12 layers reduces performance, presumably because some upper layers still contain useful information.

A.7 REMBERT DETAILS

We design a Rebalanced mBERT (RemBERT) to leverage capacity more effectively during pretraining. The model has 995M parameters during pre-training and 575M parameters during finetuning. We pre-train on large unlabeled text using both Wikipedia and Common Crawl data, covering 110 languages. The details of hyperparameters and architecture are shown in Table 14 . For each language l, we define the empirical distribution as p l = n l l ∈L n l (1)



Output embedding is sometimes referred to as "output weights", i.e., the weight matrix in the output projection in a language model. We focus on encoder-only models, and do not consider encoder-decoder models like T5(Raffel et al., 2020) where none of the embedding matrices are discarded after pre-training. Output embeddings may also be We linearly project the embeddings from Ein to H and from H to Eout. We observe the same trend if we control for the number of trainable parameters during fine-tuning by freezing the input embedding parameters. We choose 1024 dimensions to optimize efficient use of our accelerators. We do not consider retrieval tasks as they require intermediate task data(Phang et al., 2020). The probing tasks are in English while our encoder is multilingual. In Tenney et al. (2019), going from a BERT-base to a BERT-large model (with 3× more parameters) improves performance on average by 1.1 points, compared to our improvement of 0.5 points without increasing the number of fine-tuning parameters. For encoder-only models such as BERT, parameters after the last Transformer layer (e.g. the output embeddings and the pooling layer) are discarded after pre-training.



Figure 1: XNLI accuracy with the last layers removed. Larger E out improves transferability.

Effect of decoupling the input and output embedding matrices on performance on multiple tasks in XTREME. PT: Pre-training. FT: Fine-tuning. The decoupled model has input and output embeddings with the same size (E = 768) as the embedding of the coupled model. The Transformer parts of the models are the same (i.e., 12 layers with H = 768).

Performance of models with a large input and small output embedding size and vice versa. Both models have 12 Transformer layers with H = 768.

Effect of an increased output embedding size E out on tasks in XTREME. All three models have E in = 128 and 12 Transformer layers with H = 768.

Effect of additional capacity via more Transformer layers during pre-training. Both models have E in = 128. The E out = 768 model has a larger output embedding size E out and 12 Transformer layers. In contrast, the model with 11 additional Transformer layers has E out = 128. Those additional layers are dropped after pre-training, leaving 12 layers for fair comparison during fine-tuning.

Effect of reinvesting the input embedding parameters to increase the hidden dimension H and number of Transformer layers L on XTREME tasks. E in = 128, E out = 768, H = 768 for all models except for the baseline, which has coupled embeddings and E in = E out = 768.

Comparison of our model to other models on the XTREME leaderboard. Details about VECO are due to communication with the authors.

Results on word embedding association tests for the input (I) and output (O) embeddings of models (left) and the models' masked language modeling performance (right). The first two rows show the performance of coupled and decoupled embeddings with the same embedding size E in = E out = 768. The last three rows show the performance as we increase the output embedding size with E in = 128.

Probing analysis of Tenney et al. (2019) with mix strategy.

Statistics for the datasets in XTREME, including the number of training, development, and test examples as well as the number of languages for each task.



Effect of reducing the embedding size E for monolingual vs. multilingual models on MNLI and XNLI performance respectively. Monolingual numbers are fromLan et al. (2020) and have vocabulary size of 30k.

Effect of an increased output embedding size E out and additional layers during pre-training L = 15 on English BERT Base (E in = 128).

ACKNOWLEDGEMENTS

We would like to thank Laura Rimell for valuable feedback on a draft of this paper.

annex

Published as a conference paper at ICLR 2021 where n l is the number of sentences in l's pre-training corpus. Following Devlin et al. (2019) , we use an exponentially smoothed distribution, i.e., we exponentiaate p l by α = 0.5 and renormalize to obtain the sampling distribution.Hyperparameters and pre-training details are summarized in Table 14 . Hyperparameters used for the leaderboard submission are shown in Table 15 .

A.8 XTREME TASK RESULTS

We show the detailed results for RemBERT and the comparison per task on the XTREME leaderboard in Table 16 . Compared to Table 7 , which shows the average across task categories, this table shows the average across tasks.A.9 NEAREST-NEIGHBOR TRANSLATION COMPUTATIONFor an English-to-German translation, we sample M = 5000 pairs of sentences from WMT16 (Bojar et al., 2016) . For each sentence in each language, we obtain a representation v (l)LANG at each layer l by averaging the activations of all tokens (except the [CLS] and [SEP] tokens) at that layer. We then compute a translation vector from English to German by averaging the difference between the vectors of each sentence pair across all pairs:ENi .For each English sentence v (l)ENi , we can now translate it with this vector: vENi + v(l) EN→DE . We locate the closest German sentence vector based on 2 distance and measure how often the nearest neighbour is the correct pair. 

