RETHINKING EMBEDDING COUPLING IN PRE-TRAINED LANGUAGE MODELS

Abstract

We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.

1. INTRODUCTION

The performance of models in natural language processing (NLP) has dramatically improved in recent years, mainly driven by advances in transfer learning from large amounts of unlabeled data (Howard & Ruder, 2018; Devlin et al., 2019) . The most successful paradigm consists of pre-training a large Transformer (Vaswani et al., 2017) model with a self-supervised loss and fine-tuning it on data of a downstream task (Ruder et al., 2019) . Despite its empirical success, inefficiencies have been observed related to the training duration (Liu et al., 2019b ), pre-training objective (Clark et al., 2020b ), and training data (Conneau et al., 2020a) , among others. In this paper, we reconsider a modeling assumption that may have a similarly pervasive practical impact: the coupling of input and output embeddingsfoot_0 in state-of-the-art pre-trained language models. State-of-the-art pre-trained language models (Devlin et al., 2019; Liu et al., 2019b) and their multilingual counterparts (Devlin et al., 2019; Conneau et al., 2020a) have inherited the practice of embedding coupling from their language model predecessors (Press & Wolf, 2017; Inan et al., 2017) . However, in contrast to their language model counterparts, embedding coupling in encoder-only pre-trained models such as Devlin et al. ( 2019) is only useful during pre-training since output embeddings are generally discarded after fine-tuning. 2 In addition, given the willingness of researchers to exchange additional compute during pre-training for improved downstream performance (Raffel On the other hand, tying input and output embeddings constrains the model to use the same dimensionality for both embeddings. This restriction limits the researcher's flexibility in parameterizing the model and can lead to allocating too much capacity to the input embeddings, which may be wasteful. This is a problem particularly for multilingual models, which require large vocabularies with high-dimensional embeddings that make up between 47-71% of the entire parameter budget (Table 1 ), suggesting an inefficient parameter allocation. In this paper, we systematically study the impact of embedding coupling on state-of-the-art pretrained language models, focusing on multilingual models. First, we observe that while naïvely decoupling the input and output embedding parameters does not consistently improve downstream evaluation metrics, decoupling their shapes comes with a host of benefits. In particular, it allows us to independently modify the input and output embedding dimensions. We show that the input embedding dimension can be safely reduced without affecting downstream performance. Since the output embedding is discarded after pre-training, we can increase its dimension, which improves fine-tuning accuracy and outperforms other capacity expansion strategies. By reinvesting saved parameters to the width and depth of the Transformer layers, we furthermore achieve significantly improved performance over a strong mBERT (Devlin et al., 2019) baseline on multilingual tasks from the XTREME benchmark (Hu et al., 2020) . Finally, we combine our techniques in a Rebalanced mBERT (RemBERT) model that outperforms XLM-R (Conneau et al., 2020a), the state-of-the-art cross-lingual model while having been pre-trained on 3.5× fewer tokens and 10 more languages. We thoroughly investigate reasons for the benefits of embedding decoupling. We observe that an increased output embedding size enables a model to improve on the pre-training task, which correlates with downstream performance. We also find that it leads to Transformers that are more transferable across tasks and languages-particularly for the upper-most layers. Overall, larger output embeddings prevent the model's last layers from over-specializing to the pre-training task (Zhang et al., 2020; Tamkin et al., 2020) , which enables training of more general Transformer models.

2. RELATED WORK

Embedding coupling Sharing input and output embeddings in neural language models was proposed to improve perplexity and motivated based on embedding similarity (Press & Wolf, 2017) as well as by theoretically showing that the output probability space can be constrained to a subspace governed by the embedding matrix for a restricted case (Inan et al., 2017) . Embedding coupling is also common in neural machine translation models where it reduces model complexity (Firat et al., 2016) and saves memory (Johnson et al., 2017) , in recent state-of-the-art language models (Melis et al., 2020), as well as all pre-trained models we are aware of (Devlin et al., 2019; Liu et al., 2019b) . Transferability of representations Representations of large pre-trained models in computer vision and NLP have been observed to transition from general to task-specific from the first to the useful for domain-adaptive pre-training (Howard & Ruder, 2018; Gururangan et al., 2020 ), probing (Elazar & Goldberg, 2019) , and tasks that can be cast in the pre-training objective (Amrami & Goldberg, 2019).



Output embedding is sometimes referred to as "output weights", i.e., the weight matrix in the output projection in a language model. We focus on encoder-only models, and do not consider encoder-decoder models like T5(Raffel et al., 2020) where none of the embedding matrices are discarded after pre-training. Output embeddings may also be



Overview of the number of parameters in (coupled) embedding matrices of state-of-the-art multilingual (top) and monolingual (bottom) models with regard to overall parameter budget. |V |: vocabulary size. N , N emb : number of parameters in total and in the embedding matrix respectively.Brown et al., 2020)  and the fact that pre-trained models are often used for inference millions of times(Wolf et al., 2019), pre-training-specific parameter savings are less important overall.

