RETHINKING EMBEDDING COUPLING IN PRE-TRAINED LANGUAGE MODELS

Abstract

We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.

1. INTRODUCTION

The performance of models in natural language processing (NLP) has dramatically improved in recent years, mainly driven by advances in transfer learning from large amounts of unlabeled data (Howard & Ruder, 2018; Devlin et al., 2019) . The most successful paradigm consists of pre-training a large Transformer (Vaswani et al., 2017) model with a self-supervised loss and fine-tuning it on data of a downstream task (Ruder et al., 2019) . Despite its empirical success, inefficiencies have been observed related to the training duration (Liu et al., 2019b) , pre-training objective (Clark et al., 2020b) , and training data (Conneau et al., 2020a) , among others. In this paper, we reconsider a modeling assumption that may have a similarly pervasive practical impact: the coupling of input and output embeddingsfoot_0 in state-of-the-art pre-trained language models. State-of-the-art pre-trained language models (Devlin et al., 2019; Liu et al., 2019b) and their multilingual counterparts (Devlin et al., 2019; Conneau et al., 2020a) have inherited the practice of embedding coupling from their language model predecessors (Press & Wolf, 2017; Inan et al., 2017) . However, in contrast to their language model counterparts, embedding coupling in encoder-only pre-trained models such as Devlin et al. ( 2019) is only useful during pre-training since output embeddings are generally discarded after fine-tuning.foot_1 In addition, given the willingness of researchers to exchange additional compute during pre-training for improved downstream performance (Raffel



Output embedding is sometimes referred to as "output weights", i.e., the weight matrix in the output projection in a language model. We focus on encoder-only models, and do not consider encoder-decoder models like T5(Raffel et al., 2020) where none of the embedding matrices are discarded after pre-training. Output embeddings may also be

