DEBERTAV3: IMPROVING DEBERTA USING ELECTRA-STYLE PRE-TRAINING WITH GRADIENT-DISENTANGLED EMBEDDING SHAR-ING

Abstract

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing masked language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance, because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tugof-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% higher than DeBERTa and 1.91% higher than ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multilingual model mDeBERTaV3 and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTaV3 Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. Our models and code are publicly available at https://github.com/microsoft/DeBERTa.

1. INTRODUCTION

Recent advances in Pre-trained Language Models (PLMs) have created new state-of-the-art results on many natural language processing (NLP) tasks. While scaling up PLMs with billions or trillions of parameters (Raffel et al., 2020; Radford et al., 2019; Brown et al., 2020; He et al., 2020; Fedus et al., 2021) is a well-proved way to improve the capacity of the PLMs, it is more important to explore more energy-efficient approaches to build PLMs with fewer parameters and less computation cost while retaining high model capacity. Towards this direction, there are a few works that significantly improve the efficiency of PLMs. The first is RoBERTa (Liu et al., 2019) which improves the model capacity with a larger batch size and more training data. Based on RoBERTa, DeBERTa (He et al., 2020) further improves the pre-training efficiency by incorporating disentangled attention which is an improved relative-position encoding mechanism. By scaling up to 1.5B parameters, which is about an eighth of the parameters of xxlarge T5 (Raffel et al., 2020) , DeBERTa surpassed human performance on the SuperGLUE (Wang et al., 2019a) leaderboard for the first time. The second new pre-training approach to improve efficiency is Replaced Token Detection (RTD), proposed by ELECTRA (Clark et al., 2020) . Unlike BERT (Devlin et al., 2019) , which uses a transformer encoder to predict corrupt tokens with masked language modeling (MLM), RTD uses a generator to generate ambiguous corruptions and a discriminator to distinguish the ambiguous tokens from the original inputs, similar to Generative Adversarial Networks (GAN). The effectiveness of RTD is also verified by several works, including CoCo-LM (Meng et al., 2021) , XLM-E (Chi et al., 2021) , CodeBERT(Feng et al., 2020) and SmallBenchNLP (Kanakarajan et al., 2021) . In this paper, we explore two methods of improving the efficiency of pre-training DeBERTa. Following ELECTRA-style training, we replace MLM in DeBERTa with RTD where the model is trained as a discriminator to predict whether a token in the corrupt input is either original or replaced by a generator. We show that DeBERTa trained with RTD significantly outperforms the model trained using MLM. The second is a new embedding sharing method. In ELECTRA, the discriminator and the generator share the same token embeddings. However, our analysis shows that embedding sharing hurts training efficiency and model performance, since the training losses of the discriminator and the generator pull token embeddings into opposite directions. This is because the training objectives between the generator and the discriminator are very different. The MLM used for training the generator tries to pull the tokens that are semantically similar close to each other while the RTD of the discriminator tries to discriminate semantically similar tokens and pull their embeddings as far as possible to optimize the binary classification accuracy, causing a conflict between their training objectives. In other words, this creates the "tug-of-war" dynamics that reduces the training efficiency and the model quality, as illustrated in Hadsell et al. (2020) . On the other hand, we show that using separated embeddings for the generator and the discriminator results in significant performance degradation when we fine-tune the discriminator on downstream tasks, indicating the merit of embedding sharing, e.g., the embeddings of the generator are beneficial to produce a better discriminator, as argued in Clark et al. (2020) . To balance these tradeoffs, we propose a new gradient-disentangled embedding sharing (GDES) method where the generator shares its embeddings with the discriminator but stops the gradients from the discriminator to the generator embeddings. This way, we avoid the tug-of-war effect and preserve the benefits of embedding sharing. We empirically demonstrate that GDES improves both pre-training efficiency and the quality of the pre-trained models. We pre-train four variants of DeBERTaV3 models, i.e., DeBERTaV3 large , DeBERTaV3 base , DeBERTaV3 small and DeBERTaV3 xsmall . We evaluate them on various representative natural language understanding (NLU) benchmarks and set new state-of-the-art numbers among models with a similar model structure. For example, DeBERTaV3 large surpasses previous SOTA models with a similar model structure on GLUE (Wang et al., 2019b) benchmark with an average score over +1.37%, which is significant. DeBERTaV3 base achieves a 90.6% accuracy score on the MNLI-matched (Williams et al., 2018) evaluation set and an 88.4% F1 score on the SQuAD v2.0 (Rajpurkar et al., 2018) evaluation set. This improves DeBERTa base by 1.8% and 2.2%, respectively. Without knowledge distillation, DeBERTaV3 small and DeBERTaV3 xsmall surpasses previous SOTA models with a similar model structure on both MNLI-matched and SQuAD v2.0 evaluation set by more than 1.2% in accuracy and 1.3% in F1, respectively. We also train DeBERTaV3 base on the CC100 (Conneau et al., 2020) multilingual data using a similar setting as XLM-R (Conneau et al., 2020) but with only a third of the training passes. We denote the model as mDeBERTaV3 base . Under the cross-lingual transfer setting, mDeBERTaV3 base achieves a 79.8% average accuracy score on the XNLI (Conneau et al., 2018) task, which outperforms XLM-R base and mT5 base (Xue et al., 2021 ) by 3.6% and 4.4%, respectively. This makes mDeBERTaV3 the best model among multi-lingual models with a similar model structure. All these results strongly demonstrate the efficiency of DeBERTaV3 models and set a good base for future exploration towards more efficient PLMs.

2. BACKGROUND 2.1 TRANSFORMER

A Transformer-based language model is composed of L stacked Transformer blocks (Vaswani et al., 2017) . Each block contains a multi-head self-attention layer followed by a fully connected positional feed-forward network. The standard self-attention mechanism lacks a natural way to encode word position information. Thus, existing approaches add a positional bias to each input word embedding so that each input word is represented by a vector whose value depends on both its content and position. The positional bias can be implemented using absolute position embedding (Vaswani et al., 2017; Brown et al., 2020; Devlin et al., 2019) or relative position embedding (Huang et al., 2018; Yang et al., 2019) . Several studies have shown that relative position representations are more effective

