DEBERTA: DECODING-ENHANCED BERT WITH DIS-ENTANGLED ATTENTION

Abstract

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .

1. INTRODUCTION

The Transformer has become the most effective neural network architecture for neural language modeling. Unlike recurrent neural networks (RNNs) that process text in sequence, Transformers apply self-attention to compute in parallel every word from the input text an attention weight that gauges the influence each word has on another, thus allowing for much more parallelization than RNNs for large-scale model training (Vaswani et al., 2017) . Since 2018, we have seen the rise of a set of large-scale Transformer-based Pre-trained Language Models (PLMs), such as GPT (Radford et al., 2019; Brown et al., 2020) , BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019c) , XLNet (Yang et al., 2019) , UniLM (Dong et al., 2019) , ELECTRA (Clark et al., 2020) , T5 (Raffel et al., 2020) , ALUM (Liu et al., 2020) , StructBERT (Wang et al., 2019c) and ERINE (Sun et al., 2019) . These PLMs have been fine-tuned using task-specific labels and created new state of the art in many downstream natural language processing (NLP) tasks (Liu et al., 2019b; Minaee et al., 2020; Jiang et al., 2020; He et al., 2019a; b; Shen et al., 2020) . In this paper, we propose a new Transformer-based neural language model DeBERTa (Decodingenhanced BERT with disentangled attention), which improves previous state-of-the-art PLMs using two novel techniques: a disentangled attention mechanism, and an enhanced mask decoder. Disentangled attention. Unlike BERT where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words "deep" and "learning" is much stronger when they occur next to each other than when they occur in different sentences. Enhanced mask decoder. Like BERT, DeBERTa is pre-trained using masked language modeling (MLM). MLM is a fill-in-the-blank task, where a model is taught to use the words surrounding a mask token to predict what the masked word should be. DeBERTa uses the content and position information of the context words for MLM. The disentangled attention mechanism already considers the contents and relative positions of the context words, but not the absolute positions of these words, which in many cases are crucial for the prediction. Consider the sentence "a new store opened beside the new mall" with the italicized words "store" and "mall" masked for prediction. Although the local contexts of the two words are similar, they play different syntactic roles in the sentence. (Here, the subject of the sentence is "store" not "mall," for example.) These syntactical nuances depend, to a large degree, upon the words' absolute positions in the sentence, and so it is important to account for a word's absolute position in the language modeling process. DeBERTa incorporates absolute word position embeddings right before the softmax layer where the model decodes the masked words based on the aggregated contextual embeddings of word contents and positions. In addition, we propose a new virtual adversarial training method for fine-tuning PLMs to downstream NLP tasks. The method is effective in improving models' generalization. We show through a comprehensive empirical study that these techniques substantially improve the efficiency of pre-training and the performance of downstream tasks. In the NLU tasks, compared to RoBERTa-Large, a DeBERTa model trained on half the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3%(88.4% vs. 90.7%), and RACE by +3.6% (83.2% vs. 86.8%). In the NLG tasks, DeBERTa reduces the perplexity from 21.6 to 19.5 on the Wikitext-103 dataset. We further scale up DeBERTa by pre-training a larger model that consists of 48 Transformer layers with 1.5 billion parameters. The single 1.5B-parameter DeBERTa model substantially outperforms T5 with 11 billion parameters on the SuperGLUE benchmark (Wang et al., 2019a ) by 0.6%(89.3% vs. 89.9%), and surpasses the human baseline (89.9 vs. 89.8) for the first time. The ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8).

2. BACKGROUND 2.1 TRANSFORMER

A Transformer-based language model is composed of stacked Transformer blocks (Vaswani et al., 2017) . Each block contains a multi-head self-attention layer followed by a fully connected positional feed-forward network. The standard self-attention mechanism lacks a natural way to encode word position information. Thus, existing approaches add a positional bias to each input word embedding so that each input word is represented by a vector whose value depends on its content and position. The positional bias can be implemented using absolute position embedding (Vaswani et al., 2017; Radford et al., 2019; Devlin et al., 2019) or relative position embedding (Huang et al., 2018; Yang et al., 2019) . It has been shown that relative position representations are more effective for natural language understanding and generation tasks (Dai et al., 2019; Shaw et al., 2018) . The proposed disentangled attention mechanism differs from all existing approaches in that we represent each input word using two separate vectors that encode a word's content and position, respectively, and attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.

2.2. MASKED LANGUAGE MODEL

Large-scale Transformer-based PLMs are typically pre-trained on large amounts of text to learn contextual word representations using a self-supervision objective, known as Masked Language Model (MLM) (Devlin et al., 2019) . Specifically, given a sequence X " tx i u, we corrupt it into X by masking 15% of its tokens at random and then train a language model parameterized by θ to reconstruct X by predicting the masked tokens x conditioned on X: max θ log p θ pX| Xq " max θ ÿ iPC log p θ px i " x i | Xq (1) where C is the index set of the masked tokens in the sequence. The authors of BERT propose to keep 10% of the masked tokens unchanged, another 10% replaced with randomly picked tokens and the rest replaced with the [MASK] token. 3 THE DEBERTA ARCHITECTURE

3.1. DISENTANGLED ATTENTION: A TWO-VECTOR APPROACH TO CONTENT AND POSITION EMBEDDING

For a token at position i in a sequence, we represent it using two vectors, tH i u and tP i|j u, which represent its content and relative position with the token at position j, respectively. The calculation of the cross attention score between tokens i and j can be decomposed into four components as A i,j " tH i , P i|j u ˆtH j , P j|i u " H i H j `Hi P j|i `Pi|j H j `Pi|j P j|i That is, the attention weight of a word pair can be computed as a sum of four attention scores using disentangled matrices on their contents and positions as content-to-content, content-to-position, position-to-content, and position-to-positionfoot_1 . Existing approaches to relative position encoding use a separate embedding matrix to compute the relative position bias in computing attention weights (Shaw et al., 2018; Huang et al., 2018) . This is equivalent to computing the attention weights using only the content-to-content and content-toposition terms in equation 2. We argue that the position-to-content term is also important since the attention weight of a word pair depends not only on their contents but on their relative positions, which can only be fully modeled using both the content-to-position and position-to-content terms. Since we use relative position embedding, the position-to-position term does not provide much additional information and is removed from equation 2 in our implementation. Taking single-head attention as an example, the standard self-attention operation (Vaswani et al., 2017) can be formulated as: Q " HW q , K " HW k , V " HW v , A " QK ? d H o " softmaxpAqV where H P R N ˆd represents the input hidden vectors, H o P R N ˆd the output of self-attention, W q , W k , W v P R dˆd the projection matrices, A P R N ˆN the attention matrix, N the length of the input sequence, and d the dimension of hidden states. Denote k as the maximum relative distance, δpi, jq P r0, 2kq as the relative distance from token i to token j, which is defined as: δpi, jq " # 0 for i ´j ď ´k 2k ´1 for i ´j ě k i ´j `k others. (3) We can represent the disentangled self-attention with relative position bias as equation 4, where Q c , K c and V c are the projected content vectors generated using projection matrices W q,c , W k,c , W v,c P R dˆd respectively, P P R 2kˆd represents the relative position embedding vectors shared across all layers (i.e., staying fixed during forward propagation), and Q r and K r are projected relative position vectors generated using projection matrices W q,r , W k,r P R dˆd , respectively. Q c " HW q,c , K c " HW k,c , V c " HW v,c , Q r " P W q,r , K r " P W k,r Ãi,j " Q c i K c j looomooon (a) content-to-content `Qc i K r δpi,jq looooomooooon (b) content-to-position `Kc j Q r δpj,iq looooomooooon (c) position-to-content H o " softmaxp Ã ? 3d qV c (4) Ãi,j is the element of attention matrix Ã, representing the attention score from token i to token j. Q c i is the i-th row of Q c . K c j is the j-th row of K c . K r δpi,jq is the δpi, jq-th row of K r with regarding to relative distance δpi, jq. Q r δpj,iq is the δpj, iq-th row of Q r with regarding to relative distance δpj, iq. Note that we use δpj, iq rather than δpi, jq here. This is because for a given position i, position-to-content computes the attention weight of the key content at j with respect to the query position at i, thus the relative distance is δpj, iq. The position-to-content term is calculated as K c j Q r δpj,iq . The content-to-position term is calculated in a similar way. Finally, we apply a scaling factor of 1 ? 3d on Ã. The factor is important for stabilizing model training (Vaswani et al., 2017) , especially for large-scale PLMs.

Algorithm 1 Disentangled Attention

Input: Hidden state H, relative distance embedding P , relative distance matrix δ. Content projection matrix W k,c , W q,c , W v,c , position projection matrix W k,r , W q,r . 1: K c " HW k,c , Q c " HW q,c , V c " HW v,c , K r " P W k,r , Q r " P W q,r 2: A cÑc " Q c K c 3: for i " 0, ..., N ´1 do 4: ÃcÑp ri, :s " Q c ri, :sK r 5: end for 6: for i " 0, ..., N ´1 do 7: for j " 0, ..., N ´1 do 8: A cÑp ri, js " ÃcÑp ri, δri, jss 9: end for 10: end for 11: for j " 0, ..., N ´1 do 12: ÃpÑc r:, js " K c rj, :sQ r 13: end for 14: for j " 0, ..., N ´1 do 15: for i " 0, ..., N ´1 do 16: A pÑc ri, js " ÃpÑc rδrj, is, js 17: end for 18: end for 19: Ã " A cÑc `AcÑp `ApÑc 20: H o " softmaxp Ã ? 3d qV c Output: H o 3.1.1 EFFICIENT IMPLEMENTATION For an input sequence of length N , it requires a space complexity of OpN 2 dq (Shaw et al., 2018; Huang et al., 2018; Dai et al., 2019) to store the relative position embedding for each token. However, taking content-to-position as an example, we note that since δpi, jq P r0, 2kq and the embeddings of all possible relative positions are always a subset of K r P R 2kˆd , then we can reuse K r in the attention calculation for all the queries. In our experiments, we set the maximum relative distance k to 512 for pre-training. The disentangled attention weights can be computed efficiently using Algorithm 1. Let δ be the relative position matrix according to equation 3, i.e., δri, js " δpi, jq. Instead of allocating a different relative position embedding matrix for each query, we multiply each query vector Q c ri, :s by K r P R dˆ2k , as in line 3 ´5. Then, we extract the attention weight using the relative position matrix δ as the index, as in line 6 ´10. To compute the position-to-content attention score, we calculate ÃpÑc r:, js, i.e., the column vector of the attention matrix ÃpÑc , by multiplying each key vector K c rj, :s by Q r , as in line 11 ´13. Finally, we extract the corresponding attention score via the relative position matrix δ as the index, as in line 14 ´18. In this way, we do not need to allocate memory to store a relative position embedding for each query and thus reduce the space complexity to Opkdq (for storing K r and Q r ).

3.2. ENHANCED MASK DECODER ACCOUNTS FOR ABSOLUTE WORD POSITIONS

DeBERTa is pretrained using MLM, where a model is trained to use the words surrounding a mask token to predict what the masked word should be. DeBERTa uses the content and position information of the context words for MLM. The disentangled attention mechanism already considers the contents and relative positions of the context words, but not the absolute positions of these words, which in many cases are crucial for the prediction. Given a sentence "a new store opened beside the new mall" with the words "store" and "mall" masked for prediction. Using only the local context (e.g., relative positions and surrounding words) is insufficient for the model to distinguish store and mall in this sentence, since both follow the word new with the same relative positions. To address this limitation, the model needs to take into account absolute positions, as complement information to the relative positions. For example, the subject of the sentence is "store" not "mall". These syntactical nuances depend, to a large degree, upon the words' absolute positions in the sentence. There are two methods of incorporating absolute positions. The BERT model incorporates absolute positions in the input layer. In DeBERTa, we incorporate them right after all the Transformer layers but before the softmax layer for masked token prediction, as shown in Figure 2 . In this way, DeBERTa captures the relative positions in all the Transformer layers and only uses absolute positions as complementary information when decoding the masked words. Thus, we call DeBERTa's decoding component an Enhanced Mask Decoder (EMD). In the empirical study, we compare these two methods of incorporating absolute positions and observe that EMD works much better. We conjecture that the early incorporation of absolute positions used by BERT might undesirably hamper the model from learning sufficient information of relative positions. In addition, EMD also enables us to introduce other useful information, in addition to positions, for pre-training. We leave it to future work.

4. SCALE INVARIANT FINE-TUNING

This section presents a new virtual adversarial training algorithm, Scale-invariant-Fine-Tuning (SiFT), a variant to the algorithm described in Miyato et al. (2018) ; Jiang et al. (2020) , for fine-tuning. Virtual adversarial training is a regularization method for improving models' generalization. It does so by improving a model's robustness to adversarial examples, which are created by making small perturbations to the input. The model is regularized so that when given a task-specific example, the model produces the same output distribution as it produces on an adversarial perturbation of that example. For NLP tasks, the perturbation is applied to the word embedding instead of the original word sequence. However, the value ranges (norms) of the embedding vectors vary among different words and models. The variance gets larger for bigger models with billions of parameters, leading to some instability of adversarial training. Inspired by layer normalization (Ba et al., 2016) , we propose the SiFT algorithm that improves the training stability by applying the perturbations to the normalized word embeddings. Specifically, when fine-tuning DeBERTa to a downstream NLP task in our experiments, SiFT first normalizes the word embedding vectors into stochastic vectors, and then applies the perturbation to the normalized embedding vectors. We find that the normalization substantially improves the performance of the fine-tuned models. The improvement is more prominent for larger DeBERTa models. Note that we only apply SiFT to DeBERTa 1.5B on SuperGLUE tasks in our experiments and we will provide a more comprehensive study of SiFT in our future work.

5. EXPERIMENT

This section reports DeBERTa results on various NLU tasks.

5.1. MAIN RESULTS ON NLU TASKS

Following previous studies of PLMs, we report results using large and base models.

Model

CoLA We use 6 DGX-2 machines (96 V100 GPUs) to train the models. A single model trained with 2K batch size and 1M steps takes about 20 days. Refer to Appendix A for the detailed hyperparamters. We summarize the results on eight NLU tasks of GLUE (Wang et al., 2019b) We summarize the results in Table 2 . Compared to the previous SOTA PLMs with a similar model size (i.e., BERT, RoBERTa, XLNet, ALBERT large , and Megatron 336M ), DeBERTa shows superior performance in all seven tasks. Taking the RACE benchmark as an example, DeBERTa significantly outperforms XLNet by +1.4% (86.8% vs. 85.4%). Although Megatron 1.3B is three times larger than DeBERTa, DeBERTa outperforms it in three of the four benchmarks. We further report DeBERTa on text generation tasks in Appendix A.4.

5.1.2. PERFORMANCE ON BASE MODELS

Our setting for base model pre-training is similar to that for large models. The base model structure follows that of the BERT base model, i.e., L " 12, H " 768, A " 12. We use 4 DGX-2 with 64 V100 GPUs to train the base model. It takes 10 days to finish a single pre-training of 1M training steps with batch size 2048. We train DeBERTa using the same 78G dataset, and compare it to RoBERTa and XLNet trained on 160G text data. We summarize the base model results in Published as a conference paper at ICLR 2021

5.2. MODEL ANALYSIS

In this section, we first present an ablation study to quantify the relative contributions of different components introduced in DeBERTa. Then, we study the convergence property to characterize the model training efficiency. We run experiments for analysis using the base model setting: a model is pre-trained using the Wikipedia + Bookcorpus dataset for 1M steps with batch size 256 in 7 days on a DGX-2 machine with 16 V-100 GPUs. Due to space limit, we visualize the different attention patterns of DeBERTa and RoBERTa in Appendix A.7.

5.2.1. ABLATION STUDY

To verify our experimental setting, we pre-train the RoBERTa base model from scratch. The re-pretrained RoBERTa model is denoted as RoBERTa-ReImp base . To investigate the relative contributions of different components in DeBERTa, we develop three variations: • -EMD is the DeBERTa base model without EMD. • -C2P is the DeBERTa base model without the content-to-position term ((c) in Eq. 4). • -P2C is the DeBERTa base model without the position-to-content term ((b) in Eq. 4). As XLNet also uses the relative position bias, this model is close to XLNet plus EMD. ) on MNLI-m/mm, respectively. Similarly, removing either content-to-position or position-to-content leads to inferior performance in all the benchmarks. As expected, removing two components results in even more substantial loss in performance.

5.3. SCALE UP TO 1.5 BILLION PARAMETERS

Larger pre-trained models have shown better generalization results (Raffel et al., 2020; Brown et al., 2020; Shoeybi et al., 2019) . Thus, we have built a larger version of DeBERTa with 1.5 billion parameters, denoted as DeBERTa 1.5B . The model consists of 48 layers with a hidden size of 1,536 and 24 attention headsfoot_5 . DeBERTa 1.5B is trained on a pre-training dataset amounting to 160G, similar to that in Liu et al. (2019c) , with a new vocabulary of size 128K constructed using the dataset. To train DeBERTa 1.5B , we optimize the model architecture as follows. First, we share the projection matrices of relative position embedding W k,r , W q,r with W k,c , W q,c , respectively, in all attention layers to reduce the number of model parameters. Our ablation study in Table 13 on base models shows that the projection matrix sharing reduces the model size while retaining the model performance. Second, a convolution layer is added aside the first Transformer layer to induce n-gram knowledge of sub-word encodings and their outputs are summed up before feeding to the next Transformer layerfoot_6 . Table 5 reports the test results of SuperGLUE (Wang et al., 2019a) which is one of the most popular NLU benchmarks. SuperGLUE consists of a wide of NLU tasks, including Question Answering (Clark et al., 2019; Khashabi et al., 2018; Zhang et al., 2018) , Natural Language Inference (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , Word Sense Disambiguation (Pilehvar & Camacho-Collados, 2019), and Reasoning (Levesque et al., 2011; Roemmele et al., 2011) . Since its release in 2019, top research teams around the world have been developing large-scale PLMs that have driven striking performance improvement on SuperGLUE. The significant performance boost due to scaling DeBERTa to a larger model makes the single DeBERTa 1.5B surpass the human performance on SuperGLUE for the first time in terms of macroaverage score (89.9 versus 89.8) as of December 29, 2020, and the ensemble DeBERTa model (DeBERTa Ensemble ) sits atop the SuperGLUE benchmark rankings as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). Compared to T5, which consists of 11 billion parameters, the 1.5-billion-parameter DeBERTa is much more energy efficient to train and maintain, and it is easier to compress and deploy to apps of various settings. 

6. CONCLUSIONS

This paper presents a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. The second is an enhanced mask decoder which incorporates absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve model's generalization on downstream tasks. We show through a comprehensive empirical study that these techniques significantly improve the efficiency of model pre-training and the performance of downstream tasks. The DeBERTa model with 1.5 billion parameters surpasses the human performance on the SuperGLUE benchmark for the first time in terms of macro-average score. DeBERTa surpassing human performance on SuperGLUE marks an important milestone toward general AI. Despite its promising results on SuperGLUE, the model is by no means reaching the human-level intelligence of NLU. Humans are extremely good at leveraging the knowledge learned from different tasks to solve a new task with no or little task-specific demonstration. This is referred to as compositional generalization, the ability to generalize to novel compositions (new tasks) of familiar constituents (subtasks or basic problem-solving skills). Moving forward, it is worth exploring how to make DeBERTa incorporate compositional structures in a more explicit manner, which could allow combining neural and symbolic computation of natural language similar to what humans do.

A APPENDIX

A ‚ GLUE. The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding (NLU) tasks. As shown in Table 6 , it includes question answering (Rajpurkar et al., 2016) , linguistic acceptability (Warstadt et al., 2018) , sentiment analysis (Socher et al., 2013 ), text similarity (Cer et al., 2017) , paraphrase detection (Dolan & Brockett, 2005) , and natural language inference (NLI) (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009; Levesque et al., 2012; Williams et al., 2018) . The diversity of the tasks makes GLUE very suitable for evaluating the generalization and robustness of NLU models. ‚ SuperGLUE. SuperGLUE is an extension of the GLUE benchmark, but more difficult, which is a collection of eight NLU tasks. It covers a various of tasks including question answering (Zhang et al., 2018; Clark et al., 2019; Khashabi et al., 2018) , natural language inference (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009; De Marneffe et al., 2019) , coreference resolution (Levesque et al., 2012) and word sense disambiguation (Pilehvar & Camacho-Collados, 2019) . ‚ RACE is a large-scale machine reading comprehension dataset, collected from English examinations in China, which are designed for middle school and high school students (Lai et al., 2017) . ‚ SQuAD v1.1/v2.0 is the Stanford Question Answering Dataset (SQuAD) v1.1 and v2.0 (Rajpurkar et al., 2016; 2018) are popular machine reading comprehension benchmarks. Their passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowdsourcing. The SQuAD v2.0 dataset includes unanswerable questions about the same paragraphs. With the disentangled attention mechanism, we introduce three additional sets of parameters W q,r , W k,r P R dˆd and P P R 2kˆd . The total increase in model parameters is 2L ˆd2 `2k ˆd. For the large model pd " 1024, L " 24, k " 512q, this amounts to about 49M additional parameters, an increase of 13%. For the base modelpd " 768, L " 12, k " 512q, this amounts to 14M additional parameters, an increase of 12%. However, by sharing the projection matrix between content and position embedding, i.e. W q,r " W q,c , W k,r " W k,c , the number of parameters of DeBERTa is the same as RoBERTa. Our experiment on base model shows that the results are almost the same, as in Table 13 . The additional computational complexity is OpN kdq due to the calculation of the additional positionto-content and content-to-position attention scores. Compared with BERT or RoBERTa, this increases the computational cost by 30%. Compared with XLNet which also uses relative position embedding, the increase of computational cost is about 15%. A further optimization by fusing the attention computation kernel can significantly reduce this additional cost. For EM D, since the decoder in pre-training only reconstructs the masked tokens, it does not introduce additional computational cost for unmasked tokens. In the situation where 15% tokens are masked and we use only two decoder layers, the additional cost is 0.15 ˆ2{L which results in an additional computational cost of only 3% for base model(L " 12) and 2% for large model(L " 24) in EMD.

A.8 ADDITIONAL DETAILS OF ENHANCED MASK DECODER

The structure of EMD is shown in Figure 2b . There are two inputs for EMD, (i.e., I, H). H denotes the hidden states from the previous Transformer layer, and I can be any necessary information for decoding, e.g., H, absolute position embedding or output from previous EMD layer. n denotes n stacked layers of EMD where the output of each EMD layer will be the input I for next EMD layer and the output of last EMD layer will be fed to the language model head directly. The n layers can share the same weight. In our experiment we share the same weight for n " 2 layers to reduce the number of parameters and use absolute position embedding as I of the first EMD layer. When I " H and n " 1, EMD is the same as the BERT decoder layer. However, EMD is more general and flexible as it can take various types of input information for decoding.

A.9 ATTENTION PATTERNS

To visualize how DeBERTa operates differently from RoBERTa, we present in Figure 3 We observe two differences. First, RoBERTa has a clear diagonal line effect for a token attending to itself. But this effect is not very visible in DeBERTa. This can be attributed to the use of EMD, in which the absolute position embedding is added to the hidden state of content as the query vector, as verified by the attention pattern of DeBERTa-EMD where the diagonal line effect is more visible than that of the original DeBERTa. Second, we observe vertical strips in the attention patterns of RoBERTa, which are mainly caused by high-frequent functional words or tokens (e.g., "a", "the", and punctuation). For DeBERTa, the strip only appears in the first column, which represents the [CLS] token. We conjecture that a dominant emphasis on [CLS] is desirable since the feature vector of [CLS] is often used as a contextual representation of the entire input sequence in downstream tasks. We also observe that the vertical strip effect is quite obvious in the patterns of the three DeBERTa variants. We present three additional examples to illustrate the different attention patterns of DeBERTa and RoBERTa in Figures 4 and 5 



Our code and models are also available at HuggingFace Transformers: https://github.com/ huggingface/transformers, https://huggingface.co/models?filter=deberta In this sense, our model shares some similarity to Tensor Product Representation(Smolensky, 1990;Schlag et al., 2019;Chen et al., 2019) where a word is represented using a tensor product of its filler (content) vector and its role (position) vector. https://dumps.wikimedia.org/enwiki/ The hidden dimension of ALBERTxxlarge is 4 times of DeBERTa and the computation cost is about 4 times of DeBERTa. T5(Raffel et al., 2020) has more parameters (11B).Raffel et al. (2020) only report the test results of T5 which are not comparable with other models. See Table8in Appendix for the model hyperparameters. Please refer toTable 12 in Appendix A.6 for the ablation study of different model sizes, and Table 13 in Appendix A.6 for the ablation study of new modifications.



the attention patterns (taken in the last self-attention layers) of RoBERTa, DeBERTa and three DeBERTa variants.

Figure 2: Comparison of the decoding layer.

Figure 3: Comparison of attention patterns of the last layer among DeBERTa, RoBERTa and DeBERTa variants (i.e., DeBERTa without EMD, C2P and P2C respectively).

.

Figure 4: Comparison on attention patterns of the last layer between DeBERTa and RoBERTa.

QQP MNLI-m/mm SST-2 STS-B QNLI RTE MRPC Avg. Comparison results on the GLUE development set.

in Table1, where DeBERTa is compared DeBERTa with previous Transform-based PLMs of similar structures (i.e. 24 layers with hidden size of 1024) including BERT, RoBERTa, XLNet, ALBERT and ELECTRA. Note that RoBERTa, XLNet and ELECTRA are pre-trained on 160G training data while DeBERTa is pretrained on 78G training data. RoBERTa and XLNet are pre-trained for 500K steps with 8K samples in a step, which amounts to four billion training samples. DeBERTa is pre-trained for one million steps with 2K samples in each step. This amounts to two billion training samples, approximately half of either RoBERTa or XLNet. Table1shows that compared to BERT and RoBERTa, DeBERTa performs consistently better across all the tasks. Meanwhile, DeBERTa outperforms XLNet in six out of eight tasks. Particularly, the improvements on MRPC (1.1% over XLNet and 1.0% over RoBERTa), RTE (2.4% over XLNet and 1.7% over RoBERTa) and CoLA (1.5% over XLNet and 2.5% over RoBERTa) are significant. DeBERTa also outperforms other SOTA PLMs, i.e., ELECTRA large and XLNet large , in terms of average GLUE score. Results on MNLI in/out-domain, SQuAD v1.1, SQuAD v2.0, RACE, ReCoRD, SWAG, CoNLL 2003 NER development set. Note that missing results in literature are signified by "-". For comparison, we include ALBERT xxlarge(Lan et al., 2019) 4 and Megatron(Shoeybi et al., 2019) with three different model sizes, denoted as Megatron 336M , Megatron 1.3B and Megatron 3.9B , respectively, which are trained using the same dataset as RoBERTa. Note that Megatron 336M has a similar model size as other models mentioned above 5 .

Across all three tasks, DeBERTa consistently outperforms RoBERTa and XLNet by a larger margin than that in large models. For example, on MNLI-m, DeBERTa base obtains +1.2% (88.8% vs. 87.6%) over RoBERTa base , and +2% (88.8% vs. 86.8%) over XLNet base .

Ablation study of the DeBERTa base model.

SuperGLUE test set results scored using the SuperGLUE evaluation server. All the results are obtained from https://super.gluebenchmark.com on January 6, 2021.

Summary information of the NLP application benchmarks.

Comparison results of DeBERTa models with different sizes on the GLUE development set.

Ablation study of the additional modifications in DeBERTa 1.5B and DeBERTa 900M models. Note that we progressively add each component on the top of DeBERTa base .

acknowledgement

We thank Jade Huang and Nikos Karampatziakis for proofreading the paper and providing insightful comments. We thank Yoyo Liang, Saksham Singhal, Xia Song, and Saurabh Tiwary for their help with large-scale model training. We also thank the anonymous reviewers for valuable discussions.

annex

‚ SWAG is a large-scale adversarial dataset for the task of grounded commonsense inference, which unifies natural language inference and physically grounded reasoning (Zellers et al., 2018) . SWAG consists of 113k multiple choice questions about grounded situations.‚ CoNLL 2003 is an English dataset consisting of text from a wide variety of sources. It has 4 types of named entity.

A.2 PRE-TRAINING DATASET

For DeBERTa pre-training, we use Wikipedia (English Wikipedia dump 8 ; 12GB), BookCorpus (Zhu et al., 2015) 9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019) ; 38GB) and STORIES 10 (a subset of CommonCrawl (Trinh & Le, 2018 ); 31GB). The total data size after data deduplication (Shoeybi et al., 2019) is about 78GB. For pre-training, we also sample 5% training data as the validation set to monitor the training process. 

A.3 IMPLEMENTATION DETAILS

Following RoBERTa (Liu et al., 2019c) , we adopt dynamic data batching. We also include span masking (Joshi et al., 2020) as an additional masking strategy with the span size up to three. We list the detailed hyperparameters of pre-training in Table 8 . For pre-training, we use Adam (Kingma & Ba, 2014) as the optimizer with weight decay (Loshchilov & Hutter, 2018) . For fine-tuning, even though we can get better and robust results with RAdam (Liu et al., 2019a) on some tasks, e.g. CoLA, RTE and RACE, we use Adam (Kingma & Ba, 2014) as the optimizer for a fair comparison. For fine-tuning, we train each task with a hyper-parameter search procedure, each run takes about 1-2 hours on a DGX-2 node. All the hyper-parameters are presented in Table 9 . The model selection is based on the performance on the task-specific development sets.Our code is implemented based on Huggingface Transformers 11 , FairSeq 12 and Megatron (Shoeybi et al., 2019) 13 .

A.3.1 PRE-TRAINING EFFICIENCY

To investigate the efficiency of model pre-training, we plot the performance of the fine-tuned model on downstream tasks as a function of the number of pre-training steps. As shown in Figure 1 , for RoBERTa-ReImp base and DeBERTa base , we dump a checkpoint every 150K pre-training steps, and then fine-tune the checkpoint on two representative downstream tasks, MNLI and SQuAD v2.0, and then report the accuracy and F1 score, respectively. As a reference, we also report the final model performance of both the original RoBERTa base (Liu et al., 2019c) and XLNet base (Yang et al., 2019) 

A.4 MAIN RESULTS ON GENERATION TASKS

In addition to NLU tasks, DeBERTa can also be extended to handle NLG tasks. To allow DeBERTa operating like an auto-regressive model for text generation, we use a triangular matrix for selfattention and set the upper triangular part of the self-attention mask to ´8, following Dong et al. (2019) .We evaluate DeBERTa on the task of auto-regressive language model (ARLM) using Wikitext-103 (Merity et al., 2016) . To do so, we train a new version of DeBERTa, denoted as DeBERTa-MT.It is jointly pre-trained using the MLM and ARLM tasks as in UniLM (Dong et al., 2019) Table 10 summarizes the results on Wikitext-103. We see that DeBERTa base obtains lower perplexities on both dev and test data, and joint training using MLM and ARLM reduces perplexity further. That DeBERTa-AP is inferior to DeBERTa indicates that it is more effective to incorporate absolute position embeddings of words in the decoding layer as the EMD in DeBERTa than in the input layer as RoBERTa.

A.5 HANDLING LONG SEQUENCE INPUT

With relative position bias, we choose to truncate the maximum relative distance to k as in equation 3.Thus in each layer, each token can attend directly to at most 2pk ´1q tokens and itself. By stacking Transformer layers, each token in the l´th layer can attend to at most p2k ´1ql tokens implicitly.Taking DeBERTa large as an example, where k " 512, L " 24, in theory, the maximum sequence length that can be handled is 24,528. This is a byproduct benefit of our design choice and we find it beneficial for the RACE task. A comparison of long sequence effect on the RACE task is shown in Table 11 . Long sequence handling is an active research area. There have been a lot of studies where the Transformer architecture is extended for long sequence handling (Beltagy et al., 2020; Kitaev et al., 2019; Child et al., 2019; Dai et al., 2019) . One of our future research directions is to extend DeBERTa to deal with extremely long sequences.

A.6 PERFORMANCE IMPROVEMENTS OF DIFFERENT MODEL SCALES

In this subsection, we study the effect of different model sizes applied to large models on GLUE. Table 12 summarizes the results, showing that larger models can obtain a better result and SiFT also improves the model performance consistently. 

