DEEPENING HIDDEN REPRESENTATIONS FROM PRE-TRAINED LANGUAGE MODELS

Abstract

Transformer-based pre-trained language models have proven to be effective for learning contextualized language representation. However, current approaches only take advantage of the output of the encoder's final layer when fine-tuning the downstream tasks. We argue that only taking single layer's output restricts the power of pre-trained representation. Thus we deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer. Utilizing RoBERTa as the backbone encoder, our proposed improvement over the pre-trained models is shown effective on multiple natural language understanding tasks and help our model rival with the state-of-the-art models on the GLUE benchmark.

1. INTRODUCTION

Language representation is essential to the understanding of text. Recently, pre-trained language models based on Transformer (Vaswani et al., 2017) such as GPT (Radford et al., 2018) , BERT (Devlin et al., 2019) , XLNet (Yang et al., 2019) , and RoBERTa (Liu et al., 2019c) have been shown to be effective for learning contextualized language representation. These models have since continued to achieve new state-of-the-art results on a variety of natural language processing tasks. They include question answering (Rajpurkar et al., 2018; Lai et al., 2017) , natural language inference (Williams et al., 2018; Bowman et al., 2015) , named entity recognition (Tjong Kim Sang & De Meulder, 2003) , sentiment analysis (Socher et al., 2013) and semantic textual similarity (Cer et al., 2017; Dolan & Brockett, 2005) . Normally, Transformer-based models are pre-trained on large-scale unlabeled corpus in an unsupervised manner, and then fine-tuned on the downstream tasks through introducing task-specific output layer. When fine-tuning on the supervised downstream tasks, the models pass directly the output of Transformer encoder's final layer, which is considered as the contextualized representation of input text, to the task-specific layer. However, due to the numerous layers (i.e., Transformer blocks) and considerable depth of these pretrained models, we argue that the output of the last layer may not always be the best representation of the input text during the fine-tuning for downstream tasks. Devlin et al. (2019) shows diverse combinations of different layers' outputs of the pre-trained BERT result in distinct performance on CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang & De Meulder, 2003) . Peters et al. (2018b) points out for pre-trained language models, including Transformer, the most transferable contextualized representations of input text tend to occur in the middle layers, while the top layers specialize for language modeling. Therefore, the onefold use of the last layer's output may restrict the power of the pre-trained representation. In this paper, we propose an extra network component design for Transformer-based model, which is capable of adaptively leveraging the hidden information in the Transformer's hidden layers to refine the language representation. Our introduced additional components include two main additional components: 1. HIdden Representation Extractor (HIRE) dynamically learns a complementary representation which contains the information that the final layer's output fails to capture. 2. Fusion network integrates the hidden information extracted by the HIRE with Transformer final layer's output through two steps of functionalities, leading to a refined contextualized language representation. Taking advantage of the robustness of RoBERTa by using it as our backbone Transformer-based encoder (Liu et al., 2019c) , we conduct experiments on GLUE benchmark (Wang et al., 2018) , which consists of nine Natural Language Understanding (NLU) tasks. With the help of HIRE, our model outperforms the baseline on 5/9 of them and advances the state-of-the-art on SST-2 dataset. Keeping the backbone Transformer model unchanged on its architecture, pre-training procedure and training objectives, we get comparable performance with other state-of-the-art models on the GLUE leaderboard, which verifies the effectiveness of the proposed HIRE enhancement over Transformer model.

2. MODEL

2-layer Bi-GRU where d is the hidden size of the encoder and R is the output of Transformer-based encoder's last layer which has the same length as the input text. We call it preliminary representation in this paper to distinguish it with the one that we introduce in section 2.2. Here, we omit a rather extensive formulations of Transformer and refer readers to Vaswani et al. (2017) , Radford et al. (2018) and Devlin et al. (2019) for more details.

2.2. HIDDEN REPRESENTATION EXTRACTOR

Transformer-based encoder normally has many structure-identical layers stacked together, for example, BERT LARGE and XLNet LARGE all contain 24 layers of the identical structure, either outputs from these hidden layers or the last layer, but not only limited to the latter, may be extremely helpful for specific downstream task. To make full use of the representations from these hidden layers, we introduce an extra component attached to the original encoder, HIdden Representation Extractor (HIRE) to capture the complementary information that the output of the last layer fails to capture. Since each layer does not take the same importance to represent a certain input sequence for different downstream tasks, we design an adaptive mechanism that can compute the importance dynamically. We measure the importance by an importance score. The input to the HIRE is {H 0 , . . . , H j , . . . , H l } where l represents the number of layers in the encoder. Here H 0 is the initial embedding of input text, which is the input of the encoder's first layer but is updated during training and H j ∈ R n×d is the hidden-state of the encoder at the output of layer j. For the sake of simplicity, we call them all hidden-state afterwards. We use the same 2-layer Bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014) to summarize each hidden-state of the encoder. Instead of taking the whole output of GRU as the representation of the hidden state, we concatenate GRU's each layer and each direction's final state together. In this way, we manage to summarize the hidden-state into a fixed-sized vector. Hence, we obtain U ∈ R (l+1)×4d with u i the summarized vector of H i : u i = Bi-GRU(H i ) ∈ R 4d where 0 ≤ i ≤ l. Then the importance value α i for hidden-state H i is calculated by: α i = ReLU(W T u i + b) ∈ R (3) where W ∈ R 4d×1 and b ∈ R are trainable parameters. Let α={α i } be normalized into a probability distribution s through a softmax layer: s = softmax(α) ∈ R l+1 where s i is the normalized weight of hidden-state i when computing the representation. Subsequently, we obtain the input sequence's new representation A by: A = l+1 i=0 s i H i ∈ R n×d (5) With the same shape as the output of Transformer-based encoder's final layer, HIRE's output A is expected to contain the additional useful information from the encoder's hidden-states and we call it complementary representation.

2.3. FUSION NETWORK

This module fuses the information contained in the output of Transformer-based encoder and the one extracted from encoders' hidden states by HIRE. Given the preliminary representation R, instead of letting it flow directly into task-specfic output layer, we combine it together with the complementary representation A to yield M , defined by: M = [R; A; R + A; R • A] ∈ R n×4d (6) where • is element-wise multiplication (Hadamard Product) and [; ] is concatenation across the last dimension. Later, two-layer bidirectional GRU, with the output size of d for each direction, is used to fully fuse the information contained in the preliminary representation and the complementary representation. We concatenate the outputs of the GRUs in two dimensions together for the final contextualized representation: F = Bi-GRU(M ) ∈ R n×2d (7) 2.4 OUTPUT LAYER The output layer is task-specific. The following are the concerned implementation details on two tasks, classification and regression. For classification task, given the input text's contextualized representation F , following Devlin et al. (2019) , we take the first row c ∈ R 2d of F corresponding to the first input token (< s >) as the aggregated representation. Let m be the number of labels in the datasets, we pass c through a feed-forward network (FFN): q = W T 2 • tanh(W T 1 c + b 1 ) + b 2 ∈ R m (8) with W 1 ∈ R 2d×d , W 2 ∈ R d×m , b 1 ∈ R d and b 2 ∈ R m the only parameters that we introduce in output layer. Finally, the probability distribution of predicted label is computed as: p = softmax(q) ∈ R m (9) For regression task, we obtain q in the same manner with m = 1, and take q as the predicted value.

2.5. TRAINING

For classification task, the training loss to be minimized is defined by the Cross-Entropy: L(θ) = - 1 T T i=1 log(p i,c ) ( ) where θ is the set of all parameters in the model, T is the number of examples in the dataset, p i,c is the predicted probability of gold class c for example i . For regression task, we define the training loss by mean squared error (MSE): L(θ) = 1 T T i=1 (q i -y i ) 2 (11) where q i is the predicted value for example i and y i is the ground truth value for example i.

3.1. DATASET

We conducted the experiments on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) to evaluate our proposed method. GLUE is a collection of 9 diverse datasetsfoot_0 for training, evaluating, and analyzing natural language understanding models. Table 3 presents the results of HIRE enhancement and other models on the test set that have been submitted to the GLUE leaderboard. Following Liu et al. (2019c) , we fine-tune STS-B and MRPC starting from the MNLI single-task model. Given the simplicity between RTE, WNLI and MNLI, and the large-scale nature of MNLI dataset (393k), we also initialize RoBERTa+HIRE with the weights of MNLI single-task model before fine-tuning on RTE and WNLI. We submitted the ensemble-model results to the leaderboard. The results show that RoBERTa+HIRE still boosts the strong RoBERTa baseline model on the test set. To be specific, RoBERTa+HIRE outperforms RoBERTa over CoLA, SST-2, MRPC, SST-B, MNLI-mm and QNLI with an improvement of 0.8 points, 0.4 points, 0.7/0.9 points, 0.2/0.1 points, 0.2 points and 0.1 points respectively. In the meantime, RoBERTa+HIRE gets the same results as RoBERTa on QQP and WNLI. By category, RoBERTa+HIRE has better performance than RoBERTa on the single sentence tasks, similarity and paraphrase tasks. It is worth noting that our model obtains state-of-the-art results on SST-2 dataset, with a score of 97.1. The results are quite promising since HIRE does not modify the encoder internal architecture (Yang et al., 2019) or redefine the pre-training procedure (Liu et al., 2019c) , getting the comparable results with them.

Single Sentence Similarity and Paraphrase Natural Language Inference

Model CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE (Mc) (Acc) (Acc) (Acc) (Pearson) (Acc) (Acc) (

4. ABLATION STUDY

In this section, we perform a set of ablation experiments to understand the effects of our proposed techniques during fine-tuning. All the results reported in this section are a median of five random runs. 4 • We remove the HIRE from our model and pass the preliminary representation R directly into the fusion network. In order to keep the fusion network, we fuse the preliminary representation with itself, which means we define instead M by: M = [R; R; R + R; R • R] ∈ R n×4d The results are presented in row 2. • We remove the fusion network, and take the outputs of HIRE as the final representation of the input and flow it directly into the output layer and present the results in row 3. As can be seen from the table, to use of HIRE to extract the hidden information is crucial to the model: Matthews correlation of CoLA and Pearson correlation coefficient of STS-B drop dramatically by 1.2 and 0.9 points if it's removed. We also observe that fusion network is an important component that contributes 1.5/0.2 gains on the CoLA and STS-B tasks. The results are aligned with our assumption that HIRE extracts the complementary information from the hidden layers while the fusion network fuses it with the preliminary representation.

Method Mc

HIRE (dynamic) 69.7 mean, all 25 layers 68.9 (-0.8) mean, last 6 layers 68.0 (-1.7) mean, first 6 layers 68.8 (-0.9) random, all 25 layers 69.3 (-0.4) random, last 6 layers 68.1 (-1.6) We investigate how the adaptive assignment mechanism of the importance score affects the model's performance. CoLA is chosen as the downstream task. with diverse fixed combinations of importance scores over different range of layers: first 6, last 6 or all 25 layers. Two strategies are studied: mean and random. In the mean situation, we suppose all the layers contribute exactly the same. On the contrary, under the random condition, we generate a score for each layer randomly and do the SoftMax operation across all the layers. From the table, we observe that fixing each layer's importance score for all examples hurts the model performance. Across all rows, we find that none of single strategy can yield the best performance. The results enable us to conclude that the hidden-state weighting mechanism introduced by HIRE may indeed adapt to diverse downstream tasks for better performance.

4.3. EFFECT OF GRU LAYER NUMBER IN FUSION NETWORK

To investigate the effect of GRU layer number in the fusion network, we set the GRU layer number from 0 to 3 and conduct the ablation experiments on the development dataset of CoLA. Table 6 shows the results. We observe that the modest number 2 would be a better choice. We compare the importance score's distribution of different NLU tasks. For each task, we run our best single model over the development set and the results are calculated by averaging the values across all the examples within each dataset. The results are presented in the left of Figure 2 . From the top to the bottom of the heatmap, the results are placed in the following order: single-sentence tasks, similarity and paraphrase tasks, and natural language inference tasks. From it, we find that the distribution differs among the different tasks, which demonstrates HIRE's dynamic ability to adapt for distinct tasks when computing the complimentary representation. The most important contribution occurs below the final layer for all the tasks except STS-B and RTE. All layers have a close contribution for STS-B and RTE task. In addition, we find that QNLI, a classification task which requires the model to determine whether the context sentence contains the answer to the question, relies on layers 2-4 to some extent. In fact, Jawahar et al. (2019) state that BERT mostly captures phrase-level (or span-level) information in the lower layers and that this information gets gradually diluted in higher layers. Since QNLI is derived from the span extraction question answering dataset SQuAD, where the answer is a span of the input text, we speculate that HIRE takes advantage of this information when making the prediction.

5. ANALYSIS

The right of Figure 2 presents the distribution of importance scores over different layers for each example of SST-2 dataset. The number on the ordinate axis denotes the index of the example. It shows that HIRE can adapt not only distinct tasks but also the different examples. In the meantime, we observe also that even though there are subtle differences among these examples, they follow certain same patterns when calculating the complementary representation, for example, layers 21 and 22 contribute the most for almost all the examples and also the layers around them. But the figure shows also that for some examples, all layers contribute nearly equally.

6. RELATED WORK

Transformer model is empowered by self-attention mechanism and has been applied as an effective model architecture design in quite a lot of pre-trained language models (Vaswani et al., 2017) . OpenAI GPT (Radford et al., 2018) is the first model that introduced Transformer architecture into unsupervised pre-training. But instead of unidirectional training like GPT, BERT (Devlin et al., 2019) adopts Masked LM objective when pre-training, which enables the representation to incorporate context from both direction. The next sentence prediction (NSP) objective is also used by BERT to better model the relationship between sentences. Trained with dynamic masking, large mini-batches and a larger byte-level BPE, full-sentences without NSP, RoBERTa (Liu et al., 2019c) improves BERT's performance on the downstream tasks from a better BERT re-implementation training on BooksCorpus (Zhu et al., 2015) , CC-News, Openwebtext and Stories. In terms of finetuning on downstream tasks, all these powerful Transformer-based models help various NLP tasks continuously achieve new state-of-the-art results. Diverse new methods have been proposed recently for fine-tuning the downstream tasks, including multi-task learning (Liu et al., 2019b) , adversarial training (Zhu et al., 2020) or incorporating semantic information into language representation (Zhang et al., 2020) . Traditionally, downstream tasks or back-end part of models take representations from the last layer of the pre-trained language models as the default input. However, recent studies trigger researchers' interest on the intermediate layers' representation learned by pre-trained models. When studying the linguistic knowledge and transferability of contextualized word representations with a series of seventeen diverse probing tasks, Liu et al. (2019a) observe that Transformers tend to encode transferable features in their intermediate layers, aligned with the results by Peters et al. (2018b) . Jawahar et al. (2019) show a rich hierarchy of linguistic information is encoded by BERT, with surface features, syntactic features and semantic features lying from bottom to top in the layers. So far, to our best knowledge, exploiting the representations learned by hidden layers is still stuck with limited empirical observations, which motivates us to propose a general solution for fully exploiting all levels of representations learned by the pre-trained language models. However, we want to note that a freezing/fine-tuning discrepancy may exist, since previous observations are made by adopting a feature extracting approach while our work is conducted in the fine-tuning scenario. Several works have made attempts to utilize the information from the intermediate layers of deep models (Zhao et al., 2015; Peters et al., 2018a; Zhu et al., 2018; Tenney et al., 2019) . Peters et al. (2018a) propose a deep contextualized word representation called ELMo, which is a task-specific linear combination of the intermediate layer representations in the bidirectional LSTM language model. Tenney et al. (2019) find that using ELMo-style scalar mixing of layer activations, both deep Transformer models (BERT and GPT) gain a significant performance improvement on a novel edge probing task. Zhu et al. (2018) adopt a linear combination of embeddings from different layers in BERT to encode tokens for conversational question answering but in a feature-based way. Compared with these works, our approach still presents the following innovations: 1) Our dynamic layer weighting mechanism calculates importance scores for each example over different layers, which is distinct from previous works (e.g., ELMo), where the contribution score (in the form of a parameter) for each layer is always same for all the examples of the same task. 2) Our approach fuses the representation weighted from the hidden layers and the one from the last layer to get a refined representation. 3) To our best knowledge, our work is the first one that takes advantage of the pre-trained language model's intermediate information in the context of fine-tuning.

7. CONCLUSION

This paper presents HIdden Representation Extractor (HIRE), a novel enhancement component that refines language representation by adaptively leveraging the Transformer-based model's hidden layers. In our proposed model design, HIRE dynamically generates complementary representation from all hidden layers other than that from the default last layer. A lite fusion network then incorporates the outputs of HIRE into those of the original model. The experimental results demonstrate the effectiveness of refined language representation for natural language understanding. The analysis highlights the distinct contribution of each layer's output for diverse tasks and different examples.



All the datasets can be obtained from https://gluebenchmark.com/tasks https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs https://github.com/huggingface/transformers



Figure 1: Architecture of our model. HIRE denotes HIdden Representation Extractor, in which the Bi-GRUs share the same parameters for each layer's output.

Figure 2: Left: Distribution of importance scores over different layers when computing the complementary representation for various NLU tasks. The numbers on the abscissa axis indicate the corresponding layer with 0 being the first layer and 24 being the last layer. Right: Distribution of importance scores over different layers for each example of SST-2 dataset. The number on the ordinate axis denotes the index of the example.

GLUE Dev results. Our results are based on single model trained with single task and a median over five runs with different random seed but the same hyperparameter is reported for each task. The results of MT-DNN, XLNET LARGE , ALBERT and RoBERTa are fromLiu et al.

Parameter Comparison.Table1compares our method with a list of Transformer-based models on the development set. Model parameter comparison is shown in Table2. To obtain a direct and fair comparison with our baseline model RoBERTa, following the original paper(Liu et al., 2019c), we fine-tune RoBERTa+HIRE separately for each of the GLUE tasks, using only task-specific training data. The single-model results for each task are reported. We run our model with five different random seeds but the same hyperparameters and take the median value. Due to the problematic nature of WNLI dataset, we exclude its results in this table. The results shows that RoBERTa+HIRE consistently outperforms RoBERTa on 4 of the GLUE task development sets, with an improvement of 1.7 points, 0.4 points, 0.5/0.2 points, 0.3 points on CoLA, SST-2, MNLI and QNLI respectively. And on the MRPC, STS-B and RTE task, our model get the same result as RoBERTa. It should be noted that the improvement is entirely attributed to the introduction of HIdden Representation Extractor and fusion network in our model.



Ablation study over model design consideration on the development set of CoLA and STS-B. The result for each model is a median of five random runs. Both HIRE and fusion network significantly improve the model performance on all two datasets. To individually evaluate the importance of HIRE and fusion network, we vary our model in the following ways and conduct the experiments on the development set of CoLA and STS-B. The results are in Table4:

Effect of dynamic mechanism when computing the importance scores. A median Matthews correlation of five random runs is reported for CoLA on the development set.



Ablation study over GRU layer number in fusion network on the development set of CoLA. The results are a median Matthews correlation of five random runs. The best result is in bold.

annex

Similarity and paraphrase tasks: Similarity and paraphrase tasks are to predict whether each pair of sentences captures a paraphrase/semantic equivalence relationship. The Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005) , the Quora Question Pairs (QQP) 2 and the Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017) are presented in this category.Natural Language Inference (NLI) tasks: Natural language inference is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise". GLUE benchmark contains the following tasks: the Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2018) , the converted version of the Stanford Question Answering Dataset (QNLI) (Rajpurkar et al., 2016) , the Recognizing Textual Entailment (RTE) (Dagan et al., 2006; Roy et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) and the Winograd Schema Challenge (WNLI) (Levesque et al., 2012) .Four official metrics are adopted to evaluate the model performance: Matthews correlation (Matthews, 1975) , accuracy, F1 score, Pearson and Spearman correlation coefficients.

A.2 IMPLEMENTATION DETAILS

Our implementation of HIRE and its related fusion network is based on the PyTorch implementation of Transformer 3 .Preprocessing: Following Liu et al. (2019c) , we adopt GPT-2 (Radford et al., 2019) tokenizer with a Byte-Pair Encoding (BPE) vocabulary of subword units size 50K. The maximum length of input sequence is 128 tokens.

Model configurations:

We use RoBERTa LARGE as the Transformer-based encoder and load the pretraining weights of RoBERTa (Liu et al., 2019c) . Like BERT LARGE , RoBERTa LARGE model contains 24 Transformer-blocks, with the hidden size being 1024 and the number of self-attention heads being 16 (Liu et al., 2019c; Devlin et al., 2019) .Optimization: We use Adam optimizer (Kingma & Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and = 10 -6 and the learning rate is selected amongst {1e-5, 2e-5} with a warmup rate of 0.06 depending on the nature of the task. The number of training epochs ranges from 3 to 27 with the early stop and the batch size is selected amongst {16, 32, 48}. In addition to that, we clip the gradient norm within 1 to prevent exploding gradients problem occurring in the recurrent neural networks in our model.

Regularization:

We employ two types of regularization methods during training. We apply dropout (Srivastava et al., 2014) of rate 0.1 to all layers in the Transformer-based encoder and GRUs in the HIRE and fusion network. We additionally adopt L2 weight decay of 0.1 during training.

