DEEPENING HIDDEN REPRESENTATIONS FROM PRE-TRAINED LANGUAGE MODELS

Abstract

Transformer-based pre-trained language models have proven to be effective for learning contextualized language representation. However, current approaches only take advantage of the output of the encoder's final layer when fine-tuning the downstream tasks. We argue that only taking single layer's output restricts the power of pre-trained representation. Thus we deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer. Utilizing RoBERTa as the backbone encoder, our proposed improvement over the pre-trained models is shown effective on multiple natural language understanding tasks and help our model rival with the state-of-the-art models on the GLUE benchmark.

1. INTRODUCTION

Language representation is essential to the understanding of text. Recently, pre-trained language models based on Transformer (Vaswani et al., 2017) such as GPT (Radford et al., 2018) , BERT (Devlin et al., 2019 ), XLNet (Yang et al., 2019 ), and RoBERTa (Liu et al., 2019c) have been shown to be effective for learning contextualized language representation. These models have since continued to achieve new state-of-the-art results on a variety of natural language processing tasks. They include question answering (Rajpurkar et al., 2018; Lai et al., 2017) , natural language inference (Williams et al., 2018; Bowman et al., 2015) , named entity recognition (Tjong Kim Sang & De Meulder, 2003) , sentiment analysis (Socher et al., 2013) and semantic textual similarity (Cer et al., 2017; Dolan & Brockett, 2005) . Normally, Transformer-based models are pre-trained on large-scale unlabeled corpus in an unsupervised manner, and then fine-tuned on the downstream tasks through introducing task-specific output layer. When fine-tuning on the supervised downstream tasks, the models pass directly the output of Transformer encoder's final layer, which is considered as the contextualized representation of input text, to the task-specific layer. However, due to the numerous layers (i.e., Transformer blocks) and considerable depth of these pretrained models, we argue that the output of the last layer may not always be the best representation of the input text during the fine-tuning for downstream tasks. Devlin et al. (2019) shows diverse combinations of different layers' outputs of the pre-trained BERT result in distinct performance on CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang & De Meulder, 2003) . Peters et al. (2018b) points out for pre-trained language models, including Transformer, the most transferable contextualized representations of input text tend to occur in the middle layers, while the top layers specialize for language modeling. Therefore, the onefold use of the last layer's output may restrict the power of the pre-trained representation. In this paper, we propose an extra network component design for Transformer-based model, which is capable of adaptively leveraging the hidden information in the Transformer's hidden layers to refine the language representation. Our introduced additional components include two main additional components: 1. HIdden Representation Extractor (HIRE) dynamically learns a complementary representation which contains the information that the final layer's output fails to capture.

