CONTEXTUAL KNOWLEDGE DISTILLATION FOR TRANSFORMER COMPRESSION

Abstract

A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation strategy for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks. The code will be released.

1. INTRODUCTION

Since the Transformer, a simple architecture based on attention mechanism, succeeded in machine translation tasks, Transformer-based models have become a new state of the arts that takes over more complex structures based on recurrent or convolution networks on various language tasks, e.g., language understanding and question answering, etc. (Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019a; Raffel et al., 2019; Yang et al., 2019) However, in exchange for high performance, these models suffer from a major drawback: tremendous computational and memory costs. In particular, it is not possible to deploy such large models on platforms with limited resources such as mobile and wearable devices, and it is an urgent research topic with impact to keep up with the performance of the latest models from a small-size network. As the main method for this purpose, Knowledge Distillation (KD) transfers knowledge from the large and well-performing network (teacher) to a smaller network (student). Very recently, there have been some efforts that distill Transformer-based models into compact networks (Sanh et al., 2019; Turc et al., 2019; Sun et al., 2019; 2020; Jiao et al., 2019) . However, they all build on the idea that each word representation is independent, ignoring relationships between words that could be more informative than individual representations. In this paper, we pay attention to the fact that word representations from language models are very structured and capture certain types of semantic and syntactic relationships. -Word2Vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) demonstrate that trained embedding of words contains the linguistic patterns as linear relationships between word vectors. Recently, Reif et al. (2019) found out that the distance between words contains the information of the dependency parse tree. Many other studies also suggested the evidence that contextual word representations (Belinkov et al., 2017; Tenney et al., 2019a; b) and attention matrices (Vig, 2019; Clark et al., 2019) contain important relations between words. Intuitively, although each word representation has respective knowledge, the set of representations of words as a whole is more semantically meaningful, since words in the embedding space are positioned relatively by learning. Inspired by these observations, we propose a novel distillation objective, termed Contextual Knowledge Distillation (CKD), for language tasks that utilizes the statistics of relationships between word representations. In this paper, we define two types of contextual knowledge: Word Relation (WR) and Layer Transforming Relation (LTR). Specifically, WR is proposed to capture the knowledge of relationships between word representations and LTR defines how each word representation changes as it passes through the network layers. Moreover, unlike some previous approaches with constraints for distillation, the proposed objective is more robust to architecture changes as it does not add any structural constraints for teacher or student. There are two distillation techniques to compress a large pre-trained language model into a compact network. Several previous works (Sanh et al., 2019; Jiao et al., 2019; Sun et al., 2020 ) compress a large pre-trained language model into a small language model on the pre-training stage which requires high computation costs and times. On the other hand, some works (Turc et al., 2019; Sun et al., 2019) present the task-specific distillation that transfers the knowledge to a well initialized small network to improve the performance of each task. In this paper, we focus on the task-specific distillation that has the advantage of being directly applied on top of pre-trained small BERT models (Turc et al., 2019) without conducting a time-consuming pre-training process. We validate our method on the Stanford Question Answer Dataset (SQuAD) and General Language Understanding Evaluation (GLUE) benchmark. We first demonstrate the effectiveness of our method outperforming the current state-of-the-art distillation methods. We also show that our CKD performs effectively on a variety of network architectures including recently proposed MobileBERT (Sun et al., 2020) , a new type of thin architecture of BERT, trained with task-agnostic distillation. Our contribution is threefold: • Inspired by the recent observations that word representations from neural networks are structured, we propose a novel knowledge distillation strategy, Contextual Knowledge Distillation (CKD), that transfers the relationships across word representations. • We present two types of complementary contextual knowledge: horizontal Word Relation across representations in a single layer and vertical Layer Transforming Relation across representations for a single word. • We validate CKD on the standard language understanding benchmark datasets and show that CKD consistently outperforms the state-of-the-art distillation methods for BERT on various model sizes.

2. RELATED WORK

Knowledge distillation Since recently popular deep neural networks are computation-and memory-heavy by design, there has been a long line of research on transferring knowledge for the purpose of compression. Hinton et al. (2015) first proposed a teacher-student framework with an objective that minimizes the KL divergence between teacher and student class probabilities. In this framework, several follow-up works proposed various objectives to distill the well-designed knowledge such as attention map of image (Zagoruyko & Komodakis, 2016), similarity (Tung & Mori, 2019) or the relation (Park et al., 2019; Liu et al., 2019b) between the image features. In the field of natural language processing (NLP), knowledge distillation has been actively studied (Kim & Rush, 2016; Hu et al., 2018; Yang et al., 2020) . In particular, after the emergence of large language models based on pre-training such as BERT (Devlin et al., 2018; Liu et al., 2019a; Yang et al., 2019; Raffel et al., 2019) , many studies have recently emerged that attempt various knowledge distillation in the pre-training process and/or fine-tuning for downstream tasks in order to reduce the burden of handling large models. Specifically, Tang et al. ( 2019 Goyal et al., 2020; Liu et al., 2020; Hou et al., 2020) improve the performance of other compression methods such as sparsification and quantization by integrating the knowledge distillation objectives. Different from previous knowledge distillation methods that transfer respective knowledge of word representations, we design the objective to distill the contextual knowledge of them, which can be combined with existing distillation methods. Contextual knowledge of word representations Understanding and utilizing the relationships across words is one of the key ingredients in language modeling. Word embedding (Mikolov et al., 



); Chia et al. (2019) proposed to distill the BERT to train the simple recurrent and convolution networks. Sanh et al. (2019); Turc et al. (2019) proposed to use the teacher's predictive distribution to train the smaller BERT, and Wang et al. (2020) propose the structure-level distillation that transfer the predictive distribution of sequence-level for the multi-lingual sequence labeling tasks. Sun et al. (2019) proposed a method to transfer individual representation of words in the BERT. In addition to matching the hidden state, Jiao et al. (2019) and Sun et al. (2020) also utilize the attention matrices derived from the Transformer. Several works (

