PEA-KD: PARAMETER-EFFICIENT AND ACCURATE KNOWLEDGE DISTILLATION ON BERT

Abstract

How can we efficiently compress a model while maintaining its performance? Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model's level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher's Predictions (PTP). Using this combination, we are capable of alleviating the KD's limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model's performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model's performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins.

1. INTRODUCTION

How can we improve the accuracy of knowledge distillation (KD) with smaller number of parameters? KD uses a well-trained large teacher model to train a smaller student model. Conventional KD method (Hinton et al. (2006) ) trains the student model using the teacher model's predictions as targets. That is, the student model uses not only the true labels (hard distribution) but also the teacher model's predictions (soft distribution) as targets. Since better KD accuracy is directly linked to better model compression, improving KD accuracy is valuable and crucial. Naturally, there have been many studies and attempts to improve the accuracy of KD. Sun et al. (2019) introduced Patient KD, which utilizes not only the teacher model's final output but also the intermediate outputs generated from the teacher's layers. Jiao et al. (2019) applied additional KD in the pretraining step of the student model. However, existing KD methods share the limitation of students having lower model capacity compared to their teacher models, due to their smaller size. In addition, there is no proper initialization guide established for the student models, which especially becomes important when the student model is small. These limitations lower than desired levels of student model accuracy. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel KD method designed especially for Transformer-based models (Vaswani et al. (2017) ) that significantly improves the student model's accuracy. Pea-KD is composed of two modules, Shuffled Parameter Sharing (SPS) and Pretraining with Teacher's Predictions (PTP). Pea-KD is based on the following two main ideas. 1. We apply SPS in order to increase the effective model capacity of the student model without increasing the number of parameters. SPS has two steps: 1) stacking layers that share parameters and 2) shuffling the parameters between shared pairs of layers. Doing so increases the model's effective capacity which enables the student to better replicate the teacher model (details in Section 3.2). 2. We apply a pretraining task called PTP for the student. Through PTP, the student model learns general knowledge about the teacher and the task. With this additional pretraining, the student more efficiently acquires and utilizes the teacher's knowledge during the actual KD process (details in Section 3.3). Throughout the paper we use Pea-KD applied on BERT model (PeaBERT) as an example to investigate our proposed approach. We summarize our main contributions as follows: • Novel framework for KD. We propose SPS and PTP, a parameter sharing method and a KD-specialized initialization method. These methods serve as a new framework for KD to significantly improve accuracy. • Performance. When tested on four widely used GLUE tasks, PeaBERT improves student's accuracy by 4.4% on average and up to 14.8% maximum when compared to the original BERT model. PeaBERT also outperforms the existing state-of-the-art KD baselines by 3.5% on average. • Generality. Our proposed method Pea-KD can be applied to any transformer-based models and classification tasks with small modifications. Then we conclude in Section 5.

2. RELATED WORK

Pretrained Language Models. The framework of first pretraining language models and then finetuning for downstream tasks has now become the industry standard for Natural Language Processing (NLP) models. Pretrained language models, such as BERT (Devlin et al. ( 2018)), XLNet (Yang et al. ( 2019)), RoBERTa (Liu et al. ( 2019)) and ELMo (Peters et al. ( 2018)) prove how powerful pretrained language models can be. Specifically, BERT is a language model consisting of multiple transformer encoder layers. Transformers (Vaswani et al. ( 2017)) can capture long-term dependencies between input tokens by using a self-attention mechanism. Self-attention calculates an attention function using three components, query, key, and value, each denoted as matrices Q, K, and V. The attention function is defined as follows: A = QK T √ d k (1) Attention(Q, K, V ) = sof tmax(A)V It is known that through pretraining using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), the attention matrices in BERT can capture substantial linguistic knowledge. BERT has achieved the state-of-the-art performance on a wide range of NLP tasks, such as the GLUE benchmark (Wang et al. ( 2018)) and SQuAD (Rajpurkar et al. ( 2016)). However, these modern pretrained models are very large in size and contain millions of parameters, making them nearly impossible to apply on edge devices with limited amount of resources. Model Compression. As deep learning algorithms started getting adopted, implemented, and researched in diverse fields, high computation costs and memory shortage have started to become challenging factors. Especially in NLP, pretrained language models typically require a large set of parameters. This results in extensive cost of computation and memory. As such, Model Compression has now become an important task for deep learning. There have already been many attempts to tackle this problem, including quantization (Gong et al. 



(2014)) and weight pruning(Han et al.  (2015)). Two promising approaches areKD (Hinton et al. (2015)) and Parameter Sharing, which we focus on in this paper.Knowledge Distillation (KD).As briefly covered in Section 1, KD transfers knowledge from a well-trained and large teacher model to a smaller student model. KD uses the teacher models predictions on top of the true labels to train the student model. It is proven through many experiments

