AUBER: AUTOMATED BERT REGULARIZATION

Abstract

How can we effectively regularize BERT? Although BERT proves its effectiveness in various downstream natural language processing tasks, it often overfits when there are only a small number of training instances. A promising direction to regularize BERT is based on pruning its attention heads based on a proxy score for head importance. However, heuristic-based methods are usually suboptimal since they predetermine the order by which attention heads are pruned. In order to overcome such a limitation, we propose AUBER, an effective regularization method that leverages reinforcement learning to automatically prune attention heads from BERT. Instead of depending on heuristics or rule-based policies, AUBER learns a pruning policy that determines which attention heads should or should not be pruned for regularization. Experimental results show that AUBER outperforms existing pruning methods by achieving up to 9.39% better accuracy. In addition, our ablation study empirically demonstrates the effectiveness of our design choices for AUBER.

1. INTRODUCTION

How can we effectively regularize BERT (Devlin et al. (2018) )? In natural language processing (NLP), fine-tuning a large-scale pre-trained language model has greatly enhanced generalization. In particular, BERT has demonstrated effectiveness through improvements in many downstream NLP tasks such as sentence classification and question answering. Despite its recent success and wide adoption, fine-tuning BERT on a downstream task is prone to overfitting due to over-parameterization; BERT-base has 110M parameters and BERT-large has 340M parameters. The overfitting worsens when the target downstream task only has a small number of training examples. Devlin et al. (2018) and Phang et al. (2018) show that datasets with 10,000 or less training examples sometimes fail to fine-tune BERT. To mitigate this critical issue, multiple studies attempt to regularize BERT by pruning parameters or using dropout to decrease its model complexity (Michel et al. (2019); Voita et al. (2019) ; Lee et al. (2020) ). Among these approaches, we regularize BERT by pruning attention heads since pruning yields simple and explainable results and it can be used along with other regularization methods. In order to avoid combinatorial search, whose computational complexity grows exponentially with the number of heads, the existing methods measure the importance of each attention head based on heuristics such as an approximation of sensitivity of BERT to pruning a specific attention head. However, these approaches are based on hand-crafted heuristics that are not guaranteed to be directly related to the model performance, and therefore, would result in a suboptimal performance. In this paper, we propose AUBER, an effective method for regularizing BERT. AUBER overcomes the limitation of past attempts to prune attention heads from BERT by leveraging reinforcement learning. When pruning attention heads from BERT, our method automates this process by learning policies rather than relying on a predetermined rule-based policy and heuristics. AUBER prunes BERT sequentially in a layer-wise manner. For each layer, AUBER extracts features that are useful for the reinforcement learning agent to determine which attention head to be pruned from the current layer. The final pruning policy found by the reinforcement learning agent is used to prune the corresponding layer. Before AUBER proceeds to process the next layer, BERT is fine-tuned to recapture the information lost due to pruning attention heads. An overview of AUBER transitioning from the second to the third layer of BERT is demonstrated in Figure 1 . Our contributions are summarized as follows: • Method. We propose AUBER for automatically learning to regularize BERT by reinforcement learning. AUBER is designed to carefully represent the state of BERT and reduce the action search cost by dually-greedy search space pruning. • Analysis. We theoretically justify our design choice of using L1 norm of the value matrix of each attention head as an element of a state (see Theorem 1). • Experiments. We perform extensive experiments, and show that AUBER successfully regularizes BERT improving the performance by up to 9.39% and outperforms other head pruning methods. Through ablation study, we empirically show that our design choices for AUBER are effective.

2. PRELIMINARY

We describe preliminaries on multi-headed self-attention (Section 2.1), BERT (Section 2.2), and deep Q-learning (Section 2.3).

2.1. MULTI-HEADED SELF-ATTENTION

An attention function maps a query vector and a set of key-value vector pairs to an output. We compute the query, key, and value vectors by multiplying the input embeddings Q, K, V ∈ R N ×d with the parameterized matrices W Q ∈ R d×n , W K ∈ R d×n , and W V ∈ R d×m respectively, where N is the number of tokens in the sentence, and n, m, and d are query, value, and embedding dimension respectively. In multi-headed attention, H independently parameterized attention heads are applied in parallel to project the input embeddings into multiple representation subspaces. Each attention head contains parameter matrices W Q i ∈ R d×n , W K i ∈ R d×n , and W V i ∈ R d×m . Output matrices of H independent attention heads are concatenated and projected by a matrix W O ∈ R Hm×d to obtain the final result. This process can be represented as: M ultiHeadAtt(Q, K, V ) = Concat(Att 1...H (Q, K, V ))W O , where Att i (Q, K, V ) = sof tmax( (QW Q i )(KW K i ) T √ n )V W V i . A multi-headed self-attention follows the same mapping methods as general multi-headed attention function except that all the query, key, and value embeddings come from the same sequence. 2017)) pre-trained on masked language model and next sentence prediction tasks. It is then fine-tuned on specific tasks including language inference and question answering. BERT-base has 12 Transformer layers and each layer has 12 self-attention heads. Despite its success in various NLP tasks, BERT sometimes overfits when the training dataset is small due to over-parameterization. Thus, there has been a growing interest in BERT regularization through various methods such as dropout (Lee et al. ( 2020)).



Figure 1: An overview of AUBER transitioning from Layer 2 to Layer 3 of BERT-base.

BERT BERT (Devlin et al. (2018)) is a multi-layer Transformer (Vaswani et al. (

