AUBER: AUTOMATED BERT REGULARIZATION

Abstract

How can we effectively regularize BERT? Although BERT proves its effectiveness in various downstream natural language processing tasks, it often overfits when there are only a small number of training instances. A promising direction to regularize BERT is based on pruning its attention heads based on a proxy score for head importance. However, heuristic-based methods are usually suboptimal since they predetermine the order by which attention heads are pruned. In order to overcome such a limitation, we propose AUBER, an effective regularization method that leverages reinforcement learning to automatically prune attention heads from BERT. Instead of depending on heuristics or rule-based policies, AUBER learns a pruning policy that determines which attention heads should or should not be pruned for regularization. Experimental results show that AUBER outperforms existing pruning methods by achieving up to 9.39% better accuracy. In addition, our ablation study empirically demonstrates the effectiveness of our design choices for AUBER.

1. INTRODUCTION

How can we effectively regularize BERT (Devlin et al. (2018) )? In natural language processing (NLP), fine-tuning a large-scale pre-trained language model has greatly enhanced generalization. In particular, BERT has demonstrated effectiveness through improvements in many downstream NLP tasks such as sentence classification and question answering. Despite its recent success and wide adoption, fine-tuning BERT on a downstream task is prone to overfitting due to over-parameterization; BERT-base has 110M parameters and BERT-large has 340M parameters. The overfitting worsens when the target downstream task only has a small number of training examples. Devlin et al. (2018) and Phang et al. (2018) show that datasets with 10,000 or less training examples sometimes fail to fine-tune BERT. To mitigate this critical issue, multiple studies attempt to regularize BERT by pruning parameters or using dropout to decrease its model complexity (Michel et al. (2019); Voita et al. (2019) ; Lee et al. ( 2020)). Among these approaches, we regularize BERT by pruning attention heads since pruning yields simple and explainable results and it can be used along with other regularization methods. In order to avoid combinatorial search, whose computational complexity grows exponentially with the number of heads, the existing methods measure the importance of each attention head based on heuristics such as an approximation of sensitivity of BERT to pruning a specific attention head. However, these approaches are based on hand-crafted heuristics that are not guaranteed to be directly related to the model performance, and therefore, would result in a suboptimal performance. In this paper, we propose AUBER, an effective method for regularizing BERT. AUBER overcomes the limitation of past attempts to prune attention heads from BERT by leveraging reinforcement learning. When pruning attention heads from BERT, our method automates this process by learning policies rather than relying on a predetermined rule-based policy and heuristics. AUBER prunes BERT sequentially in a layer-wise manner. For each layer, AUBER extracts features that are useful for the reinforcement learning agent to determine which attention head to be pruned from the current layer. The final pruning policy found by the reinforcement learning agent is used to prune the corresponding layer. Before AUBER proceeds to process the next layer, BERT is fine-tuned to recapture the information lost due to pruning attention heads. An overview of AUBER transitioning from the second to the third layer of BERT is demonstrated in Figure 1 . Our contributions are summarized as follows: 1

