DATA-AWARE LOW-RANK COMPRESSION FOR LARGE NLP MODELS

Abstract

The representations learned by large-scale NLP models such as BERT have been widely used in various tasks. However, the increasing model size of the pre-trained models also brings the efficiency challenges, including the inference speed and the model size when deploying the model on devices. Specifically, most operations in BERT consist of matrix multiplications. These matrices are not low-rank and thus canonical matrix decomposition could not find an efficient approximation. In this paper, we observe that the learned representation of each layer lies in a lowdimensional space. Based on this observation, we propose DRONE (data-aware low-rank compression), a provably optimal low-rank decomposition of weight matrices, which has a simple closed form solution that can be efficiently computed. DRONE is generic, could be applied to both fully-connected and self-attention layers, and does not require any fine-tuning or distillation steps. Experimental results show that DRONE could improve both model size and inference speed with limited loss of accuracy. Specifically, DRONE alone achieves 1.92x faster on MRPC task with only 1.5% loss of accuracy, and when combined with distillation, DRONE achieves over 12.3x faster on various natural language inference tasks.

1. INTRODUCTION

The representations learned by large-scale Natural Language Processing (NLP) models such as BERT have been widely used in various tasks (Devlin et al., 2018) . The pre-trained models of BERT and its variations are used as feature extractors for the downstream tasks such as question answering and natural language understanding (Radford et al.; Howard & Ruder, 2018) . The success of the pre-trained BERT relies on the usage of large corpus and big models. Indeed, researchers have reported better results of models with more parameters (Shazeer et al., 2018) and number of layers (Al-Rfou et al., 2019) . The increasing model size of the pre-trained models inhibits the public user from training a model from scratch, and it also brings the efficiency challenges, including the inference speed and the model size when deploying the model on devices. To deal with the efficiency issue, most existing works resort to adjusting the model structures or distillation. For instance, Kitaev et al. ( 2020 2018) applies a pre-defined attention pattern to save computation. A large body of prior work focuses on variants of distillation information has also been explored (Sanh et al., 2019; Jiao et al., 2019; Sun et al., 2020; Liu et al., 2020; Xu et al., 2020; Sun et al., 2019) . However, all these methods either require a specific design of model architecture which is not generic, or require users to train the proposed structure from scratch which greatly reduces its practicality. In this work, we try to explore an acceleration method that is more generic. Note that as shown in Figure 1 , matrix multiplication (Feed-forward layer) is a fundamental operation which appears many times in the Transformer architecture. In fact, the underlying computation of both multi-head attention layers and feed-forward layers is matrix multiplication. Therefore, instead of resorting to the complex architecture redesign approaches, we aim to investigate whether low-rank matrix approximation, the most classical and simple model compression approach, can be used to accelerate Transformers. Despite being successfully applied to CNN (Yu et al., 2017; Sindhwani et al., 2015; Shim et al., 2017; You et al., 2019) , at the first glance low-rank compression cannot work for BERT. We could see in Figure 2 that regardless of layers, matrices in feed-forward layer, query and key transformation of attention layer are not low-rank. Therefore, even the optimal low-rank approximation (e.g., by



) uses locality-sensitive hashing to accelerate dot-product attention, Lan et al. (2019) uses repeating model parameters to reduce the size and Zhang et al. (

