MA-BERT: TOWARDS MATRIX ARITHMETIC-ONLY BERT INFERENCE BY ELIMINATING COMPLEX NON-LINEAR FUNCTIONS

Abstract

Due to their superior results, Transformer-based models such as BERT have become de facto standards in many Natural Language Processing (NLP) applications. However, the intensive use of complex non-linear functions within the Transformer architecture impairs its computing efficiency and complicates corresponding accelerator designs, because non-linear functions are generally computation-intensive and require special hardware support. In light of this, we propose MA-BERT, which allows matrix arithmetic-only operations in Transformer-based NLP models and achieves efficient inference with negligible accuracy loss. Specifically, we propose four correlated techniques that include approximating softmax with a two-layer neural network, replacing GELU with ReLU, fusing normalization layers with adjacent linear layers, and leveraging knowledge transfer from baseline models. Through these techniques, we are able to eliminate the major non-linear functions in Transformer-based models and obtain MA-BERT with only matrix arithmetic and trivial ReLU operations without compromising on accuracy. With mainly regular matrix arithmetic operations, MA-BERT enables hardware-friendly processing on various computing engines, including CPUs and GPUs. Our experimental results show that MA-BERT achieves up to 27% and 41% reduction in inference time on CPU and GPU, respectively, with comparable accuracy on many downstream tasks compared to the baseline BERT models.

1. INTRODUCTION

Recently, pretrained Transformer-based models such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) , have consistently dominated the leaderboards for a variety of NLP tasks and even surpassed the human baseline. Consequently, there has been a strong push for these models to be used in many NLP applications. At the same time, they have become a popular research topic in academia and industry. This in turn led to the creation of even better models such as GPT-3 (Brown et al., 2020 ), RoBERTa (Liu et al., 2019 ), and DeBERTa (He et al., 2020) , which further stretched the limits of what such models can achieve. Clearly, the advent of Transformer-based models has brought about a paradigm shift in the NLP field from the pre-Transformer era when recurrent neural networks and their variants used to dominate. Nonetheless, the exceptional performance of these Transformer-based models partly stems from their deep structure which involves a huge number of parameters and operations during a single forward propagation. These characteristics make it challenging to meet timing requirements and implement them on devices with limited memory and computational power. Consequently, many notable works have been proposed to improve their inference performance. In particular to BERT, some works seek to compress the model by reducing the number of encoder layers via knowledge distillation (Sanh et al., 2019; Aguilar et al., 2020; Sun et al., 2020; 2019; Jiao et al., 2019; Xu et al., 2020) with a minor accuracy penalty. Other works focus on model pruning (Gao et al., 2021; Gordon et al., 2020; Voita et al., 2019; Wang et al., 2021) , which involves eliminating unessential weights or attention heads, leading to sparser models with a reduced number of parameters and computing operations. Additionally, several works leverage quantization (Kim et al., 2021; Bhandare et al., 2019; Zafrir et al., 2019) to reduce the precision of weights and computing operations, which lowers memory demands and speeds up inference. Despite these works, one non-negligible overhead in Transformer-based models is often overlooked -the intensive use of complex non-linear functions. In BERT, these non-linear functions include the softmax operation, Gaussian Error Linear Unit (GELU) activation function (Hendrycks & Gimpel, 2016) , and Layer Normalization (LayerNorm) (Ba et al., 2016) . Although these functions undeniably play a role in helping the model to learn better during training time, they become a considerable bottleneck during inference as they are not straightforward to evaluate. Furthermore, in hardware accelerator designs, these non-linear functions typically require separate hardware for acceleration (Liu et al., 2021; Khan et al., 2021) , which makes them challenging to be deployed on resourceconstrained computing platforms. In Figure 1 and Table 1 , we show the breakdown of the cycle budget for the BERT base model. For non-linear operations, the cycle budget is defined as the equivalent number of cycles that the matrix multiply takes to process, given the matrix multiply dimensions and the number of multiplications per cycle (Khan et al., 2021) . From the figure, we can see that those non-linear operations can take as much as 43% of the processing time. The computation efficiency of BERT has a huge potential to improve if we can optimize these non-linear operations.

3HUFHQWDJHRI&\FOHV

6RIWPD In view of this, we introduce an efficient BERT, termed MA-BERT, which makes a novel attempt to completely eliminate the complex non-linear functions in BERT by substituting them with simpler functions. In MA-BERT, four correlated techniques are applied: (1) approximating softmax with a two-layer neural network, (2) replacing GELU with ReLU, (3) fusing normalization layers with adjacent linear layers, and (4) leveraging knowledge transfer from pretrained baseline models. Through these techniques, MA-BERT achieves matrix arithmetic-only and trivial ReLU operations with negligible accuracy loss. Our experiments show that the performance of MA-BERT on downstream tasks is on par with the corresponding BERT baseline and yet achieves up to 27% and 41% performance improvement on general CPUs and GPUs, respectively. By eliminating the complex non-linear functions, the majority of the operations in MA-BERT becomes regular matrix-matrix arithmetic and can potentially be deployed on various computing engines, including CPUs and GPUs, without any hardware modification.

2. RELATED WORK

While so many works focus on model compression techniques, only some paid attention to the optimization of non-linear functions in BERT and other variants of Transformer-based models. For instance, in MobileBERT, Sun et al. (2020) proposed a thin version of BERT large with reduced hidden dimensions. As part of their operational optimization, they replaced GELU with ReLU and



Cycle budget for BERT base with sequence length of 128 and 2048 multiplications per cycle(Khan et al., 2021)

