MA-BERT: TOWARDS MATRIX ARITHMETIC-ONLY BERT INFERENCE BY ELIMINATING COMPLEX NON-LINEAR FUNCTIONS

Abstract

Due to their superior results, Transformer-based models such as BERT have become de facto standards in many Natural Language Processing (NLP) applications. However, the intensive use of complex non-linear functions within the Transformer architecture impairs its computing efficiency and complicates corresponding accelerator designs, because non-linear functions are generally computation-intensive and require special hardware support. In light of this, we propose MA-BERT, which allows matrix arithmetic-only operations in Transformer-based NLP models and achieves efficient inference with negligible accuracy loss. Specifically, we propose four correlated techniques that include approximating softmax with a two-layer neural network, replacing GELU with ReLU, fusing normalization layers with adjacent linear layers, and leveraging knowledge transfer from baseline models. Through these techniques, we are able to eliminate the major non-linear functions in Transformer-based models and obtain MA-BERT with only matrix arithmetic and trivial ReLU operations without compromising on accuracy. With mainly regular matrix arithmetic operations, MA-BERT enables hardware-friendly processing on various computing engines, including CPUs and GPUs. Our experimental results show that MA-BERT achieves up to 27% and 41% reduction in inference time on CPU and GPU, respectively, with comparable accuracy on many downstream tasks compared to the baseline BERT models.

1. INTRODUCTION

Recently, pretrained Transformer-based models such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) , have consistently dominated the leaderboards for a variety of NLP tasks and even surpassed the human baseline. Consequently, there has been a strong push for these models to be used in many NLP applications. At the same time, they have become a popular research topic in academia and industry. This in turn led to the creation of even better models such as GPT-3 (Brown et al., 2020 ), RoBERTa (Liu et al., 2019 ), and DeBERTa (He et al., 2020) , which further stretched the limits of what such models can achieve. Clearly, the advent of Transformer-based models has brought about a paradigm shift in the NLP field from the pre-Transformer era when recurrent neural networks and their variants used to dominate. Nonetheless, the exceptional performance of these Transformer-based models partly stems from their deep structure which involves a huge number of parameters and operations during a single forward propagation. These characteristics make it challenging to meet timing requirements and implement them on devices with limited memory and computational power. Consequently, many notable works have been proposed to improve their inference performance. In particular to BERT,

