MIXTURE OF QUANTIZED EXPERTS (MOQE): COMPLEMENTARY EFFECT OF LOW-BIT QUANTIZATION AND ROBUSTNESS Anonymous

Abstract

Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism (Fedus et al., 2021). However, it has brought a fundamental issue of larger memory consumption at deployment time. Furthermore, this results in significant inference speed degradation at autoregressive decoding steps due to the increased memory transfers. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training. Especially, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit and 80% sparse expert weights can deliver better model performance than the dense model trained on the same dataset. We present how quantization of different parts of models affects the performance with various experiments using a large MoE model (5.3 B). As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. This cuts down the model size of 5.3B parameters from 8.4x of the dense model to only 1.7x of the dense model after 2-bit quantization. It still preserves 1.88% higher accuracy than the dense model. Combined with an optimized GPU runtime implementation, it also achieves 2.7X speed-up which is even slightly faster than the FLOPs equivalent dense model.

1. INTRODUCTION

Large Language Models (LLMs) have shown their effectiveness on various language tasks by increasing the number of trainable parameters together with the framework of pre-training a model on a large scale data and using it to different downstream tasks (Devlin et al., 2018; Radford et al., 2018; Liu et al., 2019; Raffel et al., 2020) . With the advancement of distributed large scale training methods (Shazeer et al., 2018; Rasley et al., 2020; Ren et al., 2021; Baines et al., 2021) and large scale data collection (Raffel et al., 2020; Hoffmann et al., 2022) , the models get even larger and break state-of-the-art performance with the increased model capacity (Brown et al., 2020; Rae et al., 2021; Zoph et al., 2022; Zhang et al., 2022; Smith et al., 2022; Chowdhery et al., 2022) . However, the cost of training these models increases whenever more parameters are added, and this may not be sustainable. As a solution to address this issue, sparsely activated models (Shazeer et al., 2017) are more widely adopted and show significant efficiency improvements in terms of model size scaling while enabling up to trillions of parameters to be trained more efficiently and achieving better model accuracy (Lepikhin et al., 2020; Fedus et al., 2021; Kim et al., 2021; Artetxe et al., 2021) . Mixture-of-Experts (MoE) models are one type of sparsely activated models replacing a single layer in a model with a group of parallel layers which are called experts combined with a gate layer. For a given input, the gate layer selects a subset of the experts from the group, and use them for processing the input. By limiting the number of subset layers for a given input to one or two, the theoretical FLOPs stays almost constant even if we add hundreds of parallel layers into the MoE group. Thus far, most studies have shown that it is effective to increase the capacity of the models by replacing feedforward networks (FFN) of Transformer (Vaswani et al., 2017) blocks with MoE layer consists of multiple FFN layers together with a gating network (Lepikhin et al., 2020; Fedus et al., 2021; Kim et al., 2021; Artetxe et al., 2021) . One of the most unique and critical components of MoE models is the gating network which decides how to conditionally select experts for each input, and there have been various studies to improve it to achieve a better training convergence ((Lewis et al., 2021; Roller et al., 2021; Zuo et al., 2021; Clark et al., 2022; Liu et al., 2022; Zhou et al., 2022) Quantization is a type of model acceleration and compression techniques by estimating a floating point number into a smaller precision number. There are various studies that show quantization is effective to accelerate neural network model inference (Rodriguez et al., 2018; Stock et al., 2019; Choukroun et al., 2019; Gholami et al., 2022) . Especially, it has been known to be very effective in natural language generation such as machine translation ((Kim et al., 2019; Aji & Heafield, 2020; Fan et al., 2021) ) and natural language understanding (Kim & Awadalla, 2020) tasks. However, there has not been an in-depth study about how quantization works with large MoE models. 2022) looks at outlier features in the activations of large language models, and proposes to decompose them while performing matrix multiplications. In our quantization method, this is not needed because it is a weight-only quantization and outliers in activations cannot affect the performance. And, the weights are dequantized back to fp16 while matrix multiplication is done. This also makes our approach not require a special low-bit instructions. And, we show that this can be applied to lower bits than 8-bit for large MoE models. ZeroQuant (Yao et al., 2022) presents a series of techniques including knowledge distillation (Kim & Rush, 2016) for achieving a higher quality quantization. Our focus is to exploit the intrinsic characteristics of MoE layers based on our investigation, and we show that a simple quantization algorithm can achieve significantly higher efficiency and maintain the quality at the same time. Our contributions in this paper are as below. • We present extensive studies about how applying low-bit (down to 2-bits) quantization to different layers of MoE transformer models affects the model accuracy together with comparisons to the corresponding dense model with the same embedding size. • We show that expert weights are highly robust to the quantization, therefore they can be quantized to 3-bit without additional training or calibration data and to 2-bit with Quantization Aware Training (QAT) which results in 79.6% reduction in memory size. Combined with a runtime optimization, we show that the method boosts the inference speed significantly more than 2.7X faster. We leverage the memory bounded characteristic of auto-regressive decoders, so reduced memory bottleneck improves the overall efficiency even with additional dequantization steps in our procedure. Based on the observations, we propose a new framework named Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method only applied to MoE expert weights. • Finally, we show an emerging sparsity of more than 80% in the expert weights to be zero from 2-bit quantization. The expert weight matrices are sparse and very low-precision at the same time, while still outperforming the dense counterpart trained on the same dataset.



and they are well surveyed in Fedus et al. (2022). In spite of the progress on the training of MoE models, there have been only a few handfuls of studies related to MoE model inference. Rajbhandari et al. (2022) designs a more efficient MoE architecture and distributed runtime to achieve 7.3X inference speed-up. Kudugunta et al. (2021) uses task specific information to reduce the size of the model at deployment time by only loading task specific experts. Kim et al. (2021) prunes some experts at deployment time to reduce the model size by trading-off model performance. Zoph et al. (2022) uses knowledge distillation technique to distill a large MoE model into a smaller dense model to reduce the memory consumption and improve the throughput. Even with all the proposed techniques, there has not been a solution to accelerate the inference of MoE models while maintaining the accuracy.

Recently, Dettmers et al. (2022); Yao et al. (2022) have studied how quantization works on large scale language models. Dettmers et al. (

