MIXTURE OF QUANTIZED EXPERTS (MOQE): COMPLEMENTARY EFFECT OF LOW-BIT QUANTIZATION AND ROBUSTNESS Anonymous

Abstract

Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism (Fedus et al., 2021). However, it has brought a fundamental issue of larger memory consumption at deployment time. Furthermore, this results in significant inference speed degradation at autoregressive decoding steps due to the increased memory transfers. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training. Especially, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit and 80% sparse expert weights can deliver better model performance than the dense model trained on the same dataset. We present how quantization of different parts of models affects the performance with various experiments using a large MoE model (5.3 B). As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. This cuts down the model size of 5.3B parameters from 8.4x of the dense model to only 1.7x of the dense model after 2-bit quantization. It still preserves 1.88% higher accuracy than the dense model. Combined with an optimized GPU runtime implementation, it also achieves 2.7X speed-up which is even slightly faster than the FLOPs equivalent dense model.

1. INTRODUCTION

Large Language Models (LLMs) have shown their effectiveness on various language tasks by increasing the number of trainable parameters together with the framework of pre-training a model on a large scale data and using it to different downstream tasks (Devlin et al., 2018; Radford et al., 2018; Liu et al., 2019; Raffel et al., 2020) . With the advancement of distributed large scale training methods (Shazeer et al., 2018; Rasley et al., 2020; Ren et al., 2021; Baines et al., 2021) and large scale data collection (Raffel et al., 2020; Hoffmann et al., 2022) , the models get even larger and break state-of-the-art performance with the increased model capacity (Brown et al., 2020; Rae et al., 2021; Zoph et al., 2022; Zhang et al., 2022; Smith et al., 2022; Chowdhery et al., 2022) . However, the cost of training these models increases whenever more parameters are added, and this may not be sustainable. As a solution to address this issue, sparsely activated models (Shazeer et al., 2017) are more widely adopted and show significant efficiency improvements in terms of model size scaling while enabling up to trillions of parameters to be trained more efficiently and achieving better model accuracy (Lepikhin et al., 2020; Fedus et al., 2021; Kim et al., 2021; Artetxe et al., 2021) . Mixture-of-Experts (MoE) models are one type of sparsely activated models replacing a single layer in a model with a group of parallel layers which are called experts combined with a gate layer. For a given input, the gate layer selects a subset of the experts from the group, and use them for processing

