SAMOE: PARAMETER EFFICIENT MOE LANGUAGE MODELS VIA SELF-ADAPTIVE EXPERT COMBINATION

Abstract

Recently, Mixture-of-Experts (MoE) has demonstrated success in scaling models to have large amounts of parameters without significant increases in computational cost. However, MoEs have been also reported to be parameter inefficient such that larger models do not always lead to better performance. In this work, we study how to build parameter-efficient MoE models. Our analysis identifies that MoE layers exhibit poor gradient flow as the number of experts increases, leading to insufficient training of experts. To overcome this issue, we propose a new MoE architecture design (SaMoE), which improves the parameter-efficiency of MoE models by learning a soft combination of a global set of expert layers for each MoE layer. Such a scheme enables substantial parameter savings on MoE while achieving comparable or better accuracy than the standard MoE training baseline. Extensive experiments on billion-scale GPT-3 style autoregressive MoE language models demonstrate that SaMoE significantly improves the parameter efficiency of MoE models by reducing up to 5.2× total parameters while obtaining superior pre-training and zero-shot generalization results as compared to baseline.

1. INTRODUCTION

Over the past few years, there has been an explosion in research revolving around large language models, primarily motivated by the impressive performance of Transformer-based language models (Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2019; Brown et al., 2020) . One of the most impactful findings of this research is that the performance of these models continues to scale as the number of parameters increases (Kaplan et al., 2020; Clark et al., 2022) . However, sustaining model parameters growth is getting more challenging due to the increasing compute requirements. As such, there has been substantial interest in exploring more efficient model designs and training methodologies. Among them, sparsely activated models, such as architectures based on Mixture-of-Experts (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021) , have demonstrated promising results for training massive language models. MoE allows each input to interact with only a subset of the network parameters -chosen independently for each input. As such, the number of parameters is nearly disentangled from the computation cost of processing an input. Recently, several works explored whether MoE models can be comparatively accurate to dense models but with much lower computational costs. As a result, they have successfully trained MoE-based language models and demonstrated that MoE models could perform on par with their dense equivalent counterparts but with up to 4-7× reduction in computation cost (Artetxe et al., 2021; Du et al., 2022; Rajbhandari et al., 2022) . Despite promising results, MoE architecture appears to be parameter inefficient, considering the yielded model quality improvement vs. the involved parameters. For example, prior works report that to achieve the same quality as the dense model, the MoE model has roughly an order of magnitude more parameters than its corresponding dense model (Rajbhandari et al., 2022; Du et al., 2022; Artetxe et al., 2021) . This parameter inefficiency adds a high cost of using additional memory and devices during model training and inference. Therefore, a natural question to ask is: "Are all these expert parameters necessary to increase the model quality?" or equivalently, "Given a bound on the number of trainable parameters of a model, how can we arrive at an MoE model with higher quality?" In this work, we investigate parameter-efficient architectures for MoE. In particular, our analysis shows that MoE models face challenges of poor gradient flow at MoE layers, leading to insuffi-

