SAMOE: PARAMETER EFFICIENT MOE LANGUAGE MODELS VIA SELF-ADAPTIVE EXPERT COMBINATION

Abstract

Recently, Mixture-of-Experts (MoE) has demonstrated success in scaling models to have large amounts of parameters without significant increases in computational cost. However, MoEs have been also reported to be parameter inefficient such that larger models do not always lead to better performance. In this work, we study how to build parameter-efficient MoE models. Our analysis identifies that MoE layers exhibit poor gradient flow as the number of experts increases, leading to insufficient training of experts. To overcome this issue, we propose a new MoE architecture design (SaMoE), which improves the parameter-efficiency of MoE models by learning a soft combination of a global set of expert layers for each MoE layer. Such a scheme enables substantial parameter savings on MoE while achieving comparable or better accuracy than the standard MoE training baseline. Extensive experiments on billion-scale GPT-3 style autoregressive MoE language models demonstrate that SaMoE significantly improves the parameter efficiency of MoE models by reducing up to 5.2× total parameters while obtaining superior pre-training and zero-shot generalization results as compared to baseline.

1. INTRODUCTION

Over the past few years, there has been an explosion in research revolving around large language models, primarily motivated by the impressive performance of Transformer-based language models (Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2019; Brown et al., 2020) . One of the most impactful findings of this research is that the performance of these models continues to scale as the number of parameters increases (Kaplan et al., 2020; Clark et al., 2022) . However, sustaining model parameters growth is getting more challenging due to the increasing compute requirements. As such, there has been substantial interest in exploring more efficient model designs and training methodologies. Among them, sparsely activated models, such as architectures based on Mixture-of-Experts (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021) , have demonstrated promising results for training massive language models. MoE allows each input to interact with only a subset of the network parameters -chosen independently for each input. As such, the number of parameters is nearly disentangled from the computation cost of processing an input. Recently, several works explored whether MoE models can be comparatively accurate to dense models but with much lower computational costs. As a result, they have successfully trained MoE-based language models and demonstrated that MoE models could perform on par with their dense equivalent counterparts but with up to 4-7× reduction in computation cost (Artetxe et al., 2021; Du et al., 2022; Rajbhandari et al., 2022) . Despite promising results, MoE architecture appears to be parameter inefficient, considering the yielded model quality improvement vs. the involved parameters. For example, prior works report that to achieve the same quality as the dense model, the MoE model has roughly an order of magnitude more parameters than its corresponding dense model (Rajbhandari et al., 2022; Du et al., 2022; Artetxe et al., 2021) . This parameter inefficiency adds a high cost of using additional memory and devices during model training and inference. Therefore, a natural question to ask is: "Are all these expert parameters necessary to increase the model quality?" or equivalently, "Given a bound on the number of trainable parameters of a model, how can we arrive at an MoE model with higher quality?" In this work, we investigate parameter-efficient architectures for MoE. In particular, our analysis shows that MoE models face challenges of poor gradient flow at MoE layers, leading to insuffi-cient training of those layers compared to the dense layers. Based on this analysis, we conjecture that sharing parameters across experts would allow experts to receive more sufficient training and become useful. As such, we study several expert sharing strategies for MoE models. Our studies show that due to the smaller number of parameters, the performance of MoE models with aggressive tied-experts suffers when training on large-scale GPT pretraining datasets. On the other hand, relaxing the expert constraints helps improve the model quality, but it requires manually designing the sharing strategy and the manually determined strategy may still achieve sub-optimal model quality. Our contributions in this work are: SaMoE. We improve the parameter efficiency of MoE models by developing a novel parameterefficient MoE architecture, referred to as SaMoE. SaMoE learns an expert pool that consists of a global set of shared MoE layers and expresses each MoE layer as a soft combination of global MoE layers. Such a scheme decouples the number of experts from MoE model depth, drastically reducing MoE parameters while achieving better accuracy than baseline approaches (Section 4). Analysis. We identify poor gradient flow in MoE layers as the main cause of the poor parameter efficiency of MoE models. Our preliminary analysis shows that expert-sharing helps overcome the poor gradient flow issue and encourages MoE layers to learn more sufficiently (Section 3). Evaluation. We conduct experiments on billion-scale autoregressive MoE language models with open-sourced massive datasets and demonstrate that (i) SaMoE significantly improves the parameter efficiency of MoE models, reducing the model size by up to 5.2× while achieving superior model quality than prior works such as PR-MoE in zero-shot generalization accuracy (Section 5); (ii) Ablation study of the effectiveness of the proposed design elements in SaMoE (Section 5.4); (iii) A detailed evaluation of the scaling properties of SaMoE that reveals the strong scalability of SaMoE (Section 5.5); and (iv) Comparison results between SaMoE and alternative heuristic strategies (Section 5.6). We will also open-source the training and evaluation code at anonymous_link.

2. RELATED WORK

Mixture-of-Experts architecture converts multiple layers of a deep neural network to sparsely activated counterparts and jointly trained with the rest of the network (Jacobs et al., 1991; Shazeer et al., 2017) . It falls into the paradigm of conditional computation (Bengio et al., 2013) , which was proposed to activate only a small fraction of the model's parameters and computation on-demand on a per-example basis (Bengio et al., 2013; Davis & Arel, 2014; Cho & Bengio, 2014; Bengio et al., 2015) . As such, they provide a promising path to build neural networks of much higher capacity without significantly increasing the computation required. Recent work has shown that MoE models can be extended with Transformer architecture for scaling language models (Lepikhin et al., 2020; Fedus et al., 2021) . Despite the promising aspects of sparsely activating parameters, MoE models are difficult to train. In particular, prior works attribute the training difficulty of MoE to the unbalanced load of experts and conjecture that encouraging or enforcing all experts to process balanced compute loads can help improve the learning of the gating function. 2021) learns expert selection through a differentiable loss. Different from those works, our analysis shows that there is parameter redundancy in MoE layers, and we focus on developing parameter-efficient MoE architectures that reduce parameter redundancy. Parameter efficient architectures have always been an interesting question in machine learning community (Mnih & Hinton, 2008; Mikolov et al., 2010; Press & Wolf, 2017; Inan et al., 2017; Savarese & Maire, 2019; Lan et al., 2020) . Lan et al. (2020) discovered that sharing weights across layers improves the parameter efficiency of transformer models (Lan et al., 2020) . However, ALBERT focus on encoder-based masked language model pre-training with dense Transformer blocks. More recently, Xue et al. (2022) proposed to share the weights of all MoE layers. However, we find that directly sharing all MoE layers leads to severe accuracy degradation for decoder-based MoE models, especially on large-scale autoregressive GPT-3 style pretraining tasks (e.g., 10-100× larger in



For example, Lepikhin et al. (2020) and Fedus et al. (2021) propose to add load-balancing loss term into the training objective. Lewis et al. (2021) guarantee load balancing across experts by post-processing the routing output to re-assign expert selections to ensure that all experts are selected evenly. Roller et al. (2021) propose to use a fixed hash as the gating function, and Nie et al. (2021) propose to adaptively choose K in top-K selection during training. In addition, another challenge is that the gating function is highly non-differentiable. To address it, Clark et al. (2022) propose to use reinforcement learning for routing, and Hazimeh et al. (

