SPARSE MOE AS THE NEW DROPOUT: SCALING DENSE AND SELF-SLIMMABLE TRANSFORMERS

Abstract

Despite their remarkable achievement, gigantic transformers encounter significant drawbacks, including exorbitant computational and memory footprints during training, as well as severe collapse evidenced by a high degree of parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown promise to mitigate the issue of training efficiency, yet they are prone to (1) redundant experts due to representational collapse; and (2) poor expert scalability for inference and downstream fine-tuning, primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, this work focuses on exploring the overlooked scalability bottleneck of SMoEs and leveraging it to effectively scale dense transformers. To this end, we propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. Specifically, SMoE-Dropout consists of a randomly initialized and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Transformers trained by SMoE-Dropout naturally exhibit a "self-slimmable" property subject to resource availability, offering smooth and consistent performance boosts with an increase in activated experts during inference or fine-tuning. Our extensive experiments across diverse transformer architectures on a variety of tasks demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts. In particular, our trained BERT outperforms its densely trained counterpart with consistent improvements of {1.03%, 0.78%, 1.09%} on challenging reasoning tasks {ASDiv-A, MAWPS, SVAMP}, respectively. Codes and models are available in

1. INTRODUCTION

Scaling neural networks, historically with the blessing of modern hardware, have dramatically improved the state-of-the-art on a wide array of real-world machine learning applications and leaderboards, conforming to the empirical scaling laws (Kaplan et al., 2020) , where the final model quality has been found to have a power-law relationship with the amount of data, model size, and compute time. Transformers (Vaswani et al., 2017) , swiftly after their introduction, have become de facto choice for many natural language processing (NLP) (Yang et al., 2019c; Liu et al., 2019b; Talmor et al., 2018; Jaiswal et al., 2021; Yang et al., 2019b; Wang et al., 2018; Ding et al., 2019; Chowdhery et al., 2022; Wei et al., 2022) and computer vision (Dosovitskiy et al., 2020; Han et al., 2020; Touvron et al., 2021; Mao et al., 2022; Zheng et al., 2021; Parmar et al., 2018) applications and now their parameter counts are typically measured in billions rather than millions. Unfortunately, this exploitation of parameters actuates a roughly quadratic blow-up in training costs, as both the model size and the number of training examples increase especially for dense advanced transformer-based models (e.g., BERT (Devlin et al., 2018) and GPT (Brown et al., 2020) ) and require thousands of GPU days for training. Additionally, these gigantic transformers suffer from the representation collapse issue during vanilla training, which is affirmed by a high degree of parameter redundancy (Guo et al., 2019; Ganesh et al., 2020; McCarley et al., 2019) and observed ineffective usage of the transformer expressiveness (Michel et al., 2019; Chen et al., 2022a) . test-set with a 4-layer Transformer-XL. SMoE-Dropout demonstrates a "self-slimmable" property where inference performance is smoothly boosted along with the increase of activated parameters. Learnable SMoEs tend to overfit certain levels of network capacity. Note that only gray curve is produced by ( 5) different dense models. Sparse Mixture-of-Experts (SMoEs) enable efficient scaling of model capacity at a fixed computational cost by performing input-dependent conditional computing. Such property facilitates training transformers with significantly high parameter counts at moderately increased cost, compared to their dense counterparts, resulting in improved training efficiency. For instance, with similar training FLOPS, Switch-Large (Fedus et al., 2021) (a kind of SMoE) is 35× larger than a T5-Large dense model (Raffel et al., 2020) . Despite their advantages in mitigating computational and energy footprints, SMoEs have many critical limitations. Firstly, the current learning-based routing mechanisms in SMoEs tend to push hidden representations clustering around expert centroids (Chi et al., 2022) , implying a trend toward representation collapse, which in turn leads to redundant experts, inferior expert specialization, thereby substandard performance (Mittal et al., 2022; Chen et al., 2022b) . Secondly, SMoEs suffer from poor scalability during inference and downstream fine-tuning prominently due to overfitting of the learned routing policy to the number of activated experts during training. Naive solutions to mitigate such sparsity immutability often lead to performance degradation. As recent research efforts for SMoEs are predominantly focused on improving routing policies to encourage expert specializations, we explore the overlooked scalability bottleneck of SMoEs and ask: Does there exist a principled and pluggable approach to modify SMoE training that can enhance scalability at inference and downstream fine-tuning of large-scale transformers, by dynamically adapting the number of activated experts subject to resource availability? To this end, this paper proposes a novel plug-and-play training framework, named SMoE-Dropout, to enable scaling transformers to better accuracy in the full capacity setting without collapse. More specifically, SMoE-Dropout employs a fixed router network that is randomly initialized to activate experts and progressively increases their number as training progresses over time. Our simple, yet highly effective strategy has a multi-fold win-win for trained transformers, specifically: ❶ obtaining a "self-slimmable" property during inference and downstream fine-tuning subject to resource availability, which delivers a once-for-all in-situ trade-off between efficiency and performance; ❷ mitigating representational collapse and effectively utilizing the full model capacity, where activating more experts produces superior performance (Figure 1 (blue)); ❸ eliminating the overhead of learning routing policies for SMoE. Note that SMoE-Dropout can be swiftly adapted for training any deep learning network (e.g. CNNs), given some splitting techniques (Zhang et al., 2021) , but this work primarily focuses on transformers considering their exploding computational footprints. Our innovative contributions can be summarized as: ⋆ We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers in the full capacity setting without collapse. SMoE-Dropout facilitates the randomly and sparsely activated structure of network modules, playing an implicit regularization role similar to dropout. Our new framework leads to enhanced generalization and reduced training costs (e.g., up to 37% running time savings) compared to the vanilla training of large dense transformers at equivalent parameter counts. ⋆ Transformers trained by SMoE-Dropout naturally exhibit a "self-slimmable" property that displays smooth and consistent performance boosts when increasing activated experts during inference or fine-tuning (Figure 1 (blue)). This property enjoys an "in-situ" trade-off between efficiency and performance at deployment, subject to resource availability. ⋆ Our extensive experiments across representative architectures on a variety of tasks validate the effectiveness of our proposed SMoE-Dropout. Specifically, during pre-training, our approach has {1.37, 4.10}, {2.53, 12.44} and {154.12, 188.00} (×10 -2 ) lower BPC than {vanilla dense training (with the same parameter counts), learned SMoE} for Transformer-XL, BERT, and RoBERTa, respectively; after transferring, SMoE-Dropout obtains {0.07%, 1.03%, 0.78%, 1.09%} performance improvements for BERT and {-, 5.88%, 0.07%, 5.04%} for RoBERTa, on {CSQA, ASDiv-A, MAWPS, SVAMP} reasoning tasks compared to its dense training counterpart.



Figure 1: Bits-Per-Character (↓) on enwik8's

availability

https://github.com/VITA-Group/Random

