AUTOMOE: NEURAL ARCHITECTURE SEARCH FOR EFFICIENT SPARSELY ACTIVATED TRANSFORMERS

Abstract

Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the subarchitecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse models obtain (i) 4× FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs and 10% latency reduction over state-of-the-art NAS-generated dense sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?); (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with optimal trade-off between task performance and computational metrics like FLOPs and latency.

1. INTRODUCTION

Transformers have demonstrated state-of-the-art performance in several tasks, but the larger their size, the more difficult it is to use them in resource constrained settings (Strubell et al., 2019) . Recent works in neural architecture search (NAS) (Wang et al., 2020; Xu et al., 2022a; 2021; So et al., 2021; Xu et al., 2022b; Javaheripi et al., 2022) have focused on identifying computationally efficient sub-Transformers that are easier to deploy on edge devices. However, existing works on NAS only focus on the subspace of densefoot_0 Transformer architectures, where all the network weights are activated for every input. In contrast to the above dense models, sparsely activated ones like the Mixture-of-Experts (Fedus et al., 2022b) perform conditional computation in which only a subset of the weights of the network are activated per input. Selective compute allows us to design neural networks with a large number of model parameters, without significant increase in the computational cost. With increased capacity, these sparse models have demonstrated state-of-the-art performance in natural language tasks such as neural machine translation (NMT) (Kim et al., 2021; Kudugunta et al., 2021; Zuo et al., 2022) . The goal of this work is to explore the space of sparsely activated MoE architectures for NAS to identify computationally efficient sparse sub-Transformers. Incorporating MoE architectures in the search space requires one to make several design choices. (a) Expert placement: Identifying the Transformer layers for introducing expert sub-networks. (b) Number of experts: How many experts to introduce in different layers? (c) Expert FFN size: What should be the feedforward network (FFN) size for each expert? Given the large search space of potential architectures and the exorbitant computational cost of training and evaluating them -existing approaches manually design MoE architectures with a highly-restricted homogeneous space. For instance, they use the same number of experts of the same capacity in different layers and make ad-hoc decisions like introducing experts in every other layer (Fedus et al., 2022b; Kim et al., 2021; Zuo et al., 2022; Du et al., 2022; Artetxe et al., 2021) or every four layers (Zoph et al., 2022) . These design choices are not necessarily optimal. The decoder should be lighter than the encoder for auto-regressive NMT tasks due to the cumulative latency of generating tokens one at a time (Liu et al., 2020; Kasai et al., 2021) . This impacts the design choice for the number of decoder layers and the number of experts to use in each. For instance, the loss of capacity with decoder layer reduction can be compensated by adding experts on the remaining ones. On the encoder side, a vanilla placement of the maximum allowable experts in each layer results in increased latency from expert communication and activation, although theoretical FLOPs can remain unaffected. These suggest that the optimal MoE's could be heterogeneous when resources like latency or FLOPs are constrained. In a recent review on sparsely activated models, Fedus et al. (2022a) note that the optimal hyperparameters depend on application and resource specifications -where a systematic simulation of the compute, memory and communication cost can aid practitioners to quickly determine optimal settings without costly trial-and-error launches. AutoMoE provides such a framework to identify optimal hyper-parameter configurations for sparse models under computational constraints. The above observations are depicted in Table 1 , which shows demonstrative examples of manually designed architectures vs. those found by our AutoMoE framework from the search space. We compare these architectures against various computational metrics (e.g., latency, FLOPs, active MoE parameters), architectural configurations and task performance.  1-4-1-4-1-4 1-4-1-4-1-4 28.48 506ms 56M 3.4 AutoMoE (4 Experts) 2-4-1-1-3-1 1-1-1-1-1 28.22 239ms 49M 3.1 AutoMoE (4 Experts) 1-1-4-4-4-1 4-1-1-1 28.15 194ms 22M 2.9 Table 1 : Manually designed vs. AutoMoE searched architecture for 6-layer encoder-decoder Transformer. We report various computational footprint metrics (measured on 1 V100 GPU) and BLEU score of sparse expert models on WMT'14 En-De machine translation task. We show the number of experts per layer separated by hyphen (-) for encoder and decoder. Novelty: To the best of our knowledge, AutoMoE introduces the first end-to-end framework to automatically design efficient MoE models under resource constraints. AutoMoE is also the first MoE framework to support adaptive computation due to heterogeneous experts, where input tokens are routed to experts of different sizes. With this desiderata, we develop AutoMoE with the following components and contributions: • We introduce a heterogeneous search space for Transformers consisting of variable number, FFN size and placement of experts in both encoders and decoders; variable number of layers, attention heads and intermediate FFN dimension of standard Transformer modules. • We extend Supernet training to this new search space which combines all possible sparse architectures into a single graph and jointly trains them via weight-sharing, yielding a reduced amortized training cost. • We use an evolutionary algorithm to search for optimal sparse architecture from Supernet with the best possible performance on a downstream task (e.g., BLEU score for NMT tasks) satisfying a user-specified computational constraint. • Experiments on several NMT benchmarks demonstrate that AutoMoE-searched sparse models obtain (i) 4× FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs and 10% latency reduction over state-of-the-art NAS-generated dense sub-Transformers with comparable BLEU scores. Changes made as part of the revision are highlighted in red.



Terminologies:(1) Dense architectures refer to fully activated networks for every input. (2) Sparse architectures refer to sparsely activated ones with conditional computation per input. (3) Optimal architectures refer to Pareto-optimal ones with the best trade-off between task performance and computational metrics.

