AUTOMOE: NEURAL ARCHITECTURE SEARCH FOR EFFICIENT SPARSELY ACTIVATED TRANSFORMERS

Abstract

Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the subarchitecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse models obtain (i) 4× FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs and 10% latency reduction over state-of-the-art NAS-generated dense sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?); (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with optimal trade-off between task performance and computational metrics like FLOPs and latency.

1. INTRODUCTION

Transformers have demonstrated state-of-the-art performance in several tasks, but the larger their size, the more difficult it is to use them in resource constrained settings (Strubell et al., 2019) . Recent works in neural architecture search (NAS) (Wang et al., 2020; Xu et al., 2022a; 2021; So et al., 2021; Xu et al., 2022b; Javaheripi et al., 2022) have focused on identifying computationally efficient sub-Transformers that are easier to deploy on edge devices. However, existing works on NAS only focus on the subspace of densefoot_0 Transformer architectures, where all the network weights are activated for every input. In contrast to the above dense models, sparsely activated ones like the Mixture-of-Experts (Fedus et al., 2022b) perform conditional computation in which only a subset of the weights of the network are activated per input. Selective compute allows us to design neural networks with a large number of model parameters, without significant increase in the computational cost. With increased capacity, these sparse models have demonstrated state-of-the-art performance in natural language tasks such as neural machine translation (NMT) (Kim et al., 2021; Kudugunta et al., 2021; Zuo et al., 2022) . The goal of this work is to explore the space of sparsely activated MoE architectures for NAS to identify computationally efficient sparse sub-Transformers. Incorporating MoE architectures in the search space requires one to make several design choices. 



Terminologies:(1) Dense architectures refer to fully activated networks for every input. (2) Sparse architectures refer to sparsely activated ones with conditional computation per input. (3) Optimal architectures refer to Pareto-optimal ones with the best trade-off between task performance and computational metrics.



(a) Expert placement: Identifying the Transformer layers for introducing expert sub-networks. (b) Number of experts: How many experts to introduce in different layers? (c) Expert FFN size: What should be the feedforward network (FFN) size for each expert? Given the large search space of

