SCOMOE: EFFICIENT MIXTURES OF EXPERTS WITH STRUCTURED COMMUNICATION

Abstract

Mixture-of-Experts (MoE) models are promising architectures for massively multilingual neural machine translation and large language models due to the advantage of sublinear scaling. However, the training of large MoE models is usually bottlenecked by the all-to-all communication (Lepikhin et al., 2020). To reduce the communication cost, we propose SCoMoE, an MoE architecture with structured all-to-all communication, inspired by the hierarchical architecture of the communication topology. SCoMoE encourages data to be communicated across devices through fast intra-accelerator/node communication channels, reducing communication throughput in the slow inter-node communication channel. We slice the data on the sequence dimension (SCoMoE-Seq) into three communication groups and project the data on the feature dimension (SCoMoE-Feat) into low-dimensional representations. To compensate the potential performance drop caused by the routing locality in SCoMoE, we further propose a token clustering approach to aggregating related tokens from different devices before the MoE layers. The sigmoid gating in the balanced router used in the token clustering is substituted with the softmax gating with differential sorting. Experiments on bilingual and massively multilingual machine translation demonstrate that SCo-MoE achieves a speedup of 1.44x over GShard with comparable performance, and substantially outperforms Gshard (2.8 BLEU) on OPUS-100 with a speedup of 1.25x. Codes are available at https://github.com/ZhiYuanZeng/ fairseq-moe.

1. INTRODUCTION

Recent years have witnessed a substantial interest in exploring sparse architectures based on Mixture of Experts for training massively multilingual machine translation (Lepikhin et al., 2020; Kim et al., 2021) and large language models (Fedus et al., 2021; Zhang et al., 2021b; Ma et al., 2022; Du et al., 2021; Zoph et al., 2022; Rajbhandari et al., 2022; Lin et al., 2021) . Experts of MoE models are distributed over multiple devices. Due to the sparse architecture where only a combination of experts are selected to process each input, the number of experts and hence the scale of MoE models can be sufficiently large while the computational cost is only sublinear to the number of parameters. Despite the advantage of efficient computation, MoE models require expensive all-to-all communication, to send the inputs and outputs of experts across the compute network. Previous study on GShard (Lepikhin et al., 2020) has shown that as MoE models scale, the all-to-all communication cost becomes the bottleneck for training. To mitigate this issue, we propose Structured Communication based MoE (ScoMoE), which treats the all-to-all communication in a structured way rather than equally across different devices. The motivation behind SCoMoE is that the network bandwidth between devices and nodes is different across the compute network: the bandwidth inside an accelerator (intra-accelerator) is faster than that across accelerators, and the bandwidth inside a node (intra-node) is faster than that across nodes (inter-node). Figure 1a visualizes the hierarchical structure of communication topology with a 9 × 9 matrix, where different levels of communication are in different colors. We view the data flow in



† Work was done while the author was interning at GTCOM. * Corresponding author. 1

