EXPLORING ROUTING STRATEGIES FOR MULTILIN-GUAL MIXTURE-OF-EXPERTS MODELS Anonymous

Abstract

Sparsely-Gated Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. These models, however, are prohibitively large for serving deployment and there is no easy way to extract a sub-network to decode for a particular language pair. This work proposes improved strategies to route MoE models by tasks instead of tokens, thus enabling separation of network structures at decoding time while enjoying the benefits of scale and task sharing at training time. We compare routing strategies at multiple levels (token, sentence, task) in both, the encoder and the decoder, and conduct extensive experiments on two benchmarks: the public WMT dataset of 30 language pairs and an in-house web-scale dataset of 200 language pairs. On WMT, with a Transformer base model with 32 experts, our task-level MoE outperforms the best performing token-level MoE model by +1.0 BLEU on average over all language pairs. When scaling up to Transformer big model with 128 experts on the large-scale massively multilingual benchmark, our task-level MoE is competitive with token-level MoE while being able to reduce the decoder model size by a factor of 32.34 and increase peak throughput by 2.6 times at inference.

1. INTRODUCTION

Scaling up neural network models has recently received great attention, given the significant quality improvements in a variety of areas such as natural language understanding (Raffel et al., 2019; Brown et al., 2020) and multilingual machine translation (Huang et al., 2019; Lepikhin et al., 2020) . While training massive models on large amounts of data can almost guarantee improved quality, there are two factors affecting their practicality and applicability: (1) training efficiency and (2) inference efficiency. Large dense models are often prohibitively compute-intensive to train, with some models requiring TFlops-days of compute (Brown et al., 2020) . A recent line of work has proposed sparsely-gated Mixture-of-Experts (MoE) layers as an efficient alternative to dense models (Shazeer et al., 2017; Lepikhin et al., 2020; Riabinin & Gusev, 2020) in order to address training efficiency limitations. In a vanilla sparsely-gated MoE model each token of the input sequence activates a different subset of the experts, hence the computation cost per token becomes only proportional to the size of the activated sub-network. However, they fail to meet requirements on inference efficiency. Consider a long sequence where each token of the sequence activates a disjoint subset of available experts. From a practical standpoint, the inference trace of the full sequence spans several experts independently for every token, resulting in an independent pathway for each token. Although this is a desired property adding flexibility to the model and increasing its capacity, it becomes prohibitive for inference for the following reasons: The model parameters in these large models are beyond the memory limit of a single accelerator, and require model parallelism to shard them across a cluster of devices during inference. For models with MoE Layers, the input token would be dynamically routed to different experts allocated to different devices. This further adds communication cost across devices to the overall serving cost. Moreover, due to the sequential nature of the autoregressive decoding (Kasai et al., 2020; Chen et al., 2018) , the added communication cost from model parallel decoders gets multiplied by the number of decoding steps. To add to this, serving MoE models efficiently requires batching a large number of input tokens together, otherwise only a subset of the MoE network will be activated leading to device under-utilization. In this work, we study the inference efficiency of sparsely gated MoE models while taking into account the characteristics of the intended application, Multilingual Neural Machine Translation (MNMT). MNMT is an inherently multi-task learning problem, aimed at building a single neural network for translating multiple language pairs simultaneously. In a MNMT model, the extent to which parameters are shared across languages determines the magnitude of positive transfer (Baldwin & Ford, 1988) and conversely task interference due to the capacity bottleneck (Arivazhagan et al., 2019) . In an ideal scenario, we would want to efficiently train a single large MNMT model maximizing transfer while expanding the capacity bottleneck; at the same time, we would like to enjoy the benefits of sparsely activated sub-networks per-task at inference time, i.e. extracting out a sub-network from the model to decode for a particular language pair to actualize inference efficiency. We propose routing algorithms for MoE models with affordable serving costs. While vanilla MoEs route each sub-word token in the input to its preferred experts, we explore alternative routing strategies that leverage global task level information to route all tokens corresponding to a particular task collectively to the same set of experts. While this strategy could be perceived to be restrictive for parameter sharing across tasks, we empirically demonstrate that routing based on task boundaries performs better when applied to MNMT. During training, we mix the inputs from different tasks in the same batch in order to learn the routing network and encourage positive transfer among the tasks. During inference, we decode different tasks separately and only load the subset of experts associated with the corresponding task. We compare our method with multilingual baselines and find that we achieve significant gains on two benchmarks: a multilingual WMT task with comparable inference cost (+3.59 BLEU), described in Section 4, and a large internal dataset (+3.6 BLEU), in Section 4.3.2). We see that the gains are comparable with conventional position-wise Mixture-of-Expert models while utilizing decoders with only a fraction (6.25% and 1.56%) of their serving cost. We discuss the trade-offs of these different methods in Section 3.2. In Section 4.3.4, we analyze the routing decisions made in MoE models and motivate our method.

2. SCALING TRANSFORMERS WITH MIXTURE-OF-EXPERTS

The Transformer (Vaswani et al., 2017) architecture is a popular model used for neural machine translation and other natural language understanding problems. In sequence-to-sequence problems (of which neural machine translation is one example), the model consists of a separate encoder and decoder, each of which contains multiple Transformer layers. For further details on Transformers, we refer the reader to the original paper (Vaswani et al., 2017) . We use the Mixture-of-Experts Transformer models used by Lepikhin et al. (2020) , where the MoE layers for the Transformers consist of E feed-forward networks (FFN), such that (FFN 1 . . . FFN E ). FFN e (x s ) = wo e • ReLU(wi e • x s ) y s = E e=1 G s,e • FFN e (x s ) Here, x s is the input token at position s to the MoE layer and each FFN e is a two layer neural network using a ReLU activation function. wi e and wo e are the input and output projection weights of the e-th expert. Finally, G s,E is vector computed by the gating network. For each expert, most values of this vector are zeros, one value being positive. We use this vector to route the token to a select few experts. The entries chosen from G s,E determine how much the expert contributes to the final output y s . Note that, in this work we choose the top 2 weight experts for each example to be comparable with the prior work. The gating network G s,E must be considered carefully for efficiency purposes: (1) the utilization of experts must be balanced and (2) the function must be efficient to implement at scale. For a more thorough discussion of MoE transformer, we direct the reader to Lepikhin et al. (2020) .

3. METHODS

In this section we describe our candidate routing strategies in the context of MNMT and discuss their trade-offs from the perspective of the training and inference efficiency. It is known that multi-

