GSHARD: SCALING GIANT MODELS WITH CONDI-TIONAL COMPUTATION AND AUTOMATIC SHARDING

Abstract

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. In this paper we demonstrate conditional computation as a remedy to the above mentioned impediments, and demonstrate its efficacy and utility. We make extensive use of GShard, a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler to enable large scale models with up to trillions of parameters. GShard and conditional computation enable us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts. We demonstrate that such a giant model with 600 billion parameters can efficiently be trained on 2048 TPU v3 cores in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

1. INTRODUCTION

Scaling neural networks brings dramatic quality gains over a wide array of machine learning problems such as computer vision, language understanding and neural machine translation (Devlin et al., 2018; Mahajan et al., 2018; Arivazhagan et al., 2019; Huang et al., 2019; Brown et al., 2020b) . This general tendency motivated recent studies to scrutinize the factors playing a critical role in the success of scaling, including the amounts of training data, the model size, and the computation being utilized as found by past studies (Advani & Saxe, 2017; Hestness et al., 2019; Geiger et al., 2020) . While the final model quality was found to have a power-law relationship with these factors (Hestness et al., 2017; Kaplan et al., 2020) , the significant quality gains brought by larger models also came with various practical challenges. Training efficiency, which we define as the amount of compute and time used to achieve a superior model quality against the best system existed, is oftentimes left out. In this study, we strive for improving the model quality while being training efficiently. We built a 600 billion parameters sequence-to-sequence Transformer model with Sparsely-Gated Mixture-of-Experts layers, which enjoys sub-linear computation cost and O(1) compilation time. We trained this model with 2048 TPU v3 devices for 4 days on a multilingual machine translation task and achieved far superior translation quality compared to prior art when translating 100 languages to English with a single non-ensemble model. We conducted experiments with various model sizes and found that the translation quality increases as the model gets bigger, yet the total wall-time to train only increases sub-linearly with respect to the model size, as illustrated in Figure 1 . To train such an extremely large model, we relied on the following key design choices. 2017) has shown that scaling RNN model capacity by adding Sparsely Gated Mixture-of-Experts (MoE) layers allowed to achieve improved results with sub-linear cost. We therefore present our approach to extend Transformer architecture with MoE layers in this study. GShard Annotation Second, the model description should be separated from the partitioning implementation and optimization. This separation of concerns let model developers focus on the network architecture and flexibly change the partitioning strategy, while the underlying system applies semantic-preserving transformations and implements efficient parallel execution. To this end we propose a module, GShard, which only requires the user to annotate a few critical tensors in the model with partitioning policies. It consists of a set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization. Model developers write models as if there is a single device with huge memory and computation capacity, and the compiler automatically partitions the computation for the target based on the user annotations and their own heuristics.

2. MODEL

The Transformer (Vaswani et al., 2017) architecture has been widely used for natural language processing. We scale Transformer with conditional computation by replacing every other feedforward layer with a sparsely activated Position-wise Mixture of Experts (MoE) layer (Shazeer et al., 2017) , with a variant of top-2 gating in both the encoder and the decoder (Figure 2 ). Each subword token in the training example activates a sub-network of the MoE Transformer during both training and inference. The size of the sub-network is roughly independent of the number of experts per MoE Layer, allowing sublinear scaling of the computation cost.

2.1. POSITION-WISE MIXTURE-OF-EXPERTS LAYER

The Mixture-of-Experts (MoE) layers used in our model differ from Shazeer et al. (2017)'s in the sparse gating function and the auxiliary loss being used. A MoE layer for Transformer consists of E feed-forward networks FFN 1 . . . FFN E , each of which outputs wo e • ReLU(wi e • x s ), where x s is the input token to the MoE layer, wi and wo being the input and output projection matrices for the feed-forward layer (an expert) with shapes [M, H] and [H, M ], respectively. The output of a MoE layer is the combination of the expert outputs E e=1 G s,e • FFN e (x s ), where the vector G s,E is computed by a gating function GATE(•). We choose to let each token dispatched to at most two experts. The corresponding gating entries G s,e become non-zeros, representing how much an expert contributes to the final network output.



Figure 1: Multilingual translation quality (average ∆BLEU comparing to bilingual baselines) improved as MoE model size grows up to 600B, while the end-to-end training cost (in terms of TPU v3 core-year) only increased sublinearly. Increasing the model size from 37.5B to 600B (16x), results in computation cost increase from 6 to 22 years (3.6x). The 600B parameters model that achieved the best translation quality was trained with 2048 TPU v3 cores for 4 days, a total cost of 22 TPU v3 core-years. In contrast, training all 100 bilingual baseline models would have required 29 TPU v3 core-years. Our best quality dense single Transformer model (2.3B parameters) achieving ∆BLEU of 6.1, was trained with GPipe for a total of 235.5 TPU v3 core-years.

