UNDERSTANDING MULTI-TASK SCALING IN MACHINE TRANSLATION

Abstract

In this work, we provide a large-scale empirical study of the scaling properties of multilingual (multitask) neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the individual task weights on the scaling behavior. We find that these weights only affect the multiplicative factor of the scaling law and in particular, the scaling exponent is unaffected by them. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each task and examine the role of language similarity in the scaling behavior of our models. We find minimal evidence that language similarity has any impact. In contrast, "direction" of the multilinguality plays a significant role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, greatly reducing efforts required for task balancing in large multitask models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.

1. INTRODUCTION

Over the past few years, scaling has emerged as a popular and effective way to improve the performance of neural networks (Brown et al., 2020; Chowdhery et al., 2022; Lepikhin et al., 2020) . Given the costs associated with training large state-of-the-art neural models, much work has gone into understanding their scaling properties and predicting the evolution of their performance with scale through scaling laws. Such scaling laws have been instrumental in guiding the model development efforts across a variety of domains such as computer vision (Zhai et al., 2022) , language modelling (Kaplan et al., 2020; Hoffmann et al., 2022) , and neural machine translation (Ghorbani et al., 2022) . Despite these impressive developments, as of yet, most of the scaling laws studies available in the literature only focus on single-task models. On the contrary, current massive neural models are often trained to solve more than one task across one or more modalities (Chowdhery et al., 2022; Sanh et al., 2022; Reed et al., 2022) . This disconnect from the current research frontier limits the applicability of the scaling laws in guiding model development decisions. In particular, currently available scaling laws studies are unable to inform the decision process on how to balance the different tasks effectively at training time. Without such guidance, practitioners often have to rely on cumbersome and costly approaches such as approximate grid search to inform their decisionmaking. Such approaches quickly become infeasible as the problem scale grows. In this paper, we take the initial step towards developing a quantitative understanding of the scaling behavior for multitask models. We choose multilingual neural machine translation (MNMT) as the setup for this initial study. This choice is motivated by several reasons: MNMT provides a popular setup with mature benchmarks and substantial literature on scaling (Lepikhin et al., 2020; Costajussà et al., 2022; Bapna et al., 2022; Huang et al., 2019) . Moreover, recent results on scaling laws for single-task MT models provide a natural starting point for our study (Ghorbani et al., 2022; Bansal et al., 2022; Gordon et al., 2021; Zhang et al., 2022) . Finally, recent findings on the optimization dynamics of MNMT models greatly simplify our study by removing the need to examine the role of the optimization algorithm in our results (Xin et al., 2022) . For our analysis, we train over 200 MNMT models (ranging from 20M to 1B non-embedding parameters) and systematically examine their scaling behaviors. We focus our investigation on the data rich-compute rich regime where we have access to vast amounts of training data for all the tasks (i.e. language pairs)foot_0 and the model is trained to near convergence. Here, the main bottleneck in the model performance is due to the lack of model capacity. We establish the following observations: • For each fixed task i and task weighting w, the evolution of the test cross-entropy loss (L) with model size (N ) follows a scaling law that resembles the scaling behavior of single-task models: L i (N ; w) ≈ β w,i N -αw,i + L (w,i) ∞ . Furthermore, we find that changes in the task weightings only affect the multiplicative factor β. The scaling exponent α and the irreducible loss L ∞ are unaffected by these changes. In other words, scaling multi-task models will improve their performance in a task at the same rate independently of its weight on the optimization objective. • We leverage these findings to propose a scaling law that jointly predicts the performance for all tasks and weightings considered, and use it to examine how the model splits its capacity in between the tasks by computing the effective number of parameters allocated to each task (subsection 3.3) • We examine the popular belief that training multilingual models in similar languages is more effective than training models in unrelated languages. Surprisingly, for the highresource language pairs considered, we don't observe any significant differences in the scaling behavior of models trained to translate from English into related languages (En→{De, Fr}) with models trained in unrelated languages (En→{De, Zh}). In contrast, we observe that models trained to translate from multiple languages into English (XX→En) benefit much more from multitasking compared to trained on translation out of English (En→XX). • In Section 3.4, we use simple approximations to f i (w) to provide a scaling law that predicts the full task performance trade-off frontier as a function of the model size N (See Figure 7 ). We describe how these predictions can be utilized for guiding task balancing in the development of massive models. et al., 2017; Rosenfeld et al., 2019; Kaplan et al., 2020; Hernandez et al., 2021) . The most relevant of these studies to ours is Ghorbani et al. ( 2022) where the authors study the effects of increasing the model size for single-task NMT models in the data-rich (D → ∞), compute-rich (C → ∞) regime. In this setting, the authors show that the following bivariate law describes the scaling behavior of encoder-decoder Transformers

2. BACKGROUND

L(N e , N d ) = βN -pe e N -p d d + L ∞ . Here, N e and N d correspond to the number of parameters in the encoder and decoder respectively and L ∞ corresponds to the irreducible loss associated with the task. {β, p e , p d , L ∞ } are the parameters of the scaling law that need to be empirically estimated from the data. In addition, Ghorbani et al. (2022) examine the question of optimally allocating parameters between the encoder and the decoder. They show that in order to observe the optimal scaling behavior, one needs to proportionally scale the encoder and the decoder together. Under such scaling scheme, Equation 2 simplifies to L(N ) = βN -α + L ∞ ,



Using machine translation terminology, all language pairs are high-resource. Following the literature conventions, we only consider the non-embedding layers when computing N .



NEURAL SCALING LAWS Recent research suggests that the performance of large neural models is well-predicted by a smooth function of the fundamental problem parameters: the model size N 2 , the size of the training data D, and the amount of compute used for training C (Hestness

