UNDERSTANDING MULTI-TASK SCALING IN MACHINE TRANSLATION

Abstract

In this work, we provide a large-scale empirical study of the scaling properties of multilingual (multitask) neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the individual task weights on the scaling behavior. We find that these weights only affect the multiplicative factor of the scaling law and in particular, the scaling exponent is unaffected by them. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each task and examine the role of language similarity in the scaling behavior of our models. We find minimal evidence that language similarity has any impact. In contrast, "direction" of the multilinguality plays a significant role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, greatly reducing efforts required for task balancing in large multitask models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.

1. INTRODUCTION

Over the past few years, scaling has emerged as a popular and effective way to improve the performance of neural networks (Brown et al., 2020; Chowdhery et al., 2022; Lepikhin et al., 2020) . Given the costs associated with training large state-of-the-art neural models, much work has gone into understanding their scaling properties and predicting the evolution of their performance with scale through scaling laws. Such scaling laws have been instrumental in guiding the model development efforts across a variety of domains such as computer vision (Zhai et al., 2022) , language modelling (Kaplan et al., 2020; Hoffmann et al., 2022) , and neural machine translation (Ghorbani et al., 2022) . Despite these impressive developments, as of yet, most of the scaling laws studies available in the literature only focus on single-task models. On the contrary, current massive neural models are often trained to solve more than one task across one or more modalities (Chowdhery et al., 2022; Sanh et al., 2022; Reed et al., 2022) . This disconnect from the current research frontier limits the applicability of the scaling laws in guiding model development decisions. In particular, currently available scaling laws studies are unable to inform the decision process on how to balance the different tasks effectively at training time. Without such guidance, practitioners often have to rely on cumbersome and costly approaches such as approximate grid search to inform their decisionmaking. Such approaches quickly become infeasible as the problem scale grows. In this paper, we take the initial step towards developing a quantitative understanding of the scaling behavior for multitask models. We choose multilingual neural machine translation (MNMT) as the setup for this initial study. This choice is motivated by several reasons: MNMT provides a popular setup with mature benchmarks and substantial literature on scaling (Lepikhin et al., 2020; Costajussà et al., 2022; Bapna et al., 2022; Huang et al., 2019) . Moreover, recent results on scaling laws for single-task MT models provide a natural starting point for our study (Ghorbani et al., 2022; Bansal et al., 2022; Gordon et al., 2021; Zhang et al., 2022) . Finally, recent findings on the optimization dynamics of MNMT models greatly simplify our study by removing the need to examine the role of the optimization algorithm in our results (Xin et al., 2022) .

