GRADIENT VACCINE: INVESTIGATING AND IMPROV-ING MULTI-TASK OPTIMIZATION IN MASSIVELY MUL-TILINGUAL MODELS

Abstract

Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.

1. INTRODUCTION

Modern multilingual methods, such as multilingual language models (Devlin et al., 2018; Lample & Conneau, 2019; Conneau et al., 2019) and multilingual neural machine translation (NMT) (Firat et al., 2016; Johnson et al., 2017; Aharoni et al., 2019; Arivazhagan et al., 2019) , have been showing success in processing tens or hundreds of languages simultaneously in a single large model. These models are appealing for two reasons: (1) Efficiency: training and deploying a single multilingual model requires much less resources than maintaining one model for each language considered, (2) Positive cross-lingual transfer: by transferring knowledge from high-resource languages (HRL), multilingual models are able to improve performance on low-resource languages (LRL) on a wide variety of tasks (Pires et al., 2019; Wu & Dredze, 2019; Siddhant et al., 2020; Hu et al., 2020) . Despite their efficacy, how to properly analyze or improve the optimization procedure of multilingual models remains under-explored. In particular, multilingual models are multi-task learning (MTL) (Ruder, 2017) in nature but existing literature often train them in a monolithic manner, naively using a single language-agnostic objective on the concatenated corpus of many languages. While this approach ignores task relatedness and might induce negative interference (Wang et al., 2020b) , its optimization process also remains a black-box, muffling the interaction among different languages during training and the cross-lingual transferring mechanism. In this work, we attempt to open the multilingual optimization black-box via the analysis of loss geometry. Specifically, we aim to answer the following questions: (1) Do typologically similar languages enjoy more similar loss geometries in the optimization process of multilingual models? (2) If so, in the joint training procedure, do more similar gradient trajectories imply less interference between tasks, hence leading to better model quality? (3) Lastly, can we deliberately encourage more geometrically aligned parameter updates to improve multi-task optimization, especially in real-world massively multilingual models that contain heavily noisy and unbalanced training data? Towards this end, we perform a comprehensive study on massively multilingual neural machine translation tasks, where each language pair is considered as a separate task. We first study the correlation between language and loss geometry similarities, characterized by gradient similarity along the optimization trajectory. We investigate how they evolve throughout the whole training process, and glean insights on how they correlate with cross-lingual transfer and joint performance. In particular, our experiments reveal that gradient similarities across tasks correlate strongly with both language proximities and model performance, and thus we observe that typologically close languages share similar gradients that would further lead to well-aligned multilingual structure (Wu et al., 2019) and successful cross-lingual transfer. Based on these findings, we identify a major limitation of a popular multi-task learning method (Yu et al., 2020) applied in multilingual models and propose a preemptive method, Gradient Vaccine, that leverages task relatedness to set gradient similarity objectives and adaptively align task gradients to achieve such objectives. Empirically, our approach obtains significant performance gain over the standard monolithic optimization strategy and popular multi-task baselines on large-scale multilingual NMT models and multilingual language models. To the best of our knowledge, this is the first work to systematically study and improve loss geometries in multilingual optimization at scale.

2. INVESTIGATING MULTI-TASK OPTIMIZATION IN MASSIVELY MULTILINGUAL MODELS

While prior work have studied the effect of data (Arivazhagan et al., 2019; Wang et al., 2020a ), architecture (Blackwood et al., 2018; Sachan & Neubig, 2018; Vázquez et al., 2019; Escolano et al., 2020) and scale (Huang et al., 2019b; Lepikhin et al., 2020) on multilingual models, their optimization dynamics are not well understood. We hereby perform a series of control experiments on massively multilingual NMT models to investigate how gradients interact in multilingual settings and what are their impacts on model performance, as existing work hypothesizes that gradient conflicts, defined as negative cosine similarity between gradients, can be detrimental for multi-task learning (Yu et al., 2020) and cause negative transfer (Wang et al., 2019) .

2.1. EXPERIMENTAL SETUP

For training multilingual machine translation models, we mainly follow the setup in Arivazhagan et al. (2019) . In particular, we jointly train multiple translation language pairs in a single sequenceto-sequence (seq2seq) model (Sutskever et al., 2014) . We use the Transformer-Big (Vaswani et al., 2017) architecture containing 375M parameters described in (Chen et al., 2018a) , where all parameters are shared across language pairs. We use an effective batch sizes of 500k tokens, and utilize data parallelism to train all models over 64 TPUv3 chips. Sentences are encoded using a shared source-target Sentence Piece Model (Kudo & Richardson, 2018) with 64k tokens, and a <2xx> token is prepended to the source sentence to indicate the target language (Johnson et al., 2017) . The full training details can be found in Appendix B. To study real-world multi-task optimization on a massive scale, we use an in-house training corpusfoot_0 (Arivazhagan et al., 2019) generated by crawling and extracting parallel sentences from the web (Uszkoreit et al., 2010), which contains more than 25 billion sentence pairs for 102 languages to and from English. We select 25 languages (50 language pairs pivoted on English), containing over 8 billion sentence pairs, from 10 diverse language families and 4 different levels of data sizes (detailed in Appendix A). We then train two models on two directions separately, namely Any→En and En→Any. Furthermore, to minimize the confounding factors of inconsistent sentence semantics across language pairs, we create a multi-way aligned evaluation set of 3k sentences for all languagesfoot_1 . Then, for each checkpoint at an interval of 1000 training steps, we measure pair-wise cosine similarities of the model's gradients on this dataset between all language pairs. We examine gradient similarities at various granularities, from specific layers to the entire model.



We also experiment on publicly available dataset of WMT and obtain similar observations in Appendix C. In other words, 3k semantically identical sentences are given in 25 languages.

