SHARE OR NOT? LEARNING TO SCHEDULE LANGUAGE-SPECIFIC CA-PACITY FOR MULTILINGUAL TRANSLATION

Abstract

Using a mix of shared and language-specific (LS) parameters has shown promise in multilingual neural machine translation (MNMT), but the question of when and where LS capacity matters most is still under-studied. We offer such a study by proposing conditional language-specific routing (CLSR). CLSR employs hard binary gates conditioned on token representations to dynamically select LS or shared paths. By manipulating these gates, it can schedule LS capacity across sub-layers in MNMT subject to the guidance of translation signals and budget constraints. Moreover, CLSR can easily scale up to massively multilingual settings. Experiments with Transformer on OPUS-100 and WMT datasets show that: 1) MNMT is sensitive to both the amount and the position of LS modeling: distributing 10%-30% LS computation to the top and/or bottom encoder/decoder layers delivers the best performance; and 2) one-to-many translation benefits more from CLSR compared to many-to-one translation, particularly with unbalanced training data. Our study further verifies the trade-off between the shared capacity and LS capacity for multilingual translation. We corroborate our analysis by confirming the soundness of our findings as foundation of our improved multilingual Transformers.

1. INTRODUCTION

Model architecture design injects inductive biases to neural network layouts, allowing a learning algorithm to favor certain representations over others, independent of the observed data (Mitchell, 1980) . In multilingual neural machine translation (MNMT), where the learning objective is commonly cast as a multi-task learning problem (Firat et al., 2016a; Ha et al., 2016; Johnson et al., 2017) , the inductive bias researchers usually study is deciding on which components of the neural network to share between tasks (languages), and which components to leave specific to the task or language. These components can be entire layer stacks, individual layers or even some sub-layers (Sachan & Neubig, 2018; Blackwood et al., 2018; Wang et al., 2019; Zhu et al., 2020) . Noticeably, the search space of which parameters to share and at which granularity grows rapidly, as we make neural networks large or increase the number of tasks (languages). This rapid expansion of the search space prevents us from exhaustively exploring the choice of sharing patterns in MNMT. The incapability of full-space exploration motivates methods relying on heuristics (Sachan & Neubig, 2018) , that lack flexibility when more languages are covered, or meta-learning (Platanios et al., 2018) , that are often hard to scale. These limitations hinder their generalization to large-scale multilingual models, which is the very focus of our study. In large scale multilingual models, also known as massively multilingual models (Aharoni et al., 2019; Arivazhagan et al., 2019; Zhang et al., 2020b) , hundreds of languages with varying amounts of training data, difficulty and linguistic properties are jointly trained together in a multi-task setup. While the joint training enables positive * Work done while Biao Zhang was interning at Google Research. 1

availability

Source code and models are available at https://github.com/bzhangGo/zero/

