SHARE OR NOT? LEARNING TO SCHEDULE LANGUAGE-SPECIFIC CA-PACITY FOR MULTILINGUAL TRANSLATION

Abstract

Using a mix of shared and language-specific (LS) parameters has shown promise in multilingual neural machine translation (MNMT), but the question of when and where LS capacity matters most is still under-studied. We offer such a study by proposing conditional language-specific routing (CLSR). CLSR employs hard binary gates conditioned on token representations to dynamically select LS or shared paths. By manipulating these gates, it can schedule LS capacity across sub-layers in MNMT subject to the guidance of translation signals and budget constraints. Moreover, CLSR can easily scale up to massively multilingual settings. Experiments with Transformer on OPUS-100 and WMT datasets show that: 1) MNMT is sensitive to both the amount and the position of LS modeling: distributing 10%-30% LS computation to the top and/or bottom encoder/decoder layers delivers the best performance; and 2) one-to-many translation benefits more from CLSR compared to many-to-one translation, particularly with unbalanced training data. Our study further verifies the trade-off between the shared capacity and LS capacity for multilingual translation. We corroborate our analysis by confirming the soundness of our findings as foundation of our improved multilingual Transformers.

1. INTRODUCTION

Model architecture design injects inductive biases to neural network layouts, allowing a learning algorithm to favor certain representations over others, independent of the observed data (Mitchell, 1980) . In multilingual neural machine translation (MNMT), where the learning objective is commonly cast as a multi-task learning problem (Firat et al., 2016a; Ha et al., 2016; Johnson et al., 2017) , the inductive bias researchers usually study is deciding on which components of the neural network to share between tasks (languages), and which components to leave specific to the task or language. These components can be entire layer stacks, individual layers or even some sub-layers (Sachan & Neubig, 2018; Blackwood et al., 2018; Wang et al., 2019; Zhu et al., 2020) . Noticeably, the search space of which parameters to share and at which granularity grows rapidly, as we make neural networks large or increase the number of tasks (languages). This rapid expansion of the search space prevents us from exhaustively exploring the choice of sharing patterns in MNMT. The incapability of full-space exploration motivates methods relying on heuristics (Sachan & Neubig, 2018) , that lack flexibility when more languages are covered, or meta-learning (Platanios et al., 2018) , that are often hard to scale. These limitations hinder their generalization to large-scale multilingual models, which is the very focus of our study. In large scale multilingual models, also known as massively multilingual models (Aharoni et al., 2019; Arivazhagan et al., 2019; Zhang et al., 2020b) , hundreds of languages with varying amounts of training data, difficulty and linguistic properties are jointly trained together in a multi-task setup. While the joint training enables positive transfer across languages, it also introduces task-interference between dissimilar languages (Arivazhagan et al., 2019; Wang et al., 2020a; b) and a capacity bottleneck emerges due to the increased number of languages and data (Huang et al., 2019; Zhang et al., 2020b) . In this paper we adopt an end-to-end data driven approach (conditional language-specific routing, or CLSR) which permits directly probing a large section of the search space. We let the network learn the sharing structure from the data itself, by learning to route between language-specific (LS) or shared pathways. These two routes determine the mode of operation for the network: when the LS branch is selected, the model is given access to a set of LS layers (implemented as simple projections per language) and when the shared branch is chosen, the computation is routed to a layer that is used by all languages. By guiding the (gating) decision process with token level activation information, the network flexibly learns to alternate between the two modes and naturally lends itself to a conditional computation approach for multilingual processing (Bengio et al., 2013; Davis & Arel, 2013; Bapna et al., 2020) . The gate states are optimized towards maximizing translation quality, but regularized with a budget constraint to control the amount of LS capacityfoot_0 . Reducing the available budget results in fewer gates routing through the LS paths, enforcing CLSR to identify the most crucial sub-layers which allows us to observe and study the importance of each sub-layer for multilingual processing. Our approach is visually depicted in Figure 1 . We verify our proposal on WMT and the massively multilingual OPUS-100 dataset, with models building on the Transformer architecture (Vaswani et al., 2017) . We explore target-specific and source-specific modeling for one-to-manyfoot_1 and many-to-one translation, respectively. To measure the degree of each sub-layer's tendency to be language-specific, we propose LSScore metric. Our results show that CLSR successfully navigates the trade-offs in LS modeling, outperforming several strong baselines. Our main findings are summarized below: • Both the amount and the position of LS layers matter for MNMT. The best performance is achieved by distributing 10%-30% LS computation to the top and/or bottom encoder/decoder layers. • Feed-forward sub-layers utilize more LS capacity compared to other sub-layers on one-tomany translation. • One-to-many translation benefits more from CLSR (with target LS parameters) compared to many-to-one translation (with source LS parameters), particularly when the training data is imbalanced. • The induced sharing pattern learned by CLSR is highly similar across languages.



We use the term "the amount of LS capacity" to refer to the proportion of open gates where CLSR selects to route information through the LS path instead of its shared counterpart, which is directly regularized and guided by the budget constraint p as in Eq. 6. In a one-to-many machine translation setup, a single source side language (commonly English) is tasked to be translated into multiple target languages, one at a time.



Figure1: The model architecture used for our experiments. We introduce a CLSR layer after every transformer sub-layer in the encoder and the decoder. The gating layer learns to route every input through either the LS projection layer, or a shared projection layer. We analyze the outputs of the gating layers to develop a MNMT architecture with LS projections.

availability

Source code and models are available at https://github.com/bzhangGo/zero/tree/iclr2021_clsr.

