MAD FOR ROBUST REINFORCEMENT LEARNING IN MACHINE TRANSLATION

Abstract

We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.

1. INTRODUCTION

There is increasing interest in fine-tuning conditional language models on the basis of feedback from task-specific reward models or similarity functions that compare to human-generated reference outputs rather than relying exclusively on supervised learning (Stiennon et al., 2020; Ziegler et al., 2019; Wu et al., 2018; Paulus et al., 2018; Rennie et al., 2017; Ranzato et al., 2016) . Maximising sequence level rewards has several advantages. First, it avoids the apparent conflict between the intuitive importance of "getting the full sequence right" in generation problems and the more conventional token-level cross entropy loss. Second, since a policy trained to maximize rewards is supervised with its own outputs-both good and bad ones-it mitigates issues arising from "exposure bias," in which a learned policy that has been trained only on correct examples has no experience recovering from errors and therefore performs poorly at test time (Ranzato et al., 2016) . Third, feedback from (learned) rewards can be a cost-effective strategy for incorporating human preferences about how a system should behave (Stiennon et al., 2020; Christiano et al., 2017) . Unfortunately, fine-tuning policies for generating in complex output spaces, such as language, on the basis of sparse rewards is challenging. Estimating and debugging reliable auxiliary critic/value functions that are needed by many learning algorithms is challenging (Wu et al., 2018; Bahdanau et al., 2017; Nguyen et al., 2017) , and commonly used average or batch-level reward baselines (Kreutzer et al., 2017) are poor variance reducers since they are independent of the input, and input difficulty is a strong determinant of reward magnitude. In this paper, we propose a new distributed policy gradient algorithm ( §2) for fine-tuning translation models that addresses these issues. The distributed setup lets us use modest computation to obtain simple and effective empirical reward baselines (Rennie et al., 2017) rather than using inappropriate batch-level statistics or relying on brittle auxiliary value models. Our proposed algorithm has two components designed to make learning from the reward signal more effective: we employ a sampling and reward normalization strategy that encourage batches to contain a mix of both positive and negative rewards for each source sentence and, second, an importance weighting strategy that encourages the algorithm to pay attention to trajectories that are slightly off the current policy. Thus, our algorithm learns from trajectories that are already relatively likely under the current policy (meaning any updates to the policy will be relatively small), while also encouraging continued exploration throughout training by down-weighting trajectories where the model is very confident. This enables the algorithm to make large improvements in reward while taking small, conservative steps to change the behaviour and, also, helps slow down the rate of policy collapse letting the model obtain continued performance improvements over many training steps. The policies learned using our MAD algorithm produce high-quality translations, even when using greedy decoding. In our main experiments ( §3), we use sentence BLEU as the reward and find that the average improvement on held-out test sets over the initial cross entropy trained model is 2.0 BLEU. We also find that the models are less sensitive to the beam search hyperparameters beam size and length normalization. This means we do not need to optimize the length normalization for each dataset and observe almost no performance degradation with larger beam sizes. We confirm that the algorithm learns different policies depending on the reward function used during optimization, with the resulting policies showing reward-specific improvements on held-out data. Finally, we carry out a careful empirical analysis ( §4) to better understand the impact and role the various components of the algorithm have on training dynamics and generalization performance.

2. ALGORITHM

Our algorithm consists of workers generating training trajectories and rewards, in parallel, from a slightly out-of-date copy of the policy, θ old , and a central learner that updates the current policy, θ. At the beginning of training, both θ and θ old are initialized from a pretrained cross entropy model. The data generation algorithm is shown Alg. 1 and the learner in Alg. 2. The learning algorithm has three core components: sampling from a range of temperatures ( §2.1), conditional reward normalization on the basis of empirical means and variances ( §2.2), and a novel robust importance weighting strategy that focuses learning efforts on samples that are slightly off policy ( §2.3). We discuss each of these components in turn. Algorithm 1 Asynchronous data generator 1: function GENERATE(N, ∆, Tmin, Tmax) 2: while True do 3: θ old ← θ ▷ Get current global weights 4: (x, y ref ) ∼ Dtrain 5: for i ∈ [1, N ] do ▷ Obtain Yx, q, and r 6: ∆ ← (Tmax -Tmin)/(N -1) 7: for i ∈ [1, S] do 4: T ← Tmin + (i -1) × ∆ 8: yi ← SAMPLE(x, x, y, q, r, v ← DEQUEUE() 5: p ← log p(y | x; θ) 6: u ← exp(p -q) 7: α ← SG(min{u × v, 2}) 8: L ← α × r × log p(y | x; dropout(θ)) 9: θ ← θ + η × ∂L ∂θ 10: end for 11: end function

2.1. MULTI-TEMPERATURE SAMPLING

To obtain suitably diverse candidates to learn from, it is conventional to add a temperature hyperparameter used to generate samples (Shen et al., 2016; Papini et al., 2020) . We identify two problems with this. First, there is the practical matter of needing to select a temperature in order to obtain good performance. Second, it is widely observed that policy gradient algorithms result in increasingly

