MAD FOR ROBUST REINFORCEMENT LEARNING IN MACHINE TRANSLATION

Abstract

We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.

1. INTRODUCTION

There is increasing interest in fine-tuning conditional language models on the basis of feedback from task-specific reward models or similarity functions that compare to human-generated reference outputs rather than relying exclusively on supervised learning (Stiennon et al., 2020; Ziegler et al., 2019; Wu et al., 2018; Paulus et al., 2018; Rennie et al., 2017; Ranzato et al., 2016) . Maximising sequence level rewards has several advantages. First, it avoids the apparent conflict between the intuitive importance of "getting the full sequence right" in generation problems and the more conventional token-level cross entropy loss. Second, since a policy trained to maximize rewards is supervised with its own outputs-both good and bad ones-it mitigates issues arising from "exposure bias," in which a learned policy that has been trained only on correct examples has no experience recovering from errors and therefore performs poorly at test time (Ranzato et al., 2016) . Third, feedback from (learned) rewards can be a cost-effective strategy for incorporating human preferences about how a system should behave (Stiennon et al., 2020; Christiano et al., 2017) . Unfortunately, fine-tuning policies for generating in complex output spaces, such as language, on the basis of sparse rewards is challenging. Estimating and debugging reliable auxiliary critic/value functions that are needed by many learning algorithms is challenging (Wu et al., 2018; Bahdanau et al., 2017; Nguyen et al., 2017) , and commonly used average or batch-level reward baselines (Kreutzer et al., 2017) are poor variance reducers since they are independent of the input, and input difficulty is a strong determinant of reward magnitude. In this paper, we propose a new distributed policy gradient algorithm ( §2) for fine-tuning translation models that addresses these issues. The distributed setup lets us use modest computation to obtain simple and effective empirical reward baselines (Rennie et al., 2017) rather than using inappropriate batch-level statistics or relying on brittle auxiliary value models. Our proposed algorithm has two components designed to make learning from the reward signal more effective: we employ a sampling and reward normalization strategy that encourage batches to contain a mix of both positive and negative rewards for each source sentence and, second, an importance weighting strategy that encourages the algorithm to pay attention to trajectories that are slightly off the current policy. Thus, our algorithm learns from trajectories that are already relatively likely under the current policy 1

