FUZZY ALIGNMENTS IN DIRECTED ACYCLIC GRAPH FOR NON-AUTOREGRESSIVE MACHINE TRANSLATION

Abstract

Non-autoregressive translation (NAT) reduces the decoding latency but suffers from performance degradation due to the multi-modality problem. Recently, the structure of directed acyclic graph has achieved great success in NAT, which tackles the multi-modality problem by introducing dependency between vertices. However, training it with negative log-likelihood loss implicitly requires a strict alignment between reference tokens and vertices, weakening its ability to handle multiple translation modalities. In this paper, we hold the view that all paths in the graph are fuzzily aligned with the reference sentence. We do not require the exact alignment but train the model to maximize a fuzzy alignment score between the graph and reference, which takes captured translations in all modalities into account. Extensive experiments on major WMT benchmarks show that our method substantially improves translation performance and increases prediction confidence, setting a new state of the art for NAT on the raw training data.

1. INTRODUCTION

Non-autoregressive translation (NAT) (Gu et al., 2018) reduces the decoding latency by generating all target tokens in parallel. Compared with the autoregressive counterpart (Vaswani et al., 2017) , NAT often suffers from performance degradation due to the severe multi-modality problem (Gu et al., 2018) , which refers to the fact that one source sentence may have multiple translations in the target language. NAT models are usually trained with the cross-entropy loss, which strictly aligns model prediction with target tokens. The strict alignment does not allow multi-modality such as position shifts and word reorderings, so proper translations are likely to be wrongly penalized. The inaccurate training signal makes NAT tend to generate a mixture of different translations rather than a consistent translation, which typically contains many repeated tokens in generated results. Many efforts have been devoted to addressing the above problem (Libovický & Helcl, 2018; Shao et al., 2020; Ghazvininejad et al., 2020a; Du et al., 2021; Huang et al., 2022c) . Among them, Directed Acyclic Transformer (DA-Transformer) (Huang et al., 2022c) introduces a directed acyclic graph (DAG) on top of the NAT decoder, where decoder hidden states are organized as a graph rather than a sequence. By modeling the dependency between vertices, DAG is able to capture multiple translation modalities simultaneously by assigning tokens in different translations to distinct vertices. In this way, DA-Transformer does not heavily rely on knowledge distillation (KD) (Kim & Rush, 2016; Zhou et al., 2020) to reduce training data modalities and can achieve superior performance on raw data. Despite the success of DA-Transformer, training it with negative log-likelihood (NLL) loss, which marginalizes out the path from the joint distribution of DAG path and reference, is sub-optimal in the scenario of NAT. It implicitly introduces a strict monotonic alignment between reference tokens and vertices on all paths. Although DAG enables the model to capture different translations in different transition paths, only paths that are aligned verbatim with reference with a large probability will be well calibrated by NLL (See Section 2.3 for detailed analysis). It weakens DAG's ability to handle data multi-modality, making the model less confident in generating outputs and requiring a large graph size to achieve satisfying performance. In this paper, we extend the verbatim alignment between reference and DAG path to a fuzzy alignment, aiming to better handle the multi-modality problem. Specifically, we do not require the exact alignment but hold the view that all paths in DAG are fuzzily aligned with the reference sentence. To indicate the quality of alignment, an alignment score is assigned to each DAG path based on the expectation of n-gram overlapping. We further define an alignment score between the whole DAG and reference as the expected alignment score of all its paths. The model is trained to maximize the alignment score, which takes captured translations in all modalities into account. Experiments on major WMT benchmarks show that our method substantially improves the translation quality of DA-Transformer. It achieves comparable performance to the autoregressive Transformer without the help of knowledge distillation and beam search decoding, setting a new state of the art for NAT on the raw training data.

2.1. NON-AUTOREGRESSIVE MACHINE TRANSLATION

Non-autoregressive translation (Gu et al., 2018) is proposed to reduce the decoding latency. It abandons the assumption of autoregressive dependency between output tokens and generates all tokens simultaneously. Given a source sentence x = {x 1 , ..., x N }, NAT factorizes the joint probability of target tokens y = {y 1 , ..., y M } as, P θ (y|x) = M i P θ (y i |x), where θ is the model parameter and P θ (y i |x) denotes the translation probability of y i at position i. In vanilla NAT, the decoder length is set to reference length during the training and determined by a trainable length predictor during the inference. Standard NAT model is trained with the crossentropy loss, which strictly requires the generation of word y i at position i:  L CE = - M i log P θ (y i |x). where a = {a 1 , ..., a M } is a path represented by a sequence of vertex indexes with the bound 1 = a 1 < ... < a M = L and Γ y contains all paths with the same length as the target sentence y. P θ (a|x) and P θ (y|a, x) indicate the probability of path a and the probability of target sentence y conditioned on path a respectively. DAG factorizes the path probability P θ (a|x) based on the Markov hypothesis: P θ (a|x) = M -1 i=1 P θ (a i+1 |a i , x) = M -1 i=1 E ai,ai+1 ,



suffers two major drawbacks, including inflexible length prediction and disability to handle multi-modality.DA-Transformer (Huang et al., 2022c)  addresses these problems by stacking a directed acyclic graph on the top of NAT decoder, where hidden states and transitions between states represent vertices and edges in DAG.Formally, given a bilingual pair x = {x 1 , ..., x N } and y = {y 1 , ..., y M }, DA-Transformer sets the decoder length L = λ • N and models the translation probability by marginalizing out paths in DAG: P θ (y|x) = a∈Γy P θ (y|a, x)P θ (a|x),

