EFFICIENT NEURAL MACHINE TRANSLATION WITH PRIOR WORD ALIGNMENT

Abstract

Prior word alignment has been shown indeed helpful for better translation, if such prior is good enough and can be acquired in a convenient way at the same time. Traditionally, word alignment can be learned through statistical machine translation (SMT) models. In this paper, we propose a novel method that infuses prior word alignment information into neural machine translation (NMT) to provide hints or guidelines for the target sentence at running time. To this end, previous works of similar approaches should build dictionaries for specific domains, or constraint the decoding process, or both. While being effective to some extent, these methods may greatly affect decoding speed and hurt translation flexibility and efficiency. Instead, this paper introduces an enhancement learning model, which can learn how to directly replace specific source words with their target counterparts according to prior alignment information. The proposed model is then inserted into a neural MT model and augments MT input with the additional target information from the learning model in an effective and more efficient way. Our novel method achieves BLEU improvements (up to 1.1) over a strong baseline model on English-Korean, English-to-German and English-Romanian translation tasks.

1. INTRODUCTION

As neural machine translation (NMT) models have become the dominant approach in the machine translation task, the explicit word alignment model, which is an essential intermediary result from the training of most statistical machine translation (SMT) models (Koehn et al., 2003; Och & Ney, 2004; Ganchev et al., 2008) , seems becoming increasingly obsolete. Prior research suggests that the attention mechanism of NMT systems takes over the word alignment model of SMT systems (Bahdanau et al., 2014) . However, the word alignment information extracted from the attention mechanism is far from gold alignment and even performs much worse than automatic word aligner such as FastAlign or GIZA++. In this study, we focus on the use of prior word alignment in the NMT system to improve translation performance. With the guidance of good enough known word alignment, replacing some words in the source sentence with semantically corresponding words in the target language leads to better translation or user-desired translation, and it is also known as a tip to use the translator well. As in Figure 1 , we can see that an open translation systemfoot_0 generates a better translation closer to the target sentence when some words of the target sentence are provided in the source sentence. In order words, a user can use specific alignment, such as ᄀ ᅩ ᆼᄀ ᅢ ↔ released and ᄉ ᅡᄌ ᅵ ᆫ ↔ picture, to get a desired translation. The case in Figure 1 happens because word alignment between source and target sentences more or less holds no matter how the model acquires such alignment. Besides, not all word alignment may help and only those good enough word alignment can truly enhance the model. When the concerned language pair shares a large vocabulary, such good enough alignment may be easily obtained and then conveniently works in the proposed early substituting way. This work will right explore an effective way to figure out those 'good' enough certain alignment for neural machine translation enhancement. Previous studies of similar approaches can be largely divided into two categories: constraint decoding and augmenting MT input with its corresponding target information. The former is to leverage pre-specified translations to guide the decoding procedure in the modified NMT architecture, such as an additional attention layer (Alkhouli et al., 2018; Song et al., 2020) and a modification to beam search (Post & Vilar, 2018; Hu et al., 2019) . They can be useful in certain applications, where the user wants to enforce specific translation of certain words. However, if they do not use information (e.g., alignment and dictionary) extracted from a ground truth sentence pair, they can instead lead to lower translation quality due to strict enforcement of constraints (Dinu et al., 2019) . Also, the constrained decoding methods using a modified beam search cause computational overhead in translation time. The latter is to augment MT input with pre-defined dictionary entries for a specific domain and let NMT models learn, at training time, how to use the corresponding target term when provided in the source sentence (Dinu et al., 2019; Park & Zhao, 2019) . Although they may gain small but constant performance improvement, they require a 'suitable' pre-defined dictionary for translation tasks with a set of constraints. Otherwise, the translation performance is significantly degraded because it relies only on a fixed dictionary without considering the context. This paper focuses on integrating word alignment information into NMT systems in an effective and efficient way, without building a 'suitable' dictionary beforehand. To this end, we exploit the alignment learned by SMT model, insert an Alignment-Based Word Substitution (ABWS) model into NMT system, and the goal of the model is to learn how to replace the input word with the target one at training time. Specifically, the ABWS model's input is the source sentence, and the reference of the ABWS model is the modified source sentence, in which some words are replaced with their corresponding target words according to the prior word alignment. Note that our model only requires alignment information during training, so unlike the previously proposed models, there is no need to process the test dataset. At inference time, the final hidden state of the trained ABWS model is integrated into NMT models to provide additional target information for MT input. The benefits of our model are twofold: (1) Different from constraint decoding, which add modifications to the decoding algorithm, our method does not cause computational overhead in inference time. (2) Although the augmenting MT input requires a 'good' pre-defined dictionary, our proposed method does not need to construct it separately because the ABWS model can efficiently perform the predefined dictionary role in our model. Furthermore, several previous studies inject alignment signal directly into an attention head of Transformer for constraint decoding or better alignment extraction, but they do not lower or change the translation performance (Alkhouli et al., 2018; Garg et al., 2019; Song et al., 2020) . To summarize, we make the following contributions. (1) We propose a novel augmenting MT input method that leverages only prior word alignment without pre-defined dictionaries to improve translation performance in NMT system. Therefore, prior alignment information from the automatic word aligner can be effectively injected into the NMT system in any bilingual corpus through our method. (2) In our experiment, our model outperforms strong baseline models such as vanilla Transformer, constraint decoding, augmenting MT input method on Romanian-English, English to German, and Korean-English translation tasks.

2. RELATED WORK

Word alignment is no longer an indispensable component in training NMT models, but recently there has been a resurgence of interest in the community to study word alignment for NMT models due to its better interpretability. For example, previous works used word alignment information to interpret the end-to-end NMT system and analyze translation errors or to extract better alignments from the learned NMT models. There have been also several studies that leverage word alignment to guide NMT decoding directly, and especially Alkhouli et al. (2018) described this approach as a new downstream task of leveraging word alignment (dictionary-guided decoding task). In this study, we focus on infusing SMT offered word alignment into NMT system to improve translation performance. Alkhouli et al. ( 2018) added a special alignment head, which conditions a lexical model on the alignment information, to the multi-head source-to-target attention module of Transformer decoder. The use of a separate alignment model adds significant computational overhead to the decoding process, requiring special handling to optimize speed. And Song et al. (2020) proposed an approach that introduces a dedicated head in the multi-head Transformer architecture to capture external supervision signals. Besides, these two studies constraint the decoding process to correct translation errors with



https://translate.google.com

