EFFICIENT NEURAL MACHINE TRANSLATION WITH PRIOR WORD ALIGNMENT

Abstract

Prior word alignment has been shown indeed helpful for better translation, if such prior is good enough and can be acquired in a convenient way at the same time. Traditionally, word alignment can be learned through statistical machine translation (SMT) models. In this paper, we propose a novel method that infuses prior word alignment information into neural machine translation (NMT) to provide hints or guidelines for the target sentence at running time. To this end, previous works of similar approaches should build dictionaries for specific domains, or constraint the decoding process, or both. While being effective to some extent, these methods may greatly affect decoding speed and hurt translation flexibility and efficiency. Instead, this paper introduces an enhancement learning model, which can learn how to directly replace specific source words with their target counterparts according to prior alignment information. The proposed model is then inserted into a neural MT model and augments MT input with the additional target information from the learning model in an effective and more efficient way. Our novel method achieves BLEU improvements (up to 1.1) over a strong baseline model on English-Korean, English-to-German and English-Romanian translation tasks.

1. INTRODUCTION

As neural machine translation (NMT) models have become the dominant approach in the machine translation task, the explicit word alignment model, which is an essential intermediary result from the training of most statistical machine translation (SMT) models (Koehn et al., 2003; Och & Ney, 2004; Ganchev et al., 2008) , seems becoming increasingly obsolete. Prior research suggests that the attention mechanism of NMT systems takes over the word alignment model of SMT systems (Bahdanau et al., 2014) . However, the word alignment information extracted from the attention mechanism is far from gold alignment and even performs much worse than automatic word aligner such as FastAlign or GIZA++. In this study, we focus on the use of prior word alignment in the NMT system to improve translation performance. With the guidance of good enough known word alignment, replacing some words in the source sentence with semantically corresponding words in the target language leads to better translation or user-desired translation, and it is also known as a tip to use the translator well. As in Figure 1 , we can see that an open translation systemfoot_0 generates a better translation closer to the target sentence when some words of the target sentence are provided in the source sentence. In order words, a user can use specific alignment, such as ᄀ ᅩ ᆼᄀ ᅢ ↔ released and ᄉ ᅡᄌ ᅵ ᆫ ↔ picture, to get a desired translation. The case in Figure 1 happens because word alignment between source and target sentences more or less holds no matter how the model acquires such alignment. Besides, not all word alignment may help and only those good enough word alignment can truly enhance the model. When the concerned language pair shares a large vocabulary, such good enough alignment may be easily obtained and then conveniently works in the proposed early substituting way. This work will right explore an effective way to figure out those 'good' enough certain alignment for neural machine translation enhancement. Previous studies of similar approaches can be largely divided into two categories: constraint decoding and augmenting MT input with its corresponding target information. The former is to leverage pre-specified translations to guide the decoding procedure in the modified NMT architecture, such as an additional attention layer (Alkhouli et al., 2018; Song et al., 2020) and a modification to beam search (Post & Vilar, 2018; Hu et al., 2019) . They can be useful in certain applications, where the user wants to enforce specific translation of certain words. However, if they do not use information (e.g., alignment and dictionary) extracted from a ground truth sentence pair, they can instead lead to lower translation quality due to strict enforcement of constraints (Dinu et al., 2019) . Also, the constrained decoding methods using a modified beam search cause computational overhead in translation time. The latter is to augment MT input with pre-defined dictionary entries for a specific domain and let NMT models learn, at training time, how to use the corresponding target term when provided in the source sentence (Dinu et al., 2019; Park & Zhao, 2019) . Although they may gain small but constant performance improvement, they require a 'suitable' pre-defined dictionary for translation tasks with a set of constraints. Otherwise, the translation performance is significantly degraded because it relies only on a fixed dictionary without considering the context. This paper focuses on integrating word alignment information into NMT systems in an effective and efficient way, without building a 'suitable' dictionary beforehand. To this end, we exploit the alignment learned by SMT model, insert an Alignment-Based Word Substitution (ABWS) model into NMT system, and the goal of the model is to learn how to replace the input word with the target one at training time. Specifically, the ABWS model's input is the source sentence, and the reference of the ABWS model is the modified source sentence, in which some words are replaced with their corresponding target words according to the prior word alignment. Note that our model only requires alignment information during training, so unlike the previously proposed models, there is no need to process the test dataset. At inference time, the final hidden state of the trained ABWS model is integrated into NMT models to provide additional target information for MT input. The benefits of our model are twofold: (1) Different from constraint decoding, which add modifications to the decoding algorithm, our method does not cause computational overhead in inference time. (2) Although the augmenting MT input requires a 'good' pre-defined dictionary, our proposed method does not need to construct it separately because the ABWS model can efficiently perform the predefined dictionary role in our model. Furthermore, several previous studies inject alignment signal directly into an attention head of Transformer for constraint decoding or better alignment extraction, but they do not lower or change the translation performance (Alkhouli et al., 2018; Garg et al., 2019; Song et al., 2020) . To summarize, we make the following contributions. (1) We propose a novel augmenting MT input method that leverages only prior word alignment without pre-defined dictionaries to improve translation performance in NMT system. Therefore, prior alignment information from the automatic word aligner can be effectively injected into the NMT system in any bilingual corpus through our method. (2) In our experiment, our model outperforms strong baseline models such as vanilla Transformer, constraint decoding, augmenting MT input method on Romanian-English, English to German, and Korean-English translation tasks.

2. RELATED WORK

Word alignment is no longer an indispensable component in training NMT models, but recently there has been a resurgence of interest in the community to study word alignment for NMT models due to its better interpretability. For example, previous works used word alignment information to interpret the end-to-end NMT system and analyze translation errors or to extract better alignments from the learned NMT models. There have been also several studies that leverage word alignment to guide NMT decoding directly, and especially Alkhouli et al. (2018) described this approach as a new downstream task of leveraging word alignment (dictionary-guided decoding task). In this study, we focus on infusing SMT offered word alignment into NMT system to improve translation performance. Alkhouli et al. (2018) added a special alignment head, which conditions a lexical model on the alignment information, to the multi-head source-to-target attention module of Transformer decoder. The use of a separate alignment model adds significant computational overhead to the decoding process, requiring special handling to optimize speed. And Song et al. (2020) proposed an approach that introduces a dedicated head in the multi-head Transformer architecture to capture external supervision signals. Besides, these two studies constraint the decoding process to correct translation errors with Figure 1 : A Korean-to-English translation example of inputting a source sentence, that replaces the input word with its corresponding target one, in an open translation system. The colorful line between the source and target at the top represents 'good' alignment, and the gray dotted line is a trivial one. The underline denotes the target word corresponding to the source one, and the asterisk* is the word to correct using the 'good' alignment. pre-defined dictionaries and achieve significant improvement on translation tasks. Dinu et al. (2019) ; Park & Zhao (2019) also used pre-defined dictionaries for a specific domain to let NMT model learn how to use dictionary entries when provided with the input, but they do not constraint the decoding procedure and adopt a non-coercive method that augments MT input with additional information. Furthermore, Dinu et al. (2019) showed that constrained decoding hurt translation performance in their experiments. While our goal resembles that of Dinu et al. (2019) ; Park & Zhao (2019) (converting MT input to augment it with pre-defined dictionaries for a specific domain), our proposed method does not need to build the pre-defined dictionaries and not limit to a specific domain, all it needs is to feed bilingual word alignment during training. After NMT has become the dominant MT approach, there have been a variety of studies that integrate external resources (e.g., lemmas, POS tags, named entity, and other linguistic features) into NMT systems. Sennrich & Haddow (2016) proposed a simple but novel method that augments input embedding with its corresponding linguistic features through concatenation operation. Li et al. (2020) presented three ways to integrate the inputs' compressed sentence into NMT systems. Correspondingly, we propose four integration methods to infuse the target information generated from the ABWS model into NMT models.

3. MODEL

In this section, we first describe the Transformer (Vaswani et al., 2017) for machine translation. Then we propose our core model, Alignment-Based Word Substitution (ABWS) model that replaces part of the source sentence with the corresponding target one according to prior word alignment. Furthermore, we introduce four strategies to fuse ABWS output into NMT system: Source-side Addition (Src-Add), Source-side Context Gate (Src-Gate), Soruce-side Fusion (Src-Fusion) and Target-side Fusion (Tgt-Fusion). Figure 2 shows the architecture of our proposed model with Target-side Fusion integration strategy.

3.1. TRANSFORMER

A Transformer architecture follows the encoder-decoder paradigm (Sutskever et al., 2014) and has N stacked encoder layers and decoder layers that rely entirely on self-attention networks. A sequence of input words is first fed into a word embedding layer to get word embeddings. Then positional information is injected into the embeddings. The word embeddings are fed into the encoder layer that consists of two sub-modules, namely a self-attention module and a feed-forward module. The self-attention module first creates a query matrix Q, a key matrix K, and a matrix-vector V from each of the word embeddings and product an output matrix as follows: Self Attn(Q, K, V ) = Sof tmax( QK d model )V , where d model denotes the dimensions of the model. To get a better meaning and context, the Transformer uses different attention heads that are computed parallelly and independently. Multi-head attention is computed from the concatenation of n attention heads output head i : M ultiHead(Q, K, V ) = Concat(head 1 , ..., head i )W O , head i = Self Attn(QW Q i , KW K i , V W V i ), where W O ∈ R d×d , W Q i , W K i , W V i ∈ R d× d n are parameter matrices. The Multi-head attention network is a core one in Transformer. Each encoder layer consists of a self-attention module and a feed-forward module. To preserve auto-regressive property of translation tasks, masked multi-head attention is added to each decoder layer. Finally, a softmax layer based on the decoder's last layer H N f inal produces a probability distribution over the target vocabulary: p(y t |y 1 , ..., y t-1 , x) = Sof tmax(H N f inal W F ), where W F is the learned weight matrix, x is the source sentence, and y 1 , y 2 , ..., y t is the target words.

3.2. ALIGNMENT-BASED WORD SUBSTITUTION MODEL

We propose Alignment-Based Word Substitution (ABWS) model that learns, at training time, how to align the source word with the target one while target words are provided in the source sentence. The model consists of 2 stacked base Transformer's encoder layers and is jointly trained with the NMT model, and the ABWS model input and reference are as follows: • Input I L×V : Original sentence identical to the input of the NMT model, • Reference R Abws L×V : Modified sentence that generated by replacing some source words with the target one according to the prior alignment information, where L is the length of the source sentence and V is the size of the shared vocabulary. At the inference time, the last hidden state of the trained model is incorporated into the input of the NMT model. Formally, one input sentence x = {x 1 , x 2 , x 3 , x 4 , x 5 } is fed into the ABWS model. The model learned to perform suitable replacement with the reference r = {x 1 , y 5 , x 3 , x 4 , y 1 }, where two source words x 2 , x 5 are replaced with its corresponding target word y 5 , y 1 according to the alignment. As our heuristics are applied to the use of alignment in the word substitution and model training, we distinguish three cases of bilingual word alignment: one-to-null (unaligned word), one-to-one, and one-to-many (multi-aligned word). For one-to-one alignment, we simply replace the source word with the target one in the sentence. Since unaligned source words are not replaced, this model is simply learning to copy from source word to target one. However, there are some difficulties in processing a multi-aligned word. Unlike normal single-label classification models where class labels are mutually exclusive, the model should be able to classify multi-label data. In other words, if one source word is replaced with multiple target words, this problem cannot be approached with the single-label classification task like the typical sequence generation model. Furthermore, some of the target words may not be the key meaning of the corresponding source one. Therefore, we present three solutions to these issues: (1) Like the noising method of the existing pre-training language models (Devlin et al., 2019; Yang et al., 2019) , the replacement process is performed with randomly sampled pairs at a certain ratio from all alignment pairs for each batch. The ratio p for sampling alignment pairs will be empirically determined. ( 2) It gives the model the flexibility to either learn core meaning from aligned target words or to deviate from fixed source-target word substitution pairs. The parameter of the output linear layer and the parameter of word embedding are shared, and only the parameters of the model are trained. At the inference time, the final hidden state of the model, not the output generated by the argmax operation, is directly integrated into the NMT model. This can preserve information about multi-aligned words. In preliminary experiments, we observed a drop of about 2-3 BLUE score when the output generated from the argmax is fed into the NMT model through the embedding layer. which is equivalent to optimizing the following cross-entropy loss L a : L a (O Abws ) = 1 V L l=1 V v=1 d l,v R Abws l,v log(O Abws l,v ), where d l,v (duplicate alignment penalty) is the inverse frequency of each target index for each alignment sequence, and the goal of this penalty is to exclude words that are repeated too often but have little meaning (e.g., the, a, and so on) from model training. We train our model to minimize L a in conjunction with the standard translation loss L t . The overall loss L is: L = L t + λL a (O Abws ), where λ is a hyper-parameter, which is set to 0.1 in our experiments.

3.3. INTEGRATION STRATEGIES

While many different ways have been explored to augment MT input with additional information, we consider four strategies to incorporate the last hidden state of the ABWS model into NMT system: Src-Add, Src-Gate, Src-Fusion and Tgt-Fusion. Given the last hidden state of the ABWS model H Abws ∈ R l×d and input word embeddings S ∈ R l×d , where l is the length of source sentence , these four integration strategies are performed as follows. Source-side Addition (Src-Add) is to add S to H Abws : SrcAdd(S, H Abws ) = S + H Abws . (7) Source-side Context Gate (Src-Gate) is to use a context gate G ∈ R l×d for fusing S and H Abws :  SrcGate(S, H Abws ) = ConGate(S, H Abws ) = G S + (1. -G) H Abws , (8) G = σ(M LP ([S; H Abws ])), H n Align = AlignEncM ultiHead(S, H Abws , H Abws ), H n src = EncM ultiHead(S, S, S), where SrcF usion is fed to the FFN module of the encoder layer, and AlignEncM ultiHead is an additional dedicated multi-head attention layer that is identical with the original one. Target-side Context Gate (Tgt-Fusion) is to introduce an additional dedicated encoder-decoder multi-head attention layer for integrating H N Abws into the decoder layer: T usion(S, H Abws ) = ConGate(T Align , T Ori ), T Align = AlignEncDecM ultiHead(H n tgt , H N Abws , H N Abws )), T Ori = OriEncDecM ultiHead(H n tgt , H N src , H N src )), H N Abws = F F N (EncM ultiHead(H Abws , H Abws , H Abws )), H N src = F F N (EncM ultiHead(S, S, S)), where T gtF usion is fed to the FFN module of the decoder layer, and AlignEncDecM ultiHead is an additional dedicated encoder-decoder multi-head attention layer.

4. EXPERIMENTS

To verify that the proposed method is effective, we perform experiments on three language pairs: Romanian-English (EN↔RO), English to German (EN→DE) and Korean-English (KO↔EN). For all translation tasks, we use BLEU (Papineni et al., 2002) for the evaluation of translation quality. All the model training is on a single NVIDIA Tesla V100 GPU.

4.1. DATASETS

Training data and test data for EN↔RO translation are Europarl v8 corpus and newstest2016, respectively. For EN→DE, we use the Europarl v7 news datasets as training data, newstest2016 as validation data, and newstest2014 and newstest2015 as test data. In order to evaluate alignment quality on two automatic word aligner, we use the the gold alignments for EN↔ROfoot_1 and EN→DEfoot_2 For KO↔EN translation, we use news dataset that provided by AIHubfoot_3 and split the dataset for the validation and test data. For all dataset, we first tokenize three languages (English, Romanian, German) using Moses (Koehn et al., 2007) and Korean data using KoNLPyfoot_4 toolkit and apply Byte-Pair-Encoding (Sennrich et al., 2016) . Table 1 shows the statistics of datasets and alignment error rate (Och & Ney, 2000) (AER) for EN↔RO and EN→DE. 

4.2. SETUP AND BASELINES

For Bilingual word aligner, we use the MGIZA++foot_5 (Gao & Vogel, 2008) , a parallel implementation of GIZA++, and FastAlignfoot_6 toolkit with default parameters. We align the bilingual training corpora with FastAlign for all language pairs. For RO↔EN translation task, we additionally use word alignments produced by MGIZA++ to compare the different word aligners. Both FastAlign and GIZA are used with default settings and all the training corpora are in subword format. Furthermore, the pairs of sentences with an error (e.g., zero sentences) are pruned in the process of generating alignment. In all experiments for this task, we train all models using a base Transformer configuration with an embedding size of 512, 6 encoder and decoder layers, 8 attention heads, shared source and target embeddings, the standard relu activation function and sinusoidal positional embedding. We train with a batch size of 3500 tokens and use the validation translation loss for early stopping and update parameters every 8 batches. Furthermore, We optimize the model parameters using Adam optimizer with a learning rate 7e-4 β 1 = 0.9 and β 2 = 0.98, learning rate warm-up over the first 4000 steps. Additionally, we apply label smoothing with a factor of 0.1. We average over the last 5 checkpoints and run inference with a beam size of 5. All our experiments were performed using the Torch-based toolkit, Fairseq(-py) (Ott et al., 2019) . In our experiment, we use Transformer (Vaswani et al., 2017) as the baseline model for all language pairs. For EN→DE translation, we compare our approach to the following baselines: • Constraint decodingfoot_7 : A vectorized lexically constrained decoding with dynamic beam allocation reported in Post & Vilar (2018) ; Hu et al. (2019) . • Training-by-replacing: An augment MT input method directly replacing the original term with the target one according to a dictionary (Dinu et al., 2019; Park & Zhao, 2019 ). • Training-by-appending: An augment MT input method appending the target term to its source version according to a pre-defined dictionary (Dinu et al., 2019) . They all require a pre-defined dictionary, so we extracted the dictionary from the GIZA++ alignment for a fair comparison. In order to avoid spurious matches, we used the following pruning methods: (1) We first removed meaningless word with several POS tagsfoot_8 (e.g., auxiliary, determiner, punctuation, stop words and so on). (2) To make an appropriate one-to-one matching set, we counted the occurrence frequency of each target term matched to the source one, and then adopt the occurrence frequency of the top 1 target term if it is more than 90% of the occurrence frequency of the total target term. (3) Finally, we filtered out entries occurring in the top 500 most frequent English words.

4.3. EXPERIMENTAL RESULTS

Table 2 shows the BLEU evaluation of our systems. For the experimental results, we made the following observation: (1) For the integration strategy, Most of them outperformed the baseline Transformer. We can see that each integration strategy has a gap in improving translation performance depending on the language pair or translation direction. For example, Tgt-Fusion shows relatively high performance on RO↔EN and KO↔EN translation, but Src-Fusion is the best on EN→DE translation. (2) For the automatic word aligner, the results showed that word alignment information of GIZA++ yields a better performance improvement than FastAlign on RO↔EN and EN→DE translation. This means that better alignment information can lead to better translation. (3) For other baseline models, which use pre-defined dictionary building with prior word alignment, the performance is degraded or maintained. This means that our model makes good use of the alignment information. (4) The parameters of our proposed model with the four integration strategies increased 6M to 16M over the baseline Transformer. For inference speed, the decoding time of our model increased by only 1.2 times over the baseline Transformer. 

5.1. EVALUATING ALIGNMENT SAMPLING RATIO p

In order to verify the impact of different ratios p for sampling alignment pairs on translation quality, we conducted the corresponding experiments on RO→EN translation task with our proposed model. When the sampling ratio p = 0.0, it means no use alignment information, and when the sampling ratio p = 1.0, it is equivalent to leveraging all word alignment information. Table 3a shows the experimental results. Hence, the alignment sampling ratio is set to 0.9 for the best performance in all our experiments.

5.2. WHICH NEURAL NETWORK IS SUITABLE AS THE ABWS MODEL?

We experimented with different neural networks to determine which neural network could represent the replaced word well for RO→EN translation task. As shown in Table 3b , there was little change when using Bi-LSTM as the ABWS model, and the transformer encoder improved the translation  ᄇ ᅮᄃ ᅩ ᆼᄉ ᅡ ᆫ ᄉ ᅵᄌ ᅡ ᆼ ᄋ ᅦᄉ ᅥ ᄇ ᅵ ᄀ ᅲᄌ ᅦ ᄌ ᅵᄋ ᅧ ᆨ ᄋ ᅵ ᄌ ᅮᄆ ᅩ ᆨ ᄋ ᅳ ᆯ ᄇ ᅡ ᆮᄀ ᅩ ᄋ ᅵ ᆻᄃ ᅡ . ABWS top 1 real market ᄋ ᅦᄉ ᅥ ᄇ ᅵ regulated ᄌ ᅵᄋ ᅧ ᆨ ᄋ ᅵ ᄌ ᅮᄆ ᅩ ᆨ ᄋ ᅳ ᆯ ᄇ ᅡ ᆮᄀ ᅩ ᄋ ᅵ ᆻᄃ ᅡ . top 2 estate ᄉ ᅵᄌ ᅡ ᆼ in non-@@ ᄀ ᅲᄌ ᅦ areas has attention ᄅ ᅳ ᆯ receiving is . top 3 ᄇ ᅮᄃ ᅩ ᆼᄉ ᅡ ᆫ markets at un@@ regulatory area are drawing its drawing are .. &quot; Output un@@ regulated areas are drawing attention in the real estate market . Target quantitative un@@ regulated areas are gaining attention in the real estate market . Source ᄇ ᅡ ᆨ ᄉ ᅵᄌ ᅡ ᆼ ᄋ ᅳ ᆫ ᄉ ᅥ ᆫᄀ ᅥ ᄃ ᅡᄋ ᅳ ᆷ ᄂ ᅡ ᆯ ᄇ ᅮᄐ ᅥ ᄋ ᅥ ᆸᄆ ᅮ ᄋ ᅦ ᄃ ᅳ ᆯᄋ ᅥᄀ ᅡ ᆻᄃ ᅡ . ABWS top 1 ᄇ ᅡ ᆨ mayor ᄋ ᅳ ᆫ election ᄃ ᅡᄋ ᅳ ᆷ ᄂ ᅡ ᆯ ᄇ ᅮᄐ ᅥ ᄋ ᅥ ᆸᄆ ᅮ ᄋ ᅦ ᄃ ᅳ ᆯᄋ ᅥᄀ ᅡ ᆻᄃ ᅡ . top 2 park ᄉ ᅵᄌ ᅡ ᆼ the elections next day from work on began .. . top 3 ᄇ ᅡ ᆨ@@ mayoral ᄂ ᅳ ᆫ ᄉ ᅥ ᆫᄀ ᅥ following same starting business in started &quot; Output mayor park began his work the day after the election . Target one day after the election, park began his job . Source ᄋ ᅵ ᆫᄀ ᅯ ᆫ ᄋ ᅳ ᆯ ᄀ ᅡᄅ ᅳ@@ ᄎ ᅵᄌ ᅵ ᄋ ᅡ ᆭᄂ ᅳ ᆫ ᄀ ᅥ ᆺ ᄃ ᅩ ᄋ ᅵ ᆫᄀ ᅯ ᆫ ᄎ ᅵ ᆷᄒ ᅢ ᄅ ᅡᄂ ᅳ ᆫ ᄆ ᅡ ᆯ ᄋ ᅵ ᄋ ᅵ ᆻᄃ ᅡ . ABWS top 1 rights ᄋ ᅳ ᆯ ᄀ ᅡᄅ ᅳ@@ ᄎ ᅵᄌ ᅵ ᄋ ᅡ ᆭᄂ ᅳ ᆫ ᄀ ᅥ ᆺ ᄃ ᅩ human violations ᄅ ᅡᄂ ᅳ ᆫ ᄆ ᅡ ᆯ ᄋ ᅵ ᄋ ᅵ ᆻᄃ ᅡ . top 2 human they teach &apos;t does what also rights violation says says this are .. . top 3 ᄋ ᅵ ᆫᄀ ᅯ ᆫ he teaching hurt do things even ᄋ ᅵ ᆫᄀ ᅯ ᆫ ᄎ ᅵ ᆷᄒ ᅢ say say is being and Output it is also said that not teaching human rights is a violation of human rights . Target some say that not teaching human rights is also a violation of human rights . performance. Therefore, we adopted the 2 stacked transformer encoder as our model, considering its size and performance.

5.3. EVALUATION OF ALIGNMENT USAGE ON LOW-RESOURCE CASES

We evaluated our proposed model's performance on a low-resource case that might lead to poor word alignment and our models' lower performance. We first sampled 100,000 sentences from the EN→DE data set, and the AER for that small data set was 34.0. Contrary to our expectations, Table 3c shows that our models achieve even more remarkable performance gains (up to 2 BLEU) on the low-resource case. This means that our model is also useful in low-resource cases.

5.4. ANALYSIS OF ALIGNMENT USAGE IN TRANSLATION

In this subsection, We observed how the ABWS model utilizes prior word alignment in the NMT model, which applies Tgt-Fusion integration strategy on the KO→EN translation task. Table 4 shows some translation examples that include top 3 ABWS output. We can see that the ABWS model replaces some source words with target one and the replaced target words appears in the translation output. Moreover, from the Korean word ᄉ ᅵᄌ ᅡ ᆼ (market, mayor) in the first and second translation examples, we can see that the ABWS model distinguishes the homophone in each sentence. This means that the model's replacement process takes context into account. Another observation is that the ABWS model learned well for multi-aligned words. For example, in the third example,ᄋ ᅵ ᆫ ᄀ ᅯ ᆫ(human rigths) is aligned with two target words ('human' and 'rights') , and the ABWS model catches them well.

6. CONCLUSION

This work presents a novel solution for the effective and efficient fusion of NMT and SMT. Especially, we explore an efficient way of exploiting prior word alignment offered by SMT models for NMT enhancement during the model training phase instead of constraint decoding in previous work which may slow down the inference. In detail, to augment NMT input, we design an extra model that learns how to replace specific source words with their corresponding target ones according to the prior word alignment. Our method helps yield significant performance improvement compared to a strong baseline on three translation tasks, which verifies the effectiveness of the proposed method.



https://translate.google.com http://web.eecs.umich.edu/˜mihalcea/ wpt/index.html#resources https://www-i6.informatik.rwth-aachen.de/goldAlignment/ http://www.aihub.or.kr/ https://konlpy.org/en/latest/ https://github.com/moses-smt/mgiza https://github.com/clab/fast align https://github.com/pytorch/fairseq/blob/master/examples/constrained decoding https://spacy.io/



Figure 2: The architecture of our proposed model with Target-side Fusion integration strategy.

(3) Inspired byGarg et al. (2019), we implement multilabel classification for the multi-aligned words. Formally, let O Abws L×V be the output of the model. We minimize the Kullback-Leibler divergence between R Abws l

Corpora statistics and AER [%] w.r.t FastAlign and GIZA++.

Experimental results on three translation task. "+" represent significantly better systems than the corresponding baseline Transformer at a p-value < 0.05. Time(s) denotes the average translation time (second) per sentence.

Performances on translation tasks, where our proposed model uses Tgt-Fusion integration strategy and word alignment from GIZA++. The (a) and (b) are translation performances (BLEU score) with different alignment sampling ratios and different ABWS models, respectively. In (a) and (b), the asterisk denotes our proposed ABWS model setting. The (c) is to evaluate our model on low-resource cases.

Translation examples in which the ABWS model's output augments the MT input. The underline denotes source word and its corresponding target one in the ABWS output .

