SWITCHING-ALIGNED-WORDS DATA AUGMENTATION FOR NEURAL MACHINE TRANSLATION

Abstract

In neural machine translation (NMT), data augmentation methods such as backtranslation make it possible to use extra monolingual data to help improve translation performance, while it needs extra training data and the in-domain monolingual data is not always available. In this paper, we present a novel data augmentation method for neural machine translation by using only the original training data without extra data. More accurately, we randomly replace words or mixup with their aligned alternatives in another language when training neural machine translation models. Since aligned word pairs appear in the same position of each other during training, it is helpful to form bilingual embeddings which are proved useful to provide a performance boost (Liu et al., 2019). Experiments on both small and large scale datasets show that our method significantly outperforms the baseline models.

1. INTRODUCTION

Deep neural networks show great performances when trained on massive amounts of data. Data augmentation is a simple but effective technique to generate additional training samples when deep learning models are thirsty for data. In the area of Computer Vision, it is a standard practice to use image data augmentation methods because trivial transformations for images like random rotation, resizing, mirroring and cropping (Krizhevsky et al., 2012; Cubuk et al., 2018) doesn't change its semantics. This presence of of semantically invariant transformation makes it easy to use image data augumentation in Computer Vision research. Unlike image domain, data augmentation on text for Natural Language Processing (NLP) tasks is usually non-trivial as there is often a prerequisite to do some transformations without changing the meaning of the sentence. In this paper we will focus on data augmentation techniques in neural machine translation (NMT) which is special and more difficult than other NLP tasks since we should maintain semantic consistency within language pairs which is from quite possibly different domains. Data augmentation techniques in NMT can be divided into two categories dependent on whether additional monolingual corpus is uesd. If in-domain monolingual training data for NMT is available, one successful data augmentation method is back-translation (Sennrich et al., 2016) , whereby an NMT model is trained in the reverse translation direction (target-to-source) and then used to translate target-side monolingual data back to source language. The resulting synthetic parallel corpus can added to existing training data to learn a source-to-target model. Other more refined ideas of backtranslation include dual learning (He et al., 2016) or Iterative Back-translation (Hoang et al., 2018) . Sometimes when in-domain monolingual data is limited, existing methods including randomly swapping two words, dropping word, replacing word with another one (Lample et al., 2018) and so on are applied to perform transfromations to original training data without changing its semantics to the greatest extent. However, due to text characteristics, these random transformations often result in significant change in semantics. Gao et al. (2019) propose to replace the embedding of word by a weighted combination of mutiple semantically similar words. Also, Xiao et al. ( 2019) use a lattice structure to integrate multiple segmentations of a single sentence to perfrom an immediate data augmentation. In this work, we propose Switching-Aligned-Words (SAW) data augmentation, a simple yet effective data augmentation approach for NMT training. It belongs to the second class of data augmentation methods where in-domain monolingual data is limited. Different from the previous methods that conduct semantically invariant transformations within each language, we propose to use another language (target language) to help make semantically invariant transformations for current language (source language) by switching aligned words randomly. We use an unsupervised word aligner fast-alignfoot_0 (Dyer et al., 2013) to pair source and target words that have similar meaning. To verify the effectiveness of our method, we conduct experiments on WMT14 English-to-German and IWSLT14 German-to-Englisth datasets. The experimental results show that our method can obtain remarkable BLEU score improvement over the strong baselines.

2. RELATED WORK

We describes the related work about data augmentation for NMT with or without using additional monolingual data in this section.

2.1. WITH MONOLINGUAL DATA

The most successful data augmentation techiques to leverage monolingual data for NMT training is back-translation. It requires training a target-to-source system in order to generate additional synthetic parallel data from the monolingual target data. This data complements human bitext to train the desired source-to-target system. There has been a growing body of literature that analyzes and extends back-translation. Edunov et al. ( 2018) demontrate that it is more effective to generate source sentences via sampling rather than beam search. Hoang et al. ( 2018) present iterative backtranslation, a method for generating increasingly better synthetic parallel data from monolingual data to train NMT model. Fadaee & Monz (2018) show that words with high predicted loss during training benefit most. Wang et al. (2019) propose to quantify the confidence of NMT model predictions based on model uncertainty to better cope with noise in synthetic bilingual corpora produced by back-translation. Dual learning (He et al., 2016) extends the back-translation approach to train NMT systems in both translation directions. When jointly training the source-to-target and targetto-source NMT models, the two models can provide back translated data for each other direction and perform multi-rounds back-translation. Different from back-translation, Currey et al. (2017) show that low resource language pairs can also be improved with synthetic data where the source is simply a copy of the monolingual target data. Wu et al. (2019) propose to use noised training to better leverage both back-translation and self-training data. et al. (2018) randomly swap the words within a fixed small window size or drop some words in a sentence for learning an autoencoder to help train the unsupervised NMT model. Fadaee et al. (2017) propose to replace a common word by low-frequency word in the target sentence, and change its corresponding word in the source word to improve translation quality of rare words. In Xie et al. (2017) , they replace the word with a placeholder token or a word sampled from the frequency distribution of vocabulary, showing that data noising is an effective regularizer for NMT. Kobayashi (2018) propose an approach to ues the prior knowledge from a bi-directional language model to replace a word token in the sentence. Gao et al. (2019) try to replace the ids of word by a soft ids and they train Transformer language models in original training data to get soft words. Wang et al. (2018) introduce a data augmentation method for NMT called SwitchOut to randomly replace words in both source and target sentences with other words.

3. OUR APPROACH

We first describe the background and our proposed switching-aligned-words data augumentation approach. The framework can be seen as an adversarial training process like Generative Adversarial Networks (GAN) (Goodfellow et al., 2014; Salimans et al., 2016) , see Figure 1 for an overview. For



https://github.com/clab/fast align

