PRIOR KNOWLEDGE REPRESENTATION FOR SELF-ATTENTION NETWORKS

Abstract

Self-attention networks (SANs) have shown promising empirical results in various natural language processing tasks. Typically, it gradually learning language knowledge on the whole training dataset in parallel and stacked ways, thereby modeling language representation. In this paper, we propose a simple and general representation method to consider prior knowledge related to language representation from the beginning of training. Also, the proposed method allows SANs to leverage prior knowledge in a universal way compatible with neural networks. Furthermore, we apply it to one prior word frequency knowledge for the monolingual data and other prior translation lexicon knowledge for the bilingual data, respectively, thereby enhancing the language representation. Experimental results on WMT14 English-to-German and WMT17 Chinese-to-English translation tasks demonstrate the effectiveness and universality of the proposed method over a strong Transformer-based baseline.

1. INTRODUCTION

Self-attention networks (SANs) have attracted increasing attention in the natural language processing community. Instead of using complex recurrent or convolutional neural networks (Sutskever et al., 2014; Bahdanau et al., 2015) , SANs first use the positional encoding mechanism Gehring et al. (2017) to encode order dependencies in the language. The learned positional embedding is then added to corresponding word embedding to obtain an input representation, based on which SANs perform (multi-head) and stack (multi-layer) self-attentive functions (Vaswani et al., 2017) in parallel to learn language representation. The SAN-based models are iteratively optimized to model language knowledge on the whole training dataset, which has achieved state-of-the-art performance in many natural language processing tasks pairs (Barrault et al., 2019; Oepen et al., 2019; Weissenbacher & Gonzalez-Hernandez, 2019; Demner-Fushman et al., 2019) . Despite the success, SANs gradually model language knowledge on the batch-level datasets and does not consider the prior knowledge on the whole dataset from the beginning of training, which may decrease its language representation capability. For example, the SAN-based neural machine translation (NMT) model often mistranslate into words that seem to be natural in the target language sentence, but do not reflect the original meaning of the source language sentence (Arthur et al., 2016; Wang et al., 2017a) . As a result, the NMT model produces fluent yet sometimes inadequate translations (Tu et al., 2016; 2017) . To address this issue, recent studies explored the prior knowledge which has the stringer ability to model the fluency of translation in traditional SMT (Koehn et al., 2003; Liu et al., 2006; Chiang, 2007; Liu et al., 2007) . Take the prior translation lexicon knowledge as an example, Arthur et al. ( 2016) directly biased or interpolated the bilingual lexicon translation distribution with the output of the softmax layer of NMT to improve the translations of infrequency words. An auxiliary classifier (Zhao et al., 2018; Wang et al., 2018) were employed to integrate the SMT recommendations with NMT generations to generate the faithful translation. In addition, the phrase translation rules was as the recommendation memory to make better predictions in NMT (Wang et al., 2017b; Zhao et al., 2018) . Although these studies successfully improved the issue of inadequate translations in NMT, they tended to focus on exploring the prior translation lexicon knowledge by using their specific methods. In other words, these unique methods make it difficult to explore other prior knowledge in a universal way and to determine which of the prior knowledge and the unique method this improvement comes from. Meanwhile, these studies directly

