PRIOR KNOWLEDGE REPRESENTATION FOR SELF-ATTENTION NETWORKS

Abstract

Self-attention networks (SANs) have shown promising empirical results in various natural language processing tasks. Typically, it gradually learning language knowledge on the whole training dataset in parallel and stacked ways, thereby modeling language representation. In this paper, we propose a simple and general representation method to consider prior knowledge related to language representation from the beginning of training. Also, the proposed method allows SANs to leverage prior knowledge in a universal way compatible with neural networks. Furthermore, we apply it to one prior word frequency knowledge for the monolingual data and other prior translation lexicon knowledge for the bilingual data, respectively, thereby enhancing the language representation. Experimental results on WMT14 English-to-German and WMT17 Chinese-to-English translation tasks demonstrate the effectiveness and universality of the proposed method over a strong Transformer-based baseline.

1. INTRODUCTION

Self-attention networks (SANs) have attracted increasing attention in the natural language processing community. Instead of using complex recurrent or convolutional neural networks (Sutskever et al., 2014; Bahdanau et al., 2015) , SANs first use the positional encoding mechanism Gehring et al. (2017) to encode order dependencies in the language. The learned positional embedding is then added to corresponding word embedding to obtain an input representation, based on which SANs perform (multi-head) and stack (multi-layer) self-attentive functions (Vaswani et al., 2017) in parallel to learn language representation. The SAN-based models are iteratively optimized to model language knowledge on the whole training dataset, which has achieved state-of-the-art performance in many natural language processing tasks pairs (Barrault et al., 2019; Oepen et al., 2019; Weissenbacher & Gonzalez-Hernandez, 2019; Demner-Fushman et al., 2019) . Despite the success, SANs gradually model language knowledge on the batch-level datasets and does not consider the prior knowledge on the whole dataset from the beginning of training, which may decrease its language representation capability. For example, the SAN-based neural machine translation (NMT) model often mistranslate into words that seem to be natural in the target language sentence, but do not reflect the original meaning of the source language sentence (Arthur et al., 2016; Wang et al., 2017a) . As a result, the NMT model produces fluent yet sometimes inadequate translations (Tu et al., 2016; 2017) . To address this issue, recent studies explored the prior knowledge which has the stringer ability to model the fluency of translation in traditional SMT (Koehn et al., 2003; Liu et al., 2006; Chiang, 2007; Liu et al., 2007) . Take the prior translation lexicon knowledge as an example, Arthur et al. ( 2016) directly biased or interpolated the bilingual lexicon translation distribution with the output of the softmax layer of NMT to improve the translations of infrequency words. An auxiliary classifier (Zhao et al., 2018; Wang et al., 2018) were employed to integrate the SMT recommendations with NMT generations to generate the faithful translation. In addition, the phrase translation rules was as the recommendation memory to make better predictions in NMT (Wang et al., 2017b; Zhao et al., 2018) . Although these studies successfully improved the issue of inadequate translations in NMT, they tended to focus on exploring the prior translation lexicon knowledge by using their specific methods. In other words, these unique methods make it difficult to explore other prior knowledge in a universal way and to determine which of the prior knowledge and the unique method this improvement comes from. Meanwhile, these studies directly utilized the probability distribution of the prior knowledge and lacked the neural network's ability to semantically generalize, while will further hinder the language representation ability of SANs. In this paper, we propose a simple and general representation method to introduce the prior knowledge into SANs. In particular, we package the prior knowledge related with one source sentence to a continuous space matrix, which allows SANs to utilize the prior knowledge from the beginning of training, thereby better performing language representation in a universal way compatible with neural networks. To maintain the simplicity and flexibility of the SANs, we use the prior knowledge representation in parallel and stacked ways to learn the representation of the input sentence. Furthermore, we use the proposed method to explore one prior word frequency knowledge for the monolingual data and other prior translation lexicon knowledge for the bilingual data, respectively. Empirical results on two widely used translation data sets, including WMT14 English→German and WMT17 Chinese→English, to verify the effectiveness and universality of the proposed method over a strong Transformer-based baseline.

2. SELF-ATTENTION NETWORKS

The self-attention networks (SANs) (Vaswani et al., 2017) is composed of a stack of N identical layers, each of which includes two sub-layers. Formally, given a input sentence with the length J, X={x 1 , x 2 , • • • , x J }, the positional encoding mechanism (Gehring et al., 2017) is used to compute a positional embedding of each word based on its position index. The positional embedding is then added to the corresponding word embedding as an combined embedding, thereby gaining a sequence of input representation H 0 ={v 1 , v 2 , • • • , v J }. Moreover, the stacked SANs is organized as follows: H n = LN(SelfAtt n (Q n-1 , K n-1 , V n-1 ) + H n-1 ), H n = LN(FFN n (H n ) + H n ), where SelfAtt n (•), LN(•), and FFN n (•) are self-attention module, layer normalization (Ba et al., 2016) , and feed-forward network for the n-th identical layer, respectively. Q n-1 , K n-1 , and V n-1 are query, key, and value matrices that are transformed from the (n-1)-th layer H n-1 . For example, Q 0 , K 0 , and V 0 are packed from the H 0 learned by the positional encoding mechanism (Gehring et al., 2017) . In particular, SelfAtt n (•) is applied on the {Q n-1 , K n-1 , V n-1 } of the n-1 layer: SelfAtt n (Q n-1 , K n-1 , V n-1 ) = softmax(Q n-1 K n-1 / d model )V n-1 , where d model is the dimension size of the query and key vectors. As a result, the output of the N -th layer H N is the representation of the input sentence. Moreover, the self-attention mechanism can be further refined as multi-head self-attention to jointly attend to the information from different representation sub-spaces at different positions.

3. PRIOR KNOWLEDGE REPRESENTATION

In this section, we propose a simple and general representation method to encode the prior knowledge, which allows SANs to model the prior knowledge in a manner compatible with neural networks. Given a input sentence X={x 1 , x 2 , • • • , x J } with the length J, we represent the associated prior knowledge as a matrix M: M =      m 1 1 m 2 1 • • • m K 1 m 1 2 m 2 2 • • • m K 2 . . . . . . . . . . . . m 1 J m 2 J • • • m K J      , where each row denotes the prior knowledge related with word x j and each element m t j is a fixed size vector. Also, M is packed into a key and value matrix pair {K, V} for the prior knowledge. The prior {K, V} and the current Q are the input to the self-attention mechanism (see Eq.( 2)) to learn a prior knowledge representation PK for the input sentence X: PK = LN(SelfAtt(Q, K, V) + H),

