PHRASETRANSFORMER: SELF-ATTENTION USING LO-CAL CONTEXT FOR SEMANTIC PARSING

Abstract

Semantic parsing is a challenging task whose purpose is to convert a natural language utterance to machine-understandable information representation. Recently, solutions using Neural Machine Translation have achieved many promising results, especially Transformer because of the ability to learn long-range word dependencies. However, the one drawback of adapting the original Transformer to the semantic parsing is the lack of detail in expressing the information of sentences. Therefore, this work proposes a PhraseTransformer architecture that is capable of a more detailed meaning representation by learning the phrase dependencies in the sentence. The main idea is to incorporate Long Short-Term Memory (LSTM) into the Self-Attention mechanism of the original Transformer to capture more local context of phrases. Experimental results show that the proposed model captures the detailed meaning better than Transformer, raises local context awareness and achieves strong competitive performance on Geo, MSParS datasets, and leads to SOTA performance on Atis dataset in methods using Neural Network. Our novel architecture (Figure 2 ) is based on the Encoder-Decoder of Transformer (Vaswani et al., 2017). We define a new model named PhraseTransformer to improve the encoding quality of Transformer by enhancing the Encoder architecture while keeping the original Decoder.

1. INTRODUCTION

Semantic parsing is an important task which can be applied for many applications such as Question and Answering systems or searching systems using natural language (Woods, 1973; Waltz & Goodman, 1977) . For example, the sentence "which state borders hawaii" can be represented as logical form (LF) using λ-calculus syntax "(lambda $0 e (and (state:t $0) (next to:t $0 hawaii)))". There are various strategies to address the semantic parsing task such as constructing handcraftrules (Woods, 1973; Waltz & Goodman, 1977; Hendrix et al., 1978) , using Combinatory Categorial Grammar (CCG) (Zettlemoyer & Collins, 2005; 2007; Kwiatkowski et al., 2011) , adapting statistical machine translation method (Wong & Mooney, 2006; 2007 ) or Neural Machine Translation (Dong & Lapata, 2016; Jia & Liang, 2016; Dong & Lapata, 2018; Cao et al., 2019) . The major factor of the CCG method is based on the alignments of sub-parts (lexicons or phrases) between a natural sentence and corresponding logical form and to learn how best to combine these subparts. In more detail, the phrase "borders hawaii" is aligned to "(next to:t $0 hawaiiz)" in LF. Conversely, the methods using Neural Machine Translation learn the encoder representing a sentence into a vector and decode that vector into LF. The current SOTA models are Sequence-to-Sequence using LSTM (Seq2seq) (Dong & Lapata, 2018; Cao et al., 2019) on Geo, Atis and Transformer (Ge et al., 2019) on MSParS. The methods using Neural Network almost work effectively without any handcrafted features. However, there is still room to improve the performance based on the meaning of local context in phrases. According to CCG methods, the semantic representation of a sentence is the combination of submeaning representation generated by phrases in a sentence. However, Transformer architecture only learns the dependencies between single words without considering the local context by the phrase. Therefore, we propose a new architecture named PhraseTransformer that focuses on learning the relations of phrases in a sentence (Figure 1 ). To do this, we modify the Multi-head Attention (Vaswani et al., 2017) by applying the self-attention mechanism into phrases instead of single words. Firstly, we use n-gram to split a sentence into phrases. Then, we use the final hidden state of LSTM architecture to represent the local context meaning of those phrases. Our contributions are: (1) proposing a novel model based on Transformer that works effectively for semantic parsing tasks, (2) conducting experiments to confirm the awareness capacity of the model, (3) achieving competitive performance on Geo, MSParS datasets and new SOTA performance on Atis dataset in the methods using Neural Network.

2. RELATED WORK

In Semantic Parsing task, recent works have shown that using the deep learning approach achieved potential results. results. These methods are divided into three groups: 2019) also use the sketch meaning mechanism on BERT model (Devlin et al., 2019) by two steps: classify the template of LF and fill the low-level information to that template. In our opinion, the main problem is to improve the understanding capacity of the model because semantic parsers need to capture the complicated in the natural sentences before decoding. Therefore, our work focuses on designing the Encoder architecture to improve the understanding capacity of the model. Data Augmentation. There are numerous works that focus on data augmentation to improve the performance of the semantic parsing model (Jia & Liang, 2016; Ziai, 2019; Herzig & Berant, 2019) . Jia & Liang propose three rules based on Synchronous Context-Free Grammar to recombine data. This step increases the size of the training data and grows the performance of the model (Jia & Liang, 2016) . Similarly, Ziai proposes a method that automatically augments data based on the cooccurrence of words in the sentence. The author separates the training process into two phases: (1) use augmented data to train for BERT (Devlin et al., 2019) and (2) fine-tuning on original data. Weak Supervision. Some methods use semi-supervised learning for semantic parsing task such as (Kočiský et al., 2016; Yin et al., 2018; Goldman et al., 2018; Cao et al., 2019; 2020) . These works are promising approaches for the data-hungry problem because of the ability to extract latent information such as unpaired logical forms. In our proposed model, we aim to construct the latent representation for phrases and learn these representations via the self-attention mechanism of the Transformer. We hypothesize that complicated sentences are constructed from various phrases, so learning to represent these phrases makes the model more generalizable. In Neural Machine Translation task, the approach using phrase information or constituent tree is proved that effective and attracts many works (Wang et al., 2017; Wu et al., 2018; Wang et al., 2019; Hao et al., 2019; Nguyen et al., 2020) . The points that make the difference in our work are: (1) our model is capable of learning without any additional information (e.g. constituent tree), (2) in the training process, although we do not force the attention or limit the scope of the dependencies, our model is able to pay high attention to the important phrase automatically. Compare with Yang et al. (2018) , the purpose of using local context information is similar but different in localness modeling: based on the distance, Yang et al. ( 2018) cast a Gaussian bias to change attention score while our method is simpler by incorporating multi different n-gram views as the various local contexts.



Figure 1: Pharse alignments in PhraseTransformer.

Decoder Customization. Dong & Lapata apply the Seq2seq model to semantic parsing task and introduce Sequence-to-tree (Seq2tree) (Dong & Lapata, 2016) model constructing the tree structure of the LF. This model focuses on modifying the decoding method based on bracket pairs to start a new decoding level. On an other aspect, Dong & Lapata (2018) continue to introduce a new architecture Coarse-to-Fine (Coarse2Fine) based on a rough sketch of meaning to improve the structure-awareness of Seq2seq model. Similarly, Li et al. (

