TO UNDERSTAND REPRESENTATION OF LAYER-AWARE SEQUENCE ENCODERS AS MULTI-ORDER-GRAPH Anonymous

Abstract

In this paper, we propose a unified explanation of representation for layer-aware neural sequence encoders, which regards the representation as a revisited multigraph called multi-order-graph (MoG), so that model encoding can be viewed as a processing to capture all subgraphs in MoG. The relationship reflected by Multi-order-graph, called n-order dependency, can present what existing simple directed graph explanation cannot present. Our proposed MoG explanation allows to precisely observe every step of the generation of representation, put diverse relationship such as syntax into a unifiedly depicted framework. Based on the proposed MoG explanation, we further propose a graph-based self-attention network empowered Graph-Transformer by enhancing the ability of capturing subgraph information over the current models. Graph-Transformer accommodates different subgraphs into different groups, which allows model to focus on salient subgraphs. Result of experiments on neural machine translation tasks show that the MoG-inspired model can yield effective performance improvement.

1. INTRODUCTION

propose self-attention (SAN)-based neural network (called Transformer) for neural machine translation (NMT). As state-of-the-art NMT model, several variants of the Transformer have been proposed for further performance improvement (Shaw et al., 2018; He et al., 2018) and for other natural language process tasks such as language model (Devlin et al., 2019) , parsing (Kitaev & Klein, 2018; Zhou & Zhao, 2019) , etc. Similar as recurrent neural network (RNN)-based (Kalchbrenner & Blunsom, 2013; Bahdanau et al., 2015; Sutskever et al., 2014) model, SAN-based models try to make representation of one word containing information of the rest sentence in every layer. Empirically, one layer alone cannot result in satisfactory result, in the meantime, staking layers may greatly increase the complexity of model (Hao et al., 2019; Yang et al., 2019; Guo et al., 2019) . Better understanding the representations may help better solve the problem and further improve performance of SAN-based models. It is common to model the representation as a simple directed graph, which views words as nodes and relationships between words as edges. However, such understanding of representations may be still insufficient to model various and complicated relationship among words such as syntax and semantics, let alone presenting a unified explanation for the representations given by SAN-or RNN-based models (Eriguchi et al., 2016; Aharoni & Goldberg, 2017; Wang et al., 2018b) . In addition, simple directed graph mostly models the relationship among words but is incapable of modeling the relationship among phrases or clauses. To overcome the shortcomings of modeling the representation as a simple directed graph and then in the hope of helping further improve SAN-based model, in this paper, we propose a novel explanation that representation generated by SAN-based model can be viewed as a multigraph called multi-order-graph (MoG). In MoG, a set of nodes and edges between these nodes form a subgraph. Meanwhile, one edge not only connects words, but also connects subgraphs which words belong to. Thus we call the relationship reflected by MoG n-order dependency, where n is the number of words involved in this relationship. With such an explanation, we can precisely observe every step of the generation of representation, unify various complicated relationship such as syntax into n-order dependency and understand the model encoding eventually. Inspired by our proposed explanation, we further propose a graph-based SAN empowered Graph-Transformer by enhancing the ability of capturing subgraph information over the current SAN-based sequence encoder. First of all, we generally define a full representation as the fusing result of all concerned subgraph representations. Then let the representation of one layer split into two parts, previous representation and incremental representation. The previous representation reflects full representation from previous layer, and the incremental representation reflects new information generated in this layer. Based on this, the encoding process is modified to adapt to such representation division. We split the original self-attention into three independent parts to generate incremental representation. Our method accommodates subgraphs of different orders into different parts of incremental representation, and reduces the information redundancy. To fuse the full representation, We consider three fusing strategies in terms of different weighting schemes so that let the model focus on salient parts of the representation.

2. MULTI-ORDER-GRAPH EXPLANATION

In graph theory, a directed multigraph (or pseudograph) is a graph has multiple parallel edges, and these edges have the same end nodes. Two vertices may be connected by more than one directed edge. In fact, multigraph is enough to reflect representation generated by model after encoding, while definition of edge cannot reflect relationship between subgraphs and the process of generation. In this paper, we propose a multigraph called multi-order-graph (MoG) for representation of input, which defines edges to reflect relationship between nodes more accurately.

2.1. ENCODING OF MODELS

General speaking, encoding of sentence is a process to transfer a sequence of words to a sequence of vectors. During encoding, model is treated as a stable function independent of data without change of parameters. Representation generated by model only reflects information of input sentence.

2.2. MULTI-ORDER-GRAPH

We generally define MoG as G = (V, E, SN, T N ) over a given sentence S = {s 1 , ..., s n }, in which nodes V = {v 1 , ..., v n } reflect words of S, edges E = {e 1 , ..., e m } reflect relationship between words of S, SN = {sn 1 , ..., sn m |sn j ∈ V, 1 < j ≤ m} is the set of source node of each edge in E and T N = {tn 1 , ..., tn m |tn j ∈ V, 1 < j ≤ m} is the set of target node of each edge in E. Node v i ∈ V in G can access other nodes in one step. Information captured from S is splited into two parts, (1) Word information, which are contained in V and reflects word, (2) Relationship information, which are contained in E and reflect relationship of word-pairs. Note that E in G is the most difference between MoG and standard multigraph. As mentioned above, MoG revises the definition of edges to reflect relationship between subgraphs of G. In Section 2.4 we will discuss the definition of edge e j ∈ E, subgraph and relationship between edges and subgraphs in detail.

2.3. NODE AND WORD

Similar as simple directed graph, nodes in MoG reflect word of input sentence, which means number of nodes in MoG is equal to the number of words of input. Words are represented by nodes of MoG. Without relationship between words, MoG is just a set of graphs which have only one node and no edge. Obviously, one word is independent of others, and model cannot enrich word information.

2.4. EDGE AND SUBGRAPH

In this section, we define edge, subgraph and relationship between edge and subgraph in MoG. A subgraph of G is a graph whose vertex set is a subset of V , and whose edge set is a subset of E. We define Sub G = {sub G 1 , ..., sub G p } as the set of all subgraphs of G. Subgraph can be defined as

