GN-TRANSFORMER: FUSING AST AND SOURCE CODE INFORMATION IN GRAPH NETWORKS

Abstract

As opposed to natural languages, source code understanding is influenced by grammar relations between tokens regardless of their identifier name. Considering graph representation of source code such as Abstract Syntax Tree (AST) and Control Flow Graph (CFG), can capture a token's grammatical relationships that are not obvious from the source code. Most existing methods are late fusion and underperform when supplementing the source code text with a graph representation. We propose a novel method called GN-Transformer to fuse representations learned from graph and text modalities under the Graph Networks (GN) framework with attention mechanism. Our method learns the embedding on a constructed graph called Syntax-Code Graph (SCG). We perform experiments on the structure of SCG, an ablation study on the model design and the hyperparamaters to conclude that the performance advantage is from the fusion method and not the specific details of the model. The proposed method achieved state of the art performance in two code summarization datasets and across three metrics.

1. INTRODUCTION

Code summarization is the task of generating a readable summary that describes the functionality of a snippet. Such task requires a high-level comprehension of a source code snippet thus it is an effective task to evaluate whether a Deep Learning Model is able to capture complex relations and structures inside code. Programming languages are context-free formal language, an unambiguous representation, Abstract Syntax Tree (AST), could be derived from a source code snippet. A parse tree based representation of code is precise and without noise. An AST accurately describes the structure of a snippet and relationships between tokens which provides valuable supplementary information for code understanding. Using graph representations of source code has been the focus of multiple methods that perform code summarization. For example, Alon et al. (2019a) encoded AST paths between tokens and aggregated them by an attention mechanism. Huo et al. (2020) extract AST features, however the cross-modal interaction (Veličković, 2019) is very limited since the AST and code features are independently extracted by separate models then simply concatenated or summed. In this paper we propose a novel architecture GN-Transformer shown in Figure 2 to fuse Graph information with an equivalent sequence representation. In summary: • We extend Graph Networks (GN) (Battaglia et al., 2018) to a novel GN-Transformer architecture that is a sequence of GN encoder blocks followed by a vanilla Transformer decoder. • We propose a novel method for early fusion of the AST representation and that of a code snippet sequence called Syntax-Code Graph (SCG) • We evaluate our approach on the task of code summarization and outperform the previous state of the art in two datasets and across three metrics. We denote '+' as a residual connection followed by a normalization layer. In 'Node embeddings of graph batch', each black bar represents the nodes embedding of a graph in the input batch. Blue dots represent token nodes, grey dots denote padding. Nodes embedding in the grey box are fetched as input to the decoder and AST nodes embedding (red dots) are discarded. We evaluated our model on Java and Python datasets used by Ahmad et al. (2020) . We compared our results to those of Ahmad et al. (2020) . Two qualitative results are presented in Figure 1 . We make available our code, trained models and pre-processed datasets in our supplementary package, and we will open-source it after the review process concludes.

2. FUSING GRAPH AND SEQUENCE INFORMATION

Previous methods consider sequences and graphs as two modalities that are processed independently. For a sequence, recurrent architectures such as RNNs, LSTMs, GRUs are commonly used. CNNs have also been applied on sliding windows of sequences (Kim, 2014) . Transformers (Vaswani et al., 2017) became a popular choice for sequences in recent years. For graph data, spectrum-based methods like GCNs (Bruna et al., 2014) capture graph structure through a spectrum. Non-spectrum methods like GraphSAGE (Hamilton et al., 2017) aggregates information from neighboring nodes using different aggregators, GAT (Veličković et al., 2018) introduced attention mechanism to aggregate neighboring information. Early fusion of multiple modalities is a challenging task. As a result, late fusion methods are used when considering multi-modal information in code summarization tasks. The cross-modal interactions are less efficient in late fusion as compared to early fusion Veličković (2019) . In Section 2.1 we discuss early fusion approaches of code sequence with an AST. In Section 2.2 we discuss representing sequence in a graph.

2.1. EARLY FUSION OF SEQUENCE AND GRAPH

For early fusion of a sequence and a graph, it is common to represent them under a single unified representation and input to a deep learning model. Random walks (Perozzi et al., 2014) and structure-based traversal (SBT) (Hu et al., 2018a) demonstrated the advantages of flattening a graph structure into a sequence. SBT converts AST into a sequence that can be used by seq2seq architectures. However, structural information contained in the AST is lost when flattened into an unstructured sequence. The main motivation of previous methods to flatten a graph structure into a sequence instead of the opposite is due to the power of sequence models when compared to their graph counter-parts. Recent advances in general frameworks such as Graph Networks (Battaglia et al., 2018) , Message-Passing Neural Networks (MPNN) (Gilmer et al., 2017) proposed a unified graphical representation of data. Battaglia et al. (2018) proposed an extension and generalization of previous approaches that can learn a graphical representation of data under a configurable computation unit, GN block. We discuss our GN block based encoder in Section 4.2. There are several benefits in graph representation. Firstly it can contain different information sources with arbitrary relational inductive bias (Battaglia et al., 2018) . Secondly, a graph representation can have explicit relational inductive biases among graph entities. Thirdly, flexible augmentation of graph structure of input through expert knowledge. Additionally, better combinatorial generalization (Battaglia et al., 2018) due to the reuse of multiple information sources simultaneously in a unified representation. Finally, measures for analyzing performance through graph structures could be naturally incorporated, such as average path length L and clustering coefficient C. We discuss it in better detail in Section 3.2.

2.2. GRAPH REPRESENTATIONS FOR SEQUENCE

In Graph Networks, defining the graph structure of input data is the main way to introduce relational inductive biases (Battaglia et al., 2018) . Relational inductive biases impose constraints on the relationship among entities. In Graph Networks, the assumption that two entities have a relation or interaction is expressed by an edge between the corresponding entity nodes. The absence of an edge expresses the assumption of isolation which means no relation or direct influence of two nodes. Sequences are unstructured data in which relationships are implicit. Implicit relationships can be represented with each token as a graph node that is fully connected with all other nodes. Such representation allows each node to interact with every other node with no explicit assumptions on isolation. Thus it allows the model to infer the relationships between them implicitly. Transformers could be regarded as inferring on such a fully connected graph, see Figure 3 (a) . Each token in an input sequence corresponds to a node. The relationship between tokens is thus represented by attention weights, high attention values correspond to strong interaction while low attention means isolation. It is less efficient for a model to learn to infer relationships without any explicit assumptions of interactions and isolation. AST provides precise information about interaction and isolation among tokens in source code since it's a representation without noise on how the tokens interact during the execution of a code snippet. Thus a natural way of utilizing information from an AST is fusing the graph structure with an input sequence. We can find an explicit mapping between tokens in a sequence and nodes in the AST through the scope information provided by a parser for a given programming language. We are then able to find relations thus build edges between a sequence and a graph. We'll further discuss the graph structure of input data and how we fuse AST with it in Section 3. The idea of fusing a sequence and a graph can be extended to broader cases apart from AST and code snippet when the mapping is not explicit. There are techniques that are applied in knowledge graphs to find a mapping between entities in knowledge graphs and words in sequence (Peters et al., 2019) . For our problem statement, we make the reasonable assumption that an explicit mapping between the sequence and a graph can be provided.

3. GRAPH STRUCTURE OF AST AND CODE

First, we propose a simple joint graph representation of source code and AST called standard Syntax-Code Graph (SCG) in Section 3.1. In Section 3.2 we discuss the influence of the structure of SCG and the existence of a theoretically optimal graph structure. 

3.1. SYNTAX-CODE GRAPH

Standard SCG consists of AST nodes directly from AST and token nodes created for each code snippet token or simply referred to the remainder of the paper as token node. The attribute of an AST node is the type on AST such as "NameExpr". The attribute of a token node is the identifier name such as "a", "int", "+". Standard SCG preserves the AST structure and introduces additional edges that connect token nodes with their direct parent node in AST. The direct parent node for a token node is found through the scope information for AST nodes. Taking the statement 'a[i]=4+2' as an example, the AST is shown in Figure 3(b) . The scope of an AST node is determined by the positional mark given by a compiler. The positional mark of the AST node 'AdditiveExpr' is line 1, col 6 ∼ line 1, col 8 which corresponds to '4+2', so the scope covers token nodes '4', '+', and '2'. However, a token may be covered by the scopes of multiple AST nodes, we only connect the token with its direct parent in AST. The direct parent of a token is the deepest AST node among all AST nodes which scope covered the token. Figure 3 (c) shows standard SCG. Standard SCG directly builds on top of the AST structure without introducing any additional relational assumptions. Thus it objectively reflects the program structure depicted by a compiler.

3.2. OPTIMAL SYNTAX-CODE GRAPH STRUCTURE

The graph structure of SCG plays a critical role in our method. You et al. (2020) represented neural network structure as relational graphs. Message passing in a relational graph is analogous to information propagation in Graph Networks. SCG can be directly defined as the relational graph of our model. Node features could be defined as node embedding in SCG. Likewise, the message function is node-wise FFN and the aggregation function is MHA. As a result, the graph structure of SCG will influence the performance of the model. The standard SCG structure discussed in Section 3.1 is efficient but may not be optimal. Qualitatively, we observed two problems of standard SCG. The first problem is the existence of long range dependencies. It is hard to pass information between nodes separated by long paths in SCG. In Graph Networks, each node is executing an information propagation and aggregation simultaneously with neighboring nodes. For example, One graph neural networks (GNN) layer could be regarded as executing one turn of information propagation and aggregation. In Figure 3 (c), it would require three GNN layers for information to propagate from leaf node 'a' to root node 'AssignExpr'. Thus it is difficult to propagate information between nodes in a large AST. The second problem is the isolation among token nodes. Source code token nodes can only indirectly interact with each other through AST nodes. Taking '4+2' in Figure 3 (c) as an example, there is no direct edge between token nodes '4' and '2', they can only indirectly interact through the AST node 'AdditiveExpr'. However, the information that will be passed from 'AdditiveExpr' to '4' and '2' is identical. Therefore, it makes it difficult for the token nodes to learn more complex relationships with other token nodes when they are under the same expression. We analyzed the test set of the Java dataset and present the results in Figure 4 (a). Longer code usually contains more long range dependencies and complex expressions that correspond to the above two problems. Hand-engineering graph structure with expert knowledge may alleviate these problems thus improve performance. However, introducing redundant edges could be lead to performance degradation. We perform two experiments on opposite sides of the spectrum in our graph structure to examine the problems above. Variant 1 added shortcut edges between a code token and AST nodes with the scope covered by that token thus the distance between all AST nodes to token nodes within its scope was shortened to 1. Variant 2 makes token nodes fully connected, thus there is no isolation at all. We reference to those variants as Variant 1 and Variant 2 in the remainder of this text. More details on the two variants are described in Appendix D. Both variants failed to improve performance due to loss of structural information and redundant connections respectively, leading us to conclude that a trade-off of both is required. We discuss it in further detail at Section 5.2. The optimal graph structure was also analyzed through quantitative experiments. You et al. (2020) proposed to measure the effectiveness of the message passing in a relational graph by the Average path length (L) and the clustering coefficient (C). The authors claim that the optimal structure is a balance between the value of C and L. Figure 4 (b) shows L and C for our variants, a Standard AST, and that of a fully-connected relational graph such as in a Transformer which could be formulated into a relational graph by the same way of ours with different relational graph structure. A vanilla Transformer encoder has a maximum C of 1 and a minimum L of 1. Our standard SCG, at the other extreme, has a minimum C of 0 and a relatively high L due to its tree structure. The average L of standard SCG in Java and Python dataset is 6.28 and 6.35 respectively. Figure 4 (b) shows that we only explored a small area of the entire design space of the possible graph structure for various L and C. You et al. (2020) proposed that the optimal structure is usually located around a "sweet spot" in design space between extreme cases of the tree and fully-connected structures. Model performance will improve nearing this "sweet spot". Thus we conclude that there is a large potential of improving model performance by improving the input structure either through hand-engineering or additional rules tailored for the specific problem domain.

4. MODEL ARCHITECTURE

Our model follows the generic Transformer encoder-decoder architecture. The encoder is an extended architecture of Graph Networks with multiple GN blocks. The decoder is a vanilla Transformer decoder. The overview of our architecture is presented in Figure 2 . We'll introduce our overall structure in Section 4.1. Then propose our Graph Networks based encoder in Section 4.2.

4.1. ENCODER-DECODER ARCHITECTURE

The encoder consists of a stack of N GN-Transformer blocks which are derived from GN blocks, the main computation unit in Graph Networks, it takes a graph as input and returns a graph as output (Battaglia et al., 2018) . Each block in our model implements a multi-head attention (MHA) sublayer and a feed-forward network (FFN) sublayer equivalent to a Transformer encoder layer. Encoder accepts a graph G with the node features h as input. G = {V, E} where V is node set and E is the edge set. h ∈ R |V |×d model are the node features in G, where d model is the input and output dimension of the encoder. The node features in the input graph are initialized through an embedding layer. AST and token nodes fetch an embedding vector according to their type and identifier names respectively. For our implementation, we handle token code identifiers and AST node type separately when performing an embedding lookup, as to avoid naming conflicts between the two. Encoder outputs the graph with the updated node features h . Feature vectors of only token nodes are fetched as the decoder input. Feature vectors of AST nodes are discarded and the token embeddings are padded for batching (see Figure 2 ). Residual connections and layer normalization are used in Transformers and are also employed within each block in our model. Moreover, due to equivalence in network architectures, parameters number of our model is the same as the vanilla Transformer with the same configurations.

4.2. GN-TRANSFORMER BLOCKS

We extend GN blocks as proposed by Battaglia et al. (2018) and define MHA and FFN in this context and we call it a GN-Transformer block. One GN-Transformer block will execute one round of information propagation and aggregation with the neighboring nodes that updates the node and edge attributes on a graph. This is done by two sub-blocks an edge block that updates edge attributes, then a node block that updates node attributes. Notice that for our problem the initial edge attribute matrix of a SCG is the presence of a connection between two nodes. Edge block E is updated by an edge update function φ e (h i , h j ) = hiW Q γ (hj W K γ ) T √ d k where h i and h j are attributes of node i and j respectively and E (γ) ij is the updated attribute of edge from node i to j through attention head γ. Each GN-Transformer block has H attention heads. W K γ ∈ R d model ×d k and W Q γ ∈ R d model ×d k are parameter matrices for attention head γ. And an edge aggregate function p e→v (E i ) = Concat(head i , ..., head (H) i )W O with head (γ) i = j∈Ni h j W V γ α (γ) ij which aggregates updates of edges with α (γ) ij = σ(E (γ) i ) j , the attention from node i to the set of neighboring nodes N i , σ(•) is softmax function, W V γ ∈ R d model ×dv , W O ∈ R Hdv×d model are parameter matrices. A Node block V updated by a node update function φ v (h i , p e→v i ) = F F N ( hi ) + hi where hi = h i + p e→v i and F F N (x) = max(0, xW 1 + b 1 )W 2 + b 2 . φ e together with p e→v define the MHA. φ v implemented a node-wise FFN with residual connections similar to point-wise FFN in a Transformer. For our experiments, we also use dropout and layer normalization the same as Vaswani et al. (2017) .

5. EXPERIMENT

We evaluated our model in two code summarization datasets. Additionally, we perform experiments on the hyperparameters, model structure, and variants of the graph structure. Our experiment settings are presented in Section 5.1. Results are presented and analyzed in Section 5.2.

5.1. SETTINGS

The experiments are conducted on a Java dataset (Hu et al., 2018b ) and a Python dataset (Barone & Sennrich, 2017) . We used JavaParserfoot_0 to extract the AST and javalangfoot_1 for parsing Java source code, python astfoot_2 to parse and get the AST for Python. We chose a Transformer as our main comparison baseline which achieved state of the art in the two datasets by Ahmad et al. (2020) . Ahmad et al. ( 2020) proposed a base model which is a vanilla Transformer and a full model that implements relative position embedding and copy mechanism. We reproduced their results on our preprocessed datasets. Our preprocessed datasets are composed of the source code, corresponding AST, and a text summary. We also compared our method with the results of other baselines reported in Ahmad et al. (2020) . Additional details about the datasets and preprocessing are in Appendix B. We used the same metrics that are reported by all baselines to evaluate our model which are BLUE, ROUGE-L, and METEOR. We applied the same hyperparameters as Ahmad et al. ( 2020), they are listed in Appendix A.

5.2. RESULTS AND ANALYSIS

Results on Java and Python datasets. Table 1 shows comparisons of our model GN-Transformer with previous works in code summarization. Our method outperformed all previous works in all metrics. The most suitable comparison of our approach from all previous works is that of a vanilla Transformer rather than Transformer (full). The only difference between our work and vanilla Transformer by Ahmad et al. ( 2020) is the GN-Transformer Encoder block and the SCG structure, which is the scope of this paper. In contrast Transformer (full) makes use of positional embedding and copy mechanism introduced by See et al. (2017) ; Shaw et al. (2018) . Such improvements apply to all sequence models irrespective of the domain but they do not directly apply to a graph structure. Analogous techniques introduced to Graph Networks should correspond to improvements in our model. However, we consider this to be outside the scope of this paper. Our experiments on positioning encoding and copy mechanism are presented in Appendix C. We applied the same hyperparameters and network configurations with the vanilla Transformer as proposed in Ahmad et al. (2020) . The result shows an improvement of 1.49, 2.09 BLEU, 0.51, 1.10 METEOR and 1.99, 2.34 ROUGE-L in Java and Python dataset respectively. Comparisons with more baselines are shown in Table 2 in Appendix C. Additional experiments on Java dataset We conducted additional experiments in the following three directions on the Java dataset. Results are shown in Table 3 . Appendix D provided more details about additional experiment settings. (A) Hyper parameters. Experiments on different combinations of hyper parameters including the number of layers, embedding size, width of FFN, and configurations of attention heads. (B) Model structure. We firstly tested a 2-hop GN-Transformer block by using two MHA sublayers in each block thus aggregating two hops of information. Result shows that it harms performance. Collecting two hops of information without a FFN in between is less expressive. We then did an ablation study to show the advantage of the GN-Transformer blocks. We replaced a GN-Transformer block with Graph Attention Networks (GAT) layer with the same hyperparameters as that of a base model. GAT did not perform optimally in our problem when modeled text sequence as a graph. Result shows that the GN-Transformer block largely outperformed GAT even when we compared configurations with similar parameter numbers as in (A). Then we added FFN to GAT layers for an ablation study on MHA. This was to see the impact of MHA. Results show that MHA brings a sig-Under review as a conference paper at ICLR 2021 nificant improvement. We thus conclude that both the FFN and MHA used by our GN-Transformer blocks are necessary and greatly improve performance. (C) Variants of graph structures. We tested two variants of graph structures discussed in Section 3.2. Both variants underperform with Variant 1 being much closer to a SCG when compared to Variant 2. Variant 1 underperforms because it introduces redundant edges and leads to redundant interactions between the AST and tokens. It degrades performance little since the structural information and isolation among tokens introduced by the AST were preserved. For Variant 2, the AST code sequence tokens are fully connected. Isolation among tokens is completely lost which leads to the loss of structural information. The above experiments explain why Ahmad et al. ( 2020) did not improve performance when using SBT (Hu et al., 2018a) . The AST nodes are fully connected. The structural information from the AST graph is not preserved, similar to Variant 2. Moreover, SBT introduced redundant interactions between AST nodes and token nodes similar to Variant 1. However, Variant 2 still achieved 0.43 BLEU, 0.47 METEOR, 0.78 ROUGE-L improvements compared to a vanilla Transformer which shows the usefulness of AST. From our ablation study on the graph structure variants, we conclude that the cause of degradation is the redundant edges and the loss of structural information.

6. RELATED WORKS

Integrating valuable information from graph representations like AST, CFG, PDG has long been the focus of deep learning in code understanding. Alon et al. (2019b) retrieves AST embeddings with random walks then concatenates them with token embeddings and finally aggregates them by an attention mechanism as a context vector. Hu et al. (2018a) proposed Structure-Based Traversal (SBT) that flattens the AST into a sequence. Huo et al. (2020) applied DeepWalk with CNN and LSTM to learn a CFG representation which then concatenated with the source code representation. LeClair et al. ( 2020) used GCN to learn AST embeddings then concatenated them with source code embeddings. Wang et al. (2020) augmented AST features by adding edge type information on AST that representing the control and data flow, then applied gated GNN to learn this augmented-AST embedding. Most existing methods fuse AST and code information by late fusion (Baltrušaitis et al., 2019) that concatenates embeddings of the two different modalities with few interactions of crossmodal information. Veličković (2019) investigated the possibility of early fusion. They proposed cross-connections between models to enable the sharing of cross-modal information. TreeLSTM (Tai et al., 2015) fused tree information into sequence by a tree-structured LSTM. Another category of methods modeled information sources as heterogeneous graphs that are then processed by GNNs. Yao et al. (2019) constructed heterogeneous graphs of documents and text. Ren & Zhang (2020) modeled text as a heterogeneous graph of topics and entities. Research on general deep learning frameworks proposed a method to train on a uniform representation and model framework (Battaglia et al., 2018; Gilmer et al., 2017; Wang et al., 2018) . Graph Networks unified deep learning models on powerful graph representation which could represent arbitrary relational inductive biases between entities (Battaglia et al., 2018) . On the other hand, You et al. (2020) reveals the relation between the graph structure and neural network structure which provided a more comprehensive view on the impact of graph structure in deep learning models.

7. CONCLUSION

In this paper, we analyzed the fusion of a sequence and a graph from a novel perspective of graph networks. We proposed our GN-Transformer with Syntax-Code Graph. Our method achieved state of the art in two code summarization datasets. We also did experiments on hyperparameters, model structure, and graph structures. Future works include finding the optimal structure of SCG and fusing supplementary information like CFG. Due to the similarity with Transformer, ideas like masked pretraining, positional encoding, copy mechanisms for Transformers also worth interpreting and implementing in the context of Graph Networks. A decoder designed for graph representation may also improve performance, our method discarded AST node embedding while they naturally contained additional structural information which should be useful to the decoder. The simplicity of our method could expand in other domains that there is a duality in sequence and graph representations. Java dataset and we split the Python dataset by 6:2:2. All of the above are consistent with Ahmad et al. (2020) . There are two differences between our preprocessed dataset and Ahmad et al. ( 2020): 1. Data cleaning: We discarded the samples that cannot be parsed by the compiler. There are 160 out of 87136 and 4894 out of 113108 samples in Java and Python dataset that are discarded respectively. In the Java dataset, there are no samples removed by Ahmad et al. (2020) . For the Python dataset, Ahmad et al. ( 2020) follow the same cleaning process by Wei et al. (2019b) . They removed the samples that exceed a length threshold. They remove 20563 out of 113108 samples. 2. Python Dataset: We used the same method for Python as we did for the Java dataset. Ahmad et al. (2020) deleted special characters in the Python source codes while we preserved them. The result is that the average code length of the dataset preprocessed by them is 47.98 while ours is 132.64. Apart from the above two differences, the preprocessing methodology is consistent with Ahmad et al. ( 2020) as well as other baselines reported by them. C ADDITIONAL EXPERIMENT RESULTS The comparison between our method and additional baselines are shown in Table 2 . Table 3 showed the results of additional experiments. Table 4 showed experimental results on positional encoding and copy mechanism. For experiments on positional encoding, we applied a learnable absolute positional embedding (APE) layer, which is often used on Transformers. We then use the summation of APE and node embeddings fetched from the input embedding layer as the input to the encoder. We tested APE only on token nodes and used padding on AST nodes. For the experiments on relative positional encoding (RPE) on our model, we had to adapt the original definition for sequence models to that of graph representation. When RPE is applied to sequences it requires the sequence nodes to be fully connected, which is not the case for SCG. Instead, we applied a two-layered PGNN (You et al., 2019) that learns relative positional information on a graph to learn a RPE for each node. We used an APE as the input to PGNN. We set an embedding size of 512. We applied the fixed number of 6 anchor sets with 2 copies each instead of an adjusted anchor set number in You et al. (2019) . We concatenate RPE and node embeddings fetched from the input embedding layer. We then reduce its dimension to d model by a linear layer and provide that as input to the encoder. Next, we tested applying RPE only on token nodes by replacing the RPE for AST nodes by padding. For the copy mechanism, we tested the same copy mechanism as Ahmad et al. (2020) . All hyperparameters are the same and listed in Appendix A. The results show that APE will harm the performance of both Java and Python dataset. When applying APE only in token nodes, the degradation is minor. We hypothesize that this could be because APE is not useful for AST nodes, APE in token nodes represent their absolute positions in the input sequence. On the contrary, the positions for AST nodes in the input sequence should be the scope instead of a single absolute position. The absolute positional information is not useful in a standard SCG for the AST nodes. The results for RPE are also not promising in the interpretation we made for graph representations. We chose a fixed number of anchor sets for all graphs. While in You et al. (2019) , they dynamically choosing anchor sets number for different sizes of graphs. Our anchor sets number may be inadequate to capture accurate relative positional information in large graphs and it's revealed by the fact that the average node number of the Python dataset is 70.10 compared to 50.30 in the Java dataset. This reflects on the results since we obtain improvement for some metrics in the Java dataset while performance degradation for Python. Both positional encoding and copy mechanism did not improve the performance across all metrics but were close to the base model. We hypothesize that both mechanisms have shortcomings when applied for graphs and SCG. Our results show we can not apply tricks that are designed for Transformer in our model without further consideration of the problem domain and graph modality. The ideas found in sequence models can still be designed to fit the graph representation but require additional investigation.

D DETAILS OF ADDITIONAL EXPERIMENTS

Here we showed more details about our additional experiments discussed in Section 5.2. For additional experiments (B). We used the GAT layer implementation from DGLfoot_10 . The difference between MHA in GAT and GN-Transformer is that GN-Transformer used the multiplication of features of two nodes in an edge to calculate edge feature while GAT used summation in DGL's implementation. Secondly, Transformer applied four linear layers on key, query, value, and output of MHA respectively while GAT only used one on input features. For additional experiments (C). In Variant 1, we solve the long range dependence problem discussed in Section 3.2 by introducing shortcut edges between each AST node and the token nodes that are related to this AST node. The distance between AST nodes and all nodes within its scope are shortened to 1. It doesn't break any isolation but introduced additional direct interactions. We define the related token nodes of an AST node as all token nodes within the scope of the AST node. Figure 5 (a) shows how shortcut edges are added for the 'AdditiveExpr' node, we add bi-directional shortcut edges between the AST node 'AdditiveExpr' and token nodes '4', '+', '2' which are within the scope of this AST node. The edges allow for the AST node to propagate information through the node and edge update rules. This can ameliorate the long range dependence problem in the graph. In the second variant, we make the token nodes fully connected as the Transformer. The nodes are not isolated from each other anymore. This causes all nodes to interact with each other. See 



https://javaparser.org/ https://github.com/c2nes/javalang https://docs.python.org/3/library/ast.html Hu et al. (2018a) Wei et al. (2019a) Ahmad et al. (2020) (Iyer et al., 2016) (Tai et al., 2015) (Wan et al., 2018) (Hu et al., 2018b) https://www.dgl.ai/



Figure 1: Examples of generated summarizations on Java (red blocks) and Python (blue blocks) test set. Transformer is a vanilla Transformer, Transformer (full) implements relative positional encoding and copy mechanism from Ahmad et al. (2020).

Figure 2: Overall structure of our model. Encoder consists of multiple GN-Transformer blocks. We denote '+' as a residual connection followed by a normalization layer. In 'Node embeddings of graph batch', each black bar represents the nodes embedding of a graph in the input batch. Blue dots represent token nodes, grey dots denote padding. Nodes embedding in the grey box are fetched as input to the decoder and AST nodes embedding (red dots) are discarded.

Figure 3: (a) Simplest fully-connected graph representation of sequence. Each node corresponds to a token. Self-loops are omitted. (b) Deep blue nodes are AST for the statement 'a[i]=4+2', shallow blue nodes shows the correspondence between AST and source code. (c) Standard SCG structure. It preserves the AST structure with additional edges between AST nodes and tokens.

Figure 4: (a) Average code length of subsets of samples under different sentence BLEU score thresholds on the test set of Java dataset. (b) The average path length and clustering coefficient of our SCG structures are calculated on the test set of Java dataset and Transformer.

Figure 5: (a) Introducing shortcut edges for AST node 'AdditiveExpr' in Variant 1, the three token nodes '4', '+', '2' will connect with it. We marked token nodes by shallow blue, AST nodes by deep blue, edges from token nodes to AST nodes by red, orange for the converse direction. (b) In Variant 2, all token nodes are fully connected with each other.

ten examples of the summarization generated by our model and the baselines on the test set. The first five ones are Java examples, the remaining five are Python examples. public boolean is selected ( ItemSelectionChoice p choice ){ return sel array [ p choice . ordinal () ]; } Reference: looks, if the input item type is selected. GN-Transformer: whether a choice is selected, or not Transformer: get if the selection should be a selection

Overall results on Java and Python datasets.

Other baselines on Java and Python datasets.

Additional experimental results. Unlisted values are identical to the base model. Parameter number not including the embedding layer.

Experimental results on positional encoding and copy mechanism. APE/RPE (token) means only applied APE/RPE on token nodes. Parameter number not including the embedding layers.

annex

We used the same hyper-parameters as (Ahmad et al., 2020) . For the data preprocessing, the code and summary data will be truncated if they exceed the maximum length. We will first truncate the tokens that exceed the max sequence length. For all AST nodes, the scope contains a truncated token, this AST node will be truncated. We removed all nodes that have a scope which contains truncated tokens. We set a vocabulary size limit, we store only the highest frequent words. Words that are not in the vocabulary will be recognized as U nknown word. Our methodology is consistent with Ahmad et al. (2020) with the exception of truncating AST since they do not use that information.

Dataset

Java Python The statistics of our preprocessed datasets are shown above. Despite the difference in implementation, we kept our methodology consistent with Ahmad et al. (2020) . We used the same CamelCase and snake case tokenizer from Ahmad et al. (2020) to preprocess source code data in both Java and Python datasets. SCG for subtokens is a little bit different, see Figure 6 . We also replaced the strings and numbers in source code by ' ST R ' and ' N U M '. For the code summary, we used raw corpus for the Java dataset, and used the same method with Wan et al. (2018) to process code summaries in the Python dataset. We used the train/valid/test split from the original corpus for the GN-Transformer: converts a list of strings to upper case strings .

Examples

Transformer: returns a comma -separated list of strings . Reference: sorts the instances according to the given attribute/dimension. the sorting is done on the master index array and not on the actual instances object.GN-Transformer: sorts the specified range of the array using the specified items Transformer: src the ordinal field array into ascending order Transformer (full): sorts the specified range of the array using the given workspace array .def is power2(num):return ( isinstance (num, numbers. Integral ) and (num > 0) and (not (num & (num -1))))Reference: test if num is a positive integer power of 2 .GN-Transformer: return true if the power of 2 .Transformer: returns true if and number is a user-defined power .Transformer (full): return whether or not the argument is a power . Reference: delete an item or items from a queue cli example: . GN-Transformer: delete one or more or all items from a queue cli example: . Transformer: delete an item from a queue cli example: . Transformer (full): delete message(s) from a queue cli example: . Reference: merge command line options with configuration file and default options .GN-Transformer: merges options from a config file .Transformer: merges all of the data used into an option .Transformer (full): loads configuration attributes and add attributes . Reference: loads the data necessary for instantiating a client from file storage .GN-Transformer: loads the header data necessary for instantiating .Transformer: loads a data necessary for credit two types .Transformer (full): loads key: data from a loader context .

F ATTENTION VISUALIZATION FOR SYNTAX-CODE GRAPH

The attention visualization for this short program is shown here, we could get more insights about how each nodes paid attention to its neighboring nodes: 

