GRAPHCODEBERT: PRE-TRAINING CODE REPRESEN-TATIONS WITH DATA FLOW

Abstract

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "wherethe-value-comes-from" between variables. Such a semantic-level structure is less complex and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

1. INTRODUCTION

Pre-trained models such as ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) have led to strong improvement on numerous natural language processing (NLP) tasks. These pre-trained models are first pre-trained on a large unsupervised text corpus, and then fine-tuned on downstream tasks. The success of pre-trained models in NLP also promotes the development of pre-trained models for programming language. Existing works (Kanade et al., 2019; Karampatsis & Sutton, 2020; Feng et al., 2020; Svyatkovskiy et al., 2020; Buratti et al., 2020) regard a source code as a sequence of tokens and pre-train models on source code to support code-related tasks such as code search, code completion, code summarization, etc. However, previous works only utilize source code for pre-training, while ignoring the inherent structure of code. Such code structure provides useful semantic information of code, which would benefit the code understanding process. Taking the expression v = max value -min value as an example, v is computed from max value and min value. Programmers do not always follow the naming conventions so that it's hard to understand the semantic of the variable v only from its name. The semantic structure of code provides a way to understand the semantic of the variable v by leveraging dependency relation between variables. In this work, we present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we leverage semantic-level information of code, i.e. data flow, for pretraining. Data flow is a graph, in which nodes represent variables and edges represent the relation of "where-the-value-comes-from" between variables. Compared with AST, data flow is less complex and does not bring an unnecessarily deep hierarchy, the property of which makes the model more efficient. In order to learn code representation from source code and code structure, we introduce two new structure-aware pre-training tasks. One is data flow edges prediction for learning representation from code structure, and the other is variable-alignment across source code and data flow for aligning representation between source code and code structure. GraphCodeBERT is based on Transformer neural architecture (Vaswani et al., 2017) and we extend it by introducing a graph-guided masked attention function to incorporate the code structure. We pre-train GraphCodeBERT on the CodeSearchNet dataset (Husain et al., 2019) , which includes 2.3M functions of six programming languages paired with natural language documents. We evaluate the model on four downstream tasks: natural language code search, clone detection, code translation, and code refinement. Experiments show that our model achieves state-of-the-art performance on the four tasks. Further analysis shows that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and the model has consistent preference for attending data flow. In summary, the contributions of this paper are: (1) GraphCodeBERT is the first pre-trained model that leverages semantic structure of code to learn code representation. (2) We introduce two new structure-aware pre-training tasks for learning representation from source code and data flow. ( 3) GraphCodeBERT provides significant improvement on four downstream tasks, i.e. code search, clone detection, code translation, and code refinement.

2. RELATED WORKS

Pre-Trained Models for Programming Languages Inspired by the big success of pre-training in NLP (Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019; Raffel et al., 2019) , pre-trained models for programming languages also promotes the development of code intelligence (Kanade et al., 2019; Feng et al., 2020; Karampatsis & Sutton, 2020; Svyatkovskiy et al., 2020; Buratti et al., 2020) . Kanade et al. (2019) pre-train a BERT model on a massive corpus of Python source codes by masked language modeling and next sentence prediction objectives. Feng et al. (2020) propose CodeBERT, a bimodal pre-trained model for programming and natural languages by masked language modeling and replaced token detection to support text-code tasks such as code search. Karampatsis & Sutton (2020) pre-train contextual embeddings on a JavaScript corpus using the ELMo framework for program repair task. Svyatkovskiy et al. (2020) propose GPT-C, which is a variant of the GPT-2 trained from scratch on source code data to support generative tasks like code completion. Buratti et al. (2020) present C-BERT, a transformer-based language model pre-trained on a collection of repositories written in C language, and achieve high accuracy in the abstract syntax tree (AST) tagging task. Different with previous works, GraphCodeBERT is the first pre-trained model that leverages code structure to learn code representation to improve code understanding. We further introduce a graphguided masked attention function to incorporate the code structure into Transformer and two new structure-aware pre-training tasks to learn representation from source code and code structure. Neural Networks with Code Structure In recent years, some neural networks leveraging code structure such as AST have been proposed and achieved strong performance in code-related tasks like code completion (Li et al., 2017; Alon et al., 2019; Kim et al., 2020) , code generation (Rabinovich et al., 2017; Yin & Neubig, 2017; Brockschmidt et al., 2018) , code clone detection (Wei & Li, 2017; Zhang et al., 2019; Wang et al., 2020) , code summarization (Alon et al., 2018; Hu et al., 2018) and so on (Nguyen & Nguyen, 2015; Allamanis et al., 2018; Hellendoorn et al., 2019) . Nguyen & Nguyen (2015) propose an AST-based language model to support the detection and suggestion of a syntactic template at the current editing location. Allamanis et al. (2018) use graphs to represent programs and graph neural network to reason over program structures. Hellendoorn et al. (2019) propose two different architectures using a gated graph neural network and Transformers for combining local and global information to leverage richly structured representations of source code. However, these works leverage code structure to learn models on specific tasks from scratch without using pre-trained models. In this work, we study how to leverage code structure for pre-training code representation.

3. DATA FLOW

In this section, we describe the basic concept and extraction of data flow. In next section, we will describe how to use data flow for pre-training. Data flow is a graph that represents dependency relation between variables, in which nodes represent variables and edges represent where the value of each variable comes from. Unlike AST, data flow is same under different abstract grammars for the same source code. Such code structure provides crucial code semantic information for code understanding. Taking v = max value -min value as an example, programmers do not always follow the naming conventions so that it is hard to understand the semantic of the variable. Data flow provides a way to understand the semantic of the variable v to some extent, i.e. the value of v comes from max value and min value in data flow. Besides, data flow supports the model to consider long-range dependencies induced by using the same variable or function in distant locations. Taking Figure 1 as an example, there are four variables with same name (i.e. x 3 , x 7 , x 9 and x 11 ) but with different semantic. The graph in the figure shows dependency relation between these variables and supports x 11 to pay more attention to x 7 and x 9 instead of x 3 . Next, we describe how to extract data flow from a source code. Figure 1 shows the extraction of data flow through a source code. Given a source code C = {c 1 , c 2 , ..., c n }, we first parse the code into an abstract syntax tree (AST) by a standard compiler toolfoot_0 . The AST includes syntax information of the code and terminals (leaves) are used to identify the variable sequence, denoted as V = {v 1 , v 2 , ..., v k }. We take each variable as a node of the graph and an direct edge ε = v i , v j from v i to v j refers that the value of j-th variable comes from i-th variable. Taking x = expr as an example, edges from all variables in expr to x are added into the graph. We denote the set of directed edges as E = {ε 1 , ε 2 , ..., ε l } and the graph G(C) = (V, E) is data flow used to represent dependency relation between variables of the source code C.

4. GRAPHCODEBERT

In this section, we describe GraphCodeBERT, a graph-based pre-trained model based on Transformer for programming language. We introduce model architecture, graph-guided masked attention and pre-training tasks including standard masked language model and newly introduced ones. More details about model pre-training setting are provided in the Appendix A. with comment and the corresponding data flow as the input, and is pre-trained using standard masked language modeling (Devlin et al., 2018) and two structure-aware tasks. One structure-aware task is to predict where a variable is identified from (marked with orange lines) and the other is data flow edges prediction between variables (marked with blue lines).

4.1. MODEL ARCHITECTURE

Figure 2 shows the model architecture of GraphCodeBERT. We follow BERT (Devlin et al., 2018) and use the multi-layer bidirectional Transformer (Vaswani et al., 2017) as the model backbone. Instead of only using source code, we also utilize paired comments to pre-train the model to support more code-related tasks involving natural language such as natural language code search (Feng et al., 2020) . We further take data flow, which is a graph, as a part of the input to the model. Given a source code C = {c 1 , c 2 , ..., c n } with its comment W = {w 1 , w 2 , ..., w m }, we can obtain the corresponding data flow G(C) = (V, E) as discussed in the Section 3, where V = {v 1 , v 2 , ..., v k } is a set of variables and E = {ε 1 , ε 2 , ..., ε l } is a set of direct edges that represent where the value of each variable comes from. We concatenate the comment, source code and the set of variables as the sequence input X = {[CLS], W, [SEP ], C, [SEP ], V }, where [CLS] is a special token in front of three segments and [SEP ] is a special symbol to split two kinds of data types. GraphCodeBERT takes the sequence X as the input and then converts the sequence into input vectors H 0 . For each token, its input vector is constructed by summing the corresponding token and position embeddings. We use a special position embedding for all variables to indicate that they are nodes of data flow. The model applies N transformer layers over the input vectors to produce contextual representations H n = transf ormer n (H n-1 ), n ∈ [1, N ]. Each transformer layer contains an architecturally identical transformer that applies a multi-headed self-attention operation (Vaswani et al., 2017) followed by a feed forward layer over the input H n-1 in the n-th layer. G n = LN (M ultiAttn(H n-1 ) + H n-1 ) (1) H n = LN (F F N (G n ) + G n ) (2) where M ultiAttn is a multi-headed self-attention mechanism, F F N is a two layers feed forward network, and LN represents a layer normalization operation. For the n-th transformer layer, the output Ĝn of a multi-headed self-attention is computed via: Q i = H n-1 W Q i , K i = H n-1 W K i , V i = H n-1 W V i ( ) head i = softmax( Q i K T i √ d k + M)V i (4) Ĝn = [head 1 ; ...; head u ]W O n (5) where the previous layer's output H n-1 ∈ R |X|×d h is linearly projected to a triplet of queries, keys and values using model parameters W Q i ,W K i ,W V i ∈ R d h ×d k , respectively. u is the number of heads, d k is the dimension of a head, and W O n ∈ R d h ×d h is the model parameters. M ∈ R |X|×|X| is a mask matrix, where M ij is 0 if i-th token is allowed to attend j-th token otherwise -∞.

4.2. GRAPH-GUIDED MASKED ATTENTION

To incorporate the graph structure into Transformer, we define a graph-guided masked attention function to filter out irrelevant signals. The attention masking function could avoid the key k i attended by the query q j by adding the attention score q T j k i an infinitely negative value so that the attention weight becomes zero after using a softmax function. To represent dependency relation between variables, a node-query q vi is allowed to attend to a node-key k vj if there is a direct edge from the node v j to the node v i (i.e. v j , v i ∈ E) or they are the same node (i.e. i = j). Otherwise, the attention is masked by adding an infinitely negative value into the attention score. To represent the relation between source code tokens and nodes of the data flow, we first define a set E , where v i , c j / c j , v i ∈ E if the variable v i is identified from the source code token c j . We then allow the node q vi and code k cj attend each other if and only if v i , c j / c j , v i ∈ E . More formally, we use the following graph-guided masked attention matrix as the mask matrix M in the equation 4: M ij = 0 if q i ∈ {[CLS], [SEP ]} or q i , k j ∈ W ∪ C or q i , k j ∈ E ∪ E -∞ otherwise (6)

4.3. PRE-TRAINING TASKS

We describe three pre-training tasks used for pre-training GraphCodeBERT in this section. The first task is masked language modeling (Devlin et al., 2018) for learning representation from the source code. The second task is data flow edge prediction for learning representation from data flow, where we first mask some variables' data flow edges and then let GraphCodeBERT predict those edges. The last task is variable-alignment across source code and data flow for aligning representation between source code and data flow, which predicts where a variable is identified from. Masked Language Modeling We follow Devlin et al. (2018) to apply masked language modeling (MLM) pre-training task. Specially, we sample randomly 15% of the tokens from the source code and paired comment. We replace them with a [MASK] token 80% of the time, with a random token 10% of the time, and leave them unchanged 10% of the time. The MLM objective is to predict original tokens of these sampled tokens, which has proven effective in previous works (Devlin et al., 2018; Liu et al., 2019; Feng et al., 2020) . In particular, the model can leverage the comment context if the source code context is not sufficient to infer the masked code token, encouraging the model to align the natural language and programming language representations. Edge Prediction To learn representation from data flow, we introduce a pre-training task of data flow edges prediction. The motivation is to encourage the model to learn structure-aware representation that encodes the relation of "where-the-value-comes-from" for better code understanding. Specially, we randomly sample 20% of nodes V s in data flow, mask direct edges connecting these sampled nodes by add an infinitely negative value in the mask matrix, and then predict these masked edges E mask . Taking the variable x 11 in Figure 2 for an example, we first mask edges x 7 , x 11 and x 9 , x 11 in the graph and then let the model to predict these edges. Formally, the pre-training objective of the task is calculated as Equation 7, where E c = V s × V ∪ V × V s is a set of candidates for edge prediction, δ(e ij ∈ E) is 1 if v i , v j ∈ E otherwise 0, and the probability p eij of existing an edge from i-th to j-th node is calculated by dot product following a sigmoid function using representations of two nodes from GraphCodeBERT. To balance positive-negative ratio of examples, we sample negative and positive samples with the same number for E c . loss EdgeP red = - eij ∈Ec [δ(e ij ∈ E mask )logp eij + (1 -δ(e ij ∈ E mask ))log(1 -p eij )] (7) Node Alignment To align representation between source code and data flow, we introduce a pretraining task of node alignment across source code and data flow, which is similar to data flow edge prediction. Instead of predicting edges between nodes, we predict edges between code tokens and nodes. The motivation is to encourage the model to align variables and source code according to data flow. Taking Figure 3 for an example, we first mask edges between the variable x 11 in data flow and code tokens, and then predict which code token the variable x 11 in data flow is identified from. As we can see, the model could predict that the variable x 11 is identified form the variable x in the expression "return x" according to data flow information (i.e. the value of x 11 comes from x 7 or x 9 ).

GraphCodeBERT

Variable Sequence Text Code Specially, we randomly sample 20% nodes V s in the graph, mask edges between code tokens and sampled nodes, and then predict masked edges E mask . The pre-training objective of this task is similar to Equation 7, where E c = V s × C is a set of candidates for node alignment. Similarly, we also sample negative and positive samples with the same number for E c . loss N odeAlign = - eij ∈E c [δ(e ij ∈ E mask )logp eij + (1 -δ(e ij ∈ E mask ))log(1 -p eij )] (8)

5. EXPERIMENTS

We evaluate our model on four downstream tasks, including code search, clone detection, code translation and code refinement. Detailed experimental settings can be found in the Appendix.

5.1. NATURAL LANGUAGE CODE SEARCH

Given a natural language as the input, the task aims to find the most semantically related code from a collection of candidate codes. We conduct experiments on the CodeSearchNet code corpus (Husain et al., 2019) , which includes six programming languages. Different from the dataset and the setting used in the Husain et al. (2019) , we filter low-quality queries by handcrafted rules and expand 1000 candidates to the whole code corpus, which is closer to the real-life scenario. We use Mean Reciprocal Rank (MRR) as our evaluation metric and report results of existing methods in the Table 1 . We provide more details about the filtered dataset and also give results using the same setting of Husain et al. (2019) All models calculate inner product of code and query encodings as relevance scores to rank candidate codes. We follow Husain et al. (2019) to implement four methods as baselines in the first group to obtain the encodings, including bag-of-words, convolutional neural network, bidirectional recurrent neural network, and multi-head attention. The second group is the results of pre-trained models. Roberta (Liu et al., 2019 ) is a pre-trained model on text corpus with MLM learning objective, while RoBERTa (code) is pre-trained only on code. CodeBERT (Feng et al., 2020 ) is pre-trained on code-text pairs with MLM and replaced token detection learning objectives. As we can see, GraphCodeBERT that leverages code structure for pre-training brings a 2% gain of MRR, achieving the state-of-art performance. We also conducted t-test between our GraphCodeBERT and other baselines, and the results show the improvements are significant with p < 0.01.

5.2. CODE CLONE DETECTION

Code clones are multiple code fragments that output similar results when given the same input. The task aims to measure the similarity between two code fragments, which can help reduce the cost of software maintenance and prevent bugs. We conduct experiments on the BigCloneBench dataset (Svajlenko et al., 2014) and report results in the Table 2 . Deckard (Jiang et al., 2007) is to compute vectors for structural information within ASTs and then a Locality Sensitive Hashing (LSH) (Datar et al., 2004) show that our GraphCodeBERT that leverages code structure information significantly outperforms other methods with p < 0.01, which demonstrates the effectiveness of our pre-trained model for the task of code clone detection.

5.3. CODE TRANSLATION

Code translation aims to migrate legacy software from one programming language in a platform to another. Following Nguyen et al. (2015) and Chen et al. (2018) , we conduct experiments on a dataset crawled from the same several open-source projects as them and report results in the Table 3 . The Naive method is directly copying the source code as the translation result. PBSMT is short for phrase-based statistical machine translation (Koehn et al., 2003) , and has been exploited in previous works (Nguyen et al., 2013; Karaivanov et al., 2014) . As for the Transformer, we use the same number of layers and hidden size as pre-trained models. To leverage the pretrained models for translation, we initialize the encoder with pre-trained models and randomly initialize parameters of the decoder and the source-to-target attention. Results show that the models initialized with pretrained models (i.e the second group) significantly outperform PBSMT and Transformer models. Among them, GraphCodeBERT achieves state-of-art performance, which demonstrates the effectiveness of our model for code translation.

5.4. CODE REFINEMENT

Code refinement aims to automatically fix bugs in the code, which can contribute to reducing the cost of bug-fixes. We use the dataset released by Tufano et al. (2019) and report results in the Table 4 . The Naive method directly copies the buggy code as the refinement result. For the Transformer, we use the same number of layers and hidden size as the pre-trained models. Same as the Section 5.3, we initialize the encoder with pre-trained models and randomly initialize parameters of the decoder GraphCodeBERT achieves better performance than other pre-trained models on both datasets, which shows leveraging code structure information are helpful to the task of code refinement.

5.5. MODEL ANALYSIS

Ablation Study We conduct ablation study on the task of natural language code search to understand various components in our approach impact overall performance. We remove two pre-training tasks and data flow, respectively, to analyze their contribution. Table 5 shows that the overall performance drops from 71.3% to 70.3%∼70.7% when removing Node Alignment and Edge Prediction pre-training tasks, respectively, which reveals the importance of two structure-aware pre-training tasks. After ablating the data flow totally, we can see that the performance drops from 71.3% to 69.3%, which means leveraging data flow to learn code representation could improve GraphCodeBERT. Table 6 : Attention distribution (%) between code tokens (codes) and variables (nodes) across different programming language on natural language code search test sets. The first row is the ratio of the number of code tokens to nodes, and the second row is attention distribution of [CLS] token. Comparison between AST and Data Flow Figure 4 shows MRR score with respect to input sequence length on the validation dataset of Ruby programming language for the task of code search. AST Pre-order Traversal regards AST as a sequence by linearizing all AST nodes using pre-order traversal algorithm. AST Subtree Masking regards AST as a tree and introduce subtree masking (Nguyen et al., 2019) for self-attention of the Transformer. In subtree masking, each node-query in AST attends only to its own subtree descendants, and each leaf-query only attends to leaves of AST. Transformer has a self-attention component with O(n 2 ) time and memory complexity where n is the input sequence length, and thus is not efficient to scale to long inputs. We observe that injecting AST even hurts the performance when the sequence length is short (e.g. shorter than 128), while Graph-CodeBERT consistently brings performance boost on varying sequence length and obtains better MRR score than AST-based methods. The main reason is that data flow is less complex and the number of nodes account for 5% ∼ 20% (see Table 6 ), which does not bring an unnecessarily deep hierarchy of AST and makes the model more accurate and efficient.

Case Study

We also give a case study to demonstrate that data flow would enhance the code understanding process. Given a source code and a comment, we use GraphCodeBERT with and without data flow to predict whether the comment correctly describes the source code. Results are given in Figure 5 . We can see that both models make correct prediction in the original example, where the threshold is 0.5 (left panel). To study the code understanding ability of models, we change the source code (center panel) and the comment (right panel), respectively. Although we make a small change on the source code (return a → return b) and the comment (sum value → mean value), the semantic of the source code and the comment are completely different and corresponding gold labels change from 1 to 0. As we can see in the figure, GraphCodeBERT without using data flow fails these tests and still outputs high probability for negative examples. After leveraging data flow, GraphCodeBERT better understands the semantic of source code and makes correct predictions on all tests, which demonstrates that data flow could improve the code understanding ability of the model. Figure 5 : We take a comment and a source code as the input (first row), and use GraphCodeBERT with and without data flow to predict the probability of the source code matching the comment (third row). The label is 1 if the comment correctly describes the source code otherwise 0 (second row).

6. CONCLUSION

In this paper, we present GraphCodeBERT that leverages data flow to learn code representation. To the best of our knowledge, this is the first pre-trained model that considers code structure for pre-training code representations. We introduce two structure-aware pre-training tasks and show that GraphCodeBERT achieves state-of-the-art performance on four code-related downstream tasks, including code search, clone detection, code translation and code refinement. Further analysis shows that code structure and newly introduced pre-training tasks boost the performance. Additionally, case study in the task of code search shows that applying data flow in the pre-trained model improves code understanding. Wenhan Wang, Ge 

A PRE-TRAINING DETAILS

GraphCodeBERT includes 12 layers Transformer with 768 dimensional hidden states and 12 attention heads. For fair comparison, we use the same dataset as CodeBERT (Feng et al., 2020) to pretrain our model. The dataset is the CodeSearchNet datasetfoot_1 (Husain et al., 2019) , which includes 2.3M functions with document pairs for six programming languages. We train the model on two DGX-2 machines, each having 16 NVIDIA Tesla V100 with 32GB memory. We set the max length of sequences and nodes as 512 and 128, respectively. We use the Adam optimizer to update model parameters with 1,024 batch size and 2e-4 learning rate. To accelerate the training process, we adopt the parameters of CodeBERT released by Feng et al. (2020) to initialize the model. The model is trained with 200K batches and costs about 83 hours. At each iteration, we alternate EdgePred and NodeAlign objectives in combination with MLM to pre-train the model. And we follow Lample & Conneau (2019) to sample each batch from the same programming language according to a multinomial distribution with probabilities {q i } i=1...N , where n i is number of examples for i-th programming language and α=0.7. Sampling with this distribution could alleviates the bias towards high-resource languages. q i = p α i j=1 N p α j with p i = n i k=1 N n k B NATURAL LANGUAGE CODE SEARCH Given a natural language as the input, code search aims to find the most semantically related code from a collection of candidate codes. We conduct experiments on the CodeSearchNet code corpus (Husain et al., 2019) and follow Husain et al. (2019) to take the first paragraph of the documentation as the query for the corresponding function. However, we observe that some queries contain content unrelated to the code, such as a link "http://..." that refers to external resources. Therefore, we filter following examples to improve the quality of the dataset. (1) Examples whose code could not be parsed into abstract syntax tree. (2) Examples whose query tokens number is shorter than 3 or larger than 256. (3) Examples whose query contains special tokens such as "http://". (4) Examples whose query is empty or not written in English. Different from the setting of Husain et al. ( 2019), the answer of each query is retrieved from the whole development and testing code corpus instead of 1,000 candidate codes. We list data statistics about the filtered dataset in We use GraphCodeBERT to separately encode query and source code with data flow, and calculate inner product of their representations of the special token [CLS] as relevance scores to rank candidate codes. In the fine-turning step, we set the learning rate as 2e-5, the batch size as 32, the max sequence length of queries and codes as 128 and 256, and the max number of nodes as 64. We use the Adam optimizer to update model parameters and perform early stopping on the development set. We also report the results using the same setting of Husain et al. (2019) in Table 8 . In this setting, models are required to retrieve an answer for a query from 1000 candidates. The results show that GraphCodeBERT also achieves the state-of-the-art performance. Table 8 : Results on natural language code search using the setting of Husain et al. (2019) .

C CODE CLONE DETECTION

Code clone detection aims to measure the similarity between two code fragments. We use Big-CloneBench dataset (Svajlenko et al., 2014) , which contains over 6,000,000 true clone pairs and 260,000 false clone pairs from 10 different functionalities. We follow the settings in Wei & Li (2017), discarding code fragments without any tagged true and false clone pairs and using 9,134 remaining code fragments. Finally, the dataset provided by Wang et al. (2020) includes 901,724/416,328/416,328 examples for training/validation/testing. We treat the task as a binary classification to fine-tune Graph-CodeBERT, where we use source code and data flow as the input. The probability of true clone is calculated by dot product from the representation of [CLS]. In the fine-turning step, we set the learning rate as 2e-5, the batch size as 16, the max sequence length as 512 the max number of nodes as 128. We use the Adam optimizer to update model parameters and tune hyper-parameters and perform early stopping on the development set. We give a case of the GraphCodeBERT output for this task in Figure 6 . In this example, two Java source codes both download content from a given URL and convert the type of the content into string type. Therefore, two codes are semantically similar since they output similar results when given the same input. As we can see, our model gives a high score for this case and the pair is classified as true clone pair.



https://github.com/tree-sitter/tree-sitter https://github.com/github/CodeSearchNet http://lucene.apache.org/ http://poi.apache.org/ https://github.com/eclipse/jgit/ https://github.com/antlr/ http://sourceforge.net/projects/itext/ http://sourceforge.net/projects/jts-topo-suite/



Figure1: The procedure of extracting data flow given a source code. The graph in the rightmost is data flow that represents the relation of "where-the-value-comes-from" between variables.

Figure2: An illustration about GraphCodeBERT pre-training. The model takes source code paired with comment and the corresponding data flow as the input, and is pre-trained using standard masked language modeling(Devlin et al., 2018) and two structure-aware tasks. One structure-aware task is to predict where a variable is identified from (marked with orange lines) and the other is data flow edges prediction between variables (marked with blue lines).

Figure 3: An example of the Node Alignment task.

Figure 4: MRR score on the validation dataset of Ruby for code search with varying length of input sequence.

in the Appendix B. Results on code search. GraphCodeBERT outperforms other models significantly (p < 0.01).

is used to cluster similar vectors for detection.RtvNN (White et al., 2016)  trains a recursive autoencoder to learn representations for AST. CDLH (Wei & Li, 2017) learn representations of code fragments via AST-based LSTM and hamming distance is used to optimize the distance between the vector representation of AST pairs.

Results on code refinement. and the source-to-target attention. Then we use the training data to fine-tune the whole model. In the table, we see that the Transformer significantly outperforms LSTM. Results in the second group shows that pre-trained models outperform Transformer models further, and

Ablation study on natural language code search Node-vs. Token-level Attention Table6shows how frequently a special token [CLS] that is used to calculate probability of correct candidate attends to code tokens (Codes) and variables (Nodes). We see that although the number of nodes account for 5%∼20%, attentions over nodes overwhelm node/code ratio (around 10% to 32%) across all programming languages. The results indicate that data flow plays an important role in code understanding process and the model pays more attention to nodes in data flow than code tokens.

Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural networkand flow-augmented abstract syntax tree. arXiv preprint arXiv:2002.08653, 2020. Huihui Wei and Ming Li. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In IJCAI, pp. 3034-3040, 2017. Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation. In The 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, July 2017. URL https://arxiv.org/abs/1704.01696. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783-794. IEEE, 2019.

Code Search Training examples Dev queries Testing queries Candidate codes

Data statistics about the filtered dataset. For each query in the development and testing sets, the answer is retrieved from the whole candidate codes (i.e. the last row).

ACKNOWLEDGMENTS

Daya Guo and Jian Yin are supported by the Research Foundation of Science and Technology Plan Project in Guangdong Province (2017B030308007).

D CODE TRANSLATION

Code translation aims to migrate legacy software from one programming language in a platform to another. We conduct experiments on a dataset crawled from the same several open-source projects as Nguyen et al. (2015) and Chen et al. (2018) , i.e. Lucene 4 POI 5 , JGit 6 and Antlr 7 . We do not use Itext 8 and JTS 9 as they do because of the license problem. Those projects have both Java and C# implementation. We pair the methods in the two languages based on their file names and method names. After removing duplication and methods with null function body, the total number of method pairs is 11,800, and we split 500 pairs from them as the development set and another 1,000 pairs for test. To demonstrate the effectiveness of GraphCodeBERT on the task of code translation, we adopt various pre-trained models as encoders and stay hyperparameters consistent. We set the learning rate as 1e-4, the batch size as 32, the max sequence length as 256 and the max number of nodes as 64. We use the Adam optimizer to update model parameters and tune hyper-parameters and perform early stopping on the development set.We give a case of the GraphCodeBERT output for this task in Figure 7 . In this example, the model successfully translates a piece of Java code into its C# version. The differences include the type name (from "boolean" to "bool") and the usage of getting a string value of a bool variable (from "String.valueOf(b)" to "b.ToString()"). 

E CODE REFINEMENT

Code refinement aims to automatically fix bugs in the code. We use the dataset released by Tufano et al. (2019) . The source is buggy Java functions while the target is the according fixed ones. Almost all the names of variables and custom methods are normalized. The dataset contains two subsets based on the code length. For the small dataset, the numbers of training, development and test samples are 46,680, 5,835 and 5,835. For the medium dataset, the numbers are 52,364, 6,545 and 6,545. We also use the sequence-to-sequence Transformer model to conduct the experiments. In the fine-tuning step, we adopt various pre-trained models as encoders. We set the learning rate as 1e-4, the batch size as 32, the max sequence length as 256 and the max number of nodes as 64. We use the Adam optimizer to update model parameters and perform early stopping on the development set.We give two cases of the GraphCodeBERT output for this task in Figure 8 . In the first example, the model successfully fixes the operation bug (from "*" to "+") to match the function name "add". In the second case, the source function and type names are normalized. The return type of this function is "void" but the buggy code gives a return value. Our model successfully removes the "return" word so that the return type of the function matches its declaration. We give a case study to illustrate retrieved results by GraphCodeBERT on the natural language code search task, with a comparison to CodeBERT and RoBERTa (code) models. Two examples are given in Figure 9 and we can see that GraphCodeBERT successfully retrieves correct source codes for given queries on both examples. As we can see in the first case, incorporating data flow will help Graph-CodeBERT better understand the complicated expression " [(k, v) for k, v in self.items() if v is not self.EMPTY]" by leveraging dependency relation among variables in data flow graph. In the second case, the terminology "%Y-%m-%d" in Python program language is a format of date time.GraphCodeBERT and CodeBERT both successfully search the correct function. Compared with RoBERTa (code), the second case shows that utilizing natural language descriptions for pre-training helps models do better semantic matching between source codes and queries on the code search task.

F.2 CODE CLONE DETECTION

We give a case study to compare GraphCodeBERT with CodeBERT and RoBERTa (code) models on code clone detection task. An example is shown in Figure 10 . The first source code is to return the HTML content from a given URL, while the second source code is to return the last line from a fixed URL "http://kmttg.googlecode.com/svn/trunk/version". Their semantics are not similar due to their different outputs. Data flow could help GraphCodeBERT better understand that the return value "pageHTML" in first source code comes from "pageHTML.append(line); pageHTML.append("\r\n");" instead of "bufferedWriter.write(pageHTML.toString());" and the return value "version" in the second source code comes from "version = inputLine" or "version = null;". Although two source codes are highly overlapped (marked in yellow), GraphCodeBERT successfully predict the gold label compared with other models without data flow. 

F.3 CODE TRANSLATION AND CODE REFINEMENT

We give a case study to compare GraphCodeBERT with Transformer without using data flow on code generation tasks, including code translation and code refinement. We list three cases in Table 9 and Table 10, respectively. [src] represents the source input, [ref] represents the reference, [sys] represents Transformer without data flow and [ours] represents GraphCodeBERT. We can see that the Transformer ( [sys] ) baseline makes several mistakes, including repeating tokens, logic errors and syntax errors, while GraphCodeBERT ( [ours] ) as a encoder could improve the generation.

G ERROR ANALYSIS

We also conduct error analysis and summary two main classes of errors for both code understanding and generation tasks.Figure 11 gives three error cases of GraphCodeBERT on the natural language code search task. We observe that GraphCodeBERR mainly fails to retrieve those source code that involves functions of the library like "tf" (Tensorflow) in the first case and " GoogleCloudStorageHook" in the second case. It's difficult for GraphCodeBERR to understand meanings of APIs like "tf.io.read file" and "tf.image.decode image" without relevant information. A potential direction to mitigate the problem is to incorporate definitions of the library. The other major problem is that there are some terminologies like "unistr" in the query (corresponding to "decode('utf-8')" in Python code) in third case.Incorporating more text-code pairs for pre-training might alleviate this problem.As for the code generation task, Table 11 shows two cases of GraphCodeBERT on the code translation task. We find that the major problems include semantic errors like identifiers from nowhere in the first case and syntax errors like missing a "}" symbol before "return n" in the second case. This by incorporating a dedicated decoder that takes into account grammar of programming languages and different generation paradigm like generating a sequence of production rules (Yin & Neubig, 2017; Guo et al., 2018; 2019) in a context-free grammar manner. Case1: semantic error -identifiers from nowhere.[src] public String toString() {return getKey() + ": " + getValue(); } [ref] public override string ToString(){return GetKey() + ": " + GetValue();} [ours] public override string ToString(){return Name + ": " + GetValue();} Case2: syntax errors -missing a "}" before "return n") [src] represents the source input, [ref] represents the reference and [ours] represents GraphCodeBERT.

