LANGUAGE-AGNOSTIC REPRESENTATION LEARNING OF SOURCE CODE FROM STRUCTURE AND CONTEXT

Abstract

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

1. INTRODUCTION

Machine learning for code is an active and growing area of research which aims at building models that can learn semantically meaningful representations of programs. These embeddings can be used on downstream tasks, such as code generation, bug detection, or code summarization. We focus our work on two complementary data representations of programs: the source code (referred to as Context in this work), and the abstract syntax tree (AST; referred to as Structure). Traditionally, researchers and practitioners have decided to predominantly leverage either Structure or Context in their machine learning models. In this work, we show that jointly learning on Context and Structure improves representation learning on source code (see Fig. 1 ). The source code representation naturally lends itself to models from natural language processing (NLP), e.g., long short-term memory networks (Hochreiter & Schmidhuber, 1997) (LSTM) or Transformers (Vaswani et al., 2017; Radford et al., 2019; Dai et al., 2019; Yang et al., 2019; Shaw et al., 2018) . On the other hand, models leveraging the structure representations are typically based on graph neural networks (GNNs) (Kipf & Welling, 2017; Xu et al., 2019; Veličković et al., 2018; You et al., 2019; Hamilton et al., 2017; Li et al., 2015; Klicpera et al., 2020) . While the AST representation makes the highly structured nature of source code explicit to the models, since most GNNs use the message-passing framework, their learned representations are inherently local and struggle to leverage long-range interactions. We propose the CODE TRANSFORMERfoot_0 , which combines distances computed on Structure and Context in the self-attention operation. In contrast to the localized treatment via edges described above, we make the full Structure accessible to the model at each layer by computing pairwise distances on the AST, such as shortest path lengths. To this end, we draw inspiration from the XLNet architecture (Yang et al., 2019), which uses relative distances instead of absolute positions in the attention computation. Importantly, all our features are language-agnosticfoot_1 , i.e., can easily be computed for any programming language based on the source code and AST.

Recently

We use two datasets comprising 5 different programming languages in total, and evaluate the representations learned by our model on the task of code summarization, where the model predicts a method's name based on its body. Besides setting the state-of-the-art on all five languages for singlelanguage training, we also train the first multilingual model for code summarization. This is enabled by the fact that our model uses only language-agnostic features that can easily be obtained for any programming language. Remarkably, training our model on multiple programming languages substantially improves the performance on all languages. Moreover, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

2. RELATED WORK

Machine Learning for Code. Early research learned language models on raw text data, e.g., (Wang et al., 2016; Raychev et al., 2014; Dam et al., 2016) et al., 2015) or API calls (Acharya et al., 2007; Nguyen et al., 2017) . They can be used as advanced auto-completion tools (Hindle et al., 2012; Bhoopchand et al., 2016) , including for user-provided tokens like Variable Names (Raychev et al., 2014; Allamanis et al., 2014) . These are useful for deobfuscating Android applications (Bichsel et al., 2016) for example. Several works leverage structured graphical models for probabilistic models of source code, usually through parse trees (Maddison & Tarlow, 2014; Bielik et al., 2016) . Unlike previous works where hand-crafted features were used as node features (Raychev et al., 2014) or as explicit semantic edges (Allamanis et al., 2018) , our work does not augment the existing syntactic relationships between the



Code at www.daml.in.tum.de/code-transformer, demo at code-transformer.org. We use the term language-agnostic to highlight that our model does not rely on language-specific features (e.g., program analysis edges), thus facilitating multi-language training, as it is possible to generate unified AST representations for different programming languages.



, Hellendoorn et al. (2020) have explored models that can leverage several representations, including both Structure and Context. Their Graph Relational Embedding Attention Transformer (GREAT) extends Shaw et al. (2018), which biases the self-attention computation in a localized way given the underlying graph. The language-specific representations used by GREAT include a combination of the data flow graph, control flow graph, syntactic edges (inspired by Allamanis et al. (2018)), etc. which require specialized pipelines and static analysis tools to be obtained.

Figure1: Context and Structure both encapsulate valuable information about source code. In this realistic example, token 1 and 4 are distant in the sequence of tokens (Context), but only 5 hops away when traversing the Abstract Syntax Tree (Structure). As such, a method that relies only on the sequence of tokens could neglect the relationship between a method name and its return variable. Conversely, token 1 and 2 showcase the opposite setting. Hence, unifying Structure and Context leads to a more powerful representation of source code.

, providing evidence for the naturalness assumption(Hindle et al., 2012). For example,Allamanis et al. (2015)  learned distributed representations of variables and methods, finding that they were indeed able to encode common semantic properties from the regularities present in source code. Alon et al. (2019b) also found evidence of semantic arithmetic in their embedding space, dubbed code2vec. These representations-and their variants like(Mou et al., 2016)-can then be used to predict sequences of identifier sub-tokens (Allamanis

