LANGUAGE-AGNOSTIC REPRESENTATION LEARNING OF SOURCE CODE FROM STRUCTURE AND CONTEXT

Abstract

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

1. INTRODUCTION

Machine learning for code is an active and growing area of research which aims at building models that can learn semantically meaningful representations of programs. These embeddings can be used on downstream tasks, such as code generation, bug detection, or code summarization. We focus our work on two complementary data representations of programs: the source code (referred to as Context in this work), and the abstract syntax tree (AST; referred to as Structure). Traditionally, researchers and practitioners have decided to predominantly leverage either Structure or Context in their machine learning models. In this work, we show that jointly learning on Context and Structure improves representation learning on source code (see Fig. 1 ). The source code representation naturally lends itself to models from natural language processing (NLP), e.g., long short-term memory networks (Hochreiter & Schmidhuber, 1997) (LSTM) or Transformers (Vaswani et al., 2017; Radford et al., 2019; Dai et al., 2019; Yang et al., 2019; Shaw et al., 2018) . On the other hand, models leveraging the structure representations are typically based on graph neural networks (GNNs) (Kipf & Welling, 2017; Xu et al., 2019; Veličković et al., 2018; You et al., 2019; Hamilton et al., 2017; Li et al., 2015; Klicpera et al., 2020) . While the AST representation makes the highly structured nature of source code explicit to the models, since most GNNs use the message-passing framework, their learned representations are inherently local and struggle to leverage long-range interactions. Recently, Hellendoorn et al. ( 2020) have explored models that can leverage several representations, including both Structure and Context. Their Graph Relational Embedding Attention Transformer (GREAT) extends Shaw et al. (2018) , which biases the self-attention computation in a localized way given the underlying graph. The language-specific representations used by GREAT include a combination of the data flow graph, control flow graph, syntactic edges (inspired by Allamanis et al. ( 2018)), etc. which require specialized pipelines and static analysis tools to be obtained.

