GLUECODE: A BENCHMARK FOR SOURCE CODE MA-CHINE LEARNING MODELS

Abstract

A multitude of machine learning models for source code have been proposed in the recent years capturing various aspects of the inherent rich structure and semantics of code. However, these models are commonly designed to perform well on a single task, failing to capture code's multifaceted nature. To address this, we present GLUECode, Global and Local Understanding Evaluation of Code, a benchmark of diverse tasks to evaluate machine learning models of source code. Crucially, GLUECode accounts for the distinct characteristics of source code: (1) source code is highly structured and (2) source code is often composed of multiple interacting entities. Existing tasks incentivize researchers to create models and code representations that perform well on a single task -commonly focusing on local reasoning. GLUECode aims to allow researchers to experiment with multiple local and global source code representations, and evaluate these models on their ability to capture the diverse characteristics of source code, thus driving the community towards building robust source code models incorporating global reasoning. We present results for several baselines. The GLUECode tasks are challenging for the evaluated baselines; no model achieves convincing performance across all tasks. This indicates that there is ample room for progress on GLUECode.

1. INTRODUCTION

In recent years, there has been considerable interest in researching machine learning models on source code artifacts. Machine learning models have been used to address a variety of software engineering tasks, as the inherent rich structure of code has allowed machine learning researchers to explore new models and ideas. However, research has focused on single-purpose application models, targeting a single task each time while using varying source code representations and datasets. This impedes progress towards general-purpose machine learning models of code that can learn and reason across many tasks. In this work, we present GLUECode (Global and Local Understanding Evaluation of Code), with the goal of measuring progress in source code modelling across a range of tasks that account for the diverse characteristics of software and require diverse reasoning capabilities over several thousands of software projects. As GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) does for natural language, GLUECode highlights important aspects of reasoning about code: (1) since code in software is composed of multiple interacting entities, it includes tasks that leverage both local (single method) and global (multiple inter-related methods, information beyond the local method) reasoning to varying degrees. This is in contrast to most tasks and models that have been introduced so far that focus on local reasoning; (2) since source code mixes structured and unstructured information, GLUECode tasks leverage both kinds of information, and (3) since the space of modelling choices is large, we provide several source code representations ranging from raw text to abstract syntax trees (AST) and graph representations, lowering the barrier to entry and ease of experimentation. The design space for source code models is extremely large and spans a wide range of source code representations. These range from the simplest (software metrics and n-grams), to very complex that fully take advantage of the structure and semantics of source code (such as graph-based representations). Even seemingly simple choices, such as how to preprocess identifiers, can be handled in many different ways and have disproportionate impact (Karampatsis et al., 2020) . GLUECode aims to provide a unified benchmark to explore this design space. We provide performance results on a set of baselines, ranging from simple neural architectures such as LSTMs and CNNs, to variants of pre-trained transformers. These models leverage purely local reasoning and limited amounts of structural information. We show that existing models perform well in a few tasks but fail to yield good results in others: In contrast to NLP, where (pre-trained) transformers outperform other models, we find that no single model of code consistently outperforms the others in all tasks. Finally, while models can be evaluated on any single task in the benchmark in isolation (as the field is presently doing), a long-term goal of GLUECode is the creation of unified multi-task source code models that perform well across multiple tasks. A source code model that is jointly trained and can perform well on all the task in the benchmark would be a significant step towards more versatile models, that can, beyond the tasks they were trained, also adapt to downstream tasks, especially when there is not enough data. Given the performance of our baselines in the single-task scenario, defining a model that performs well across the board is thus very much an open problem.

2. THE GLUECODE BENCHMARK

Benchmarks are a common practice in machine learning and NLP, prominently featuring GLUE and SuperGLUE (Wang et al., 2018; 2019) among others. In the domain of machine learning on source code, several benchmarks have been proposed. However, in contrast to GLUECode, they consider relatively local contexts and do not incentivize non-local reasoning: Idbench looks at identifiers, (Wainakh et al., 2019 ), BigCloneBench (Svajlenko & Roy, 2015) and OJClone (Mou et al., 2016) at clone detection, and CodeSearchNet at a function-level text-to-code search (Husain et al., 2020) . Finally, COSET concerns classifying small programs by their functionality in 38 classes (Wang & Christodorescu, 2019) , and CoNaLa is a line-level text-to-code generation benchmark (Yin et al., 2018) . In this section, we provide an overview of GLUECode. We first describe the software-specific characteristics that impact the choice of tasks, before detailing the dataset and the tasks involved. Details about other related benchmarks can be found in the Appendix D.

2.1. LOCAL VERSUS GLOBAL CONTEXT

Most existing machine learning models of source code work at the level of a single function or method. We call these local models, as they reason over the local context of a single software entity. This is in contrast to global models that reason over multiple software entities and scales. Global models are highly desirable since software systems are composed of multiple entities such as modules and functions, that communicate with each other. This composition of communicating entities dictates the behavior of a software system. For instance, a function may have a radically different behavior, depending on its arguments. Indeed, small local changes can manifest in large changes in behaviour in distant program locations. Only global models will be able to detect that. To push forward the state of the art, it is thus critical to focus on global models. Instead to reason over global contexts two limitations need to be overcome: First, time-consuming interprocedural static analyses need to be performed at scale. These require compiling projects and resolving all its dependencies. In GLUECode, we take a step towards this direction, by using the largest publicly available corpus of compilable Java code (Sec. 2.3). (2) Existing methods do not operate well on large and sparse inputs and thus representations are tailored to use only the necessary information. In GLUECode, we provide access to a variety of representations and propose a set of tasks that cannot focus solely on local or global information (Sec 2.2).



global models are currently out of reach but GLUECode incentivizes building models that feature some form of global reasoning, in addition to local reasoning. Existing work uses simplified projections of global representations: the GNN works of Allamanis et al. (2017; 2020) look solely at file-level tokens, syntax, data and control flow information. CocoGum (Wang et al., 2020) uses class context represented as abstracted UML diagrams. LambdaNet extracts type dependencies in JavaScript into a single graph (Wei et al., 2020) for a few mid-sized projects (500-10k lines of code), ignoring syntactic information, code comments, etc. Finally, Func2Vec (DeFreez et al., 2018) computes function embeddings over an interprocedural call graph, ignoring local syntax, function arguments, etc. An extended related work discussion can be found in Appendix D.

