STATIC PREDICTION OF RUNTIME ERRORS BY LEARNING TO EXECUTE PROGRAMS WITH EXTERNAL RESOURCE DESCRIPTIONS

Abstract

The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so the program cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a "static" setting, where program execution is not possible? Here, we introduce a competitive programming dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. We approach this task by developing an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and "learns to execute" descriptions of external resources. Surprisingly, we show that the model can also predict the locations of errors, despite being trained only on labels indicating error presence or absence and kind. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution behavior and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code.

1. INTRODUCTION

We investigate applying neural machine learning methods to the static analysis of source code for early prediction of runtime errors. The execution behavior of a program is in general not fully defined by its source code in isolation, because programs often rely on external resources like inputs, the contents of files, or the network. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors early, even when program execution is not yet an option. Therefore we consider the following machine learning challenge: can we predict runtime errors in a "static" setting, where program execution is not possible? This runtime error prediction task is well suited as a challenge problem because it is difficult-yetapproachable, has real-world value for software developers, requires novel modeling considerations that we hypothesize will be applicable to a range of learning for code tasks, and with this work, now has a suitable large dataset of complex human-authored code with error labels. The task is to predict whether a program will exhibit a runtime error when it is run, and if so to determine the error; even when static analysis cannot provide guarantees of an error in the code, patterns learned from data may point to likely errors. Our dataset consists of 2.4 million Python 3 programs from Project CodeNet (Puri et al., 2021) written by competitive programmers. We have run all programs in a sandboxed environment on sample inputs to determine their error classes, finding the programs exhibit 26 distinct error classes including "no error". Each program relies on an external resource, the stdin input stream, and we pair each program with a natural language description of the behavior of the stream. We make the task and dataset, along with all models considered in this work, available for the research community to facilitate reproduction of this work and further researchfoot_0 . To make progress on this challenging task, we identify a promising class of models from prior work, interpreter-inspired models, and we demonstrate they perform well on the task. Instruction Pointer Attention Graph Neural Network (IPA-GNN) (Bieber et al., 2020) models simulate the execution of a program, following its control flow structure, but operating in a continuous embedding space. We make a number of improvements to IPA-GNN: scaling up to handle complex programs requiring thousands of execution steps, adding the ability to "learn to execute" descriptions of external resources, and extending the architecture to model exception handling and recover error locations. We evaluate these interpreter-inspired architectures against Transformer, LSTM, and GGNN neural baselines, and against pylint as a static analysis baseline. Our combined improvements lead to increased accuracy in predicting runtime errors and to interpretability allowing for prediction of error locations even though the models are only trained on error presence and error class, not error location. In total, we summarize our contributions as: • We introduce the runtime error prediction task and a large accompanying dataset, providing runtime error annotations for millions of competition Python programs. • We demonstrate that IPA-GNN architectures are practical for the complexity of real programs by scaling them to handle competition programs, and there we find they outperform generic models. • We demonstrate that external resource descriptions, such as Japanese or English descriptions of stdin, can be leveraged to improve performance on the task across all model architectures. • We extend the IPA-GNN to model exception handling, resulting in the Exception IPA-GNN, which we find can localize errors even when only trained on error presence and kind, not error location.

2. RELATED WORK

Program analysis Program analysis is a rich family of techniques for detecting defects in programs, including static analyses which are performed without executing code (Livshits and Lam, 2005; Xie and Aiken, 2006; Ayewah et al., 2008) and dynamic analyses which are performed at runtime (Cadar et al., 2008; Sen et al., 2005; Godefroid et al., 2005) . Linters and type checkers are popular error detection tools that use static analysis. Static analysis (e.g. symbolic execution) does not typically use concrete inputs, while dynamic analysis requires concrete inputs and program execution. Compared with traditional static analysis, our approach is more flexible in its input representation, using a general "resource description" abstraction, which can represent the entire spectrum from concrete inputs to input constraints to missing inputs. Execution-aware models Several neural architectures draw inspiration from program interpreters (Graves et al., 2014; Łukasz Kaiser and Sutskever, 2016; Reed and de Freitas, 2016; Graves et al., 2016; Bošnjak et al., 2017; Gaunt et al., 2017; Dehghani et al., 2019; Bieber et al., 2020) et al., 2021a; Pei et al., 2021; Nye et al., 2021b) or performing execution during synthesis (Chen et al., 2019; Li et al., 2022; Shrivastava et al., 2021) . Compared with these, our approach uses weaker supervision, using only runtime error labels for training.

Fault detection and localization datasets

There has been considerable recent interest in applying machine learning to identifying and localizing faults in source code (Allamanis et al., 2018a) . Puri et al. ( 2021) makes a large dataset of real world programs available, which we build on in constructing our runtime errors dataset. Our dataset (i) is large (it has millions of examples), (ii) exhibits many programming language features, (iii) is written by human authors, and (iv) has error labels from the execution behavior of programs. Previous code datasets only exhibit a subset of these properties: large real-world and competition code datasets (Hendrycks et al., 2021; Li et al., 2022; Kanade et al., 2020; Raychev et al., 2016; Husain et al., 2019; Puri et al., 2021) exhibit properties i, ii, and iii, but not iv, while learning to execute datasets (Zaremba and Sutskever, 2014; Bieber et al., 2020) exhibit property iv but not i, ii, or iii. Recent program synthesis datasets (Chen et al., 2021; Austin et al., 2021) exhibit ii and iii only. Other datasets obtain error labels by injecting synthetic errors (Allamanis et al., 2018b; Karampatsis and Sutton, 2020; Pradel and Sen, 2018) (lacking the realism of iii) or from commit messages (Just et al., 2014; Dinella et al., 2020) (lacking i and iv). Fault localization approaches Fault localization approaches vary in (i) level of supervisionweak (error labels) (Li et al., 2019) vs strong (explicit location labels) (Lou et al., 2021; Zhang et al., 



https://github.com/google-research/runtime-error-prediction



. Our work is most similar to Bieber et al. (2020) and Bošnjak et al. (2017), focusing on how interpreters handle control flow and exception handling, rather than on memory allocation and function call stacks. Other works use program execution data directly, training with textual representations of execution traces as inputs (Nye

