DEEP DATA FLOW ANALYSIS

Abstract

Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow analysis that are critical to making good optimization decisions. This must change for machine learning to become the dominant technology in compiler heuristics. To this end, we propose PROGRAML -Program Graphs for Machine Learning -a language-independent, portable representation of whole-program semantics for deep learning. To benchmark current and future learning techniques for compiler analyses we introduce an open dataset of 461k Intermediate Representation (IR) files for LLVM, covering five source programming languages, and 15.4M corresponding data flow results. We formulate data flow analysis as an MPNN and show that, using PROGRAML, standard analyses can be learned, yielding improved performance on downstream compiler optimization tasks.

1. INTRODUCTION

Compiler implementation is a complex and expensive activity (Cooper & Torczon, 2012) . For this reason, there has been significant interest in using machine learning to automate various compiler tasks (Allamanis et al., 2018) . Most works have restricted their attention to selecting compiler heuristics or making optimization decisions (Ashouri et al., 2018; Wang & O'Boyle, 2018) . Whether learned or engineered by human experts, these decisions naturally require reasoning about the program and its behavior. Human experts most often rely upon data flow analyses (Kildall, 1973; Kam & Ullman, 1976) . These are algorithms on abstract interpretations of the program, propagating information of interest through the program's control-flow graph until a fixed point is reached (Kam & Ullman, 1977) . Two examples out of many data flow analyses are: liveness -determining when resources become dead (unused) and may be reclaimed; and available expressions -discovering which expressions have been computed on all paths to points in the program. Prior machine learning works, on the other hand, have typically represented the entirety of the program's behavior as a fixed-length, statically computed feature vector (Ashouri et al., 2018) . Typical feature values might be the number of instructions in a loop or the dependency depth. The weakness of these techniques is shown by the fact that they are trivially confused by the addition of dead code, which changes their feature vectors without changing the program's behavior or its response to optimizations. Such learning algorithms are unable to learn their own abstract interpretations of the program and so cannot avoid these pitfalls or more subtle versions thereof (Barchi et al., 2019) . Recently, there have been attempts to develop representations that allow finer-grain program reasoning. Many, however, are limited both by how inputs are represented as well as how inputs are processed. Representations based on source code and its direct artifacts (e.g., AST) (Alon et al., 2018a; Yin et al., 2018; Haj-Ali et al., 2020) put unnecessary emphasis on naming and stylistic choices that may not correlate with the functionality of the code (e.g., Fig. 2a ). Approaches based on intermediate representations (IR) (Ben-Nun et al., 2018; Mirhoseini et al., 2017; Brauckmann et al., 2020) remove such noise but fail to capture information about the program that is important for analysis (e.g., Fig. 2b variables, Fig. 2c commutativity ). In both cases, models are expected to reason about the flow of information in programs using representations that do not directly encode this information. Clearly, a program representation is needed that enables machine learning algorithms to reason about the execution of a program by developing its own data flow analyses.

