DEEP DATA FLOW ANALYSIS

Abstract

Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow analysis that are critical to making good optimization decisions. This must change for machine learning to become the dominant technology in compiler heuristics. To this end, we propose PROGRAML -Program Graphs for Machine Learning -a language-independent, portable representation of whole-program semantics for deep learning. To benchmark current and future learning techniques for compiler analyses we introduce an open dataset of 461k Intermediate Representation (IR) files for LLVM, covering five source programming languages, and 15.4M corresponding data flow results. We formulate data flow analysis as an MPNN and show that, using PROGRAML, standard analyses can be learned, yielding improved performance on downstream compiler optimization tasks.

1. INTRODUCTION

Compiler implementation is a complex and expensive activity (Cooper & Torczon, 2012) . For this reason, there has been significant interest in using machine learning to automate various compiler tasks (Allamanis et al., 2018) . Most works have restricted their attention to selecting compiler heuristics or making optimization decisions (Ashouri et al., 2018; Wang & O'Boyle, 2018) . Whether learned or engineered by human experts, these decisions naturally require reasoning about the program and its behavior. Human experts most often rely upon data flow analyses (Kildall, 1973; Kam & Ullman, 1976) . These are algorithms on abstract interpretations of the program, propagating information of interest through the program's control-flow graph until a fixed point is reached (Kam & Ullman, 1977) . Two examples out of many data flow analyses are: liveness -determining when resources become dead (unused) and may be reclaimed; and available expressions -discovering which expressions have been computed on all paths to points in the program. Prior machine learning works, on the other hand, have typically represented the entirety of the program's behavior as a fixed-length, statically computed feature vector (Ashouri et al., 2018) . Typical feature values might be the number of instructions in a loop or the dependency depth. The weakness of these techniques is shown by the fact that they are trivially confused by the addition of dead code, which changes their feature vectors without changing the program's behavior or its response to optimizations. Such learning algorithms are unable to learn their own abstract interpretations of the program and so cannot avoid these pitfalls or more subtle versions thereof (Barchi et al., 2019) . Recently, there have been attempts to develop representations that allow finer-grain program reasoning. Many, however, are limited both by how inputs are represented as well as how inputs are processed. Representations based on source code and its direct artifacts (e.g., AST) (Alon et al., 2018a; Yin et al., 2018; Haj-Ali et al., 2020) put unnecessary emphasis on naming and stylistic choices that may not correlate with the functionality of the code (e.g., Fig. 2a ). Approaches based on intermediate representations (IR) (Ben-Nun et al., 2018; Mirhoseini et al., 2017; Brauckmann et al., 2020) remove such noise but fail to capture information about the program that is important for analysis (e.g., Fig. 2b variables, Fig. 2c commutativity ). In both cases, models are expected to reason about the flow of information in programs using representations that do not directly encode this information. Clearly, a program representation is needed that enables machine learning algorithms to reason about the execution of a program by developing its own data flow analyses. Since current approaches are ill-suited to program-wide data flow analysis, we propose overcoming their limitations by making the program's control, data, and call dependencies a central part of the program's representation and a primary consideration when processing it. We achieve this by seeing the program as a graph in which individual statements are connected to other statements through relational dependencies. Each statement in the program is understood only in the context of the statements interacting with it. Through relational reasoning (Battaglia et al., 2018) , a latent representation of each statement is learned that is a function of not just the statement itself, but also of the (latent) representations of its graph neighborhood. Notably, this formulation has a striking similarity to the IRs used by compilers, and the iterative propagation of information resembles the transfer functions and meet operators in traditional data flow analyses (Kildall, 1973) . Recently proposed techniques for learning over graphs have shown promise in a number of domains (Schlichtkrull et al., 2018; Ziwei et al., 2020) . With a suitable representation and graph-based model, we extend these approaches to the domain of compiler analysis, enabling downstream tasks built on top of such graph models to natively incorporate reasoning about data flow into their decision making. This improves performance on downstream tasks without requiring additional features. We make the following contributions: • We propose a portable, language-independent graph representation of programs derived from compiler IRs. PROGRAML is the first representation to capture whole-program control-, data-, and call relations between instructions and operands as well as their order and data types. PROGRAML is a compiler-agnostic design for use at all points in the optimization pipeline; we provide implementations for LLVM and XLA IRs. • We introduce a benchmark dataset that poses a suite of established compiler analysis tasks as supervised machine learning problems. DEEPDATAFLOW comprises five tasks that require, in combination, the ability to model: control-and data-flow, function boundaries, instruction types, and the type and order of operands over complex programs. DEEP-DATAFLOW is constructed from 461k real-world program IRs covering a diverse range of domains and source languages, totaling 8.5 billion data flow analysis classification labels. • We adapt Gated-Graph Neural Networks (GGNN) to the PROGRAML representation. We show that, within a bounded problem size, our approach achieves ≥ 0.939 F 1 score on all analysis tasks, a significant improvement over state-of-the-art representations. In evaluating the limits of this approach we propose directions to better learn over programs. 2020) explore a large-scale, context-dependent vector embedding. This is done at a token level, however, and is unsuited for dataflow analysis.

2. RELATED WORK

Prior work on learning over programs employed methods from Natural Language Processing that represented programs as a sequence of lexical tokens (Allamanis, 2016; Cummins et al., 2017a) .



Figure 1: Our proposed approach for compiler analyses driven by graph-based deep learning.

Data flow analysis is a long established area of work firmly embedded in modern compilers. Despite its central role, there has been limited work in learning such analysis.Bielik et al. (2017)  use ASTs and code synthesis to learn rule-sets for static analyses, some of which are dataflow-related. Our approach does not require a program generator or a hand-crafted DSL for rules.Shi et al. (2020)  and Wang & Su (2020) use dynamic information (e.g., register snapshots and traces) from instrumented binaries to embed an assembler graph representation. We propose a static approach that does not need runtime features. Si et al. (2018) use a graph embedding of an SSA form to generate invariants. The lack of phi nodes and function call/return edges means that the representation is not suitable for interprocedural analysis as it stands.Kanade et al. (

