STRUCTURAL CODE REPRESENTATION LEARNING FOR AUTO-VECTORIZATION

Abstract

The single instruction multiple data (SIMD) capability in modern processors is critical to improving the performance of current compute-intensive programs. SIMD allows architectures to exploit the natural data parallelism that exists in a wide-range of real applications (e.g., games, signal processing, etc) by executing a single instruction on multiple data items simultaneously. Modern compilers use vectorization techniques to exploit the SIMD capability, by detecting data parallelism in scalar source code and transforming a group of scalar instructions into vector-based instructions. In this work, we focus on one of the most common vectorization techniques called loop-based vectorization, which targets loops and optimize their performance by grouping multiple occurrences of the same operation across loop iterations into single SIMD instructions. This is achieved by setting two key parameters: (1) the vectorization factor (VF), and (2) the interleaving factor (IF). Unfortunately, vectorizing loop computations effectively is a key challenging problem for both programmers and compilers due to the large search space. For example, manual vectorization of each loop puts a huge burden on the programmer, is more error-prone, and/or requires expert knowledge of both the software and the architecture. Alternatively, current compilers use fixed-cost models based on expert heuristics to make automatic vectorization decisions. However, these models often ignore the data dependencies, as well as the underlying computation graph. In this paper, we propose a data-driven graph-based learning framework for automatic vectorization, called autograph, which takes an input program, extracts the loops, then learns a structured representation to automatically predict the correct VF/IF factors. Our proposed framework utilizes deep reinforcement learning to learn an optimal policy (observations to actions) from an intelligent agent in a SIMD environment, and automatically injects the predicted vectorization pragmas into the input program. We conducted an extensive evaluation on multiple benchmark datasets and comparisons with state-of-the-art baselines. Our results show that for Polybench, autograph achieves on average 2.47x performance improvement for Polybench compared to neurovectorizer and 3.61x compared to the baseline.

1. INTRODUCTION

The single instruction multiple data (SIMD) mechanisms have been widely incorporated in modern processors such as gaming machines, massively parallel supercomputers, as well as general-purpose processors (Nuzman et al., 2006; Bachega et al., 2004; Peleg & Weiser, 1996) . These mechanisms allow architectures to exploit the natural parallelism that exists in software for real-world applications (e.g., games, signal processing, etc.), by simultaneously executing the same instruction on multiple elements of the input data. Modern compilers use vectorization techniques to exploit the SIMD capability of these architectures. Vectorization techniques allow the compiler to reveal the data parallelism in the scalar source code and converts the code from a scalar implementation to the corresponding functionally-correct vector implementation.This allows portions of the code to run on the processor's high-throughput SIMD units, without any additional effort from the programmer (Porpodas et al., 2018) . With the SIMD architecture, such operations can run in fewer cycles while using less energy to boost performance in applications with vector computations. Vectorization can be classified into two major methods: (i) the loop vectorizer, which operates on loops, and (ii) the superword-level parallelism (SLP) vectorizer (Porpodas, 2017; Mendis et al. , 2019), which operates on straight-line code. Loops in repetitive tasks are commonly used in modern programs to save time and code size. Therefore, in this work, we focus on the loop vectorizer. One of the key challenges is to define the vectorization factor (VF) and the interleaving factor (IF) (Nuzman et al., 2006) . The VF determines how many instructions to pack together from different iterations of the loop. The IF determines the stride of the memory accesses of the packed instructions (Haj-Ali et al., 2020) . Hence, the goal of loop vectorization is to search for the optimal VF and IF, given a program. As shown in Figure 1 (b), even though brute-force search has the ability to find the optimal/best vectorization parameter, the compiler cannot support the exhaustive search due to its time-consuming process. The brute-force method exhaustively searches the vectorization parameter for each loop in a kernel while taking a significant amount of time. For a kernel with N loops with M interleaving factors (IFs) and K vectorization factors (VFs), the search space is O(N M K). Since manual vectorization is error-prone and difficult to maintain, modern compilers such as LLVM use auto-vectorization techniques that rely on linear and constant-cost models to predict the vectorization factors (Tian et al., 2016; Trifunovic et al., 2009) . However, these cost models don't consider the computation graph or loop dependencies, thus, they are not capable of capturing the natural structural dependencies and semantics of the wide-range of software programs that exist in today's real-world applications. Machine learning has been proposed to improve these cost models (Stock et al., 2012; Wang & O'Boyle, 2018) by extracting hand-engineered features from assembly code and using supervised learning techniques to predict the vectorization factors. However, these methods are still incapable of automatically learning a representation that can capture the computation graph, as well as the dependencies of input codes. As we will show in evaluation, by accounting for structural dependencies we improve performance by 1.26x. In this work, we propose a framework that learns a representation that is capable of reasoning about the flow of information and the semantics in the code, while capturing the structural dependencies in the computation graph. More specifically, in Figure 1 (a), we propose an end-to-end graph-based deep learning framework for compiler auto-vectorization, called autograph, that takes an input code, extracts the loops, then learns a structured representation to automatically predict the correct VF/IF factors. Autograph utilizes deep reinforcement learning to learn an optimal policy (graph embeddings to VF/IF pairs) from an intelligent agent in a SIMD environment, and automatically injects the predicted vectorization pragmas into the input program to achieve better performance. We conducted an extensive evaluation on multiple benchmarks and comparisons with state-of-the-art baselines. Our experimental results show that for Polybench, autograph achieves on average 2.47x performance improvement for Polybench compared to neurovectorizer and 3.61x compared to the baseline.



Figure 1: (a) Overview of the proposed autograph framework. Autograph extracts loops from source code in the intermediate representation format (LLVM IR). Then using compiler dependency analysis, autograph constructs dependency graphs to capture the flow of information in the code, as well as the semantics of the code by using the full text features of the instructions. Autograph then learns a structured representation using an inductive GNN approach. Finally, autograph expolits deep reinforcement to learn a mapping from embeddings to vectorization factors. (b) Comparison between brute-force and prediction-based autograph. For each kernel, the brute-force search time is measured by summing up the compile times and execution times shown in a log 10 scale of nanoseconds with different VFs and IFs. Although the brute-force can eventually find the best vectorization parameters, it is ∼60 000x slower compared to autograph that only predicts the parameter without exhaustively searching the space.

