# TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen<sup>1</sup>, Thierry Moreau<sup>1</sup>, Ziheng Jiang<sup>2</sup>, Lianmin Zheng<sup>3</sup>, Eddie Yan<sup>1</sup> Meghan Cowan<sup>1</sup>, Haichen Shen<sup>1</sup>, Leyuan Wang<sup>4</sup>, Yuwei Hu<sup>5</sup>, Luis Ceze<sup>1</sup>, Carlos Guestrin<sup>1</sup>, Arvind Krishnamurthy<sup>1</sup> <sup>1</sup>Paul G. Allen School of Computer Science & Engineering, University of Washington <sup>2</sup>Fudan University, <sup>3</sup>Shanghai Jiao Tong University, <sup>4</sup>UC Davis, <sup>5</sup>Cornell ## **Abstract** There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. TVM also offers automated optimization of lowlevel programs to hardware characteristics by employing a novel learning-based cost modeling method for rapid exploration of code optimizations. Experimental results demonstrate that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art hand tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends by targeting an FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies. #### 1 Introduction Deep learning models can now recognize images, process natural language, and defeat humans in challenging strategy games. There is an increasing demand to deploy smart applications to a wide spectrum of devices, ranging from cloud servers to self-driving cars and embedded devices. Mapping deep learning workloads to these devices is complicated by the diversity of hardware characteristics, including embedded CPUs, GPUs, FPGAs, and ASICs (e.g., the TPU [20]). These hardware targets Figure 1: CPU, GPU and TPU-like accelerators require different on-chip memory architecture and compute primitives. This divergence must be addressed when generating optimized code. diverge in terms of memory organization, compute functional units, etc., as shown in Figure 1. Current deep learning frameworks, such as Tensor-Flow, MXNet, Caffe, and PyTorch rely on a computational graph intermediate representation to implement optimizations such as auto differentiation and dynamic memory management [3, 4, 8]. Graph-level optimizations, however, are often too high-level to handle hardware back-end-specific operator-level transformations. Most of these frameworks focus on a narrow class of server-class GPU devices and delegate target-specific optimizations to highly engineered and vendor-specific operator libraries. These operator-level libraries require significant manual tuning and hence are too specialized and opaque to be easily ported across hardware devices. Providing support in various deep learning frameworks for diverse hardware back-ends in the present fashion requires significant engineering effort. Even for supported back-ends, frameworks have to make the difficult choice of avoiding graph optimizations yielding new operators that are not in the predefined operator library, or using unoptimized implementations of these new operators. In order to enable both graph-level and operator-level optimizations for diverse hardware back-ends, we take a fundamentally different, end-to-end approach. We built TVM, a compiler that takes a high-level specification of a deep learning program from existing frameworks and generates low-level optimized code for a diverse set of hardware back-ends. For TVM to be attractive to users, it needs to offer performance competitive with the multitude of manually optimized operator libraries across diverse hardware back-ends. This goal requires addressing the key challenges described below. Leveraging Specific Hardware Features and Abstractions Deep learning accelerators introduce optimized tensor compute primitives [1, 11, 20], while GPUs and CPUs continuously evolve with improvements to their processing elements. This poses a significant challenge in generating optimized code for a given operator description. The inputs to hardware instructions are multidimensional, with fixed or variable lengths, dictate different data layouts, and have special requirements for memory hierarchy. The system must effectively exploit these complex primitives to benefit from acceleration. Aside from compute primitives, accelerator designs also commonly favor leaner control [20] and offload most of the scheduling complexity to the compiler stack. For specialized accelerators, the system now needs to generate code that explicitly controls pipeline dependencies to hide memory access latency—a job that is done by the hardware in case of CPU and GPU. Large Search Space for Optimization Another challenge is producing efficient code without manually tuning operators. The combinatorial choices of memory access, threading pattern, and novel hardware primitives creates a huge configuration space for generated code (e.g., loop tiles and ordering, caching, unrolling) that would incur a large search cost if we do black box autotuning. One could adopt a predefined cost model to guide the search, but building an accurate cost model is very hard due to the increasing complexity of modern hardware. Furthermore, such an approach would require us to build separate cost models for each hardware type. TVM addresses these challenges with three key modules. (1) We introduce a *tensor expression language* to build operators and provide program transformation primitives that generate different versions of the program with various optimizations. This layer extends Halide [30]'s compute/schedule separation concept by also separating target hardware intrinsics from transformation primitives, which enables support for novel accelerators and their corresponding new intrinsics. Moreover, we introduce new transformation primitives to address the challenges brought by GPUs and also enable deployment to specialized accelerators. We can then apply different sequences of program transformations to form a rich space of valid programs for a given operator declaration. (2) We introduce an *automated program op* timization framework to find optimized tensor operators. The optimizer is guided by a machine learning based cost model that adapts and improves as we collect more data from a hardware back-end. (3) On top of the automatic code generator, we introduce a graph rewriter that takes full benefit of high level and operator level optimizations. By combining these three modules, TVM can take model descriptions from existing deep learning frameworks, perform joint high-level and low-level optimizations, and generate hardware-specific optimized code for back-ends such as CPUs, GPUs, and FPGA-based specialized accelerators. This paper makes the following contributions: - We identify the major optimization challenges in providing performance portability to deep learning workloads across diverse hardware back-ends. We introduce novel schedule primitives to take advantage of cross-thread memory reuse, novel hardware intrinsics, and latency hiding. - We propose and implement a machine learning based optimization system to automatically explore and search for optimized tensor operators. - We build an end-to-end compilation and optimization stack allowing the deployment of deep learning workloads specified in high-level frameworks (including Caffe, MXNet, PyTorch, Caffe2, CNTK) to diverse hardware back-ends (including CPUs, server GPUs, mobile GPUs, and FPGA-based accelerators). TVM is open sourced and is in production use inside several major companies. - We evaluate TVM on a server-class GPU, an embedded GPU, an embedded CPU, and a custom generic FPGA-based accelerator using real world workloads. Experimental results show that TVM offers portable performance across back-ends, and achieves speedups ranging from 1.2× to 3.8× over existing frameworks backed by hand optimized libraries. #### 2 Overview This section gives an overview of TVM by walking through the system components and the user API with an example. Figure 2 summarizes the system execution steps in TVM and their corresponding sections in the paper. The system first takes a model from an existing framework as input and transforms it into a computational graph representation. The system then performs high-level dataflow rewriting to generate an optimized graph. The operator-level optimization module needs to generate efficient code for each fused operator in the optimized graph. The operators are specified in a declarative tensor expression language, leaving the execution details unspecified. TVM identifies a collection of Figure 2: System overview of TVM. The current stack supports descriptions from many deep learning frameworks and targeting major CPU, GPU and specialized accelerators. possible code optimizations for the operators for a given hardware target. The possible optimizations form a large space, so we use a machine learning based cost model to find optimized operators. Finally, the system packs the generated code into a deployable module. **End-User Example** A user can take a model from existing deep learning frameworks and call the TVM API to get a deployable module in a few lines of code: ``` import tvm as t # Use keras framework as example, import model graph, params = t.frontend.from_keras(keras_model) target = t.target.cuda() graph, lib, params = t.compiler.build(graph, target, params) ``` This compiled runtime module contains three components: the final optimized computational graph (graph), generated operators (lib), and module parameters (params). The user can then use these to deploy the model to the target back-end: ``` import tvm.runtime as t module = runtime.create(graph, lib, t.cuda(0)) module.set_input(**params) module.run(data=data_array) output = tvm.nd.empty(out_shape, ctx=t.cuda(0)) module.get_output(0, output) ``` TVM support multiple deployment back-ends and in languages such as C++, Java and Python. The rest of the paper describes the TVM's architecture and how a systems programmer can extend it to support new back-ends. ### **3 Optimizing Computational Graphs** Computational graphs are a common way to represent programs in deep learning frameworks [3, 4, 6, 8]. Figure 3 shows an example computational graph representation of a two layer convolutional neural network. The main difference between this high-level representation and a low-level compiler IR, such as LLVM, is that the Figure 3: Example computational graph of a two layer convolutional neural network. Each node in the graph represents an operation that consumes one or more tensors and produces one or more tensors. The tensor operations can be parameterized by attributes to configure their behavior (e.g., padding or strides). Figure 4: Performance comparison between fused operations and non-fused operations. Both are generated by TVM. intermediate data items are large multi-dimensional tensors. Computational graphs provide a global view of operators, yet avoid specifying how each operator needs to be implemented. Similar to LLVM IR, a computational graph can be transformed into functionally equivalent graphs to apply optimizations. TVM exploits a computational graph representation to apply high-level optimizations: a node represents an operation on tensors or program inputs, and edges represent data dependencies between operations. TVM implements many graph level optimizations such as the following: Operator fusion fuses multiple small operations together. Constant-folding can be applied to pre-compute parts of the graph that can be determined statically, saving execution costs. A static memory planning pass can be performed on the graph to pre-allocate memory to hold each intermediate tensor. Data layout transformations can be used to massage the internal data layouts into back-end-friendly forms. We now discuss operator fusion and data layout transformation. **Operator Fusion** Operator fusion combines multiple operators together into a single kernel without saving the intermediate results back into memory. This optimization can greatly reduce execution time, particularly in GPUs and specialized accelerators. Specifically, we recognize four categories of graph operators: injective (one-to-one map, e.g. add), reduction (e.g., sum), complexout-fusable (can fuse element-wise map to output, e.g., conv2d), and opaque (cannot be fused, e.g., sort). We provide generic rules to fuse these operators. Multiple injective operators can be fused together into another injective operator. A reduction operator can be fused together with input injective operators (e.g. fuse scale and sum). Operators such as conv2d are categorized as complex-out-fusable, and we can fuse element-wise operators to its output. We can apply these rules to transform the computation graph into a fused version. Figure 4 demonstrates the impact of this optimization in different workloads. We find that fused operators can bring up to $1.2 \times$ to $2 \times$ speedup by reducing memory accesses. **Data Layout Transformation** There are multiple ways to store a given tensor in the computational graph. The most common data layout choices are column major and row major. In practice, we may prefer to use even more complicated data layouts. For instance, a deep learning accelerator might exploit $4 \times 4$ matrix operations, requiring data to be tiled into $4 \times 4$ chunks to optimize for access locality. Data layout optimization converts a computational graph to use better internal data layouts for execution on the target hardware. Optimizing data layout starts with specifying the preferred data layout for each operator given the constraints dictated by memory hierarchies. We then perform the proper layout transformation between a producer and a consumer if their preferred data layouts do not match. While high-level graph optimizations can greatly improve the efficiency of deep learning workloads, they are only as effective as what the operator library provides. Currently, the few deep learning frameworks that support operator fusion require the operator library to provide an implementation of the fused patterns. With more network operators introduced on a regular basis, the number of possible fused kernels can grow dramatically. This approach is no longer sustainable when targeting an increasing number of hardware back-ends, as the required number of fused pattern implementations grows combinatorially with the number of data layouts, datatypes, and accelerator intrinsics that need to be supported. It is not feasible to handcraft operator kernels for the various operations desired by a program and for each back-end. To this end, we next propose a code generation approach that can generate various possible implementations for the operators appropriate for a given model. ### 4 Generating Tensor Operations TVM produces efficient code for each operator by generating many valid implementations on each hardware back-end and choosing an optimized implementation. The process of generating multiple valid implementations builds on Halide's idea of decoupling descriptions from computation rules (or *schedule optimizations*) [30], ``` A = t.placeholder((1024, 1024)) = t.placeholder((1024, 1024)) = t.placeholder((1024, 1024)) = t.reduce_axis((0, 1024)) = t.compute((1024, 1024), lambda y, x: t.sum(A[k, y] * B[k, x], axis=k)) t.create schedule(C.op) for v in range(1024): for x in range(1024): C[y][x] = 0 or k in range(1024): C[y][x] += A[k][y] * B[k][x] + Loop Tiling yo, xo, ko, yi, xi, ki = s[C].tile(y, x, k, 8, 8, 8) for yo in range(128): C[yo*8:yo*8+8][xo*8:xo*8+8] = 0 ko in range(128): for yi in range(8) xi in range(8): or ki in range(8): C[yo*8+yi][xo*8+xi] + A[ko*8+ki][yo*8+yi] * B[ko*8+ki][xo*8+xi] + Cache Data on Accelerator Special Buffer CL = s.cache_write(C, vdla.acc_buffer) AL = s.cache_read(A, vdla.inp_buffer) # additional schedule steps omitted ... + Map to Accelerator Tensor Instructions s[CL].tensorize(vi. vdla.gemm8x8) inp_buffer AL[8][8], BL[8][8] acc_buffer CL[8][8] for yo in range(128): for xo in range(128): vdla.fill_zero(CL) of ko in rangerizor: vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8]) vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8]) vdla.fused_genm8x8_add(CL, AL, BL) vdla.dma copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL) corresponding schedule ) schedule low-level code transformation ``` Figure 5: Example schedule transformations to optimize a matrix multiplication on a specialized accelerator. Figure 6: TVM schedule lowering and code generation process. The table lists the existing Halide, and novel TVM scheduling primitives that are being used to optimize schedules for CPU, GPU and accelerator back-ends. and extends it to support new optimizations (nested parallelism, tensorization, and latency hiding) and a wide array of hardware back-ends. We now highlight TVM-specific features. ## 4.1 Tensor Expression and Schedule Space We introduce a tensor expression language to support automatic code generation. Unlike high-level computation graph representations, where the implementation of tensor operations is opaque, each operation is described in an index formula expression language. The following code shows an example tensor expression to compute transposed matrix multiplication: Each compute operation specifies the shape of the output tensor, and an expression describing how to compute each element of the output tensor. Our tensor expression language supports common arithmetic and math operations and covers common operator patterns used in deep learning. The language leaves the loop structure and many other execution details unspecified, and provides flexibility for adding hardware-aware optimizations for various back-ends. Adopting the decoupled compute/schedule principle from Halide [30], we use a schedule to denote a specific mapping from a tensor expression to low-level code. There are many possible schedules that map a given expression to a low-level program. We build a schedule by incrementally applying basic transformations (schedule primitives) that preserve the logical equivalence of the program. Figure 5 shows an example for scheduling matrix multiplication on a specialized accelerator. Internally, TVM uses a data structure to keep track of the loop structure and other information as we apply schedule transformations. This information can then be used to generate low-level code for a given final schedule. Our tensor expression takes cues from Halide [30], Darkroom [16], and TACO [22], with the primary enhancements being support for new schedule optimizations discussed below. In order to achieve high performance on many back-ends, we must support enough schedule primitives to cover a diverse set of optimizations on different hardware back-ends. Figure 6 summarizes the operation code generation process and schedule primitives supported in TVM. We reuse useful primitives and the low-level loop program AST from Halide and introduce new primitives to optimize for GPUs and accelerators. We describe the new optimization primitives in this section and describe how to automatically derive efficient schedules in section 5. ## 4.2 Nested Parallelism with Cooperation Parallelism is the key to improving the efficiency of compute intensive kernels in deep learning workloads. Modern GPUs offer massive parallelism, requiring us to bake parallel patterns into schedule transformations. Most existing solutions adopt a model called *nested parallelism*, which is a form of fork—join. This requires a parallel schedule primitive to parallelize a data parallel task, each of which can be further recursively subdivided into subtasks to exploit multi-level thread hierarchy on the target architecture (e.g., thread groups in GPU). We call this model *shared-nothing nested parallelism*, as one working thread cannot look at the data of its sibling within the same parallel computation stage. An alternate approach to the shared-nothing approach is to fetch data cooperatively. Specifically, groups of threads can cooperatively fetch the data they all need into a shared memory space. This optimization can take advantage of the GPU memory hierarchy and enable data reuse across threads through shared memory regions. This pattern is well known in GPU programming, and TVM supports this optimization using a schedule primitive. The code below shows a GPU code example to optimize matrix multiplication. ``` for thread_group (by, bx) in cross(64, 64): for thread_item (ty, tx) in cross(2, 2): local CL[8][8] = 0 shared AS[2][8], BS[2][8] for k in range(1024): for i in range(4): AS[ty][i**4+tx] = A[k][by*64+ty*8+i**4+tx] for each i in 0..4: BS[ty][i**4+tx] = B[k][bx*64+ty*8+i**4+tx] memory_barrier_among_threads() for yi in range(8): CL[yi][xi] += AS[yi] * BS[xi] by compiler for yi in range(8): CL[yi][xi] += AS[yi] * BS[xi] for yi in range(8): CL[yi][xi] += CL[yi][xi] ``` We introduce the concept of *memory scopes* to the schedule space so that a compute stage (AS and BS in the code) can be marked as shared. Without explicit memory scopes, automatic scope inference will mark them as thread-local. The shared task needs to compute the dependencies of all the working threads in the group. Additionally, memory synchronization barriers need to be properly inserted to guarantee that shared loaded data is visible to the consumers. Finally, in addition to being useful to GPUs, memory scopes allow us to tag special memory buffers and create special lowering rules when targeting specialized deep learning accelerators. #### 4.3 Tensorization Deep learning workloads have high arithmetic intensity that can be typically decomposed into tensor operators like matrix-matrix multiplication or 1D convolution. These natural decompositions have led to the recent trend of adding tensor compute primitives [1, 11, 20]. These new primitives create new opportunities and challenges for schedule-based compilation; making use of these complex primitives can improve performance, but the compilation framework should seamlessly integrate new primitives. We dub this tensorization, analogous to vectorization for SIMD architectures, but with significant differences. The inputs to the instructions are multidimensional, with fixed or variable lengths and each with different data layouts. More importantly, we cannot just support a fixed set of primitives, as new accelerators are emerging with their own flavors of tensor instructions, and we therefore need an extensible solution. Figure 7: TVM virtual thread lowering transforms a virtual thread-parallel program to a single instruction stream with explicit low-level synchronizations that the hardware can interpret to recover pipeline parallelism required to hide memory access latency. Figure 8: Decoupled Access-Execute in hardware hides most of the memory access latency by allowing memory and computation to overlap. Execution correctness is enforced by low-level synchronization in the form of dependence token enqueueing/dequeuing actions, which have to be inserted in the instruction stream by the compiler stack. We make tensorization extensible by separating the target hardware intrinsic from the schedule with a mechanism for tensor intrinsic declaration mechanism. We use the same tensor expression language to declare the behavior of each new hardware intrinsic, as well as the lowering rule associated with it. The code below shows how to declare an $8\times 8$ tensor hardware intrinsic. ``` w, x = t.placeholder((8, 8)), t.placeholder((8, 8)) k = t.reduce_axis((0, 8)) y = t.compute((8, 8), lambda i, j: t.sum(wli, kl * x[j, kl, axis=k)) def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr("r") xz_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("r") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) update = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower) ``` Additionally, we introduce a *tensorize* schedule primitive to replace a unit of computation with the corresponding intrinsics. The compiler matches the computation pattern with a hardware declaration, and lowers it to the corresponding hardware intrinsic. Tensorization decouples the schedule from specific hardware primitives, making it easy to extend TVM to support new hardware architectures. The generated code of tensorized schedules align with practices in highperformance computing: break complex operations into a sequence of micro-kernel calls. We can also use the tensorize primitive to take advantage of handcrafted micro-kernels, which can be beneficial in some platforms. For example, we implement ultra low precision operators for mobile CPUs that operate on datatypes that are one or two bits wide by leveraging a bit-serial matrix vector multiplication micro-kernel. This micro-kernel accumulates results into progressively larger datatypes to minimize memory footprint. Presenting the microkernel as a tensor intrinsic to TVM yields up to $1.5 \times$ speedup over the non-tensorized version. ## 4.4 Explicit Memory Latency Hiding Latency hiding refers to the process of overlapping memory operations with computation to maximize utilization of memory and compute resources. It requires different strategies depending on the target hardware back-end. On CPUs, memory latency hiding is achieved implicitly with simultaneous multithreading [13] or hardware prefetching [9, 19]. GPUs rely on rapid context switching of many warps of threads [39]. In contrast, specialized deep learning accelerators such as the TPU [20] usually favor leaner control with a *decoupled access-execute* (DAE) architecture [32] and offload the problem of finegrained synchronization to software. Figure 8 shows a DAE hardware pipeline that reduces runtime latency. Compared to a monolithic hardware design, the DAE pipeline can hide most memory access overheads, and almost fully utilize compute resources. To achieve higher utilization of a DAE pipeline, the in- Figure 9: Roofline [42] of an FPGA-based deep learning accelerator running ResNet inference. With latency hiding enabled by TVM, the performance of the benchmarks are brought closer to the roofline, demonstrating higher compute and memory bandwidth efficiency. struction stream needs to be augmented with fine-grained synchronization operations. Without these operations, dependencies cannot be enforced, leading to erroneous execution. Consequently, DAE hardware pipelines require fine-grained dependence enqueuing/dequeuing operations between the pipeline stage to guarantee correct execution, as shown in Figure 8's instruction stream. Programming DAE accelerators that require explicit low-level synchronization is difficult. To reduce the burden on the programmer, we introduce a virtual threading scheduling primitive that lets the programmer specify a high-level data parallel program as they would for hardware back-ends with support for multi-threading. TVM then automatically lowers the program to a single instruction stream with low-level explicit synchronization, as shown in Figure 7. The algorithm starts with a highlevel multi-threaded program schedule and then inserts the necessary low-level synchronization operations to guarantee correct execution within each thread. Next, the operations of all virtual threads are interleaved into a single instruction stream. Finally, the hardware recovers the available pipeline parallelism dictated by the low-level synchronizations in the instruction stream. Hardware Evaluation of Latency Hiding We demonstrate the effectiveness of latency hiding on a custom FPGA-based accelerator design which we describe in depth in subsection 6.4. We run each layer of ResNet on the accelerator, and use TVM to generate two schedules: one with latency hiding, and one without. The schedule with latency hiding parallelizes the program with virtuals thread to expose pipeline parallelism and therefore hide memory access latency. The results are shown in Figure 9 as a roofline diagram [42]. Roofline performance diagrams provide insight on how well computation and memory resources are utilized on a given system for different benchmarks. Overall, latency hiding improves performance on all ResNet layers. Peak compute utilization increases from 70% with no latency hid- ing to 88% with latency hiding turned on. ## 5 Automating Optimization Given the rich set of schedule primitives, our remaining problem is to find optimal operator implementations for each layer of a deep learning model. TVM creates a specialized operator for the specific input shape and layout associated with each layer. Such specialization offers significant performance benefits (in contrast to handcrafted code that would target a smaller diversity of shapes and layouts), but also brings challenges for automation. The system needs to choose the schedule optimizations, such as modifying the loop order, optimizing for the memory hierarchy, as well as schedulespecific parameters such as the tiling size and the loop unrolling factor. Such combinatorial choices create a large search space of operator implementations for each hardware back-end. To address this challenge, we build an automated schedule optimizer. The optimizer has two main components: a machine learning cost model that predicts the performance of a given configuration, and a schedule explorer that proposes new configurations that are promising. This section describes these components and TVM's automated optimization flow (Figure 10). ## 5.1 Schedule Space Specification We build a schedule template specification API to allow a developer to declare the knobs in the schedule space. The template specification allows incorporation of a developer's domain-specific knowledge, when necessary, in specifying possible schedules. We also provide a generic master template for each hardware back-end that automatically extracts possible knobs based on the computation description expressed using the tensor expression language. At a high level, we wish to consider as many configurations as possible and leave the selection burden to the optimizer. Consequently, the optimizer needs to search over *billions* of possible configurations on the real world deep learning workloads used in our experiments. ## 5.2 Machine Learning based Cost Model One way to find the best schedule from a large configuration space is through blackbox optimization, i.e., autotuning. This method is used to tune high performance computing libraries [14, 41]. However, auto-tuning requires many experiments to find a good configuration. An alternate approach is to build a predefined cost model to guide the search for a particular hardware backend instead of running all possibilities and measuring their performance. Ideally, a perfect cost model for a Figure 10: Overview of automated optimization framework. A schedule explorer explores the schedule space using a machine learning based cost model and chooses experiments run on a distributed device cluster via RPC. The machine learning model is updated periodically using collected data recorded in a database to improve its predictive power. | Method Category | Data<br>Cost | Model<br>Bias | Need<br>hardware<br>info | Learn<br>from<br>history | |-----------------------|--------------|---------------|--------------------------|--------------------------| | Blackbox auto-tuning | high | none | no | no | | Predefined cost model | none | high | yes | no | | ML based cost model | low | low | no | ves | Table 1: Comparison of automation methods. Model bias refers to inaccuracy due to modeling. hardware target considers all factors affecting performance. These factors include memory access patterns, data reuse, pipeline dependencies, and threading patterns, among others. This approach, unfortunately, is very hard due to the increasing complexity of modern hardware. Furthermore, every new hardware target require a new (predefined) cost model. We instead take a statistical approach to solve the cost modeling problem. In this approach, a schedule explorer proposes configurations that may improve the performance of an operator. For each schedule configuration, we use a machine learning model that takes the lowered loop program as input and predicts its running time on a given hardware back-end. The model is trained using runtime measurement data collected during exploration and does not require the user to input detailed information about the hardware. We also update the model periodically as we explore more configurations during optimization, which translates to improved accuracy for other related workloads as well. This way, the quality of the machine learning model improves with more experimental trials. Table 1 summarizes the key differences between automation methods. ML-based cost models strike a balance between auto-tuning and predefined cost modeling, and can benefit from the historical performance data of other related workloads. Machine Learning Model Design Choices We need to consider two key factors when choosing the machine learning model used by the schedule explorer: *quality* and *speed*. The schedule explorer queries the cost model frequently, which incurs overheads due to model prediction time and model refitting time. In order for the model to be useful, these overheads must be smaller than the Figure 11: Comparison of different automation methods for a conv2d operator in ResNet-18 on TITAN X. The ML-based model starts with no training data and uses the collected data to improve itself. The Y-axis is the speedup relative to cuDNN. We observe a similar trend for other workloads. Figure 12: Example workflow of machine learning cost models. XGBoost makes a cost prediction based on the features of the loop program. TreeRNN directly summarizes the AST. time to measure performance on real hardware, which can be on the order of seconds depending on the specific workload/hardware target. This speed requirement differentiates our problem from traditional hyperparameter tuning problems, where the cost of doing measurement is very high relative to model overheads and more expensive models can be used. In addition to the choice of the model, we also need to choose an objective function to train the model, such as the error in the predicted running time of a configuration. However, since the explorer only selects the top candidates based on the relative order of the prediction, we do not need to predict execution times directly. Instead, we use a rank objective to predict the relative order of runtime costs. We implement several types of models in our machine learning optimizer. We employ a gradient tree boosting model (based on XGBoost [7]) that makes predictions based on features extracted from the loop program. These features include the memory access count and reuse ratio of each memory buffer at each loop level, as well as an one-hot encoding of loop annotations such as "vectorize", "unroll", and "parallel." We also evaluate a neural network model that uses TreeRNN [35] to summarize the loop program's AST without feature engineering. Figure 12 summarizes the workflow of the cost models. We found that that tree boosting and TreeRNN have similar predictive quality. However, gradient tree boosting performs prediction twice as fast and costs much less time to train. As a result, we choose gradient tree boosting as the default cost model in our experiments. Nevertheless, we believe that both approaches are valuable and expect more future research on this problem. On average, the tree boosting model does prediction in 0.67 ms, thousands of times faster than running a real measurement. Figure 11 shows a comparison of ML-based optimizer vs. black box autotuning methods; the MLbased cost model finds better configurations much faster than black box autotuning. ### **5.3** Schedule Exploration Once we have a cost model, we can use it to select promising configurations to run real measurements for in an iterative fashion. In each iteration, the explorer uses the machine learning model's predictions to pick a batch of candidates to run real measurements on. The collected data is then used as training data to update the model. When there is no initial training data, the explorer will pick random candidates to measure. The simplest exploration algorithm is to enumerate and run every configuration through the cost model, selecting the top-k predicted performers. However, this strategy becomes intractable with large search spaces. Instead, we run a parallel simulated annealing algorithm [21]. The explorer starts with random configurations, and at each step, randomly walks to a nearby configuration. This transition is successful if there is a decrease in cost as predicted by the cost model. This transition has a probability to fail (reject) if the target configuration has a higher cost. This random walk process will tend to converge to configurations that have lower costs as predicted by cost model. The states of the exploration persist across cost model updates – we continue from the last configuration after cost model updates. #### 5.4 Distributed Device Pool and RPC We build a distributed device pool to scale up running onhardware trials and enable fine-grained resource sharing among multiple optimization jobs. TVM implements a customized RPC-based distributed device pool that enables clients to run programs on a specific type of device. With this interface, we can compile the program on a host compiler, request a remote device, run the function remotely, and access the results in the same script on the host. TVM's RPC supports dynamic upload and runs | Name | Operator | H,W | IC,OC | K,S | |------|----------|----------|---------|------| | C1 | conv2d | 224, 224 | 3,64 | 7, 2 | | C2 | conv2d | 56, 56 | 64,64 | 3, 1 | | C3 | conv2d | 56, 56 | 64,64 | 1, 1 | | C4 | conv2d | 56, 56 | 64,128 | 3, 2 | | C5 | conv2d | 56, 56 | 64,128 | 1, 2 | | C6 | conv2d | 28, 28 | 128,128 | 3, 1 | | C7 | conv2d | 28, 28 | 128,256 | 3, 2 | | C8 | conv2d | 28, 28 | 128,256 | 1, 2 | | C9 | conv2d | 14, 14 | 256,256 | 3, 1 | | C10 | conv2d | 14, 14 | 256,512 | 3, 2 | | C11 | conv2d | 14, 14 | 256,512 | 1, 2 | | C12 | conv2d | 7, 7 | 512,512 | 3, 1 | | Name | Operator | H, | W IC | K,S | | D 4 | 1 1 1 | 0.1 110 | | | | Name | Operator | H,W | IC | K,S | |------|------------------|----------|------|------| | D1 | depthwise conv2d | 112, 112 | 32 | 3, 1 | | D2 | depthwise conv2d | 112, 112 | 64 | 3, 2 | | D3 | depthwise conv2d | 56, 56 | 128 | 3, 1 | | D4 | depthwise conv2d | 56, 56 | 128 | 3, 2 | | D5 | depthwise conv2d | 28, 28 | 256 | 3, 1 | | D6 | depthwise conv2d | 28, 28 | 256 | 3, 2 | | D7 | depthwise conv2d | 14, 14 | 512 | 3, 1 | | D8 | depthwise conv2d | 14, 14 | 512 | 3, 2 | | D9 | depthwise conv2d | 7, 7 | 1024 | 3, 1 | | | | | | | Table 2: Configurations of all conv2d operators in ResNet-18 and all depthwise conv2d operators in MobileNet used in the single kernel experiments. H/W denotes height and width, IC input channels, OC output channels, K kernel size, and S stride size. All ops use "SAME" padding. All depthwise conv2d operations have channel multipliers of 1. cross-compiled modules, as well as any functions that use TVM's runtime convention. As a result, we can use the same infrastructure to do a single workload optimization and end-to-end graph inference. Our approach automates the compile, run, and profile steps across multiple devices. This infrastructure is especially critical for embedded devices, which traditionally require tedious manual effort for cross-compilation, code deployment, and measurement. #### 6 Evaluation The core of TVM is implemented in C++ ( $\sim$ 50k LoC). We provide language bindings to Python, Java. Earlier sections of this paper evaluated the impact of several individual optimizations and components of TVM, namely, we evaluate *Operator fusion* in Figure 4, *Latency hiding* in Figure 9, and the *ML-based cost model* in Figure 11. We now focus on an end-to-end evaluation, aiming to answer the following questions: - Can TVM optimize deep learning workloads over multiple platforms? - How does TVM compare to existing deep learning frameworks (which rely on heavily optimized libraries) on each back-end? - Can TVM support new, emerging workloads in deep learning (e.g., depthwise convolution, low precision operations)? Figure 13: GPU end-to-end evaluation among TVM, MXNet, Tensorflow, and Tensorflow XLA. Tested on NVIDIA Titan X. Can TVM support and optimize for new specialized accelerators? To answer these questions, we evaluate TVM on four types of platforms—a server-class GPU, an embedded GPU, an embedded CPU, and a deep learning accelerator implemented on a low-power FPGA SoC. The benchmarks are based on real world deep learning inference workloads including ResNet [15], MobileNet [18], LSTM Language Model [43], Deep Q Network (DQN) [26] and Deep Convolutional Generative Adversarial Networks (DCGAN) [29]. We compare our approach with existing deep learning frameworks including MxNet [8] and TensorFlow [2] that rely on highly engineered vendor-specific libraries. TVM performs end to end automatic optimization and code generation without any external operator library. #### 6.1 Server-class GPU Evaluation We first compare the end-to-end performance of deep neural networks among TVM, MXNet (v1.1), Tensorflow (v1.7), and Tensorflow XLA on an Nvidia Titan X. MXNet and Tensorflow both use cuDNN v7 for convolution operators and implement their own versions of depthwise convolution as it is relatively new and is not yet supported by the latest libraries. They also use cuBLAS v8 for matrix multiplications. On the other hand, Tensorflow XLA uses JIT compilation. Figure 13 shows that TVM outperforms the baselines with speedups ranging from $1.6\times$ to $3.8\times$ . This improvement is brought by the joint graph optimization and the automatic optimizer that generates high performance fused operators. The result of DQN ( $3.8\times$ speedup) is due to its use of unconventional operators ( $4\times4$ conv2d, strides=2) that are not well optimized by cuDNN while the ResNet workloads are more conventional. TVM automatically finds optimized operators in both cases. To evaluate the effectiveness of operator level optimization, we also perform a breakdown comparison for Figure 14: Relative speedup of all conv2d operators in ResNet-18 and all depthwise conv2d operators in MobileNet. Tested on TITAN X. See Table 2 for the configurations of these operators. We also include a weight pre-transformed Winograd [24] for 3x3 conv2d (TVM PT). Figure 15: ARM A53 end-to-end evaluation of TVM and TFLite. each tensor operator in ResNet and MobileNet, shown in Figure 14. We include the TensorComprehension (TC, commit: ef644ba) [37] a recently introduced auto-tuning framework as an additional baseline. The results of TC are the best kernels it found in 10 generations $\times$ 100 population $\times$ 2 random seeds for each operator (i.e., 2000 trials per operator). 2D convolution is one of the most important operators in deep learning and is heavily optimized by cuDNN. However, TVM can still generate better GPU kernels for most layers. Depthwise convolution is a newly introduced operator with a simpler structure [18]. In this case, both TVM and TC can find fast kernels compared to the handcrafted kernels in MXNet. TVM's improvements are mainly due to its exploration of a large schedule space and an effective ML-based search algorithm. #### **6.2** Embedded CPU Evaluation We evaluated the performance of TVM on an ARM Cortex A53 (Quad Core 1.2GHz). We use Tensorflow Lite (TFLite, commit: 7558b085) as our baseline system. Figure 16 shows the comparison between TVM tensor operators against hand-optimized ones for ResNet and Figure 16: Relative speedup of all conv2d operators in ResNet-18 and all depthwise conv2d operators in mobilenet. Tested on ARM A53. See Table 2 for the configurations of these operators. MobileNet. We observe that TVM generates operators that outperform the hand-optimized TFLite version for both neural network workloads. This result also demonstrates TVM's ability to quickly optimize emerging tensor operators, such as depthwise convolution operators. Finally, Figure 15 shows an end-to-end comparison of three workloads, where TVM outperforms the TFLite baseline.<sup>1</sup> Ultra Low Precision Operators We demonstrate TVM's ability to support ultra low precision inference [12, 31] by generating highly optimized operators that operate on fixed-point data types of less than 8-bits. Low precision networks replace expensive multiplication with vectorized bit-serial multiplication composed of bitwise AND and popcount reductions [36]. Achieving efficient low precision inference requires packing quantized data types into wider standard data types such as int8 or int32. Our system is able to generate code that outperforms hand optimized libraries from Caffe2 (commit: 39e07f7) [36]. We implement an ARM specific tensorization intrinsic that leverages ARM instructions to implement an efficient low precision matrix-vector microkernel. We then leverage TVM's automated optimizer to explore the scheduling space. In Figure 17, we compare TVM against the Caffe2 ultra low precision library on ResNet for 2-bit activations, 1-bit weights inference. Since the baseline is single-threaded, we also compare it against a single-threaded TVM version. Single-threaded TVM outperforms the baseline, particularly for C5, C8, C11 layers which are convolution layers of kernel size $1 \times 1$ and stride of 2 which the ultra low precision baseline library is not optimized for. Furthermore, we take advantage of additional TVM capabilities to produce a parallel library implemen- Figure 17: Relative speedup of single-threaded and multithreaded low precision conv2d operators in ResNet. Baseline is a single-threaded hand optimized implementation from Caffe2 (commit: 39e07f7). Figure 18: End-to-end experiment results on Mali-T860MP4. Two data types float32 and float16 are evaluated. tation that improves over the baseline. Besides the 2-bit+1-bit configuration, TVM can generate and optimize for other precision configurations that are not supported by the baseline library, offering improved flexibility. #### **6.3** Embedded GPU Evaluation For our mobile GPU experiments, we run our end-to-end pipeline on a Firefly-RK3399 board. It is equipped with an ARM Mali-T860MP4 GPU. The baseline is a vendor-provided library ARM Compute Library (v18.03). As shown in Figure 18, we outperform the baseline on three available models for both float16 and float32 (DCGAN and LSTM are not yet supported by the baseline). The speedup ranges from $1.2\times$ to $1.6\times$ . #### **6.4 FPGA Accelerator Evaluation** Vanilla Deep Learning Accelerator We demonstrate how TVM tackles accelerator-specific code generation on a generic inference accelerator design we prototyped on an FPGA. We introduce the Vanilla Deep Learning Accelerator (VDLA) which distills characteristics from previous accelerator proposals [11, 20, 25] into a minimalist hardware architecture. We use VDLA in this evaluation to demonstrate TVM's ability to generate highly efficient schedules that can target specialized accelerators. Figure 19 shows the high-level hardware organization of the VDLA architecture. VDLA is programmed as a tensor processor to efficiently execute operations with high compute intensity (e.g, matrix multiplication, high dimensional convolution). VDLA can perform load/store $<sup>^{1}\</sup>mbox{DCGAN}$ and LSTM results are not presented because they are not yet supported by the baseline. operations to bring blocked 3-dimensional tensors from DRAM to into a contiguous region of SRAM. VDLA also provides specialized on-chip memories for network parameters, layer inputs (narrow data type), and layer outputs (wide data type). Finally, VDLA provides explicit synchronization control over successive load, compute, and store to maximize the overlap between memory and compute operations. **Methodology** We implement the VDLA design on a low-power PYNQ board which incorporates an ARM Cortex A9 dual core CPU clocked at 667MHz and an Artix-7 based FPGA fabric. On the modest FPGA resources, we implement a 16 × 16 matrix-vector unit clocked at 200MHz that performs products of 8-bit values and accumulates them into a 32-bit register every cycle. The theoretical peak throughput of this flavor of the VDLA design lies around 102.4GOPS/s. We allocate 32kB of resources for activation storage, 32kB for parameter storage, 32kB for microcode buffers, and 128kB for the register file. These on-chip buffers are nowhere near large enough to provide enough on-chip storage for a single layer of ResNet, and therefore enable a case study on effective memory reuse and latency hiding. We build a driver library for VDLA with a C runtime API that can construct instructions and push them to the target accelerator for execution. Our code generation algorithm then translates the accelerator program to a series of calls into the runtime API. Adding the specialized accelerator back-end takes ~2k LoC in python. End-to-end ResNet Evaluation We leverage TVM to generate ResNet inference kernels on the PYNQ platform and offload as many layers as possible to VDLA. We utilize TVM to generate both schedules for the CPU only and CPU+FPGA implementation. Due to its shallow convolution depth, the first ResNet convolution layer could not be efficiently offloaded on the FPGA and is instead computed on the CPU. All other convolution layers in ResNet, however, are amenable to efficient offloading. Operations like residual layers and activations are also performed on the CPU since VDLA does not support these operations. Figure 20 shows the ResNet inference time breakdown between the CPU-only execution and the CPU+FPGA execution. Most of the computation is spent on the convolution layers that can be offloaded to VDLA. For those convolution layers, the achieved speedup is $40\times$ . Unfortunately, the overall performance of the FPGA accelerated system is bottlenecked by the sections of the workload that have to be executed on the CPU due to Amdahl's law. We envision that extending the VDLA design to support these other operators will help reduce Figure 19: VDLA Hardware Design Overview. Figure 20: We offload convolutions in the ResNet workload to an FPGA-based accelerator. The grayed-out bars correspond to layers that cannot be accelerated by the FPGA and therefore have to run on the CPU. The FPGA can provide a 40x acceleration on offloaded convolution layers over the Cortex A9. cost even further. This FPGA-based experiment show-cases TVM's ability to adapt to new architectures and the hardware intrinsics that they expose. #### 7 Related Work Deep learning frameworks [3,4,6,8] provide convenient interfaces for users to express deep learning workloads, and deploy them easily on different hardware back-ends. While existing frameworks currently depend on vendor specific tensor operator libraries to execute their workloads, they can leverage TVM's stack to generate optimized code for a larger number of hardware devices. High-level computation graph DSLs are a typical way to represent and perform high-level optimizations. Tensorflow's XLA [3], and the recently introduced DLVM [40] falls into this category. The representations of computation graphs in these works are similar, and a high-level computation graph DSL is also used in this paper. While graph level representations are a good fit for high-level optimizations, they are too high-level to optimize tensor operators under a diverse set of hardware back-ends. Prior work relies on specific lowering rules to directly generate low-level LLVM or resorts to vendor crafted libraries. These approaches require significant engineering effort for each hardware back-end and operator-variant combination. Halide [30] introduced the idea of separation between compute and scheduling. We adopt Halide's insights and reuse its existing useful scheduling primitives in our compiler. The tensor operator scheduling is also related to other works on DSL for GPUs [17, 23, 33, 34] as well as works on polyhedral-based loop transformation [5, 38]. TACO [22] introduces a generic way to generate sparse tensor operators on CPU. Weld [28] is a DSL for data processing tasks. We specifically focus on solving the new scheduling challenges of deep learning workloads for GPUs and specialized accelerators. Our new primitives can be potentially adopted by the optimization pipelines in these works. High-performance libraries such as ATLAS [41] and FFTW [14] use auto-tuning to get the best performance. Tensor comprehension [37] applied black-box auto-tuning together with polyhedral optimizations to optimize CUDA kernels. A predefined cost model is used to automatically schedule image processing pipelines in Halide [27]. TVM's machine learning based distributed schedule optimizer scales to a larger search space and can find state of the art kernels on a large range of supported back-ends. More importantly, we provide an end-to-end stack that can take descriptions directly from deep learning frameworks, and jointly optimize together with the graph-level stack. Despite the emerging popularity of accelerators for deep learning [10, 20], it is yet unclear how a compilation stack can be built to effectively target these devices. The VDLA design used in the evaluation provides a generic way to summarize the properties of these accelerators, and enables a concrete case study on how to compile code for accelerators. This paper provides a generic solution to effectively target specialized accelerators via tensorization and compiler-driven latency hiding. ### 8 Conclusion We proposed an end-to-end compilation stack to solve fundamental optimization challenges for deep learning across a diverse set of hardware back-ends. Our system includes automated end-to-end optimization, which is historically a labor intensive and highly specialized task. We hope this work will encourage more studies of end-to-end compilation approaches and open new opportunities for software-hardware co-design techniques for deep learning systems. ### Acknowledgement We would like to thank Ras Bodik, James Bornholt, Xi Wang, Tom Anderson and Qiao Zhang for their thorough feedback on earlier versions of this paper. Tianqi Chen is supported by the Google PhD Fellowship. This work was partially supported by the NSF under grant #1518703. #### References - NVIDIA Tesla V100 GPU Architecture: The World's Most Advanced Data Center GPU, 2017. - [2] ABADI, M., AGARWAL, A., BARHAM, P., BREVDO, E., CHEN, Z., CITRO, C., CORRADO, G. S., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., GOODFELLOW, I., HARP, A., IRVING, G., ISARD, M., JIA, Y., JOZEFOWICZ, R., KAISER, L., KUDLUR, M., LEVENBERG, J., MANÉ, D., MONGA, R., MOORE, S., MURRAY, D., OLAH, C., SCHUSTER, M., SHLENS, J., STEINER, B., SUTSKEVER, I., TALWAR, K., TUCKER, P., VANHOUCKE, V., VASUDEVAN, V., VIÉGAS, F., VINYALS, O., WARDEN, P., WATTENBERG, M., WICKE, M., YU, Y., AND ZHENG, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. - [3] ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., KUDLUR, M., LEVENBERG, J., MONGA, R., MOORE, S., MURRAY, D. G., STEINER, B., TUCKER, P., VASUDEVAN, V., WARDEN, P., WICKE, M., YU, Y., AND ZHENG, X. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016), pp. 265–283. - [4] AGARWAL, A., AKCHURIN, E., BASOGLU, C., CHEN, G., CYPHERS, S., DROPPO, J., EVERSOLE, A., GUENTER, B., HILLEBRAND, M., HOENS, R., HUANG, X., HUANG, Z., IVANOV, V., KAMENEV, A., KRANEN, P., KUCHAIEV, O., MANOUSEK, W., MAY, A., MITRA, B., NANO, O., NAVARRO, G., ORLOV, A., PADMILAC, M., PARTHASARATHI, H., PENG, B., REZNICHENKO, A., SEIDE, F., SELTZER, M. L., SLANEY, M., STOLCKE, A., WANG, Y., WANG, H., YAO, K., YU, D., ZHANG, Y., AND ZWEIG, G. An introduction to computational networks and the computational network toolkit. Tech. Rep. MSR-TR-2014-112, August 2014. - [5] BAGHDADI, R., BEAUGNON, U., COHEN, A., GROSSER, T., KRUSE, M., REDDY, C., VERDOOLAEGE, S., BETTS, A., DONALDSON, A. F., KETEMA, J., ABSAR, J., HAASTREGT, S. V., KRAVETS, A., LOKHMOTOV, A., DAVID, R., AND HA-JIYEV, E. Pencil: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (Washington, DC, USA, 2015), PACT '15, IEEE Computer Society, pp. 138–149. - [6] BASTIEN, F., LAMBLIN, P., PASCANU, R., BERGSTRA, J., GOODFELLOW, I. J., BERGERON, A., BOUCHARD, N., AND BENGIO, Y. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. - [7] CHEN, T., AND GUESTRIN, C. Xgboost: A scalable tree boosting system. In *Proceedings of the 22Nd ACM SIGKDD Inter*national Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD '16, ACM, pp. 785–794. - [8] CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems (LearningSys'15) (2015) - [9] CHEN, T.-F., AND BAER, J.-L. Effective hardware-based data prefetching for high-performance processors. *IEEE Transactions* on Computers 44, 5 (May 1995), 609–623. - [10] CHEN, Y., LUO, T., LIU, S., ZHANG, S., HE, L., WANG, J., LI, L., CHEN, T., XU, Z., SUN, N., AND TEMAM, O. Dadiannao: A machine-learning supercomputer. In *Proceedings of the 47th* - Annual IEEE/ACM International Symposium on Microarchitecture (Washington, DC, USA, 2014), MICRO-47, IEEE Computer Society, pp. 609–622. - [11] CHEN, Y.-H., EMER, J., AND SZE, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In *Proceedings of the 43rd International Symposium* on *Computer Architecture* (Piscataway, NJ, USA, 2016), ISCA '16, IEEE Press, pp. 367–379. - [12] COURBARIAUX, M., BENGIO, Y., AND DAVID, J. Binaryconnect: Training deep neural networks with binary weights during propagations. *CoRR abs/1511.00363* (2015). - [13] EGGERS, S. J., EMER, J. S., LEVY, H. M., LO, J. L., STAMM, R. L., AND TULLSEN, D. M. Simultaneous multithreading: a platform for next-generation processors. *IEEE Micro* 17, 5 (Sept 1997), 12–19. - [14] FRIGO, M., AND JOHNSON, S. G. Fftw: an adaptive software architecture for the fft. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on (May 1998), vol. 3, pp. 1381–1384 vol.3. - [15] HE, K., ZHANG, X., REN, S., AND SUN, J. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016). - [16] HEGARTY, J., BRUNHAVER, J., DEVITO, Z., RAGAN-KELLEY, J., COHEN, N., BELL, S., VASILYEV, A., HOROWITZ, M., AND HANRAHAN, P. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Trans. Graph. 33, 4 (July 2014), 144:1–144:11. - [17] HENRIKSEN, T., SERUP, N. G. W., ELSMAN, M., HENGLEIN, F., AND OANCEA, C. E. Futhark: Purely functional gpu-programming with nested parallelism and in-place array updates. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation* (New York, NY, USA, 2017), PLDI 2017, ACM, pp. 556–571. - [18] HOWARD, A. G., ZHU, M., CHEN, B., KALENICHENKO, D., WANG, W., WEYAND, T., ANDREETTO, M., AND ADAM, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. *CoRR abs/1704.04861* (2017). - [19] JOUPPI, N. P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture (May 1990), pp. 364–373. - [20] JOUPPI, N. P., YOUNG, C., PATIL, N., PATTERSON, D., AGRAWAL, G., BAJWA, R., BATES, S., BHATIA, S., BODEN, N., BORCHERS, A., BOYLE, R., CANTIN, P.-L., CHAO, C., CLARK, C., CORIELL, J., DALEY, M., DAU, M., DEAN, J., GELB, B., GHAEMMAGHAMI, T. V., GOTTIPATI, R., GUL-LAND, W., HAGMANN, R., HO, C. R., HOGBERG, D., HU, J., HUNDT, R., HURT, D., IBARZ, J., JAFFEY, A., JAWORSKI, A., KAPLAN, A., KHAITAN, H., KILLEBREW, D., KOCH, A., KUMAR, N., LACY, S., LAUDON, J., LAW, J., LE, D., LEARY, C., LIU, Z., LUCKE, K., LUNDIN, A., MACKEAN, G., MAG-GIORE, A., MAHONY, M., MILLER, K., NAGARAJAN, R., NARAYANASWAMI, R., NI, R., NIX, K., NORRIE, T., OMER-NICK, M., PENUKONDA, N., PHELPS, A., ROSS, J., ROSS, M., SALEK, A., SAMADIANI, E., SEVERN, C., SIZIKOV, G., SNEL-HAM, M., SOUTER, J., STEINBERG, D., SWING, A., TAN, M., THORSON, G., TIAN, B., TOMA, H., TUTTLE, E., VASUDE-VAN, V., WALTER, R., WANG, W., WILCOX, E., AND YOON, D. H. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (New York, NY, USA, 2017), ISCA '17, ACM, pp. 1-12. - [21] KIRKPATRICK, S., GELATT, C. D., AND VECCHI, M. P. Optimization by simulated annealing. *Science* 220, 4598 (1983), 671–680. - [22] KJOLSTAD, F., KAMIL, S., CHOU, S., LUGATO, D., AND AMARASINGHE, S. The tensor algebra compiler. *Proc. ACM Program. Lang. 1*, OOPSLA (Oct. 2017), 77:1–77:29. - [23] KLÖCKNER, A. Loo.py: transformation-based code generation for GPUs and CPUs. In *Proceedings of ARRAY '14: ACM SIG-PLAN Workshop on Libraries, Languages, and Compilers for Ar*ray *Programming* (Edinburgh, Scotland., 2014), Association for Computing Machinery. - [24] LAVIN, A., AND GRAY, S. Fast algorithms for convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (2016), pp. 4013–4021. - [25] LIU, D., CHEN, T., LIU, S., ZHOU, J., ZHOU, S., TEMAN, O., FENG, X., ZHOU, X., AND CHEN, Y. Pudiannao: A polyvalent machine learning accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2015), ASPLOS '15, ACM, pp. 369–381. - [26] MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K., OSTROVSKI, G., ET AL. Human-level control through deep reinforcement learning. *Nature* 518, 7540 (2015), 529. - [27] MULLAPUDI, R. T., ADAMS, A., SHARLET, D., RAGAN-KELLEY, J., AND FATAHALIAN, K. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (July 2016), 83:1–83:11. - [28] PALKAR, S., THOMAS, J. J., NARAYANAN, D., SHANBHAG, A., PALAMUTTAM, R., PIRK, H., SCHWARZKOPF, M., AMA-RASINGHE, S. P., MADDEN, S., AND ZAHARIA, M. Weld: Rethinking the interface between data-intensive applications. *CoRR* abs/1709.06416 (2017). - [29] RADFORD, A., METZ, L., AND CHINTALA, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015). - [30] RAGAN-KELLEY, J., BARNES, C., ADAMS, A., PARIS, S., DURAND, F., AND AMARASINGHE, S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In *Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (New York, NY, USA, 2013), PLDI '13, ACM, pp. 519–530. - [31] RASTEGARI, M., ORDONEZ, V., REDMON, J., AND FARHADI, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In *European Conference on Computer Vision* (2016), Springer, pp. 525–542. - [32] SMITH, J. E. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (Los Alamitos, CA, USA, 1982), ISCA '82, IEEE Computer Society Press, pp. 112–119. - [33] STEUWER, M., REMMELG, T., AND DUBACH, C. Lift: A functional data-parallel ir for high-performance gpu code generation. In *Proceedings of the 2017 International Symposium on Code Generation and Optimization* (Piscataway, NJ, USA, 2017), CGO '17, IEEE Press, pp. 74–85. - [34] SUJEETH, A. K., LEE, H., BROWN, K. J., CHAFI, H., WU, M., ATREYA, A. R., OLUKOTUN, K., ROMPF, T., AND ODERSKY, M. Optiml: An implicitly parallel domain-specific language for machine learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (USA, 2011), ICML'11, pp. 609–616. - [35] TAI, K. S., SOCHER, R., AND MANNING, C. D. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015). - [36] TULLOCH, A., AND JIA, Y. High performance ultra-low-precision convolutions on mobile devices. arXiv preprint arXiv:1712.02427 (2017). - [37] VASILACHE, N., ZINENKO, O., THEODORIDIS, T., GOYAL, P., DEVITO, Z., MOSES, W. S., VERDOOLAEGE, S., ADAMS, A., AND COHEN, A. Tensor comprehensions: Frameworkagnostic high-performance machine learning abstractions. *CoRR* abs/1802.04730 (2018). - [38] VERDOOLAEGE, S., CARLOS JUEGA, J., COHEN, A., IGNA-CIO GÓMEZ, J., TENLLADO, C., AND CATTHOOR, F. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1–54:23. - [39] VOLKOV, V. Understanding Latency Hiding on GPUs. PhD thesis, University of California at Berkeley, 2016. - [40] WEI, R., ADVE, V., AND SCHWARTZ, L. Dlvm: A modern compiler infrastructure for deep learning systems. *CoRR* abs/1711.03016 (2017). - [41] WHALEY, R. C., AND DONGARRA, J. J. Automatically tuned linear algebra software. In *Proceedings of the 1998 ACM/IEEE Conference on Supercomputing* (Washington, DC, USA, 1998), SC '98, IEEE Computer Society, pp. 1–27. - [42] WILLIAMS, S., WATERMAN, A., AND PATTERSON, D. Roofline: An insightful visual performance model for multicore architectures. *Commun. ACM* 52, 4 (Apr. 2009), 65–76. - [43] ZAREMBA, W., SUTSKEVER, I., AND VINYALS, O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).