LEARNED HARDWARE/SOFTWARE CO-DESIGN OF NEURAL ACCELERATORS

Abstract

The use of deep learning has grown at an exponential rate, giving rise to numerous specialized hardware and software systems for deep learning. Because the design space of deep learning software stacks and hardware accelerators is diverse and vast, prior work considers software optimizations separately from hardware architectures, effectively reducing the search space. Unfortunately, this bifurcated approach means that many profitable design points are never explored. This paper instead casts the problem as hardware/software co-design, with the goal of automatically identifying desirable points in the joint design space. The key to our solution is a new constrained Bayesian optimization framework that avoids invalid solutions by exploiting the highly constrained features of this design space, which are semicontinuous/semi-discrete. We evaluate our optimization framework by applying it to a variety of neural models, improving the energy-delay product by 18% (ResNet) and 40% (DQN) over hand-tuned state-of-the-art systems, as well as demonstrating strong results on other neural network architectures, such as MLPs and Transformers.

1. INTRODUCTION

The compute requirements of deep learning are growing at a double exponential rate (Hernandez & Brown, 2020) , with more powerful models requiring exponentially more compute to train. This growth has been enabled by large systems of hardware accelerators, like GPUs and TPUs (NVIDIA, 2017; Jouppi et al., 2017) . However, the continued scaling of these systems is limited by issues of power density, cooling, and memory, so we need to improve computational efficiency. Efficiency improvements can be sought at each layer of the deep learning stack, from better learning algorithms (Kingma & Ba, 2014) , to improved neural network architectures (Tan & Le, 2019), to deep learning compilers (Chen et al., 2018) , to specialized DNN accelerators that increase hardware efficiency (Chen et al., 2014a; 2016) . In this paper, we focus on the low-level software and hardware portions of this stack, with the goal of automatically optimizing the energy × delay product of executing a particular model on a hardware accelerator. We consider two components from the deep learning stack: the hardware accelerator and the software compiler that maps a model onto that hardware. This area is commonly referred to as hardware/software co-design, and since it requires human expertise from multiple disciplines (software engineers, compiler writers, hardware architects), it is typically driven by manual heuristics or heuristic-based search (Yang et al., 2020b) . We propose a different approach, recognizing that for a given DNN model, this hardware/software co-design can be framed as a joint search of the space of all of the valid mappings and hardware architectures that can correctly execute the model. We formally parameterize this space based on prior work (Parashar et al., 2019) , and we find that standard optimization techniques, including off-the-shelf Bayesian optimization, perform poorly because the design space is semi-discrete and the vast majority of the points in the space are infeasible. Prior work (Nardi et al., 2019) makes a similar observation, noting that (1) complex constraints such as hardware area and energy budget limit the feasible parameter values (i.e. small feasibility set), (2) some constraints are unknown until after a sample point has been evaluated (i.e. unknown feasibility). Our solution casts the search as a bilevel optimization problem, as shown in Figure 1 . The outer loop optimizes over hardware architectures, while the inner loop optimizes over software mappings for a given architecture. Both of these are heavily constrained black-box global optimization problems

annex

that require expensive simulations to obtain performance estimates. We therefore propose a nested, constrained Bayesian optimization (BO) formulation that uses Bayesian models of hardware and software performance to guide the search towards promising regions of the design space. Our approach is extensible to a variety of different neural network architectures, and we make the code publicly available.We find that when compared against the state-of-the-art manually-designed hardware accelerators that use heuristic software mappings, our BO-based approach provides significant improvements in the speed and energy efficiency of the resulting system, improving the energy-delay product (EDP) by 16.0% to 40.2% on a series of neural networks. The key to our solution is our robust BO software optimizer, whose consistent efficiency allows our approach to scale to this huge search space.This paper makes the following contributions:• We present the first system that automatically co-optimizes both the hardware architecture and software mapping phases of DNN accelerator design using a principled and systematic search algorithm. • We present a constrained formulation of hardware and software design for BO, a challenging problem given the high ratio (90%) of invalid hardware and software designs. • We present a nested hardware/software formulation of BO that is extensible to other hardware accelerator designs. • We provide model-specific hardware and state-of-the-art results on multiple models.

2. A FORMAL REPRESENTATION OF SOFTWARE AND HARDWARE

Hardware/software co-design typically performed manually, but we believe that this vast design space is best navigated by an intelligent search process. To facilitate this automation, this section formally defines the hardware and software design spaces. 2.1 PARAMETERIZING THE DESIGN SPACE Software design points can be parameterized by the loop ordering, loop tiling, and computational parallelism of the seven-level loop nest used to compute a convolutional layer (see appendix), as has been noted by recent work (Parashar et al., 2019; Yang et al., 2020b) . These software parameters are subject to hardware constraints, such as the quantity and layout of processing elements (PEs) and the size of storage elements.

Hardware parameters can be broken down into a two broad categories:

Resource configurations represent the physical aspects of hardware, such as buffer sizes, tile sizes, and the cluster size of global buffers, as well as the layouts of the PE array and of the global buffer.Dataflow configurations represent the usage of the PE array that are implemented in hardware, such as the blocking factors and degree of parallelism at the PE level, which also determines the communication patterns among PEs.Figure 2 shows two possible design points for a 1D convolution. Both design points tile and parallelize the channel (C) dimension. To the right of each component in the architecture is a set of loops that specifies the control logic for the component, which can be broken down into temporal streaming (for loops) and spatial distribution (parallel_for loops). For example, in the architecture on the left, the global buffer distributes across the PEs 1 weight from 4 separate channels (c2), and the

