OPTIMIZING MEMORY PLACEMENT USING EVOLUTIONARY GRAPH REINFORCEMENT LEARNING

Abstract

For deep neural network accelerators, memory movement is both energetically expensive and can bound computation. Therefore, optimal mapping of tensors to memory hierarchies is critical to performance. The growing complexity of neural networks calls for automated memory mapping instead of manual heuristic approaches; yet the search space of neural network computational graphs have previously been prohibitively large. We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces, that combines graph neural networks, reinforcement learning, and evolutionary search. A set of fast, stateless policies guide the evolutionary search to improve its sample-efficiency. We train and validate our approach directly on the Intel NNP-I chip for inference. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on BERT, ResNet-101 and ResNet-50. We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.

1. INTRODUCTION

The proliferation of deep learning (DL) has been fueled, in part, by a rapid growth in the size and complexity of deep neural networks (DNN) (Dean et al., 2012; Ying et al., 2018) . This has spurred the rapid development of hardware (Wang et al., 2016; Jouppi et al., 2017) and software (Abadi et al., 2016; Paszke et al., 2018; Cyphers et al., 2018) dedicated to deep learning workloads that seek to optimize critical performance metrics like throughput and power efficiency (Mattson et al., 2020) . Producing compiler optimizations that map the tensors of a neural network's computational graph to the memory units on host hardware is a critical challenge. Since different memory types trade off bandwidth and capacity differently, a sub-optimal mapping could significantly increase latency. For DL inference, the computational graph is static and placement can be pre-planned instead of relying on online cache management (Zhang et al., 2020; Shi et al., 2019) . However, this is especially challenging with DNNs due to the high dimensional search space. For example, ResNet-50 (He et al., 2016) has 57 operational layers. Mapping each activation and weight tensor to, for example, three (DRAM, LLC, and SRAM) memory caches represents 3 (2 * 57) ≈ 10 54 possible decisions. BERT (Devlin et al., 2018) has 376 operational layers, and a search space of ∼ 10 358 . Since optimizing this mapping is intractable with traditional approaches, such as dynamic programming (Bellman, 1954) , current solutions primarily rely on manually-tuned heuristic rules encoded in a compiler. Because of the large search space, prior reinforcement learning (RL) algorithms for automating mappings have relied on manually-designed grouping (Mirhoseini et al., 2017; Addanki et al., 2018) or a learned grouper whose hierarchical structure is domain dependent (Mirhoseini et al., 2018) . In addition to the extremely large action space, the large number of nodes render the reward sparse and noisy, and thus further unsuitable for gradient-based Deep RL algorithms. This sparsity stems from the fact that an overall performance metric can only be measured after all nodes have been processed. In this paper, we present Evolutionary Graph Reinforcement Learning (EGRL), a hybrid approach of evolutionary search with gradient-based learning, that is able to natively search in a high-dimensional space that is orders-of-magnitude larger than previous approaches. EGRL is an extension of CERL (Khadka et al., 2019) , a population based method for sparse-reward tasks that combines fast policy gradient (PG) learning with a stable evolutionary algorithm (EA). Since the action spaces explored in this paper are several orders of magnitude larger than those explored in CERL, we introduce Boltzmann chromosomes -a set of fast, stateless policies that accelerate evolution by providing partially optimized solutions as anchors. This mechanism is necessary to improve the sampleefficiency of the slow EA component for this large action space. Further, we employ a graph neural network (GNN) (Wu et al., 2020; Scarselli et al., 2008) to represent our policy. This allows our agent to natively process computational graphs representing deep learning workloads, enabling generalization over workloads of varying size and connectivity. We demonstrate our solution on the Intel Neural Network Processor for Inference (NNP-I), a deep learning accelerator, to map modern neural networks on one of the three memory hierarchies on the chip. Each memory level in this chip has trade-offs in memory size and bandwidth, as detailed in Wechsler et al. (2019) . This additionally differentiates our work from prior works such as REGAL (Paliwal et al., 2020) that assume infinite bandwidth and memory that are not practical on real hardware. Additionally, we consider single-batch inference, an important industry benchmark Mattson et al. (2020) . While large batch sizes have greater computational efficiency (e.g., Boudoukh et al. (2020) on NNP-I), they are sub-optimal for a given inference example due to the latency associated with queuing up a batch. Therefore, single-batch inference is key to many time-critical applications (Park et al., 2018) where an individual inference query needs to be processed in real-time. Results on ResNet-50, ResNet-101 (He et al., 2016) and BERT, show that EGRL significantly outperforms the chipset's native compiler across all workloads, and exceeds the performance of dynamic programming, evolutionary search and policy-gradient approaches. Specifically, the contributions of this work are: 1. A generalized GNN-based policy that can natively accept a computational graph and produce a corresponding graph representation with the optimal memory maps. This eliminates the need for serialized, layer-dependent representations. 2. EGRL, a scalable population-based algorithm that can effectively train on sparse and noisy feedback from the host hardware in large search spaces. 3. An RL agent that trains directly on the hardware, with a feedback mechanism for constraint violation, and thus allowing direct deployment and testing on hardware.



Figure 1: Workflow of Graph RL agent mapping weights (W) and activations (A) of each layer of a trained neural network workload to various on-board memory components (e.g. DRAM, SRAM).

