MEMORY OPTIMIZATION FOR DEEP NETWORKS

Abstract

Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32× over the last five years, the total available memory only grew by 2.5×. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we present MONET, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks. MONET jointly optimizes the checkpointing schedule and the implementation of various operators. MONET is able to outperform all prior handtuned operations as well as automated checkpointing. MONET reduces the overall memory requirement by 3× for various PyTorch models, with a 9-16% overhead in computation. For the same computation cost, MONET requires 1.2-1.8× less memory than current state-of-the-art automated checkpointing frameworks. Our code is available at https://github.com/utsaslab/MONeT.

1. INTRODUCTION

Deep networks are widely used in domains ranging from image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016) to video recognition (Wu et al., 2019; Feichtenhofer et al., 2019) or natural language processing (Devlin et al., 2019; Yang et al., 2019) . However, training deep networks is resource-intensive. In particular, the amount of GPU memory bottlenecks training many deep networks (Dong et al., 2016; Kim et al., 2016; Chen et al., 2018; Child et al., 2019) . This bottleneck requires either modifying the network architecture or scaling training to multiple nodes, incurring significant overheads. We presents MONET, an automatic framework to minimize memory footprint for deep networks. MONET jointly optimizes global compute-graph-level techniques (such as checkpointing) and local techniques (such as memory-efficient implementations of individual operator). At the heart of MONET is a theoretical analysis that enables joint optimization and provides tight bounds on memory consumption. We analyze the memory consumption and computational cost of a general forward and backward pass under changing local operator implementations and a global checkpointing schedule. Specifically, we are able to tightly bound the peak memory consumption for network forward, backward, and recomputation stages. MONET uses these constraints to optimize for the most efficient forward and backward implementation both locally and globally under a fixed memory budget. We linearize all memory bounds, and express both implementation selection and checkpointing as a 0-1 integer program, which we solve using standard solvers. We conduct extensive experiments, demonstrating that MONET significantly outperforms existing automatic frameworks that use local or global techniques. On multiple architectures (ResNet (He et al., 2016 ), VGG (Simonyan & Zisserman, 2015) , UNet (Ronneberger et al., 2015) , GoogleNet (Szegedy et al., 2015) , MobileNet-V2 (Sandler et al., 2018) ), memory budgets (5-10 GB), and network configurations (multiple resolutions), MONET consistently achieves lower memory footprints at equivalent or lower computational overhead. MONET reduces the overall memory requirement by 3× for various models, with a 9-16% overhead in computation. For the same computation cost, MONET requires 1.2-1.8× less memory than the current state-of-the-art automated checkpointing framework. The results achieved by MONET demonstrate the power of jointly optimizing global checkpointing schedules and local operator implementations. In-place activated batchnorm (Bulò et al., 2018) or ReLU layers use output activations to compute their gradients, thus reusing a single memory buffer for the gradient computation in consecutive layers. Mixed precision training (Micikevicius et al., 2018) uses half precision (FP16) instead of single precision (FP32) for all tensors and arithmetic during training, reducing the memory by nearly half. While training at precision lower than FP16 results in loss of training quality (Banner et al., 2018) , prior work like backpropagation with approximate activations (Chakrabarti & Moseley, 2019) carefully quantize certain intermediate outputs (activations) to 4 bits, resulting in significant memory savings. Although these hand-crafted techniques independently result in memory savings, there is no one-size-fits-all recipe, and different implementations perform best on different architectures. In contrast, MONET automatically finds the best implementation for each forward and backward operator given a memory budget. et al., 2019) solves the problem in a more general setting, using an mixed-integer linear program solver to decide which layers to recompute for a given network. Like Checkmate, our work optimizes a checkpointing schedule, but on a different computation graph. Our computation graph allows for the optimization of an entire execution plan jointly finding a checkpointing schedule and the best implementation of each forward and backward operator. In Checkmate, changes in operator implementation induce a different computation graph, and could thus not directly be optimized. Appendix F highlights some of the difficulties of adding operator optimizations into Checkmate. In summary, while much work has been done on local optimizations (operator implementations) and global compute-graph-level techniques (automated checkpointing), MONET is the first system to jointly optimize a given architecture using both local and global techniques.



Figure 1: Memory Optimized Network Training (MONeT), an automatic framework that minimizes the memory footprint of deep networks by jointly optimizing global and local techniques.

Chen et al. (2016) proposed dividing a network into different segments, dropping all intermediate outputs within each segment, and recomputing them later. Chen et al. use √ n equal segments, trading memory savings for the cost of an extra forward pass. Checkmate (Jain

