MEMORY OPTIMIZATION FOR DEEP NETWORKS

Abstract

Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32× over the last five years, the total available memory only grew by 2.5×. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we present MONET, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks. MONET jointly optimizes the checkpointing schedule and the implementation of various operators. MONET is able to outperform all prior handtuned operations as well as automated checkpointing. MONET reduces the overall memory requirement by 3× for various PyTorch models, with a 9-16% overhead in computation. For the same computation cost, MONET requires 1.2-1.8× less memory than current state-of-the-art automated checkpointing frameworks. Our code is available at https://github.com/utsaslab/MONeT.

1. INTRODUCTION

Deep networks are widely used in domains ranging from image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016) to video recognition (Wu et al., 2019; Feichtenhofer et al., 2019) or natural language processing (Devlin et al., 2019; Yang et al., 2019) . However, training deep networks is resource-intensive. In particular, the amount of GPU memory bottlenecks training many deep networks (Dong et al., 2016; Kim et al., 2016; Chen et al., 2018; Child et al., 2019) . This bottleneck requires either modifying the network architecture or scaling training to multiple nodes, incurring significant overheads. We presents MONET, an automatic framework to minimize memory footprint for deep networks. MONET jointly optimizes global compute-graph-level techniques (such as checkpointing) and local techniques (such as memory-efficient implementations of individual operator). At the heart of MONET is a theoretical analysis that enables joint optimization and provides tight bounds on memory consumption. We analyze the memory consumption and computational cost of a general forward and backward pass under changing local operator implementations and a global checkpointing schedule. Specifically, we are able to tightly bound the peak memory consumption for network forward, backward, and recomputation stages. MONET uses these constraints to optimize for the most efficient forward and backward implementation both locally and globally under a fixed memory budget. We linearize all memory bounds, and express both implementation selection and checkpointing as a 0-1 integer program, which we solve using standard solvers. We conduct extensive experiments, demonstrating that MONET significantly outperforms existing automatic frameworks that use local or global techniques. On multiple architectures (ResNet (He et al., 2016 ), VGG (Simonyan & Zisserman, 2015) , UNet (Ronneberger et al., 2015) , GoogleNet (Szegedy et al., 2015) , MobileNet-V2 (Sandler et al., 2018) ), memory budgets (5-10 GB), and network configurations (multiple resolutions), MONET consistently achieves lower memory footprints at equivalent or lower computational overhead. MONET reduces the overall memory requirement by 3× for various models, with a 9-16% overhead in computation. For the same computation cost, MONET requires 1.2-1.8× less memory than the current state-of-the-art automated checkpointing framework. The results achieved by MONET demonstrate the power of jointly optimizing global checkpointing schedules and local operator implementations.

