DROPIT: DROPPING INTERMEDIATE TENSORS FOR MEMORY-EFFICIENT DNN TRAINING

Abstract

A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint -Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90% of the intermediate tensor elements in fullyconnected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g. , classification, object detection, instance segmentation). Our code and models are available at https://github.com/chenjoya/dropit.

1. INTRODUCTION

The training of state-of-the-art deep neural networks (DNNs) (Krizhevsky et al., 2017; Simonyan & Zisserman, 2015; He et al., 2016; Vaswani et al., 2017; Dosovitskiy et al., 2021) for computer vision often requires a large GPU memory. For example, training a simple visual transformer detection model ViTDet-B (Li et al., 2022) , with its required input image size of 1024×1024 and batch size of 64, requires ∼700 GB GPU memory. Such a high memory requirement makes the training of DNNs out of reach for the average academic or practitioner without access to high-end GPU resources. When training DNNs, the GPU memory has six primary uses (Rajbhandari et al., 2020) : network parameters, parameter gradients, optimizer states (Kingma & Ba, 2015) , intermediate tensors (also called activations), temporary buffers, and memory fragmentation. Vision tasks often require training with large batches of high-resolution images or videos, which can lead to a significant memory cost for intermediate tensors. In the instance of ViTDet-B, approximately 70% GPU memory cost (∼470 GB) is assigned to the intermediate tensor cache. Similarly, for NLP, approximately 50% of GPU memory is consumed by caching intermediate tensors for training the language model GPT-2 (Radford et al., 2019; Rajbhandari et al., 2020) . As such, previous studies (Gruslys et al., 2016; Chen et al., 2016; Rajbhandari et al., 2020; Feng & Huang, 2021) treat the intermediate tensor cache as the largest consumer of GPU memory. For differentiable layers, standard implementations store the intermediate tensors for computing the gradients during back-propagation. One option to reduce storage is to cache tensors from only some layers. Uncached tensors are recomputed on the fly during the backward pass -this is the strategy of gradient checkpointing (Gruslys et al., 2016; Chen et al., 2016; Bulo et al., 2018; Feng & Huang, 2021) . Another option is to quantize the tensors after the forward computation and use the quantized values for gradient computation during the backward pass (Jain et al., 2018; Chakrabarti & Moseley, 2019; Fu et al., 2020; Evans & Aamodt, 2021; Liu et al., 2022) -this is known as activation compression training (ACT). Quantization can reduce memory considerably, but also brings inevitable performance drops. Accuracy drops can be mitigated by bounding the error at each layer through adaptive quantization (Evans & Aamodt, 2021; Liu et al., 2022) , i.e. adaptive ACT. However, training time consequently suffers as extensive tensor profiling is necessary during training. In this paper, we propose to reduce the memory usage of intermediate tensors by simply dropping elements from the tensor. We call our method Dropping Intermediate Tensors (DropIT). In the most basic setting, dropping indices can be selected randomly, though dropping based on a min-k ranking on the element magnitude is more effective. Both strategies are much simpler than the sensitivity checking and other profiling strategies, making DropIT much faster than adaptive ACT. During training, the intermediate tensor is transformed over to a sparse format after the forward computation is complete. The sparse tensor is then recovered to a general tensor during backward gradient computation with dropped indices filled with zero. Curiously, with the right dropping strategy and ratio, DropIT has improved convergence properties compared to SGD. We attribute this to the fact that DropIT can, theoretically, reduce noise on the gradients. In general, reducing noise will result in more precise and stable gradients. Experimentally, this strategy exhibits consistent performance improvements on various network architectures and different tasks. To the best of our knowledge, we are the first to propose activation sparsification. The closest related line of existing work is ACT, but unlike ACT, DropIT leaves key elements untouched, which is crucial for ensuring accuracy. Nevertheless, DropIT is orthogonal to activation quantization, and the two can be combined for additional memory reduction with higher final accuracy. The key contributions of our work are summarized as follows: • We propose DropIT, a novel strategy to reduce the activation memory by dropping the elements of the intermediate tensor. • We theoretically and experimentally show that DropIT can be seen as a noise reduction on stochastic gradients, which leads to better convergence. • DropIT can work for various settings: training from scratch, fine-tuning on classification, object detection, etc. Our experiments demonstrate that DropIT can drop up to 90% of the intermediate tensor elements in fully-connected and convolutional layers with a testing accuracy higher than the baseline for CNNs and ViTs. We also show that DropIT is much better regarding accuracy and speed compared to SOTA activation quantization methods, and it can be combined with them to pursue higher memory efficiency.

2. RELATED WORK

Memory-efficient training. Current DNNs usually incur considerable memory costs due to huge model parameters (e.g. GPTs (Radford et al., 2019; Brown et al., 2020) ) or intermediate tensors (e.g. , high-resolution feature map (Sun et al., 2019; Gu et al., 2022) ). The model parameters and corresponding optimizer states can be reduced with lightweight operations (Howard et al., 2017; Xie et al., 2017; Zhang et al., 2022) , distributed optimization scheduling (Rajbhandari et al., 2020) , and mixed precision training (Micikevicius et al., 2018) . Nevertheless, intermediate tensors, which are essential for gradient computation during the backward pass, consume the majority of GPU memory (Gruslys et al., 2016; Chen et al., 2016; Rajbhandari et al., 2020; Feng & Huang, 2021) , and reducing their size can be challenging. Gradient checkpointing. To reduce the tensor cache, gradient checkpointing (Chen et al., 2016; Gruslys et al., 2016; Feng & Huang, 2021) stores tensors from only a few layers and recomputes any uncached tensors when performing the backward pass; in the worst-case scenario, this is equivalent to duplicating the forward pass, so any memory savings come as an extra computational expense. InPlace-ABN (Bulo et al., 2018) halves the tensor cache by merging batch normalization and activation into a single in-place operation. The tensor cache is compressed in the forward pass and recovered in the backward pass. Our method is distinct in that it does not require additional recomputation; instead, the cached tensors are sparsified heuristically. Activation compression. (Jain et al., 2018; Chakrabarti & Moseley, 2019; Fu et al., 2020; Evans & Aamodt, 2021; Chen et al., 2021; Liu et al., 2022) explored lossy compression on the activation cache via low-precision quantization. (Wang et al., 2022) compressed high-frequency components while (Evans et al., 2020) adopted JPEG-style compression. In contrast to all of these methods, DropIT reduces activation storage via sparsification, which has been previously unexplored. In addition, DropIT is more lightweight than adaptive low-precision quantization methods (Evans & Aamodt, 2021; Liu et al., 2022) .

