DROPIT: DROPPING INTERMEDIATE TENSORS FOR MEMORY-EFFICIENT DNN TRAINING

Abstract

A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint -Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90% of the intermediate tensor elements in fullyconnected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g. , classification, object detection, instance segmentation). Our code and models are available at https://github.com/chenjoya/dropit.

1. INTRODUCTION

The training of state-of-the-art deep neural networks (DNNs) (Krizhevsky et al., 2017; Simonyan & Zisserman, 2015; He et al., 2016; Vaswani et al., 2017; Dosovitskiy et al., 2021) for computer vision often requires a large GPU memory. For example, training a simple visual transformer detection model ViTDet-B (Li et al., 2022) , with its required input image size of 1024×1024 and batch size of 64, requires ∼700 GB GPU memory. Such a high memory requirement makes the training of DNNs out of reach for the average academic or practitioner without access to high-end GPU resources. When training DNNs, the GPU memory has six primary uses (Rajbhandari et al., 2020) : network parameters, parameter gradients, optimizer states (Kingma & Ba, 2015) , intermediate tensors (also called activations), temporary buffers, and memory fragmentation. Vision tasks often require training with large batches of high-resolution images or videos, which can lead to a significant memory cost for intermediate tensors. In the instance of ViTDet-B, approximately 70% GPU memory cost (∼470 GB) is assigned to the intermediate tensor cache. Similarly, for NLP, approximately 50% of GPU memory is consumed by caching intermediate tensors for training the language model GPT-2 (Radford et al., 2019; Rajbhandari et al., 2020) . As such, previous studies (Gruslys et al., 2016; Chen et al., 2016; Rajbhandari et al., 2020; Feng & Huang, 2021) treat the intermediate tensor cache as the largest consumer of GPU memory. For differentiable layers, standard implementations store the intermediate tensors for computing the gradients during back-propagation. One option to reduce storage is to cache tensors from only some layers. Uncached tensors are recomputed on the fly during the backward pass -this is the strategy of gradient checkpointing (Gruslys et al., 2016; Chen et al., 2016; Bulo et al., 2018; Feng & Huang, 2021) . Another option is to quantize the tensors after the forward computation and use the quantized values for gradient computation during the backward pass (Jain et al., 2018; Chakrabarti & Moseley, 2019; Fu et al., 2020; Evans & Aamodt, 2021; Liu et al., 2022) -this is known as activation compression training (ACT). Quantization can reduce memory considerably, but also brings inevitable performance drops. Accuracy drops can be mitigated by bounding the error at each layer through adaptive quantization (Evans & Aamodt, 2021; Liu et al., 2022) , i.e. adaptive ACT. However, training time consequently suffers as extensive tensor profiling is necessary during training. In this paper, we propose to reduce the memory usage of intermediate tensors by simply dropping elements from the tensor. We call our method Dropping Intermediate Tensors (DropIT). In the most basic setting, dropping indices can be selected randomly, though dropping based on a min-k ranking on the element magnitude is more effective. Both strategies are much simpler than the sensitivity checking and other profiling strategies, making DropIT much faster than adaptive ACT. During training, the intermediate tensor is transformed over to a sparse format after the forward computation is complete. The sparse tensor is then recovered to a general tensor during backward gradient computation with dropped indices filled with zero. Curiously, with the right dropping strategy and ratio, DropIT has improved convergence properties compared to SGD. We attribute this to the fact that DropIT can, theoretically, reduce noise on the gradients. In general, reducing noise will result in more precise and stable gradients. Experimentally, this strategy exhibits consistent performance improvements on various network architectures and different tasks. To the best of our knowledge, we are the first to propose activation sparsification. The closest related line of existing work is ACT, but unlike ACT, DropIT leaves key elements untouched, which is crucial for ensuring accuracy. Nevertheless, DropIT is orthogonal to activation quantization, and the two can be combined for additional memory reduction with higher final accuracy. The key contributions of our work are summarized as follows: • We propose DropIT, a novel strategy to reduce the activation memory by dropping the elements of the intermediate tensor. • We theoretically and experimentally show that DropIT can be seen as a noise reduction on stochastic gradients, which leads to better convergence. • DropIT can work for various settings: training from scratch, fine-tuning on classification, object detection, etc. Our experiments demonstrate that DropIT can drop up to 90% of the intermediate tensor elements in fully-connected and convolutional layers with a testing accuracy higher than the baseline for CNNs and ViTs. We also show that DropIT is much better regarding accuracy and speed compared to SOTA activation quantization methods, and it can be combined with them to pursue higher memory efficiency.

2. RELATED WORK

Memory-efficient training. Current DNNs usually incur considerable memory costs due to huge model parameters (e.g. GPTs (Radford et al., 2019; Brown et al., 2020) ) or intermediate tensors (e.g. , high-resolution feature map (Sun et al., 2019; Gu et al., 2022) ). The model parameters and corresponding optimizer states can be reduced with lightweight operations (Howard et al., 2017; Xie et al., 2017; Zhang et al., 2022) , distributed optimization scheduling (Rajbhandari et al., 2020) , and mixed precision training (Micikevicius et al., 2018) . Nevertheless, intermediate tensors, which are essential for gradient computation during the backward pass, consume the majority of GPU memory (Gruslys et al., 2016; Chen et al., 2016; Rajbhandari et al., 2020; Feng & Huang, 2021) , and reducing their size can be challenging. Gradient checkpointing. To reduce the tensor cache, gradient checkpointing (Chen et al., 2016; Gruslys et al., 2016; Feng & Huang, 2021) stores tensors from only a few layers and recomputes any uncached tensors when performing the backward pass; in the worst-case scenario, this is equivalent to duplicating the forward pass, so any memory savings come as an extra computational expense. InPlace-ABN (Bulo et al., 2018) halves the tensor cache by merging batch normalization and activation into a single in-place operation. The tensor cache is compressed in the forward pass and recovered in the backward pass. Our method is distinct in that it does not require additional recomputation; instead, the cached tensors are sparsified heuristically. Activation compression. (Jain et al., 2018; Chakrabarti & Moseley, 2019; Fu et al., 2020; Evans & Aamodt, 2021; Chen et al., 2021; Liu et al., 2022) explored lossy compression on the activation cache via low-precision quantization. (Wang et al., 2022) compressed high-frequency components while (Evans et al., 2020) adopted JPEG-style compression. In contrast to all of these methods, DropIT reduces activation storage via sparsification, which has been previously unexplored. In addition, DropIT is more lightweight than adaptive low-precision quantization methods (Evans & Aamodt, 2021; Liu et al., 2022) . Gradient approximation. Approximating gradients has been explored in large-scale distributed training to limit communication bandwidth for gradient exchange. (Strom, 2015; Dryden et al., 2016; Aji & Heafield, 2017; Lin et al., 2018) propose dropping gradients based on some fixed thresholds and sending only the most significant entries of the stochastic gradients with the guaranteed convergence (Stich et al., 2018; Cheng et al., 2022; Chen et al., 2020) . Instead of dropping gradient components, DropIT directly drops elements within intermediate tensors as our objective is to reduce the training memory.

3.1. PRELIMINARIES

We denote the forward function and learnable parameters of the i-th layer as l and θ, respectively. In the forward pass, l operates on the layer's input a to compute the output z: 1 z = l(a, θ). For example, if layer i is a convolution layer, l would indicate a convolution operation with θ representing the kernel weights and bias parameter. Given a loss function F (Θ), where Θ represents the parameters of the entire network, the gradient, with respect to θ at layer i, can be estimated according to the chain rule as ∇θ ≜ ∂F (Θ) ∂θ = ∇z ∂z ∂θ = ∇z ∂l(a, θ) ∂θ , where ∇z ≜ ∂F (Θ) ∂z is the gradient passed back from layer i + 1. Note that the computation of ∂l(a,θ) ∂θ requires a if the forward function l involves tensor multiplication between a and θ. This is the case for common learnable layers, such as convolutions in CNNs and fully-connected layers in transformers. As such, a is necessary for estimating the gradient and is cached after it is computed during the forward pass, as illustrated in Figure 1 (a). A common way to reduce storage for a is to store a quantized version (Jain et al., 2018; Chakrabarti & Moseley, 2019; Fu et al., 2020; Evans & Aamodt, 2021; Liu et al., 2022) . Subsequent gradients in the backward pass are then computed using the quantized a. The gradient ∇a can be estimated similarly via chain rule as ∇a ≜ ∂F (Θ) ∂a = ∇z ∂z ∂a = ∇z ∂l(a, θ) ∂a . Analogous to Eq. 2, the partial ∂l(a,θ) ∂a may depend on the parameter θ and θ is similarly stored in the model memory. However, the stored θ always shares memory with the model residing in the GPU, so it does not incur additional memory consumption. Furthermore, θ typically occupies much less memory. In Table 1 , the intermediate tensor's space complexity becomes significant when B or L a is large, which is common in CV and NLP tasks.

3.2. DROPPING INTERMEDIATE TENSORS

Let X denote the set of all indices for an intermediate tensor a. Suppose that X is partitioned into two disjoint sets X d and X r , i.e. X r ∩ X d = ∅ and X r ∪ X d = X . In DropIT, we introduce a dropping operation D(•) to sparsify a into â, where â consists of the elements a Xr and the indices X r , i.e. â = D(a) = {a Xr , X r }. The sparse â can be used as a substitute for a in Eq. 2. While sparsification can theoretically reduce both storage and computation time, we benefit only from storage savings in practice. We retain general matrix multiplication because the sparsity rate is insufficient for sparse matrix multiplication to provide meaningful computational gains. As such, the full intermediate tensors are recovered for gradient computation, i.e. ∇θ ≈ ∇z ∂l(R(â),θ) ∂θ , where R(•) represents the process that inflates â back to a general matrix with dropped indices filled with zero. The overall procedure is demonstrated in Figure 1(b) . Consider for a convolutional layer with C z kernels of size K × K. For the j th kernel, where j ∈ [1, C z ], the gradient at location (u, v) for the k th channel is given by convolving incoming gradient ∇z and input a: ∇θ j,k (u, v) = (n,x,y)∈X ∇z n j (x, y)a n k (x ′ , y ′ ), where x ′ = x + u and y ′ = y + v. The set X in this case would denote the set of all sample indices n ∈ [1, B] and all location indices (x, y) ∈ [1, W ] × [1, H] in the feature map. Without any loss in generality, we can partition X into two disjoint sets X r and X d to split Eq. 4 as ∇θ j,k (u, v) = (n,x,y)∈X d ∇z n j (x, y)a n k (x ′ , y ′ ) + (n,x,y)∈Xr ∇z n j (x, y) a n k (x ′ , y ′ ) ⊤θ j,k (u,v) . (5) Assume now, that some element a n k (x ′ , y ′ ) is small or near-zero; in CNNs and Transformers, such an assumption is reasonable due to preceding batch/layer normalization and ReLU or GeLU activations (see Figure 3 ). Accordingly, this element's contribution to the gradient will also be correspondingly small. If we assign the spatial indices (x, y) in sample n of all small or near-zero elements to X d , then we can approximate the gradient ∇θ j,k (u, v) with simply the second term of Eq. 5. We denote the approximated gradient as g dropit = ⊤θ j,k (u, v). Input 𝑎𝑎 Output 𝑧𝑧 Weight 𝜃𝜃 × → Cache � 𝑎𝑎 Gradient ∇𝑧𝑧 Weight 𝜃𝜃 𝑇𝑇 ← × Gradient ∇𝑎𝑎 𝑅𝑅 � 𝑎𝑎 𝑇𝑇 ← × Gradient ⊤𝜃𝜃 (approximated) Gradient ∇𝑧𝑧 Dropping Function 𝑫𝑫(⋅) (e.g., Random, γ = 0.9) (a) Forward Input 𝑎𝑎 Output 𝑧𝑧 Weight 𝜃𝜃 × → Cache � 𝑎𝑎 Gradient ∇𝑧𝑧 Weight 𝜃𝜃 𝑇𝑇 ← × Gradient ∇𝑎𝑎 𝑅𝑅 � 𝑎𝑎 𝑇𝑇 ← × Gradient ⊤𝜃𝜃 (approximated) Gradient ∇𝑧𝑧 Dropping Function 𝑫𝑫(⋅) (e.g., Random, γ = 0.9) (b) Backward Figure 2 : Forward and backward of DropIT on the fully-connected layer (without bias). In the forward pass, we sparsify the cache tensor and drop γ percentage storage. In the backward pass, only saved elements participate in the gradient computation. For a fully connected layer, the approximated gradient can be defined similarly as g dropit = ⊤θ j,k = n∈Xr ∇z n j a n k . A visualization of the gradient approximation is shown in Figure 2 . With the approximated gradient, we can use any standard deep learning optimization scheme to update the parameters.

3.3. DROPPING FUNCTION D(•)

We define the overall dropping rate as γ = |Xr| BCa for a fully connected layer and γ = |Xr| BCaHW for a convolutional layer. γ can be varied and will be used later to define the dropping function D(•). As we aim to drop elements with minimal contribution to the gradient, it is logical to perform a min-k based selection on the elements' magnitudes before dropping the elements. As a baseline comparison, we also select X d based on uniform random sampling. We investigate the following options for D(•): Random Elements: γ fraction of elements are dropped randomly within a mini-batch. Min-K Elements: Within a mini-batch, we drop the smallest γ fraction of elements according to their absolute magnitudes.

3.4. THEORETICAL ANALYSIS

Below, we analyze convergence for dropping min-k elements. The gradient of Stochastic Gradient Descent (SGD) is commonly viewed as Gradient Descent (GD) with noise: g sgd = g gd + n(0, ξ 2 ), where n represents some zero-mean noise distribution with a variance of ξ 2 introduced by variation in the input data batches. With min-k dropping, the gradient becomes biased; we assume it can be modeled as: g min-k = αg gd + βn(0, ξ 2 ). That is, min-k dropping results in a bias factor α while affecting noise by a factor of β. α and β vary each iteration, i.e., α = {α 1 , α 2 , ..., α t } and β = {β 1 , β 2 , ..., β t }. Additionally in Appendix A.2, we provide a nonlinear approximation to g min-k that achieves same convergence. By scaling the learning rate with a factor of 1 α , the gradient after min-k dropping as given in Eq. 8 can also be expressed as: g min-k = g gd + β α n(0, ξ 2 ). We can formally show (see Appendix A.3) that E[α] ≥ E[β] ≥ 1 -γ and therefore E[ β α ] ≤ 1. This suggests that min-k dropping reduces the noise of the gradient. With less noise, better theoretical convergence is expected. Similar to convergence proofs in most optimizers, we will assume that the loss function F is Lsmooth. Under the L-smooth assumption, for SGD with a learning rate η and min-k dropping with a learning rate η αt , we can reach the following convergence after T iterations: SGD: 1 T E T t=1 ∥∇F (x t )∥ 2 ≤ 2(F (x 1 ) -F (x * )) T η + ηLξ 2 DropIT with min-k: 1 T E T t=1 ∥∇F (x t )∥ 2 ≤ 2(F (x 1 ) -F (x * )) T η + ηLξ 2 1 T T t=1 β 2 t α 2 t , where x * indicates an optimal solution. Full proof can be found in Appendix A.1. Note that the two inequalities differ only by the second term in the right-hand side. α t represents the bias caused by dropping at the t-th iteration and β t measures the noise reduction effect after dropping. We further investigate α and β in the supplementary and show that under certain conditions, E[α] ≥ E[β], thereby reducing the noise and improving the convergence of DropIT from standard SGD.

3.5. DROPIT FOR NETWORKS

For some layers, e.g. normalization and activations, ∂l(a,θ) ∂a may also depend on a. In these cases, we do not drop the cache of intermediate tensors as this will affect subsequent back-propagation. For DropIT, dropping happens only when the gradient flows to the parameters, which prevents the aggregation of errors from approximating the gradient. Now, we have discussed dropping tensor elements from the cache of a single layer. DropIT is theoretically applicable for all convolutional and fully-connected layers in a network since it does not affect the forward pass. For Visual Transformers (Dosovitskiy et al., 2021) , we apply DropIT for most learnable layers, though we ignore the normalization and activations like LayerNorm (Ba et al., 2016) and GELU (Hendrycks & Gimpel, 2016) ). The applicable layers include fully-connected layers in multi-head attention and MLPs in each block, the beginning convolutional layer (for patches projection), and the final fully-connected classification layer. For CNNs the applicable layers include all convolutional layers and the final fully-connected classification layer. We leave networks unchanged during inference. Table 2 : Ablation study on dropping strategy and dropping rate. Reported results are top-1 accuracy on the ImageNet-1k validation set, achieved by DeiT-Ti training from scratch on the ImageNet-1k training set. We highlight that the accuracy is higher than baseline ( ≥72.1). (SRFK

/RVV

Figure 4 : Training loss curves of Min-K DropIT. Baseline (γ = 0%) is bolded. γ = 80%, 90% are hidden as their losses are obviously higher than the baseline. γ = 10%, 30% are also hidden for easier viewing. γ = 40%∼70% achieve lower loss than baseline at the end. Best viewed in color.

4. EXPERIMENTS

In this section, we present a comprehensive evaluation of DropIT's effectiveness, leveraging experiments on training from scratch on ImageNet-1k (Russakovsky et al., 2015) . Our results demonstrate that DropIT outperforms existing methods by achieving lower training loss, higher testing accuracy, and reduced GPU memory consumption. We showcase the versatility of DropIT in various finetuning scenarios, such as ImageNet-1k to CIFAR-100 (Krizhevsky et al., 2009) , object detection, and instance segmentation on MS-COCO (Lin et al., 2014) . Furthermore, we compare DropIT with recent state-of-the-art ACT methods (Pan et al., 2022; Liu et al., 2022) and establish its superiority in terms of accuracy, speed, and memory cost.

4.1. EXPERIMENTAL DETAILS

Models. For image classification, we employed DeiT (Touvron et al., 2021) instead of vanilla ViT (Dosovitskiy et al., 2021) since it doesn't require fine-tuning from ImageNet-21k. DeiT and ViT share the same architecture, differing only in their training hyper-parameters. Additionally, for transfer learning, we utilized Faster/Mask R-CNN models (Ren et al., 2017; He et al., 2017) to evaluate our approach in object detection and instance segmentation. Implementation Details. We use the official implementations of DeiT (without distillation) and Faster/Mask R-CNN, and keep all hyper-parameters consistent. The only difference is that we compute gradients using DropIT. Our implementation is based on PyTorch 1.12 (Paszke et al., 2019) , and we utilize torch.autograd package. During the forward pass, we use DropIT to convert the dense tensor to coordinate format, and recover it during the backward pass. The min-k strategy of DropIT is implemented by torch.topk, which retains elements with the largest absolute value, based on a proportion of 1-γ. The corresponding indices of these elements are also maintained. For all experiments, we follow DeiT (Touvron et al., 2021) and set a fixed random seed of 0. We measure training speed and memory on NVIDIA RTX A5000 GPUs. Additional details can be found in Appendix A.8. 

4.2. IMPACT ON ACCURACY

Training from scratch on ImageNet-1k. Table 2 shows that training DeiT-Ti from scratch without DropIT (baseline) has a top-1 accuracy of 72.1 on ImageNet-1k. Random dropping matches or improves the accuracy (72.4) when γ ≤ 20%, but with higher γ (γ ≥ 30%), accuracy progressively decreases from the baseline. The phenomenon can be explained by the following: (1) Small amounts of random dropping (γ ≤ 20%) can be regarded as adding random noise to the gradient. The noise has a regularization effect on the network optimization to improve the accuracy, similar to what was observed in previous studies (Neelakantan et al., 2015; Evans & Aamodt, 2021) . ( 2) Too much random dropping (γ ≥ 30%) results in deviations that can no longer be seen as small gradient noise, hence reducing performance. With min-k dropping, DropIT can match or exceed the baseline accuracy over a wide range of γ (≤ 70%). Intuitively, training from scratch should be difficult with DropIT, especially under large dropping rates, as the computed gradients are approximations. However, our experiments demonstrate that DropIT achieves 0.4% and 0.3% higher accuracy in γ = 50% and 60%, respectively. In fact, DropIT can match the baseline accuracy even after discarding 70% of the elements. Fig. 4 compares the loss curves when training from scratch on the baseline DeiT-Ti model without and with DropIT using a min-k strategy. The loss curves of DropIT with various γ values follow the same trend as the baseline; up to some value of γ, the curves are also but are consistently lower than the baseline, with γ = 50%, 60% achieving the lowest losses and highest accuracies. As such, we conclude that DropIT accurately approximates the gradient while reducing noise, as per our theoretical analysis. Fine-tuning on CIFAR-100. Table 3 (a) shows that DeiT networks can be fine-tuned with DropIT to achieve higher than baseline accuracies even while dropping up to 90% intermediate elements. Compared to training from scratch from Table 2 , DropIT can work with a more extreme dropping rate (90% vs. 70%). We interpret that this is because the network already has a good initialization before fine-tuning, thereby simplifying the optimization and allowing a higher γ to be used. Backbone fine-tuning, head network training from scratch, on COCO. We investigated DropIT in two settings: training from scratch and fine-tuning from a pre-trained network. We also studied a backbone network initialized with ImageNet pre-training, while leaving others, such as RPN and R-CNN head, uninitialized, which is common practice in object detection. (Pan et al., 2022) and GACT (Liu et al., 2022) . As shown in Table 5 , MESA can reduce memory with 8-bit quantization and it has no impact on the baseline accuracy (89.7). However, the time cost of the MESA algorithm is also considerable, and is 416/172 ≈ 2.4× slower than baseline and 416/212 ≈ 2× more than DropIT, with no accuracy improvement in CIFAR-100 finetuning. MESA achieves 71.9 accuracy of DropIT-Ti on ImageNet-1k, but DropIT can go up to 72.5 (Table 2 , γ = 50%). We can combine MESA with DropIT by applying DropIT in the conv/fc layers and applying MESA in the other layers. Together, the accuracy, memory, and speed are all improved over MESA alone, conclusively demonstrating the effectiveness of DropIT. We compare similarly to GACT; Table 5 shows that at 4 bits, it can reduce max-memory even further. Combining GACT with DropIT marginally increases the max-memory due to DropIT's indexing consumption; however, there are both accuracy and speed gains. Furthermore, GACT reports 0.2∼0.4 AP box loss on COCO (Liu et al., 2022) , though our DropIT can produce 0.6 AP box improvement on COCO (Table 3 (b)). To sum up, DropIT has its unique advantages in terms of accuracy and speed compared to existing activation quantization methods. Although it saves less memory than the latter, we can combine the two to achieve higher memory efficiency. 

A APPENDIX A.1 COMPLETE CONVERGENCE ANALYSIS

Here we prove convergence of DropIT with min-k dropping strategy. By scaling the learning rate with a factor of 1 α , the gradient of min-k dropping is modeled as: g min-k = g gd + β α n(0, ξ 2 ). ( ) where n is zero-mean noise with a variance of ξ 2 , and α, β are varied each iteration. We assume that the loss function F is L-smooth, i.e., F is differentiable and there exists a constant L > 0 such that F (y) ≤ F (x) + ⟨∇F (x), y -x⟩ + L 2 ∥y -x∥ 2 , ∀x, y ∈ R d . Performing Taylor expansion we have: E[F (x t+1 )] ≤ F (x t ) -⟨∇F (x t ), x t+1 -x t ⟩ + η 2 L 2 E[∥∇F (x t )∥ 2 ] ≤ F (x t ) -η∥∇F (x t )∥ 2 + η 2 Lξ 2 β 2 t 2α 2 t (14) Rearranging the terms of the above inequality and dividing by η 2 , we obtain: ∥∇F (x t )∥ 2 ≤ 2(F (x t ) -E[F (x t+1 )]) η + ηLξ 2 β 2 t α 2 t ( ) Summing up from t = 1 to T and divided by T , we get: 1 T E T t=1 ∥∇F (x t )∥ 2 ≤ 2(F (x 1 ) -F (x * )) T η + ηLξ 2 1 T T t=1 β 2 t α 2 t ( ) where x * indicates an optimal solution.

A.2 MODELING MIN-k DROPPING GRADIENT WITH NONLINEAR FUNCTION

We can replace Eq. 8 (gradient model of min-k dropping) with nonlinear function and still achieve the same convergence as in Eq. 10. The gradient is biased with min-k dropping, we assume it can be modeled as: Using a learning rate of η αt instead, we have: g min-k = g gd + βn(0, ξ 2 ) + b, ( ) ∥∇F (x t )∥ 2 ≤ 2(F (x t ) -E[F (x t+1 )]) η + ηLξ 2 β 2 t α 2 t ( ) Summing up from t = 1 to T and divided by T , we get: In Table 6 , we compare convergence of SGD and DropIT under various learning rates. Under a fixed learning rate, SGD and DropIT differ no both convergence speed (the 1st term in convergence) and error (the 2nd term in convergence formula). For a fair setting, we compare SGD with learning rate η and DropIT with learning rate η α . With a fixed convergence speed, DropIT theoretically achieves lower error.

A.3 THEORETICAL ANALYSIS ON α AND β

In this section we compare the gradients of SGD and DropIT with min-k dropping. Note we slightly change the notation of the gradients from g sgd and g mink in the main paper to improve clarity for return x 11 12 for x, y in dataloader: 13 x = model(x) # x will not be released with DropIT 14 loss_func(x, y).backward()



Note that the output from the previous layer i-1, i.e. a i = z i-1 . However, we assign different symbols to denote the input and output of a given layer explicitly; this redundant notation conveniently allows us, for clarity purposes, to drop the explicit reference of the layer index i as a superscript. https://www.github.com/facebookresearch/deit/blob/main/README_deit.md https://www.github.com/facebookresearch/deit/issues/45 https://www.github.com/facebookresearch/deit/issues/45 https://www.github.com/pytorch/vision/tree/main/references/detection# faster-r-cnn-resnet-50-fpn https://www.github.com/pytorch/vision/tree/main/references/detection# mask-r-cnn



Figure 3: Distribution of element values in intermediate tensors' on DeiT-Ti. Dropped elements are shaded in grey. DropIT with min-k only discards elements that are close to zero. Here we only show the final block while observing that the distributions of other blocks are similar.

x * indicates an optimal solution. The convergence is exactly the same as in Appendix A.1Theoretical convergences of SGD and DropIT under L-smooth condition













Space complexity for parameters and intermediate tensors in a single layer. B: batch size, L a : input sequence length (e.g. , width×height), C a , C z : the number of input, output channels, K: convolutional kernel size. Typically, C a , C z , K would be fixed once the model has been built, so the complexity for intermediate tensors would be considerable with large B, L a .

From Touvron et al. (2021)'s official implementation, we obtain 72.13 with public weights and our training.

Fine-tuning with DropIT on image classification, object detection & instance segmentation.

Memory cost of DropIT cached tensors (without indices) for different γ. DropIT can precisely reduce the memory by γ. The measured model is DeiT-S with a batch size of 1024.

Table 3(b)  shows that DropIT can steadily improve detection accuracy (AP box ). When γ = 80%, we observed an impressive 0.6 AP gain in Mask R-CNN, although this gain was not observed in AP mask . We believe that the segmentation performance may be highly related to the deconvolutional layers in the mask head, which are not currently supported by DropIT. We plan to investigate this further in future work. These experiments demonstrate the effectiveness of DropIT on CNNs, and in Appendix A.4, we demonstrate the effectiveness of DropIT for ResNet training on ImageNet-1k. Note: GACT has a time-consuming sensitivity profiling computation every 1000 iterations. It costs 49.81 and  25.43 (+DropIT)  seconds in our benchmark. So we add an average of this time over 1000 iterations).

Compare and combine with state-of-the-art ACT methods. FC: fully-connected. MaxM: maximum memory. MaxM (-Index): maximum memory without index (moved to CPU). We follow MESA to use batch size 128 to measure memory and speed on a single GPU.4.3 IMPACT ON MEMORY & SPEED, SOTA COMPARISONIntermediate Tensor Cache Reduction. Table4shows the intermediate tensor cache reduction achieved by DropIT. In DropIT applied layers (FC layers of DeiT-S), the total reserved activation (batch size = 1024) is 11.26 G. When we use DropIT to discard activations, the memory reduction is precisely controlled by γ, i.e. γ = 90% means the reduction is 11.26 × 0.9. DropIT does incur some memory cost for indexing, but as we show next, the maximum GPU memory can still be reduced.Comparison and Combination with SOTA. In Table5, we compare and combine DropIT with state-of-the-art activation quantization methods. Measuring performance individually, with γ = 90%, DropIT improves accuracy by 0.4 and reduces maximum memory by 1.07 G (1.37G activations -0.3G indexing), and slightly increases the time (40 ms) per iteration. The max memory reduction is less than that shown Table4because activations from non-applicable layers still occupy considerable memory. Therefore, a natural idea to supplement DropIT is to perform activation quantization for layers without DropIT. We next present the combination results of DropIT with recent methods MESA

In this paper, we propose the Dropping Intermediate Tensors (DropIT) method to reduce the GPU memory cost during the training of DNNs. Specifically, DropIT drops elements in intermediate tensors to achieve a memory-efficient tensor cache, and it recovers sparsified tensors from the remaining elements in the backward pass to compute the gradient. Our experiments show that DropIT can improve the accuracies of DNNs and save GPU memory on different backbones and datasets.

where b is a bias and ||b|| 2 ≤ (1-α)||g gd || 2 . α and β varies each iteration, i.e., α = {α 1 , α 2 , ..., α t } and β = {β 1 , β 2 , ..., β t }.Assuming the loss function F is L-smooth, we obtain:EF (x t+1 ) ≤ F (x t ) -⟨∇F (x t ), x t+1 -x t ⟩ + η 2 L 2 E||∇F (x t ) + b|| 2 = f (x t ) -η⟨∇F (x t ), ∇F (x t ) + b⟩ + η 2 Lξ 2 β 2 t 2 ≤ f (x t ) -η⟨∇F (x t ), ∇F (x t ) + b⟩ + η 2 ||∇F (x t ) + b|| 2 + η 2 Lξ 2 β 2 ∇F (x t ) + b⟩ -||∇F (x t ) + b|| 2 + η 2 Lξ 2 β 2Rearranging the terms of the above inequality and dividing by ηαt 2 , we obtain:∥∇F (x t )∥ 2 ≤ 2(F (x t ) -E[F (x t+1 )])

6. ACKNOWLEDGEMENTS

This research is supported by the National Research Foundation, Singapore under its NRF Fellowship for AI (NRF-NRFFAI1-2019-0001). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

annex

element-wise analysis. We denote the gradient of SGD as G and DropIT as G ′ . Both gradients are computed by an input tensor A and intermediate tensor Z. In DropIT we drop γ percent of the elements in A. Thus we have:where ⊙ is element-wise multiplication and D is a dropping mask where each element is either 1 or 0.From an element-wise viewpoint, we rewrite the computation of gradients:where d ik is a mask, i.e., d ik is either 0 or 1 depending on a ik .For the simplicity of analysis, we assume A and Z are independent. Let µ be the mean value of A and c be the mean of dropped elements, after dropping, A ⊙ D has a mean of µ -c. Taking expectation over all possible inputs, we have:Therefore the bias caused by dropping is expected to be µ-c µ . Assuming the mean value of A is µ and the mean of dropped value is c, after dropping, D(A) has a mean of µ -c. Thus the bias caused by dropping isµ . Recall that we drop elements with small absolute values. In the extreme case where every element in A has the same value as µ, c will reach the upper bound γµ. Therefore,Now we analyze on noise and compute β. Due to the variation on input samples, we have noise in A and Z, which results in noise in G and G ′ . To highlight the noise, we rewrite a noisy element x as x + n x , where x is the mean value of x and n x is a zero-mean noise. Applying it to Eq.24 and Eq.25 we arrive at:Focusing on the noise of gradients, we obtain:Recall that d ik is a mask depending on a ik and therefore depending on āik , thus from Eq.26 we have Because γ percent of D is 0 and the otherPlugging them in Eq.30 we have:where the inequality is satisfied due to c ≤ γµ. 

A.5 MORE NETWORK RESULTS

We present more results of different network architectures as shown in Table 7 . ResNet-18 are trained from scratch by 90 epochs on ImageNet, totally following torchvision reference code.ViT-B/16 is fine-tuned in 3 epochs from its ImageNet-21k pretrained weights. Our proposed DropIT can improve the accuracy for these setting with lower GPU memory cost.

A.6 WHY DROPIT IS NOT USED FOR THE NETWORK FIRST & FINAL LAYERS

We do not apply DropIT to conv/fc layer if it is the first/final layer of the network. The reason is that this does not save memory:(1) The first layer: The logic of our DropIT on saving memory can be concluded as: creating a smaller tensor x dropped (i.e. by torch.topk) from input tensor x, then the input tensor x will be automatically released by python garbage collection. However, popular code style is like:1 dataloader = ... 2 loss_func = ... As we can see, in the dataloader loop, the input x to model can only be recycled when model running is finished. So, using DropIT in layer1 will not reduce maximum memory -instead, it will increase the maximum memory as DropIT created a new x dropped .(2) The final layer: it is easy to understand that DropIT using in the final layer has no effect on memory. See the code block, when running to layeri, the maximum memory should be layer1 ∼ layeri1 cached tensors plus x input to layeri. If use DropIT at layeri, then there would be an extra x dropped produced, making the maximum memory even higher.A.7 HOW TO SELECT γ OF DROPITFrom our experiments, we recommend γ = 70% for training from scratch and γ = 80, 90% for finetuning. As DropIT incurs memory cost for indexing, γ should be larger than 50% to be meaningful (assuming index data type is int32 with the same number of bits of float32 for activation).Empirically, we observe that γ is reflected consistently in both training loss and testing accuracy. A too-high γ which will bias the gradient will have training losses higher than the baseline. As such, an alternative way to select γ is to observe the training loss after a some iterations (e.g. 100); if it is lower than the baseline, then the testing accuracy is likely to improve as well.

A.8 MORE EXPERIMENTAL DETAILS

We list the detailed key training hyper-parameters, though they are totally the same with the offical implementations:• DeiT-Ti, training from scratch, ImageNet-1k, w/wo DropIT 2 : batch size 1024, AdamW optimizer, learning rate 10 -3 , weight decay 0.05, cosine LR schedule, 300 epochs, with auto mixed precision (AMP) training;• DeiT-S, finetuning from official DeiT-S ImageNet-1k weights, CIFAR-100, w/wo DropIT 3 : batch size 768, SGD optimizer (momentum 0.9), learning rate 10 -2 , weight decay 10 -4 , cosine LR schedule, 1000 epochs, with AMP training;• DeiT-B, finetuning from official DeiT-B ImageNet-1k weights, CIFAR-100, w/wo DropIT 4 : batch size 768, SGD optimizer (momentum 0.9), learning rate 10 -2 , weight decay 10 -4 , cosine LR schedule, 1000 epochs, with AMP training; 

