DIVISION: MEMORY EFFICIENT TRAINING VIA DUAL ACTIVATION PRECISION

Abstract

Activation compressed training (ACT) has been shown to be a promising way to reduce the memory cost of training deep neural networks (DNNs). However, existing work of ACT relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual ActIVation PrecISION (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10× compression of activation maps, and significantly higher training throughput than state-of-theart ACT methods, without loss of model accuracy. The code is available at https://anonymous.4open.science/r/division-5CC0/.

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied to real-world tasks such as language understanding (Devlin et al., 2018) , machine translation (Vaswani et al., 2017) , visual detection and tracking (Redmon et al., 2016) . With increasingly larger and deeper architectures, DNNs achieve remarkable improvement in representation learning and generalization capacity (Krizhevsky et al., 2012) . Generally, training a larger model requires more memory resources to cache the activation values of all intermediate layers during the back-propagationfoot_0 . For example, training a DenseNet-121 (Huang et al., 2017) on the ImageNet dataset (Deng et al., 2009) requires to cache over 1.3 billion float activation values (4.8GB) during back-propagation; and training a ResNet-50 (He et al., 2016) requires to cache over 4.6 billion float activation values (17GB). Some techniques have been developed to reduce the training cache of DNNs, such as checkpointing (Chen et al., 2016; Gruslys et al., 2016 ), mix precision training (Vanholder, 2016) , low bit-width training (Lin et al., 2017; Chen et al., 2020) and activation compressed training (Georgiadis, 2019; Liu et al., 2022) . Among these, the activation compressed training (ACT) has emerged as a promising method due to its significant reduction of training memory and the competitive learning performance (Liu et al., 2021b) . Existing work of ACT relies on quantizing the activation maps to reduce the memory consumption of DNN training, such as BLPA (Chakrabarti & Moseley, 2019 ), TinyScript (Fu et al., 2020) and ActNN (Chen et al., 2021) . Although ACT could significantly reduce the training memory cost, the quantization process introduces noises in backward propagation, which makes the training suffer from undesirable degradation of accuracy (Fu et al., 2020) . Due to this reason, BLPA requires 4bit ACT to ensure the convergence to optimal solution on the ImageNet dataset, which has only 6× compression ratefoot_1 of activation maps (Chakrabarti & Moseley, 2019) . Other works propose to search for optimal bit-width to match different samples during training, such as ActNN (Chen et al., 2021) and AC-GC (Evans & Aamodt, 2021) . Although it can moderately reduce the quantization noise and achieves optimal solution under 2-bit ACT (nearly 10× compression rate), the following issues cannot be ignored. First, it is time-consuming to search for optimal bit-width during training. Second, the framework of bit-width searching is complicated and non-transparent, which brings new challenges to follow-up studies on the ACT and its real-world applications. In this work, we propose a simple and transparent method to reduce the memory cost of DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for the storage of high-frequency component (HFC) during the training. This indicates the HFC of activation map is highly redundant and compressible during the training. Following this direction, we propose Dual Activation Precision (DIVISION), which preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision during the training. In this way, DIVISION can significantly reduce the memory cost. Meanwhile, it will not negatively affect the quality of backward propagation and could maintain competitive model accuracy. 

2. PRELIMINARY

2.1 NOTATIONS Without loss of generality, we consider an L-layer deep neural network in this work. During the forward pass, for each layer l (1 ≤ l ≤ L), the activation map is calculated by H l = forward(H l-1 ; W l ), where H l denotes the activation map of layer l; H 0 takes a mini-batch of input images; W l denotes the weight of layer l; and forward(•) denotes the feed-forward operation. During the backward pass, the gradients of the loss value towards the activation maps and weights are be estimated by ∇H l-1 , ∇W l = backward( ∇H l , H l-1 , W l ), where ∇H l-1 and ∇H l denote the gradient towards the activation map of layer l -1 and l, respectively; ∇W l denotes the gradient towards the weight of layer l; and backward(•)foot_2 denotes the backward function which takes ∇H l , H l-1 and W l , and outputs the gradients ∇H l-1 and ∇W l . Equation (2) indicates it is required to cache the activation maps H 0 , • • • , H L-1 after the feedforward operations for gradient estimation during backward propagation. 



The activation map of each layer is required for estimating the gradient during backward propagation. 6× compression rate indicates the memory of cached activation maps is 1/6 of that of normal training. We do not focus on the closed from the backward function, which is implemented by torch.autograd.



Compared with existing work that integrates searching into learning(Liu et al., 2022), DIVISION has a more simplified compressor and decompressor, speeding up the procedure of ACT. More importantly, it reveals the compressible (HFC) and non-compressible factors (LFC) during DNN training, improving the transparency of ACT. Experiments are conducted to evaluate DIVISION in terms of memory cost, model accuracy, and training throughput. An overall comparison is given in Figure 1 (a). Our proposed DIVISION consistently outperforms state-of-the-art baseline methods in the above three aspects. The contributions of this work are summarized as follows: • We experimentally demonstrate and theoretically prove that DNN backward propagation mainly utilizes the LFC of the activation maps. The HFC is highly redundant and compressible. • We propose a simple framework DIVISION to effectively reduce the memory cost of DNN training via removing the redundancy in the HFC of activation maps during the training. • Experiments on three benchmark datasets demonstrate the effectiveness of DIVISION in terms of memory cost, model accuracy, and training throughput.

ACTIVATION COMPRESSED TRAINING It has been proved in existing work (Chen et al., 2020) that majority of memory (nearly 90%) is for caching activation maps during the training of DNNs. Following this direction, the activation compressed training (ACT) reduces the memory cost via real-time compressing the activation maps during the training. A typical ACT framework in existing work (Chakrabarti & Moseley, 2019) is shown in Figure 1 (b). Specifically, after the feed-forward operation of each layer l, activation map H l-1 is compressed into a representation for caching. The compression enables a significant reduction of memory compared with caching the original (exact) activation maps. During the backward pass of layer l, ACT decompresses the cached representation into Ĥl-1 , and estimates the gradient by taking the reconstructed Ĥl-1 into Equation (2): [ ∇H l-1 , ∇W l ] = backward( ∇H l , Ĥl-1 , W l ).

