DIVISION: MEMORY EFFICIENT TRAINING VIA DUAL ACTIVATION PRECISION

Abstract

Activation compressed training (ACT) has been shown to be a promising way to reduce the memory cost of training deep neural networks (DNNs). However, existing work of ACT relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual ActIVation PrecISION (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10× compression of activation maps, and significantly higher training throughput than state-of-theart ACT methods, without loss of model accuracy. The code is available at https://anonymous.4open.science/r/division-5CC0/.

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied to real-world tasks such as language understanding (Devlin et al., 2018) , machine translation (Vaswani et al., 2017) , visual detection and tracking (Redmon et al., 2016) . With increasingly larger and deeper architectures, DNNs achieve remarkable improvement in representation learning and generalization capacity (Krizhevsky et al., 2012) . Generally, training a larger model requires more memory resources to cache the activation values of all intermediate layers during the back-propagationfoot_0 . For example, training a DenseNet-121 (Huang et al., 2017) on the ImageNet dataset (Deng et al., 2009) requires to cache over 1.3 billion float activation values (4.8GB) during back-propagation; and training a ResNet-50 (He et al., 2016) requires to cache over 4.6 billion float activation values (17GB). Some techniques have been developed to reduce the training cache of DNNs, such as checkpointing (Chen et al., 2016; Gruslys et al., 2016) , mix precision training (Vanholder, 2016) , low bit-width training (Lin et al., 2017; Chen et al., 2020) and activation compressed training (Georgiadis, 2019; Liu et al., 2022) . Among these, the activation compressed training (ACT) has emerged as a promising method due to its significant reduction of training memory and the competitive learning performance (Liu et al., 2021b) . Existing work of ACT relies on quantizing the activation maps to reduce the memory consumption of DNN training, such as BLPA (Chakrabarti & Moseley, 2019) , TinyScript (Fu et al., 2020) and ActNN (Chen et al., 2021) . Although ACT could significantly reduce the training memory cost, the quantization process introduces noises in backward propagation, which makes the training suffer from undesirable degradation of accuracy (Fu et al., 2020) . Due to this reason, BLPA requires 4bit ACT to ensure the convergence to optimal solution on the ImageNet dataset, which has only 6× compression ratefoot_1 of activation maps (Chakrabarti & Moseley, 2019) . Other works propose to search for optimal bit-width to match different samples during training, such as ActNN (Chen et al., 2021) and AC-GC (Evans & Aamodt, 2021) . Although it can moderately reduce the quantization noise and achieves optimal solution under 2-bit ACT (nearly 10× compression rate), the following issues cannot be ignored. First, it is time-consuming to search for optimal bit-width during training. Second, the framework of bit-width searching is complicated and non-transparent, which brings new challenges to follow-up studies on the ACT and its real-world applications. In this work, we propose a simple and transparent method to reduce the memory cost of DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for the storage of high-frequency component (HFC) during the training. This indicates the HFC of activation map is highly redundant and compressible during the training. Following this direction, we propose Dual Activation Precision (DIVISION), which preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision during the training. In this way, DIVISION can significantly reduce the memory cost. Meanwhile, it will not negatively affect the quality of backward propagation and could maintain competitive model accuracy. Compared with existing work that integrates searching into learning (Liu et al., 2022) , DIVISION has a more simplified compressor and decompressor, speeding up the procedure of ACT. More importantly, it reveals the compressible (HFC) and non-compressible factors (LFC) during DNN training, improving the transparency of ACT. Experiments are conducted to evaluate DIVISION in terms of memory cost, model accuracy, and training throughput. An overall comparison is given in Figure 1 (a). Our proposed DIVISION consistently outperforms state-of-the-art baseline methods in the above three aspects. The contributions of this work are summarized as follows: • We experimentally demonstrate and theoretically prove that DNN backward propagation mainly utilizes the LFC of the activation maps. The HFC is highly redundant and compressible. • We propose a simple framework DIVISION to effectively reduce the memory cost of DNN training via removing the redundancy in the HFC of activation maps during the training. • Experiments on three benchmark datasets demonstrate the effectiveness of DIVISION in terms of memory cost, model accuracy, and training throughput.

2. PRELIMINARY

2.1 NOTATIONS Without loss of generality, we consider an L-layer deep neural network in this work. During the forward pass, for each layer l (1 ≤ l ≤ L), the activation map is calculated by H l = forward(H l-1 ; W l ), where H l denotes the activation map of layer l; H 0 takes a mini-batch of input images; W l denotes the weight of layer l; and forward(•) denotes the feed-forward operation. During the backward pass, the gradients of the loss value towards the activation maps and weights are be estimated by ∇H l-1 , ∇W l = backward( ∇H l , H l-1 , W l ), where ∇H l-1 and ∇H l denote the gradient towards the activation map of layer l -1 and l, respectively; ∇W l denotes the gradient towards the weight of layer l; and backward(•)foot_2 denotes the backward function which takes ∇H l , H l-1 and W l , and outputs the gradients ∇H l-1 and ∇W l . Equation (2) indicates it is required to cache the activation maps H 0 , • • • , H L-1 after the feedforward operations for gradient estimation during backward propagation.

2.2. ACTIVATION COMPRESSED TRAINING

It has been proved in existing work (Chen et al., 2020) that majority of memory (nearly 90%) is for caching activation maps during the training of DNNs. Following this direction, the activation compressed training (ACT) reduces the memory cost via real-time compressing the activation maps during the training. A typical ACT framework in existing work (Chakrabarti & Moseley, 2019 ) is shown in Figure 1 (b). Specifically, after the feed-forward operation of each layer l, activation map H l-1 is compressed into a representation for caching. The compression enables a significant reduction of memory compared with caching the original (exact) activation maps. During the backward pass of layer l, ACT decompresses the cached representation into Ĥl-1 , and estimates the gradient by taking the reconstructed Ĥl-1 into Equation ( 2): Even though the pipeline of compression and decompression is lossy, i.e. Ĥl ̸ = H l for 1 ≤ l ≤ L. It has been proved ACT can limit the reconstruction error flowing back to early layers and enables the training to approach an approximately optimal solution (Chen et al., 2021) . [ ∇H l-1 , ∇W l ] = backward( ∇H l , Ĥl-1 , W l ).

2.3. DISCRETE COSINE TRANSFORMATION

Discrete Cosine Transformation (DCT) projects the target data from the spatial domain to the frequency domain via the inner-production of the data and a collection of cosine functions with different frequency (Rao & Yip, 2014) . We focus on the 2D-DCT in this work, where the target data is the input image and activation maps of DNNs. Specifically, for 2D-matrix data H, the frequency-domain feature H is estimated by H = DCT(H), where H and H have the same shape of N ×N . and each of the element hi,j is given by hi,j = N -1 m=0 N -1 n=0 hm,n cos π N m + 1 2 i cos π N n + 1 2 j , where h m,n , 0 ≤ m, n ≤ N -1, are elements in the original matrix H. During the training of DNNs, an image or activation map has the shape of Minibatch×Channel×N ×N . In this case, the frequency-domain feature is estimated via operating 2D-DCT for each N×N matrix in each channel. With DCT, we could extract the low-frequency/high-frequency component (LFC/HFC) of an image or activation map, using a pipeline of low-pass/high-pass masking and inverse DCT, as shown in Figure 2 . To be concrete, the estimation of LFC and HFC is given by H L = iDCT( H ⊙ M) (4) H H = iDCT( H ⊙ (1 N ×N -M)), (5) where iDCT(•) denotes the inverse DCT (Rao & Yip, 2014) ; M = [m i,j |1 ≤ i, j ≤ N ] denotes an N ×N low-pass mask satisfying m i,j = 1 for 1 ≤ i, j ≤ W and m i,j = 0 for other elements; and 1 N ×N -M indicates the high-pass mask. Intuitively, H L has W 2 non-zero float numbers in each channel, in contrast with N 2 -W 2 non-zero float numbers in each channel of H H . Generally, we have W ≪ N in practical scenarios, e.g. W/N = 0.1 in Figure 2 (a) . This indicates the HFC takes the majority of the memory cost in the caching of activation maps.

3. CONTRIBUTION OF LFC AND HFC TO BACKWARD PROPAGATION

In this section, we experimentally prove the LFC of activation maps has significantly more contribution to DNN backward propagation than the HFC. Moreover, our theoretical result indicates the LFC enables the estimated gradient to be bounded into a tighter range around the optimal value, leading to a more accurate learned model, which is consistent with the experimental results. 6), where H L l is estimated by Equations (4); HFC-ACT takes HFC into the backward function as given in Equation ( 7), where H H l is according to Equation (5); Normal training (for comparison) estimates the gradients by Equation ( 2). [ than that of the LFC. i.e. The storage of HFC consumes the majority of memory. To better understand the results of model accuracy, we theoretically prove the gradient for backward propagation is bounded into a tighter range around the optimal value in LFC-ACT. This enables LFC-ACT to learn a more accurate model than HFC-ACT. ∇H l-1 , ∇W l ] = backward( ∇H l , H L l , W l ), ▷LFC-ACT (6) [ ∇H l-1 , ∇W l ] = backward( ∇H l , H H l , W l ). ▷HFC-ACT

3.2. THEORETICAL ANALYSIS

We theoretically analyze the gradient estimation error of LFC-ACT and HFC-ACT which adopt Equations ( 6) and (7) for backward propagation, respectively. Generally, for the LFC-ACT and HFC-ACT, let ∇L W l and ∇H W l denote the estimated gradient of layer l, respectively. In this way, || ∇L W l -∇ W l || F 4 and || ∇H W l -∇ W l || F indicates the gradient estimation errors, taking the complete gradient ∇ W l as a reference. To compare the distortion of backward propagation in LFC-ACT and HFC-ACT, let GEB L l and GEB H l denote the gradient error upper bound (GEB), respectively, i.e. || ∇L W l -∇ W l || F ≤ GEB L l and || ∇H W l -∇ W l || F ≤ GEB H l . Intuitively, higher GEB indicates less accurate backward propagation, leading to a less accurate model after training. To this end, we give Theorem 1 to compare GEB L l and GEB H l , where a convolutional layer is considered. The proof is given in Appendix B. A similar analysis of GEB for a linear layer is provided in Appendix C. Theorem 1. During the backward pass of a convolutional layer l, GEB L l and GEB H l satisfy GEB L l -GEB H l = α l,l ||H T l-1 || F +β l (λ H l -λ L l )+ ||H T l-1 || F L i=l+1 α l,i (λ H i -λ L i ) i-1 j=l γ j , where α l,i , β l , γ l > 0 for 1 ≤ l, i ≤ L depend on the model weights before backward propagation (given by Equations (24) in Appendix B); λ L l = || H l ⊙M|| F ; λ H l = || H l ⊙(1-M)|| F ; H l = DCT(H l ) ; and M denotes the loss-pass mask given by Equation (4). In this section, from both experimental and theoretical perspectives, we prove the HFC of activation maps has less contribution to backward propagation than LFC. However, according to Figure 2 

4. DUAL ACTIVATION PRECISION TRAINING

We introduce the proposed Dual ActIVation PrecISION (DIVISION) in this section. The framework of DIVISION is shown in Figure 4 . Specifically, after the feed-forward operation of each layer, DIVISION estimates the LFC and compresses the HFC into a low-precision copy such that the total memory cost is significantly decreased after the compression. Before the backward propagation of each layer, the low-precision HFC is decompressed and combined with LFC to reconstruct the activation map. The detailed compression and decompression are given as follows.

4.1. ACTIVATION MAP COMPRESSION

For compressing the activation map H l of layer l, DIVISION estimates the LFC H L l and HFC H H l after the feed-forward operation. However, the high computational complexity of DCT prevents us from directly applying it to real-time algorithms. We thus give Theorem 2 to introduce a moving average operation that can approximate the loss-pass filter. The proof is given in Appendix E. Theorem 2. For any real-valued function f (x) and its moving average f (x) = 1 2B x+2B x f (t)dt, let F (ω) and F (ω) denote the Fourier transformation of f (x) and f (x), respectively. Generally, we have F (ω) = H(ω)F (ω), where |H(ω)| = sin ωB ωB . Remark 1. The frequency response of H(ω) depends on its envelope function 1 |ωB| . Note that 1 |ωB| decreases with |ω| such that 1 |ωB| → 0 as ω → ∞. Hence, H(ω) is an approximate loss-pass filter. According to Remark 1, we approximate the LFC H L l into the moving average of H l . Note that the average pooling operator provides efficient moving average. DIVISION adopts average pooling to estimate the LFC by H L l = AveragePooling(H l ). The value of block-size and moving stride is a unified hyper-parameter B, which controls the memory of H L lfoot_3 . Moreover, H L l is cached in the format of bfloat16 for saving the memory. In our experiments, we found B = 8 can provide representative LFC for backward propagation, where the memory cost of H L l is only 0.8% of H l . To estimate the HFC, DIVISION calculates the residual value H H l = H l -UpSampling(H L l ) , where the UpSampling(•) enlarges H L l to shape Minibatch×Channel×N ×N via nearest interpolation. Then, DIVISION compress the H H l into low-precision because it plays a less important role during the backward propagation but consumes most of the memory. Specifically, DIVISION adopts Q-bit per-channel quantization 67 for the compression, where the bit-width Q controls the precision and memory cost of HFC after the compression. Let V H l denote a Q-bit integer matrix, as the lowprecision representation of H H l . The detailed procedure of compressing H H l into V H l is given by V H l = Quant(H H l ) = ∆ -1 l (H H l -δ l ) , where δ l denotes the minimum element in H H l ; ∆ l = (h max -δ l )/(2 Q -1) denotes the quantization step; h max denotes the maximum element in H H l ; ⌊•⌉ denotes the stochastic rounding 89 (Gupta et al., 2015) ; and δ l and ∆ l are cached in the formate of bfloat16 for saving memory. In this way, the memory cost of (V H l , δ l , h max ) is (N 2 Q/8 + 4) bytes per channel, in contrast with that of H l being 4N 2 bytes per channel. In our experiments, we found Q = 2 can provide enough representation for backward propagation, where the memory cost of V H l is only 8.3% of H l . After the compression, as the representation of H l , the tuple of (H L l , V H l , ∆ l , δ l ) is cached to the memory for reconstructing the activation maps during the backward pass.

4.2. ACTIVATION MAP DECOMPRESSION

During the backward pass, DIVISION adopts the cached tuples of {(H L l , V H l , ∆ l , δ l ) | 0 ≤ l ≤ L -1} to reconstruct the activation map layer-by-layer. Specifically, for each layer l, DIVISION dequantizes the HFC via ĤH l = ∆ l V H l +δ l , which is the inverse process of Equation ( 9). Then, the activation map is reconstructed via Ĥl = UpSampling(H L l ) + ĤH l , ) where UpSampling(•) enlarges H L l to shape Minibatch×Channel×N×N via nearest interpolation. After the decompression, DIVISION frees the caching of (H L l , V H l , ∆ l , δ l ), and takes Ĥl into [ ∇H l-1 , ∇W l ] = backward( ∇H l , Ĥl-1 , W l ) to estimate the gradient for backward propagation. Without loss of generality, 1D/3D activation maps are considered for DIVISION in Appendix F.

4.3. ALGORITHM OF DIVISION

Algorithm 1 Mini-batch updating of DIVISION Input: Mini-batch samples x and labels y. Output: Weight and bias {W l , B l |1 ≤ l ≤ L}. 1: for layer l := 1 to L do 2: H l = f (W l H l-1 + B l ) // H0 = x 3: H L l-1 = AveragePooling(H l-1 ) 4: H H l-1 = H l-1 -UpSampling(H L l-1 ) 5: V H l-1 , ∆ l-1 , δ l-1 = Quant(H H l-1 ) 6: Cache (H L l-1 , V H l-1 , ∆ l-1 , δ l-1 ) 7: end for 8: Estimate the loss value and gradient ∇H L . 9: for layer l := L to 1 do 10: ĤH l-1 = Dequant(V H l-1 , ∆ l-1 , δ l-1 ) 11: Ĥl-1 = UpSampling(H L l-1 ) + ĤH l-1 12: Estimate [ ∇H l-1 , ∇W l ] and update W l . 13: Free (H L l-1 , V H l-1 , ∆ l-1 , δ l-1 ). 14: end for Algorithm 1 demonstrates a mini-batch updating of DIVISION, which includes a forward pass and backward pass. During the forward pass of each layer, DIVISION first forwards the exact activation map to the next layer (line 2); then, estimates the LFC and HFC (line 3-4); after this, achieves the low precision copy of HFC (lines 5); finally caches the representation to the memory (line 6). During the backward pass of each layer, DIVISION first decompresses the HFC (line 10); then reconstructs the activation map (line 11); after this, estimates the gradients and updates the weights of layer l (line 12); finally frees the caching of (H L l-1 ,V H l-1 ,∆ l-1 ,δ l-1 ) (line 13). For each minibatch updating, the memory usage reaches the maximum value after the forward pass (caching the representation of activation maps layer-bylayer), and reduces to the minimum value after the backward pass (freeing the cache layer-by-layer). Existing work (Chen et al., 2021) 

5.2. EVALUATION BY MEMORY COST (RQ1)

We evaluate the training methods in terms of the training memory cost on the ImageNet datasest, where the configuration of our computational infrastructure is given in Appendix R. Table 1 indicates the training memory cost and practical compression rate of DIVISION and baseline methods. Moreover, DIVISION is compared with the checkpoint strategy of Megatron-LM (Shoeybi et al., 2019) in Appendix O. Overall, we have the following observations: • DIV vs SWAP, Checkpoint & BLPA: SWAP reduces the GPU memory cost merely by transferring the overhead from GPU to CPU, which is non-effective if considering the memory utilization of both GPU and CPU. Checkpoint shows considerable memory overhead because it caches some key activation maps to reconstruct other activation maps during backward pass. BLPA is less effective than DIVISION because it relies on at least 4-bit compression. • DIV vs AC-GC: The practical memory cost of AC-GC should be greater than the values given in Table 1 . AC-GC searches the bit-width from an initial maximum value, and finalizes with an optimal bit-width. Thus, the average memory cost should be greater than that in the last epoch. • Act. Maps: For normal training, the caching of activation maps takes the majority of memory cost (>90%, growing with the mini-batch size), which is consistent with our discussion in Section 1. • Compression rate: The activation map compression rate of DIVISION is consistently with the theoretical results (R ResNet-50 , R WRN-50-2 ≥ 10.35, see Appendix G), which is not influenced by the mini-batch size. Moreover, the overall compression rate grows with the mini-batch size.

5.3. EVALUATION BY TRAINING THROUGHPUT (RQ1)

We now evaluate the training methods in terms of the training throughput on the Imagenet dataset. Generally, the throughput indicates the running speed of a training method via counting the average number of data samples processed per second. The throughput is given by Mini-batch Size

Tbatch

, where T batch denotes the time consumption of single mini-batch updating. Each method is combined with the automatic mixed precision (AMP)foot_9 to speed up the training. Figures 6 (a ) and (b) show the average throughput of 20 times of mini-batch updating. Overall, we have the following observations: • Reason for Time overhead: Compared with normal training, the time overhead of DIVISION comes from the estimation of LFC and compression of HFC. In ActNN, the overhead mainly comes from the the dynamic bit-width allocation and activation map quantization. In Checkpoint, it comes from replaying the forward process of inter-media layers. In SWAP, the overhead mainly derives from the communication cost between the CPUs and GPUs. • DIV vs ActNN: DIVISION shows higher throughput than ActNN as a result of more simplified data compression. To be concrete, DIVISION adopts average-pooling to extract the LFC, and a fixed bit-width per-channel quantization to compress the HFC. In contrast, ActNN relies on searching optimal bit-width to match different samples, and per-group quantization based on the searched bit-width. ActNN has more complex processing, which leads to its lower throughput. • DIV vs SWAP: SWAP is less efficient than ACT-based methods (DIVISION and ActNN), which indicates the CPU-GPU communication cost is larger than the cost of activation map processing.

5.4. EFFECT OF DUAL PRECISION STRATEGY (RQ2)

To study the effect of our proposed dual precision strategy, DIVISION is compared with three training methods: DIVISION w/o HFC: Merely caching the high-precision LFC for back-propagation. DIVISION w/o LFC: Merely caching the low-precision HFC for back-propagation. Fixed Quant: Compressing the activation maps using a fixed bit-width quantization. The experiments are conducted on the ImageNet dataset using the hyper-parameters given in Appendix S. More experiments of Fixed Quant with different bit-width are given in Appendix P. The model accuracy are given in Figure 6 (c). Overall, we have the following insights: • LFC & Low Precision HFC: Removing either HFC or LFC from DIVISION, the training converges to far lower levels of accuracy. This indicates both the LFC and low precision HFC of activation maps are necessary for leading the training to converge to an optimal solution. • Benifits of Dual Precision: The fixed bit-width quantization fails to converge to an optimal solution. This indicates the noise caused by the fixed bit-width quantization can terribly disturb the back-propagation. DIVISION solves this problem by combining a high-precision LFC and a fixed bit-width quantization for compressing the activation maps. 5.5 HYPER-PARAMETER TUNING FOR DIVISION (RQ3) (a) Effect of Hyper-parameters. We study the effect of hyper-parameters B (block-size) and Q (bitwidth) on the accuracy and compression rate. Specifically, we adopt DIVISION to train ResNet-18 on the CIFAR-10 dataset with B ∈ {8, 12, 18} and Q ∈ {2, 4, 8}. The accuracy versus compression rate is shown in Figure 7 (a) . Overall, we have the following insights: • Effect of Q: The accuracy is stable (consistently nearly 95%) when reducing the precision-level of HFC (Q reduces from 8 to 2). This indicates DIVISION only requires approximate values of HFC during backward propagation. • Effect of B: Lower-precision LFC for backward propagation leads to significant degradation of accuracy (as B grows from 8 to 18). This is because DIVISION relies on a high-precision LFC to reconstruct the activation maps for backward propagation. • Optimal Setting: DIVISION has optimal accuracy-compression trade-off taking B = 8 and Q = 2, where the degradation of accuracy is less than 0.4%. According to more empirical studies in Appendix N, B = 8 and Q = 2 can be a default setting effective for most of model architectures and datasets. Note that normal training can be accelerated by the automatic mixed precision (AMP) (Micikevicius et al., 2017) without loss of accuracy. We study whether AMP can speed up DIVISION without loss of accuracy. Specifically, we follow the setting of DIVISION B = 8, Q = 2 to train ResNet-18 on the CIFAR-10 dataset. The accuracy and training throughput of DIVISION w/ and w/o AMP are shown in Figure 7 (b) . More experiments with different mini-batch size are given in Appendix Q. It is observed that AMP can significantly speed up the DIVISION when MiniBatch-size ≥ 256 without loss of model accuracy. This indicates DIVISION has the potential to be applied to scenarios where both time and memory are limited.

6. CONCLUSION

In this work, we propose a simple framework of activation compressed training. Our framework is motivated by an instructive observation: DNN backward propagation mainly depends on the LFC of the activation maps, while the majority of memory is for the storage of HFC during the training. This indicates back-propagation mainly utilizes the LFC to estimate the gradient, while the HFC is highly redundant and compressible. Following this direction, our proposed DIVISION compresses the activation maps into dual precision representations: high-precision LFC and low-precision HFC, according to their contributions to the back-propagation. This dual precision compression can significantly reduce the memory cost of activation maps without disturbing the training. Different from the existing work of ACT, DIVISION is a simple and transparent framework, where the simplicity enables efficient compression and decompression; and transparency allows us to understand the compressible (HFC) and non-compressible factors (LFC) during DNN training. To this end, we hope our work could provide some inspiration for the compression of DNN training.

APPENDIX A IMPLEMENTATION DETAILS OF SECTION 3

We give the details of the experiment in Section 3. Without loss of generality, the experiment is conducted on the CIFAR-10 dataset using ResNet-18, DenseNet-121 and ShuffleNet-V2. During the backward propagation of normal training, the gradient of each layer l is estimated by [ ∇H l-1 , ∇W l ] = backward( ∇H l , H l , W l ) (12) For LFC-ACT, the gradient is estimated by [ ∇H l-1 , ∇W l ] = backward( ∇H l , H L l , W l ), where HFC-ACT denotes the HFC of H l ; for HFC-ACT, the gradient is estimated by [ ∇H l-1 , ∇W l ] = backward( ∇H l , H H l , W l ), where H H l denotes the HFC of H l . Note that Equations ( 13) and ( 14) causes the distortion of backward propagation in LFC-ACT and HFC-ACT, respectively. The objective of this experiment is to investigate whether this distortion of back-propagation may be powerful enough to lead training to a non-optimal solution. The hyper-parameter setting of the training is given in Table 2 . Step LR Step LR Step LR Weight-decay 0.0005 0.0005 0.0005 Optimizer SGD SGD SGD SGD Momentum 0.9 0.9 0.9 Ratio of LFC (W/N ) 0.3 0.3 0.5

B PROOF OF THEOREM 1

We prove Theorem 1 in this section. Theorem 1 During the backward pass of a convolutional layer l, GEB L l and GEB H l satisfy GEB L l -GEB H l = α l,l ||H T l-1 || F +β l (λ H l -λ L l )+ ||H T l-1 || F L i=l+1 α l,i (λ H i -λ L i ) i-1 j=l γ j , ( ) where α l,i , β l , γ l > 0 for 1 ≤ l, i ≤ L are given by Equation ( 24); λ L l = || H l ⊙ M|| F ; λ H l = || H l ⊙(1-M)|| F ; H l = DCT(H l ) ; and M denotes the loss-pass mask given by Equation (4). Proof. For simplicity of derivation, we study the case with a single input channel and output channel number. In this case, H l and W l are 2-D matrix for each layer l, where 1 ≤ l ≤ L. The backward propagation of a convolutional layer is given by ∇Z l = ∇Z l+1 * W rot l+1 ⊙ σ ′ ( Ẑl ), ∇W l = ∇Z l * ĤT l-1 , where * denotes a convolutional operation; Ẑl = W l * Ĥl-1 + b l ; b l denotes the bias of layer l; and W rot l denotes to rotate W l by 180 • . The case of multiple input and output channels can be proved in an analogous way, which is omitted in this work. According to Equation ( 16), we have the gradient of Z l given by ∇Z l -∇ Z l = ∇Z l+1 * W rot l+1 ⊙ σ ′ ( Ẑl ) -∇ Z l+1 * W rot l+1 ⊙ σ ′ (Z l ), = ∇Z l+1 * W rot l+1 ⊙σ ′ ( Ẑl )-∇Z l+1 * W rot l+1 ⊙σ ′ (Z l )+ ∇Z l+1 * W rot l+1 ⊙σ ′ (Z l )-∇ Z l+1 * W rot l+1 ⊙σ ′ (Z l ), = ∇Z l+1 * W rot l+1 ⊙ [σ ′ ( Ẑl ) -σ ′ (Z l )] + ( ∇Z l+1 -∇ Z l+1 ) * W rot l+1 ⊙ σ ′ (Z l ). ( ) For the activation functions ReLu(•), LeakyReLu(•), Sigmoid(•), Tanh(•) and SoftPlus(•), the gradient σ ′ (•) satisfies |σ ′′ (•)| ≤ 1 in the differentable domains. Note that we have ||W l * H l-1 || F ≤ (K l + N l -1)||W l || F ||H l-1 || F according to Corollary 1. ||σ ′ ( Ẑl ) -σ ′ (Z l )|| F satisfies ||σ ′ ( Ẑl ) -σ ′ (Z l )|| F ≤ || Ẑl -Z l || F ≤ (K l + N l -1)|| Ĥl-1 -H l-1 || F ||W ′ l || F , where K l and N l denote the size of convolutional kernel W l and activation map H l in layer l, respectively. After taking Equation ( 18) into Equation ( 17), we have || ∇Z l -∇ Z l || F ≤ (K l + N l -1)|| ∇Z l+1 || F ||W rot l+1 || F ||σ ′ ( Ẑl ) -σ ′ (Z l )|| F + (K l + N l -1)|| ∇Z l+1 -∇ Z l+1 || F ||W rot l+1 || F ||σ ′ (Z l )|| F , = (K l + N l -1) 2 || ∇Z l+1 || F ||W rot l+1 || F || Ĥl-1 -H l-1 || F ||W ′ l || F + (K l + N l -1)|| ∇Z l+1 -∇ Z l+1 || F ||W rot l+1 || F ||σ ′ (Z l )|| F , = η l || Ĥl-1 -H l-1 || F + γ l || ∇Z l+1 -∇ Z l+1 || F , where η l and γ l are given by η l = (K l + N l -1) 2 || ∇Z l+1 || F ||W l+1 || F ||W ′ l || F ; γ l = (K l + N l -1)||W l+1 || F ||σ ′ (Z l )|| F ; the value η l and γ l depend on the model weight before backward propagation, which is constant with respect to the gradient. Iterate Equation ( 19) until l = L where || ∇Z L -∇ Z L || F ≤ η L || ĤL-1 - H L-1 || F . In this way, we have || ∇Z l -∇ Z l || F ≤ η l || Ĥl-1 -H l-1 || F + L i=l+1 η i || Ĥi-1 -H i-1 || F i-1 j=l γ j . According to Equation ( 16), we have the gradient of W l given by ∇W l -∇ W l = ∇Z l * ĤT l-1 -∇ Z l * H T l-1 , = ∇Z l * ĤT l-1 -∇Z l * H T l-1 + ∇Z l * H T l-1 -∇ Z l * H T l-1 , = ∇Z l * ( ĤT l-1 -H T l-1 ) + ( ∇Z l -∇ Z l ) * H T l-1 . ( ) Taking Equation ( 21) into Equation ( 22), we have || ∇W l -∇ W l || F ≤ (K l + N l -1)|| ∇Z l || F || ĤT l-1 -H T l-1 || F + (K l + N l -1)|| ∇Z l -∇ Z l || F ||H T l-1 || F , ≤(K l +N l -1) || ∇Z l || F || ĤT l-1 -H T l-1 || F +||H T l-1 || F η l || Ĥl-1 -H l-1 || F + L i=l+1 η i || Ĥi-1 -H i-1 || F i-1 j=l γ j =(K l +N l -1) || ∇Z l || F +η l ||H T l-1 || F || ĤT l-1 -H T l-1 || F +||H T l-1 || F L i=l+1 η i || Ĥi-1 -H i-1 || F i-1 j=l γ j , = β l +α l,l ||H T l-1 || F || ĤT l-1 -H T l-1 || F +||H T l-1 || F L i=l+1 α l,i || ĤT i-1 -H T i-1 || F i-1 j=l γ j , where α l,i = (K l + N l -1)(K i + N i -1) 2 || ∇Zi+1 || F ||W i+1 || F ||W ′ i || F ; β l = (K l + N l -1)|| ∇Z l || F ; γ l = (K l + N l -1)||W l+1 || F ||σ ′ (Z l )|| F ; (24) K l and N l denote the size of convolutional kernel W l and activation map H l in layer l, respectively. During the LFC-ACT and HFC-ACT trainings, the activation map of a convolutional layer satisfies ||H l -H L l || F = || H l -H L l || F = || H l ⊙ (1 -M)|| F ≜ λ H l , ||H l -H H l || F = || H l -H H l || F = || H l ⊙ M|| F ≜ λ L l . Taking Equations ( 25) and ( 26) into ( 23), we have GEB L l and GEB H l of a convolutional layer by || ∇W l -∇ L W l || F ≤ α l,l ||H T l-1 || F +β l λ H l + ||H T l-1 || F L i=l+1 α l,i λ H i i-1 j=l γ j ≜ GEB L l , || ∇W l -∇ H W l || F ≤ α l,l ||H T l-1 || F +β l λ L l + ||H T l-1 || F L i=l+1 α l,i λ L i i-1 j=l γ j ≜ GEB H l . Given the expression of GEB L l and GEB H l by Equations ( 27) and ( 28), respectively, we have the GEB for a convolutional layer given by GEB L l -GEB H l = α l,l ||H T l-1 || F +β l (λ H l -λ L l )+ ||H T l-1 || F L i=l+1 α l,i (λ H i -λ L i ) i-1 j=l γ j . Corollary 1. For a K × K convolutional kernel and a N × N square matrix H, we have the ||W * H|| F ≤ (K + N -1)||W|| F ||H|| F Proof. According to the relations between convolutional operation and Discrete Fourier Transformation (Sundararajan, 2001) , W * H satisfies FFT(W * H) = FFT(ZP(W)) ⊙ FFT(ZP(H)), where FFT(•) denotes the discrete Fourier transformation; ZP(W) denotes zero-padding W into a (K + N -1)×(K + N -1) matrix. According to the Parseval's theorem (Diniz et al., 2010) , FFT(ZP(W)) and FFT(ZP(H)) and FFT(W * H) satisfy ||FFT(ZP(W))|| F = (K + N -1)||W|| F , ||FFT(ZP(H))|| F = (K + N -1)||H|| F , ||FFT(W * H)|| F = (K + N -1)||W * H|| F . Taking ||A 1 ⊙ A 2 || F ≤ ||A 1 || F ||A 2 || F into Equation (31), we have FFT(ZP(W)) ⊙ FFT(ZP(H)) ≤ ||FFT(W)|| F ||FFT(H)|| F Taking Equation (31) into Equation (32), we have (K + N -1)||W * H|| F = ||FFT(W) ⊙ FFT(H)|| F ≤ ||FFT(W)|| F ||FFT(H)|| F = (K + N -1)||W|| F (K + N -1)||H|| F

C GRADIENT ERROR BOUND (GEB) OF A LINEAR LAYER

We give the Gradient Error upper Bound (GEB) of a linear layer and proof in this section. Theorem 1B. During the backward pass of a linear layer l, GEB L l and GEB H l satisfy GEB L l -GEB H l = α l ||H T l-1 || F +β l (λ H l -λ L l )+||H T l-1 || F L i=l+1 α i (λ H i -λ L i ) i-1 j=l γ j , where α l , β l , γ l > 0 for 1 ≤ l ≤ L are given by Equation ( 42); λ L l = || H l ⊙M|| F ; λ H l = || H l ⊙(1- M)|| F ; H l = DCT(H l ) ; and M denotes the 1-D loss-pass mask. Proof. For simplicity of derivation, we consider the case MiniBatch=1. In this case, H l is a vector; and W l is a 2-D matrix, for 1 ≤ l ≤ L. The backward propagation of a linear layer is given by ∇Z l = (W l+1 ∇Z l+1 ) ⊙ σ ′ ( Ẑl ), ∇W l = ∇Z l ĤT l-1 , where Ẑl = W T l Ĥl-1 + b l ; and b l denotes the bias of layer l. The case of MiniBatch ≥ 2 can be proved in an analogous way, which is omitted in this work. According to Equation (34), we have the gradient of Z l given by ∇Z l -∇ Z l = W l+1 ∇Z l+1 ⊙ σ ′ ( Ẑl ) -W l+1 ∇ Z l+1 ⊙ σ ′ (Z l ), = W l+1 ∇Z l+1 ⊙σ ′ ( Ẑl )-W l+1 ∇Z l+1 ⊙σ ′ (Z l )+W l+1 ∇Z l+1 ⊙σ ′ (Z l )-W l+1 ∇ Z l+1 ⊙σ ′ (Z l ), = W l+1 ∇Z l+1 ⊙ [σ ′ ( Ẑl ) -σ ′ (Z l )] + ( ∇Z l+1 -W l+1 ∇ Z l+1 ) ⊙ σ ′ (Z l ). For activation functions ReLu(•), LeakyReLu(•), Sigmoid(•), Tanh(•) and SoftPlus(•), the gra- dient σ ′ (•) satisfies |σ ′′ (•)| ≤ 1 in each differentiable domain. Combined with Cauchy-Schwarz inequality ||A 1 A 2 || F ≤ ||A 1 || F ||A 2 || F (Horn & Johnson, 2012), we have ||σ ′ ( Ẑl ) -σ ′ (Z l )|| F ≤ || Ẑl -Z l || F ≤ ||W l || F || Ĥl-1 -H l-1 || F . According to inequality ||A 1 ⊙A 2 || F ≤ ||A 1 || F ||A 2 || F (Horn & Johnson, 2012), we have the upper bound of || ∇Z l -∇ Z l || F given by || ∇Z l -∇ Z l || F ≤ ||W l+1 || F || ∇Z l+1 || F ||σ ′ ( Ẑl ) -σ ′ (Z l )|| F + ||W l+1 || F || ∇Z l+1 -∇ Z l+1 || F ||σ ′ (Z l )|| F , = ||W l+1 || F || ∇Z l+1 || F || Ĥl-1 -H l-1 || F ||W ′ l || F + ||W l+1 || F || ∇Z l+1 -∇ Z l+1 || F ||σ ′ (Z l )|| F , = α l || Ĥl-1 -H l-1 || F + γ l || ∇Z l+1 -∇ Z l+1 || F , where α l and γ l are given by α l = ||W l+1 || F || ∇Z l+1 || F ||W ′ l || F ; γ l = ||W l+1 || F ||σ ′ (Z l )|| F ; the value α l and γ l depend on the model weight before backward propagation, which are constant with respect to the gradient. Iterate Equation (37 ) until l = L where || ∇Z L -∇ Z L || F ≤ α l || ĤL-1 - H L-1 || F . In such a manner, we have || ∇Z l -∇ Z l || F ≤ α l || Ĥl-1 -H l-1 || F + L i=l+1 α i || Ĥi-1 -H i-1 || F i-1 j=l γ j . According to Equation (34), we have the gradient of W l given by ∇W l -∇ W l = ∇Z l ĤT l-1 -∇ Z l H T l-1 , = ∇Z l ĤT l-1 -∇Z l H T l-1 + ∇Z l H T l-1 -∇ Z l H T l-1 , = ∇Z l ( ĤT l-1 -H T l-1 ) + ( ∇Z l -∇ Z l )H T l-1 . Taking Equation (39) into Equation ( 40), we have || ∇W l -∇ W l || F ≤ || ∇Z l || F || ĤT l-1 -H T l-1 || F + || ∇Z l -∇ Z l || F ||H T l-1 || F , ≤ || ∇Z l || F || ĤT l-1 -H T l-1 || F +||H T l-1 || F α l || Ĥl-1 -H l-1 || F + L i=l+1 α i || Ĥi-1 -H i-1 || F i-1 j=l γ j , = β l +α l ||H T l-1 || F || ĤT l-1 -H T l-1 || F +||H T l-1 || F L i=l+1 α i || ĤT i-1 -H T i-1 || F i-1 j=l γ j , where β l is given by α l = ||W l+1 || F || ∇Z l+1 || F ||W ′ l || F ; β l = || ∇Z l || F ; γ l = ||W l+1 || F ||σ ′ (Z l )|| F . (42) During the LFC-ACT and HFC-ACT trainings, the activation map of a linear layer satisfies ||H l -H L l || F = || H l -H L l || F = || H l ⊙ (1 -M)|| F ≜ λ H l , ||H l -H H l || F = || H l -H H l || F = || H l ⊙ M|| F ≜ λ L l . Taking Equations ( 43) and ( 44) into (41), we have the GEB L l and GEB H l of a linear layer given by || ∇W l -∇ L W l || F ≤ α l ||H T l-1 || F + β l λ H l + ||H T l-1 || F L i=l+1 α i λ H i i-1 j=l γ j ≜ GEB L l , || ∇W l -∇ H W l || F ≤ ||α l ||H T l-1 || F + β l λ L l + ||H T l-1 || F L i=l+1 α i λ L i i-1 j=l γ j ≜ GEB H l . Given the expression of GEB L l and GEB H l by Equations ( 45) and ( 46), we have the GEB difference for a linear layer given by GEB L l -GEB H l = ||α l ||H T l-1 || F + β l (λ H l -λ L l )+ ||H T l-1 || F L i=l+1 α i (λ H i -λ L i ) i-1 j=l γ j . D λ L l VERSUS λ H l IN DENSENET-121 A further experiment is conducted to study whether λ L l > λ H l can be guaranteed for DenseNet-121. Specifically, the values of λ L l and λ H l in the training (epoches 20, 40, and 60) of DenseNet-121 are given in Tables 3 and 4 , where the low-pass M mask satisfies W/N = 0.1 and W/N = 0.2, respectively; and λ L l and λ H l are estimated based on the input activation maps of the four DenseBlocks. It is consistently observed that λ L l > λ H l for W/N = 0.1 and W/N = 0.2 in different training epochs. This indicates our proposed Theorem 1 holds without loss of generality.

E PROOF OF THEOREM 2

We give the proof of Theorem 2 in this section. Theorem 2. For any real-valued function f (x) and its moving average f (x) = 1 2B x+2B x f (t)dt, let F (ω) and F (ω) denote the Fourier transformation (Madisetti, 1997) of f (x) and f (x), respectively. Generally, we have F (ω) = H(ω)F (ω), where |H(ω)| = sin ωB ωB . Proof. We adopt the limit operator to reformulate f (x) into f (x) = 1 2B x+2B x f (t)dt = 1 2B lim N →∞ N -1 n=0 2B N f (x + 2Bn N ) = lim N →∞ N -1 n=0 1 N f (x + 2Bn N ) Taking Equation (47) into the Fourier Transform of f (x), we have F ′ (ω) = ∞ -∞ f (x)e -iωx dx = ∞ -∞ 1 N lim N →∞ N -1 n=0 f (x + 2Bn N )e -iωx dx = lim N →∞ 1 N N -1 n=0 ∞ -∞ f (x + 2Bn N )e -iωx dx = lim N →∞ 1 N N -1 n=0 e iω 2Bn N ∞ -∞ f (x)e -iωx dx = F (ω) lim N →∞ 1 N N -1 n=0 e iω 2Bn N = F (ω)(1 -e iω2B ) lim N →∞ 1 N (1 -e iω 2B N ) = F (ω) 1 -e iω2B -iω2B , where i denotes the imaginary unit. Let H(ω) = 1-e iω2B -iω2B . The magnitude of H(ω) is given by setting in this section. Checkpoint: Checkpoint relies on the model-specific design of the checkpointed layers. We find that it provides a good trade-off between memory cost and running speed to checkpoint the activation map of each Botteneck block in ResNet-50 and WRN-50-2 (a ResNet-50 or WRN-50-2 has four Botteneck blocks). To implement the checkpointing of inter-media layers, we revised the forward function of the ResNet-50 and WRN-50-2 as Figure 8 . SWAP: For the SWAP method, the memory utilization is considered both GPU and CPU because the activation maps are stored on both GPU and CPU in SWAP. H(ω) = |1 -cos ω2B + i sin ω2B| |ω2B| = (1 -cos ω2B) 2 + J COMPRESSION OF POOLING LAYERS, RELU ACTIVATIONS, AND DROPOUT DIVISION follows Algorithms 2, 3, 4 and 5 to compresse the activation map of a Max-Pooling layer, Average-Pooling layer, Relu activation and Dropout operator, respectively. For the pooling layers, we consider a simple case kernelsize = movingstride = k. General cases with different kernelsize and movingstride can be designed in analogous ways. Algorithm 2 Max-Pooling layer. 

1: Function

Forward (H l-1 , k, **kwargs) 2: H l , V l-1 12 =Max-Pooling(H l-1 , k, kwargs) 3: Pack & Cache V l-1 using Int8. 4: return H l 5: 6: Function Backward(∇ H l ) 7: Load V l-1 and k. 8: ∇ ′ H l = 1 k×k ⊗ ∇ H l 9: ∇ H l-1 = V l-1 ⊙ ∇ ′ H l 10: return ∇ H l-1 ′ H l = 1 k×k ⊗ ∇ H l 9: ∇ H l-1 = k -2 ∇ H l 10: return ∇ H l-1 Algorithm 4 Relu operator. 1: Function Forward (H l-1 ) 2: V l-1 = sgn(H l-1 ) 3: H l = V l-1 ⊙ H l-1 4: Pack & Cache V l-1 using Int8 5: return H l 6: 7: 8: Function Backward(∇ H l ) 9: Load V l-1 . 10: ∇ H l-1 = V l-1 ⊙ ∇ H l 11: return ∇ H l-1 Algorithm 5 Dropout operator. 1: Function Forward (H l-1 ) 2: Generate a Minibatch×Channel×N ×N binary matrixV l-1 3: following the Bernoulli distribution with dropout probability p. 4: H l = V l-1 ⊙ H l-1 5: Pack & Cache V l-1 using Int8. 6: return H l 7: 8: Function Backward(∇ H l ) 9: Load V l-1 . 10: ∇ H l-1 = V l-1 ⊙ ∇ H l 11: return ∇ H l-1 K EVALUATION OF DIVISION ON MULTI-LAYER PERCEPTRONS (MLPS) We conduct experiments on the GAS dataset (Dua & Graff, 2017) Although DIVISION shows slightly lower accuracy lower than Mesa, the compression rate is significantly higher than Mesa. Moreover, DIVISION can be applied to general vision models, including MLPs, CNNs, and vision transformers; while Mesa is explicitly designed for vision transformers. As a supplementary of Section 5.5, a follow-up experiment is conducted to study the effect of batchsize on the training throughput. The result on the CIFAR-10 dataset is given in Table 12 . It is observed: for the experiment w/o AMP, the throughput significantly grows as the batch-size grows from 64 to 128, but is almost unchanged when the batch-size ≥ 128; for the experiment w/ AMP, it grows continuously when the batch-size grows from 64 to 256, and becomes stable when the batch-size ≥ 256. Intuitively, as the batch-size grows, the GPU can parallel process more images per second, until the GPU is 100% utilized (Goyal et al., 2017) . In the experiment w/o AMP, the GPU is almost 100% utilized when batch-size ≥ 128; while this happens when batch-size ≥ 256 in the experiment w/ AMP. More images can be processed in the experiment w/ AMP, since it employs bfloat16 operations in the training, in contrast with the float32 operations in the training w/o AMP, where a bfloat16 operation has nearly half of the computation cost of a float32 operation. Therefore, the training throughput significantly grows from 2086 (images/s) to 3753 (images/s) as the batch-size grows from 128 to 256 in the experiment w/ AMP, but is almost unchanged (2184 images/s vs 2335 images/s) in the same condition in the experiment w/o AMP.

R DETAILS ABOUT THE COMPUTATION INFRASTRUCTURE

The details about our physical computing infrastructure for testing the training memory cost and throughput are given in Table 13 . 



The activation map of each layer is required for estimating the gradient during backward propagation. 6× compression rate indicates the memory of cached activation maps is 1/6 of that of normal training. We do not focus on the closed from the backward function, which is implemented by torch.autograd. For the case N < B, the pooling block-size and stride will be N such that the shape of H L l is Minibatch×Channel×1×1. A fixed bit-width is adopted for the quantization of all layers to maximize the efficiency of data processing. Per-channel quantization is more efficient and light than per-group quantization in state-of-the-art work. ⌊x⌉ takes the value of ⌊x⌋ with a probability of x-⌊x⌋ and takes ⌈x⌉ with a probability of ⌈x⌉-x. The stochastic rounding enables the quantization-dequantization pipeline to be unbiased, i.e. E[V H l ] = H H l . torch.cuda.memory allocated returns the memory occupied by tensors in bytes. https://pytorch.org/docs/stable/amp.html https://paperswithcode.com/lib/torchvision V l-1 reserves the locations of each kernel-wise max-values in H l-1 .



Figure 1: (a) Overall performance of DIVISION versus baseline methods. (b) Activation compressed training.

Figure 3: λ L l = || H l ⊙M||F versus λ H l = || H l ⊙(1-M)||F in the training (epoches 20, 40, and 60) of ResNet-18. H l takes the activation maps of four BasicBlocks in ResNet-18; □ indicates the mean values; W = 0.5N . 3.1 EXPERIMENTAL ANALYSIS To study the individual contribution of LFC and HFC to DNN backward propagation, we design three training methods with different backward propagations: LFC-ACT takes LFC into the backward function as shown in Equation (6), where H L l is estimated by Equations (4); HFC-ACT takes HFC into the backward function as given in Equation (7), where H H l is according to Equation (5); Normal training (for comparison) estimates the gradients by Equation (2).[∇H l-1 , ∇W l ] = backward( ∇H l , H L l , W l ), ▷LFC-ACT (6) [ ∇H l-1 , ∇W l ] = backward( ∇H l , H H l , W l ). ▷HFC-ACT(7) We conduct the experiments on the CIFAR-10 dataset. The implementation details are given in Appendix A. The top-1 accuracy and memory cost of LFC-ACT, HFC-ACT, and normal training are shown in Figure 2 (b) and (c), respectively. Overall, we have the following observations: • Accuracy drop: According to Figure 2 (b), HFC-ACT suffers from significantly more degradation of accuracy than LFC-ACT. This indicates DNN backward propagation mainly utilizes the LFC of activation maps during the training. • Memory cost: According to Figure 2 (c), the storage of HFC requires significantly more memory than that of the LFC. i.e. The storage of HFC consumes the majority of memory. To better understand the results of model accuracy, we theoretically prove the gradient for backward propagation is bounded into a tighter range around the optimal value in LFC-ACT. This enables LFC-ACT to learn a more accurate model than HFC-ACT.

7) We conduct the experiments on the CIFAR-10 dataset. The implementation details are given in Appendix A. The top-1 accuracy and memory cost of LFC-ACT, HFC-ACT, and normal training are shown in Figure 2 (b) and (c), respectively. Overall, we have the following observations: • Accuracy drop: According to Figure 2 (b), HFC-ACT suffers from significantly more degradation of accuracy than LFC-ACT. This indicates DNN backward propagation mainly utilizes the LFC of activation maps during the training. • Memory cost: According to Figure 2 (c), the storage of HFC requires significantly more memory

Figure 4: Framework of Dual Activation Precision Training. activation maps of the BasicBlocks in ResNet-18, and Denseblocks in DenseNet-121. The estimation of λ L l and λ H l is based on the checkpoint of ResNet-18 in epoches 20, 40, and 60, and visualized in Figures 3 (a)-(c), respectively; and the results of DenseNet-121 are given in Appendix D. It is consistently observed that λ L l > λ H l for different instances and layers. This leads to GEB L l < GEB H l according to Theorem 1. Therefore, HFC-ACT suffers from a worse distortion of backward propagation during the training, eventually leading to less accurate learned model than LFC-ACT.

Figure 5: Top-1 accuracy (%) ↑ of Normal training, DIVISION, BLPA (a), AC-GC (b), and ActNN (c).5 EVALUATION OF DIVISIONWe conduct the experiments to evaluate DIVISION by answering the following research questions. RQ1: How does DIVISION perform compared with state-of-the-art baseline methods in terms of the model accuracy, memory cost, and training throughput? RQ2: Does the strategy of dual-precision compression contribute to DIVISION? RQ3: What is the effect of hyper-parameters on DIVISION?The experiment setting including the datasets, baseline methods and DNN architectures is specified in Appendix H. The implementation details including the hyper-parameters of DIVISION and configuration of baseline methods are given in Appendix I. More experiments on MLPs, vision transformers, depthwise, and pointwise convolutional layers are given in Appendix K, L, and M.5.1 EVALUATION BY MODEL ACCURACY (RQ1)In this section, we evaluate the training methods in terms of model accuracy on the CIFAR-10, CIFAR-100 and ImageNet datasets. Specifically, DIVISION is compared with BLPA(Chakrabarti & Moseley, 2019), AC-GC(Evans & Aamodt, 2021) and ActNN(Chen et al., 2021) in Figure5(a)-(c), respectively, where different model architectures are considered. Here, Checkpoint and SWAP are not considered in this section because they are able to reduce the training memory without degradation of model accuracy. Overall, we have the following observations: • DIV vs Baseline Methods: Compared with normal training, DIVISION achieves almost the same top-1 validation accuracy. In contrast, the baseline methods suffer from slightly higher validation error. This indicates DIVISION provides nearly loss-less compression of DNN training. • Flexibility of DIV: DIVISION consistently achieves competitive model accuracy in the training of different architectures on different datasets. This indicates DIVISION is a flexible framework that can be applied to different scenarios. • Compressibility of HFC: Note that DIVISION adopts a significantly high compression rate 12× for the HFC during the training, and achieves nearly loss-less accuracy. This result indicates the HFC of activation map is highly redundant and compressible during the training.

Figure 6: Training throughput ↑ of (a) Resnet-50 and (b) WRN-50-2 on the ImageNet dataset, where indicates out of memory. (c) Top-1 validation accuracy (%) ↑ of DIVISION, DIVISION w/o HFC, DIVISION w/o LFC and fixed bit-width quantization on the ImageNet dataset.

Figure 7: (a) Top-1 Accuracy and compression rate of DI-VISION in different settings. (b) Top-1 accuracy and training throughput of DIVISION w/ and w/o AMP.

Figure 8: Implementation of checkpointed ResNet-50 and WRN-50-2.

|| H l ⊙M|| F and λ H l = || H l ⊙(1-M)|| F during the training of ResNet-18 and DenseNet-121 on the CIFAR-10 dataset. Specifically, H l takes the 4 The Frobenius norm of n×n matrix A is given by ||A||F =

(c), the HFC takes the majority of memory cost during the training. This indicates the HFC is highly redundant and compressible during the training. Following this direction, we propose DIVISION to compress the activation maps into a dual precision representation: high-precision LFC combined with low-precision HFC. On the one hand, both LFC and low-precision HFC requires much less memory to cache. On the other hand, removing the redundancy of HFC cannot cause much distortion of backward propagation. In this way, DIVISION enables effective compression of training memory without degradation of model accuracy.

estimates the memory cost of activation maps by Memory Cost = Memory Utilization after forward -Memory Utilization after backward ,(11)where existing deep learning tools provide APIs 10 to estimate the memory utilization.The theoretical compression rate R of DIVISION is given in Appendix G, where general cases of convolutional neural networks and multi-layer perception are considered for the estimation. For the model architectures in our experiments, we have R ResNet-50 , R WRN-50-2 ≥ 10.35.

• DIV vs ActNN: DIVISION has approximately the same memory cost as ActNN. Beyond the storage of 2-bit activation maps, DIVISION has overhead for caching the LFC; and ActNN spends almost equal overhead for storing the parameters of per-group quantization. Memory cost↓ and compression rate↑. Total Mem refers to total memory cost of weights, optimizer, data and activation maps. Act Mem refers to memory cost of activation maps. OOM refers to out of memory.

Hyper-parameter setting.

W/N = 0.1.

Hyper-parameter setting.

(128-dimensional features, 13910  instances, 6 classification task). The classification model is a 4-layer MLP (128 neuros in the input layer, 6 neuros in the output layer, and 64 neuros in the hidden layer); The setting of DIVISION is B = 16 and Q = 2. The model accuracy and memory cost of activation map are given in Table6. It is observed that DIVISION has 7.3× compression rate with only 0.07% degradation of model accuracy. This indicates the effectiveness of DIVISION on the MLP models.

Model Accuracy on the GAS dataset.

Accuracy of Swin Transformer-T on the ImageNet dataset. evaluate DIVISION on the depthwise convlution and pointwise convulution layers, we conduct experiments of MobileNet-V2 on the CIFAR-10 and CIFAR-100 datasets, where the model accuracy are given Table8. It is observed that DIVISION achieve nearly the same accuracy compared with normal training. This indicates the effectiveness of DIVISION for the depthwise convlution and pointwise convulution layers.

Accuracy of MoblieNet-V2 on the CIFAR-10 and CIFAR-100 datasets. 8} to train ResNet-18 and MobileNet-V2 on the CIFAR-10 and CIFAR-100 datasets. The model accuracy is given in Table9. It is observed B and Q have a consistent impact on different model architectures and datasets: the accuracy slightly grows with Q and considerably reduces with B. This indicates we can reuse the hyper-parameter setting of DIVISION on CIFAR-10 to CIFAR-100 with the same model architecture, or we can reuse the setting with ResNet-18 and MobileNet-V2 on the same dataset.

Accuracy of ResNet-18 and MoblieNet-V2 with different hyper-parameter settings.B = 18 Q = 2 B = 8 Q = 2 B = 8 Q = 8 O COMPARISON OF DIVISION WITH CHECKPOINT STRATEGY OF MEGATRON-LMDIVISION is compared with the checkpointing strategy of Megatron-LM(Shoeybi et al., 2019).According to the official guideline, Megatron-LM checkpoints the activation map after each transformer block. We follow this strategy to checkpoint the activation map after each transformer block in the Swin Transformer, and after each Bottleneck block in the ResNet-50. The memory cost (on the ImageNet dataset with batch-size 128) is given in Table 10. It is observed DIVISION has a higher compression rate (2.8× for Swin Transformer-T and 7.9× for ResNet-50) than the Checkpoint strategy of Megatron-LM (2.3× for Swin Transformer-T and 2.27× for ResNet-50). This indicates the effectiveness of DIVISION over the checkpoint strategy of Megatron-LM.

Memory cost of DIVISION and Checkpoint strategy of Megatron-LM. P EFFECT OF BIT-WIDTH ON THE FIXED QUANTIZATION To demonstrate the effectiveness of dual activation precision, DIVISION (B = 8, Q = 2) is compared with the fixed quantization under different bit-width. The model accuracy of ResNet-50 on the ImageNet dataset is given in Table11. It is observed the training fails to converge to an optimal solution under 2-bit quantization, even though it performs favorably under 4-bit and 8-bit quantization. This result is consistent with existing work(Chen et al., 2021) (Table3in(Chen et al., 2021)). In contrast, DIVISION achieves 75.9% accuracy when adopting 2-bit quantization to compress the HFC. This indicates the effectiveness of dual activation precision in terms of the model accuracy under low bit-width quantization.

Model accuracy of fixed quantization under different bit-width.

Traininig throughput versus the batch-size.

Computing infrastructure for the experiments. We give the implementation details of the experiment in Section 5.4. Specifically, DIVISION w/o HFC takes block-size B = 4 for estimating LFC; DIVISION w/o LFC takes the bit-width Q = 2 for the quantization of HFC; DIVISION combines these settings for the training; and Fixed Quant has a 2-bit per-group quantization of activation maps during the backward pass of the training, where the group size of quantization is 256. Other training hyper-parameters are given in Table5.

F COMPRESSION OF 1D AND 3D ACTIVATION MAPS BY DIVISION

We give more details about DIVISION considering 1D, 2D and 3D activation maps in this section.F.1 ACTIVATION MAP COMPRESSION DIVISION adopts average pooling to estimate the LFC by H L l = AveragePooling(H l ). The value of block-size and moving stride is a unified hyper-parameter B, which controls the memory of H L l . The average pooling of 1D, 2D and 3D activation maps are considered as follows,To estimate the HFC, DIVISION calculates the residual value H H l = H l -UpSampling(H L l ), where the UpSampling(•) enlarges H L l to the shape of H l via nearest interpolation. The up sampling of 1D, 2D and 3D activation maps are considered as follows,Then, DIVISION adopts Q-bit per-channel quantization for the compression, where the bit-width Q controls the precision and memory cost of HFC after the compression. Let V H l denote a Q-bit integer matrix, as the low-precision representation of H H l . The detailed procedure of compressingwhere δ l denotes the minimum element in H H l ; ∆ l = (h max -δ l )/(2 Q -1) denotes the quantization step; h max denotes the maximum element in H H l ; ⌊•⌉ denotes the stochastic rounding. After the compression, as the representation of H l , the tuple of (H L l , V H l , ∆ l , δ l ) is cached to the memory for reconstructing the activation maps during the backward pass.

F.2 ACTIVATION MAP DECOMPRESSION

During the backward pass, DIVISION adopts the cached tuples of {(H L l , V H l , ∆ l , δ l ) | 0 ≤ l ≤ L -1} to reconstruct the activation map layer-by-layer. Specifically, for each layer l, DIVISION dequantizes the HFC via ĤH l = ∆ l V H l +δ l , which is the inverse process of Equation ( 52). Then, the activation map is reconstructed via Ĥl = UpSampling(H L l ) + ĤH l , where UpSampling(•) enlarges H L l to the shape of H l via nearest interpolation. The cases of 1D, 2D and 3D activation maps are considered in Equation (51).After the decompression, DIVISION frees the caching of (H L l , V H l , ∆ l , δ l ), and takes Ĥl into [ ∇H l-1 , ∇W l ] = backward( ∇H l , Ĥl-1 , W l ) to estimate the gradient for backward propagation.

G THEORETICAL COMPRESSION RATE OF DIVISION

The compression rate of DIVISION is estimated in this section. A general case of convolutional neural networks (CNN) and multi-layer perceptron (MLP) are considered for the estimation.

G.1 COMPRESSION RATE OF CNN TRAINING

Without loss of generality, we estimate the compression rate for a block of convolutional layer (conv), batch normalization layer (BN) and Relu activation. Most of existing backbones purely stacks conv-BN-Relu blocks (He et al., 2016; Huang et al., 2017; Szegedy et al., 2015; Tan & Le, 2019; Simonyan & Zisserman, 2014) , which makes our estimated compression rate hold in practice. Generally, the compression rate is defined as the memory reduction ratio after the compression. To be concrete, let Minibatch×Channel×N ×N denote the shape of activaition maps for a conv-BN-Relu block; given the block-size B and bit-width Q, DIVISION has the compression rate of activation maps given by Theorem 3A.Theorem 3A. DIVISION has average activation map compression rate for a conv-BN-Relu block given bywhere Minibatch×Channel×N ×N is the shape of activation map H l ; B denotes the block-size of LFC average pooling; and Q denotes the bit-width of HFC quantization.Proof. For each mini-batch updating of normal training, a conv-layer or BN-layer caches N 2 float32 × 4byte/float32 = 4N 2 byte activation map; a Relu operator caches N 2 int8 × 1byte/int8 = N 2 byte activation map. For each mini-batch updating of DIVISION, a conv-layer or BN-layer cachesbyte HFC; and spends 2bfloat16 × 2byte/bfloat16 = 4byte for ∆ l and δ l . Moreover, a Relu operator caches N 2 bit × 1 8 byte/bit = N 2 8 byte activation map. Therefore, the average activation map compression rate of a conv-BN-Relu block is given byA higher compression rate indicates more effective compression. It is observed that the compression rate grows with B and N , and decreases with Q. In our experiments, we found B = 8 and Q = 2 can provide loss-less model accuracy. In this condition, the shape of activation maps satisfies N ≥ 7 for ResNet-50 and WRN-50-2 on the ImageNet dataset (He et al., 2016) . According to Equation (53), we have R ResNet-50 , R WRN-50-2 ≥ 10.35.

G.2 COMPRESSION RATE OF MLP TRAINING

We estimate the compression rate for a linear-Relu block in Theorem 3B. An MLP simply stacks multiple linear-Relu blocks, such that our estimated compression rate holds for MLP models.Theorem 3B. DIVISION has average activation map compression rate for a linear-Relu block given bywhere Minibatch×N is the shape of activation map H l ; B denotes the block-size of LFC average pooling; and Q denotes the bit-width of HFC quantization.Proof. 

H EXPERIMENT SETTING

We give the experiment setting including the datasets, baseline methods and model architectures in this section.Datasets. We consider CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) datasets in our experiments. CIFAR-10: An image dataset with 60,000 color images in 10 different classes, where each image has 32×32 pixels. CIFAR-100: An image dataset with 60,000 color images in 100 different classes, where each image has 32×32 pixels. ImageNet: A large scale image dataset which has over one million color images covering 1000 categories, where each image has 224×224 pixels.Baseline Methods. Normal: Caching the exact activation map for backward propagation. BLPA: A systemic implementation of ACT by (Chakrabarti & Moseley, 2019) , which only supports ResNetrelated architectures. AC-GC: A framework of ACT with automatic searched bit-width for the quantization of activation maps (Evans & Aamodt, 2021) . ActNN: Activation compression training with dynamic bit-width quantization, where the bit-allocation minimizes the variance of activation maps via dynamic processing (Chen et al., 2021) . Checkpoint: Caching some key activation maps to reconstruct other activation maps via replaying parts of the forward pass during the backward pass (Chen et al., 2016) . SWAP: Swapping the activation maps to the CPU during the forward pass the memory consumption of GPU, and reload the activation maps to GPU during the backward pass (Huang et al., 2020) . 

I IMPLEMENTATION DETAILS ABOUT DIVISION AND BASELINE METHODS

DIVISION: DIVISION adopts block-size 8 (B = 8) and 2-bit quantization (Q = 2) to compress the activation maps of linear, convolutional and BatchNorm layers, where the theoretical compression rate is not less than 10.35×. For the operators without quantization error during backward propagation such as pooling layers, ReLu activation, and Dropout, DIVISION follows the algorithms in Appendix J to compress the activation maps. Other hyper-parameter settings are given in Table 5 .BLPA: Existing work (Chakrabarti & Moseley, 2019) has shown that BLPA requires at least 4-bit ACT for loss-less DNN training. We follow this setting for BLPA, where the compression rate of activation maps is not more than 8×. AC-GC: AC-GC follows existing work (Evans & Aamodt, 2021) to take the multiplicative error (1 + e 2 AC-GC ) = 1.5, where the searched bit-width enables AC-GC to satisfy this loss bound (training loss not more than 150% of normal training). In this setting, AC-GC finalizes the bit-with as 7.01 after the searching, which has a nearly 4.57× compression rate of activation maps. ActNN: ActNN adopts 2-bit ACT and dynamic programming for searching the optimal bit-width specific for each layer, and uses per-group quantization for compressing the activation map, which has approximately 10.5× compression rate of activation maps. Such experimentally setting is denoted as L3 strategy in the original work (Chen et al., 2021) , and we follow this

