DIVISION: MEMORY EFFICIENT TRAINING VIA DUAL ACTIVATION PRECISION

Abstract

Activation compressed training (ACT) has been shown to be a promising way to reduce the memory cost of training deep neural networks (DNNs). However, existing work of ACT relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual ActIVation PrecISION (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10× compression of activation maps, and significantly higher training throughput than state-of-theart ACT methods, without loss of model accuracy. The code is available at https://anonymous.4open.science/r/division-5CC0/.

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied to real-world tasks such as language understanding (Devlin et al., 2018) , machine translation (Vaswani et al., 2017) , visual detection and tracking (Redmon et al., 2016) . With increasingly larger and deeper architectures, DNNs achieve remarkable improvement in representation learning and generalization capacity (Krizhevsky et al., 2012) (Georgiadis, 2019; Liu et al., 2022) . Among these, the activation compressed training (ACT) has emerged as a promising method due to its significant reduction of training memory and the competitive learning performance (Liu et al., 2021b) . Existing work of ACT relies on quantizing the activation maps to reduce the memory consumption of DNN training, such as BLPA (Chakrabarti & Moseley, 2019 ), TinyScript (Fu et al., 2020) and ActNN (Chen et al., 2021) . Although ACT could significantly reduce the training memory cost, the quantization process introduces noises in backward propagation, which makes the training suffer from undesirable degradation of accuracy (Fu et al., 2020) . Due to this reason, BLPA requires 4bit ACT to ensure the convergence to optimal solution on the ImageNet dataset, which has only 6× compression ratefoot_1 of activation maps (Chakrabarti & Moseley, 2019) . Other works propose to



The activation map of each layer is required for estimating the gradient during backward propagation. 6× compression rate indicates the memory of cached activation maps is 1/6 of that of normal training.1



. Generally, training a larger model requires more memory resources to cache the activation values of all intermediate layers during the back-propagation 1 . For example, training a DenseNet-121 (Huang et al., 2017) on the ImageNet dataset (Deng et al., 2009) requires to cache over 1.3 billion float activation values (4.8GB) during back-propagation; and training a ResNet-50 (He et al., 2016) requires to cache over 4.6 billion float activation values (17GB). Some techniques have been developed to reduce the training cache of DNNs, such as checkpointing (Chen et al., 2016; Gruslys et al., 2016), mix precision training (Vanholder, 2016), low bit-width training (Lin et al., 2017; Chen et al., 2020) and activation compressed training

