LEARNING WITH AUXILIARY ACTIVATION FOR MEMORY-EFFICIENT TRAINING

Abstract

While deep learning has achieved great success in various fields, a large amount of memory is necessary to train deep neural networks, which hinders the development of massive state-of-the-art models. The reason is the conventional learning rule, backpropagation, should temporarily store input activations of all the layers in the network. To overcome this, recent studies suggested various memoryefficient implementations of backpropagation. However, those approaches incur computational overhead due to the recomputation of activations, slowing down neural network training. In this work, we propose a new learning rule which significantly reduces memory requirements while closely matching the performance of backpropagation. The algorithm combines auxiliary activation with output activation during forward propagation, while only auxiliary activation is used during backward propagation instead of actual input activation to reduce the amount of data to be temporarily stored. We mathematically show that our learning rule can reliably train the networks if the auxiliary activation satisfies certain conditions. Based on this observation, we suggest candidates of auxiliary activation that satisfy those conditions. Experimental results confirm that the proposed learning rule achieves competitive performance compared to backpropagation in various models such as ResNet, Transformer, BERT, ViT, and MLP-Mixer.

1. INTRODUCTION

Backpropagation (1) is an essential learning rule to train deep neural networks and has proven its outstanding performance in diverse models and tasks. In general, the wider and deeper the deep learning model, the better the training performance (2). However, increasing model size unavoidably requires larger memory in training hardware such as GPU (3; 4) . This is because backpropagation needs to temporarily store input activations of all the layers in the network generated in forward propagation as they are later used to update weights in backward propagation. Consequently, state-of-the-art deep learning models require substantial memory resources due to the large amount of input activation to store. In order to train very deep models with limited hardware resources, the batch size may be reduced (3; 4) or many GPUs can be used in parallel (5; 6; 7; 8; 9; 10; 11) . However, reducing the batch size causes a long training time, and the advantage of batch normalization (12) disappears. Also, training huge models such as GPT-3 (13) still requires expensive GPU clusters with thousands of GPUs and incurs high I/O costs even with parallelism. Recently, a wide range of algorithms have been proposed to alleviate this memory requirement. For instance, a new optimizer (14; 15) or neural network architectures (16; 17; 18; 19; 20; 21) have been suggested to reduce the memory requirements. Gradient checkpointing (22; 23; 24; 25; 26) reduces memory space by only storing some of the input activation during forward propagation. Then, it restores the unsaved input activations through recomputation in backward propagation. In-place activated batch normalization (27) merges a batch normalization layer and a leaky ReLU layer and stores the output activation of the merged layer in forward propagation. In backward propagation, the layer input can be reconstructed for training because the leaky ReLU function is reversible. Similarly, RevNet (28), Momentum ResNet (29), and Reformer (30) employ reversible neural network architectures, which allow for calculating input activation from output activation in backward propagation. Gradient checkpointing and reversible network structures reduce training memory space because they partially store input activations (e.g., input activations of selected layers). However, these methods incur additional computational overhead because the unstored input activations must be recomputed during backward propagation. Alternatively, algorithms to approximate activation have been suggested (31; 32; 33; 34; 35; 36; 37; 38) , but they suffer from performance degradation or slow down training due to additional computations to quantize and dequantize activations. TinyTL (39) entirely avoids saving activations by updating only bias parameters while fixing weight parameters. However, it is only applicable to fine-tuning of a pre-trained model. 

2.1. MEMORY REQUIREMENTS OF BACKPROPAGATION

To understand why backpropagation uses a large memory space for training, we describe how it trains a deep neural network. The training process of backpropagation can be expressed by the equations below. h l+1 = ϕ(W l+1 h l + b l+1 ) δ l = W T l+1 δ l+1 ⊙ ϕ ′ (y l ) (2) ∆W l+1 = -ηδ l+1 h T l Here h l , W l+1 , and b l+1 denote the input activation, weight, and bias of the hidden layer l + 1, respectively. ϕ, δ, and η represent the nonlinear function, backpropagated error, and learning rate, respectively. y l is the output of a linear or convolutional layer (i.e., y l = W l h l-1 + b l ), which becomes the input of the nonlinear function ϕ. In forward propagation, the input propagates through the network following equation (1). During this process, the generated input activations are stored in memory. In backward propagation, the error is backpropagated through the network based on



In this study, we propose a new learning rule, Auxiliary Activation Learning, which can significantly reduce memory requirements for training deep neural networks without sacrificing training speed. We first introduce the concept of auxiliary activation in the training process. Auxiliary activations are combined with output activations and become the input activation of the next layer when processing forward propagation, but only the auxiliary activations are temporarily stored instead of actual input activation for updating weights in backward propagation. To justify our algorithm, we prove that an alternate type of input activation could reliably train neural networks if auxiliary activation satisfies certain conditions. Then, we propose multiple candidates of auxiliary activations which meet this criterion. Experimental results demonstrate that the proposed algorithm not only succeeds in training ResNet models (40) on ImageNet (41) with similar performance to backpropagation, but is also suitable for training other neural network architectures such as Transformer (42), BERT (43), ViT (44), and MLP-Mixer (45).

Figure 1: Conventional memory-efficient training algorithms and Auxiliary Activation Learning.

