LEARNING WITH AUXILIARY ACTIVATION FOR MEMORY-EFFICIENT TRAINING

Abstract

While deep learning has achieved great success in various fields, a large amount of memory is necessary to train deep neural networks, which hinders the development of massive state-of-the-art models. The reason is the conventional learning rule, backpropagation, should temporarily store input activations of all the layers in the network. To overcome this, recent studies suggested various memoryefficient implementations of backpropagation. However, those approaches incur computational overhead due to the recomputation of activations, slowing down neural network training. In this work, we propose a new learning rule which significantly reduces memory requirements while closely matching the performance of backpropagation. The algorithm combines auxiliary activation with output activation during forward propagation, while only auxiliary activation is used during backward propagation instead of actual input activation to reduce the amount of data to be temporarily stored. We mathematically show that our learning rule can reliably train the networks if the auxiliary activation satisfies certain conditions. Based on this observation, we suggest candidates of auxiliary activation that satisfy those conditions. Experimental results confirm that the proposed learning rule achieves competitive performance compared to backpropagation in various models such as ResNet, Transformer, BERT, ViT, and MLP-Mixer.

1. INTRODUCTION

Backpropagation (1) is an essential learning rule to train deep neural networks and has proven its outstanding performance in diverse models and tasks. In general, the wider and deeper the deep learning model, the better the training performance (2). However, increasing model size unavoidably requires larger memory in training hardware such as GPU (3; 4) . This is because backpropagation needs to temporarily store input activations of all the layers in the network generated in forward propagation as they are later used to update weights in backward propagation. Consequently, state-of-the-art deep learning models require substantial memory resources due to the large amount of input activation to store. In order to train very deep models with limited hardware resources, the batch size may be reduced (3; 4) or many GPUs can be used in parallel (5; 6; 7; 8; 9; 10; 11) . However, reducing the batch size causes a long training time, and the advantage of batch normalization (12) disappears. Also, training huge models such as GPT-3 (13) still requires expensive GPU clusters with thousands of GPUs and incurs high I/O costs even with parallelism. Recently, a wide range of algorithms have been proposed to alleviate this memory requirement. For instance, a new optimizer (14; 15) or neural network architectures (16; 17; 18; 19; 20; 21) have been suggested to reduce the memory requirements. Gradient checkpointing (22; 23; 24; 25; 26) reduces memory space by only storing some of the input activation during forward propagation. Then, it restores the unsaved input activations through recomputation in backward propagation. In-place activated batch normalization (27) merges a batch normalization layer and a leaky ReLU layer and stores the output activation of the merged layer in forward propagation. In backward propagation, the layer input can be reconstructed for training because the leaky ReLU function is reversible. Similarly, RevNet (28), Momentum ResNet (29), and Reformer (30) employ reversible neural network architectures, which allow for calculating input activation from output activation in backward propagation. Gradient checkpointing and reversible network structures reduce training memory space because they partially store input activations (e.g., input activations of selected layers). However,

