FEW-BIT BACKWARD: QUANTIZED GRADIENTS OF AC-TIVATION FUNCTIONS FOR MEMORY FOOTPRINT RE-DUCTION

Abstract

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -as we show -can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

1. INTRODUCTION

Memory consumed by the model during training (except intermediate tensors) can be split into two groups: 1) the model weights (including additional memory for the optimizer state), 2) activations saved for the backward pass, over which the computation is not carried out directly at the moment, but which will be required in the future to compute the gradients. Every operation in the computational graph generates a memory footprint. It is typically overlooked, that the application of the pointwise non-linearity (such as GELU or sigmoid) results in storing the input for the backward pass. We show that instead of keeping the full input tensor, it is possible to store a low-bit representation, which allows accurate gradients approximation. In this work, we propose to approximate the derivative of the activation function in a piecewiseconstant form. Such an approximation problem has to be solved once for each activation function, and we propose a simple technique to do that. The proposed approximation divides all values into several bins and saves only their corresponding bin indices instead of storing all values. This is a lossy compression, but the additional noise introduced by it is negligible as we will show on several benchmarks in Section 4. The main contributions of our paper are: • We propose new approximate backward computation schemes that significantly reduce the memory consumption of neural network training. • We benchmark our approach on several tasks. We show that it provides up to 40% memory reduction on various tasks while maintaining accuracy on par with the model trained via the standard approach.

2. QUANTIZED GRADIENTS OF ACTIVATIONS

Tensors saved for backward

Quantize and Save

Quantized tensors saved for backward Figure 2: Computation graph of both forward and backward pass. Orange and purple parts of the graph correspond to standard and proposed ways of saving tensors for backward, respectively. Vector x bit stands for the tensor saved using 2-bit quantization, while x denotes its uncompressed version. Gradients of activations using automatic differentiation. Modern deep learning frameworks use the reverse mode automatic differentiation to calculate the gradients of the loss over the model parameters. Forward computation can be associated with a directed acyclic graph, depicted in Fig. 2 . Each operation f computes the output X l+1 given the input X l and has to save some information S l that would be used on the backward pass in order to calculate the derivative ∂L/∂X l from ∂L/∂X l+1 and S l . Thus, in a typical training loop, the intermediates S l of all operations in the graph are stored in the memory during the whole forward pass until they are no longer needed after the completion of the corresponding backward operation during backward pass. This generates an additional memory, which can be quite significant and be larger than the total amount of parameters of the model. Pointwise activations. In this paper, we focus on a pointwise activation function, which is ubiquitous in modern neural network architectures. Given an input tensor X l we apply a function f to each of the elements of this tensor: f (X l ) = [f (X j1,...,j k l )] j1,...,j k , f : R → R. This operation is very cheap compared to other operations in the deep neural network model and does not attract much attention when analysing computational complexity. However, standard implementation in such a framework as PyTorch induces not a very small memory footprint and the whole input X l is saved for the backward pass. The backward pass for such a function consists of element-wise multiplication of the propagated gradient tensor by the derivative of the nonlinearity function at the points of the input tensor: if X l+1 = f (X l ), then the gradient of the loss L with respect to X l is computed as ∂L ∂X l = ∂L ∂X l+1 f ′ (X l ),



Figure 1: Examples of 3-bit approximations for derivatives of popular nonlinearities: GELU, SELU, and Sigmoid.

