FEW-BIT BACKWARD: QUANTIZED GRADIENTS OF AC-TIVATION FUNCTIONS FOR MEMORY FOOTPRINT RE-DUCTION

Abstract

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -as we show -can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

1. INTRODUCTION

Memory consumed by the model during training (except intermediate tensors) can be split into two groups: 1) the model weights (including additional memory for the optimizer state), 2) activations saved for the backward pass, over which the computation is not carried out directly at the moment, but which will be required in the future to compute the gradients. Every operation in the computational graph generates a memory footprint. It is typically overlooked, that the application of the pointwise non-linearity (such as GELU or sigmoid) results in storing the 1



Figure 1: Examples of 3-bit approximations for derivatives of popular nonlinearities: GELU, SELU, and Sigmoid.

