TRAINING NEURAL NETWORKS WITH LOW-PRECISION MODEL MEMORY

Abstract

The demand for memory to store model-related statistics ("model memory") is a major bottleneck for training large neural networks. A promising solution is lowprecision optimizers, which reduce the numerical precision of the model memory. However, existing work only compresses the momentum, resulting in suboptimal memory efficiency. This paper proposes Low-Precision Model Memory (LPMM), an optimization framework with the entire model memory kept in low precision. LPMM compresses not only the momentum but also model parameters and gradient accumulators. We identify arithmetic underflow as the main problem in building low-precision optimizers and propose a stochastic quantization method and a microbatching technique to overcome this problem. We analyze the convergence behavior of LPMM and theoretically show how the proposed techniques could affect underflowing, which in turn affects the convergence. We apply LPMM to the SGD optimizer with momentum (SGDM). On several realistic benchmarks, LPMM-SGDM can train neural networks with negligible loss of accuracy while reducing over 70% of the model memory compared to the full-precision SGDM.

1. INTRODUCTION

Many huge models (Kenton & Toutanova, 2019; Radford et al., 2019; Dosovitskiy et al., 2021; Brown et al., 2020; Fedus et al., 2022) emerge in the recent several years. Despite being powerful, training these models is challenging. Memory is the main bottleneck in developing large models, as the training memory footprint is typically proportional to the number of model parameters. During training, the device memory is consumed by three types of objects: 1. data-related objects ("data memory"), including data and each layer's activation. Their size is proportional to the data size, i.e., mini-batch size and image resolution / sequence length. 2. model-related objects ("model memory"), including model parameters, momentum, and gradient accumulators. Their size is proportional to the amount of model parameters. 3. temporary objects ("workspace memory"), such as scratch memory used by computing kernels and memory fragments. Among the three types, model memory is the main bottleneck in scaling up machine learning models (Rajbhandari et al., 2020) . Quantization is a promising way of reducing the model memory. Specifically, low-precision optimizers (Ramesh et al., 2021; Dettmers et al., 2021) represent their states with low-precision numerical formats, such as 8-bit integers, which consume less memory. Particularly, Dettmers et al. (2021) propose an 8-bit optimizer, which quantizes the momentum to a block-wise 8-bit format. However, existing works have two limitations. First, the convergence behavior of low-precision optimizers is theoretically not well understood. Second, they only quantize the momentum, while model parameters and gradients are left in full precision. Therefore, the overall memory saving is unsatisfactory. In this work, we propose LPMM, a novel framework for optimizing with low-precision model memory. Unlike previous works, LPMM consider all model-related objects, including model parameters, momentum, and gradients, in low precision. We identify arithmetic underflow as the major bottleneck of building low-precision optimizers. Theoretically, we analyze the convergence behavior of low-precision optimizers. Our analysis reveals how the design of the optimizer could impact the degree of underflowing, which we link to the convergence behavior. Algorithmically, we propose stochastic quantization and gradient accumulation methods to reduce underflowing. These techniques are backed with our theoretical findings. We further discuss the quantizer design and system implementation for LPMM. We evaluate LPMM on the standard image classification benchmark. LPMM can quantize gradients and the momentum to 8 bit, and model parameters to 12 bit, with neglible loss of accuracy. In total, LPMM only requires 28 bits of model memory per parameter, compared to 72 bits by Dettmers et al. (2021) or 96 bits of the full-precision algorithm. ) keeps low-precision activations in memory. These method reduces the data memory rather than the model memory, and are orthogonal to our approach.

2. RELATED WORK

Quantized Training There is a series of work focusing on the acceleration of neural network training with low-precision computations (Wang et al., 2018b; Micikevicius et al., 2017; Zhu et al., 2020; Sun et al., 2020) . However, they still store the model-related objects in full-precision or 16-bit.

3. TRAINING WITH LOW-PRECISION MODEL MEMORY

In this section, we formulate the problem of training neural networks with low precision memory. We show that arithmetic underflow is the major bottleneck for reducing the numerical precision. To solve this problem, we propose stochastic quantization and micro-batching techniques. Here, we mainly consider stochastic gradient descent with momentum (SGDM) (Qian, 1999; Sutskever et al., 2013) as a motivating example, but the proposed techniques also apply to other optimizers, such as stochastic gradient descent (Bottou, 2010) and Adam (Kingma & Ba, 2015).

3.1. BASIC OPTIMIZATION FRAMEWORK

Consider the empirical risk minimization problem min θ f (θ) = 1 n n i=1 f i (θ), where θ is the model parameter. Since the dataset size n is large, stochastic optimizers are adopted for solving the problem. SGDM is one of the most widely used optimizers. Starting with an initial model θ 0 and momentum m 0 = 0, SGDM performs the following updates: m t ← βm t-1 + ∇ f (θ t-1 ), θ t ← θ t-1 -αm t , where ∇ f (θ t-1 ) is an unbiased estimator of the gradient. We define ∇ f (θ t-1 ) := ∇f (θ t-1 , ζ t ) := 1 |ζ t | i∈ζ t ∇f i (θ t-1 ), where ζ t is a minibatch sampled uniformly from the dataset {1, 2, . . . , n}.



Training Methods With Compressed Model Memory Several works train neural networks with quantized parameters, momentum, or gradients. QSGD (Alistarh et al., 2017) quantizes the gradient into lower bits for efficient communication. Low-Precision SGD Li et al. (2017); Li & De Sa (2019); De Sa et al. (2018); Yang et al. (2019) uses low-precision parameters for training weight-quantized neural networks. The 8-bit Optimizer (Dettmers et al., 2021) quantizes the momentum for SGDM and Adam. Many of these works are not designed for saving the memory. Moreover, they only consider different parts of the model memory, without unified framework and analysis.

