FAST BINARIZED NEURAL NETWORK TRAINING WITH PARTIAL PRE-TRAINING

Abstract

Binarized neural networks, networks with weights and activations constrained to lie in a 2-element set, allow for more time-and resource-efficient inference than standard floating-point networks. However, binarized neural networks typically take more training to plateau in accuracy than their floating-point counterparts, in terms of both iteration count and wall clock time. We demonstrate a technique, partial pre-training, that allows for faster from-scratch training of binarized neural networks by first training the network as a standard floating-point network for a short amount of time, then converting the network to a binarized neural network and continuing to train from there. Without tuning any hyperparameters across four networks on three different datasets, partial pre-training is able to train binarized neural networks between 1.26× and 1.61× faster than when training a binarized network from scratch using standard low-precision training.

1. INTRODUCTION

Quantizing neural networks (Gupta et al., 2015) , constraining weights and activations to take on values within some small fixed set, is a popular set of techniques for reducing the storage (Han et al., 2016) or compute (Fromm et al., 2020) requirements of deep neural networks. Weights and activations can often be quantized down to as few as 8 bits with no loss in accuracy compared to a full-precision model. Further quantization often comes at the expense of accuracy: it is possible to binarize neural networks (Hubara et al., 2016; Rastegari et al., 2016) , constraining weights and activations to take on values within a set of two elements (often {-1, 1}), but such binarization often lowers the accuracy of the resultant network, necessitating a tradeoff between desired compression and accuracy. In the literature, there are two primary techniques for obtaining a quantized neural network: quantizing a pre-trained full-precision network (Banner et al., 2019; Han et al., 2016) , and training a quantized network from scratch (Hubara et al., 2016; Gupta et al., 2015) . Full-precision training. Quantizing a full-precision network requires few or even no additional training epochs on top of training that full-precision network. Typical procedures for quantizing a full-precision network range from data-blind procedures like selecting quantization bins to minimize distance from the original weights (Banner et al., 2019) , to data-intensive procedures such as retraining the network to be more amenable to quantization (Han et al., 2016) . However without significant additional training time, quantizing a pre-trained network often does not reach the highest accuracy possible for the quantized network architecture (Alizadeh et al., 2019) . Further, achieving high accuracy with heavy quantization, such as binarization, often requires changing the network architecture, for instance by adding skip connections (Bethge et al., 2019) ; such architectural changes mean that the weights of a pre-trained full-precision network may not transfer to the new architecture. Low-precision training. Alternatively, training a quantized network from scratch allows for achieving high accuracy regardless of the availability of pre-trained full-precision weights (Alizadeh et al., 2019) . Typical procedures for training a quantized network from scratch involve tracking and optimizing latent weights, weights which are quantized during the forward pass but treated as fullprecision during the backward pass (Hubara et al., 2016) . However, training a quantized network from scratch can be costly. Quantized networks typically require more training iterations to plateau in accuracy (Hubara et al., 2016, Figure 1; Bethge et al., 2019, Figure 2 ). Further, since quantized networks are often trained by simulating the quantized operations in floating-point (Zhang et al., 2019) , low-precision training can be even more computationally expensive than the full-precision equivalent. Research question. In this paper, we explore the question: Can we accelerate training a binarized neural network from scratch to a given target accuracy? Concretely, we assume that a network architecture and standard training schedule are provided, but that pre-trained full-precision networks are not available. We also specifically focus on achieving accuracy in the early phase of training, exposing the tradeoff between training cost and accuracy. 

Contributions.

• We present partial pre-training, which can train binarized neural networks from scratch between 1.26× and 1.61× faster than standard low-precision training. • We find that partial pre-training both requires fewer iterations to train to a given accuracy, and also that partial pre-training takes on average less time per iteration than standard low-precision training. • We analyze the sensitivity of partial pre-training to the choice of split between full-precision and low-precision training finding that an even split, though not always optimal, nearly matches the highest accuracy achievable by any other choice of split. All together, we find that partial pre-training is a simple and effective approach for accelerating binarized neural network training. Partial pre-training is a step towards the goal of binarized neural network training procedures that can match the efficiency gains of binarized neural network inference.

2. BACKGROUND

Binarized neural networks trade off accuracy for inference efficiency. However, binarized neural networks often take longer to train than the full-precision versions of the same network architecture, both in terms of training iterations until convergence and wall-clock time per iteration. Training iterations. Binarized neural networks tend to take more iterations to train than the fullprecision versions of the same network architecture. 



Partial pre-training. To answer the above research question, we evaluate a technique, partial pre-training, that allows for faster training of binarized neural networks by first training the network as a standard floating point network with standard full-precision training for a short amount of time, then converting the network to a binarized neural network and continuing to train from there with standard low-precision training for the remainder of the budgeted training time. We specifically evaluate partial pre-training's speedup over standard low-precision training, when training a binarized neural network from scratch. We find that partial pre-training can train VGG, ResNet, and Neural Collaborative Filtering networks on CIFAR-10, ImageNet, and MovieLens-20m between 1.26× and 1.61× faster than standard low-precision training.

For instance, Hubara et al. (2016, Figure 1) show a binarized neural network with a custom architecture requiring 4× as many training iterations to plateau in accuracy as a full-precision baseline on CIFAR-10. Bethge et al. (2019, Figure 2) similarly show a binarized ResNet-18 taking 2× as many training iterations to plateau in accuracy as a full-precision baseline on ImageNet.Wall-clock time. Beyond requiring more iterations to train, binarized neural networks tend to take more wall-clock time to complete each iteration of training than full-precision networks do. This is because binarized neural networks are often trained by simulating low-precision operations with standard floating point operations(Zhang et al., 2019; Fromm et al., 2020), and require additional bookkeeping that full-precision networks do not require, such as performing the binarization of weights and activations(Hubara et al., 2016)  and calculating scaling factors(Rastegari et al., 2016). While it is theoretically possible to accelerate binarized neural network training, we are not aware of any effort to exploit binarization during the training phase. It is also not clear what the maximum speedup possible from accelerating binarized neural network training would be:Fromm et al. (2020)   show an acceleration of 6.33× for a VGG during inference; with the additional bookkeeping of binarized neural network training and potentially requiring higher precision gradients in the backward pass(Zhou et al., 2016), real training speedups would likely be lower.

