FAST BINARIZED NEURAL NETWORK TRAINING WITH PARTIAL PRE-TRAINING

Abstract

Binarized neural networks, networks with weights and activations constrained to lie in a 2-element set, allow for more time-and resource-efficient inference than standard floating-point networks. However, binarized neural networks typically take more training to plateau in accuracy than their floating-point counterparts, in terms of both iteration count and wall clock time. We demonstrate a technique, partial pre-training, that allows for faster from-scratch training of binarized neural networks by first training the network as a standard floating-point network for a short amount of time, then converting the network to a binarized neural network and continuing to train from there. Without tuning any hyperparameters across four networks on three different datasets, partial pre-training is able to train binarized neural networks between 1.26× and 1.61× faster than when training a binarized network from scratch using standard low-precision training.

1. INTRODUCTION

Quantizing neural networks (Gupta et al., 2015) , constraining weights and activations to take on values within some small fixed set, is a popular set of techniques for reducing the storage (Han et al., 2016) or compute (Fromm et al., 2020) requirements of deep neural networks. Weights and activations can often be quantized down to as few as 8 bits with no loss in accuracy compared to a full-precision model. Further quantization often comes at the expense of accuracy: it is possible to binarize neural networks (Hubara et al., 2016; Rastegari et al., 2016) , constraining weights and activations to take on values within a set of two elements (often {-1, 1}), but such binarization often lowers the accuracy of the resultant network, necessitating a tradeoff between desired compression and accuracy. In the literature, there are two primary techniques for obtaining a quantized neural network: quantizing a pre-trained full-precision network (Banner et al., 2019; Han et al., 2016) , and training a quantized network from scratch (Hubara et al., 2016; Gupta et al., 2015) . Full-precision training. Quantizing a full-precision network requires few or even no additional training epochs on top of training that full-precision network. Typical procedures for quantizing a full-precision network range from data-blind procedures like selecting quantization bins to minimize distance from the original weights (Banner et al., 2019) , to data-intensive procedures such as retraining the network to be more amenable to quantization (Han et al., 2016) . However without significant additional training time, quantizing a pre-trained network often does not reach the highest accuracy possible for the quantized network architecture (Alizadeh et al., 2019) . Further, achieving high accuracy with heavy quantization, such as binarization, often requires changing the network architecture, for instance by adding skip connections (Bethge et al., 2019) ; such architectural changes mean that the weights of a pre-trained full-precision network may not transfer to the new architecture. Low-precision training. Alternatively, training a quantized network from scratch allows for achieving high accuracy regardless of the availability of pre-trained full-precision weights (Alizadeh et al., 2019) . Typical procedures for training a quantized network from scratch involve tracking and optimizing latent weights, weights which are quantized during the forward pass but treated as fullprecision during the backward pass (Hubara et al., 2016) . However, training a quantized network from scratch can be costly. Quantized networks typically require more training iterations to plateau in accuracy (Hubara et al., 2016, Figure 1; Bethge et al., 2019, Figure 2 ). Further, since quantized networks are often trained by simulating the quantized operations in floating-point (Zhang et al., 2019) , low-precision training can be even more computationally expensive than the full-precision equivalent.

