CPT: EFFICIENT DEEP NEURAL NETWORK TRAINING VIA CYCLIC PRECISION

Abstract

Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. In this paper, we attempt to explore low-precision training from a new perspective as inspired by recent findings in understanding DNN training: we conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy efficiency of DNN training. Specifically, we propose Cyclic Precision Training (CPT) to cyclically vary the precision between two boundary values which can be identified using a simple precision range test within the first few training epochs. Extensive simulations and ablation studies on five datasets and eleven models demonstrate that CPT's effectiveness is consistent across various models/tasks (including classification and language modeling). Furthermore, through experiments and visualization we show that CPT helps to (1) converge to a wider minima with a lower generalization error and (2) reduce training variance which we believe opens up a new design knob for simultaneously improving the optimization and efficiency of DNN training.

1. INTRODUCTION

The record-breaking performance of modern deep neural networks (DNNs) comes at a prohibitive training cost due to the required massive training data and parameters, limiting the development of the highly demanded DNN-powered intelligent solutions for numerous applications (Liu et al., 2018; Wu et al., 2018) . As an illustration, training ResNet-50 involves 10 18 FLOPs (floating-point operations) and can take 14 days on one state-of-the-art (SOTA) GPU (You et al., 2020b) . Meanwhile, the large DNN training costs have raised increasing financial and environmental concerns. For example, it is estimated that training one DNN can cost more than $10K US dollars and emit carbon as high as a car's lifetime emissions. In parallel, recent DNN advances have fueled a tremendous need for intelligent edge devices, many of which require on-device in-situ learning to ensure the accuracy under dynamic real-world environments, where there is a mismatch between the devices' limited resources and the prohibitive training costs (Wang et al., 2019b; Li et al., 2020; You et al., 2020a) . To address the aforementioned challenges, extensive research efforts have been devoted to developing efficient DNN training techniques. Among them, low-precision training has gained significant attention as it can largely boost the training time/energy efficiency (Jacob et al., 2018; Wang et al., 2018a; Sun et al., 2019) . For instance, GPUs can now perform mixed-precision DNN training with 16-bit IEEE Half-Precision floating-point formats (Micikevicius et al., 2017b) . Despite their promise, existing low-precision works have not yet fully explored the opportunity of leveraging recent findings in understanding DNN training. In particular, existing works mostly fix the model precision during the whole training process, i.e., adopt a static quantization strategy, while recent works in DNN training optimization suggest dynamic hyper-parameters along DNNs' training trajectory. For example, (Li et al., 2019) shows that a large initial learning rate helps the model to memorize easier-to-fit and more generalizable patterns, which aligns with the common practice to start from a large learning rate for exploration and anneal to a small one for final convergence; and (Smith, 2017; Loshchilov & Hutter, 2016) improve DNNs' classification accuracy by adopting cyclical learning rates. In this work, we advocate dynamic precision training, and make the following contributions: • We show that DNNs' precision seems to have a similar effect as the learning rate during DNN training, i.e., low precision with large quantization noise helps DNN training exploration while high precision with more accurate updates aids model convergence, and dynamic precision schedules help DNNs converge to a better minima. This finding opens up a design knob for simultaneously improving the optimization and efficiency of DNN training. • We propose Cyclic Precision Training (CPT) which adopts a cyclic precision schedule along DNNs' training trajectory for pushing forward the achievable trade-offs between DNNs' accuracy and training efficiency. Furthermore, we show that the cyclic precision bounds can be automatically identified at the very early stage of training using a simple precision range test, which has a negligible computational overhead. • Extensive experiments on five datasets and eleven models across a wide spectrum of applications (including classification and language modeling) validate the consistent effectiveness of the proposed CPT technique in boosting the training efficiency while leading to a comparable or even better accuracy. Furthermore, we provide loss surface visualization for better understanding CPT's effectiveness and discuss its connection with recent findings in understanding DNNs' training optimization.

2. RELATED WORKS

Quantized DNNs. DNN quantization (Courbariaux et al., 2015; 2016; Rastegari et al., 2016; Zhu et al., 2016; Li et al., 2016; Jacob et al., 2018; Mishra & Marr, 2017; Mishra et al., 2017; Park et al., 2017; Zhou et al., 2016) has been well explored based on the target accuracy-efficiency tradeoffs. For example, (Jacob et al., 2018) proposes quantization-aware training to preserve the post quantization accuracy; (Jung et al., 2019; Bhalgat et al., 2020; Esser et al., 2019; Park & Yoo, 2020) strive to improve low-precision DNNs' accuracy using learnable quantizers. Mixed-precision DNN quantization (Wang et al., 2019a; Xu et al., 2018; Elthakeb et al., 2020; Zhou et al., 2017) assigns different bitwidths for different layers/filters. While these works all adopt a static quantization strategy, i.e., the assigned precision is fixed post quantization, CPT adopts a dynamic precision schedule during the training process. Low-precision DNN training. Pioneering works (Wang et al., 2018a; Banner et al., 2018; Micikevicius et al., 2017a; Gupta et al., 2015; Sun et al., 2019) have shown that DNNs can be trained with reduced precision. For distributed learning, (Seide et al., 2014; De Sa et al., 2017; Wen et al., 2017; Bernstein et al., 2018) quantize the gradients to reduce the communication costs, where the training computations still adopt full precision; For centralized/on-device learning, the weights, activations, gradients, and errors involved in both the forward and backward computations all adopt reduced precision. Our CPT can be applied on top of these low-precision training techniques, all of which adopt a static precision during the whole training trajectory, to further boost the training efficiency. et al., 2020) proposes to adapt the precision of each layer during inference in an input-dependent manner to balance computational cost and accuracy.

3. THE PROPOSED CPT TECHNIQUE

In this section, we first introduce the hypothesis that motivates us to develop CPT using visualization examples in Sec. 3.1, and then present the CPT concept in Sec. 3.2 followed by the Precision Range Test (PRT) method in Sec. 3.3, where PRT aims to automate the precision schedule for CPT.



DNNs. There exist some dynamic precision works which aim to derive a quantized DNN for inference after the full-precision training. Specifically,(Zhuang et al., 2018)  first trains a full-precision model to reach convergence and then gradually decreases the model precision to the target one for achieving better inference accuracy;(Khoram & Li, 2018) also starts from a full-precision model and then gradually learns the precision of each layer to derive a mixed-precision counterpart; (Yang & Jin, 2020) learns a fractional precision of each layer/filter based on the linear interpolation of two consecutive bitwidths which doubles the computation and requires an extra fine-tuning step; and (Shen

availability

https://github.com/RICE

