ACCELERATING DNN TRAINING THROUGH SELEC-TIVE LOCALIZED LEARNING

Abstract

Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. We propose LoCal+SGD, a new algorithmic approach to accelerate DNN training by selectively combining localized or Hebbian learning within a Stochastic Gradient Descent (SGD) based training framework. Back-propagation is a computationally expensive process that requires 2 Generalized Matrix Multiply (GEMM) operations to compute the error and weight gradients for each layer. We alleviate this by selectively updating some layers' weights using localized learning rules that require only 1 GEMM operation per layer. Further, since the weight update is performed during the forward pass itself, the layer activations for the mini-batch do not need to be stored until the backward pass, resulting in a reduced memory footprint. Localized updates can substantially boost training speed, but need to be used selectively and judiciously in order to preserve accuracy and convergence. We address this challenge through the design of a Learning Mode Selection Algorithm, where all layers start with SGD, and as epochs progress, layers gradually transition to localized learning. Specifically, for each epoch, the algorithm identifies a Localized→SGD transition layer, which delineates the network into two regions. Layers before the transition layer use localized updates, while the transition layer and later layers use gradient-based updates. The trend in the weight updates made to the transition layer across epochs is used to determine how the boundary between SGD and localized updates is shifted in future epochs. We also propose a low-cost weak supervision mechanism by controlling the learning rate of localized updates based on the overall training loss. We applied LoCal+SGD to 8 image recognition CNNs (including ResNet50 and MobileNetV2) across 3 datasets (Cifar10, Cifar100 and ImageNet). Our measurements on a Nvidia GTX 1080Ti GPU demonstrate upto 1.5× improvement in end-to-end training time with ∼0.5% loss in Top-1 classification accuracy.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved continued success in many application domains involving images (Krizhevsky et al., 2017 ), videos (Ng et al., 2015) , text (Zhou et al., 2015) and natural language (Goldberg & Hirst, 2017) . However training state-of-the-art DNN models is computationally quite challenging, often requiring exa-FLOPs of compute as the models are quite complex and need to be trained using large datasets. Despite rapid improvements in the capabilities of GPUs and the advent of specialized accelerators, training large models using current platforms is still quite expensive and often takes days to even weeks. In this work, we aim to reduce the computational complexity of DNN training through a new algorithmic approach called LoCal+SGDfoot_0 , which alleviates the key performance bottlenecks in Stochastic Gradient Descent (SGD) through selective use of localized or Hebbian learning. Computational Bottlenecks in DNN Training. DNNs are trained in a supervised manner using gradient-descent based cost minimization techniques such as SGD (Bottou, 2010) or Adam (Kingma & Ba, 2015) . The training inputs (typically grouped into minibatches) are iteratively forward propagated (F P ) and back propagated (BP ) through the DNN layers to compute weight updates that push the network parameters in the direction that decreases the overall classification loss. Back-propagation is computationally expensive, accounting for 65-75% of the total training time on GPUs. This is attributed to two key factors: (i) BP involves 2 Generalized Matrix Multiply (GEMM) operations, one to propagate the error across layers and the other to compute the weight gradients, and (ii) when training on distributed systems using data/model parallelism (Dean et al., 2012b; Krizhevsky et al., 2012) , aggregation of weight gradients/errors across devices incurs significant communication overhead. Further, BP through auxiliary ops such as batch normalization are also more expensive than F P . Prior Efforts on Efficient DNN Training. Prior research efforts to improve DNN training time can be grouped into a few directions. One group of efforts enable larger scales of parallelism in DNN training through learning rate tuning (You et al., 2017a; Goyal et al., 2017; You et al., 2017b) and asynchronous weight updates (Dean et al., 2012a) (Hebb; Oja, 1982; Zhong, 2005) utilize a single feed-forward weight update to learn the feature representations, eschewing BP . Careful formulation of the localized learning rule can result in ∼2× computation savings compared to SGD and also significantly reduces memory footprint as activations from F P needed not be retained until BP . The reduction in memory footprint can in turn allow increasing the batch size during training, which leads to further runtime savings due to better compute utilization and reduced communication costs. It is worth noting that localized learning has been actively explored in the context of unsupervised learning (Chen et al., 2020; van den Oord et al., 2018; Hénaff et al., 2019) . Further, there has been active research efforts on neuro-scientific learning rules (Lee et al., 2015; Nøkland, 2016) . Our work is orthogonal to such efforts and represents a new application of localized learning in a fully supervised context, wherein we selectively combine it within an SGD framework to achieve computational savings. Preserving model accuracy and convergence with LoCal+SGD requires localized updates to be applied judiciously i.e., only to selected layers in certain epochs. We address this challenge through the design of a learning mode selection algorithm. At the start training, the selection algorithm initializes the learning mode of all layers to SGD, and as training progresses determines the layers that transition to localized learning. Specifically, for each epoch, the algorithm identifies a Localized→SGD transition layer, which delineates the network into two regions. Layers before the transition layer use localized updates, while subsequent layers use gradient-based updates. This allows BP to stop at the transition layer, as layers before it have no use for the back-propagated errors. The algorithm takes advantage of the magnitude of the weight updates of the Localized→SGD transition layer in deciding the new position of the boundary every epoch. Further, we provide weak supervision by tweaking the learning rate of locally updated layers based on overall training loss. Contributions: To the best of our knowledge, LoCal+SGD is the first effort that combines localized learning (an unsupervised learning technique) within a supervised SGD context to reduced computational costs while maintaining classification accuracy. This favorable tradeoff is achieved by LoCal+SGD through a Learning Mode Selection Algorithm that applies localized learning to selected layers and epochs. Further improvement is achieved through the use of weak supervision by modulating the learning rate of locally updated layers based on the overall training loss. Across 8 image recognition CNNs (including ResNet50 and MobileNet) and 3 datasets (Cifar10, Cifar100 and ImageNet), we demonstrate that LoCal+SGD achieves up to 1.5× improvement in training time with ∼0.5% Top-1 accuracy loss on a Nvidia GTX 1080Ti GPU. 2 LoCal+SGD: COMBINING SGD WITH SELECTIVE LOCALIZED LEARNING The key idea in LoCal+SGD is to apply localized learning to selected layers and epochs during DNN training to improve the overall execution time, without incurring loss in accuracy. The following components are critical to the effectiveness of LoCal+SGD:



In addition to combining localized and SGD based learning, LoCal+SGD is Low-Calorie SGD or SGD with reduced computational requirements



. Another class of efforts employ importancebased sample selection during training, wherein 'easier' training samples are selectively discarded to improve runtime (Jiang et al., 2019; Zhang et al., 2019). Finally, model quantization (Sun et al., 2019) and pruning (Lym et al., 2019) can lead to significant runtime benefits during training by enabling the use of reduced-bitwidth processing elements. LoCal+SGD: Combining SGD with Localized Learning. Complementary to the aforementioned efforts, we propose a new approach, LoCal+SGD, to alleviate the performance bottlenecks in DNN training, while preserving model accuracy. Our hybrid approach combines Hebbian or localized learning (Hebb) with SGD by selectively applying it in specific layers and epochs. Localized learning rules

