ACCELERATING DNN TRAINING THROUGH SELEC-TIVE LOCALIZED LEARNING

Abstract

Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. We propose LoCal+SGD, a new algorithmic approach to accelerate DNN training by selectively combining localized or Hebbian learning within a Stochastic Gradient Descent (SGD) based training framework. Back-propagation is a computationally expensive process that requires 2 Generalized Matrix Multiply (GEMM) operations to compute the error and weight gradients for each layer. We alleviate this by selectively updating some layers' weights using localized learning rules that require only 1 GEMM operation per layer. Further, since the weight update is performed during the forward pass itself, the layer activations for the mini-batch do not need to be stored until the backward pass, resulting in a reduced memory footprint. Localized updates can substantially boost training speed, but need to be used selectively and judiciously in order to preserve accuracy and convergence. We address this challenge through the design of a Learning Mode Selection Algorithm, where all layers start with SGD, and as epochs progress, layers gradually transition to localized learning. Specifically, for each epoch, the algorithm identifies a Localized→SGD transition layer, which delineates the network into two regions. Layers before the transition layer use localized updates, while the transition layer and later layers use gradient-based updates. The trend in the weight updates made to the transition layer across epochs is used to determine how the boundary between SGD and localized updates is shifted in future epochs. We also propose a low-cost weak supervision mechanism by controlling the learning rate of localized updates based on the overall training loss. We applied LoCal+SGD to 8 image recognition CNNs (including ResNet50 and MobileNetV2) across 3 datasets (Cifar10, Cifar100 and ImageNet). Our measurements on a Nvidia GTX 1080Ti GPU demonstrate upto 1.5× improvement in end-to-end training time with ∼0.5% loss in Top-1 classification accuracy.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved continued success in many application domains involving images (Krizhevsky et al., 2017 ), videos (Ng et al., 2015) , text (Zhou et al., 2015) and natural language (Goldberg & Hirst, 2017). However training state-of-the-art DNN models is computationally quite challenging, often requiring exa-FLOPs of compute as the models are quite complex and need to be trained using large datasets. Despite rapid improvements in the capabilities of GPUs and the advent of specialized accelerators, training large models using current platforms is still quite expensive and often takes days to even weeks. In this work, we aim to reduce the computational complexity of DNN training through a new algorithmic approach called LoCal+SGDfoot_0 , which alleviates the key performance bottlenecks in Stochastic Gradient Descent (SGD) through selective use of localized or Hebbian learning. Computational Bottlenecks in DNN Training. DNNs are trained in a supervised manner using gradient-descent based cost minimization techniques such as SGD (Bottou, 2010) or Adam (Kingma & Ba, 2015) . The training inputs (typically grouped into minibatches) are iteratively forward propagated (F P ) and back propagated (BP ) through the DNN layers to compute weight updates that push the network parameters in the direction that decreases the overall classification loss.



In addition to combining localized and SGD based learning, LoCal+SGD is Low-Calorie SGD or SGD with reduced computational requirements

