CUTTING LONG GRADIENT FLOWS: DECOUPLING END-TO-END BACKPROPAGATION BASED ON SUPER-VISED CONTRASTIVE LEARNING

Abstract

End-to-end backpropagation (BP) is the foundation of current deep learning technology. Unfortunately, as a network becomes deeper, BP becomes inefficient for various reasons. This paper proposes a new methodology for decoupling BP to transform a long gradient flow into multiple short ones in order to address the optimization issues caused by long gradient flows. We report thorough experiments conducted to illustrate the effectiveness of our model compared with BP, Early Exit, and associated learning (AL), a state-of-the-art methodology for backpropagation decoupling. We release the experimental code for reproducibility.

1. INTRODUCTION

Current deep learning technology largely depends on backpropagation and gradient-based learning methods (e.g., gradient descent) for model training. Meanwhile, many successful applications rely on extremely deep neural networks; for example, Transformer contains at least 12 layers (most have several sublayers) (Vaswani et al., 2017) , BERT has 12 to 24 layers (most also have several sublayers) (Devlin et al., 2018) , and GoogLeNet has 22 layers (many layers are Inception modules containing several sublayers) (Szegedy et al., 2015) . However, training a deep network based on backpropagation is inefficient for many reasons. First, a long gradient flow may suffer from gradient vanishing or explosion (Hochreiter, 1998) . Second, a long gradient flow may lead to unstable gradients in the early layers (the layers close to the input layer) (Nielsen, 2015) . Third, backpropagation results in backward locking, meaning that the gradient of a network parameter can be computed only when all other gradients on which it depends have been computed (Jaderberg et al., 2017) . These issues may become severe bottlenecks, especially when a network is deep. To train deep networks more efficiently, researchers have developed various strategies, such as batch normalization, gradient clipping, new activation functions (e.g., ReLU and leaky ReLU), new network architectures (e.g., LSTM (Hochreiter & Schmidhuber, 1997)), and many more. Since a long gradient flow is a root cause of the above issues, a possible way to eliminate these issues is to shorten the length of the gradient flow, for example, by cutting a network into multiple components and assigning a local objective to each component. In this way, a long gradient flow can be divided into multiple shorter pieces, which should, at least partially, address the issues of vanishing/exploding gradients, unstable gradients in early layers, and backward locking. Perhaps the most straightforward approach for assigning a local objective to a component is by adding a local auxiliary classifier that outputs a predicted ŷ and updates the local parameters based on the difference between ŷ and the ground-truth target y. We call this strategy "Early Exit" in this paper because each such auxiliary classifier can be regarded as an exit of the neural network. The concept of Early Exit has been used in many previous studies, e.g., Mostafa et al. (2018); Teerapittayanon et al. (2016); Szegedy et al. (2015) . However, most of these studies have used Early Exit for other purposes, e.g., creating multiple prediction paths or helping to obtain gradients for the parameters in the early layers. Consequently, these studies have not investigated the separation of end-to-end backpropagation (BP) into multiple pieces, and the associated gradient flows are still long. In addition, even if Early Exit is used to isolate the gradient flow, as shown in (Mostafa et al., 2018) , the test accuracies are lower than those of models trained via BP. There are other methods of cutting long gradient flows (Jaderberg et al., 2017; Czarnecki et al., 2017; Löwe et al., 2019; Wu et al., 2022; Kao & Chen, 2021) . However, most of these methods have been applied only to simple networks, and

