CUTTING LONG GRADIENT FLOWS: DECOUPLING END-TO-END BACKPROPAGATION BASED ON SUPER-VISED CONTRASTIVE LEARNING

Abstract

End-to-end backpropagation (BP) is the foundation of current deep learning technology. Unfortunately, as a network becomes deeper, BP becomes inefficient for various reasons. This paper proposes a new methodology for decoupling BP to transform a long gradient flow into multiple short ones in order to address the optimization issues caused by long gradient flows. We report thorough experiments conducted to illustrate the effectiveness of our model compared with BP, Early Exit, and associated learning (AL), a state-of-the-art methodology for backpropagation decoupling. We release the experimental code for reproducibility.

1. INTRODUCTION

Current deep learning technology largely depends on backpropagation and gradient-based learning methods (e.g., gradient descent) for model training. Meanwhile, many successful applications rely on extremely deep neural networks; for example, Transformer contains at least 12 layers (most have several sublayers) (Vaswani et al., 2017) , BERT has 12 to 24 layers (most also have several sublayers) (Devlin et al., 2018) , and GoogLeNet has 22 layers (many layers are Inception modules containing several sublayers) (Szegedy et al., 2015) . However, training a deep network based on backpropagation is inefficient for many reasons. First, a long gradient flow may suffer from gradient vanishing or explosion (Hochreiter, 1998) . Second, a long gradient flow may lead to unstable gradients in the early layers (the layers close to the input layer) (Nielsen, 2015) . Third, backpropagation results in backward locking, meaning that the gradient of a network parameter can be computed only when all other gradients on which it depends have been computed (Jaderberg et al., 2017) . These issues may become severe bottlenecks, especially when a network is deep. To train deep networks more efficiently, researchers have developed various strategies, such as batch normalization, gradient clipping, new activation functions (e.g., ReLU and leaky ReLU), new network architectures (e.g., LSTM (Hochreiter & Schmidhuber, 1997) ), and many more. Since a long gradient flow is a root cause of the above issues, a possible way to eliminate these issues is to shorten the length of the gradient flow, for example, by cutting a network into multiple components and assigning a local objective to each component. In this way, a long gradient flow can be divided into multiple shorter pieces, which should, at least partially, address the issues of vanishing/exploding gradients, unstable gradients in early layers, and backward locking. Perhaps the most straightforward approach for assigning a local objective to a component is by adding a local auxiliary classifier that outputs a predicted ŷ and updates the local parameters based on the difference between ŷ and the ground-truth target y. We call this strategy "Early Exit" in this paper because each such auxiliary classifier can be regarded as an exit of the neural network. The concept of Early Exit has been used in many previous studies, e.g., Mostafa et al. (2018); Teerapittayanon et al. (2016) ; Szegedy et al. (2015) . However, most of these studies have used Early Exit for other purposes, e.g., creating multiple prediction paths or helping to obtain gradients for the parameters in the early layers. Consequently, these studies have not investigated the separation of end-to-end backpropagation (BP) into multiple pieces, and the associated gradient flows are still long. In addition, even if Early Exit is used to isolate the gradient flow, as shown in (Mostafa et al., 2018) , the test accuracies are lower than those of models trained via BP. There are other methods of cutting long gradient flows (Jaderberg et al., 2017; Czarnecki et al., 2017; Löwe et al., 2019; Wu et al., 2022; Kao & Chen, 2021) . However, most of these methods have been applied only to simple networks, and ), a state-of-the-art methodology for BP decoupling that yields results comparable to those of BP. We find that Delog-SCL outperforms AL in terms of test accuracy in most cases with fewer parameters. Additionally, since our method has a more straightforward network architecture than AL, our method could be a favorable alternative to AL. The rest of the paper is organized as follows. In Section 2, we introduce Delog-SCL and its properties. Section 3 presents a comparison of Delog-SCL with BP, Early Exit, and AL in terms of their test accuracies and numbers of parameters. We also report the results of analyses on certain properties of Delog-SCL in the same section. Section 4 reviews previous works on BP decoupling and presents a comparison of these works with our model. We conclude our contribution in Section 5.

2.1. PRELIMINARIES: CONTRASTIVE LEARNING AND SUPERVISED CONTRASTIVE LEARNING

Contrastive learning (CL) is a self-supervised technique for learning visual representations of images. Referring to the left of Figure 1 , given an image x, CL involves generating different views (i.e., x 1 and x 2 in Figure 1 ) via the same family of data augmentations T . The generated views (x 1 and x 2 ) are further transformed via an encoder function f and a projection head g to minimize the contrastive loss between the output vectors (i.e., z 1 and z 2 ). After training, the projection head g is disregarded, and only the encoder f is used to generate the visual representations of images (Chen et al., 2020) . In other words, given an anchor image x, CL relies on regarding its augmented images as positive instances and all other images as negative instances and considering that positive pairs should be close after encoding and projection. SCL refers to the extension of CL from a self-supervised setting to a fully supervised setting. Therefore, the training data for SCL consist of not only the training images themselves but also the classes of those images. Referring to the right of Figure 1 , given an anchor image x of class c, the positive instances include the other images of class c in the same batch, whereas all other images in the same batch are regarded as negative instances (Khosla et al., 2020) . 𝒙 ! 𝒓 " ! 𝒓 # ! 𝒓 $ ! # 𝑦 ! 𝑓 " 𝑓 # 𝑓 $ 𝑓 % 𝑦 ! ℒ &'( 𝜕ℒ &'( 𝜕𝜃 ) ! 𝜕ℒ &'( 𝜕𝜃 ) " 𝜕ℒ &'( 𝜕𝜃 ) # 𝜕ℒ &'( 𝜕𝜃 ) $ Figure 2: An example neural network with 3 hidden layers. The black arrows correspond to the forward path, the red arrows correspond to the backward path, and the green box denotes the comparison of the distance between two incoming variables ŷ(i) and y (i) . Let us first consider a standard neural network with 3 hidden layers as an example. As shown in Figure 2 , x (i) refers to an input image i, and the function f ℓ (ℓ = 1, . . . , 4) transforms r 𝒙 ! 𝒓 " ! # 𝑦 ! 𝑓 " 𝑓 # 𝒙 $ 𝒓 " $ # 𝑦 $ 𝑓 " 𝑓 # 𝒛 " ! 𝑔 " 𝒛 " $ 𝑔 " ℒ " %& Component 4 𝑦 ! ℒ ! '() 𝑦 $ ℒ $ '() 𝜕ℒ " %& 𝜕𝜃 * ! 𝜕ℒ " %& 𝜕𝜃 + ! 𝜕ℒ " %& 𝜕𝜃 + ! 𝜕ℒ " %& 𝜕𝜃 * ! Component 1 𝒓 , ! 𝑓 , 𝒓 , $ 𝑓 , 𝒛 , ! 𝑔 , 𝒛 , $ 𝑔 , ℒ , %& 𝜕ℒ , %& 𝜕𝜃 * " 𝜕ℒ , %& 𝜕𝜃 + " 𝜕ℒ , %& 𝜕𝜃 + " 𝜕ℒ , %& 𝜕𝜃 * " Component 2 𝒓 - ! 𝑓 - 𝒓 - $ 𝑓 - 𝒛 - ! 𝑔 - 𝒛 - $ 𝑔 - ℒ - %& 𝜕ℒ - %& 𝜕𝜃 * # 𝜕ℒ - %& 𝜕𝜃 + # 𝜕ℒ - %& 𝜕𝜃 + # 𝜕ℒ - %& 𝜕𝜃 * # Component 3 𝜕ℒ ! '() 𝜕𝜃 + $ 𝜕ℒ $ '() 𝜕𝜃 + $ (i) ℓ-1 into r (i) ℓ (under the assumptions that x (i) = r (i) 0 and the predicted class ŷ(i) = r (i) 4 ). Depending on the network architecture, the functions f ℓ could be various well-known neural network layers, such as fully connected layers, convolutional layers, pooling layers, or residual blocks. The objective L OU T is determined by the task type. For example, if we are addressing a classification task, we could use the cross-entropy loss between the predicted class ŷ(i) and the ground-truth class y (i) as the objective L OU T . We use backpropagation to obtain ∂L OU T /∂θ f ℓ for each layer ℓ, where θ f ℓ represents the parameters of function f ℓ . Once the gradients are obtained, we can use gradient-based optimization strategies, e.g., gradient descent, to update the parameter values. Given a neural network with H hidden layers, it can be seen that the longest gradient flow is constructed as a product of H + 2 local gradients. For example, to obtain ∂L OU T /θ f1 in a network with 3 hidden layers (as shown in Figure 2 ), we need the following: ∂L OU T ∂θ f1 = ∂L OU T ∂ ŷ(i) × ∂ ŷ(i) ∂r (i) 3 × ∂r (i) 3 ∂r (i) 2 × ∂r (i) 2 ∂r (i) 1 × ∂r (i) 1 ∂θ f1 . The number of terms of this product grows linearly with the depth of the network. Therefore, as networks become deeper, their long gradient flows cause several optimization issues, as discussed in Section 1. We use Figure 3 to illustrate our strategy of cutting a long gradient flow into several local gradients for a neural network with 3 hidden layers. Let r (i) 0 (i.e., x (i) ) and r (j) 0 (i.e., x (j) ) be two image views in the same batch (r (i) 0 and r (j) 0 may or may not be augmented images, i.e., views, from the same image). We use f 1 to transform each of them, obtaining r (i)   1 and z   (j) 1 , respectively. The functions f 1 and g 1 can be regarded as the encoding function and the projection head, respectively, in CL (refer to Figure 1 ). We repeat the same process for each hidden layer ℓ to form the corresponding component ℓ. If x (i) and x (j) are two different views of the same image or if y (i) (the class of x (i) ) is equal to y (j) (the class of x (j) ), then we should ensure that z (i) ℓ is close to z (j) ℓ for all ℓ. Otherwise, we should increase the distance between z (i) ℓ and z (j) ℓ . In the last layer, we compute the distance between ŷ(i) and y (i) as the loss L OU T i . Eventually, we define the local supervised contrastive loss L SC ℓ for batch B in layer ℓ as shown in Equation 2: L SC ℓ = ∀i∈B -1 |P (i)| ∀p∈P (i) log exp z (i) ℓ • z (p) ℓ /τ ∀j∈B I(j ̸ = i) exp z (i) ℓ • z (j) ℓ /τ , where B = 1, 2, . . . , N represents a batch of multiview images, P (i) is the set of all positive samples for an image i, τ is a hyperparameter, and I(j ̸ = i) ∈ {0, 1} is an indicator function that returns 1 if j ̸ = i and 0 otherwise. Ultimately, the global objective function is an accumulation of the local supervised contrastive losses and the losses in the output layer, as defined below: L = H ℓ=1 L SC ℓ + N i=1 L OU T i , ( ) where H is the number of hidden layers and L OU T i is the ith loss in the output layer (refer to Figure 3 ). The computation of L SC ℓ and the pseudo code of Delog-SCL for a 3-layer vanilla ConvNet is given in Algorithm 1 and Algorithm 2 in Appendix A.5.

2.3. FORWARD PATH, BACKWARD PATH, AND INFERENCE FUNCTION

For a regular neural network (e.g., Figure 2 ), the forward path and the inference function are identical, and the backward path is simply obtained by inverting the direction of the forward path. However, the situation is more complicated in our case because we divide the global objective into several local ones. Consequently, we have multiple short forward paths, multiple short backward paths, and one inference path. Thus, the inference path and the forward paths are no longer identical in Delog-SCL. During training, each component ℓ has its own forward and backward paths. Taking Figure 3 as an example, the forward path of component ℓ transforms each r (i) ℓ-1 into r (i) ℓ via the local encoding function f ℓ and further transforms each r (i) ℓ into z (i) ℓ via the local projection head g ℓ . On the backward path, each hidden layer computes ∂L SC ℓ /∂θ g ℓ and ∂L SC ℓ /∂θ f ℓ based on the chain rule and updates the parameters by means of gradient-based optimization strategies. We block the gradient flow between each component.foot_0 As a result, each gradient flow remains within one component and is therefore short. Equation 4 and Equation 5show these local gradient flows. ∂L SC ℓ ∂θ g ℓ = ∂L SC ℓ ∂z (i) ℓ × ∂z (i) ℓ ∂θ g ℓ . (4) ∂L SC ℓ ∂θ f ℓ = ∂L SC ℓ ∂z (i) ℓ × ∂z (i) ℓ ∂r (i) ℓ × ∂r (i) ℓ ∂θ f ℓ . Eventually, even if we construct a deep neural network, the cost of computing each ∂L SC /∂θ f ℓ and each ∂L SC /∂θ g ℓ remains constant. Additionally, the gradient flow in the output layer is also short: we simply compute ∂L out k /∂θ f H+1 (where H is the number of hidden layers). This design alleviates various issues caused by long gradient flows. In the inference (prediction) phase, we need the functions f ℓ but not g ℓ , as shown by Equation 6: ŷ(i) = f H+1 • f H • . . . • f 2 • f 1 (x (i) ), ( ) where • is the function composition operator (H = 3 for the example illustrated in Figure 3 ). Although our proposed method (e.g., Figure 3 ) involves more parameters than a standard neural network structure (e.g., Figure 2 ) during training, they have the same number of parameters during inference because both of them use only the functions f ℓ . Therefore, they have the same hypothesis space. The parameters that participate in the inference phase (denoted by θ f ℓ ) are called the effective parameters, and the parameters used during training but not during inference (denoted by θ g ℓ ) are called the affiliated parameters. Task 1 Task 2 Task 3 ... In this section, we discuss three properties of our proposed model -short gradient flows, a flexible structure, and the ability to perform parallel (pipelined) training.

2.4. PROPERTIES

As discussed in Section 2.3, training a regular neural network with BP requires a gradient flow of length O(H). In contrast, the length of each gradient flow in our model is independent of the number of layers; the length is always a constant. Therefore, the various optimization issues resulting from long gradient flows, as discussed in Section 1, are eliminated (or at least alleviated). The network structure is more flexible and perhaps easier to understand than that of associated learning (AL), a state-of-the-art methodology for decoupling BP in terms of test accuracy (Kao & Chen, 2021; Wu et al., 2022) . Specifically, AL involves projecting the features x and the target y into the same latent space for each layer ℓ. Although this design yields excellent test accuracies that are comparable to those of BP-trained models (Wu et al., 2022) , it has at least two unconventional and perhaps mysterious characteristics. First, AL involves projecting a one-hot-encoded target variable y into a latent vector t 1 and then transforming t 1 back into y. Interestingly, the length of t 1 is sometimes greater than the number of classes. This process corresponds to building an autoencoder whose bottleneck layer is larger than the input and output layers. Although this unconventional approach works surprisingly well in practice (Wu et al., 2022; Kao & Chen, 2021) , the fundamental reasons for this are still unclear. Second, when converting a neural network into its AL form, we sometimes need to create extra fully connected layers. In contrast, our design is more natural because we need neither the autoencoder nor the extra fully connected layers. Finally, since each component has its own local objective, we can parallelize the training procedure by means of pipelining. We use the network illustrated in Figure 3 as an example. Let Task ℓ denote the entire forward and backward process in layer ℓ; then, we can illustrate the pipelining process as shown in Table 1 . Specifically, in the first time unit t 1 , component 1 uses the first batch (B 1 ) to perform Task 1. At t 2 , component 2 performs Task 2 based on B 1 , and component 1 continues to performing Task 1 based on the second batch (B 2 ). Starting at t 3 , all three components can perform forward propagation, backward propagation, and parameter updating simultaneously. However, we have shown here only that parallelization by means of pipelining is feasible; implementation of the pipeline mechanism is left for future work.

3. EXPERIMENTS

We compare Delog-SCL with three baselines using different neural networks on different datasets. The baseline models include BP, the Early Exit mechanism introduced in Section 1, and AL, a stateof-the-art method for BP decoupling in terms of the test accuracy. We test three neural networks: a vanilla convolutional neural network (vanilla ConvNet), the VGG network (Simonyan & Zisserman, 2014) , and the residual network (ResNet) (He et al., 2016) . The experimental datasets include CIFAR-10 (consists of 50, 000 color training images and 10, 000 test images; each image belongs to 1 of 10 classes), CIFAR-100 (consists of 50, 000 color training images and 10, 000 test images; each image belongs to 1 of 100 classes), and Tiny-ImageNet (consists of 100, 000 color training images, 10, 000 validation images, and 10, 000 test images; each image belongs to 1 of 200 classes). We also tested BP, Early Exit, AL, and Delog-SCL on CIFAR-100. The results, as shown in Table 3 , are similar to those on CIFAR-10: Delog-SCL performs better than both AL and BP based on Vanilla ConvNet and VGG, whereas Delog-SCL performs worse than BP when ResNet is used. These results are also consistent with those reported in (Kao & Chen, 2021; Wu et al., 2022) .

3.1. ACCURACY COMPARISON

Table 4 gives the results obtained on Tiny-ImageNet. When VGG is used, both AL and Delog-SCL outperform BP. However, for ResNet, BP performs much better. Delog-SCL is stable in training, as can be shown by Figure 5 in the Appendix.

3.1.1. DISCUSSION ON ACCURACY COMPARISON

When BP is used, all parameters are updated to minimize a global objective -the residual between the prediction ŷ and the target y. On the other hand, methods to decouple end-to-end backpropagation, such as Delog-SCL and AL, are composed of many local objectives, which may differ from the global objective. Therefore, it is surprising that Delog-SCL and AL outperform BP for some network structures. The authors of AL proposed several conjectures to explain this remarkable result. First, projecting the feature vector x and the target y into the same latent space may be helpful. Second, the autoencoder may implicitly perform some feature extraction and regularization. Third, overparameterization may be helpful for optimization (Arora et al., 2018; Chen & Chen, 2020 ). However, the first and second conjectures only apply to AL but not to Delog-SCL, but Delog-SCL still yields better accuracies than BP and AL in vanilla ConvNet and VGG. Therefore, the above conjectures may not fully explain the success of Delog-SCL. Further investigation will be needed to uncover the fundamental reasons. As for ResNet, its authors state that the main effect of a residual is not about promoting gradient flows (He et al., 2016) . Instead, ResNet performs better than vanilla ConvNet because the latent representations H ℓ and H ℓ+1 at deep neighboring layers ℓ and ℓ + 1 are likely similar. Regular nonlinear transformations may be difficult to approximate an (almost) identical mapping from H ℓ to H ℓ+1 . However, the residual connection sets H ℓ+1 to be f (H ℓ ) + H ℓ . Even if f () is a nonlinear function, a solver is easier to make H ℓ+1 ≈ H ℓ by making f (H ℓ ) ≈ 0. The property that H ℓ+1 ≈ H ℓ is likely true when a network is deep. However, when using Delog-SCL or AL, each local network is short, so Delog-SCL and AL are unlikely to take advantage of the residual connections. As a result, optimizing a ResNet by BP usually gives better results than by BP-decoupling methods, such as Delog-SCL and AL. Table 5 shows the numbers of parameters required during training and inference for VGG and ResNet (using CIFAR-10 as an example).

3.2. NUMBER OF EFFECTIVE PARAMETERS

For BP, the training and testing stages involve the same set of parameters. In contrast, the training process of AL requires additional bridge functions and encoding functions, which are not used during testing. So, the number of training parameters is much larger than the number of parameters required for BP. During testing, the extra fully connected layers in AL also require more parameters than are needed in BP. Finally, Delog-SCL and BP have the same number of testing parameters. However, during training, Delog-SCL needs the projection heads g ℓ for each component, so the number of required parameters during training is more than for BP but much fewer than for AL. The difference of the parameter counts may also be reflected on the training and inference speed. We show the practical training and testing time of different methods in Table 7 in the Appendix.

3.3. MULTIPLE SHORT GRADIENT FLOWS ACCELERATE LEARNING

This section shows that dividing a long gradient flow into multiple short ones accelerates learning. Referring to Figure 3 , our proposed Delog-SCL uses a local supervised contrastive loss to create a short local gradient flow L SC ℓ . We compare the standard Delog-SCL with a modification where only a single long gradient flow is used. In the compared baseline, we enable the global objective L OU T to pass through the entire network and remove all local supervised contrastive losses L SC ℓ . The results are shown in Figure 4 . We label the original Delog-SCL as "multiple short gradient flows" and the modification with single long gradient flow as "single gradient flow". Using multiple short gradient flows accelerates the learning speed, especially in the first 100 epochs.

3.4. THE EFFECT OF BATCH SIZE AND PROJECTION HEAD

We also experimented with how the batch size and the type of projection head influence learning. The experimental results show that a larger batch size improves the learning quality, which is consistent with previous studies (Chen et al., 2020; Henaff, 2020; Bachman et al., 2019) . As for the projection head, using a nonlinear function benefits the representation quality of layers before it. The result also matches the experiments conducted in Chen et al. (2020) . The experimental details of batch size and projection heads are presented in Section A.3 and Section A.4 in Appendix.

4. RELATED WORK

Studies on alternatives to BP mostly aim to address optimization and performance issues, such as gradient vanishing/explosion and training costs. We review some of these works that have particularly focused on the creation of local objectives and local gradient flows. The first type of BP alternative is target propagation (Lee et al., 2015; Meulemans et al., 2020; Manchev & Spratling, 2020; Bengio, 2014) , which assigns a local target for each layer via feedback (inverse) mapping. Such a methodology can alleviate the problem of vanishing/exploding gradients since each gradient flow is short. However, the parameters are still updated in a layerwise fashion, so it could be challenging to learn the parameters in different layers simultaneously. Methods of the second type model BP as a constrained optimization problem, in which the output of one layer is forced to equal the input to the next layer (Gotmare et al., 2018; Marra et al., 2020) . Such a design shortens the gradient flows and enables parallel parameter updates. However, the experimental results show that the test accuracies are lower than that of standard BP. Methods of the third type determine the local objectives through transformations of the target. A representative method of this type is AL Wu et al. ( 2022); Kao & Chen (2021) , which transforms both the feature vector x and the target y into the same set of latent spaces. To the best of our knowledge, AL is the only existing method that can achieve BP decoupling for a wide range of network architectures and yield test accuracies that are comparable to those obtained with BP. Our proposed Delog-SCL is motivated by both AL and Greedy InfoMax (GIM) (Löwe et al., 2019) , which uses the contrastive loss as each local objective. However, GIM targets self-supervised learning tasks, whereas our Delog-SCL can handle supervised learning tasks because Delog-SCL uses the supervised contrastive loss in the hidden layers and the distance between the predicted and observed targets in the output layer. Since only Delog-SCL and AL yield test accuracies comparable to those of BP, in Table 6 , we further compare the properties of these three methods. First, BP requires no affiliated parameters because all parameters collaborate to reduce the global loss. BP can be applied to almost all kinds of neural networks. However, its gradient flow is long (especially when the network is deep), and it is challenging to achieve pipelined training with this method. AL requires transforming both the feature vector x and the target y alongside each other. As a result, AL usually requires additional fully connected layers, resulting in a large number of affiliated parameters and less structural flexibility. However, each gradient flow in AL is short, and parameters in different layers can be updated simultaneously via pipelined training. Finally, because Delog-SCL requires computing the supervised contrastive loss in each hidden layer, this method also needs additional affiliated parameters (although much fewer than AL) during training. The introduction of the supervised contrastive loss also adds complexity in the network design. The advantages of Delog-SCL are similar to those of AL: the gradient flows are short, and parallel parameter updating is possible (via pipelining).

5. CONCLUSION

This paper presents Delog-SCL, a new methodology for decoupling the components of the BP process in a neural network. Delog-SCL may address various optimization issues (e.g., vanishing/exploding gradients and unstable gradients in the early layers) resulting from the long gradient flows in deep neural networks. We report experiments conducted to show that Delog-SCL's predictive power is comparable to (and frequently better than) that of either BP or AL, which is a state-of-the-art alternative to BP. Delog-SCL is more flexible than AL because Delog-SCL does not require additional fully connected layers, whereas AL usually does. Therefore, Delog-SCL is a natural substitute for AL and could be a promising alternative to BP. We also tested how the batch size influences the test accuracy. As shown in Table 8 , performing Delog-SCL training using a large batch size is helpful, and the improvement on VGG is more evident than in other networks. This finding is consistent with the results reported in previous studies, e.g., Chen et al. (2020) ; Henaff (2020); Bachman et al. (2019) , in which the authors noted that because a larger batch tends to include more negative pairs (as shown in Equation 2), the model has access to more information that can be used to distinguish positive pairs from negative pairs. Although other studies, e.g., Mitrovic et al. (2020) , have shown that the number of negative pairs may not be critical to the improvement of the test accuracy, most studies tend to agree that a larger batch size leads to better results. This section presents the influence of different projection heads. Table 9 compares the accuracies of VGG and ResNet on CIFAR-10 when 3 different types of projection heads are used: identity mapping (i.e., g ℓ r (i) ℓ = r (i) ℓ ), linear mapping (i.e., g ℓ r (i) ℓ = w T r (i) ℓ + w 0 ), and the default mapping based on a multilayer perceptron (MLP). The results are similar to those reported in (Chen et al., 2020) : the MLP mapping shows a 2.6% improvement over linear projection, which outperforms identity projection by over 10%. Using an MLP as the projection head is beneficial likely because the information loss induced by the contrastive loss is more severe when a simple projection head is used (Chen et al., 2020) . In particular, since a projection head g ℓ (refer to Figure 3 and Figure 1 ) maximizes the agreement between augmented images, g ℓ may remove information relevant to image rotation, flipping, and other data augmentation operations, which could be useful for downstream tasks. When a simple projection head g ℓ is used, the information contained in r (i) ℓ will be similar to that in z (i) ℓ , which means that r (i) ℓ is invariant to data augmentation. On the other hand, when a complex projection head such as an MLP is used, the information in r 



The gradient flow can be blocked by using Tensor.detach() in PyTorch or tf.stop_gradient in TensorFlow.



Figure 1: An illustration of contrastive learning and supervised contrastive learning.

Figure 3: An example of a decoupled neural network based on supervised contrastive learning. The black arrows correspond to the forward path, the red arrows correspond to the backward path, and the green box denotes the comparison of the distance between two incoming variables.

further use the function g 1 to convert them into z

Figure4: Training Delog-SCL on CIFAR-10 using multiple short gradient flows vs using single long gradient flow.

Figure 5: Epoch vs test accuracy for BP and Delog-SCL on CIFAR-100 using vanilla ConvNet.

a PyTorch pseudocode for the creation of the local supervised contrastive losses (Algorithm 1) and Delog-SCL (Algorithm 2).

An illustration of the training process pipeline

A comparison of the test accuracies (mean ± standard deviation) of different methodologies when using different neural network architectures on CIFAR-10. We highlight the winner among the non-BP methodologies in bold face. We mark a methodology with a † symbol if the test accuracy of this methodology is higher than that of BP.

A comparison of the test accuracies of different methodologies when using different neural network architectures on CIFAR-100. We follow the same notations used in Table2.Table2shows the test accuracies of the various methods on the CIFAR-10 dataset. The simple Early Exit mechanism can be used to learn the relationship between an image and its corresponding class. However, the test accuracies of Early Exit are much worse than those of BP. Both AL and our proposed Delog-SCL yield better test accuracies than BP based on the Vanilla ConvNet and VGG architectures. However, when ResNet is used, BP yields the highest test accuracy. If we compare only the methods that involve BP decomposition, Delog-SCL performs the best among them.



A comparison of the properties of BP, AL, and Delog-SCL

A comparison of the training and testing seconds (mean ± standard deviation) per epoch with different methodologies (using CIFAR-10 as an example)

The test accuracies of Delog-SCL when using different batch sizes on CIFAR-10.

The test accuracies when using different projection heads on CIFAR-10. Type of projection head VGG ResNet

A APPENDIX

A.1 TEST ACCURACY VS EPOCH Figure 5 compares Delog-SCL and BP on vanilla CNN in terms of their dynamics of test accuracy when the epoch increases. First, the test accuracy of Delog-SCL improves stably. Second, the Delog-SCL outperforms BP after approximately 100 epochs. BP is better than Delog-SCL at the beginning, likely because all the parameters in BP are updated to minimize a global objective. On the contrary, most of the parameters in Delog-SCL are updated to fit local objective functions, which usually have no direct access to the target variable.Different hyperparameter settings may lead to slightly different curves. However, most of them follow a similar pattern. Experiments on other datasets (CIFAR-10 and tiny-ImageNet) for the VGG network structure also show similar trends. 

