REVISITING LOCALLY SUPERVISED LEARNING: AN ALTERNATIVE TO END-TO-END TRAINING

Abstract

Due to the need to store the intermediate activations for back-propagation, end-toend (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration.

1. INTRODUCTION

End-to-end (E2E) back-propagation has become a standard paradigm to train deep networks (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Huang et al., 2019) . Typically, a training loss is computed at the final layer, and then the gradients are propagated backward layer-by-layer to update the weights. Although being effective, this procedure may suffer from memory and computation inefficiencies. First, the entire computational graph as well as the activations of most, if not all, layers need to be stored, resulting in intensive memory consumption. The GPU memory constraint is usually a bottleneck that inhibits the training of state-of-the-art models with high-resolution inputs and sufficient batch sizes, which arises in many realistic scenarios, such as 2D/3D semantic segmentation/object detection in autonomous driving, tissue segmentation in medical imaging and object recognition from remote sensing data. Most existing works address this issue via the gradient checkpointing technique (Chen et al., 2016) or the reversible architecture design (Gomez et al., 2017) , while they both come at the cost of significantly increased computation. Second, E2E training is a sequential process that impedes model parallelization (Belilovsky et al., 2020; Löwe et al., 2019) , as earlier layers need to wait for their successors for error signals. As an alternative to E2E training, the locally supervised learning paradigm (Hinton et al., 2006; Bengio et al., 2007; Nøkland & Eidnes, 2019; Belilovsky et al., 2019; 2020) by design enjoys higher memory efficiency and allows for model parallelization. In specific, it divides a deep network into several gradient-isolated modules and trains them separately under local supervision (see Figure 1 (b)). Since back-propagation is performed only within local modules, one does not need to store all

availability

https://github.com/blackfeather-wang/InfoPro-Pytorch.

Information Propagation Loss

(ℒ =ℒ InfoPro )Module II(a) End-to-end Supervised Learning (b) Locally Supervised Learning (K = 2) Greedy Supervised Learning (ℒ = End-to-end Loss)End-to-end Loss

Minimize

End-to-end Loss End-to-end LossEnd-to-end Loss Input x < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 / s T U 5 n F M b E V U / y O v 2 7 q 2 B / y c h 8 = " > A A A I J H i c p V X L b t N Q E J 2 2 g E t 4 t b B k Y x F V Y k F D X C H R F a r E h g W L g p q 0 U l s h 2 7 l J T f z C D 9 o q y o 4 P Y A s f w N e w Q y z Y s I a / 4 M x c p 3 k 7 r b D V d D z 3 z J k 7 r 3 u d 2 P f S r F 7 / t b S 8 c u 3 6 D W P 1 Z u X W 7 T t 3 7 6 2 t 3 2 + m U Z 6 4 q u F G f p Q c O H a q f C 9 U j c z L f H U Q J 8 o O H F / t O 9 2 X v L 7 / U S W p F 4 V 7 2 X m s j g O 7 E 3 p t z 7 U z q J p H T t A 7 6 7 9 b q 9 Z r d X n M a c E q h C o V z 2 6 0 v v K J j q h F E b m U U 0 C K Q s o g + 2 R T i v e Q L K p T D N 0 x 9 a B L I H m y r q h P F d j m Q C k g b G i 7 + O 3 g 6 7 D Q h v h m z l S s X X j x 8 Z f A 0 q Q N 2 E T A J Z D Z m y n r u T C z d h 5 3 T z h 5 b + f 4 7 x R c A b Q Z n U C 7 y G 6 A v K w d x 5 J R m 7 Y l B g 8 x x a L h 6 N y C J Z e s 8 M 7 N k a g y M M T Q s d z C e g L Z F c t B n k 2 x S S V 2 z q 0 t 6 3 8 E y V r + d g t s T n 9 L o 3 P A y h W p / F c k 7 N O X K p 1 C 1 v h N W J g j + P n 8 e + g W 5 m 7 D P h x j 1 u + G c K a S p 2 3 w O m D l L L C d e W F V F u U J q q 5 z y j l q j 3 A P P Q w x v u y 6 W 8 q Y C 2 6 S Z 5 w x B U 8 A S 1 3 h B r 2 l 1 1 J f 7 Y d Z M + m A E J x l F Y r w d q U v H C B m 7 T 3 G b i L E p a T D P c k I V 2 S T P o D V l o j Y r 1 m w 6 P 5 Z 1 P P D z M 7 y 6 c j 8 s R X v k P u w B e 1 g M s y L S e P p L f M V yT N 9 o a / G N + O 7 8 c P 4 q a H L S 4 X N A x p 7 j N / / A M 4 j j x M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 / s T Ux S S V 2 z q 0 t 6 3 8 E y V r + d g t s T n 9 L o 3 P A y h W p / F c k 7 N O X K p 1 C 1 v h N W J g j + P n 8 e + g W 5 m 7 D P h x j 1 u + G c K a S p 2 3 w O m D l L L C d e W F V F u U J q q 5 z y j l q j 3 A P P Q w x v u y 6 W 8 q Y C 2 6 S Z 5 wT N 9 o a / G N + O 7 8 c P 4 q a H L S 4 X N A x p 7 j N / / A M 4 j j x M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 / s T UO l e + F q p F 5 m a 8 O 4 0 T Z g e O r A 6 f 7 g t c P P q g k 9 a J w P z u P 1 U l g d 0 K v 7 b l 2 B l X z 2 A l 6 p / 2 3 6 9 V 6 r S 6 O l e + F q p F 5 m a 8 O 4 0 T Z g e O r A 6 f 7 g t c P P q g k 9 a J w P z u P 1 U l g d 0 K v 7 b l 2 B l X z 2 A l 6 p / 2 3 6 9 V 6 r S 6 O l e + F q p F 5 m a 8 O 4 0 T Z g e O r A 6 f 7 g t c P P q g k 9 a J w P z u P 1 U l g d 0 K v 7 b l 2 B l X z 2 A l 6 p / 2 3 6 9 V 6 r S 6e s u w w 9 h 5 D x 9 L s S n 0 D u r i 4 T u X q S t D n x X M G r 8 4 c 1 e J S 9 8 f i z m H N 8 N 8 Z I w e b 4 u 2 I x M E L O 5 O a / K m n B a a 2 z W r X r N e P 6 3 u P i 9 u 0 V V 6 Q A / p E X r 2 G e 3 S S 9 r D C e H S O / p M X + i r 8 c 3 4 b v w w f m r o 8 l J h c 5 / G H u P 3 P 1 H J j w E = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 s x x s p 5 2 D k Z g z c y l uO l e + F q p F 5 m a 8 O 4 0 T Z g e O r A 6 f 7 g t c P P q g k 9 a J w P z u P 1 U l g d 0 K v 7 b l 2 B l X z 2 A l 6 p / 2 3 6 9 V 6 r S 6 intermediate activations at the same time. Consequently, the memory footprint during training is reduced without involving significant computational overhead. Moreover, by removing the demands for obtaining error signals from later layers, different local modules can potentially be trained in parallel. This approach is also considered more biologically plausible, given that brains are highly modular and predominantly learn from local signals (Crick, 1989; Dan & Poo, 2004; Bengio et al., 2015) . However, a major drawback of local learning is that they usually lead to inferior performance compared to E2E training (Mostafa et al., 2018; Belilovsky et al., 2019; 2020) .In this paper, we revisit locally supervised training and analyse its drawbacks from the informationtheoretic perspective. We find that directly adopting an E2E loss function (i.e., cross-entropy) to train local modules produces more discriminative intermediate features at earlier layers, while it collapses task-relevant information from the inputs and leads to inferior final performance. In other words, local learning tends to be short sighted, and learns features that only benefit local modules, while ignoring the demands of the rest layers. Once task-relevant information is washed out in earlier modules, later layers cannot take full advantage of their capacity to learn more powerful representations.Based on the above observations, we hypothesize that a less greedy training procedure that preserves more information about the inputs might be a rescue for locally supervised training. Therefore, we propose a less greedy information propagation (InfoPro) loss that aims to encourage local modules to propagate forward as much information from the inputs as possible, while progressively abandon task-irrelevant parts (formulated by an additional random variable named nuisance), as shown in Figure 1 (c). The proposed method differentiates itself from existing algorithms (Nøkland & Eidnes, 2019; Belilovsky et al., 2019; 2020) on that it allows intermediate features to retain a certain amount of information which may hurt the short-term performance, but can potentially be leveraged by later modules. In practice, as the InfoPro loss is difficult to estimate in its exact form, we derive a tractable upper bound, leading to surrogate losses, e.g., cross-entropy loss and contrastive loss.Empirically, we show that InfoPro loss effectively prevents collapsing task-relevant information at local modules, and yields favorable results on five widely used benchmarks (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes). For instance, it achieves comparable accuracy as E2E training using 40% or less GPU memory, while allows using a 50% larger batch size or a 50% larger input resolution with the same memory constraints. Additionally, our method enables training different local modules asynchronously (even in parallel).

2. WHY LOCALLY SUPERVISED LEARNING UNDERPERFORMS E2E TRAINING?

We start by considering a local learning setting where a deep network is split into multiple successively stacked modules, each with the same depth. The inputs are fed forward in an ordinary way, while the gradients are produced at the end of every module and back-propagated until reaching an earlier module. To generate supervision signals, a straightforward solution is to train all the local modules as

