REVISITING LOCALLY SUPERVISED LEARNING: AN ALTERNATIVE TO END-TO-END TRAINING

Abstract

Due to the need to store the intermediate activations for back-propagation, end-toend (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration.

1. INTRODUCTION

End-to-end (E2E) back-propagation has become a standard paradigm to train deep networks (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Huang et al., 2019) . Typically, a training loss is computed at the final layer, and then the gradients are propagated backward layer-by-layer to update the weights. Although being effective, this procedure may suffer from memory and computation inefficiencies. First, the entire computational graph as well as the activations of most, if not all, layers need to be stored, resulting in intensive memory consumption. The GPU memory constraint is usually a bottleneck that inhibits the training of state-of-the-art models with high-resolution inputs and sufficient batch sizes, which arises in many realistic scenarios, such as 2D/3D semantic segmentation/object detection in autonomous driving, tissue segmentation in medical imaging and object recognition from remote sensing data. Most existing works address this issue via the gradient checkpointing technique (Chen et al., 2016) or the reversible architecture design (Gomez et al., 2017) , while they both come at the cost of significantly increased computation. Second, E2E training is a sequential process that impedes model parallelization (Belilovsky et al., 2020; Löwe et al., 2019) , as earlier layers need to wait for their successors for error signals. As an alternative to E2E training, the locally supervised learning paradigm (Hinton et al., 2006; Bengio et al., 2007; Nøkland & Eidnes, 2019; Belilovsky et al., 2019; 2020) by design enjoys higher memory efficiency and allows for model parallelization. In specific, it divides a deep network into several gradient-isolated modules and trains them separately under local supervision (see Figure 1 (b)). Since back-propagation is performed only within local modules, one does not need to store all "End-to-end Loss" refers to the standard loss function used by E2E training, e.g., softmax cross-entropy loss for classification, etc., while L denotes the loss function used to train local modules. (c) compares three training approaches in terms of the information captured by features. Greedy supervised learning (greedy SL) tends to collapse some of task-relevant information with the beginning module, leading to inferior final performance. The proposed information propagation (InfoPro) loss, however, alleviates this problem by encouraging local modules to propagate forward all the information from inputs, while maximally discard task-irrelevant information. intermediate activations at the same time. Consequently, the memory footprint during training is reduced without involving significant computational overhead. Moreover, by removing the demands for obtaining error signals from later layers, different local modules can potentially be trained in parallel. This approach is also considered more biologically plausible, given that brains are highly modular and predominantly learn from local signals (Crick, 1989; Dan & Poo, 2004; Bengio et al., 2015) . However, a major drawback of local learning is that they usually lead to inferior performance compared to E2E training (Mostafa et al., 2018; Belilovsky et al., 2019; 2020) . In this paper, we revisit locally supervised training and analyse its drawbacks from the informationtheoretic perspective. We find that directly adopting an E2E loss function (i.e., cross-entropy) to train local modules produces more discriminative intermediate features at earlier layers, while it collapses task-relevant information from the inputs and leads to inferior final performance. In other words, local learning tends to be short sighted, and learns features that only benefit local modules, while ignoring the demands of the rest layers. Once task-relevant information is washed out in earlier modules, later layers cannot take full advantage of their capacity to learn more powerful representations. Based on the above observations, we hypothesize that a less greedy training procedure that preserves more information about the inputs might be a rescue for locally supervised training. Therefore, we propose a less greedy information propagation (InfoPro) loss that aims to encourage local modules to propagate forward as much information from the inputs as possible, while progressively abandon task-irrelevant parts (formulated by an additional random variable named nuisance), as shown in Figure 1 (c). The proposed method differentiates itself from existing algorithms (Nøkland & Eidnes, 2019; Belilovsky et al., 2019; 2020) on that it allows intermediate features to retain a certain amount of information which may hurt the short-term performance, but can potentially be leveraged by later modules. In practice, as the InfoPro loss is difficult to estimate in its exact form, we derive a tractable upper bound, leading to surrogate losses, e.g., cross-entropy loss and contrastive loss. Empirically, we show that InfoPro loss effectively prevents collapsing task-relevant information at local modules, and yields favorable results on five widely used benchmarks (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes). For instance, it achieves comparable accuracy as E2E training using 40% or less GPU memory, while allows using a 50% larger batch size or a 50% larger input resolution with the same memory constraints. Additionally, our method enables training different local modules asynchronously (even in parallel).

2. WHY LOCALLY SUPERVISED LEARNING UNDERPERFORMS E2E TRAINING?

We start by considering a local learning setting where a deep network is split into multiple successively stacked modules, each with the same depth. The inputs are fed forward in an ordinary way, while the gradients are produced at the end of every module and back-propagated until reaching an earlier module. To generate supervision signals, a straightforward solution is to train all the local modules as 

Mutual Information between Features h and Label y

K = 1 K = 2 K = 4 K = 8 K = 16 Figure 2 : The linear separability (left, measured by test errors), mutual information with the input x (middle), and mutual information with the label y (right) of the intermediate features h from different layers when the greedy supervised learning (greedy SL) algorithm is adopted with K local modules. The ends of local modules are marked using larger markers with black edges. The experiments are conducted on CIFAR-10 with a ResNet-32. Table 1 : Test errors of a ResNet-32 using greedy SL on CIFAR-10. The network is divided into K successive local modules. Each module is trained separately with the softmax cross-entropy loss by appending a global-pool layer followed by a fully-connected layer (see Appendix F for details). "K = 1" refers to end-to-end (E2E) training. K = 1 K = 2 K = 4 K = 8 K = 16 Test Error 7.37% 10.30% 16.07% 21.19% 24.59% independent networks, e.g., in classification tasks, attaching a classifier to each module, and computing the local classification loss such as cross-entropy. However, such a greedy version of the standard supervised learning (greedy SL) algorithm leads to inferior performance of the whole network. For instance, in Table 1 , we present the test error of a ResNet-32 (He et al., 2016) on CIFAR-10 (Krizhevsky et al., 2009) when it is greedily trained with K modules. One can observe a severe degradation (even more than 15%) with K growing larger. Plausible as this phenomenon seems, it remains unclear whether it is inherent for local learning and how to alleviate this problem. In this section, we investigate the performance degradation issue of the greedy local training from an information-theoretic perspective, laying the basis for the proposed algorithm. Linear separability of intermediate features. In the common case that greedy SL operates directly on the features output by internal layers, a natural intuition is to investigate how these locally learned features differ from their E2E learned counterparts in task-relevant behaviors. To this end, we fix the networks in Table 1 , and train a linear classifier using the features from each layer. The test errors of these classifiers are presented in the left plot of Figure 2 , where the horizontal axis denotes the indices of layers. The plot shows an intriguing trend: greedy SL contributes to dramatically more discriminative features with the first one (or few) local module, but is only able to slightly improve the performance with all the consequent modules. In contrast, the E2E learned network progressively boosts the linear separability of features throughout the whole network with even more significant effects in the later layers, surpassing greedy SL eventually. This raises an interesting question: why does the full network achieve inferior performance in greedy SL compared to the E2E counterpart, even though the former is based on more discriminative earlier features? This observation appears incompatible with prior works like deeply-supervised nets (Lee et al., 2015a) . Information in features. Since we use the same training configuration for both greedy SL and E2E learning, we conjecture that the answer to the above question might lie in the differences of features apart from merely separability. To test that, we look into the information captured by the intermediate features. In specific, given intermediate feature h corresponding to the input data x and the label y (all of them are treated as random variables), we use the mutual information I(h, x) and I(h, y) to measure the amount of all retained information and task-relevant information in h, respectively. As these metrics cannot be directly computed, we estimate the former by training a decoder with binary cross-entropy loss to reconstruct x from h. For the latter, we train a CNN using h as inputs to correctly classify x, and estimate I(h, y) with its performance. Details are deferred to Appendix G. The estimates of I(h, x) and I(h, y) at different layers are shown in the middle and right plots of Figure 2 . We note that in E2E learned networks, I(h, y) remains unchanged when the features pass through all the layers, while I(h, x) reduces gradually, revealing that the models progressively discard task-irrelevant information. However, greedily trained networks collapse both I(h, x) and I(h, y) in their first few modules. We attribute this to the short sighted optimization objective of earlier modules, which have relatively small capacity compared with full networks and are not capable of extracting and leveraging all the task-relevant information in x, as the E2E learned networks do. As a consequence, later modules, even though introducing additional parameters and increased capacity, lack necessary information about the target y to construct more discriminative features. Information collapse hypothesis. The above observations suggest that greedy SL induces local modules to collapse some of the task-relevant information that may be useless for short-term performance. However, the information is useful for the full model. In addition, we postulate that, although E2E training is incapable of extracting all task-relevant information at earlier layers as well, it alleviates this phenomenon by allowing a larger amount of task-irrelevant information to be kept, even though it may not be ideal for short-term performance. More empirical validation of our hypothesis is provided in Appendix A.

3. INFORMATION PROPAGATION (INFOPRO) LOSS

In this section, we propose an information propagation (InfoPro) loss to address the issue of information collapse in locally supervised training. The key idea is to enforce local modules to retain as much information about the input as possible, while progressively discard task-irrelevant parts. As it is difficult to estimate InfoPro loss in its exact form, we derive an easy-to-compute upper bound as the surrogate loss, and analyze its tightness.

3.1. LEARNING TO DISCARD USELESS INFORMATION

Nuisance. We first model the task-irrelevant information in the input data x by introducing the concept of nuisance. A nuisance is defined as an arbitrary random variable that affects x but provides no helpful information for the task of interest (Achille & Soatto, 2018) . Take recognizing a car in the wild for example. The random variables determining the weather and the background are both nuisances. Formally, given a nuisance r, we have I(r, x) > 0 and I(r, y) = 0 , where y is the label. Without loss of generality, we suppose that y, r, x and h form the Markov chain (y, r) → x → h, namely p(h|x, y, r) = p(h|x). As a consequence, for the intermediate feature h from any layer, we obviously have I(r, x) ≥ I(r, h). Nevertheless, we postulate that max r I(r, h) > 0. This assumption is mild since it does not hold only when h strictly contains no task-irrelevant information. Information Propagation (InfoPro) Loss. Thus far, we have been ready to introduce the proposed InfoPro loss. Instead of overly emphasizing on learning highly discriminative features at local modules, we also pay attention to preventing collapsing useful information in the feed-forward process. A simple solution to achieve this is maximizing the mutual information I(h, x). Ideally, if there is no information loss, all useful information will be retained. However, it goes to another extreme case where the local modules do not learn any task-relevant feature, and is obviously dispensable. By contrast, in E2E training, intermediate layers progressively discard useless (taskirrelevant) information as well as shown above. Therefore, to model both effects simultaneously, we propose the following combined loss function: L InfoPro (h) = α[-I(h, x) + βI(r * , h)], α, β ≥ 0, s.t. r * = argmax r, I(r,x)>0, I(r,y)=0 I(r, h), where the nuisance r * is formulated to capture as much task-irrelevant information in h as possible, and the coefficient β controls the amount of information that is propagated forward (first term) and task-irrelevant information that is discarded (second term). Notably, we assume that the final module is always trained using the normal E2E loss (e.g., softmax cross-entropy loss for classification) weighted by the constant 1, such that α is essential to balance the intermediate loss and the final one. In addition, L InfoPro (h) is used to train the local module outputting h, whose inputs are not required to be x. The module may stack over another local module trained with the same form of L InfoPro (h) but (possibly) different α and β. Our method differs from existing works (Nøkland & Eidnes, 2019; Belilovsky et al., 2019; 2020) in that it is a non-greedy approach. The major effect of minimizing L InfoPro (h) can be described as maximally discarding the task-irrelevant information under the goal of retaining as much information of the input as possible. Obtaining high short-term performance is not necessarily required. As we explicitly facilitate information propagation, we refer to L InfoPro (h) as the InfoPro loss.

3.2. UPPER BOUND OF L InfoPro

The objective function in Eq.( 1) is difficult to be directly optimized, since it is usually intractable to estimate r * , which is equivalent to disentangling all task-irrelevant information from intermediate features. Therefore, we derive an easy-to-compute upper bound of L InfoPro as an surrogate loss. Our result is summarized in Proposition 1, with the proof in Appendix B. Proposition 1. Suppose that the Markov chain (y, r) → x → h holds. Then an upper bound of L InfoPro is given by L InfoPro ≤ -λ 1 I(h, x) -λ 2 I(h, y) LInfoPro , where λ 1 = α(1 -β), λ 2 = αβ. For simplicity, we integrate α and β into two mutually independent hyper-parameters, λ 1 and λ 2 . Although we do not explicitly restrict λ 1 ≥ 0, we find in experiments that the performance of networks is significantly degraded with λ 1 → 0 + (see Figure 4 ), or say, β → 1 -, where models tend to reach local minima by trivially minimizing I(r * , h) in Eq. ( 1). Thus, we assume λ 1 , λ 2 ≥ 0. With Proposition 1, we can optimize the upper bound LInfoPro as an approximation, circumventing dealing with the intractable term I(r * , h) in L InfoPro . To ensure that the approximation is accurate, the gap between the two should be reasonably small. Below we present an analysis of the tightness of LInfoPro in Proposition 2, (proof given in Appendix C). We also empirically check it in Appendix H. Proposition 2 provides a useful tool to examine the discrepancy between L InfoPro and its upper bound. Proposition 2. Given that r * = argmax r, I(r,x)>0, I(r,y)=0 I(r, h) and that y is a deterministic function with respect to x, the gap = LInfoPro -L InfoPro is upper bounded by ≤ λ 2 [I(x, y) -I(h, y)] . (3)

3.3. MUTUAL INFORMATION ESTIMATION

In the following, we describe the specific techniques we use to obtain the mutual information I(h, x) and I(h, y) in LInfoPro . Both of them are estimated using small auxiliary networks. However, we note that the involved additional computational costs are minimal or even negligible (see Tables 3, 4 ). Estimating I(h, x). Assume that R(x|h) denotes the expected error for reconstructing x from h. It has been widely known that R(x|h) follows I(h, x) = H(x)-H(x|h) ≥ H(x)-R(x|h), where H(x) denotes the marginal entropy of x, as a constant (Vincent et al., 2008; Rifai et al., 2012; Kingma & Welling, 2013; Makhzani et al., 2015; Hjelm et al., 2019) . Therefore, we estimate I(h, x) by training a decoder parameterized by w to obtain the minimal reconstruction loss, namely I(h, x) ≈ max w [H(x)-R w (x|h)]. In practice, we use the binary cross-entropy loss for R w (x|h). Estimating I(h, y). We propose two ways to estimate I(h, y). Since I(h, y) = H(y) -H(y|h) = H(y) -E (h,y) [-log p(y|h)], a straightforward approach is to train an auxiliary classifier q ψ (y|h) with parameters ψ to approximate p(y|h), such that we have I(h, y) ≈ max ψ {H(y)-E h [ y -p(y|h)log q ψ (y|h)]}. Note that this approximate equation will become an equation if and only if q ψ (y|h) ≡ p(y|h) (according to the Gibbs' inequality). Finally, we estimate the expectation on h using the samples {(x i , h i , y i )} N i=1 , namely I(h, y) ≈ max ψ {H(y) -1 N [ N i=1 -log q ψ (y i |h i )]}. Consequently, q ψ (y|h) can be trained in a regular classification fashion with the cross-entropy loss. In addition, motivated by recent advances in contrastive representation learning (Chen et al., 2020; Khosla et al., 2020; He et al., 2020) , we formulate a contrastive style loss function L contrast , and prove in Appendix D that minimizing L contrast is equivalent to maximizing a lower bound of I(h, y). Empirical results indicate that adopting L contrast may lead to better performance if a large batch size is available. In specific, considering a mini-batch of intermediate features {h 1 , . . . , h N } corresponding to the labels {y 1 , . . . , y N }, L contrast is given by: L contrast = 1 i =j 1 yi=yj i =j 1 yi=yj log exp(z i z j /τ ) N k=1 1 i =k exp(z i z k /τ ) , z i = f φ (h i ). Herein, 1 A ∈ {0, 1} returns 1 only when A is true, τ > 0 is a pre-defined hyper-parameter, temperature, and f φ is a projection head parameterized by φ that maps the feature h i to a representation vector z i (this design follows Chen et al. (2020) ; Khosla et al. (2020) ). Implementation details. We defer the details on the network architecture of w, ψ and φ to Appendix E. Briefly, on CIFAR, SVHN and STL-10, w is a two layer decoder with up-sampled inputs (if not otherwise noted), with ψ and φ sharing the same architecture consisting of a single convolutional layer followed by two fully-connected layers. On ImageNet and Cityscapes, we use relatively larger auxiliary nets, but they are very small compared with the primary network. Empirically, we find that these simple architectures are capable of achieving competitive performance consistently. Moreover, in implementation, we train w, ψ and φ collaboratively with the main network. Formally, let θ denote the parameters of the local module to be trained, and then our optimization objective is minimize θ,w,ψ λ 1 R w (x|h)+λ 2 1 N N i=1 -log q ψ (y i |h i ) or minimize θ,w,φ λ 1 R w (x|h)+λ 2 L contrast , which corresponds to using the cross-entropy and contrast loss to estimate I(h, y), respectively. Such an approximation is acceptable as we do not need to acquire the exact approximation of mutual information, and empirically it performs well in various experimental settings.

4. EXPERIMENTS

Setups. Our experiments are based on five widely used datasets (i.e., CIFAR-10 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011 ), STL-10 (Coates et al., 2011) , ImageNet (Deng et al., 2009) and Cityscapes (Cordts et al., 2016) ) and two popular network architectures (i.e., ResNet (He et al., 2016) and DenseNet (Huang et al., 2019) ) with varying depth. We split each network into K local modules with the same (or approximately the same) number of layers, where the first K -1 modules are trained using LInfoPro , and the last module is trained using the standard E2E loss, as aforementioned. Due to spatial limitation, details on data pre-processing, training configurations and local module splitting are deferred to Appendix F. The hyper-parameters λ 1 and λ 2 are selected from {0, 0.1, 0.5, 1, 2, 5, 10, 20}. Notably, to avoid involving too many tunable hyper-parameters when K is large (e.g., K = 16), we assume that λ 1 and λ 2 change linearly from 1 st to (K -1) th local module, and thus we merely tune λ 1 and λ 2 for these two modules. We always use τ = 0.07 in L contrast . Two training modes are considered: (1) simultaneous training, where the back-propagation process of all local modules is sequentially triggered with every mini-batch of training data; and (2) asynchronous training, where local modules are isolatedly learned given cached outputs from completely trained earlier modules. Both modes enjoy high memory efficiency since only the activations within a single module require to be stored at a time. The second mode removes the dependence of local modules on their predecessors, enabling the fully decoupled training of network components. The experiments using asynchronous training are referred to as "Asy-InfoPro", while all other results are based on simultaneous training. InfoPro DGL (Belilovsky et al., 2020) DIB (Mosca et al., 2017) BoostResNet (Huang et al., 

K

= 2 K = 4 K = 8 K =

Mutual Information between Features h and Label y

InfoPro, K = 4 InfoPro, K = 8 DGL (Belilovsky et al., 2020) , K = 4 DGL (Belilovsky et al., 2020) , K = 8 Comparisons with other local learning methods. We first compare the proposed InfoPro method with three recently proposed algorithms, decoupled greedy learning (Belilovsky et al., 2020) (DGL), Boost-ResNet (Huang et al., 2018a) and deep incremental boosting (Mosca & Magoulas, 2017) (DIB) in Figure 3 . Our method yields the lowest test errors with all values of K. Notably, DGL can be viewed as a special case of InfoPro where λ 1 = 0. Hence, we use the same architecture of auxiliary networks as us in DGL for fair comparison. In addition, we present the estimates of mutual information between intermediate features and labels in the right plot of Figure 3 . One can observe that DGL suffers from a severe collapse of task-relevant information at early modules, since it optimizes local modules greedily for merely short-term performance. By contrast, our method effectively alleviates this problem, retaining a larger amount of task-relevant information within intermediate features. Results on various image classification benchmarks are presented in Table 2 . We also report the result of DGL (Belilovsky et al., 2020) in our implementation. It can be observed that InfoPro outperforms greedy SL by large margins consistently across different networks, especially when K is Table 2 : Performance of different networks with varying numbers of local modules. The averaged test errors and standard deviations of 5 independent trials are reported. InfoPro (Softmax/Contrast) refers to two approaches to estimating I(h, y). The results of Asy-InfoPro is obtain by asynchronous training, while others are based on simultaneous training. Greedy SL + adopts deeper networks to have the same computational costs as InfoPro. Dataset Network Method K = 2 K = 4 K = 8 K = 16 CIFAR-10 ResNet-32 (E2E: 7.37 ± 0.10%) Greedy SL 10.30 ± 0.20% 16.07 ± 0.46% 21.19 ± 0.52% 24.59 ± 0.83% DGL (Belilovsky et al., 2020) large. For example, on CIFAR-10, ResNet-32 + InfoPro achieves a test error of 12.75% with K = 16, surpassing greedy SL by 11.84%. For ResNet-110, we note that our method performs on par with E2E training with K = 2, while degrading the performance by up to 0.8% with K = 4. Moreover, InfoPro is shown to compare favorably against DGL under most settings. In addition, given that our method introduces auxiliary networks, we enlarge network depth for greedy SL to match the computational cost of InfoPro, named as greedy SL + . However, this only slightly ameliorates the performance since the problem of information collapse still exists. Another interesting phenomenon is that InfoPro (Contrast) outperforms InfoPro (Softmax) on CIFAR-10 and SVHN, yet fails to do so on STL-10. We attribute this to the larger batch size we use on the former two datasets and the proper value of the temperature τ . A detailed analysis is given in Appendix H. Asynchronous and parallel training. The results of asynchronous training are presented in Table 2 as "Asy-InfoPro", and it appears to slightly hurt the performance. Asy-InfoPro differentiates itself from InfoPro on that it adopts the cached outputs from completely trained earlier modules as the inputs of later modules. Therefore, the degradation of performance might be ascribed to lacking regularizing effects from the noisy outputs of earlier modules during training (Löwe et al., 2019) . However, Asy-InfoPro is still considerably better than both greedy SL and DGL, approaching E2E [0.5, 1.75] ) and left-right flipped inputs during inference. We also present the results reported by the original paper in the "original" row. DGL refers to decoupled greedy learning (Belilovsky et al., 2020) .  = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N K Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g U N U W Z D w + 5 8 z O 7 M z a i X 0 v N a 3 W r 9 r U 9 J 2 7 9 + 7 X H z R m H j 5 6 / G R 2 7 u l O G m W J q 7 f d y I + S P c d O t e + F e t t 4 x t d 7 c a L t w P H 1 r t N 7 x + 9 3 T 3 S S e l G 4 Z c 5 j f R D Y R 6 H X 9 V z b w H U 4 + 3 N B t X 3 A O / b h a l s t t o 0 + M 2 G U B L b f d y P d B d D T o V F R V + U f F 9 t O 0 D / O l 9 X 5 0 p J q D I n W r Y n 8 e 5 Y z + 9 X X / i g l 6 e R 5 W 4 1 6 / M i 1 f R V E n c z X + e H s f K v Z k k u N G 1 Z p z F N 5 b U Z z 0 z + o T R 2 K y K W M A t I U k o H t k 0 0 p 7 n 2 y q E U x f A f U h y + B 5 c l 7 T T k 1 w M 2 A 0 k D Y 8 P b w / w h P + 6 U 3 x D N r p s J 2 E c X H X w K m o g V w I u A S 2 B x N y f t M l N l 7 k 3 Z f N H l t 5 / h 1 S q 0 A X k P H 8 E 7 i D Z C 3 5 X E u h r q 0 L j l 4 y C k W D 2 f n l i q Z V I V X r k a y M l C I 4 W O 7 g / c J b F e Y g z o r 4 a S S O 9 f W l v e / B c l e f n Z L b E Z / K r N z o M o 7 0 v i v T D i m L 7 t 0 C r v A r 4 C h R v A 3 6 2 + h W 1 i 7 C 3 5 4 S b m 4 F 0 Q z l T q t Q 9 e B K l e B e e q C V Z X l M X a 9 q C n X q D u i P Y w w x P i y 6 l 6 l Y i a 4 q z q X F V P o B G A W O 7 x N X + i T 7 G 8 R h 1 W N d E A I z a o d i n D 3 p C 8 c I K 5 b e 4 z V R M h L S 4 d 7 U h H e k R X 6 B l V b M u K 4 q l Q p + m d S z w 8 r e 1 1 M R + a P W b x C 7 s M O v I P J U B e T x t N b F S u U k 4 F 3 p e j k 6 2 L x j A c l 0 4 x p F z 3 A 8 2 L R S 1 q l Z V h a z q N m R d x A 4 n L d B j u S j 8 W 9 i o n A T W Q + u V s m T 9 V t z 5 Q q p A + L s / M E W 6 z o 3 x h V 6 E D O T i P n D E 9 v F X a Y O 6 / h r B K b w u 9 g X z w 8 Z z J 1 V e j T U r n A T 6 7 c v + R V f D 8 m a w 6 / D D c j Y / R 4 V 7 x H M k H A 4 t t p X f 1 S j h s 7 q 0 2 r 1 b Q + r 8 1 v r J V f 0 T o 9 p x e 0 i J 5 9 T R v 0 g T Z x Q r i 1 m Z p V e 1 N 7 W 3 9 f D + q m f l J A p 2 o l 5 x l d u u r f / w K b Y s Y j < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e y w w i z h J u 0 M A l D 6 G l 4 B 5 O D 5 3 P s 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N K Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g U N U W Z D w + 5 8 z O 7 M z a i X 0 v N a 3 W r 9 r U 9 J 2 7 9 + 7 X H z R m H j 5 6 / G R 2 7 u l O G m W J q 7 f d y I + S P c d O t e + F e t t 4 x t d 7 c a L t w P H 1 r t N 7 x + 9 3 T 3 S S e l G 4 Z c 5 j f R D Y R 6 H X 9 V z b w H U 4 + 3 N B t X 3 A O / b h a l s t t o 0 + M 2 G U B L b f d y P d B d D T o V F R V + U f F 9 t O 0 D / O l 9 X 5 0 p J q D I n W r Y n 8 e 5 Y z + 9 X X / i g l 6 e R 5 W 4 1 6 / M i 1 f R V E n c z X + e H s f K v Z k k u N G 1 Z p z F N 5 b U Z z 0 z + o T R 2 K y K W M A t I U k o H t k 0 0 p 7 n 2 y q E U x f A f U h y + B 5 c l 7 T T k 1 w M 2 A 0 k D Y 8 P b w / w h P + 6 U 3 x D N r p s J 2 E c X H X w K m o g V w I u A S 2 B x N y f t M l N l 7 k 3 Z f N H l t 5 / h 1 S q 0 A X k P H 8 E 7 i D Z C 3 5 X E u h r q 0 L j l 4 y C k W D 2 f n l i q Z V I V X r k a y M l C I 4 W O 7 g / c J b F e Y g z o r 4 a S S O 9 f W l v e / B c l e f n Z L b E Z / K r N z o M o 7 0 v i v T D i m L 7 t 0 C r v A r 4 C h R v A 3 6 2 + h W 1 i 7 C 3 5 4 S b m 4 F 0 Q z l T q t Q 9 e B K l e B e e q C V Z X l M X a 9 q C n X q D u i P Y w w x P i y 6 l 6 l Y i a 4 q z q X F V P o B G A W O 7 x N X + i T 7 G 8 R h 1 W N d E A I z a o d i n D 3 p C 8 c I K 5 b e 4 z V R M h L S 4 d 7 U h H e k R X 6 B l V b M u K 4 q l Q p + m d S z w 8 r e 1 1 M R + a P W b x C 7 s M O v I P J U B e T x t N b F S u U k 4 F 3 p e j k 6 2 L x j A c l 0 4 x p F z 3 A 8 2 L R S 1 q l Z V h a z q N m R d x A 4 n L d B j u S j 8 W 9 i o n A T W Q + u V s m T 9 V t z 5 Q q p A + L s / M E W 6 z o 3 x h V 6 E D O T i P n D E 9 v F X a Y O 6 / h r B K b w u 9 g X z w 8 Z z J 1 V e j T U r n A T 6 7 c v + R V f D 8 m a w 6 / D D c j Y / R 4 V 7 x H M k H A 4 t t p X f 1 S j h s 7 q 0 2 r 1 b Q + r 8 1 v r J V f 0 T o 9 p x e 0 i J 5 9 T R v 0 g T Z x Q r i 1 m Z p V e 1 N 7 W 3 9 f D + q m f l J A p 2 o l 5 x l d u u r f / w K b Y s Y j < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e y w w i z h J u 0 M A l D 6 G l 4 B 5 O D 5 3 P s 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N K Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g U N U W Z D w + 5 8 z O 7 M z a i X 0 v N a 3 W r 9 r U 9 J 2 7 9 + 7 X H z R m H j 5 6 / G R 2 7 u l O G m W J q 7 f d y I + S P c d O t e + F e t t 4 x t d 7 c a L t w P H 1 r t N 7 x + 9 3 T 3 S S e l G 4 Z c 5 j f R D Y R 6 H X 9 V z b w H U 4 + 3 N B t X 3 A O / b h a l s t t o 0 + M 2 G U B L b f d y P d B d D T o V F R V + U f F 9 t O 0 D / O l 9 X 5 0 p J q D I n W r Y n 8 e 5 Y z + 9 X X / i g l 6 e R 5 W 4 1 6 / M i 1 f R V E n c z X + e H s f K v Z k k u N G 1 Z p z F N 5 b U Z z 0 z + o T R 2 K y K W M A t I U k o H t k 0 0 p 7 n 2 y q E U x f A f U h y + B 5 c l 7 T T k 1 w M 2 A 0 k D Y 8 P b w / w h P + 6 U 3 x D N r p s J 2 E c X H X w K m o g V w I u A S 2 B x N y f t M l N l 7 k 3 Z f N H l t 5 / h 1 S q 0 A X k P H 8 E 7 i D Z C 3 5 X E u h r q 0 L j l 4 y C k W D 2 f n l i q Z V I V X r k a y M l C I 4 W O 7 g / c J b F e Y g z o r 4 a S S O 9 f W l v e / B c l e f n Z L b E Z / K r N z o M o 7 0 v i v T D i m L 7 t 0 C r v A r 4 C h R v A 3 6 2 + h W 1 i 7 C 3 5 4 S b m 4 F 0 Q z l T q t Q 9 e B K l e B e e q C V Z X l M X a 9 q C n X q D u i P Y w w x P i y 6 l 6 l Y i a 4 q z q X F V P o B G A W O 7 x N X + i T 7 G 8 R h 1 W N d E A I z a o d i n D 3 p C 8 c I K 5 b e 4 z V R M h L S 4 d 7 U h H e k R X 6 B l V b M u K 4 q l Q p + m d S z w 8 r e 1 1 M R + a P W b x C 7 s M O v I P J U B e T x t N b F S u U k 4 F 3 p e j k 6 2 L x j A c l 0 4 x p F z 3 A 8 2 L R S 1 q l Z V h a z q N m R d x A 4 n L d B j u S j 8 W 9 i o n A T W Q + u V s m T 9 V t z 5 Q q p A + L s / M E W 6 z o 3 x h V 6 E D O T i P n D E 9 v F X a Y O 6 / h r B K b w u 9 g X z w 8 Z z J 1 V e j T U r n A T 6 7 c v + R V f D 8 m a w 6 / D D c j Y / R 4 V 7 x H M k H A 4 t t p X f 1 S j h s 7 q 0 2 r 1 b Q + r 8 1 v r J V f 0 T o 9 p x e 0 i J 5 9 T R v 0 g T Z x Q r i 1 m Z p V e 1 N 7 W 3 9 f D + q m f l J A p 2 o l 5 x l d u u r f / w K b Y s Y j < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e y w w i z h J u 0 M A l D 6 G l 4 B 5 O D 5 3 P s 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N K Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g U N U W Z D w + 5 8 z O 7 M z a i X 0 v N a 3 W r 9 r U 9 J 2 7 9 + 7 X H z R m H j 5 6 / G R 2 7 u l O G m W J q 7 f d y I + S P c d O t e + F e t t 4 x t d 7 c a L t w P H 1 r t N 7 x + 9 3 T 3 S S e l G 4 Z c 5 j f R D Y R 6 H X 9 V z b w H U 4 + 3 N B t X 3 A O / b h a l s t t o 0 + M 2 G U B L b f d y P d B d D T o V F R V + U f F 9 t O 0 D / O l 9 X 5 0 p J q D I n W r Y n 8 e 5 Y z + 9 X X / i g l 6 e R 5 W 4 1 6 / M i 1 f R V E n c z X + e H s f K v Z k k u N G 1 Z p z F N 5 b U Z z 0 z + o T R 2 K y K W M A t I U k o H t k 0 0 p 7 n 2 y q E U x f A f U h y + B 5 c l 7 T T k 1 w M 2 A 0 k D Y 8 P b w / w h P + 6 U 3 x D N r p s J 2 E c X H X w K m o g V w I u A S 2 B x N y f t M l N l 7 k 3 Z f N H l t 5 / h 1 S q 0 A X k P H 8 E 7 i D Z C 3 5 X E u h r q 0 L j l 4 y C k W D 2 f n l i q Z V I V X r k a y M l C I 4 W O 7 g / c J b F e Y g z o r 4 a S S O 9 f W l v e / B c l e f n Z L b E Z / K r N z o M o 7 0 v i v T D i m L 7 t 0 C r v A r 4 C h R v A 3 6 2 + h W 1 i 7 C 3 5 4 S b m 4 F 0 Q z l T q t Q 9 e B K l e B e e q C V Z X l M X a 9 q C n X q D u i P Y w w x P i y 6 l 6 l Y i a 4 q z q X F V P o B G A W O 7 x N X + i T 7 G 8 R h 1 W N d E A I z a o d i n D 3 p C 8 c I K 5 b e 4 z V R M h L S 4 d 7 U h H e k R X 6 B l V b M u K 4 q l Q p + m d S z w 8 r e 1 1 M R + a P W b x C 7 s M O v I P J U B e T x t N b F S u U k 4 F 3 p e j k 6 2 L x j A c l 0 4 x p F z 3 A 8 2 L R S 1 q l Z V h a z q N m R d x A 4 n L d B j u S j 8 W 9 i o n A T W Q + u V s m T 9 V t z 5 Q q p A + L s / M E W 6 z o 3 x h V 6 E D O T i P n D E 9 v F X a Y O 6 / h r B K b w u 9 g X z w 8 Z z J 1 V e j T U r n A T 6 7 c v + R V f D 8 m a w 6 / D D c j Y / R 4 V 7 x H M k H A 4 t t p X f 1 S j h s 7 q 0 2 r 1 b Q + r 8 1 v r J V f 0 T o 9 p x e 0 i J 5 9 T R v 0 g T Z x Q r i 1 m Z p V e 1 N 7 W 3 9 f D + q m f l J A p 2 o l 5 x l d u u r f / w K b Y s Y j < / l a t e x i t > 1 st local module < l a t e x i t s h a 1 _ b a s e 6 4 = " B t / X S I x X H y v q N K E g r W Z O Y h o S v P 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N E Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g g G o L M h 6 f c 2 Z n V R B 1 M l / n h 7 P z r W Z L L j V u W K U x T + W 1 G c 1 N / 6 Q 2 d S g i l z I K S F N I B r Z P N q W 4 9 8 m i F s X w H V A f v g S W J + 8 1 5 d Q A N w N K A 2 H D 2 8 P / I z z t l 9 4 Q z 6 y Z C t t F F B 9 / C Z i K F s C J g E t g c z Q l 7 z N R Z u 9 N 2 n 3 R 5 L W d 4 9 c p t Q J 4 D R 3 D O 4 k 3 Q N 6 W x 7 k Y 6 t K 6 5 O A h p 1 g 8 n J 1 b q m R S F V 6 5 G s n K Q C G G j + 0 O 3 i e w X W E O 6 q y E k 0 r u X F t b 3 v 8 R J H v 5 2 S 2 x G f 2 t z M 6 B K u 9 I 4 7 8 y 4 Z i + 7 N I p 7 A K / A o Y a w d + s v 4 V u Y e 0 u + O E l 5 e J e E M 1 U 6 r Q O X Q e q X A X m q Q t W V Z b H 2 P W i p l y j 7 o j 2 M M I Q 4 8 u q e 5 W K m e C u 6 l x W T K E T g F n s 8 D Z 9 p c + y v 0 U c V j X S A S E 0 q 3 Y o w t 2 T v n C A u G 7 t M V Y T I S 8 t H e 5 J R X h H V u g 7 V G 3 J i O O q U q X o n 0 k 9 P 6 z s d T E d m T 9 m 8 Q q 5 D z v w D i Z D X U w a T 2 9 V r F B O B t 6 V o p O v i 8 U z H p R M M 6 Z d 9 A D P i 0 W v a Z W W Y W k 5 j 5 o V c Q O J y 3 U b 7 E g + F v c q J g I 3 k f n k b p k 8 V b c 9 U 6 q Q P i z O z h N s s a K 7 M a r Q g Z y d R s 4 Z n t 4 q 7 D B 3 X s N Z J T a F 3 8 G + e H j O Z O q q 0 K e l c o G f X L m 7 5 F V 8 P y Z r D r 8 M N y N j 9 H h X v E c y Q c D i 2 2 l d / V K O G z u r T a v V t L 6 s z W + s l V / R O r 2 k V 7 S I n n 1 D G / S R N n F C u L W Z m l V 7 W 3 t X / 1 A P 6 q Z + U k C n a i X n B V 2 6 6 j / + A R D 6 x j I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B t / X S I x X H y v q N K E g r W Z O Y h o S v P 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N E Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g g G o L M h 6 f c 2 Z n V R B 1 M l / n h 7 P z r W Z L L j V u W K U x T + W 1 G c 1 N / 6 Q 2 d S g i l z I K S F N I B r Z P N q W 4 9 8 m i F s X w H V A f v g S W J + 8 1 5 d Q A N w N K A 2 H D 2 8 P / I z z t l 9 4 Q z 6 y Z C t t F F B 9 / C Z i K F s C J g E t g c z Q l 7 z N R Z u 9 N 2 n 3 R 5 L W d 4 9 c p t Q J 4 D R 3 D O 4 k 3 Q N 6 W x 7 k Y 6 t K 6 5 O A h p 1 g 8 n J 1 b q m R S F V 6 5 G s n K Q C G G j + 0 O 3 i e w X W E O 6 q y E k 0 r u X F t b 3 v 8 R J H v 5 2 S 2 x G f 2 t z M 6 B K u 9 I 4 7 8 y 4 Z i + 7 N I p 7 A K / A o Y a w d + s v 4 V u Y e 0 u + O E l 5 e J e E M 1 U 6 r Q O X Q e q X A X m q Q t W V Z b H 2 P W i p l y j 7 o j 2 M M I Q 4 8 u q e 5 W K m e C u 6 l x W T K E T g F n s 8 D Z 9 p c + y v 0 U c V j X S A S E 0 q 3 Y o w t 2 T v n C A u G 7 t M V Y T I S 8 t H e 5 J R X h H V u g 7 V G 3 J i O O q U q X o n 0 k 9 P 6 z s d T E d m T 9 m 8 Q q 5 D z v w D i Z D X U w a T 2 9 V r F B O B t 6 V o p O v i 8 U z H p R M M 6 Z d 9 A D P i 0 W v a Z W W Y W k 5 j 5 o V c Q O J y 3 U b 7 E g + F v c q J g I 3 k f n k b p k 8 V b c 9 U 6 q Q P i z O z h N s s a K 7 M a r Q g Z y d R s 4 Z n t 4 q 7 D B 3 X s N Z J T a F 3 8 G + e H j O Z O q q 0 K e l c o G f X L m 7 5 F V 8 P y Z r D r 8 M N y N j 9 H h X v E c y Q c D i 2 2 l d / V K O G z u r T a v V t L 6 s z W + s l V / R O r 2 k V 7 S I n n 1 D G / S R N n F C u L W Z m l V 7 W 3 t X / 1 A P 6 q Z + U k C n a i X n B V 2 6 6 j / + A R D 6 x j I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B t / X S I x X H y v q N K E g r W Z O Y h o S v P 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N E Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g g G o L M h 6 f c 2 Z n V R B 1 M l / n h 7 P z r W Z L L j V u W K U x T + W 1 G c 1 N / 6 Q 2 d S g i l z I K S F N I B r Z P N q W 4 9 8 m i F s X w H V A f v g S W J + 8 1 5 d Q A N w N K A 2 H D 2 8 P / I z z t l 9 4 Q z 6 y Z C t t F F B 9 / C Z i K F s C J g E t g c z Q l 7 z N R Z u 9 N 2 n 3 R 5 L W d 4 9 c p t Q J 4 D R 3 D O 4 k 3 Q N 6 W x 7 k Y 6 t K 6 5 O A h p 1 g 8 n J 1 b q m R S F V 6 5 G s n K Q C G G j + 0 O 3 i e w X W E O 6 q y E k 0 r u X F t b 3 v 8 R J H v 5 2 S 2 x G f 2 t z M 6 B K u 9 I 4 7 8 y 4 Z i + 7 N I p 7 A K / A o Y a w d + s v 4 V u Y e 0 u + O E l 5 e J e E M 1 U 6 r Q O X Q e q X A X m q Q t W V Z b H 2 P W i p l y j 7 o j 2 M M I Q 4 8 u q e 5 W K m e C u 6 l x W T K E T g F n s 8 D Z 9 p c + y v 0 U c V j X S A S E 0 q 3 Y o w t 2 T v n C A u G 7 t M V Y T I S 8 t H e 5 J R X h H V u g 7 V G 3 J i O O q U q X o n 0 k 9 P 6 z s d T E d m T 9 m 8 Q q 5 D z v w D i Z D X U w a T 2 9 V r F B O B t 6 V o p O v i 8 U z H p R M M 6 Z d 9 A D P i 0 W v a Z W W Y W k 5 j 5 o V c Q O J y 3 U b 7 E g + F v c q J g I 3 k f n k b p k 8 V b c 9 U 6 q Q P i z O z h N s s a K 7 M a r Q g Z y d R s 4 Z n t 4 q 7 D B 3 X s N Z J T a F 3 8 G + e H j O Z O q q 0 K e l c o G f X L m 7 5 F V 8 P y Z r D r 8 M N y N j 9 H h X v E c y Q c D i 2 2 l d / V K O G z u r T a v V t L 6 s z W + s l V / R O r 2 k V 7 S I n n 1 D G / S R N n F C u L W Z m l V 7 W 3 t X / 1 A P 6 q Z + U k C n a i X n B V 2 6 6 j / + A R D 6 x j I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B t / X S I x X H y v q N K E g r W Z O Y h o S v P 4 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 n o B d r H v q y K k E C C N E Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 G 7 D i W + 0 1 F 0 V W X / o r / a j + Q f s X P T N 2 S C D g g G o L M h 6 f c 2 Z n V R B 1 M l / n h 7 P z r W Z L L j V u W K U x T + W 1 G c 1 N / 6 Q 2 d S g i l z I K S F N I B r Z P N q W 4 9 8 m i F s X w H V A f v g S W J + 8 1 5 d Q A N w N K A 2 H D 2 8 P / I z z t l 9 4 Q z 6 y Z C t t F F B 9 / C Z i K F s C J g E t g c z Q l 7 z N R Z u 9 N 2 n 3 R 5 L W d 4 9 c p t Q J 4 D R 3 D O 4 k 3 Q N 6 W x 7 k Y 6 t K 6 5 O A h p 1 g 8 n J 1 b q m R S F V 6 5 G s n K Q C G G j + 0 O 3 i e w X W E O 6 q y E k 0 r u X F t b 3 v 8 R J H v 5 2 S 2 x G f 2 t z M 6 B K u 9 I 4 7 8 y 4 Z i + 7 N I p 7 A K / A o Y a w d + s v 4 V u Y e 0 u + O E l 5 e J e E M 1 U 6 r Q O X Q e q X A X m q Q t W V Z b H 2 P W i p l y j 7 o j 2 M M I Q 4 8 u q e 5 W K m e C u 6 l x W T K E T g F n s 8 D Z 9 p c + y v 0 U c V j X S A S E 0 q 3 Y o w t 2 T v n C A u G 7 t M V Y T I S 8 t H e 5 J R X h H V u g 7 V G 3 J i O O q U q X o n 0 k 9 P 6 z s d T E d m T 9 m 8 Q q 5 D z v w D i Z D X U w a T 2 9 V r F B O B t 6 V o p O v i 8 U z H p R M M 6 Z d 9 A D P i 0 W v a Z W W Y W k 5 j 5 o V c Q O J y 3 U b 7 E g + F v c q J g I 3 k f n k b p k 8 V b c 9 U 6 q Q P i z O z h N s s a K 7 M a r Q g Z y d R s 4 Z n t 4 q 7 D B 3 X s N Z J T a F 3 8 G + e H j O Z O q q 0 K e l c o G f X L m 7 5 F V 8 P y Z r D r 8 M N y N j 9 H h X v E c y Q c D i 2 2 l d / V K O G z u r T a v V t L 6 s z W + s l V / R O r 2 k V 7 S I n n 1 D G / S R N n F C u L W Z m l V 7 W 3 t X / 1 A P 6 q Z + U k C n a i X n B V 2 6 6 j / + A R D 6 x j I = < / l a t e x i t > 1 (coefficient of I(h, x)) < l a t e x i t s h a 1 _ b a s e 6 4 = " c U A i P u I A X F Z g p 0 y m m T 5 D a l v X q 7 8 = " > A 2 ) since their training processes are identical except for the parallelism. A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q w a R Q I J 0 h g h F b U v S H 1 o K / W B V t w k Q p H t r I O V t d e 1 1 0 A U W X 3 p r / S j + g f t X 3 R m 7 J B A w A H V F m Q 8 P u f M z u z M 2 o 1 V k J p 2 + 3 d t b v 7 e / Q c P r U f 1 h c d P n j 5 b X H q + l + o s 8 e S u p 5 V O D l w n l S q I 5 K 4 J j J I H c S K d 0 F V y 3 + 2 / p / f 7 p z J J A x 3 t m E E s j 0 K n F w V + 4 D k G X c e L v 5 q i o x D e d Y 7 X O 2 K 5 Y + S 5 i X Q S O m r o a e k j M J C R E d o X + a f l j h s O T / J V M V h Z E f U R z b 4 1 j X 7 P c + I 2 h f 1 t O E l K T Z 5 3 x K R H a c 9 R I t T d T M n 8 e L H R b r X 5 E t O G X R o N K K 9 t v T T / E z r Q B Q 0 e Z B C C h A g M 2 g o c S P E + B B v a E K P v C I b o S 9 A K + L 2 E H O r I z R A l E e G g t 4 / / e / h 0 W H o j f C b N l N k e R l H 4 l y B T Q B M 5 G n E J 2 h R N 8 P u M l c l 7 k / a Q N W l t A / x 1 S 6 0 Q v Q Z O M c V d 1 L i u m q B M i s 9 j h X f g K n 3 l / i z i k a r g D I t S s 2 i G N d 5 / 7 w k X E d W u P c T U a 8 5 L c 4 Q F X h H Z k D b 6 j q s M Z U V x R q h T 9 M 6 v n x 5 W 9 L q b L 8 0 c s W i H 1 Y R e 9 o 8 k Q F 5 N G 0 1 s V K + K T g X a l 6 O T r Y t G M h y X T T G k X P U D z Y s N r W I d V t C S f R 6 2 K u C H H p b q N d i S f i n s V o 5 G b 8 H x S t 8 y e q t u e K V V I h R Z l F z C 2 W N H d G F X o k M 9 O w + c M T W 8 V 8 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q w a R Q I J 0 h g h F b U v S H 1 o K / W B V t w k Q p H t r I O V t d e 1 1 0 A U W X 3 p r / S j + g f t X 3 R m 7 J B A w A H V F m Q 8 P u f M z u z M 2 o 1 V k J p 2 + 3 d t b v 7 e / Q c P r U f 1 h c d P n j 5 b X H q + l + o s 8 e S u p 5 V O D l w n l S q I 5 K 4 J j J I H c S K d 0 F V y 3 + 2 / p / f 7 p z J J A x 3 t m E E s j 0 K n F w V + 4 D k G X c e L v 5 q i o x D e d Y 7 X O 2 K 5 Y + S 5 i X Q S O m r o a e k j M J C R E d o X + a f l j h s O T / J V M V h Z E f U R z b 4 1 j X 7 P c + I 2 h f 1 t O E l K T Z 5 3 x K R H a c 9 R I t T d T M n 8 e L H R b r X 5 E t O G X R o N K K 9 t v T T / E z r Q B Q 0 e Z B C C h A g M 2 g o c S P E + B B v a E K P v C I b o S 9 A K + L 2 E H O r I z R A l E e G g t 4 / / e / h 0 W H o j f C b N l N k e R l H 4 l y B T Q B M 5 G n E J 2 h R N 8 P u M l c l 7 k / a Q N W l t A / x 1 S 6 0 Q v Q Z O M c V d 1 L i u m q B M i s 9 j h X f g K n 3 l / i z i k a r g D I t S s 2 i G N d 5 / 7 w k X E d W u P c T U a 8 5 L c 4 Q F X h H Z k D b 6 j q s M Z U V x R q h T 9 M 6 v n x 5 W 9 L q b L 8 0 c s W i H 1 Y R e 9 o 8 k Q F 5 N G 0 1 s V K + K T g X a l 6 O T r Y t G M h y X T T G k X P U D z Y s N r W I d V t C S f R 6 2 K u C H H p b q N d i S f i n s V o 5 G b 8 H x S t 8 y e q t u e K V V I h R Z l F z C 2 W N H d G F X o k M 9 O w + c M T W 8 V 8 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q w a R Q I J 0 h g h F b U v S H 1 o K / W B V t w k Q p H t r I O V t d e 1 1 0 A U W X 3 p r / S j + g f t X 3 R m 7 J B A w A H V F m Q 8 P u f M z u z M 2 o 1 V k J p 2 + 3 d t b v 7 e / Q c P r U f 1 h c d P n j 5 b X H q + l + o s 8 e S u p 5 V O D l w n l S q I 5 K 4 J j J I H c S K d 0 F V y 3 + 2 / p / f 7 p z J J A x 3 t m E E s j 0 K n F w V + 4 D k G X c e L v 5 q i o x D e d Y 7 X O 2 K 5 Y + S 5 i X Q S O m r o a e k j M J C R E d o X + a f l j h s O T / J V M V h Z E f U R z b 4 1 j X 7 P c + I 2 h f 1 t O E l K T Z 5 3 x K R H a c 9 R I t T d T M n 8 e L H R b r X 5 E t O G X R o N K K 9 t v T T / E z r Q B Q 0 e Z B C C h A g M 2 g o c S P E + B B v a E K P v C I b o S 9 A K + L 2 E H O r I z R A l E e G g t 4 / / e / h 0 W H o j f C b N l N k e R l H 4 l y B T Q B M 5 G n E J 2 h R N 8 P u M l c l 7 k / a Q N W l t A / x 1 S 6 0 Q v Q Z O M c V d 1 L i u m q B M i s 9 j h X f g K n 3 l / i z i k a r g D I t S s 2 i G N d 5 / 7 w k X E d W u P c T U a 8 5 L c 4 Q F X h H Z k D b 6 j q s M Z U V x R q h T 9 M 6 v n x 5 W 9 L q b L 8 0 c s W i H 1 Y R e 9 o 8 k Q F 5 N G 0 1 s V K + K T g X a l 6 O T r Y t G M h y X T T G k X P U D z Y s N r W I d V t C S f R 6 2 K u C H H p b q N d i S f i n s V o 5 G b 8 H x S t 8 y e q t u e K V V I h R Z l F z C 2 W N H d G F X o k M 9 O w + c M T W 8 V 8 = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q w a R Q I J 0 h g h F b U v S H 1 o K / W B V t w k Q p H t r I O V t d e 1 1 0 A U W X 3 p r / S j + g f t X 3 R m 7 J B A w A H V F m Q 8 P u f M z u z M 2 o 1 V k J p 2 + 3 d t b v 7 e / Q c P r U f 1 h c d P n j 5 b X H q + l + o s 8 e S u p 5 V O D l w n l S q I 5 K 4 J j J I H c S K d 0 F V y 3 + 2 / p / f 7 p z J J A x 3 t m E E s j 0 K n F w V + 4 D k G X c e L v 5 q i o x D e d Y 7 X O 2 K 5 Y + S 5 i X Q S O m r o a e k j M J C R E d o X + a f l j h s O T / J V M V h Z E f U R z b 4 1 j X 7 P c + I 2 h f 1 t O E l K T Z 5 3 x K R H a c 9 R I t T d T M n 8 e L H R b r X 5 E t O G X R o N K K 9 t v T T / E z r Q B Q 0 e Z B C C h A g M 2 g o c S P E + B B v a E K P v C I b o S 9 A K + L 2 E H O r I z R A l E e G g t 4 / / e / h 0 W H o j f C b N l N k e R l H 4 l y B T Q B M 5 G n E J 2 h R N 8 P u M l c l 7 k / a Q N W l t A / x 1 S 6 0 Q v Q Z O M c V d 1 L i u m q B M i s 9 j h X f g K n 3 l / i z i k a r g D I t S s 2 i G N d 5 / 7 w k X E d W u P c T U a 8 5 L c 4 Q F X h H Z k D b 6 j q s M Z U V x R q h T 9 M 6 v n x 5 W 9 L q b L 8 0 c s W i H 1 Y R e 9 o 8 k Q F 5 N G 0 1 s V K + K T g X a l 6 O T r Y t G M h y X T T G k X P U D z Y s N r W I d V t C S f R 6 2 K u C H H p b q N d i S f i n s V o 5 G b 8 H x S t 8 y e q t u e K V V I h R Z l F z C 2 W N H d G F X o k M 9 O w + c M T W 8 V d p w 7 r e G 8 E p u i 3 8 V 9 C f A 5 4 6 m r Q p + V y g V + d u X u k l f x / Z i t O f V B 7 F 3 l E R h F H g G X Y e z v z o K w V 3 v c L U j F j t G n p l E Z 7 G n + o G W I c I i m R i h Q 1 F + W u z 4 c f + 4 X B b n S 0 u i u S A G R P f W R P o 9 K y 3 b / d Y f J e W m L D t i 1 K N 0 4 C k R 6 2 6 h Z H k 4 O 9 9 u t f k S 4 4 Z b G f N Q X Z t 6 b v o n d K A L G g I o I A Y J C R i 0 F X i Q 4 7 0 P L r Q h R d 8 B 9 N G X o R X x e w k l N J F b I E o i w k N v D / 8 f 4 d N + 5 U 3 w m T R z Z g c Y R e F f h k w B C 8 j R i M v Q p m i C 3 x e s T N 6 b t P u s S W s 7 x 1 + / 0 o r R a + A Y v Z N 4 A + R t e Z S L g R D W O Y c I c 0 r Z Q 9 k F l U r B V a G V i 5 G s D C q k 6 C O 7 i + 8 z t A N m D u o s m J N z 7 l R b j 9 / / Y S R 5 6 T m o s A X 8 r c 3 O R 1 X a k e Z / Z U I x F e / S K d o W v 4 I M M Y K / W X 8 L u 4 W 0 Q + Q n l 5 T t v c C a O d d p H X V 9 V K U q E E 9 c s O q y P M Z d t z W l G o U j 2 s M I Q 4 z i V f d q F Q v G X d W 5 r J i j T o x M u 8 P b 8 B U + 8 / 7 a O K R q u A M S 1 K z b I Y 1 3 j / v C R 8 R 1 a 0 9 x N R r z k t z h E V e E d m Q F v q O q x x l R X F G V B 7 F 3 l E R h F H g G X Y e z v z o K w V 3 v c L U j F j t G n p l E Z 7 G n + o G W I c I i m R i h Q 1 F + W u z 4 c f + 4 X B b n S 0 u i u S A G R P f W R P o 9 K y 3 b / d Y f J e W m L D t i 1 K N 0 4 C k R 6 2 6 h Z H k 4 O 9 9 u t f k S 4 4 Z b G f N Q X Z t 6 b v o n d K A L G g I o I A Y J C R i 0 F X i Q 4 7 0 P L r Q h R d 8 B 9 N G X o R X x e w k l N J F b I E o i w k N v D / 8 f 4 d N + 5 U 3 w m T R z Z g c Y R e F f h k w B C 8 j R i M v Q p m i C 3 x e s T N 6 b t P u s S W s 7 x 1 + / 0 o r R a + A Y v Z N 4 A + R t e Z S L g R D W O Y c I c 0 r Z Q 9 k F l U r B V a G V i 5 G s D C q k 6 C O 7 i + 8 z t A N m D u o s m J N z 7 l R b j 9 / / Y S R 5 6 T m o s A X 8 r c 3 O R 1 X a k e Z / Z U I x F e / S K d o W v 4 I M M Y K / W X 8 L u 4 W 0 Q + Q n l 5 T t v c C a O d d p H X V 9 V K U q E E 9 c s O q y P M Z d t z W l G o U j 2 s M I Q 4 z i V f d q F Q v G X d W 5 r J i j T o x M u 8 P b 8 B U + 8 / 7 a O K R q u A M S 1 K z b I Y 1 3 j / v C R 8 R 1 a 0 9 x N R r z k t z h E V e E d m Q F v q O q x x l R X F G V B 7 F 3 l E R h F H g G X Y e z v z o K w V 3 v c L U j F j t G n p l E Z 7 G n + o G W I c I i m R i h Q 1 F + W u z 4 c f + 4 X B b n S 0 u i u S A G R P f W R P o 9 K y 3 b / d Y f J e W m L D t i 1 K N 0 4 C k R 6 2 6 h Z H k 4 O 9 9 u t f k S 4 4 Z b G f N Q X Z t 6 b v o n d K A L G g I o I A Y J C R i 0 F X i Q 4 7 0 P L r Q h R d 8 B 9 N G X o R X x e w k l N J F b I E o i w k N v D / 8 f 4 d N + 5 U 3 w m T R z Z g c Y R e F f h k w B C 8 j R i M v Q p m i C 3 x e s T N 6 b t P u s S W s 7 x 1 + / 0 o r R a + A Y v Z N 4 A + R t e Z S L g R D W O Y c I c 0 r Z Q 9 k F l U r B V a G V i 5 G s D C q k 6 C O 7 i + 8 z t A N m D u o s m J N z 7 l R b j 9 / / Y S R 5 6 T m o s A X 8 r c 3 O R 1 X a k e Z / Z U I x F e / S K d o W v 4 I M M Y K / W X 8 L u 4 W 0 Q + Q n l 5 T t v c C a O d d p H X V 9 V K U q E E 9 c s O q y P M Z d t z W l G o U j 2 s M I Q 4 z i V f d q F Q v G X d W 5 r J i j T o x M u 8 P b 8 B U + 8 / 7 a O K R q u A M S 1 K z b I Y 1 3 j / v C R 8 R 1 a 0 9 x N R r z k t z h E V e E d m Q F v q O q x x l R X F G f c L 2 v z G 2 v V V 9 S B l / A K F r F n 3 8 A G f I R N P C G C x k z D b b x t v H M + O L F j n B M L n W p U n B d w 6 X J + / A N D 0 c Y y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G H V + 5 k m w k v R I I + 0 w m n a R S G g o H P o = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q y K k E C C N E Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 a 7 C y 9 r r 2 m o s i q y / 9 l X 5 U / 6 D 9 i 8 7 M O i Q Q c E C 1 B R m P z z m z M z u z 9 l M V 5 a b d / t 2 Y m r 5 3 / 8 F D 5 1 F z 5 v G T p 8 9 m 5 5 7 v 5 L r I A r k d a K W z P d / L p Y o S u W 0 i o + R e m k k v 9 p X c 9 X v v 6 f 3 u i c z y S C d b 5 j y V B 7 F 3 l E R h F H g G X Y e z v z o K w V 3 v c L U j F j t G n p l E Z 7 G n + o G W I c I i m R i h Q 1 F + W u z 4 c f + 4 X B b n S 0 u i u S A G R P f W R P o 9 K y 3 b / d Y f J e W m L D t i 1 K N 0 4 C k R 6 2 6 h Z H k 4 O 9 9 u t f k S 4 4 Z b G f N Q X Z t 6 b v o n d K A L G g I o I A Y J C R i 0 F X i Q 4 7 0 P L r Q h R d 8 B 9 N G X o R X x e w k l N J F b I E o i w k N v D / 8 f 4 d N + 5 U 3 w m T R z Z g c Y R e F f h k w B C 8 j R i M v Q p m i C 3 x e s T N 6 b t P u s S W s 7 x 1 + / 0 o r R a + A Y v Z N 4 A + R t e Z S L g R D W O Y c I c 0 r Z Q 9 k F l U r B V a G V i 5 G s D C q k 6 C O 7 i + 8 z t A N m D u o s m J N z 7 l R b j 9 / / Y S R 5 6 T m o s A X 8 r c 3 O R 1 X a k e Z / Z U I x F e / S K d o W v 4 I M M Y K / W X 8 L u 4 W 0 Q + Q n l 5 T t v c C a O d d p H X V 9 V K U q E E 9 c s O q y P M Z d t z W l G o U j 2 s M I Q 4 z i V f d q F Q v G X d W 5 r J i j T o x M u 8 P b 8 B U + 8 / 7 a O K R q u A M S 1 K z b I Y 1 3 j / v C R 8 R 1 a 0 9 x N R r z k t z h E V e E d m Q F v q O q x x l R X F G p 2 P 6 Z 1 P P D y l 4 X 0 + f 5 I x a t k P q w i 9 7 B Z I i L S a P p r Y u V 8 M l A u 2 I 7 + b p Y N O N x x T R j 2 r Y H a F 5 c e A 2 r s I y W 5 P O o V R M 3 5 r h U t 8 G O l G N x r 2 I 0 c j O e T + q W y V N 1 2 z O l D q n Q o u w i x t o V 3 Y 1 R h 4 7 5 7 D R 8 z t D 0 1 m G H u d M a z m q x O f Reducing GPUs memory requirements. Here we split the network into local modules to ensure each module consumes a similar amount of GPU memory during training. Note that this is different from splitting the model into modules with the same number of layers. We denote the results in this setting by InfoPro * , and the trade-off between GPU memory consumption and test errors is presented in Table 3 , where we report the minimally required GPU memory to run the training algorithm. The contrastive and softmax loss are used in InfoPro * on CIFAR-10 and STL-10, respectively. One can observe that our method significantly improves the memory efficiency of CNNs. For instance, on STL-10, InfoPro * (K = 4) outperforms the E2E baseline by 1.05% with 37.9% of the GPUs memory requirements. The computational overhead is presented in both the theoretical results and the practical wall time. Due to implementation issues, we find that the latter is slightly larger than the former for InfoPro * . Compared to the gradient checkpointing technique (Chen et al., 2016) , our method achieves competitive performance with significantly reduced computational and time cost. Results on ImageNet are reported in Table 4 . The softmax loss is used in InfoPro * since the batch size is relatively small. The proposed method reduces the memory cost by 40%, and achieves slightly better performance. Notably, our method enables training these large networks using 16 GB GPUs. Results of semantic segmentation on Cityscapes are presented in Table 5 . We report the mean Intersection over Union (mIoU) of all classes on the validation set. The softmax loss is used in InfoPro * . The details of ψ and w are presented in Appendix E. Our method boosts the performance of the DeepLab-V3 (Chen et al., 2017) network and allows training the model with 50% larger batch sizes (2 → 3 per GPU) under the same memory constraints. This contributes to more accurate statistics for batch normalization, which is a practical requirement for tasks with high resolution inputs. In addition, InfoPro * enables using larger crop sizes (512×1024 → 640×1280) during training without enlarging GPUs memory footprint, which significantly improves the mIoU. Note that this does not increase the training or inference cost.

4.2. HYPER-PARAMETER SENSITIVITY AND ABLATION STUDY

The coefficient λ 1 and λ 2 . To study how λ 1 and λ 2 affect the performance, we change them for the 1 st and 3 rd local modules of a ResNet-32 trained using InfoPro (Contrast), K = 4, with the results shown in Figure 4 . We find that the earlier module benefits from small λ 2 to propagate more information forward, while larger λ 2 helps the later module to boost the final accuracy. This is compatible with previous works showing that removing earlier layers in ResNets has a minimal impact on performance (Veit et al., 2016) . Ablation study. For ablation, we test directly removing the decoder w or replacing the contrastive head φ by the linear classifier used in greedy SL, as shown in Table 6 .

5. RELATED WORK

Greedy training of deep networks is first proposed to learn unsupervised deep generative models, or to obtain an appropriate initialization for E2E supervised training (Hinton et al., 2006; Bengio et al., 2007) . However, later works reveal that this initialization is indeed dispensable once proper networks architectures are adopted, e.g., introducing batch normalization layers (Ioffe & Szegedy, 2015) , skip connections (He et al., 2016) or dense connections (Huang et al., 2019) . Some other works (Kulkarni & Karande, 2017; Malach & Shalev-Shwartz, 2018; Marquez et al., 2018; Huang et al., 2018a) attempt to learn deep models in a layer-wise fashion. For example, BoostResNet (Huang et al., 2018a) trains different residual blocks in a ResNet (He et al., 2016) sequentially with a boosting algorithm. Deep Cascade Learning (Marquez et al., 2018) extends the cascade correlation algorithm (Fahlman & Lebiere, 1990) to deep learning, aiming at improving the training efficiency. However, these approaches mainly focus on theoretical analysis and are usually validated with limited experimental results on small datasets. More recently, several works have pointed out the inefficiencies of back-propagation and revisited this problem (Nøkland & Eidnes, 2019; Belilovsky et al., 2019; 2020) . These works adopt a similar local learning setting to us, while they mostly optimize local modules with a greedy short-term objective, and hence suffer from the information collapse issue we discuss in this paper. In contrast, our method trains local modules by minimizing the non-greedy InfoPro loss. Alternatives of back-propagation have been widely studied in recent years. Some biologicallymotivated algorithms including target propagation (Lee et al., 2015b; Bartunov et al., 2018) and feedback alignment (Lillicrap et al., 2014; Nøkland, 2016) avoid back-propagation by directly propagating backward optimal activations or error signals with auxiliary networks. Decoupled Neural Interfaces (DNI) (Jaderberg et al., 2017) learn auxiliary networks to produce synthetic gradients. In addition, optimization methods like Alternating Direction Method of Multipliers (ADMM) split the end-to-end optimization into sub-problems using auxiliary variables (Taylor et al., 2016; Choromanska et al., 2018) . Decoupled Parallel Back-propagation (Huo et al., 2018b) and Features Replay (Huo et al., 2018a) update parameters with previous gradients instead of current ones, and show its convergence theoretically, enabling training network modules in parallel. Nevertheless, these methods are fundamentally different from us as they train local modules by explicitly or implicitly optimizing the global objective, while we merely consider optimizing local objectives. Information-theoretic analysis in deep learning has received increasingly more attention in the past few years. Shwartz-Ziv & Tishby (2017) and Saxe et al. ( 2019) study the information bottleneck (IB) principle (Tishby et al., 2000) to explain the training dynamics of deep networks. Achille & Soatto (2018) decompose the cross-entropy loss and propose a novel IB for weights. There are also efforts towards fulfilling efficient training with IB (Alemi et al., 2016) . In the context of unsupervised learning, a number of methods have been proposed based on mutual information maximization (Oord et al., 2018; Tian et al., 2020; Hjelm et al., 2019) . SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) propose to maximize the mutual information of different views from the same input with the contrastive loss. This paper analyzes the drawbacks of greedy local supervision and propose the InfoPro loss from the information-theoretic perspective as well. In addition, our method can also be implemented as the combination of a contrastive term and a reconstruction loss. By the definition of nuisance, we note that r * and y are mutually independent, and thus we obtain

6. CONCLUSION

I(h, y|r * ) = H(y|r * ) -H(y|h, r * ) = H(y) -H(y|h, r * ) ≥ H(y) -H(y|h) = I(h, y). Combining Eqs. ( 9) and ( 10), we have I(h, r * ) ≤ I(h, x) -I(h, y). Finally, Proposition 1 is proved by combining Eq. ( 7) and Inequality (11).

C PROOF OF PROPOSITION 2

We first introduce a Lemma proven by Achille & Soatto (2018) . Lemma 1. Given a joint distribution p(x, y), where y is a discrete random variable, we can always find a random variable r independent of y such that x = f (y, r), for some deterministic function f . Proposition 2. Given that r * = argmax r, I(r,x)>0, I(r,y)=0 I(r, h) and that y is a deterministic function with respect to x, the gap = LInfoPro -L InfoPro is upper bounded by ≤ λ 2 [I(x, y) -I(h, y)] . (12) Proof. Let r be the random variable in Lemma 1, and then, since we can find a deterministic function that maps x to y, we have  I(h, x) = I(h, (y, r)) = I(h, r) + I(h, y|r). When considering y as a deterministic function with regards to x, we obtain H(y|x) = 0, and therefore ζ ≤ I(x, y) -I(h, y). (16) Given that = LInfoPro -L InfoPro = λ 2 ζ, we have ≤ λ 2 [I(x, y) -I(h, y)] , for which we have proven Proposition 2.

D WHY MINIMIZING THE CONTRASTIVE LOSS MAXIMIZES THE LOWER BOUND OF TASK-RELEVANT INFORMATION?

In this section, we show that minimizing the proposed contrastive loss, namely L contrast = 1 i =j 1 yi=yj i =j 1 yi=yj log exp(z i z j /τ ) N k=1 1 i =k exp(z i z k /τ ) , z i = f φ (h i ), actually maximizes an lower bound of task-relevant information I(h, y). We start by considering a simplified but equivalent situation. Suppose that we have a query sample z + , together with a set X = {z 1 , . . . , z N } consisting of N samples with one positive sample z p from the same class as z + definitely, and other negative samples are randomly sampled, namely X = {z p } ∪ X neg . Then the expectation of L contrast can be written as E[L contrast ] = E z + ,X -log exp(z + z p /τ ) N i=1 exp(z + z i /τ ) . ( ) Eq. ( 19) can be viewed as a categorical cross-entropy loss of recognizing the positive sample z p correctly. Hence, we define the optimal probability of this classification problem as P pos (z i |X), which denotes the true probability of z i being the positive sample. Assuming that the label of z p is y, the positive and negative samples can be viewed as being sampled from the true distributions p(z|y) and p(z), respectively. As a consequence, P pos (z i |X) can be derived as P pos (z i |X) = p(z i |y) l =i p(z l ) N j=1 p(z j |y) l =j p(z l ) = p(zi|y) p(zi) N j=1 p(zj |y) p(zj ) , which indicates that an optimal value for exp(z + z p /τ ) is p(z p |y) p(z p ) . Therefore, by assuming that z + is uniformly sampled from all classes, we have  E[L contrast ] ≥ E[L optimal contrast ] = E y,X   -log p(z p |y) p(z p ) N j=1 p(zj |y) p(zj )   (21) = E y,X   -log p(z p |y) p(z p ) p(z p |y) p(z p ) + zj ∈Xneg p(zj |y) p(zj )   (22) = E y,X    log   1 + p(z p ) p(z p |y) zj ∈Xneg p(z j |y) p(z j )      = -I(z p , y) + log(N -1) ≥ -I(h, y) + log(N -1). In the above, Inequality (24) follows from Oord et al. (2018) , which quickly becomes more accurate when N increases. Inequality (27) follows from the data processing inequality (Shwartz-Ziv & Tishby, 2017) . Finally, we have E[L contrast ] ≥ log(N -1) -I(h, y), and thus minimizing L contrast under the stochastic gradient descent framework maximizes a lower bound of I(h, y).

E ARCHITECTURE OF AUXILIARY NETWORKS

Here, we introduce the network architectures of w, ψ and φ we use in our experiments. Note that, w is a decoder that aims to reconstruct the input images from deep features, while ψ and φ share the same architecture except for the last layer. The architectures used on CIFAR, SVHN and STL-10 are shown in Table 7 and Table 8 . Architectures on ImageNet are shown in Table 9 and Table 10 . The architecture of ψ for the semantic segmentation experiments on Cityscapes is shown in Table 11 , where we use the same decoder w as on ImageNet (except for the size of feature maps). An empirical study on the size and architecture of auxiliary nets is presented in Appendix H.

F DETAILS OF EXPERIMENTS

Datasets. (1) The CIFAR-10 ( Krizhevsky et al., 2009) dataset consists of 60,000 32x32 colored images of 10 classes, 50,000 for training and 10,000 for test. We normalize the images with channel Following Chen et al. (2017) , we conduct our experiments on the finely annotated dataset and report the performance on the validation set. The training images are augmented by randomly scaling (from 0.5 to 2.0) followed by randomly cropping high-resolution patches (512×1024 or 640×1280). At test time, we simply feed the whole 1024×2048 images into the model. Networks and training hyper-parameters. Our experiments on CIAFR-10, SVHN and STL-10 are based on three popular networks, namely ResNet-32/110 (He et al., 2016) and DenseNet-BC-100-12 (Huang et al., 2019) . The networks are trained using a SGD optimizer with a Nesterov momentum of 0.9 for 160 epochs. The L2 weight decay ratio is set to 1e-4. For ResNets, the batch size is set to 1024 and 128 for CIFAR-10/SVHN and STL-10, associated with an initial learning rate of 0.8 and 0.1, respectively. For DenseNets, we use a batch size of 256 and an initial learning rate of 0.2. The cosine learning rate annealing is adopted. Note that, the results of greedy supervised learning presented in In the cases where the number of basic layers is not divisible by K, we assign one less basic layer to earlier modules. For example, if ResNet-110 is split into K = 4 modules, the corresponding numbers of basic layers will be {13, 14, 14, 14}. If ResNet-110 is split into K = 16 modules, the corresponding numbers of basic layers will be {3} × 9 modules + {4} × 7 modules. For DenseNet-BC, similarly, we view each dense layer (the composite function BN-ReLU-1×1Conv-BN-ReLU-3×3Conv) as a basic layer following their paper (Huang et al., 2019) . The first convolutional layer and the transition layers are viewed as individual basic layers. The splitting criteria is the same as ResNets. Notably, for InfoPro * , the networks are divided to make sure that each local module has the same memory consumption during training, and hence the aforementioned splitting criteria based on the same numbers of basic layers is not applicable, as we discussed in Section 4.1.

G DETAILS OF MUTUAL INFORMATION ESTIMATION

In this section, we describe the details on obtaining the estimates of I(h, x) and I(h, y) we present in Figures 2, 3 and 6. As we have discussed in Section 3.3, the expected reconstruction error R(x|h (Vincent et al., 2008; Rifai et al., 2012; Kingma & Welling, 2013; Makhzani et al., 2015; Hjelm et al., 2019) . Therefore, similar to Section 3.3, we estimate I(h, x) by training a decoder parameterized by w to obtain the minimal reconstruction loss, namely ) follows I(h, x) = H(x) -H(x|h) ≥ H(x) -R(x|h) I(h, x) ≈ max w [H(x) -R w (x|h)]. Note that, ideally, this bound can be arbitrarily tight provided that w has sufficient capacity. In specific, we use the same network architecture as Table 7 , and train it for 10 epochs to minimize the averaged binary cross-entropy reconstruction loss of each pixel. An adam (Kingma & Ba, 2014) optimizer with default hyper-parameters (lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight decay=0) is adopted. Naive as this procedure might seems, for one thing, we find that it is sufficient to reconstruct the input images well given enough information, and meanwhile distinguish different values of I(h, x) via the quality of reconstructed images, as shown in Figure 7 . For another, we are primarily concerned with the comparisons of I(h, x) between end-to-end training and various cases of greedy supervised learning rather than obtaining the exact values of I(h, x). The same training process is applied to all the intermediate features h, and hence the comparisons are fair. Finally, since H(x) is a constant, for the ease of understanding, we simply present 1 -AverageBinaryCrossEntropyLoss(x|h) as the estimates of I(h, x), equivalent to adding the constant 1 -H(x) to the real estimates of I(h, x). For I(h, y), as we have discussed in Section 3.3 as well, since I(h, y) = H(y) -H(y|h) = H(y) -E (h,y) [-log p(y|h)], we train an auxiliary classifier q ψ (y|h) with parameters ψ to approximate p(y|h), such that we have I(h, y) ≈ max ψ {H(y) -1 N [ N i=1 -log q ψ (y i |h i )]}. Here we simply adopt the test accuracy of q ψ (y|h) as the estimate of I(h, y), which is highly correlated to the value of -1 N [ N i=1 -log q ψ (y i |h i )]} (or say, the cross-entropy loss). This can be viewed as the highest generation performance that a classifier based on h is able to reach. Notably, we use a ResNet-32 as q ψ . For the inputs of q ψ , we up-sample h to 32 × 32 and map h to 16 channels at the first layer. All training hyper-parameters of the ResNet-32 are the same as Appendix F.

H MORE RESULTS

Size and architecture of auxiliary nets. Here we investigate how the auxiliary nets (i.e., w, ψ and φ) influence the performance of our method. Since ψ and φ share the same architecture, we study the decoder w and the projection head φ for example. We start by scaling their width with a certain factor (which equals 1 for our original design), and show the results in Figure 8 . It can be observed that using larger w and φ both improve the performance, but the effects are less significant. In addition, their sizes can be shrunk by up to 2 times without severely degrading the accuracy. We also test other architectures for φ in Table 12 , where "Conv" and "Linear" refer to convolutional and linear layers, respectively. We find that involving at least one conv layer and a following MLP instead of a simple linear layer are both important designs. Besides, although adding more conv layers further boosts the performance, we observe that this comes at a considerably increased computational overhead. Temperature τ . In Figure 9 , we change the temperature τ for InfoPro (Contrast). Our method is robust to τ when τ ≤ 0.1. However, we find the training tends to be unstable when τ < 0.005. Batch size. For a comprehensive comparison of InfoPro (Softmax) and InfoPro (Contrast), we vary the batch sizes for the SGD optimizer, and present the results in Table 13 . It is shown that training models with small mini-batches for sufficient epochs is beneficial for InfoPro (Softmax), but produces limited positive effects on InfoPro (Contrast). We also note that using small mini-batches usually prolongs the practical training time as they cannot make full use of even a single GPU (Nvidia Titan Xp). Besides, when considering a short schedule with the same number of updates (iterations), adopting small mini-batches significantly hurts the performance of InfoPro (Contrast). This observation is consistent with previous works (Chen et al., 2020; He et al., 2020; Khosla et al., 2020) . Empirical study on the tightness of the upper bound. Here, we empirically study the gap between of the upper bound LInfoPro and L InfoPro . Since the gap = LInfoPro -L InfoPro is upper bounded by ≤ λ 2 [I(x, y) -I(h, y)], we simply need to check the gap between I(x, y) and I(h, y). To this end, we consider training a ResNet-110 on CIFAR-10 using InfoPro (Contrast) with K = 2, and estimate the mutual information I(h, y) between the outputs of the first local module and the label. In addition, to obtain a comparable estimate of I(x, y), we assume that in end-to-end learned networks, the intermediate features retain all the task-relevant information within the original inputs (which is empirically observed in this paper). Hence, we can estimate I(x, y) by training the network in an end-to-end fashion, and estimating the mutual information I(h, y) in the same position as the end of the first module. The comparisons between the estimates of I(x, y) and I(h, y) are presented in Figure 10 . One can observe that the gap shrinks gradually during the training process. Results with VGG are presented in Table 14 . Obviously, similar observations to Table 2 can be obtained. InfoPro achieves competitive performance with E2E training, while outperforming DGL. Transferability. To verify the transferability of the models trained by the proposed method, we initialize the backbone of Faster-RCNN (Ren et al., 2015) using ResNet-101 trained by E2E training and InfoPro * , K = 2 on ImageNet, and train it for the MS COCO (Lin et al., 2014) object detection task. The training process adopts the default configuration of MMDetection (Chen et al., 2019) with feature pyramid networks (FPN) (Lin et al., 2017) . The results are reported in Table 15 . It can be observed that the locally learned backbone using InfoPro * outperforms its E2E counterpart in terms of average precision (AP). This result is consistent with the accuracy on ImageNet.

I ADDITIONAL DISCUSSIONS AND FUTURE WORK

Applying InfoPro to regression tasks. Although this paper mainly focuses on implementing InfoPro in the context of classification based tasks (i.e., image classification and semantic segmentation), the formulation of L InfoPro and LInfoPro is general and flexible. As long as the mutual information I(h, x) and I(h, y) can be estimated, InfoPro is able to be used in more tasks. In vision tasks, I(h, x) can usually be estimated with a decoder, as we introduced in Section 3.3, while for estimating I(h, y), the technique we discussed may be easily extended to the regression tasks (e.g., depth estimation, bounding box regression in object detection). For example, consider a target value y i ∈ [0, 1] corresponding to the sample x i and the hidden representation h i . It can be viewed as a Bernoulli distribution where P(y = 1) = y i , such that we have I(h, y) ≈ max ψ {H(y) -1 N [ N i=1 -y i log q ψ (y = 1|h i ) -(1y i )log q ψ (y = 0|h i )]}. As a result, the auxiliary network q ψ (y|h) can be trained with the binary cross-entropy loss. This might also be approximated by the mean-square loss as it has the same minima as the binary cross-entropy loss. In the future, we will focus on applying InfoPro to more complex tasks, such as 2D/3D detection, instance segmentation and video recognition.



Figure 1: (a) and (b) illustrate the paradigms of end-to-end (E2E) learning and locally supervised learning (K = 2)."End-to-end Loss" refers to the standard loss function used by E2E training, e.g., softmax cross-entropy loss for classification, etc., while L denotes the loss function used to train local modules. (c) compares three training approaches in terms of the information captured by features. Greedy supervised learning (greedy SL) tends to collapse some of task-relevant information with the beginning module, leading to inferior final performance. The proposed information propagation (InfoPro) loss, however, alleviates this problem by encouraging local modules to propagate forward all the information from inputs, while maximally discard task-irrelevant information.

Figure 3: Comparisons of InfoPro and state-of-the-art local learning methods in terms of the test errors at the final layer (left) and the task-relevant information capture by intermediate features, I(h, y) (right). Results of ResNet-32 on CIFAR-10 are reported. We use the contrastive loss in LInfoPro.

l a t e x i t s h a 1 _ b a s e 6 4 = " e y w w i z h J u 0 M A l D 6 G l 4 B 5 O D 5 3 P s 4

d m b t x L 6 X m l b r d 2 1 q + t 7 9 B w / r j x o z j 5 8 8 f T Y 7 9 3 w n j b L E 1 d t u 5 E f J n m O n 2 v d C v W 0 8 4 + u 9 O N F 2 4 P h 6 1 + m 9 5 / e 7 J z p J v S j c M u e x P g j s o 9 D r e q 5 t 4 D q c / b W g 2 j 7 g H f t w t a 0 W 2 0 a f m T B K A t v v u 5 H u A u j p 0 K i o q / J P i 2 0 n 6 B / n y + p 8 a U k 1 h k T r 1 k T + P c u Z b X 3 r j 1 J S k + d t N e r x I 9 f 2

d m b t x L 6 X m l b r d 2 1 q + t 7 9 B w / r j x o z j 5 8 8 f T Y 7 9 3 w n j b L E 1 d t u 5 E f J n m O n 2 v d C v W 0 8 4 + u 9 O N F 2 4 P h 6 1 + m 9 5 / e 7 J z p J v S j c M u e x P g j s o 9 D r e q 5 t 4 D q c / b W g 2 j 7 g H f t w t a 0 W 2 0 a f m T B K A t v v u 5 H u A u j p 0 K i o q / J P i 2 0 n 6 B / n y + p 8 a U k 1 h k T r 1 k T + P c u Z b X 3 r j 1 J S k + d t N e r x I 9 f 2

d m b t x L 6 X m l b r d 2 1 q + t 7 9 B w / r j x o z j 5 8 8 f T Y 7 9 3 w n j b L E 1 d t u 5 E f J n m O n 2 v d C v W 0 8 4 + u 9 O N F 2 4 P h 6 1 + m 9 5 / e 7 J z p J v S j c M u e x P g j s o 9 D r e q 5 t 4 D q c / b W g 2 j 7 g H f t w t a 0 W 2 0 a f m T B K A t v v u 5 H u A u j p 0 K i o q / J P i 2 0 n 6 B / n y + p 8 a U k 1 h k T r 1 k T + P c u Z b X 3 r j 1 J S k + d t N e r x I 9 f 2

d m b t x L 6 X m l b r d 2 1 q + t 7 9 B w / r j x o z j 5 8 8 f T Y 7 9 3 w n j b L E 1 d t u 5 E f J n m O n 2 v d C v W 0 8 4 + u 9 O N F 2 4 P h 6 1 + m 9 5 / e 7 J z p J v S j c M u e x P g j s o 9 D r e q 5 t 4 D q c / b W g 2 j 7 g H f t w t a 0 W 2 0 a f m T B K A t v v u 5 H u A u j p 0 K i o q / J P i 2 0 n 6 B / n y + p 8 a U k 1 h k T r 1 k T + P c u Z b X 3 r j 1 J S k + d t N e r x I 9 f 2

0 D u L N 0 L e l k e 5 G P B h k 3 M I M K e Y P Z S d V 6 p k X B V a u Z j I y q B C j D 6 y u / g + Q d t j 5 q j O g j k p 5 0 6 1 d f j 9 H 0 a S l 5 6 9 E p v B 3 8 r s X F S l H a n / V y Y U U / E u n a F d 4 N e Q I S b w N + v v Y L e Q t o / 8 6 J J y c T d Z M + U 6 b a K u i 6 p U B e K J C 1 Z V l i e 4 6 0 V N q U b + h P Y 4 w h i j e N X 9 S s W

d p w 7 r e G 8 E p u i 3 8 V 9 C f A 5 4 6 m r Q p + V y g V + d u X u k l f x / Z i t O f 4 y 3 I y M s c d 9 9 v Z 4 g h C L 3 0 7 7 6 p d y 2 t h b b 9 n t l v 1 l o 7 G 1 U X 5 F L X g J r 2 A Z e / Y N b M F H 2 M Y T w q s t 1 O z a 2 9 o 7 6 4 M V W s Y 6 L a B z t Z L z A i 5 d 1 o 9 / K 9 r G M g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " c U A i P u I A X F Z g p 0 y m m T 5 D a l v X q 7

0 D u L N 0 L e l k e 5 G P B h k 3 M I M K e Y P Z S d V 6 p k X B V a u Z j I y q B C j D 6 y u / g + Q d t j 5 q j O g j k p 5 0 6 1 d f j 9 H 0 a S l 5 6 9 E p v B 3 8 r s X F S l H a n / V y Y U U / E u n a F d 4 N e Q I S b w N + v v Y L e Q t o / 8 6 J J y c T d Z M + U 6 b a K u i 6 p U B e K J C 1 Z V l i e 4 6 0 V N q U b + h P Y 4 w h i j e N X 9 S s W

d p w 7 r e G 8 E p u i 3 8 V 9 C f A 5 4 6 m r Q p + V y g V + d u X u k l f x / Z i t O f 4 y 3 I y M s c d 9 9 v Z 4 g h C L 3 0 7 7 6 p d y 2 t h b b 9 n t l v 1 l o 7 G 1 U X 5 F L X g J r 2 A Z e / Y N b M F H 2 M Y T w q s t 1 O z a 2 9 o 7 6 4 M V W s Y 6 L a B z t Z L z A i 5 d 1 o 9 / K 9 r G M g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " c U A i P u I A X F Z g p 0 y m m T 5 D a l v X q 7

0 D u L N 0 L e l k e 5 G P B h k 3 M I M K e Y P Z S d V 6 p k X B V a u Z j I y q B C j D 6 y u / g + Q d t j 5 q j O g j k p 5 0 6 1 d f j 9 H 0 a S l 5 6 9 E p v B 3 8 r s X F S l H a n / V y Y U U / E u n a F d 4 N e Q I S b w N + v v Y L e Q t o / 8 6 J J y c T d Z M + U 6 b a K u i 6 p U B e K J C 1 Z V l i e 4 6 0 V N q U b + h P Y 4 w h i j e N X 9 S s W

d p w 7 r e G 8 E p u i 3 8 V 9 C f A 5 4 6 m r Q p + V y g V + d u X u k l f x / Z i t O f 4 y 3 I y M s c d 9 9 v Z 4 g h C L 3 0 7 7 6 p d y 2 t h b b 9 n t l v 1 l o 7 G 1 U X 5 F L X g J r 2 A Z e / Y N b M F H 2 M Y T w q s t 1 O z a 2 9 o 7 6 4 M V W s Y 6 L a B z t Z L z A i 5 d 1 o 9 / K 9 r G M g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " c U A i P u I A X F Z g p 0 y m m T 5 D a l v X q 7

0 D u L N 0 L e l k e 5 G P B h k 3 M I M K e Y P Z S d V 6 p k X B V a u Z j I y q B C j D 6 y u / g + Q d t j 5 q j O g j k p 5 0 6 1 d f j 9 H 0 a S l 5 6 9 E p v B 3 8 r s X F S l H a n / V y Y U U / E u n a F d 4 N e Q I S b w N + v v Y L e Q t o / 8 6 J J y c T d Z M + U 6 b a K u i 6 p U B e K J C 1 Z V l i e 4 6 0 V N q U b + h P Y 4 w h i j e N X 9 S s W

4 y 3 I y M s c d 9 9 v Z 4 g h C L 3 0 7 7 6 p d y 2 t h b b 9 nt l v 1 l o 7 G 1 U X 5 F L X g J r 2 A Z e / Y N b M F H 2 M Y T w q s t 1 O z a 2 9 o 7 6 4 M V W s Y 6 L a B z t Z L z A i 5 d 1 o 9 / K 9 r G M g = = < / l a t e x i t > 2 (coefficient of I(h, y)) < l a t e x i t s h a 1 _ b a s e 6 4 = " G H V + 5 k m w k v R I I + 0 w m n a R S G g o H P o = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q y K k E C C N E Z I R e 0 L U h / a S n 2 g F T e J U G Q 7 a 7 C y 9 r r 2 m o s i q y / 9 l X 5 U / 6 D 9 i 8 7 M O i Q Q c E C 1 B R m P z z m z M z u z 9 l M V5 a b d / t 2 Y m r 5 3 / 8 F D 5 1 F z 5 v G T p 8 9 m 5 5 7 v 5 L r I A r k d a K W z P d / L p Y o S u W 0 i o + R e m k k v 9 p X c 9 X v v 6 f 3 u i c z y S C d b 5 j y

p 2 P 6 Z 1 P P D y l 4 X 0 + f 5 I x a t k P q w i 9 7 B Z I i L S a P p r Y u V 8 M l A u 2I 7 + b p Y N O N x x T R j 2 r Y H a F 5 c e A 2 r s I y W 5 P O o V R M 3 5 r h U t 8 G O l G N x r 2 I 0 c j O e T + q W y V N 1 2 z O l D q n Q o u w i x t o V 3 Y 1 R h 4 7 5 7 D R 8 z t D 0 1 m G H u d M a z m q x O f p 9 3 J c I n w u e u jr 0 a a V s 8 Z M r d 5 e 8 7 P d j s u b w y 3 A z M s U e D 9 l 7 x B O E W P x 2 u l e / l O P G z m r L b bf c L 2 v z G 2 v V V 9 S B l / A K F r F n 3 8 A G f I R N P C G C x k z D b b x t v H M + O L F j n B M L n W p U n B d w 6 X J + / A N D 0 c Y y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G H V + 5 k m w k v R I I + 0 w m n a R S G g o H P o = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q y K k E C C N E Z I R e 0 L Uh / a S n 2 g F T e J U G Q 7 a 7 C y 9 r r 2 m o s i q y / 9 l X 5 U / 6 D 9 i 8 7 M O i Q Q c E C 1 B R m P z z m z M z u z 9 l M V 5 a b d / t 2 Y m r 5 3 / 8 F D 5 1 F z 5 v G T p 8 9 m 5 5 7 v 5 L r I A r k d a K W z P d / L p Y o S u W 0 i o + R e m k k v 9 p X c 9 X v v 6 f 3 u i c z y S C d b 5 j y

p 2 P 6 Z 1 P P D y l 4 X 0 + f 5 I x a t k P q w i 9 7 B Z I i L S a P p r Y u V 8 M l A u 2I 7 + b p Y N O N x x T R j 2 r Y H a F 5 c e A 2 r s I y W 5 P O o V R M 3 5 r h U t 8 G O l G N x r 2 I 0 c j O e T + q W y V N 1 2 z O l D q n Q o u w i x t o V 3 Y 1 R h 4 7 5 7 D R 8 z t D 0 1 m G H u d M a z m q x O f p 9 3 J c I n w u e u jr 0 a a V s 8 Z M r d 5 e 8 7 P d j s u b w y 3 A z M s U e D 9 l 7 x B O E W P x 2 u l e / l O P G z m r L b bf c L 2 v z G 2 v V V 9 S B l / A K F r F n 3 8 A G f I R N P C G C x k z D b b x t v H M + O L F j n B M L n W p U n B d w 6 X J + / A N D 0 c Y y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G H V + 5 k m w k v R I I + 0 w m n a R S G g o H P o = " > A A A I x X i c p V X b T t t A E J 1 A L 2 7 o B d r H v q y K k E C C N E Z I R e 0 L Uh / a S n 2 g F T e J U G Q 7 a 7 C y 9 r r 2 m o s i q y / 9 l X 5 U / 6 D 9 i 8 7 M O i Q Q c E C 1 B R m P z z m z M z u z 9 l M V 5 a b d / t 2 Y m r 5 3 / 8 F D 5 1 F z 5 v G T p 8 9 m 5 5 7 v 5 L r I A r k d a K W z P d / L p Y o S u W 0 i o + R e m k k v 9 p X c 9 X v v 6 f 3 u i c z y S C d b 5 j y

p 2 P 6 Z 1 P P D y l 4 X 0 + f 5 I x a t k P q w i 9 7B Z I i L S a P p r Y u V 8 M l A u 2 I 7 + b p Y N O N x x T R j 2 r Y H a F 5 c e A 2 r s I y W 5 P O o V R M 3 5 r h U t 8 G O l G N x r 2 I 0 c j O e T + q W y V N 1 2 z O l D q n Q o u w i x t o V 3 Y 1 R h 4 7 5 7 D R 8 z t D 0 1 m G H u d M a z m q x O fp 9 3 J c I n w u e u j r 0 a a V s 8 Z M r d 5 e 8 7 P d j s u b w y 3 A z M s U e D 9 l 7 x B O E W P x 2 u l e / l O P G z m r L b b

Figure 4: Sensitivity tests. The CIFAR-10 test errors of ResNet-32 trained using InfoPro (K = 4) are reported. We vary λ1 and λ2 for 1 st and 3 rd local modules respectively, with all other modules unchanged. We do not consider λ1 = λ2 = 0, where we obviously have LInfoPro = LInfoPro ≡ 0. training. Besides, we note that asynchronous training can be easily extended to training different local modules in parallel by dynamically caching the outputs of earlier modules. To this end, we preliminarily test training two local modules parallelly on 2 GPUs when K = 2, using the same experimental protocols as Huo et al. (2018b) (train ResNet-110 with a batch size of 128 on CIFAR-10) and their public code. Our method gives a 1.5× speedup over the standard parallel paradigm of E2E training (the DataParallel toolkit in Pytorch). Note that parallel training has the same performance as simultaneous training (i.e., "InfoPro" in Table 2) since their training processes are identical except for the parallelism.

Assume ζ = [I(h, x) -I(h, y)] -I(h, r * ), in terms that r * = argmax r, I(r,x)>0, I(r,y)=0 I(r, h), we obtain ζ ≤ [I(h, x) -I(h, y)] -I(h, r) = I(h, r) + I(h, y|r) -I(h, y) -I(h, r) = I(h, y|r) -I(h, y) = I(h, y|r) -I(x, y) + I(x, y) -I(h, y) = H(y|r) -H(y|r, h) -H(y) + H(y|x) + I(x, y) -I(h, y). (14) Since y and r are mutually independent, namely H(y|r) = H(y), we have ζ ≤ H(y) -H(y|r, h) -H(y) + H(y|x) + I(x, y) -I(h, y) = H(y|x) -H(y|r, h) + I(x, y) -I(h, y) ≤ H(y|x) + I(x, y) -I(h, y).

Figure 7: Visualization of the reconstruction results obtained from the decoder w.

Figure 9: Performance of InfoPro (Contrast) with varying temperature τ . Test errors of ResNet-110 on CIFAR-10.

Figure 10: The estimates of I(x, y) and I(h, y).

± 1.37% 27.97 ± 0.75% 29.07 ± 0.76% 30.38 ± 0.39% DGL(Belilovsky et al., 2020) 24.96 ± 1.18% 26.77 ± 0.64% 27.33 ± 0.24% 27.73 ± 0.



Results of semantic segmentation on Cityscapes. 2 Nvidia GeForce RTX 3090 GPUs are used for training. 'SS' refers to the single-scale inference. 'MS' and 'Flip' denote employing the average prediction of multi-scale (

Ablation studies. Test errors of ResNet-32 on CIFAR-10 are reported.

This work studied locally supervised deep learning from the information-theoretic perspective. We demonstrated that training local modules greedily results in collapsing task-relevant information at earlier layers, degrading the final performance. To address this issue, we proposed an information propagation (InfoPro) loss that encourages local modules to preserve more information about the input, while progressively discard task-irrelevant information. Extensive experiments validated that InfoPro significantly reduced GPUs memory footprint during training without sacrificing accuracy. It also enabled model parallelization in an asynchronous fashion. InfoPro may open new avenues for developing more efficient and biologically plausible deep learning algorithms.

Table1follows exactly the same experimental configurations stated here. On ImageNet, we train ResNet-101 and ResNet-152 with a batch size of 1024 and an initial learning rate of 0.4. For ResNeXt-101, 32×8d, we use a batch size of 512 and an initial learning rate of 0.2. The number of training epochs is set to 90. Other hyper-parameters are the same as CIAFR-10. For the DeepLab-V3 model used in semantic segmentation, we follow the training configurations of MMSegmentation (Contributors, 2020) (with ResNet-101 and synthetic batch normalization), except for using an initial learning rate of 0.015 when setting the batch size to 12 or using 640×1280 cropped patches.Local module splitting. Since ResNets consist of a cascade of residual blocks, which naturally should not be further divided into smaller parts, we view each residual block as a minimal indivisible unit, or say, a basic layer (distinguished from a single convolutional layer). Particularly, the first convolutional layer of the network is individually viewed as a basic layer. As a consequence, ResNet-32 has 16 basic layers, and ResNet-110 has 55 basic layers. If ResNet-32 is split into K = 8 local modules, then each module will have 2 basic layers.

Test errors of ResNet-110 trained by InfoPro (Contrast) on CIFAR-10, with different architecture of φ. "MLP" refers to the multi-layer perceptron.

Performance of InfoPro (Contrast/Softmax) with varying batch sizes. Two settings are considered: training models with the same number of iterations (40/80/160 epochs for batch size=256/512/1024) and epochs (160 epochs). All other training hyper-parameters (i.e., the learning rate schedule, weight decay, etc.) are remained unchanged. Test errors of ResNet-110 on CIFAR-10 are reported. InfoPro (Softmax) 8.88 ± 0.36% 7.70 ± 0.23% 7.01 ± 0.34% 6.14 ± 0.11% 5.95 ± 0.19% InfoPro (Contrast) 9.96 ± 0.29% 7.49 ± 0.35% 6.42 ± 0.08% 6.16 ± 0.15% 6.19 ± 0.20% K = 8 InfoPro (Softmax) 11.22 ± 0.10% 9.42 ± 0.05% 9.40 ± 0.27% 8.37 ± 0.33% 8.04 ± 0.29% InfoPro (Contrast) 13.22 ± 0.57% 9.96 ± 0.13% 8.93 ± 0.40% 9.02 ± 1.18% 9.32 ± 1.44%

Performance of InfoPro with the VGG network(Simonyan & Zisserman, 2014). The averaged test errors and standard deviations of 5 independent trials are reported. InfoPro (Softmax/Contrast) refers to two approaches to estimating I(h, y).

Object detection results on COCO(Lin et al., 2014). We initialize the backbone of Faster-RCNN-FPN(Ren et al., 2015) using ResNet-101 trained by E2E training and InfoPro * , K = 2 on ImageNet. The COCO style box average precision (AP) metric is adopted, where AP50 and AP75 denote AP over 50% and 75% IoU thresholds, mAP takes the average value of AP over different thresholds (50%-95%), and mAPS, mAPM and mAPL denote for objects at different scales. All results are presented in percentages (%). The better results are bold-faced.

ACKNOWLEDGMENTS

This work is supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grants 2018AAA0100701, the National Natural Science Foundation of China under Grants 61906106 and 62022048, the Institute for Guo Qiang of Tsinghua University and Beijing Academy of Artificial Intelligence.

availability

https://github.com/blackfeather-wang/InfoPro-Pytorch.

Information

< l a t e x i t s h a 1 _ b a s e 6 4 = " 9 / s T U 5 n F M b E V U / y O v 2 7 q 2 B / y c h 8 = " > A A A I J H i c p V X L b t N Q E J 2 2 g E t 4 t b B k Y x F V Y k F D X C H R F a r E h g W L g p q 0 U l s h 2 7 l J T f z C D 9 o q y o 4 P Y A s f w N e w Q y z Y s I a / 4 M x c p 3 k 7 r b D V d D z 3 z J k 7 r 3 u d 2 P f S r F 7 / t b S 8 c u 3 6 D W P 1 Z u X W 7 T t 3 7 6 2 t 3 2 + m U Z 6 4 q u F G f p Q c O H a q f C 9 U j c z L f H U Q J 8 o O H F / t O 9 2 X v L 7 / U S W p F 4 V 7 2 X m s j g O 7 E 3 p t z 7 U z q J p H T t A 7 6 7 9 b q 9 Z r d X n M a c E q h C o V z 2 6 0 v v K J j q h F E b m U U 0 C K Q s o g + 2 R T i v e Q L K p T D N 0 x 9 a B L I H m y r q h P F d j m Q C k g b G i 7 + O 3 g 6 7 D Q h v h m z l S s X X j x 8 Z f A 0 q Q N 2 E T A J Z D Z m y n r u T C z d h 5 3 T z h 5 b + f 4 7 x R c A b Q Z n U C 7 y G 6 A v K w d x 5 J R m 7 Y l B g 8 x x a L h 6 N y C J Z e s 8 M 7 N k a g y M M T Q s d z C e g L Z F c t B n k 2 x S S V 2 z q 0 t 6 3 8 E y V r + d g t s T n 9 L o 3 P A y h W p / F c k 7 N O X K p 1 C 1 v h N W J g j + P n 8 e + g W 5 m 7 D P h x j 1 u + G c K a S p 2 3 w O m D l L L C d e W F V F u U J q q 5 z y j l q j 3 A P P Q w x v u y 6 W 8 q Y C 2 6 S Z 5 w O C v F p t A 7 q I u H 7 1 y m r g x 9 W j B r / O L M X S U u f X 8 s 5 h z e D P O R M X q 8 L d q O T B C w u D u t y Z t y W m h u 1 a x 6 z X r z r L r z o r h F V + k h P a L H 6 N n n t E O v a B c n h E v v 6 T N 9 o a / G N + O 7 8 c P 4 q a H L S 4 X N A x p 7 j N / / A M 4 j j x M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 / s T U 5 n F M b E V U / y O v 2 7 q 2 B / y c h 8 = " > A A A I J H i c p V X L b t N Q E J 2 2 g E t 4 t b B k Y x F V Y k F D X C H R F a r E h g W L g p q 0 U l s h 2 7 l J T f z C D 9 o q y o 4 P Y A s f w N e w Q y z Y s I a / 4 M x c p 3 k 7 r b D V d D z 3 z J k 7 r 3 u d 2 P f S r F 7 / t b S 8 c u 3 6 D W P 1 Z u X W 7 T t 3 7 6 2 t 3 2 + m U Z 6 4 q u F G f p Q c O H a q f C 9 U j c z L f H U Q J 8 o O H F / t O 9 2 X v L 7 / U S W p F 4 V 7 2 X m s j g O 7 E 3 p t z 7 U z q J p H T t A 7 6 7 9 b q 9 Z r d X n M a c E q h C o V z 2 6 0 v v K J j q h F E b m U U 0 C K Q s o g + 2 R T i v e Q L K p T D N 0 x 9 a B L I H m y r q h P F d j m Q C k g b G i 7 + O 3 g 6 7 D Q h v h m z l S s X X j x 8 Z f A 0 q Q N 2 E T A J Z D Z m y n r u T C z d h 5 3 T z h 5 b + f 4 7 x R c A b Q Z n U C 7 y G 6 A v K w d x 5 J R m 7 Y l B g 8 x x a L h 6 N y C J Z e s 8 M 7 N k a g y M M T Q s d z C e g L Z F c t B n k 2 x S S V 2 z q 0 t 6 3 8 E y V r + d g t s T n 9 L o 3 P A y h W p / F c k 7 N O X K p 1 C 1 v h N W J g j + P n 8 e + g W 5 m 7 D P h x j 1 u + G c K a S p 2 3 w O m D l L L C d e W F V F u U J q q 5 z y j l q j 3 A P P Q w x v u y 6 W 8 q Y C 2 6 S Z 5 w x B U 8 x S S V 2 z q 0 t 6 3 8 E y V r + d g t s T n 9 L o 3 P A y h W p / F c k 7 N O X K p 1 C 1 v h N W J g j + P n 8 e + g W 5 m 7 D P h x j 1 u + G c K a S p 2 3 w O m D l L L C d e W F V F u U J q q 5 z y j l q j 3 A P P Q w x v u y 6 W 8 q Y C 2 6 S Z 5 w x B U 8 p K p a p s 5 y a 1 4 h f 2 N S W q u m P L F j 6 A r 2 G H 2 C F + A P 6 C M 2 O n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 e 1 m c p 5 5 q e X E Q p w e u k 6 n A j O n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 e 1 m c p 5 5 q e X E Q p w e u k 6 n A jO n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 e 1 m c p 5 5 q e X E Q p w e u k 6 n A j O n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 e 1 m c p 5 5 q e X E Q p w e u k 6 n A j O n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 u 1 m c p 5 5 q e 3 E Q p / u u k 6 n A j 1 R b + z p Q + 0 m q n N A N 1 J 7 b e 8 H r e x 9 U m v l x t K P 7 i T o M n e P I 7 / q e o 6 F 6 2 z 9 q H q 3 p K p a p s 5 6 a 1 4 h f 2 N S W q u m P L F j 6 A r 2 G H 2 C F + A P 6 C M 2 O n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 u 1 m c p 5 5 q e 3 E Q p / u u k 6 n A j 1 R b + z p Q + 0 m q n N A N 1 J 7 b e 8 H r e x 9 U m v l x t K P 7 i T o M n e P I 7 / q e o 6 F 6 2 z 9 q H q 3 p K p a p s 5 6 a 1 4 h f 2 N S W q u m P L F j 6 A r 2 G H 2 C F + A P 6 C M 2 O n e T u t s N V 0 P P f M m T u v e 9 0 k 8 D N t W b + W l l e u X b 9 h r N 6 s 3 b p 9 5 + 6 9 t f X 7 u 1 m c p 5 5 q e 3 E Q p / u u k 6 n A j 1 R b + z p Q + 0 m q n N A N 1 J 7 b e 8 H r e x 9 U m v l x t K P 7 i T o M n e P I 7 / q e o 6 F 6 2 z 9 q H q 3 To further validate the information collapse hypothesis, we visualize the "information flow" within deep networks using a toy example. First, we establish a MNIST-STL10 dataset via placing MNIST digits on a certain position (randomly selected from 64 candidates) of a background image from STL-10. Then, three specific tasks can be defined on MNIST-STL10, namely classifying digits, backgrounds and positions of numbers. We refer to the labels of them as y 1 , y 2 and y 3 , respectively, as illustrated by Figure 5 .We train ResNet-32 networks for the three tasks with greedy SL (K = 4) and end-to-end training (K = 1). The estimates of mutual information I(h, y 1 ), I(h, y 2 ) and I(h, y 3 ) are shown in Figure 6 , with the same estimating approach as Figure 2 (details in Appendix G). Note that when one label (take y 1 for example) is adopted for training, the information related to other labels (I(h, y 2 ) and I(h, y 3 )) is task-irrelevant. From the plots, one can clearly observe that end-to-end training retains all task-relevant information throughout the feed-forward process, while greedy SL usually yields less informative intermediate representations in terms of the task of interest. This phenomenon confirms the proposed information collapse hypothesis empirically. In addition, we postulate that the end-to-end learned early layers prevent collapsing task-relevant information by being allowed to keep larger amount of task-irrelevant information, which, however, may lead to the inferior classification performance of intermediate features, and thus cannot be achieved by greedy SL. Figure 6 : The estimates of mutual information between the intermediate features h and the three labels of MNIST-STL10 (see: Figure 5 ), i.e. y1 (left, background), y2 (middle, digit) and y3 (right, position of digit). Models are trained using greedy SL (K = 1, 4) supervised by one of the three labels, and the results are shown with respect to layer indices. "K = 1" to end-to-end training.

B PROOF OF PROPOSITION 1

Proposition 1. Suppose that the Markov chain (y, r) → x → h holds. Then an upper bound of L InfoPro is given bywhereProof. Note that, L InfoPro is given by we have I(h, r * ) ≤ I(h, x) -I(h, y|r * ).Published as a conference paper at ICLR 2021 1×1 conv., stride=1, padding=0, output channels=128, BatchNorm+ReLU 3×3 conv., stride=2, padding=1, output channels=256, BatchNorm+ReLU 3×3 conv., stride=2, padding=1, output channels=512, BatchNorm+ReLU 1×1 conv., stride=1, padding=0, output channels=2048, BatchNorm+ReLU Global average pooling Fully connected 2048→1000 Input: 64×128 feature maps, 1024 channels 3×3 conv., stride=1, padding=1, output channels=512, BatchNorm+ReLU Dropout, p=0.1 1×1 conv., stride=1, padding=0, output channels=19 means and standard deviations for pre-processing. Then data augmentation is performed by 4x4 random translation followed by random horizontal flip (He et al., 2016; Huang et al., 2019) . ( 2) SVHN (Netzer et al., 2011) 4) ImageNet is a 1,000-class dataset from ILSVRC2012 (Deng et al., 2009) , with 1.2 million images for training and 50,000 images for validation. We adopt the same data augmentation and pre-processing configurations as Huang et al. (2019; 2018b) ; Wang et al. (2019; 2020b) ; Yang et al. (2020) . (5) Cityscapes dataset (Cordts et al., 2016) contains 5,000 1024×2048 pixel-level finely annotated images (2,975/500/1,525 for training, validation and testing) and 20,000 coarsely annotated images from 50 different cities. Each pixel of the image is categorized among 19 classes.

