BLOCKWISE SELF-SUPERVISED LEARNING WITH BARLOW TWINS

Abstract

Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. Notably, we show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

1. INTRODUCTION

One of the primary ingredients behind the success of deep learning is backpropagation. It remains an open question whether comparable recognition performance can be achieved with local learning rules. Previous attempts in the context of supervised learning and unsupervised learning have only been successful on small datasets like MNIST (Salakhutdinov and Hinton, 2009; Löwe et al., 2019; Ahmad et al., 2020; Ernoult et al., 2022; Lee et al., 2015) or large datasets but small networks like VGG-11 (Belilovsky et al., 2019) (67.6% top-1 accuracy on ImageNet). Being able to train models with local learning rules at scale is useful for a multitude of reasons. By locally training different parts of the network, one could optimize learning of very large networks while limiting the memory footprint during training. This approach was recently illustrated at scale in the domain of video prediction, using a stack of VAEs trained sequentially in a greedy fashion (Wu et al., 2021) . From a neuroscientific standpoint, it is interesting to explore the viability of alternative learning rules to backpropagation, as it is debated whether the brain performs backpropagation (mostly considered implausible) (Lillicrap et al., 2020) , approximations of backpropagation (Lillicrap et al., 2016) , or relies instead on local learning rules (Halvagal and Zenke, 2022; Clark et al., 2021; Illing et al., 2021) . Finally, local learning rules could unlock the possibility for adaptive computations as each part of the network is trained to solve a subtask in isolation, naturally tuning different parts to solve different tasks (Yin et al., 2022; Baldock et al., 2021) offering interesting energy and speed trade-offs depending on the complexity of the input. There has been some evidence that suggests that the brain also uses computational paths that depends on the complexity of the task (e.g., Shepard and Metzler (1971) ). Self-supervised learning has proven very successful as an approach to pretrain deep networks. A simple linear classifier trained on top of the last layer can yield an accuracy close to the best supervised learning methods (Chen et al., 2020a; b; He et al., 2020; Chen et al., 2020c; Caron et al., 2021a; b; Zbontar et al., 2021; Bardes et al., 2022; Dwibedi et al., 2021) , and even exceeds performance of supervised learning when finetuned (He et al., 2016) . Self-supervised learning objectives typically have two terms that reflect the desired properties of the learned representations: the first term ensures invariance to distortions that do not affect the label of an image, and the second term ensures that the representation is informative about its input (Zbontar et al., 2021) . Such loss func- (Nøkland, 2016; Clark et al., 2021; Belilovsky et al., 2019) while being (2) more biologically plausible than these alternatives. Methods represented with magenta color used similar methodology as ours (Halvagal and Zenke, 2022; Löwe et al., 2019) , but only demonstrated performance on small datasets. tions may better reflect the goal of intermediate layers compared to supervised learning rules which only try to extract labels, without any consideration of preserving input information for the next layers. In this paper, we revisit the possibility of using local learning rules as a replacement for backpropagation using a recent self-supervised learning method, Barlow Twins. As a natural transition towards purely local learning rules, we resort to blockwise training where we limit the length of the backpropagation path to single blocks of the networkfoot_0 . We applied a self-supervised learning loss locally at different blocks of a ResNet-50 trained on ImageNet and made the following contributions: • We show that a network trained with a blockwise self-supervised learning rule can reach performance on par with the same network trained end-to-end, as long as the network is not broken into too many blocks. • We find that training blocks simultaneously (as opposed to sequentially) is essential to achieving this level of performance, suggesting that important learning interactions take place between blocks through the feedforward path. • We find that methods which increase downstream performance of lower blocks on the classification task, such as supervised blockwise training, tend to decrease performance of the overall network, suggesting that building invariant representations prematurely in early blocks may be harmful for learning higher-level features in latter blocks. • We compare different spatial and feature pooling strategies for the intermediate block outputs (which are fed to the intermediate loss functions via projector networks), and find that expanding the feature-dimensionality of the block's output is key to the successful training of the network. • We evaluated a variety of strategies to customize the training procedure at the different blocks (e.g., tuning the trade-off parameter of the objective function as a function of the block, feeding different image distortions to different blocks, routing samples to different blocks depending on their difficulty) and found that none of these approaches added a substantial gain to performance. However, despite negative initial results, we consider these attempts to be promising avenues to be explored further in future work. We exhaustively describe these negative results in Appendix A.



We consider all layers with the same spatial resolution to belong to the same block.



Figure1: We rank blockwise/local learning methods of the literature according to their biological plausibility, and indicate their best demonstrated performance on large-scale datasets i.e. ImageNet (intensity of the blue rectangle below each model family). Our method is situated at a unique tradeoff between biological plausibility and performance on large-scale datasets, by being (1) on par in performance with intertwined blockwise training(Xiong et al., 2020)  and supervised broadcasted learning(Nøkland, 2016; Clark et al., 2021; Belilovsky et al., 2019)  while being (2) more biologically plausible than these alternatives. Methods represented with magenta color used similar methodology as ours(Halvagal and Zenke, 2022; Löwe et al., 2019), but only demonstrated performance on small datasets.

