BLOCKWISE SELF-SUPERVISED LEARNING WITH BARLOW TWINS

Abstract

Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. Notably, we show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

1. INTRODUCTION

One of the primary ingredients behind the success of deep learning is backpropagation. It remains an open question whether comparable recognition performance can be achieved with local learning rules. Previous attempts in the context of supervised learning and unsupervised learning have only been successful on small datasets like MNIST (Salakhutdinov and Hinton, 2009; Löwe et al., 2019; Ahmad et al., 2020; Ernoult et al., 2022; Lee et al., 2015) or large datasets but small networks like VGG-11 (Belilovsky et al., 2019) (67.6% top-1 accuracy on ImageNet). Being able to train models with local learning rules at scale is useful for a multitude of reasons. By locally training different parts of the network, one could optimize learning of very large networks while limiting the memory footprint during training. This approach was recently illustrated at scale in the domain of video prediction, using a stack of VAEs trained sequentially in a greedy fashion (Wu et al., 2021) . From a neuroscientific standpoint, it is interesting to explore the viability of alternative learning rules to backpropagation, as it is debated whether the brain performs backpropagation (mostly considered implausible) (Lillicrap et al., 2020) , approximations of backpropagation (Lillicrap et al., 2016) , or relies instead on local learning rules (Halvagal and Zenke, 2022; Clark et al., 2021; Illing et al., 2021) . Finally, local learning rules could unlock the possibility for adaptive computations as each part of the network is trained to solve a subtask in isolation, naturally tuning different parts to solve different tasks (Yin et al., 2022; Baldock et al., 2021) offering interesting energy and speed trade-offs depending on the complexity of the input. There has been some evidence that suggests that the brain also uses computational paths that depends on the complexity of the task (e.g., Shepard and Metzler (1971) ). Self-supervised learning has proven very successful as an approach to pretrain deep networks. A simple linear classifier trained on top of the last layer can yield an accuracy close to the best supervised learning methods (Chen et al., 2020a; b; He et al., 2020; Chen et al., 2020c; Caron et al., 2021a; b; Zbontar et al., 2021; Bardes et al., 2022; Dwibedi et al., 2021) , and even exceeds performance of supervised learning when finetuned (He et al., 2016) . Self-supervised learning objectives typically have two terms that reflect the desired properties of the learned representations: the first term ensures invariance to distortions that do not affect the label of an image, and the second term ensures that the representation is informative about its input (Zbontar et al., 2021) . Such loss func-

