MOVING BEYOND HANDCRAFTED ARCHITECTURES IN SELF-SUPERVISED LEARNING

Abstract

The current literature on self-supervised learning (SSL) focuses on developing learning objectives to train neural networks more effectively on unlabeled data. The typical development process involves taking well-established architectures, e.g., ResNet or ViT demonstrated on ImageNet, and using them to evaluate newly developed objectives on downstream scenarios. While convenient, this neglects the role of architectures which has been shown to be crucial in the supervised learning literature. In this work, we establish extensive empirical evidence showing that a network architecture plays a significant role in contrastive SSL. We conduct a large-scale study with over 100 variants of ResNet and MobileNet architectures and evaluate them across 11 downstream scenarios in the contrastive SSL setting. We show that there is no one network that performs consistently well across the scenarios. Based on this, we propose to learn not only network weights but also architecture topologies in the SSL regime. We show that "self-supervised architectures" outperform popular handcrafted architectures (ResNet18 and Mo-bileNetV2) while performing competitively with the larger and computationally heavy ResNet50 on major image classification benchmarks (ImageNet-1K, iNat2021, and more). Our results suggest that it is time to consider moving beyond handcrafted architectures in contrastive SSL and start thinking about incorporating architecture search into self-supervised learning objectives.

1. INTRODUCTION

Self-supervised learning (SSL) achieves impressive results on challenging tasks involving image, video, audio, and text. Models pretrained on large unlabeled data perform nearly as good and sometimes even better than their supervised counterparts (Caron et al., 2020; Chen & He, 2021) . So far, the focus has been on designing effective learning objectives -e.g., pretext tasks (Gidaris et al., 2018; Caron et al., 2018 ), contrastive (Oord et al., 2018; Chen et al., 2020a) and noncontrastive (Grill et al., 2020) tasks -together with empirical (Cole et al., 2021; Feichtenhofer et al., 2021) and theoretical (Arora et al., 2019; Poole et al., 2019) However, there has been little focus on the role of architectures in SSL. Currently, the de facto protocol in SSL is to take architectures that perform well on established benchmarks in the supervised setting and to adapt them to the self-supervised setting by plugging in different learning objectives. For example, several existing work on contrastive learning use ResNet (He et al., 2016) as the backbone (Chen et al., 2020a; He et al., 2020) . This is partly for convenience. Evaluating different architectures in SSL is computationally expensive; selecting an architecture in advance and fixing it throughout makes it easy to evaluate different learning objectives. This also stems from strong empirical success of those architectures in transfer learning, e.g., CNNs trained on large labeled data provide "unreasonable effectiveness" (Sun et al., 2017; Zhang et al., 2018; Sejnowski, 2020) in a variety of downstream cases. Nonetheless, one implicit assumption is that an architecture that works well in the supervised learning scenario will continue to be effective in the SSL regime. We argue that this assumption is incorrect and dangerous. It is valid only to a limited extent and the performance starts deteriorating significantly when SSL is conducted on data whose distribution deviates much from the original distribution the architecture was trained on. This is counter to



studies providing key insights and underpinnings. Recent works propose new objectives with a different class of network architectures such as vision transformers (ViT) (Bao et al., 2021) and masked autoencoders (He et al., 2022).

