MOVING BEYOND HANDCRAFTED ARCHITECTURES IN SELF-SUPERVISED LEARNING

Abstract

The current literature on self-supervised learning (SSL) focuses on developing learning objectives to train neural networks more effectively on unlabeled data. The typical development process involves taking well-established architectures, e.g., ResNet or ViT demonstrated on ImageNet, and using them to evaluate newly developed objectives on downstream scenarios. While convenient, this neglects the role of architectures which has been shown to be crucial in the supervised learning literature. In this work, we establish extensive empirical evidence showing that a network architecture plays a significant role in contrastive SSL. We conduct a large-scale study with over 100 variants of ResNet and MobileNet architectures and evaluate them across 11 downstream scenarios in the contrastive SSL setting. We show that there is no one network that performs consistently well across the scenarios. Based on this, we propose to learn not only network weights but also architecture topologies in the SSL regime. We show that "self-supervised architectures" outperform popular handcrafted architectures (ResNet18 and Mo-bileNetV2) while performing competitively with the larger and computationally heavy ResNet50 on major image classification benchmarks (ImageNet-1K, iNat2021, and more). Our results suggest that it is time to consider moving beyond handcrafted architectures in contrastive SSL and start thinking about incorporating architecture search into self-supervised learning objectives.

1. INTRODUCTION

Self-supervised learning (SSL) achieves impressive results on challenging tasks involving image, video, audio, and text. Models pretrained on large unlabeled data perform nearly as good and sometimes even better than their supervised counterparts (Caron et al., 2020; Chen & He, 2021) . So far, the focus has been on designing effective learning objectives -e.g., pretext tasks (Gidaris et al., 2018; Caron et al., 2018) , contrastive (Oord et al., 2018; Chen et al., 2020a) and noncontrastive (Grill et al., 2020) tasks -together with empirical (Cole et al., 2021; Feichtenhofer et al., 2021) and theoretical (Arora et al., 2019; Poole et al., 2019) studies providing key insights and underpinnings. Recent works propose new objectives with a different class of network architectures such as vision transformers (ViT) (Bao et al., 2021) and masked autoencoders (He et al., 2022) . However, there has been little focus on the role of architectures in SSL. Currently, the de facto protocol in SSL is to take architectures that perform well on established benchmarks in the supervised setting and to adapt them to the self-supervised setting by plugging in different learning objectives. For example, several existing work on contrastive learning use ResNet (He et al., 2016) as the backbone (Chen et al., 2020a; He et al., 2020) . This is partly for convenience. Evaluating different architectures in SSL is computationally expensive; selecting an architecture in advance and fixing it throughout makes it easy to evaluate different learning objectives. This also stems from strong empirical success of those architectures in transfer learning, e.g., CNNs trained on large labeled data provide "unreasonable effectiveness" (Sun et al., 2017; Zhang et al., 2018; Sejnowski, 2020) in a variety of downstream cases. Nonetheless, one implicit assumption is that an architecture that works well in the supervised learning scenario will continue to be effective in the SSL regime. We argue that this assumption is incorrect and dangerous. It is valid only to a limited extent and the performance starts deteriorating significantly when SSL is conducted on data whose distribution deviates much from the original distribution the architecture was trained on. This is counter to the promise of SSL, where one can learn optimal representation for a wide range of tasks. One main reason for performance degradation is that different data distributions benefit from different inductive biases: An architecture with specific layer types and the wiring between them naturally encodes inductive biases, which may be optimal only for a certain data distribution (e.g., objectcentric imagery such as ImageNet) and not for others (e.g., medical and satellite imagery). In fact, numerous studies have shown that standard "recipes" for architecture design do not translate well across different data distributions (Tuggener et al., 2021; Dey et al., 2021; Kolesnikov et al., 2019) . The main objective of this work is to show that the choice of network architecture crucially matters in SSL, and that it is not easy to handcraft architectures that are effective across different SSL scenarios. To see this, recall that the goal of SSL is to learn data representations capturing important features and attributes that generalize well across various downstream tasks. There has been extensive literature on the expressivity of neural networks (Raghu et al. (2017) ; Zhang et al. (2021a) and references therein); one of the important conclusions is that the network topology plays a significant role in determining the expressivity, i.e., the kinds of functions a network can approximate is bounded by the network capacity and available sample size in the finite sample regime. This implies that, in practice, SSL with a fixed architecture learns representations only within the scope of function space induced by the pre-selected architecture topology, and therefore, the ultimate success of SSL can be achieved when it finds optimal architecture from certain search space in conjunction with its weights for specific data distributions. In this paper, we establish extensive empirical evidence showing that architecture matters in selfsupervised learning. We do this in two sets of large-scale studies. First, we sample 116 variants of ResNet (He et al., 2016) and MobileNet (Sandler et al., 2018) architectures with different topologies and evaluate them on 11 downstream tasks in the SSL setting. We pretrain all models under the same setting, optimizing the SimCLR objective (Chen et al., 2020a) on ImageNet (Deng et al., 2009) , and investigate if there exist any correlation between these models in downstream performance on different datasets. We observe no strong correlation, except for tasks highly similar to ImageNet. We further show that ImageNet downstream performance, the gold standard benchmark in the SSL literature, is not indicative of performance on other downstream tasks. This implies that we need to be careful in choosing an architecture for evaluating any newly developed SSL objectives, as one might get different conclusions based on different network architectures. This subsequently raises the question: Can we improve SSL by learning not only network weights but also architectures directly optimized for the given dataset? It removes the burden of manually searching for effective architectures in SSL, and if we succeed, it can substantially improve performance of SSL. To test this hypothesis, as the second set of our study, we apply a well-established NAS algorithm (Cai et al., 2018) to the SSL setting. Unlike the typical NAS setting that optimize on a labeled target dataset, we search for optimal architectures directly on an unlabeled pretraining dataset via contrastive learning (Chen et al., 2020a) . We evaluate our "NAS + SSL" framework on datasets with different distributions, ImageNet-1K and iNat 2021 (Van Horn et al., 2021) , and show that self-supervised architectures consistently outperform handcrafted ones in the same parameter range (MobileNetV2 and ResNet18) across 11 downstream tasks. This provides strong evidence suggesting the importance of learning architecture topologies in addition to their weights in SSL. Our work focuses on studying the role of architectures in contrastive SSL with the SimCLR framework for CNN-based architectures such as ResNets and MobileNets. As a first step in this direction, we provide an in-depth analysis through large-scale experiments in this specific (yet limited) setting. Extending our study to different SSL approaches (He et al., 2020; Grill et al., 2020; He et al., 2022) and architectures such as ViTs would be an interesting direction but beyond the scope of this paper. In summary, our main contributions are: 1) We establish extensive evidence showing that there isn't one network architecture performing consistently well across different downstream scenarios in the SimCLR setting. We show this using 116 variants of ResNet and MobileNet architectures pretrained on ImageNet and evaluated on 11 downstream datasets. 2) We show that ImageNet performance (the gold standard in SSL benchmark) is not always indicative of downstream performance. This means that findings about SSL objectives shown only on ImageNet do not generalize across other data distributions. 3) We propose to self-supervise a CNN architecture topology and its network weights on unlabeled data. We show that self-supervised architectures outperform handcrafted ones in a similar parameter range for the SimCLR setting across different downstream datasets.

