SLIMMABLE NETWORKS FOR CONTRASTIVE SELF-SUPERVISED LEARNING

Abstract

Self-supervised learning makes great progress in large model pre-training but suffers in training small models. Previous solutions to this problem mainly rely on knowledge distillation and indeed have a two-stage learning procedure: first train a large teacher model, then distill it to improve the generalization ability of small ones. In this work, we present a new one-stage solution to obtain pre-trained small models without extra teachers: slimmable networks for contrastive selfsupervised learning (SlimCLR). A slimmable network contains a full network and several weight-sharing sub-networks. We can pre-train for only one time and obtain various networks including small ones with low computation costs. However, in self-supervised cases, the interference between weight-sharing networks leads to severe performance degradation. One evidence of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation, and the main parameters may not be fully optimized. The interference between networks also result in gradient direction divergence. To overcome these problems, we make the main parameters produce dominant gradients and provide consistent guidance for sub-networks via three techniques: slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Besides, a switchable linear probe layer is applied during linear evaluation to avoid the interference of weight-sharing linear layers. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs.

1. INTRODUCTION

In the past decade, deep learning achieves great success in different fields of artificial intelligence. A large amount of manually labeled data is the fuel behind such success. However, manually labeled data is expensive and far less than unlabeled data in practice. To relieve the constraint of costly annotations, self-supervised learning (Dosovitskiy et al., 2016; Wu et al., 2018; van den Oord et al., 2018; He et al., 2020; Chen et al., 2020a) aims to learn transferable representations for downstream tasks by training networks on unlabeled data. Great progress is made in large models, i.e., models bigger than ResNet-50 (He et al., 2016) that has roughly 25M parameters. For example, ReLICv2 (Tomasev et al., 2022) achieves 77.1% accuracy on ImageNet (Russakovsky et al., 2015) under linear evaluation protocol with ResNet-50, outperforming the supervised baseline 76.5%. In contrast to the success of the large model pre-training, self-supervised learning with small models lags behind. For instance, supervised ResNet-18 with 12M parameters achieves 72.1% accuracy on ImageNet, but its self-supervised result with MoCov2 (Chen et al., 2020c) is only 52.5% (Fang et al., 2021) . The gap is nearly 20%. To fulfill the large performance gap between supervised and self-supervised small models, previous methods (Fang et al., 2021; Gao et al., 2022; Xu et al., 2022) mainly focus on knowledge distillation, namely, they try to transfer the knowledge of a selfsupervised large model into small ones. Nevertheless, such methodology actually has a two-stage procedure: first train an additional large model, then train a small model to mimic the large one. Besides, one-time distillation only produces a single small model for a specific computation scenario. An interesting question naturally arises: can we obtain different small models through one time pre-training to meet various computation scenarios without extra teachers? Inspired by the success of slimmable networks (Yu et al., 2019) in supervised learning, we present a novel one-stage

