SLIMMABLE NETWORKS FOR CONTRASTIVE SELF-SUPERVISED LEARNING

Abstract

Self-supervised learning makes great progress in large model pre-training but suffers in training small models. Previous solutions to this problem mainly rely on knowledge distillation and indeed have a two-stage learning procedure: first train a large teacher model, then distill it to improve the generalization ability of small ones. In this work, we present a new one-stage solution to obtain pre-trained small models without extra teachers: slimmable networks for contrastive selfsupervised learning (SlimCLR). A slimmable network contains a full network and several weight-sharing sub-networks. We can pre-train for only one time and obtain various networks including small ones with low computation costs. However, in self-supervised cases, the interference between weight-sharing networks leads to severe performance degradation. One evidence of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation, and the main parameters may not be fully optimized. The interference between networks also result in gradient direction divergence. To overcome these problems, we make the main parameters produce dominant gradients and provide consistent guidance for sub-networks via three techniques: slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Besides, a switchable linear probe layer is applied during linear evaluation to avoid the interference of weight-sharing linear layers. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs.

1. INTRODUCTION

In the past decade, deep learning achieves great success in different fields of artificial intelligence. A large amount of manually labeled data is the fuel behind such success. However, manually labeled data is expensive and far less than unlabeled data in practice. To relieve the constraint of costly annotations, self-supervised learning (Dosovitskiy et al., 2016; Wu et al., 2018; van den Oord et al., 2018; He et al., 2020; Chen et al., 2020a) aims to learn transferable representations for downstream tasks by training networks on unlabeled data. Great progress is made in large models, i.e., models bigger than ResNet-50 (He et al., 2016) that has roughly 25M parameters. For example, ReLICv2 (Tomasev et al., 2022) achieves 77.1% accuracy on ImageNet (Russakovsky et al., 2015) under linear evaluation protocol with ResNet-50, outperforming the supervised baseline 76.5%. In contrast to the success of the large model pre-training, self-supervised learning with small models lags behind. For instance, supervised ResNet-18 with 12M parameters achieves 72.1% accuracy on ImageNet, but its self-supervised result with MoCov2 (Chen et al., 2020c) is only 52.5% (Fang et al., 2021) . The gap is nearly 20%. To fulfill the large performance gap between supervised and self-supervised small models, previous methods (Fang et al., 2021; Gao et al., 2022; Xu et al., 2022) mainly focus on knowledge distillation, namely, they try to transfer the knowledge of a selfsupervised large model into small ones. Nevertheless, such methodology actually has a two-stage procedure: first train an additional large model, then train a small model to mimic the large one. Besides, one-time distillation only produces a single small model for a specific computation scenario. An interesting question naturally arises: can we obtain different small models through one time pre-training to meet various computation scenarios without extra teachers? Inspired by the success of slimmable networks (Yu et al., 2019) in supervised learning, we present a novel one-stage method to obtain pre-trained small models without adding large models: slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and some weight-sharing sub-networks with different widths. The width denotes the number of channels in a network. Slimmable networks can execute at various widths, permitting flexible deployment on different computing devices. We can thus obtain multiple networks including small ones meeting low computing cases via one-time pre-training. Weight-sharing networks can also inherit knowledge from the large ones via the sharing parameters to achieve better generalization performance. Weight-sharing networks in a slimmbale network cause interference to each other when training simultaneously, and the situation is worse in self-supervised cases. As shown in Figure 1 , with supervision, weight-sharing networks only have a slight impact on each other, e.g., the full model achieves 76.6% vs. 76.0% accuracy in ResNet-50 [1.0] and 0.75, 0.5, 0.25] . Without supervision, the corresponding numbers become 67.2% vs. 64.8%. One observed phenomenon of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation. The imbalance occurs because the sharing parameters receive gradients from multiple losses of different networks during optimization. The main parameters may not be fully optimized due to gradient imbalance. Besides, the conflicts in gradient directions of weight-sharing networks also cause gradient direction divergence of the full network. Please refer to Appendix A.3 for detailed explanations and visualizations. To relieve the gradient imbalance, the main parameters should produce dominant gradients during the optimization process. To avoid conflicts in gradient directions of various networks, sub-networks should have consistent guidance. Following these principles, we introduce three simple yet effective techniques during pre-training to relieve the interference of networks. 1) We adopt a slow start strategy for sub-networks. The networks and pseudo supervision of contrastive learning are both unstable and fast updating at the start of training. To avoid interference making the situation worse, we only train the full model at first. After the full model becomes relatively stable, sub-networks can inherit the knowledge via sharing parameters and start with better initialization. 2) We apply online distillation to make all sub-networks consistent with the full model to eliminate divergence of networks. The predictions of the full model will serve as global guidance for all sub-networks. 3) We re-weight the losses of networks according to their widths to ensure that the full model dominates the optimization process. Besides, we adopt a switchable linear probe layer to avoid interference of weight-sharing linear layers during evaluation. A single slimmable linear layer cannot achieve several complex mappings simultaneously when the data distribution is complicated. We instantiate two algorithms for SlimCLR with typical contrastive learning frameworks, i.e., Mo-Cov2 and MoCov3 (Chen et al., 2020c; 2021) . Extensive experiments are done on ImageNet (Russakovsky et al., 2015) dataset, and the results show that our methods achieve significant performance improvements compared to previous arts with fewer parameters and FLOPs.

2. RELATED WORKS

Self-supervised learning Self-supervised learning aims to learn transferable representations for downstream tasks from the input data itself. According to Liu et al. (2020) , self-supervised methods can be summarized into three main categories according to their objectives: generative, contrastive,



Figure 1: Training a slimmable ResNet-50 in supervised (left) and self-supervised (right) manners. ResNet-50 [1.0,0.75,0.5,0.25] means this slimmbale network can switch at width [1.0, 0.75, 0.5, 0.25]. The width 0.25 represents that the number of channels is scaled by 0.25 of the full network.

