SLIMMABLE NETWORKS FOR CONTRASTIVE SELF-SUPERVISED LEARNING

Abstract

Self-supervised learning makes great progress in large model pre-training but suffers in training small models. Previous solutions to this problem mainly rely on knowledge distillation and indeed have a two-stage learning procedure: first train a large teacher model, then distill it to improve the generalization ability of small ones. In this work, we present a new one-stage solution to obtain pre-trained small models without extra teachers: slimmable networks for contrastive selfsupervised learning (SlimCLR). A slimmable network contains a full network and several weight-sharing sub-networks. We can pre-train for only one time and obtain various networks including small ones with low computation costs. However, in self-supervised cases, the interference between weight-sharing networks leads to severe performance degradation. One evidence of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation, and the main parameters may not be fully optimized. The interference between networks also result in gradient direction divergence. To overcome these problems, we make the main parameters produce dominant gradients and provide consistent guidance for sub-networks via three techniques: slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Besides, a switchable linear probe layer is applied during linear evaluation to avoid the interference of weight-sharing linear layers. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. Under review as a conference paper at ICLR 2023 76.6 76 76 76 74.9 72.6 71.9 72.2 64.4 64.4 65 70 75 0.0 Top-1 Accuracy on ImageNet (%) Widths of a slimmable ResNet-50 (supervised, 90 epochs) [1.0] [1.0, 0.5] [1.0, 0.5, 0.25] [1.0,0.75, 0.5, 0.25] 67.2 66.

1. INTRODUCTION

In the past decade, deep learning achieves great success in different fields of artificial intelligence. A large amount of manually labeled data is the fuel behind such success. However, manually labeled data is expensive and far less than unlabeled data in practice. To relieve the constraint of costly annotations, self-supervised learning (Dosovitskiy et al., 2016; Wu et al., 2018; van den Oord et al., 2018; He et al., 2020; Chen et al., 2020a) aims to learn transferable representations for downstream tasks by training networks on unlabeled data. Great progress is made in large models, i.e., models bigger than ResNet-50 (He et al., 2016) that has roughly 25M parameters. For example, ReLICv2 (Tomasev et al., 2022) achieves 77.1% accuracy on ImageNet (Russakovsky et al., 2015) under linear evaluation protocol with ResNet-50, outperforming the supervised baseline 76.5%. In contrast to the success of the large model pre-training, self-supervised learning with small models lags behind. For instance, supervised ResNet-18 with 12M parameters achieves 72.1% accuracy on ImageNet, but its self-supervised result with MoCov2 (Chen et al., 2020c ) is only 52.5% (Fang et al., 2021) . The gap is nearly 20%. To fulfill the large performance gap between supervised and self-supervised small models, previous methods (Fang et al., 2021; Gao et al., 2022; Xu et al., 2022) mainly focus on knowledge distillation, namely, they try to transfer the knowledge of a selfsupervised large model into small ones. Nevertheless, such methodology actually has a two-stage procedure: first train an additional large model, then train a small model to mimic the large one. Besides, one-time distillation only produces a single small model for a specific computation scenario. An interesting question naturally arises: can we obtain different small models through one time pre-training to meet various computation scenarios without extra teachers? Inspired by the success of slimmable networks (Yu et al., 2019) in supervised learning, we present a novel one-stage Top-1 Accuracy on ImageNet (%) [1.0] [1.0, 0.5] Widths of a slimmable 200 epochs) [1.0, 0.5, 0.25] [1.0,0.75, 0.5, 0.25] The width 0.25 represents that the number of channels is scaled by 0.25 of the full network. method to obtain pre-trained small models without adding large models: slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and some weight-sharing sub-networks with different widths. The width denotes the number of channels in a network. Slimmable networks can execute at various widths, permitting flexible deployment on different computing devices. We can thus obtain multiple networks including small ones meeting low computing cases via one-time pre-training. Weight-sharing networks can also inherit knowledge from the large ones via the sharing parameters to achieve better generalization performance. Weight-sharing networks in a slimmbale network cause interference to each other when training simultaneously, and the situation is worse in self-supervised cases. As shown in Figure 1 , with supervision, weight-sharing networks only have a slight impact on each other, e.g., the full model achieves 76.6% vs. 76.0% accuracy in and 0.75, 0.5, 0.25] . Without supervision, the corresponding numbers become 67.2% vs. 64.8%. One observed phenomenon of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation. The imbalance occurs because the sharing parameters receive gradients from multiple losses of different networks during optimization. The main parameters may not be fully optimized due to gradient imbalance. Besides, the conflicts in gradient directions of weight-sharing networks also cause gradient direction divergence of the full network. Please refer to Appendix A.3 for detailed explanations and visualizations. To relieve the gradient imbalance, the main parameters should produce dominant gradients during the optimization process. To avoid conflicts in gradient directions of various networks, sub-networks should have consistent guidance. Following these principles, we introduce three simple yet effective techniques during pre-training to relieve the interference of networks. 1) We adopt a slow start strategy for sub-networks. The networks and pseudo supervision of contrastive learning are both unstable and fast updating at the start of training. To avoid interference making the situation worse, we only train the full model at first. After the full model becomes relatively stable, sub-networks can inherit the knowledge via sharing parameters and start with better initialization. 2) We apply online distillation to make all sub-networks consistent with the full model to eliminate divergence of networks. The predictions of the full model will serve as global guidance for all sub-networks. 3) We re-weight the losses of networks according to their widths to ensure that the full model dominates the optimization process. Besides, we adopt a switchable linear probe layer to avoid interference of weight-sharing linear layers during evaluation. A single slimmable linear layer cannot achieve several complex mappings simultaneously when the data distribution is complicated. We instantiate two algorithms for SlimCLR with typical contrastive learning frameworks, i.e., Mo-Cov2 and MoCov3 (Chen et al., 2020c; 2021) . Extensive experiments are done on ImageNet (Russakovsky et al., 2015) dataset, and the results show that our methods achieve significant performance improvements compared to previous arts with fewer parameters and FLOPs.

2. RELATED WORKS

Self-supervised learning Self-supervised learning aims to learn transferable representations for downstream tasks from the input data itself. According to Liu et al. (2020) , self-supervised methods can be summarized into three main categories according to their objectives: generative, contrastive, and generative-contrastive (adversarial). Methods belonging to the same categories can be further classified by the difference between pretext tasks. Given input x, generative methods encode x into an explicit vector z and decode z to reconstruct x from z, e.g., auto-regressive (van den Oord et al., 2016a; b) , auto-encoding models (Ballard, 1987; Kingma & Welling, 2014; Devlin et al., 2019; He et al., 2022) . Contrastive learning methods encoder input x into an explicit vector z to measure similarity. The two mainstream methods below this category are context-instance contrast (info- Slimmable networks Slimmable networks are first proposed to achieve instant and adaptive accuracy-efficiency trade-offs on different devices (Yu et al., 2019) . It can execute at different widths during runtime. Following the pioneering work, universally slimmable networks (Yu & Huang, 2019b) develop systematic training approaches to allow slimmable networks to run at arbitrary widths. AutoSlim (Yu & Huang, 2019a) further achieves one-shot architecture search for channel numbers under a certain computation budget. MutualNet (Yang et al., 2020) trains slimmable networks using different input resolutions to learn multi-scale representations. Dynamic slimmable networks (Li et al., 2022; 2021) change the number of channels of each layer in the fly according to the input. In contrast to weight-sharing sub-networks in slimmable networks, some methods try to train multiple sub-networks with independent parameters (Zhao et al., 2022b) . A relevant concept of slimmable networks in network pruning is network slimming (Liu et al., 2017; Chavan et al., 2022; Wang et al., 2021) , which aims to achieve channel-level sparsity for computation efficiency.

3.1. DESCRIPTION OF SLIMCLR

We develop two instantial algorithms for SlimCLR with typical contrastive learning frameworks Mo-Cov2 and MoCov3 (Chen et al., 2020c; 2021) . As shown in Figure 2a (right), a slimmable network with n widths w 1 , . . . , w n contains multiple weight-sharing networks f θw 1 , . . . , f θw n , which are parameterized by learnable weights θ w1 , . . . , θ wn , respectively. Each network f θw i in the slimmable network has its own set of weights Θ wi and θ wi ∈ Θ wi . A network with a small width shares the weights with large ones, namely, Θ wj ⊂ Θ wi if w j < w i . Generally, we assume w j < w i if j > i, i.e., w 1 , . . . , w n arrange in descending order, and θ w1 represent the parameters of the full model. We first illustrate the learning process of SlimCLR-MoCov2 in Figure 2a . Given a set of images D, an image x sampled uniformly from D, and one distribution of image augmentation T , SlimCLR produces two data views x1 = t(x) and x2 = t ′ (x) from x by applying augmentations t ∼ T and t ′ ∼ T , respectively. For the first view, SlimCLR outputs multiple representations h θw 1 , . . . , h θw n and predictions z θw 1 , . . . , z θw n , where h θw i = f θw i (x 1 ) and z θw i = g θw i (h θw i ). g is a stack of slimmable linear transformation layers, i.e., a slimmable version of the MLP head in MoCov2 and SimCLR (Chen et al., 2020a) . For the second view, SlimCLR only outputs a single representation from the full model h ξw 1 = f ξw 1 (x 2 ) and prediction z ξw 1 = g ξw 1 (h ξw 1 ). We minimize the InfoNCE (van den Oord et al., 2018) loss to maximize the similarity of positive pairs z θw i and z ξw 1 : L z θw i ,z ξ ,{z -} = -log exp(z θw i • z ξw 1 /τ 1 ) exp(z θw i • z ξw 1 /τ 1 ) + z -exp(z θw i • z -/τ 1 ) , where  z θw i = z θw i /∥z θw i ∥ 2 , z ξw 1 = z ξw 1 /∥z ξw 1 ∥ 2 , L z θ ,z ξ ,{z -} = n i=1 L z θw i ,z ξ ,{z -} . ξ is updated by θ every iteration: ξ ← mξ +(1-m)θ, where m ∈ [0, 1) is a momemtum coefficient. Compared to SlimCLR-MoCov2, SlimCLR-MoCov3 has an additional projection process. It first projects the representation to another high dimensional space, then makes predictions. The projector q is a stack of slimmable linear transformation layers. SlimCLR-MoCov3 also adopts the InfoNCE loss, but the negative samples come from other samples in the mini-batch. After contrastive learning, we only keep f θw 1 , . . . , f θw n and abandon other components.

3.2. GRADIENT IMBALANCE AND SOLUTIONS

As shown in Figure 1 , a vanilla implementation of the above framework leads to severe performance degradation as weight-sharing networks interfere with each other during pre-training. One evidence of such interference we observed is gradient imbalance. Gradient imbalance refers to that a small proportion of parameters produces dominant gradients during backpropagation. To quantitatively evaluate the phenomenon, we show the ratios of gradient norms of main and minor parameters: ∥∇ θ1.0 L∥ 2 and ∥∇ θ 1.0\0.25 L∥ 2 versus ∥∇ θ0.25 L∥ 2 in Figure 3 , where L is the loss function. Meanwhile, the ratio of the numbers of parameters is |Θ1.0\Θ0.25| |Θ0.25| ≈ 15, where θ 1.0\0.25 ∈ Θ 1.0 \Θ 0.25 . This means Θ 1.0 \Θ 0.25 contains more than 90% of the total parameters. Generally, the main parameters dominate the optimization process and produce large gradient norms, i.e., the two ratios should both be large (> 1). In Figure 3a , the two ratios are both around 3.5 when training a normal network. However, in Figure 3b and 3c , when training a slimmable network, gradient imbalance occurs because sharing parameters obtain multiple gradients from different losses. To be specific, if the widths of a slimmable network w 1 , . . . , w n arrange in a descending order and the training loss is L z θ ,z ξ ,{z -} , θ wn that only represent a small part of parameters will receive gradients from n different losses and obtain a large gradient norm: ∇ θw n L z θ ,z ξ ,{z -} = ∂L z θ ,z ξ ,{z -} ∂θ wn = n i=1 ∂L z θw i ,z ξ ,{z -} ∂θ wn . ( ) Gradient imbalance is more obvious in self-supervised cases. In the supervised case in Figure 3b , ∥∇ θ 1.0\0.25 L∥ 2 is close to ∥∇ θ0.25 L∥ 2 at first, and the former becomes larger along with the training process. By contrast, for vanilla SlimCLR-MoCov2 in Figure 3c , ∥∇ θ 1.0\0.25 L∥ 2 is smaller than the other most time. A conjecture is that instance discrimination is harder than supervised classification. Consequently, small networks with limited capacity are hard to convergence, produce large losses, and cause more disturbances to other weight-sharing networks. The gradient directions of weight-sharing networks may also diverge from each other during backpropagation. This causes the gradient direction divergence of the full network during training. To avoid gradient imbalance, one natural idea is to make the main parameters dominate the optimization process, i.e., the two ratios in Figure 3 should both be large. To resolve the conflicts of gradient directions, networks should have a consistent optimization goal. In order to achieve the above purposes, we develop three simple yet effective techniques during pre-training: slow start, online distillation, and loss reweighting. Besides, we further introduce a switchable linear probe layer to avoid the interference of weight-sharing linear layers during linear evaluation. slow start At the start of training, the model and pseudo supervision of contrastive learning are both fast updating. The optimization procedure is unstable. To avoid interference between weightsharing networks making the situation harder, at the first S epochs, we only train the full model, i.e., only update θ 1.0 by ∇ θ1.0 L z θ 1.0 ,z ξ ,{z -} . In Figure 3d , the ratios of gradient norms are large before the S-th epoch; then they dramatically become small after slow start. At the first S epochs, the full model can learn certain knowledge from the data without disturbances, and sub-networks can inherit the knowledge via the sharing parameters and start with better initialization. Similar approaches are also adopted in some one-shot NAS methods (Cai et al., 2020; Yu et al., 2020) . online distillation The full model has the greatest capacity to learn knowledge from the data. The prediction of the full model can serve as consistent guidance for all sub-networks to resolve the gradient direction conflicts of weight-sharing networks. Following Yu & Huang (2019b) , we minimize the Kullback-Leibler (KL) divergence between the estimated probabilities p wi = exp(z θw i •z ξw 1 /τ2) exp(z θw i •z ξw 1 /τ2)+ z -exp(z θw i •z -/τ2 ) of sub-networks and the full model: L pw i = -p w1 log p wi where w i ∈ {w 2 , . . . , w n }. ( ) τ 2 is a temperature coefficient of distillation. In Figure 3e , we observe that online distillation helps ∥∇ θ 1.0\0.25 L∥ 2 /∥∇ θ0.25 L∥ 2 become larger than 1.0. This means that online distillation also relieves the gradient imbalance and helps the main parameters dominate the optimization process. loss reweighting Another straightforward solution to gradient imbalance and gradient direction divergence is to assign large confidence to networks with large widths. We adopt a strategy in which the strongest takes control. The weight for the loss of the network with width w i is: λ i = 1.0 + 1{w i = w 1 } × n j=2 w j , where 1{•} equals to 1 if the inner condition is true, 0 otherwise. In Figure 3f , both ratios become large, and ∥∇ θ 1.0\0.25 L∥ 2 /∥∇ θ0.25 L∥ 2 are larger than 1.0 by a clear margin. Loss reweighting helps the main parameters produce large gradient norms and dominate the optimization process. The overall pre-training objective of SlimCLR is: L all = λ 1 L z θw 1 ,z ξ ,{z -} + n i=2 λ i (L z θw i ,z ξ ,{z -} + L pw i ) 2 . ( ) switchable linear probe layer As we demonstrate theoretically in Appendix A.1, given the features extracted by a slimmable network which is pre-trained via contrastive self-supervised learning methods, a single slimmable linear probe layer cannot achieve several complex mappings from different representations to the same object classes simultaneously. The failure is because the learned representations in Figure 2 do not meet the requirement discussed in Appendix A.1. In this case, we propose a switchable linear probe layer mechanism. Namely, each network in the slimmable network will have its own linear probe layer for linear evaluation.

4.1. EXPERIMENTAL DETAILS

Datatest We train SlimCLR on ImageNet (Russakovsky et al., 2015) , which contains 1.28M training and 50K validation images. During pre-training, we use training images without labels. Learning strategies of SlimCLR-MoCov2 By default, we use a total batch size 1024, an initial learning rate 0.2, and weight decay 1 × 10 -4 . We adopt the SGD optimizer with a momentum 0.9. A linear warm-up and cosine decay policy (Goyal et al., 2017; He et al., 2019) for learning rate is applied, and the warm-up epoch is 10. The temperatures are τ 1 = 0.2 for InfoNCE and τ 2 = 5.0 for online distillation. Without special mentions, other settings including data augmentations, queue size (65536), and feature dimension (128) are the same as the counterparts of MoCov2 (Chen et al., 2020c) . The slow start epoch S of sub-networks is set to be half of the number of total epochs.

Learning strategies of SlimCLR-MoCov3

We use a total batch size 1024, an initial learning rate 1.2, and weight decay 1 × 10 -6 . We adopt the LARS (You et al., 2017) optimizer and a cosine learning rate policy with warm-up epoch 10. The temperatures are τ 1 = 1.0 and τ 2 = 1.0. The slow start epoch S is half of the total epochs. One different thing is that we increase the initial learning rate to 3.2 after S epochs. Pre-training is all done with mixed precision (Micikevicius et al., 2018) . Linear evaluation Following the general linear evaluation protocol (Chen et al., 2020a; He et al., 2020) , we add new linear layers on the backbone and freeze the backbone during evaluation. We also apply online distillation with a temperature τ 2 = 1.0 when training these linear layers. For evaluation of SlimCLR-MoCov2, we use a total batch size 1024, epochs 100, and an initial learning rate 60, which is decayed by 10 at 60 and 80 epochs. For evaluation of SlimCLR-MoCov3, we use a total batch size 1024, epochs 90, and an initial learning rate 0.4 with cosine decay policy.

4.2. RESULTS OF SLIMCLR ON IMAGENET

Results of SlimCLR on ImageNet are shown in Table 1 . Even though we pay huge efforts to relieve the interference of weight-sharing networks as described in Section 3.2, slimmable training inevitably leads to a drop in performance for the full model. When training for more epochs, the degradation is more obvious. However, such degradation also occurs in the supervised case. Considering the advantages of slimmable training we will discuss below, the degradation is acceptable. Compared to MoCov2 with individual networks, SlimCLR helps sub-networks achieve significant performance improvements. Specifically, for ResNet-50 0.5 and ResNet-50 0.25 , SlimCLR-MoCov2 achieves 3.5% and 6.6% improvements in performance when pre-training for 200 epochs, respectively. This verifies that sub-networks can inherit knowledge from the full model via sharing parameters to improve their generalization ability. We can also use more powerful contrastive learning framework to further boost the performance of sub-networks, i.e., SlimCLR-MoCov3. Compared to previous methods that aim to distill the knowledge of large teacher models, subnetworks of ResNet-50 [1.0,0.75,0.5.0.25] achieve better performance with fewer parameters and FLOPs. SlimCLR also helps small models get closer performance to their supervised counterparts. Furthermore, SlimCLR does not need any additional training process of large teacher models, and all networks in SlimCLR are trained jointly. By only training for one time, we can get different models with various computation cost which are suitable for different devices. This demonstrates the superiority of adopting slimmable networks for contrastive learning to get pre-trained small models.

4.3. DISCUSSION

In this section, we will discuss the influences of different components in SlimCLR. switchable linear probe layer The influence of the switchable linear probe layer is shown in Table 2a . A switchable linear probe layer brings significant improvements in accuracy compared to slimmable linear probe layer. For only one slimmable layer, the interference between weight-sharing linear layers is not unavoidable. It is also possible that the learned representations of pre-trained models do not meet the requirements as we discussed in Appendix A.1. 2f . Setting S to be half of the total epochs is a natural and appropriate choice. online distillation Here we compare two classical distillation losses: mean-square-error (MSE) and KL divergence (KD), and two other distillation losses from recent works: ATKD (Guo, 2022) and DKD (Zhao et al., 2022a) . ATKD reduces the difference in sharpness between distributions of teacher and student to help the student better mimic the teacher. DKD decouples the classical knowledge distillation objective function (Hinton et al., 2015) into target class and non-target class knowledge distillation to achieve more effective and flexible distillation. In Table 2c , we can see that these four distillation losses make trivial differences in our context. Combining the results of distillation with ResNet [1.0,0.5,0.25] in Figure 3e , we find that distillation mainly improves the performance of the full model, and the improvements of sub-networks are relatively small. This violates the purpose of knowledge distillation: distill the knowledge of large models and improve the performance of small ones. This is possibly because sub-networks in a slimmable network already inherit the knowledge from the full model via the sharing weights, and feature distillation cannot help the sub-networks much in this case. The main function of online distillation in our context is to relieve the interference of sub-networks as shown in Figure 3e . We also test the influence of different temperatures in online distillation, i.e., τ 2 related to p wi in equation 4, and results are shown in Table 2d . Following classical KD (Hinton et al., 2015) , we choose τ 2 ∈ {3.0, 4.0, 5.0, 6.0}. The choices of temperatures make a trivial difference in our context. SEED (Fang et al., 2021) use a small temperature 0.01 for the teacher to get sharp distribution and a temperature 0.2 for the student. BINGO (Xu et al., 2022) adopts a single temperature 0.2. Their choices are quite different from ours, and SlimCLR is more robust to the choice of temperatures. We further provide an analysis of the influence of temperatures in Appendix A.2. loss reweighting We compared four loss reweighting manners in Table 2e . They are [1.54, 1.08, 0.77, 0.62] . It is clear that a larger weight for the full model helps the system achieve better performance. This demonstrates again that it is important for the full model to lead the optimization direction during training. The differences of the above four loss reweighting strategies are mainly reflected in the sub-networks with small sizes. To ensure the performance of the smallest network, we adopt the reweighting manner (1) in practice. (1). λ i = 1.0 + 1{w i = w 1 } × n j=2 w j , (2). λ i = 1.0 + 1{w i = w 1 } × max{w 2 , . . . , w n }, (3). λ i = n × wi n j=1 wj , ( ). λ i = n × 1.0+ Transfer learning to object detection and instance segmentation Following previous works (He et al., 2020; Fang et al., 2021; Xu et al., 2022) , we evaluate the generalization ability of SlimCLR on object detection and instance segmentation tasks. As the supervised slimmable networks (Yu et al., 2019) , we use Mask R-CNN (He et al., 2017) with FPN (Lin et al., 2017) for the two tasks. We fine-tune parameters of all layers including batch normalization (Ioffe & Szegedy, 2015) endto-end on COCO 2017 (Lin et al., 2014) dataset. The training schedule is the default 1× in Chen et al. (2019) . The pre-trained backbone we used here is ResNet-50 [1.0,0.75,0.5,0.25] pre-trained via SlimCLR-MoCov2 (800 epochs). The transfer learning results are shown in Table 3 . SlimCLR-MoCov2 achieves better transfer learning results than the supervised baseline. Differences of training slimmable networks in self-supervised and supervised cases In Appendix A.3, through extensive visualizations, we show that optimization process of training slimmable networks is harder in self-supervised cases than that in supervised cases. Specifically, the gradient imbalance and gradient direction divergence are both more significant in self-supervised cases as we discussed in Section 3.2 and Appendix A.3.2. In supervised cases, clear global supervision can help slimmable networks avoid these problems to some extent during training. In self-supervised learning, we do not have such clear global supervision, and we need to pay more effort to deal with the performance degradation problems.

5. CONCLUSION

In this work, we adapt slimmable networks for contrastive learning to obtain pre-trained small models in a self-supervised manner. By using slimmable networks, we can pre-train for one time and get several models with different sizes which are suitable for various devices with different computation resources. Besides, unlike previous distillation based methods, our methods do not require additional training process of large teacher models. However, weight-sharing sub-networks in a slimmable network cause severe interference to each other in self-supervised learning. One evidence of such interference we observed is the gradient imbalance in the backpropagation process. We develop several techniques to relieve the interference of weight-sharing networks during pre-training and linear evaluation. Two specific algorithms are instantiated in this work, i.e., SlimCLR-MoCov2 and SlimCLR-MoCov3. We take extensive experiments on ImageNet and achieve better performance than previous arts with fewer network parameters and FLOPs. and θ 11 = B 11 X T 11 + B 12 X T 12 T = X T 1 X 1 -1 X T 1 T . From equation 14, we get B 21 = -X T 12 X 12 -1 X T 12 X 11 B 11 , B 12 = -B 11 X T 11 X 12 X T 12 X 12 -1 . Substitute equation 18 into equation 12, we get X T 11 X 11 B 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 X 11 B 11 = I d1 , B 11 = X T 11 X 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 X 11 -1 . ( ) At the same time θ 11 = B 11 X T 11 + B 12 X T 12 T = B 11 X T 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 T = X T 11 X 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 X 11 -1 X T 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 T . Combining equation 17 and equation 22, we get the condition of the input These results demonstrate that features of slimmable networks learned by contrastive selfsupervised learning cannot meet the input conditions (equation 23) when using a single slimmable linear probe layer. This explains why using a switchable linear probe layer achieves much better performance than a single slimmable linear probe layer in Table 2a . X T 11 X 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 X 11 -1 X T 11 -X T 11 X 12 X T 12 X 12 -1 X T 12 T = X T 1 X 1 -1 X T 1 T .

A.2 INFLUENCE OF TEMPERATURES DURING DISTILLATION

In this section, we will analyze the influence of temperatures when applying distillation. One of the previous methods SEED (Fang et al., 2021) uses different temperatures for the student and teacher, without loss of generalization, we will also adopt a such strategy in our analysis. Specifically, we adopt τ t for teacher and τ s for student. The predicted probability for a certain category i of the student is q i = e z i / τs j e z j / τs , where z is the output of the student model, i.e., logit of the model. The probability for a certain category i of the teacher is p i = e v i / τ t j e v j / τ t , where v is the output of the student model. The loss is the KL divergence: L = - k p k log q k . ( ) The gradient of q w.r.t. z is: ∂q i ∂z i = 1 τs e zi / τs j e zj / τs -1 τs e zi / τs e zi / τs j e zj / τs 2 = 1 τ s q i -q 2 i , ∂q t ∂z i | t̸ =i = 0 -1 τs e zt / τs e zi / τs j e zj / τs 2 = - 1 τ s q i q t . Similarly, ∂p i ∂v i = 1 τ t p i -p 2 i , ∂p t ∂v i = - 1 τ t p i p t . The gradient of L w.r.t. z is:  ∂L ∂z i = - p i q i ∂q i ∂z i + t̸ =i - p t q t ∂q t ∂z i (29) = 1 τ s (q i -p i ) Following classical KD (Hinton et al., 2015) , we assume temperatures are much larger than logits and use the first-order Taylor series to approximate the exponential function: (Goodfellow et al., 2016) . ∂L ∂z i ≈ 1 τ s 1 + zi τs C + j zj ts - 1 + vi τt C + j vj tt , where C is the number of classes. Following classical KD (Hinton et al., 2015) , we further assume j z j = j v j = 0, we can get: ∂L ∂z i ≈ 1 Cτ s z i τ s - v i τ t . A.3 VISUALIZATION In this section, we provide more detailed visualization and explanations for gradient imbalance, gradient direction divergence, and optimization trajectory during training. These visualization can help readers better understand how our methods work. Plus, they are also helpful for understanding the difference of training slimmable networks in supervised and self-supervised learning.

A.3.1 GRADIENT IMBALANCE

Besides the ratios of gradient norms in Figure 3 , we also display the absolute values of gradient norms in Figure 4 to help readers better understand the gradient imbalance phenomenon.

A.3.2 GRADIENT DIRECTION DIVERGENCE

Besides the imbalance of gradient magnitudes, the gradient directions of different weight-sharing networks also have conflicts with each other. Such conflicts result in disordered gradient directions of the full model, and we call the phenomenon gradient direction divergence in this work. We show the phenomenon when training slimmable networks on ImageNet in Figure 5 , supervised and self-supervised. In Figure 5a&5b&5c , the largest and second largest singular values of the full gradient matrix of the last linear layer in the model are displayed. In Figure 5b , the network has small eigenvalues without slimmable training, namely, no interference from other weight-sharing networks. In Figure 5b&5c , after the slow start point, we can see that training of weight-sharing networks dramatically increase the singular values. In supervised case (Figure 5a ), training of weightsharing networks also results in large singular values. Training of weight-sharing networks will make the directions of full gradients disordered, and such a phenomenon is more serious in self-supervised learning. We visualize the gradient directions in Figure 5d&5e&5f . Specifically, we collect the gradients of weights of the last linear layer during training; after training, we perform PCA on these gradients and show their projections on the first two principal components (Li et al., 2018) . In Figure 5e , we can see that the gradient directions are stable and consistent during training before slow start. However, after slow start, gradient directions become disordered due to conflicts of weight-sharing networks during training. In Figure 5d , training of weight-sharing networks also makes the gradient directions disordered. However, as the networks have the same global supervision, the gradient direction divergence is less obvious compared to Figure 5e . In Figure 5f , our proposed distillation and loss reweighting techniques effectively solve the divergence of gradient directions and make the training process stable.

A.3.3 ERROR SURFACE AND OPTIMIZATION TRAJECTORY

The performance degradation caused by the interference between weight-sharing networks is more severe in self-supervised learning compared to supervised learning. This is because the gradient imbalance and gradient direction divergence are both more significant in self-supervised cases. In this subsection, we further provide the the error surface and optimization trajectory (Li et al., 2018) when training a slimmable networks to re-emphasize the point: the optimization process of a slimmable network is harder in self-supervised cases than that in supervised cases. We train slimmable networks in both supervised and self-supervised (MoCo (He et al., 2020 )) manners on CIFAR-10 ( Krizhevsky & Hinton, 2009) . The base network is a ResNet-20×4, which has 4.3M parameters. We train the model for 100 epochs. At the end of each epoch, we save the weights of the full model and calculate the Top-1 accuracy. For self-supervised cases, we use a k-NN predictor (Wu et al., 2018) to obtain the accuracy. After training, we calculate the principle components of the differences of the saved weights at each epoch and the weights of the final model following Li et al. (2018) . Then we use the first two principle components as the directions to plot the error surface and optimization trajectory in Figure 6 . The visualization shows that self-supervised learning is harder than supervised learning. In the left error surface in Figure 6a and Figure 6c , we can see that the terrain around the valley is flat in supervised cases; by contrast, the terrain around the valley is more complicated in self-supervised cases. From the trajectory of ResNet-20×4 in the left of Figure 6b and Figure 6d , the contours in supervised cases are denser, i.e., the nearby two contours are closer. Namely, the model in selfsupervised cases costs more time to achieve the same improvement of accuracy compared to the model in supervised cases (the gaps between two contour lines are all the same). In supervised cases, clear global guidance helps the model quickly reach the global minima. In self-supervised cases, it is harder for the model to reach the global minima fast without such global guidance. The visualization shows that the interference of weight-sharing networks is more significant in selfsupervised cases. First of all, in self-supervised cases, weight-sharing networks bring huge changes to the error surface in Figure 6c . In contrast, the change is not so obvious in supervised cases. Second, the interference between weight-sharing networks in self-supervised cases makes the model shift more away from the global minima (the origin in the visualization) as shown in Figure 6d . In 



Figure 1: Training a slimmable ResNet-50 in supervised (left) and self-supervised (right) manners. ResNet-50 [1.0,0.75,0.5,0.25] means this slimmbale network can switch at width [1.0, 0.75, 0.5, 0.25]. The width 0.25 represents that the number of channels is scaled by 0.25 of the full network.

Max Hjelm et al. (2019), CPC van den Oord et al. (2018), AMDIM Bachman et al. (2019)) and instance-instance contrast (DeepCluster Caron et al. (2018), MoCo He et al. (2020); Chen et al. (2021), SimCLR Chen et al. (2020a;b), SimSiam Chen & He (2021)). Generative-contrastive methods generate a fake sample x ′ from x and try to distinguish x ′ from real samples, e.g., DCGANsRadford et al. (2016), inpaintingPathak et al. (2016), and colorizationZhang et al. (2016).

Figure2: The overall framework of SlimCLR. A slimmable network produces different outputs from weight-sharing networks with various widths w 1 , . . . , w n , where w 1 is the width of the full model. θ are the network parameters and ξ are an exponential moving average of θ. sg means stop-gradient.

Figure 3: Ratios of gradient norms: ∥∇ θ 1.0 L∥2 ∥∇ θ 0.25 L∥2 and ∥∇ θ 1.0\0.25 L∥2 ∥∇ θ 0.25 L∥2 . The gradient norm of the network is calculated by averaging the layer-wise ℓ 2 gradient norms. ∇ θ 1.0\0.25 L is the gradient of the final loss w.r.t. parameters θ 1.0\0.25 ∈ Θ 1.0 \Θ 0.25 , i.e., rest parameters of Θ 1.0 besides Θ 0.25 .

slow start and training time Experiments with and without slow start are shown in Table 2b. The pre-training time of SlimCLR-MoCov2 without and with slow start epoch S = 100 on 8 Tesla V100 GPUs are 45 and 33 hours, respectively. For reference, the pre-training time of MoCov2 with ResNet-50 is roughly 20 hours. Slow start largely reduces the pre-training time. It also avoids the interference between weight-sharing networks at the start stage of training and helps the system reach a steady point fast during optimization. Thus sub-networks can start with good initialization and achieve better performance. We also provide ablations of slow start epoch S when training for a longer time in Table

To verify whether equation 23 meets in practice, we sample 2048 images from the training set of ImageNet and use a0.75,0.5,0.25]  pre-trained by SlimCLR-MoCov2 (800 epochs) to extract the features of these images. The features from ResNet-50 1.0 denote X ∈ R 2048×1024 and features from ResNet-50 0.5 denote X 1 ∈ R 2048×512 . We use L to represent the left side of equation 23 and R for the right side. Then we use the extracted features to get the absolute difference between L and R, i.e., |L -R|. The average value of entries in |L -R| is 1.07. This means a total difference 1096665.50. Similar experiments are performed on the validation set of ImageNet. The average value of entries in |L -R| is 0.88. This means a total difference 903094.19.

Figure4: The average gradient norm. We calculate the ℓ 2 norm of gradients layer-wise and get their mean. θ 1.0 ∈ Θ 1.0 and θ 0.25 ∈ Θ 0.25 represent parameters of the full model and sub-network with width 0.25. θ 1.0\0.25 ∈ Θ 1.0 \Θ 0.25 , denote the rest parameters of Θ 1.0 besides Θ 0.25 . It is normal that the gradient norm increases during training(Goodfellow et al., 2016).

Figure 5: The gradient direction divergence. (a) (b) (c) The singular values of the full gradient matrix of the last linear layer. (c) (d) (e) The principal directions of gradients of the last linear layer.

Figure 6b, the maximal offsets from the global minima along the 2nd PCA component are 21.75 and 28.49 for ResNet-20×4 and ResNet-20×4 [1.0,0.5] . The offset increased 31.0%. For self-supervised cases in Figure 6d, the maximal offsets from the global minima along the 2nd PCA component are 13.26 and 18.75 for ResNet-20×4 and ResNet-20×4 [1.0,0.5] . The offset increased 41.4%. It is clear that the interference of weight-sharing networks is more significant in self-supervised cases compared to supervised cases.

error surface (supervised). left: R-20×4, 94.07%; right: R-20×4 [1.0,0.5] , 93.58%, 93.03%. trajectory (supervised). left: R-20×4, 94.07%; right: R-20×4 [1.0,0.5] , 93.58%, 93.03%. error surface (MoCo). left: R-20×4, 76.75%; right: R-20×4 [1.0,0.5] , 75.74%, 74.24%. trajectory (MoCo). left: R-20×4, 76.75%; right: R-20×4 [1.0,0.5] , 75.74%, 74.24%.

Figure 6: Visualization of the error surface and optimization trajectory.

τ 1 is a temperature hyper-parameter, and {z -} are features of negative samples. For SlimCLR-MoCov2, {z -} comes from a queue. Following MoCov2, the queue is updated by z ξw 1 every iteration during training. The overall objective is the sum of losses of all networks with various widths:

Linear evaluation results of SlimCLR with0.75,0.5.0.25]  on ImageNet. Through only one-time pre-training, SlimCLR obtains multiple different small models without extra large teacher models. It also outperforms previous methods using ResNet as backbones. The performance degradation when training a slimmable network is shown in cyan. The gaps between the self-supervised and supervised results are shown in orange. The smaller, the better.

Ablation experiments with SlimCLR-MoCov2 on ImageNet. The experiment in a former table serves as a baseline for the consequent table.

Transfer learning results on COCO val2017 set. Bounding-box AP (AP bb ) for object detection and mask AP (AP mk ) for instance segmentation.

A APPENDIX A.1 CONDITIONS OF INPUTS GIVEN A SLIMMABLE LINEAR LAYER

We consider the conditions of inputs when only using one slimmable linear transformation layer, i.e., consider solving multiple multi-class linear regression problems with shared weights. The parameters of the linear layer are θ ∈ R d×C , C is the number of classes, where θ = θ 11 θ 21 , θ 11 ∈ R d1×C ,The first input for the full model is X ∈ R N ×d , where N is the number of samples, Xis the input feature for the sub-model parameterized by θ 11 .Generally, we have N ≥ d > d 1 . We assume that both X and X 1 have independent columns, i.e., X T X and X T 1 X 1 are invertible. The ground truth is T ∈ R N ×C . The prediction of the full model is Y = Xθ, to minimize the sum-of-least-squares loss between prediction and ground truth, we getBy setting the derivative w.r.t. θ to 0, we getIn the same way, we can getFor X T X, we haveWe denote the inverse of X T X is B = B 11 B 12 B 21 B 22 , X T XB = I, as X T X is a symmetric matrix, its inverse is also symmetric, so B 12 = B T 21 . For X T XB = I, we haveThen we can getAt the same time 

A.4 MORE IMPLEMENTATION DETAILS

Slimmable networks We adopt the implementation of slimmable networks described in Yu et al. (2019) , which has switchable batch normalization layers. Namely, each network in the slimmable network has its own independent batch normalization process.

SlimCLR-MoCov2

We train SlimCLR-MoCov2 on 8 Tesla V100 32GB GPUs without synchronized batch normalization across GPUs. The momentum coefficient m is 0.999 during training.

SlimCLR-MoCov3

We train SlimCLR-MoCov3 on 8 Tesla V100 32GB GPUs with synchronized batch normalization across GPUs. Synchronized batch normalization is important for MoCov3 to obtain a better performance in linear evaluation. The momentum coefficient m is 0.99 with a cosine schedule when training for 300 epochs. The data augmentations are the same as augmentations of MoCov3 Chen et al. (2021) .Ablation study with SlimCLR-MoCov3 Besides the ablation studies with SlimCLR-MoCov2, we also provide empirical analysis for SlimCLR-MoCov3. Different from SlimCLR-MoCov2, SlimCLR-MoCov3 adopts different temperatures τ 1 = 1.0 and τ 2 = 1.0. When applying slow start, SlimCLR-MoCov3 also increases the initial learning rate at the same time.In Table 4a , we show the influence of temperature τ 2 for online distillation. Different from the choice in SlimCLR (MoCov2), it is better for SlimCLR-MoCov3 to choose a temperature τ 2 for online distillation, which is close to the temperature τ 1 of contrastive loss. τ 1 = 1.0 is the default choice of MoCov3 (Chen et al., 2021) , and we do not modify it.Another interesting phenomenon is that when the number of training epochs of SlimCLR-MoCov3 becomes larger, we need to increase the learning rate (lr) when we start to train the sub-networks. The influence of learning rate for slimmable training in SlimCLR-MoCov3 is shown in Table 4b .Here the learning rate refers to the base learning rate, the immediate learning rate after the warmup is calculated by this base learning rate and the current training steps: 1 2 lr(1 + cos(π step max step )). Different from SlimCLR-MoCov2 and SlimCLR-MoCov3 with fewer epochs, SlimCLR-MoCov3 will get poor performance if we do not change the learning rate when training for more epochs. We attribute the difference to the LARS (You et al., 2017) optimizer we adopt for SlimCLR-MoCov3. LARS normalizes the gradient of layers in the networks to avoid the imbalance of gradient magnitude across layers and ensures convergence when training networks with very large batch size. LARS is sensitive to the change in learning rate and helps self-supervised models training with large batches converge fast (Chen et al., 2021; 2020a) . When training for the 300 epochs, the full model can reach a local minima fast in the first 150 epochs. In this case, a learning rate 0.6 (half of the base learning rate) is not able to help the system walk out of the valley and reach a better local minima. Consequently, a large learning rate is needed to give the system more powerful momentum. From Table 4b , we can also see that SlimCLR-MoCov3 with LARS is sensitive to the change in learning rate. This is consistent with observations of previous works (Chen et al., 2021; 2020a) .

