DSPNET: TOWARDS SLIMMABLE PRETRAINED NETWORKS BASED ON DISCRIMINATIVE SELF-SUPERVISED LEARNING

Abstract

Self-supervised learning (SSL) has achieved promising downstream performance. However, when facing various resource budgets in real-world applications, it costs a huge computation burden to pretrain multiple networks of various sizes one by one. In this paper, we propose Discriminative-SSL-based Slimmable Pretrained Networks (DSPNet), which can be trained once and then slimmed to multiple subnetworks of various sizes, each of which faithfully learns good representation and can serve as good initialization for downstream tasks with various resource budgets. Specifically, we extend the idea of slimmable networks to a discriminative SSL paradigm, by integrating SSL and knowledge distillation gracefully. We show comparable or improved performance of DSPNet on ImageNet to the networks individually pretrained one by one under the linear evaluation and semi-supervised evaluation protocols, while reducing large training cost. The pretrained models also generalize well on downstream detection and segmentation tasks. Code will be made public.

1. INTRODUCTION

Recently, self-supervised learning (SSL) draws much attention to researchers where good representations are learned without dependencies on manual annotations. Such representations are considered to suffer less human bias and enjoy better transferability to downstream tasks. Generally, SSL solves a well-designed pretext task, such as image colorization (Zhang et al., 2016) , jigsaw puzzle solving (Noroozi & Favaro, 2016) , instance discrimination (Dosovitskiy et al., 2014) , and masked image modeling (Bao et al., 2021) . According to the different types of pretext tasks, SSL methods can be categorized into generative approaches and discriminative approaches. Discriminative approaches received more interest during the past few years, especially the ones following the instance discrimination pretext tasks (He et al., 2020; Chen et al., 2020a; Grill et al., 2020; Chen et al., 2020b; Caron et al., 2020; Chen & He, 2021) . These pretraining approaches have shown superiority over their ImageNet-supervised counterpart in multiple downstream tasks. However, the pretraining is time-consuming and costs large computing resources, e.g., the pretraining of BYOL costs over 4000 TPU hours on the Cloud TPU v3 cores. In real-world applications, the resource budgets vary in a wide range for the practical deployment of deep networks. A single trained network cannot achieve optimal accuracy-efficiency trade-offs across different devices. It means that we usually need multiple networks of different sizes. A naive solution to apply pretraining and also meet such conditions is to pretrain them one by one and then fine-tune them to accomplish specific downstream tasks. However, their pretraining cost grows approximately linearly as the number of desired networks increases. It costs so many computing resources that this naive solution is far from being a practical way to make use of the favorable self-supervised pretraining. In this paper, a feasible approach is proposed to address this problem -developing slimmable pretrained networks that can be trained once and then slimmed to multiple sub-networks of different sizes, each of which learns good representation and can serve as good initialization for downstream tasks with various resource budgets. To this end, we take inspiration from the idea of slimmable networks (Yu et al., 2018) , which can be trained once and executed at different scale, and permit the network FLOPs to be dynamically configurable at runtime. Nonetheless, slimmable networks (Yu et al., 2018) and the subsequent works (Yu & Huang, 2019; Yu et al., 2020; Cai et al., 2019; Yang et al., 2020) all focus on specific tasks with supervision from manual annotations, while we concentrate on SSL-based representation learning without manual annotations. To bridge this gap, we build our approach upon BYOL (Grill et al., 2020) , a representative discriminative SSL method. Specifically, as multiple networks of different sizes with good representations need to be pretrained at one go, we thus construct them within a network family that are built with the same basic building blocks but with different widths and depths. We use desired networks (DNs) to denote them. Different from the original design in BYOL that the online and target networks share the same architecture at each training iteration, our online network is specifically deployed by activating the randomly sampled sub-networks in it during pretraining, which include all the above desired networks. We also perform BYOL's similarity loss between the target branch and the sampled sub-networks in the online branch, and the target network is also updated by exponential moving averages of the online network. After pretraining, DNs can be slimmed from the online network and each of them learns good representation. In this way, DNs are pretrained together by sharing weights, which can reduce large training cost compared with pretraining them one by one. We name the proposed approach as Discriminative-SSL-based Slimmable Pretrained Networks (DSPNet). Our contributions are summarized as follows: • We propose DSPNet, which can be pretrained once with SSL and then slimmed to multiple sub-networks of various sizes, each of which learns good representation and serves as good initialization for downstream tasks with various resource budgets. • We show that our slimmed pretrained networks achieve comparable or improved performance to individually pretrained ones under various evaluation protocols, while large training cost is reduced. • With extensive experiments, we show that DSPNet also performs on par or better than previous distillation-based SSL methods, which can only obtain a single network with good representation by once training.

2. RELATED WORK

Self-supervised Learning. Recent self-supervised learning (SSL) approaches have shown prominent results by pretraining the models on ImageNet (Deng et al., 2009) and transferring them to downstream tasks. It largely reduces the performance gap or even surpasses with respect to the supervised models, especially when adopting large encoders. Generally, SSL solves a well-designed pretext task. According to the types of pretext tasks, SSL methods can be categorized into generative approaches and discriminative approaches. Generative approaches focus on reconstructing original data (Pathak et al., 2016; Bao et al., 2021; He et al., 2021; Zhang et al., 2016) . Discriminative approaches have drawn much attention in recent years, especially those based on the instance discrimination pretext tasks (Dosovitskiy et al., 2014) , which consider each image in a dataset as its own class. Among them, contrastive learning (Hadsell et al., 2006) et al., 2020) . However, previous works all focus on specific tasks with supervision from manual annotations. Our approach extends them to SSL-based representation learning, to obtain slimmable pretrained models by once training.



methods achieve more promising performance, where the representations of different views of the same image are brought closer (positive pairs) while the representations of views from different images (negative pairs) are pushed apart. BYOL(Grill et al., 2020)  further gets rid of negative pairs while preserving high performance, namely non-contrastive method. Most of the above methods rely on large batch size and long training schedule, which cost large computing resources.Slimmable Networks. Slimmable networks are a class of networks executable at different scales, which permit the network FLOPs to be dynamically configurable at runtime and enable the customers to trade off between accuracy and latency while deploying deep networks. The original version(Yu et al., 2018)  achieves networks slimmable at different widths, and US-Nets (Yu & Huang, 2019) further extends slimmable networks to be executed at arbitrary widths, and also proposes two improved training techniques, namely the sandwich rule and inplace distillation. The subsequent works go beyond only changing the network width, e.g., additionally changing the network depth(Cai et al., 2019), kernel size (Yu et al., 2020)  and input resolutions (Yang

