DSPNET: TOWARDS SLIMMABLE PRETRAINED NETWORKS BASED ON DISCRIMINATIVE SELF-SUPERVISED LEARNING

Abstract

Self-supervised learning (SSL) has achieved promising downstream performance. However, when facing various resource budgets in real-world applications, it costs a huge computation burden to pretrain multiple networks of various sizes one by one. In this paper, we propose Discriminative-SSL-based Slimmable Pretrained Networks (DSPNet), which can be trained once and then slimmed to multiple subnetworks of various sizes, each of which faithfully learns good representation and can serve as good initialization for downstream tasks with various resource budgets. Specifically, we extend the idea of slimmable networks to a discriminative SSL paradigm, by integrating SSL and knowledge distillation gracefully. We show comparable or improved performance of DSPNet on ImageNet to the networks individually pretrained one by one under the linear evaluation and semi-supervised evaluation protocols, while reducing large training cost. The pretrained models also generalize well on downstream detection and segmentation tasks. Code will be made public.

1. INTRODUCTION

Recently, self-supervised learning (SSL) draws much attention to researchers where good representations are learned without dependencies on manual annotations. Such representations are considered to suffer less human bias and enjoy better transferability to downstream tasks. Generally, SSL solves a well-designed pretext task, such as image colorization (Zhang et al., 2016) , jigsaw puzzle solving (Noroozi & Favaro, 2016) , instance discrimination (Dosovitskiy et al., 2014) , and masked image modeling (Bao et al., 2021) . According to the different types of pretext tasks, SSL methods can be categorized into generative approaches and discriminative approaches. Discriminative approaches received more interest during the past few years, especially the ones following the instance discrimination pretext tasks (He et al., 2020; Chen et al., 2020a; Grill et al., 2020; Chen et al., 2020b; Caron et al., 2020; Chen & He, 2021) . These pretraining approaches have shown superiority over their ImageNet-supervised counterpart in multiple downstream tasks. However, the pretraining is time-consuming and costs large computing resources, e.g., the pretraining of BYOL costs over 4000 TPU hours on the Cloud TPU v3 cores. In real-world applications, the resource budgets vary in a wide range for the practical deployment of deep networks. A single trained network cannot achieve optimal accuracy-efficiency trade-offs across different devices. It means that we usually need multiple networks of different sizes. A naive solution to apply pretraining and also meet such conditions is to pretrain them one by one and then fine-tune them to accomplish specific downstream tasks. However, their pretraining cost grows approximately linearly as the number of desired networks increases. It costs so many computing resources that this naive solution is far from being a practical way to make use of the favorable self-supervised pretraining. In this paper, a feasible approach is proposed to address this problem -developing slimmable pretrained networks that can be trained once and then slimmed to multiple sub-networks of different sizes, each of which learns good representation and can serve as good initialization for downstream tasks with various resource budgets. To this end, we take inspiration from the idea of slimmable networks (Yu et al., 2018) , which can be trained once and executed at different scale, and permit

