ONE STEP TOWARDS SUSTAINABLE SELF-SUPERVISED LEARNING

Abstract

Although increasingly training-expensive, most self-supervised learning (SSL) models have repeatedly been trained from scratch but not fully utilized, since only a few SOTAs are employed for downstream tasks. In this work, we explore a sustainable SSL framework with two major challenges: i) learning a stronger new SSL model based on the existing pretrained SSL model, also called as " base" model, in a cost-friendly manner, ii) allowing the training of the new model to be compatible with various base models. We propose a Target-Enhanced Conditional (TEC) scheme which introduces two components to the existing maskreconstruction based SSL. Firstly, we propose patch-relation enhanced targets which enhances the target given by base model and encourages the new model to learn semantic-relation knowledge from the base model by using incomplete inputs. This hardening and target-enhancing help the new model surpass the base model, since they enforce additional patch relation modeling to handle incomplete input. Secondly, we introduce a conditional adapter that adaptively adjusts new model prediction to align with the target of different base models. Extensive experimental results show that our TEC scheme can accelerate the learning speed, and also improve SOTA SSL base models, e.g., MAE and iBOT, taking an explorative step towards sustainable SSL.

1. INTRODUCTION

Self-supervised learning (SSL) has achieved overwhelming success in unsupervised representation learning, with astonishingly high performance in many downstream tasks like classification (Zhou et al., 2022a; b) , object detection, and segmentation (Bao et al., 2021; He et al., 2022) . In SSL, a pretext task is first built, e.g., instance discrimination task (He et al., 2020; Chen* et al., 2021) or masked image modeling (MIM) (Bao et al., 2021; He et al., 2022) , and then pseudo labels are generated via the pretext task to train a network model without requiring manual labels. Though successful, SSL is developing towards a direction of requiring increasingly large training costs, e.g., 200 training epochs in MoCo (He et al., 2020) while 16,00 epochs in MAE (He et al., 2022) to release its potential. Unfortunately, most researchers only have limited computational budgets and often cannot afford to train large SSL models. Moreover, the pretrained non-SOTA SSL models are rarely used in practice, since SOTA is updated frequently and a previous one quickly becomes useless, wasting huge training resources. Thus, a sustainable SSL framework is much demanded. 



Base model.

Figure 1: The concept of sustainable SSL. Just like how human experience is enriched and passed from one generation to the next in human society, we try to let an SSL model inherit the knowledge from a pretrained SSL base model to achieve superior representation learning ability for "sustainable" learning and also to improve learning efficiency than training a new SSL model from scratch. Fig. 1 illustrates the sustainable SSL for more clarity, in which we call the new SSL model to be trained as the new model and the pretrained SSL model as the base model. To surpass the base model, in sustainable SSL, the new model exploits not only the implicit base model knowledge but also the absent knowledge in the base model. Such a learning process follows a fully self-supervised manner and differs from the self-training schemes (Xie et al., 2020; Yalniz et al., 2019) that require labels for supervised learn-

