ONE STEP TOWARDS SUSTAINABLE SELF-SUPERVISED LEARNING

Abstract

Although increasingly training-expensive, most self-supervised learning (SSL) models have repeatedly been trained from scratch but not fully utilized, since only a few SOTAs are employed for downstream tasks. In this work, we explore a sustainable SSL framework with two major challenges: i) learning a stronger new SSL model based on the existing pretrained SSL model, also called as " base" model, in a cost-friendly manner, ii) allowing the training of the new model to be compatible with various base models. We propose a Target-Enhanced Conditional (TEC) scheme which introduces two components to the existing maskreconstruction based SSL. Firstly, we propose patch-relation enhanced targets which enhances the target given by base model and encourages the new model to learn semantic-relation knowledge from the base model by using incomplete inputs. This hardening and target-enhancing help the new model surpass the base model, since they enforce additional patch relation modeling to handle incomplete input. Secondly, we introduce a conditional adapter that adaptively adjusts new model prediction to align with the target of different base models. Extensive experimental results show that our TEC scheme can accelerate the learning speed, and also improve SOTA SSL base models, e.g., MAE and iBOT, taking an explorative step towards sustainable SSL.

1. INTRODUCTION

Self-supervised learning (SSL) has achieved overwhelming success in unsupervised representation learning, with astonishingly high performance in many downstream tasks like classification (Zhou et al., 2022a; b) , object detection, and segmentation (Bao et al., 2021; He et al., 2022) . In SSL, a pretext task is first built, e.g., instance discrimination task (He et al., 2020; Chen* et al., 2021) or masked image modeling (MIM) (Bao et al., 2021; He et al., 2022) , and then pseudo labels are generated via the pretext task to train a network model without requiring manual labels. Though successful, SSL is developing towards a direction of requiring increasingly large training costs, e.g., 200 training epochs in MoCo (He et al., 2020) while 16,00 epochs in MAE (He et al., 2022) to release its potential. Unfortunately, most researchers only have limited computational budgets and often cannot afford to train large SSL models. Moreover, the pretrained non-SOTA SSL models are rarely used in practice, since SOTA is updated frequently and a previous one quickly becomes useless, wasting huge training resources. Thus, a sustainable SSL framework is much demanded. ) miss some semantic regions, e.g., ears, while TEC with iBOT as the base model captures all semantics and well distinguishes all different components of an input image. Because of its more powerful ability to capture comprehensive semantic, TEC helps achieve the challenging sustainable SSL, and actually can provide rich and flexible semantics for downstream tasks. However, different SSL base models could have various properties due to their various training targets and training strategies, e.g., iBOT models with more category semantics while MAE models with more image details (He et al., 2022) . So it is important to build high-qualified and compatible reconstruction targets from the base model so that the new model learns these targets in a complementary manner. A good model target should reveal the semantic relations among patches, e.g., the relation between car wheels and car body, so that new model can learn these general relation patterns and adapts to downstream tasks. To this end, we propose to enhance the target quality of the base model by using two complementary reconstruction targets: a) the patch-dim normalization which normalizes base model targets along patch dimension to enhance the relations among input patches, and b) patch attention maps with rich semantics to filter out possible noise and establish the correlation between the whole image semantic and the patch semantic. For target compatibility, we introduce conditional adapters into the new model so that new model predictions can be adaptable to various base models with different properties. Given a base model target, adapters conditionally active and adjust mid-level features of the new model to predict the target more effectively. These adapters are discarded after pretraining but can serve parameter-efficient finetuning (Jia et al., 2022; Chen et al., 2022b) if kept. et al., 2022) and iBOT (Zhou et al., 2022a) . For instance, taking iBOT with 1600 epochs as base model, TEC with only 800 training epochs makes 1.0% improvement. Moreover, we also find that TEC can significantly accelerate the SSL learning process and saves training cost. For example, training TEC for only 100 epochs with random initialization and a 300-epochs-trained MAE base model outperforms MAE trained with 1600 epochs. This work takes one step closer to sustainable SSL, and we hope our initial effort will inspire more works in the future to sustainably improve SSL in a cost-friendly manner.



Base model.

Figure 1: The concept of sustainable SSL. Just like how human experience is enriched and passed from one generation to the next in human society, we try to let an SSL model inherit the knowledge from a pretrained SSL base model to achieve superior representation learning ability for "sustainable" learning and also to improve learning efficiency than training a new SSL model from scratch. Fig. 1 illustrates the sustainable SSL for more clarity, in which we call the new SSL model to be trained as the new model and the pretrained SSL model as the base model. To surpass the base model, in sustainable SSL, the new model exploits not only the implicit base model knowledge but also the absent knowledge in the base model. Such a learning process follows a fully self-supervised manner and differs from the self-training schemes (Xie et al., 2020; Yalniz et al., 2019) that require labels for supervised learn-

Figure 3: Top1 accuracy on ImageNet-1k. TEC models have the same color with their base model. We call the above method for sustainable SSL as Target-Enhanced Conditional (TEC) maskreconstruction. As shown in Fig. 3, on ImageNet, TEC without any extra training data improves the SSL base model by a remarkable margin, e.g., MAE (He et al., 2022) and iBOT(Zhou et al., 2022a). For instance, taking iBOT with 1600 epochs as base model, TEC with only 800 training epochs makes 1.0% improvement. Moreover, we also find that TEC can significantly accelerate the SSL learning process and saves training cost. For example, training TEC for only 100 epochs with random initialization and a 300-epochs-trained MAE base model outperforms MAE trained with 1600 epochs. This work takes one step closer to sustainable SSL, and we hope our initial effort will inspire more works in the future to sustainably improve SSL in a cost-friendly manner.

