MC-SSL: TOWARDS MULTI-CONCEPT SELF-SUPERVISED LEARNING

Abstract

Self-supervised pretraining is the method of choice for natural language processing models and is rapidly gaining popularity in many vision tasks. Recently, selfsupervised pretraining has shown to outperform supervised pretraining for many downstream vision applications, marking a milestone in the area. This superiority is attributed to the negative impact of incomplete labelling of the training images, which convey multiple concepts, but are annotated using a single dominant class label. Although Self-Supervised Learning (SSL), in principle, is free of this limitation, the choice of a pretext task facilitating SSL can perpetuate this shortcoming by driving the learning process towards a single concept output. This study aims to investigate the possibility of modelling all the concepts present in an image without using labels. In this respect, the proposed Multi-Concept SSL (MC-SSL) framework is a step towards unsupervised learning, which embraces all the diverse content in an image with the aim of explicitly modelling the information from all the concepts present in the image. MC-SSL involves two core design steps: Group Mask Model Learning (GMML) and learning of pseudo-concepts for data tokens using a momentum encoder framework. The experimental results on multi-label and multi-class image classification downstream tasks demonstrate that MC-SSL not only surpasses existing SSL methods but also outperforms supervised transfer learning. The source code will be made publicly available for community to train on bigger corpus.

1. INTRODUCTION

Recent advances in self-supervised learning [1, 2, 3, 4, 5, 6] have shown great promise for downstream applications, particularly for image classification datasets with labels highlighting one dominant concept per image (also known as multi-class datasets, e.g. ImageNet-1K [7]). A typical representative of these SSL approaches, i.e. DINO [6] , trains the system to extract and associate image features with a single dominant concept in a promising way, but it ignores the intrinsic multilabel nature of natural images that depict more than one object/concept. The empirical evidence provided by Stock and Cisse [8] study proved that the remaining error in the ImageNet-1K dataset is predominantly due to the single-label annotation. Indeed every pixel in an image is part of some semantic concept. However, collecting an exhaustive list of labels for every object/concept is not scalable and requires significant human effort, making large scale supervised training infeasible for multi-label datasets. In view of this, it is pertinent to ask what the best strategy is to address these deficiencies. We believe that multi-concept self-supervised learning (i.e. ability to represent each concept/object in an image without using labels) is the principled way forward. This study is a step towards building a self-supervised (SS) framework that is capable of learning representations for all the objects in an image. Once the multiple-concepts are extracted by MC-SSL with no labels, the expensive multi-label annotated datasets will only be needed for evaluation. MC-SSL is built on Group Mask Model Learning (GMML) introduced in [1, 9] . In MC-SSL, the network is trained with two objectives: i) reconstruct the raw pixel values of the GMML-based manipulated data-tokens and, crucially, ii) learn patch level concepts/classes for individual data-tokens. MC-SSL forces the network to learn the representation of an object/concept, i.e. its properties such as colour, texture and structure, as well as context, in order to reconstruct and recover the distorted data-tokens by using the information conveyed by unmasked data tokens. This encourages all the

