MC-SSL: TOWARDS MULTI-CONCEPT SELF-SUPERVISED LEARNING

Abstract

Self-supervised pretraining is the method of choice for natural language processing models and is rapidly gaining popularity in many vision tasks. Recently, selfsupervised pretraining has shown to outperform supervised pretraining for many downstream vision applications, marking a milestone in the area. This superiority is attributed to the negative impact of incomplete labelling of the training images, which convey multiple concepts, but are annotated using a single dominant class label. Although Self-Supervised Learning (SSL), in principle, is free of this limitation, the choice of a pretext task facilitating SSL can perpetuate this shortcoming by driving the learning process towards a single concept output. This study aims to investigate the possibility of modelling all the concepts present in an image without using labels. In this respect, the proposed Multi-Concept SSL (MC-SSL) framework is a step towards unsupervised learning, which embraces all the diverse content in an image with the aim of explicitly modelling the information from all the concepts present in the image. MC-SSL involves two core design steps: Group Mask Model Learning (GMML) and learning of pseudo-concepts for data tokens using a momentum encoder framework. The experimental results on multi-label and multi-class image classification downstream tasks demonstrate that MC-SSL not only surpasses existing SSL methods but also outperforms supervised transfer learning. The source code will be made publicly available for community to train on bigger corpus.

1. INTRODUCTION

Recent advances in self-supervised learning [1, 2, 3, 4, 5, 6] have shown great promise for downstream applications, particularly for image classification datasets with labels highlighting one dominant concept per image (also known as multi-class datasets, e.g. ). A typical representative of these SSL approaches, i.e. DINO [6] , trains the system to extract and associate image features with a single dominant concept in a promising way, but it ignores the intrinsic multilabel nature of natural images that depict more than one object/concept. The empirical evidence provided by Stock and Cisse [8] study proved that the remaining error in the ImageNet-1K dataset is predominantly due to the single-label annotation. Indeed every pixel in an image is part of some semantic concept. However, collecting an exhaustive list of labels for every object/concept is not scalable and requires significant human effort, making large scale supervised training infeasible for multi-label datasets. In view of this, it is pertinent to ask what the best strategy is to address these deficiencies. We believe that multi-concept self-supervised learning (i.e. ability to represent each concept/object in an image without using labels) is the principled way forward. This study is a step towards building a self-supervised (SS) framework that is capable of learning representations for all the objects in an image. Once the multiple-concepts are extracted by MC-SSL with no labels, the expensive multi-label annotated datasets will only be needed for evaluation. MC-SSL is built on Group Mask Model Learning (GMML) introduced in [1, 9]. In MC-SSL, the network is trained with two objectives: i) reconstruct the raw pixel values of the GMML-based manipulated data-tokens and, crucially, ii) learn patch level concepts/classes for individual data-tokens. MC-SSL forces the network to learn the representation of an object/concept, i.e. its properties such as colour, texture and structure, as well as context, in order to reconstruct and recover the distorted data-tokens by using the information conveyed by unmasked data tokens. This encourages all the 1 ). The ultimate role of the auxiliary but complementary task of learning a patch classifier is to assign a pseudo-semantic label to a group of context aware data tokens covering an object. Our conjecture is that learning pseudo labels for patches encourages data tokens on similar objects within an image and across images to belong to the same pseudo class, promoting intra and inter image concept consistency. Our scientific hypothesis is that this context based learning of objects across a collection of images through representation clustering will conform to human semantic labels. The main contribution of the work is the introduction of a novel SSL framework which models the information conveyed by all the objects/concepts present in the image rather than focusing on the dominant object. Most importantly, from the practical point of view, MC-SSL enables the training of data hungry transformers from scratch with only a few thousand images. The possibility of training from scratch on limited data with high accuracy will have a significant impact on small AI research groups, companies, and application domains which are relying on the use of pretrained models. Additionally, we show that, although MC-SSL is unaware of the semantic concepts present in an image, the self-learnt grouping of data-tokens corresponds to distinct semantic concepts as evident from Figure 1 . The impact of the proposed innovation is that MC-SSL outperforms state-of-the-art (SOTA) SSL methods by a large margin in multi-label classification tasks, and achieves competitive results in multi-class tasks. Lastly, MC-SSL based self-supervised pretraining outperforms supervised pretraining for downstream tasks.

2. RELATED WORK

Self-supervised methods can roughly be categorised to generative and discriminative approaches. Generative approaches [10, 11, 12] learn to model the distribution of the data. However, data distribution modelling generally is computationally expensive and may not be necessary for representation learning in all scenarios. On the other hand, discriminative approaches, typically implemented in a contrastive learning framework [13, 14, 15, 16, 3, 2, 6, 17, 18, 19] or using pre-text tasks [20, 21, 22, 1, 5, 23] , demonstrate the ability to obtain better generalised representations with modest computational requirements. A large body of work on contrastive learning trains the model to discriminate between images considering each of them as a different class. Such approaches require either large batches [15, 19] or memory banks [13, 24] to perform well which is often computationally expensive. Another recent line of work [2, 25, 3, 17, 18, 6] has shown the feasibility to learn feature representations that are invariant to different augmented views of the same image, without requiring negative samples. Such approaches are prone to learn trivial embeddings. Grill et al. [3] prevent a mode collapse by employing a momentum encoder with an asymmetric predictor. Barlow Twins [17] and VICREG [18] employ a covariance constraint. In particular, in Barlow Twins, the model is trained to obtain an identity cross-correlation matrix between the outputs of two identical networks fed with



Figure 1: MC-SSL has basic knowledge of objects as shown by the self-learnt grouping of datatokens by k-means clustering by only providing number of clusters the output data-token from vision transformer should be clustered to. Data tokens on objects of the same class have similar representation. This is achieved as a by-product of MC-SSL without any labels or any self-supervised segmentation objective. Notice how the concepts are refined when asking for more concepts to be discovered.To achieve concept learning is still a challenge for MC-SSL as shown by the spread out representation exemplified by the 5 th column for example. This demands training on bigger datasets and the design of even more advanced MC-SSL methods, which this research aims to stimulate.

