

Abstract

Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose Group Masked Model Learning (GMML), a Self-Supervised Learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. This is achieved by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current SSL techniques. Since its conception at the beginning of 2021, GMML maintains itself as unbeaten SSL method with several desirable benefits and marked a significant milestone in computer vision by being one of the first self-supervised pretraining methods which outperform supervised pretraining consistently with a large margin. GMML is currently the best mechanism to extract information from a given dataset and instil this information into transformer's weights. The code will be made publicly available for the community to train on bigger corpora.

Impact of GMML:

We proposed GMML for self-supervised learning of vision transformers at the beginning of 2021 using masked autoencoder with reconstruction loss, however the idea is generally applicable to other losses [1, 2, 3] . The merits of GMML were shown employing small models and small/medium scale datasets due to extremely restricted computational resources. Since then, GMML has been widely adopted in computer vision and other related fields. Towards the end of 2021, SIMMIM [4] and MAE [5] extended GMML with reconstruction loss using huge vision transformers on large scale datasets, like ImageNet-1K [6] . GMML is now the leading SSL framework on multiple application areas, giving sate-of-the-art results for image classification [2], segmentation [4], audio analysis [7] , medical image analysis [8, 9] , video representation [10], etc.

1. INTRODUCTION

Vision transformers (ViT) [11] have shown tremendous potential due to self-attention mechanism which is able to model global context. Borrowing idea from Natural Language Processing (NLP) [12, 13] the ViT also treat an image as 1D sequence of visual tokens. This induces lack of intrinsic inductive bias to model local visual structure. Therefore, ViT requires orders of magnitude more data to model this inductive bias [11] . Very recently, vision transformers have been shown to perform well on ImageNet-1K [6] without external data [14] . However, they need distillation approaches and guidance from CNNs counterparts. Another hindrance preventing a wide spread adoption of vision transformers (ViTs) is their tremendous computational demand [11] despite the improvements in vision transformers architecture design [15, 16] . These drawback particularly affect AI researchers with a smaller resource budget. Self-supervised pretraining (SSP) can be an alternative to data hungry supervised pretraining (SP) of the ViTs. SSP of transformers is the defacto standard for natural language processing (NLP) [13] due to its success. However, SP is still the default due to its superiority over SSP. A tremendous progress in SSL for visual data has been marked by recent methods [17, 18, 19, 20] prior to GMML. A common theme to these non-GMML based methods is the learning of invariant representations for different views (distortions/augmentations) of the visual data by maximising the similarity between these different views. However, most of these approaches suffer from trivial constant solutions. To avoid trivial solution, these SSL approaches rely on careful implementation details such as large batches, gradient stopping, weight updates by moving average, asymmetric projection head, etc. In contrast to existing unsupervised learning approaches, GMML exploits information redundancy and complementarity in the image data by learning to reconstruct local contentby linking it to context. In spirit, this principle is similar to the masked language modelling (MLM) used in BERT [13] which recover masked words from context. The principle of predicting words from context is also inspired from word2vec [21] . In computer vision, we take the inspiration from the principle of denoising autoencoder [22] and from the idea of context encoder [23] which has been studied for unsupervised learning using CNNs. The main aim of this study is to merely extended the principles of MLM, denoising autoencoders, and context encoders to vision transformers for self-supervised learning. This is achieved by three principles: i) learning to reconstruct the input stimulus by a mechanism akin to autoencoding, implemented by means of random data perturbation using masking of groups of connected tokens, etc. ii) a perception-action mechanism [24] , which learns to recognise an action from its impact on perception, and iii) learning the notion of similarity of content from the preservation of content identity in the data. The proposed SSL approach is instrumental in extracting an intrinsic data model and is admirably able to adapt to downstream tasks by fine tuning. The GMML addresses the issues of data-efficiency of ViT by investigating how to train vision transformers from scratch, using limited data, by means of self-supervised pretraining, without using any external data. The proposed methodology of transformer pretraining by self-supervision is expected to have a significant impact on the advancement of science by enabling the wider research community starved of resources to contribute to deep learning. The main contributions and findings of this study are summarised as follows: • We introduce GMML, a simple method for SSL of visual representations using transformer inspired from MLM of BERT, denoising autoencoder and context encoders. • We endow the GMML architecture with a decoder and demonstrate that it can be implemented by essentially a couple of pointwise convolutional (linear) layers, thanks to the intrinsic characteristics of the transformer. This transformer based autoencoder avoids the need for a whole decoder block which is typically present in CNNs based encoder-decoder. • The amount of labelled training data required for finetuning to learn a downstream task is two orders of magnitude lower than the supervised pretraining and finetuning. • Total amount of training data (labelled and unlabelled) is also orders of magnitude lower. • GMML outperforms state-of-the-art supervised/self-supervised methods in small, medium and large datasets with large margins reaching +35% improvement. • To best of our knowledge for computer vision GMML is one of the first self-supervised pretraining method which outperformed supervised pretraining. We hope that this will enable the vision transformers to enjoy same success as BERT in NLP. • GMML is among concurrent SSL methods which neither suffer from trivial solutions nor require careful implementation details, others being Barlowtwins [20] and VICReg [25] .

2. METHOD

Unlike recent SSL joint embedding based methods [17, 18, 19, 20, 25, 26, 27, 28] 



, GMML does not rely on maximising similarity between joint embeddings of different views of the image. Instead, GMML is motivated by a successful NLP pretext task MLM [13], denoising autoencoder [22] and context encoder [23] in image domain. There are several considerations when designing MLM

