MASKED UNSUPERVISED SELF-TRAINING FOR LABEL-FREE IMAGE CLASSIFICATION

Abstract

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data from a target domain to improve the performance of a pre-trained zero-shot classifier, by unsupervised finetuning of the pre-trained model. We propose Masked Unsupervised Self-Training (MUST), a new unsupervised adaptation method which leverages two different and complementary sources of training signals: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on a variety of downstream tasks, where it improves upon CLIP by a large margin. MUST also outperforms supervised fewshot adaptation methods. It achieves a top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP, and +6.2% higher than 16-shot CLIP adaptation. Our code is available at https://github.com/salesforce/MUST.

1. INTRODUCTION

Zero-shot image classification is a challenging goal that marks the capability of a vision model to solve tasks without human supervision. Recently, vision-language pre-training (e.g. CLIP (Radford et al., 2021) ) has shown promising performance on open-vocabulary zero-shot classification, where it leverages web-scale image-text pairs to train image and text encoders that can be transferred to downstream tasks through natural language prompting. However, the zero-shot performance of CLIP is often inadequate for real-world adoptions, especially when compared to models that are trained with supervised learning. On the other hand, there are abundant unlabeled data available for many tasks. In this paper, we aim to improve the performance of an open-vocabulary zero-shot classifier by finetuning it on unlabeled images from a downstream task. Given unlabeled data, the key question is: what is the source of supervision? Numerous papers have attempted to answer this question. Among them, self-training and self-supervised learning are two of the most dominant approaches. In self-training (Zoph et al., 2020; Xie et al., 2020; Lee et al., 2013) , pseudo-labels are generated by a teacher model and then used to supervise task-specific training of a student model, where the student usually has an equal model size as the teacher. On the other hand, self-supervised learning methods are generally task-agnostic. Masked image modeling (Bao et al., 2022; He et al., 2021; Xie et al., 2022) , which trains the model to predict missing information from masked image patches, has recently emerged as the superior self-supervised learning method for vision transformers (ViT) (Dosovitskiy et al., 2021) . However, both self-training and self-supervised learning have their limitations. Self-training overly relies on the pseudo-labels as the only source of supervision, thus is prone to overfitting to the noise in pseudo-labels. Self-supervised learning requires an additional stage of task-specific finetuning on labeled data, thus is not a one-stop solution. In this paper, we propose Masked Unsupervised Self-Training (MUST), a simple and effective method for label-free image classification. MUST performs unsupervised learning using both pseudo-labels and raw images as two different and complementary training signals. Specifically, MUST jointly optimizes three objectives to finetune a pre-trained classification model (e.g. CLIP) on unlabeled images: (1) Self-training objective to learn global task-specific class prediction; (2) Masked image modeling objective to learn local pixel-level information; (3) Global-local feature alignment objective to bridge the knowledge learned from the two sources of supervision. We validate the efficacy of MUST on 8 image classification tasks across a variety of domains, showing significant improvement over CLIP (Radford et al., 2021) . MUST also outperforms supervised fewshot adaptation methods (Zhou et al., 2021; 2022) . For instance, MUST achieves 77.7% top-1 accuracy on ImageNet, +9.4% higher than CLIP, and +6.2% higher than 16-shot CLIP adaptation. On certain domains, MUST can achieve comparable performance with a fully-supervised method. We further perform extensive quantitative and qualitative analysis to examine the effect of each proposed component. MUST is a low-cost solution for image classification that unlocks the potential of CLIP-like models for practical scenarios where images are abundant but labels are scarce.

2. RELATED WORK

Zero-shot learning traditionally aims to recognize unseen classes by training the model on base classes (Xian et al., 2017; Wang et al., 2019) , where the most common approach is to utilize auxiliary information such as attributes (Huynh & Elhamifar, 2020) or knowledge graphs (Wang et al., 2018) . CLIP (Radford et al., 2021) popularizes a new approach for open-vocabulary zero-shot image classification by leveraging natural language supervision from web-scale datasets. Despite its impressive performance, the zero-shot accuracy of CLIP stills falls far below the supervised method in many domains. Some recent work tries to adapt CLIP to downstream tasks using labeled data (Zhou et al., 2021; Gao et al., 2021) , which is less scalable compared to our unsupervised adaptation method. Our method is also orthogonal to CLIP-like research in model pre-training and can be applied to other image classification models (Jia et al., 2021; Li et al., 2022; Yao et al., 2022; Li et al., 2021) , such as LiT (Zhai et al., 2022) which performs two-stage pre-training. Self-training has shown promising progress in many domains including vision (Zoph et al., 2020; Xie et al., 2020; Sahito et al., 2022) , NLP (He et al., 2020a), and speech (Kahn et al., 2020) . Our method is more closely related to the self-training approaches proposed for semi-supervised learning (Tarvainen & Valpola, 2017; Sohn et al., 2020; Berthelot et al., 2020) , where pseudo-labels on unlabeled data are used as training targets. We construct our self-training objective by following three principles: (1) consistency regularization (Sohn et al., 2020; Laine & Aila, 2017) which enforces the model to output the same prediction when the input is perturbed; (2) entropy minimization (Grandvalet & Bengio, 2004 ) which encourages the model to give "sharp" predictions with low entropy; (3) prediction fairness (Berthelot et al., 2020) which alleviates the model's bias towards certain classes. Our method leverages a separate supervision signal from raw image pixels to reduce the confirmation bias in self-training, which is orthogonal to methods for pseudo-label debiasing (Wei et al., 2021; Wang et al., 2022) . Masked image modeling, fueled by the success of vision transformers (Dosovitskiy et al., 2021) , recently emerged as a more appealing self-supervised representation learning method over contrastive learning (He et al., 2020b; Chen et al., 2020; Hjelm et al., 2019) . While some methods train the model to predict discrete tokens (Bao et al., 2022) 

3. METHOD

MUST is a simple unsupervised learning approach that adapts a pre-trained open-vocabulary classifier to a downstream task using unlabeled images. In this paper, we consider vision transformers pretrained by CLIP (Radford et al., 2021) as the models to be adapted due to their distinctive zero-shot performance. CLIP pre-trains an image encoder and a text encoder with a contrastive loss such that paired images and texts have higher similarities in a shared embedding space. In order to perform zero-shot classification, CLIP converts a set of class names into text embeddings using ensemble of



or contextualized representations (Baevski et al., 2022) for masked image patches, MAE (He et al., 2021) and SimMIM (Xie et al., 2022) achieve competitive performance by simply predicting the pixel values. The MIM objective has also been used for test-time training (Gandelsman et al., 2022). Different from existing self-supervised learning methods which require an additional stage of supervised finetuning on labeled data, we synergically incorporate masked image modeling into self-training as a one-stage solution for zero-shot image classification.

