MASKED UNSUPERVISED SELF-TRAINING FOR LABEL-FREE IMAGE CLASSIFICATION

Abstract

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data from a target domain to improve the performance of a pre-trained zero-shot classifier, by unsupervised finetuning of the pre-trained model. We propose Masked Unsupervised Self-Training (MUST), a new unsupervised adaptation method which leverages two different and complementary sources of training signals: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on a variety of downstream tasks, where it improves upon CLIP by a large margin. MUST also outperforms supervised fewshot adaptation methods. It achieves a top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP, and +6.2% higher than 16-shot CLIP adaptation. Our code is available at https://github.com/salesforce/MUST.

1. INTRODUCTION

Zero-shot image classification is a challenging goal that marks the capability of a vision model to solve tasks without human supervision. Recently, vision-language pre-training (e.g. CLIP (Radford et al., 2021) ) has shown promising performance on open-vocabulary zero-shot classification, where it leverages web-scale image-text pairs to train image and text encoders that can be transferred to downstream tasks through natural language prompting. However, the zero-shot performance of CLIP is often inadequate for real-world adoptions, especially when compared to models that are trained with supervised learning. On the other hand, there are abundant unlabeled data available for many tasks. In this paper, we aim to improve the performance of an open-vocabulary zero-shot classifier by finetuning it on unlabeled images from a downstream task. Given unlabeled data, the key question is: what is the source of supervision? Numerous papers have attempted to answer this question. Among them, self-training and self-supervised learning are two of the most dominant approaches. In self-training (Zoph et al., 2020; Xie et al., 2020; Lee et al., 2013) , pseudo-labels are generated by a teacher model and then used to supervise task-specific training of a student model, where the student usually has an equal model size as the teacher. On the other hand, self-supervised learning methods are generally task-agnostic. Masked image modeling (Bao et al., 2022; He et al., 2021; Xie et al., 2022) , which trains the model to predict missing information from masked image patches, has recently emerged as the superior self-supervised learning method for vision transformers (ViT) (Dosovitskiy et al., 2021) . However, both self-training and self-supervised learning have their limitations. Self-training overly relies on the pseudo-labels as the only source of supervision, thus is prone to overfitting to the noise in pseudo-labels. Self-supervised learning requires an additional stage of task-specific finetuning on labeled data, thus is not a one-stop solution. In this paper, we propose Masked Unsupervised Self-Training (MUST), a simple and effective method for label-free image classification. MUST performs unsupervised learning using both pseudo-labels and raw images as two different and complementary training signals. Specifically, MUST jointly

