CHILS: ZERO-SHOT IMAGE CLASSIFICATION WITH HIERARCHICAL LABEL SETS

Abstract

Open vocabulary models (e.g. CLIP) have shown strong performance on zero-shot classification through their ability generate embeddings for each class based on their (natural language) names. Prior work has focused on improving the accuracy of these models through prompt engineering or by incorporating a small amount of labeled downstream data (via finetuning). In this paper, we aim to tackle classification problems with coarsely-defined class labels. We propose Classification with Hierarchical Label Sets (or CHiLS), an alternative strategy that proceeds in three steps: (i) for each class, produce a set of subclasses, using either existing label hierarchies or by querying GPT-3; (ii) perform the standard zero-shot CLIP procedure as though these subclasses were the labels of interest; (iii) map the predicted subclass back to its parent to produce the final prediction. Across numerous datasets (with implicit semantic hierarchies), CHiLS leads to improved accuracy yielding gains of over 30% in situations where known hierarchies are available and more modest gains when they are not. CHiLS is simple to implement within existing CLIP pipelines and requires no additional training cost.

1. INTRODUCTION

Recently, machine learning researchers have become captivated by the remarkable capabilities of pretrained open vocabulary models (Radford et al., 2021; Wortsman et al., 2021; Jia et al., 2021; Gao et al., 2021; Pham et al., 2021; Cho et al., 2022; Pratt et al., 2022) . These models, like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) , learn to map images and captions into shared embedding spaces such that images are close in embedding space to their corresponding captions but far from randomly sampled captions. The resulting models can then used to assess the relative compatibility of a given image with an arbitrary set of textual "prompts". Notably, Radford et al. (2021) observed that by inserting each class name directly within a natural language prompt, one can then use CLIP embeddings to assess the compatibility of an images with each among the possible classes. Thus, open vocabulary models are able to perform zero-shot image classification, and do so with high rates of success (Radford et al., 2021; Zhang et al., 2021b) . Despite the documented successes, the current interest in open vocabulary models poses a new question: How should we represent our classes for a given problem in natural language? As class names are now part of the inferential pipeline (as opposed to mostly an afterthought in traditional scenarios) for models like CLIP in the zero-shot setting, CLIP's performance is now directly tied to the descriptiveness of the class "prompts" (Santurkar et al., 2022) . While many researchers have focused on improving the quality of the prompts into which class names are embedded (Radford et al., 2021; Pratt et al., 2022; Zhou et al., 2022b; a; Huang et al., 2022) , surprisingly little attention has been paid to improving the richness of the class names themselves. This can be particularly crucial in cases where class names are not very informative or are too coarsely-defined to match the sort of descriptions that might arise in natural captions. Consider, for an example, the classes "large man-made outdoor things" and "reptiles" in the CIFAR20 dataset (Krizhevsky, 2009) . In this paper, we introduce a new method to tackle zero-shot classification with CLIP models for problems with coarsely-defined class labels. We refer to our method as Classification with Hierarchical Label Sets (CHiLS for short). Our method utilizes a hierarchical map to convert each class into a list of subclasses, performs normal CLIP zero-shot prediction across the union set of all subclasses, and finally uses the inverse mapping to convert the subclass prediction to the requi-Figure 1 : (Left) Standard CLIP Pipeline for Zero-Shot Classification. For inference, a standard CLIP takes in input a set of classes and an image where we want to make a prediction and makes a prediction from that set of classes. (Right) Our proposed method CHiLS for leveraging hierarchical class information into the zero-shot pipeline. We map each individual class to a set of subclasses, perform inferences in the subclass space (i.e., union set of all subclasses), and map the predicted subclass back to its original superclass. site superclass. We additionally include a reweighting step wherein we leverage the raw superclass probabilities in order to make our method robust to less-confident predictions at the superclass and subclass level. We evaluate CHiLS on a wide array of image classification benchmarks with and without available hierarchical information. In the former case, leveraging preexisting hierarchies leads to strong accuracy gains across all datasets. In the latter, we show that rather than enumerating the hierarchy by hand, using GPT-3 to query a list of possible subclasses for each class (whether or not they are actualy present in the dataset) still leads to consistent improved accuracy over raw superclass prediction. We summarize our main contributions below: • We propose CHiLS, a new method for improving zero-shot CLIP performance in scenarios with ill-defined and/or overly general class structures, which requires no labeled data or training time and is flexible to both existing and synthetically generated hierarchies. • We show that CHiLS consistently performs as well or better than standard practices in situations with only synthetic hierarchies, and that CHiLS can achieve up to 30% accuracy gains when ground truth hierarchies are available.

2. RELATED WORK

2.1 TRANSFER LEARNING While the focus of this paper is to improve CLIP models in the zero-shot regime, there is a large body of work exploring improvements to CLIP's few-shot capabilities. In the standard fine-tuning paradigm for CLIP models, practitioners discard the text encoder and only use the image embeddings as inputs for some additional training layers. This however, leads to certain problems. One particular line of work on improving the fine-tuned capabilities of CLIP models leverages model weight interpolation. Wortsman et al. (2021) proposes to linear interpolate the weights of a finetuned and a zero-shot CLIP model to improve the fine-tuned model under distribution shifts. This idea is extended by Wortsman et al. ( 2022) into a general purpose paradigm for ensembling models' weights in order to improve robustness. Ilharco et al. (2022) then builds on both these works and puts forth a method to "patch" fine-tuned and zero-shot CLIP weights together in order to avoid the issue of catastrophic forgetting. Among all the works in this section, our paper is perhaps most similar to this vein of work (albeit in spirit), as CHiLS too seeks to combine two different predictive methods. Ding et al. ( 2022) also tackles catastrophic forgetting, though they propose an orthogonal direction and fine-tune both the image encoder and the text encoder, where the latter draws from a replay vocabulary of text concepts from the original CLIP database.

