CHILS: ZERO-SHOT IMAGE CLASSIFICATION WITH HIERARCHICAL LABEL SETS

Abstract

Open vocabulary models (e.g. CLIP) have shown strong performance on zero-shot classification through their ability generate embeddings for each class based on their (natural language) names. Prior work has focused on improving the accuracy of these models through prompt engineering or by incorporating a small amount of labeled downstream data (via finetuning). In this paper, we aim to tackle classification problems with coarsely-defined class labels. We propose Classification with Hierarchical Label Sets (or CHiLS), an alternative strategy that proceeds in three steps: (i) for each class, produce a set of subclasses, using either existing label hierarchies or by querying GPT-3; (ii) perform the standard zero-shot CLIP procedure as though these subclasses were the labels of interest; (iii) map the predicted subclass back to its parent to produce the final prediction. Across numerous datasets (with implicit semantic hierarchies), CHiLS leads to improved accuracy yielding gains of over 30% in situations where known hierarchies are available and more modest gains when they are not. CHiLS is simple to implement within existing CLIP pipelines and requires no additional training cost.

1. INTRODUCTION

Recently, machine learning researchers have become captivated by the remarkable capabilities of pretrained open vocabulary models (Radford et al., 2021; Wortsman et al., 2021; Jia et al., 2021; Gao et al., 2021; Pham et al., 2021; Cho et al., 2022; Pratt et al., 2022) . These models, like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) , learn to map images and captions into shared embedding spaces such that images are close in embedding space to their corresponding captions but far from randomly sampled captions. The resulting models can then used to assess the relative compatibility of a given image with an arbitrary set of textual "prompts". Notably, Radford et al. (2021) observed that by inserting each class name directly within a natural language prompt, one can then use CLIP embeddings to assess the compatibility of an images with each among the possible classes. Thus, open vocabulary models are able to perform zero-shot image classification, and do so with high rates of success (Radford et al., 2021; Zhang et al., 2021b) . Despite the documented successes, the current interest in open vocabulary models poses a new question: How should we represent our classes for a given problem in natural language? As class names are now part of the inferential pipeline (as opposed to mostly an afterthought in traditional scenarios) for models like CLIP in the zero-shot setting, CLIP's performance is now directly tied to the descriptiveness of the class "prompts" (Santurkar et al., 2022) . While many researchers have focused on improving the quality of the prompts into which class names are embedded (Radford et al., 2021; Pratt et al., 2022; Zhou et al., 2022b; a; Huang et al., 2022) , surprisingly little attention has been paid to improving the richness of the class names themselves. This can be particularly crucial in cases where class names are not very informative or are too coarsely-defined to match the sort of descriptions that might arise in natural captions. Consider, for an example, the classes "large man-made outdoor things" and "reptiles" in the CIFAR20 dataset (Krizhevsky, 2009) . In this paper, we introduce a new method to tackle zero-shot classification with CLIP models for problems with coarsely-defined class labels. We refer to our method as Classification with Hierarchical Label Sets (CHiLS for short). Our method utilizes a hierarchical map to convert each class into a list of subclasses, performs normal CLIP zero-shot prediction across the union set of all subclasses, and finally uses the inverse mapping to convert the subclass prediction to the requi-1

