VISUAL CLASSIFICATION VIA DESCRIPTION FROM LARGE LANGUAGE MODELS

Abstract

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

1. INTRODUCTION

Why does a person recognize a hen in Fig. 1 ? If you had to justify your answer, you might name its beak, describe its feathers, or discuss any number of other traits that we associate with hens. It is easy for people to describe the visual features of categories in words, as well as use these verbal descriptions to aid perception. However, generating such schemata, let alone leveraging them for perceptual tasks, has remained a key challenge in machine learning. Vision-language models (VLMs) trained on large corpora of paired image-text data, such as CLIP (Radford et al., 2021) , have seen huge successes recently, dominating image classification. The standard zero-shot classification procedure -computing similarity between the query image and the embedded words for each category, then choosing the highest -has shown impressive performance on many popular benchmarks, such as ImageNet (Russakovsky et al., 2015) . Comparing to the word that names a category was a reasonable place to start because these methods can rely on the fact that the word "hen" tends to show up near images of hens on the Internet. Despite the advances on classification performance, the large models often make unreasonable mistakes or give undesired answers (Goh et al., 2021) . The standard zero-shot method gives us no intermediate understanding (i.e. explanation) of the model's reasoning process. They often fail to look at cues that a human would use easily, and there is no clear way to get the right cues or provide them to the model. Our key insight is that we can use language as an internal representation for visual recognition, which creates an interpretable bottleneck for computer vision tasks. Instead of querying a VLM with just a category name, the use of language enables us to flexibly compare to any words. If we have an idea what features should be used, we can ask the VLM to check for those features instead of just the class name. To find a hen, look for its beak; its feathers; and more. By basing the decision on these features, we can provide additional cues that encourage looking at the features we want to be used. In the process, we can get a clear idea of what the model uses to make its decision; it is inherently explainable. However, hand-writing these features can be costly, and does not scale to large numbers of classes. We can solve this by requesting help from another model. Large language models (LLMs), such as GPT-3 (Brown et al., 2020) , show remarkable world knowledge on a variety of topics. They can be thought of as implicit knowledge bases, noisily condensing the collective knowledge of the Internet in a way that can be easily queried with natural language (Petroni et al., 2019) . As people often write about what things look like, this includes knowledge of visual descriptors. We thus can simply ask an LLM, much like a 5-year old asking their parent: what does it look like? We provide an alternative to the current zero-shot classification paradigm with vision-language models, comparing to class descriptors obtained from a large language model instead of just the class directly. This requires no additional training, and does not require substantial computational overhead during inference. By construction, this provides some level of inherent interpretability; we can know an image was labeled a tiger because the model saw its stripes rather than its tail. Rather than compromising performance metrics, our approach improves accuracy across datasets and distribution shifts, achieving a ∼ 4-5% increase on top-1 ImageNet accuracy.

2.1. PERFORMING CLASSIFICATION WITH DESCRIPTORS

Given an image x, our goal is to classify whether a visual category c is present in the image, where we represent a category c through a textual phrase, e.g., "school bus." To make our model both "tiger" "syringe" "beach" "stripes" "waves" "sand" "shells" "cap" "needle" "plunger" "sharp teeth" "syringe" "tiger" "big cat" "claws" "barrel" "beach" 



Figure 1: On the left, we show an example decision by our model in addition to its justification (blue bars). On the right, we show how CLIP classifies this image. Our model does not make the same mistake because it cannot produce a compatible justification with the image (red bars).

Figure2: (a) The standard vision-and-language model compares image embeddings (white dot) to word embeddings of the category name (colorful dots) in order to perform classification. (b) We instead mine large language models to automatically build descriptors, and perform recognition by comparing to the category descriptors.

