WHAT DOES A PLATYPUS LOOK LIKE? GENERATING CUSTOMIZED PROMPTS FOR ZERO-SHOT IMAGE CLASSIFICATION

Abstract

Open vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that are customized for each object category. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot.



"A spatula is a flat, rectangular kitchen utensil with a long handle" "A photo of a spatula" Standard Zero-shot Customized Prompts via Language models (CuPL) "What does a platypus look like?" "A platypus looks like a beaver with a duck's bill" "A photo of a goldfish" "A photo of a platypus" "Goldfish are small, orange fish with shiny scales" "A platypus looks like a beaver with a duck's bill" 

1. INTRODUCTION

Open vocabulary models (Pham et al., 2021; Jia et al., 2021; Radford et al., 2021; Yu et al., 2022a) achieve high classification accuracy across a large number of datasets without labeled training data for those tasks. To accomplish this, these models leverage the massive amounts of image-text pairs available on the internet by learning to associate the images with their correct caption, leading to greater flexibility during inference. Unlike standard models, these models classify images by providing a similarity score between an image and a caption. To perform inference, one can generate a caption or "prompt" associated with each of the desired categories, and match each image to the best prompt. This means that categories can be selected ad hoc and adjusted without additional training.

However, this new paradigm poses a challenge:

How can we best represent an image category through natural language prompts? The standard approach is to hand write a number of prompts templates (Radford et al., 2021 ) (e.g.,"a photo of a {}"), compile a natural language label for each category in the dataset, and create a set of prompts for each category by filling in each of these templates with the natural language labels. Then, image embeddings are matched to the nearest set of prompt embeddings and labelled with the category associated with that set of prompts (more details in Section 2). This method has three major drawbacks. Firstly, each prompt template has to be hand-written, so having twice as many prompts for a category requires twice as much human effort. This can become costly as each new dataset typically has a different set of prompt templates (Radford et al., 2021) . Secondly, the prompt templates must be general enough to apply to all image categories. For example, a prompt for the ImageNet (Deng et al., 2009) category "platypus" could only be as specific as "a photo of a {platypus}", and could not be something like "a photo of a {platypus}, a type of aquatic mammal" as that template would no longer be relevant for other image categories. Lastly, writing high performing prompt templates currently requires prior information about the contents of the dataset. For example, the list of hand-written ImageNet prompts (Radford et al., 2021) includes "a black and white photo of the {}.", "a low resolution photo of a {}.", and "a toy {}." all of which demonstrate prior knowledge about the type of representations present in the dataset. This information is not generalizable to other datasets, as ImageNet contains "black and white" and "toy" representations of its categories, but other datasets do not (e.g., FVGC Aircraft (Maji et al., 2013) ). To overcome these challenges, we propose Customized Prompts via Language models (CuPL). In this algorithm, we couple a large language model (LLM) with a zero-shot open vocabulary image classification model. We use the LLM to generate prompts for each of the image categories in a dataset. Using an LLM allows us to generate an arbitrary number of prompts with a fixed number of hand-written sentences. Additionally, these prompts are now customized to each category and can contain rich visual descriptions while still remaining zero-shot (e.g., "A platypus looks like a beaver with a duck's bill" -a sentence generated by an LLM). We find these customized prompts outperform the hand-written templates on 15 zero-shot image classification benchmarks, including a greater than 1 percentage point gain on ImageNet (Deng et al., 2009) Top-1 accuracy and a greater than 6 percentage point gain on Describable Textures Dataset (Cimpoi et al., 2014) , with fewer hand-written prompts when compared to the standard method used in Radford et al. (2021) . Finally, this method requires no additional training or labeled data for either model.

2. METHODS

The CuPL algorithm consists of two steps: (1) generating customized prompts for each of the categories in a given dataset and (2) using these prompts to perform zero-shot image classification.

2.1. GENERATING CUSTOMIZED PROMPTS

This step consists of generating prompts using an LLM. For clarity, we distinguish between two different kind of prompts. The first are the prompts which cue the LLM to generate the descriptions of the dataset categories. These prompts do not describe an object, but rather prompt the description of an object (e.g., "What does a platypus look like?"). We will refer to these as "LLM-prompts". Secondly, there are the prompts to be matched with images in the zero-shot image classification model. These are the prompts that describe a category (e.g., "A platypus looks like ..."). We call them "image-prompts." These are the output of the LLM, as examplified in Figure 2 . In this work, we use GPT-3 (Brown et al., 2020) as our LLM. To generate our image-prompts, we must first construct a number of LLM-prompt templates. While this does require some engineering



Figure 1: Schematic of the method. (Left) The standard method of a zero-shot open vocabulary image classification model (e.g., CLIP (Radford et al., 2021)). (Right) Our method of CuPL. First, an LLM generates descriptive captions for given class categories. Next, an open vocabulary model uses these captions as prompts for performing classification.

