WHAT DOES A PLATYPUS LOOK LIKE? GENERATING CUSTOMIZED PROMPTS FOR ZERO-SHOT IMAGE CLASSIFICATION

Abstract

Open vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that are customized for each object category. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot.



"A spatula is a flat, rectangular kitchen utensil with a long handle" "A photo of a spatula" Standard Zero-shot Customized Prompts via Language models (CuPL)

GPT-3

"What does a platypus look like?" "A platypus looks like a beaver with a duck's bill" "A photo of a goldfish" "A photo of a platypus" "Goldfish are small, orange fish with shiny scales" "A platypus looks like a beaver with a duck's bill" 

1. INTRODUCTION

Open vocabulary models (Pham et al., 2021; Jia et al., 2021; Radford et al., 2021; Yu et al., 2022a) achieve high classification accuracy across a large number of datasets without labeled training data for those tasks. To accomplish this, these models leverage the massive amounts of image-text pairs available on the internet by learning to associate the images with their correct caption, leading to greater flexibility during inference. Unlike standard models, these models classify images by providing a similarity score between an image and a caption. To perform inference, one can generate a



Figure 1: Schematic of the method. (Left) The standard method of a zero-shot open vocabulary image classification model (e.g., CLIP (Radford et al., 2021)). (Right) Our method of CuPL. First, an LLM generates descriptive captions for given class categories. Next, an open vocabulary model uses these captions as prompts for performing classification.

