HIERARCHICAL PROMPTING IMPROVES VISUAL RECOGNITION ON ACCURACY, DATA EFFICIENCY AND EXPLAINABILITY

Abstract

When humans try to distinguish some inherently similar visual concepts, e.g., Rosa Peace and China Rose, they may use the underlying hierarchical taxonomy to prompt the recognition. For example, given a prompt that the image belongs to the rose family, a person can narrow down the category range and thus focuses on the comparison between different roses. In this paper, we explore the hierarchical prompting for deep visual recognition (image classification, in particular) based on the prompting mechanism of the transformer. We show that the transformer can take the similar benefit by injecting the coarse-class prompts into the intermediate blocks. The resulting Transformer with Hierarchical Prompting (TransHP) is very simple and consists of three steps: 1) TransHP learns a set of prompt tokens to represent the coarse classes, 2) learns to predict the coarse class of the input image using an intermediate block, and 3) absorbs the prompt token of the predicted coarse class into the feature tokens. Consequently, the injected coarse-class prompt conditions (influences) the subsequent feature extraction and encourages better focus on the relatively subtle differences among the descendant classes. Through extensive experiments on popular image classification datasets, we show that this simple hierarchical prompting improves visual recognition on classification accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement over the baseline under 10% ImageNet training data), and model explainability.

1. INTRODUCTION

For human visual recognition, awareness of the underlying semantic hierarchy is sometimes beneficial, especially when the object is difficult to recognize. More specifically, when trying to distinguish some inherently similar visual concepts, a person may use the hierarchical taxonomy to prompt the recognition. For example, the China Rose is easily confused with the Rosa Peace when the scopeof-interest is the whole Plantae (or even larger). However, given the prompt that the image belongs to the rose family (i.e., the ancestor class), a person can narrow down the category range and shift his/her focus to the subtle variation between different roses. Therefore, the prompt of the coarse (ancestor) class in the hierarchy conditions (influences) the subsequent inference and benefits the fine (descendant) class recognition. In this paper, we explore the above hierarchical prompting for deep visual recognition. We base our exploration on the prompting mechanism of the transformer, which typically uses prompt to condition the model for different tasks (Li & Liang, 2021; Gu et al., 2021; He et al., 2021) , different domains (Ge et al., 2022) , etc. For the first time, we show that in the image classification task, the transformer can benefit from being prompted with coarse class information. To this end, we inject the coarse-class prompts into the intermediate block to dynamically condition the subsequent feature extraction. Such a hierarchical prompting is similar as in the human visual recognition. Specifically, exploiting the underlying semantic hierarchy to improve visual recognition has attracted great research interest and yielded several popular tasks, e.g., hierarchical image classification and hierarchical semantic segmentation. Considering that classification is fundamental for many computer vision tasks, this paper focuses on hierarchical image classification. Many popular image classification datasets (e.g., ImageNet and iNaturalist) can well accommodate this task because they already provide hierarchical annotations ("coarse + fine" labels). Compared with prior literature on

