HIERARCHICAL PROMPTING IMPROVES VISUAL RECOGNITION ON ACCURACY, DATA EFFICIENCY AND EXPLAINABILITY

Abstract

When humans try to distinguish some inherently similar visual concepts, e.g., Rosa Peace and China Rose, they may use the underlying hierarchical taxonomy to prompt the recognition. For example, given a prompt that the image belongs to the rose family, a person can narrow down the category range and thus focuses on the comparison between different roses. In this paper, we explore the hierarchical prompting for deep visual recognition (image classification, in particular) based on the prompting mechanism of the transformer. We show that the transformer can take the similar benefit by injecting the coarse-class prompts into the intermediate blocks. The resulting Transformer with Hierarchical Prompting (TransHP) is very simple and consists of three steps: 1) TransHP learns a set of prompt tokens to represent the coarse classes, 2) learns to predict the coarse class of the input image using an intermediate block, and 3) absorbs the prompt token of the predicted coarse class into the feature tokens. Consequently, the injected coarse-class prompt conditions (influences) the subsequent feature extraction and encourages better focus on the relatively subtle differences among the descendant classes. Through extensive experiments on popular image classification datasets, we show that this simple hierarchical prompting improves visual recognition on classification accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement over the baseline under 10% ImageNet training data), and model explainability.

1. INTRODUCTION

For human visual recognition, awareness of the underlying semantic hierarchy is sometimes beneficial, especially when the object is difficult to recognize. More specifically, when trying to distinguish some inherently similar visual concepts, a person may use the hierarchical taxonomy to prompt the recognition. For example, the China Rose is easily confused with the Rosa Peace when the scopeof-interest is the whole Plantae (or even larger). However, given the prompt that the image belongs to the rose family (i.e., the ancestor class), a person can narrow down the category range and shift his/her focus to the subtle variation between different roses. Therefore, the prompt of the coarse (ancestor) class in the hierarchy conditions (influences) the subsequent inference and benefits the fine (descendant) class recognition. In this paper, we explore the above hierarchical prompting for deep visual recognition. We base our exploration on the prompting mechanism of the transformer, which typically uses prompt to condition the model for different tasks (Li & Liang, 2021; Gu et al., 2021; He et al., 2021 ), different domains (Ge et al., 2022) , etc. For the first time, we show that in the image classification task, the transformer can benefit from being prompted with coarse class information. To this end, we inject the coarse-class prompts into the intermediate block to dynamically condition the subsequent feature extraction. Such a hierarchical prompting is similar as in the human visual recognition. Specifically, exploiting the underlying semantic hierarchy to improve visual recognition has attracted great research interest and yielded several popular tasks, e.g., hierarchical image classification and hierarchical semantic segmentation. Considering that classification is fundamental for many computer vision tasks, this paper focuses on hierarchical image classification. Many popular image classification datasets (e.g., ImageNet and iNaturalist) can well accommodate this task because they already provide hierarchical annotations ("coarse + fine" labels). Compared with prior literature on this topic, our method has significant differences due to the employed prompting mechanism. Please refer to Section 2 (Related Works) for a detailed comparison. We model our intuition into a Transformer with Hierarchical Prompting (TransHP) based on Vision Transformer (ViT) (Fig. 1 (a) ). TransHP is very simple, as illustrated in Fig. 1 (b ). Without loss of generality, Fig. 1 assumes the hierarchy has only two levels for simplicity, i.e., a coarse level and a fine level. In other word, each image has a coarse label (e.g., the fish) and a fine label (e.g., the goldfish), simultaneously. TransHP selects an intermediate block as the "prompting block" to inject the coarse class information. Specifically, given the feature tokens (i.e., the "class" token and the patch tokens) output from the preceding block, the prompting block concatenates them with a set of prompt tokens. Each prompt token represents a coarse class and is learnable (Section 3.2). The prompting block learns to predict the coarse class of the input image and to select the corresponding prompt token through weighted absorption (i.e., high absorption on the target prompt and low absorption on the non-target prompt). Therefore, during inference, the prompt injection concentrates on the predicted coarse class on the fly and dynamically conditions the subsequent recognition. Since our prompting mechanism follows the coarse-to-fine (or ancestor-to-descendant) semantic structure, we term it as the hierarchical prompting. We hypothesize this hierarchical prompting (and conditioning) will encourage TransHP to focus on the subtle differences among the descendant classes for better discrimination. We conduct extensive experiments on multiple image classification datasets (e.g., ImageNet (Deng et al., 2009) and iNaturalist (Van Horn et al., 2018) ) and show that the hierarchical prompting improves the accuracy, data efficiency and explainability of the transformer: (1) Accuracy. TransHP



Figure 1: The comparison between Vision Transformer (ViT) and the proposed Transformer with Hierarchical Prompting (TransHP). In (a), ViT attends to the overall foreground region and recognizes the goldfish from the 1000 classes in ImageNet. In (b), TransHP uses an intermediate block to recognize the input image as belonging to the fish family and then injects the corresponding prompt. Afterwards, the last block attends to the face and crown which are particularly informative for distinguishing the goldfish against other fish species. Please refer to Fig. 5 for more visualizations. Note that TransHP may have several prompting blocks, and we only add one in (b) for demonstration.

Fig. 1 partially validates our hypothesis by visualizing the attention map of the class token in the last transformer block. In Fig. 1 (a), given a goldfish as the input image, the baseline model (ViT) attends to the whole body for recognizing it from the entire 1000 classes in ImageNet. In contrast, in TransHP in Fig. 1 (b), since the intermediate block has already received the prompt of "fish", the final block mainly attends to the face and crown which are particularly informative for distinguishing the goldfish against other fish species. Please refer to Section 4.4 for more visualization examples.

