HICLIP: CONTRASTIVE LANGUAGE-IMAGE PRETRAIN-ING WITH HIERARCHY-AWARE ATTENTION

Abstract

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding. The concise design brings CLIP the advantage in inference efficiency against other visionlanguage models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP does not explicitly capture the hierarchical nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks.

1. INTRODUCTION

In recent years, vision-language pretraining has achieved significant progress pairing with large-scale multimodal data. Contrastive vision-language pretraining (CLIP) features its generalization ability for zero-shot tasks and robustness to domain shift (Radford et al., 2021) . Moreover, the spectrum of problems that CLIP can solve range from visual recognition, image-text retrieval, and vision-language reasoning tasks via providing appropriate prompt engineering (Zhou et al., 2022; Gao et al., 2021; Xu et al., 2021; Shridhar et al., 2021; Rao et al., 2022; Zhong et al., 2022) . Since CLIP is built upon simple cross-modal interactions, it has superior inference efficiency over cross-attention based vision-language models (Li et al., 2021; Chen et al., 2020; Li et al., 2020; Tan & Bansal, 2019; Dou et al., 2022) 2018), such hierarchical nature in vision and language contents has been explored to assist the design of various model architectures. However, contrastive vision-language learning methods like CLIP often cannot capture visual and linguistic hierarchies in an explicit way. In the example of right figure in Fig. 1 (a), the pixels first form local patches as image encoder's inputs, and are further merged into semantic groups denoting objects ("traffic lights", "sky"), attributes ("cloudy"), etc. Similarly, syntax hierarchies can also be observed in natural languages where the caption can be decomposed into constituents as shown in Fig. 1 (b) . Therefore, we argue that the hierarchical nature (i.e., merging from local to global) in vision and language is critical and can be explicitly utilized for improving CLIP's capability on multimodal tasks, especially those requiring high-level understanding and reasoning.



. Recent studies including DeCLIP (Li et al., 2022), SLIP (Mu et al., 2021), and FILIP (Yao et al., 2022) extend CLIP by either leveraging extra self-supervision training objectives or performing contrastive loss on dense token features. As humans perceive the world in a hierarchical manner Hubel & Wiesel (1968); Fukushima & Miyake (1982); Kuzovkin et al. (

funding

conducted while interning at ByteDance. 1 We release our implementation of HiCLIP at

availability

https://github.com/jeykigung

