HICLIP: CONTRASTIVE LANGUAGE-IMAGE PRETRAIN-ING WITH HIERARCHY-AWARE ATTENTION

Abstract

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding. The concise design brings CLIP the advantage in inference efficiency against other visionlanguage models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP does not explicitly capture the hierarchical nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks.

1. INTRODUCTION

In recent years, vision-language pretraining has achieved significant progress pairing with large-scale multimodal data. Contrastive vision-language pretraining (CLIP) features its generalization ability for zero-shot tasks and robustness to domain shift (Radford et al., 2021) . Moreover, the spectrum of problems that CLIP can solve range from visual recognition, image-text retrieval, and vision-language reasoning tasks via providing appropriate prompt engineering (Zhou et al., 2022; Gao et al., 2021; Xu et al., 2021; Shridhar et al., 2021; Rao et al., 2022; Zhong et al., 2022) . Since CLIP is built upon simple cross-modal interactions, it has superior inference efficiency over cross-attention based vision-language models (Li et al., 2021; Chen et al., 2020; Li et al., 2020; Tan & Bansal, 2019; Dou et al., 2022) . Recent studies including DeCLIP (Li et al., 2022) , SLIP (Mu et al., 2021), and FILIP (Yao et al., 2022) extend CLIP by either leveraging extra self-supervision training objectives or performing contrastive loss on dense token features. As humans perceive the world in a hierarchical manner Hubel & Wiesel (1968); Fukushima & Miyake (1982) ; Kuzovkin et al. (2018) , such hierarchical nature in vision and language contents has been explored to assist the design of various model architectures. However, contrastive vision-language learning methods like CLIP often cannot capture visual and linguistic hierarchies in an explicit way. In the example of right figure in Fig. 1 (a), the pixels first form local patches as image encoder's inputs, and are further merged into semantic groups denoting objects ("traffic lights", "sky"), attributes ("cloudy"), etc. Similarly, syntax hierarchies can also be observed in natural languages where the caption can be decomposed into constituents as shown in Fig. 1 (b) . Therefore, we argue that the hierarchical nature (i.e., merging from local to global) in vision and language is critical and can be explicitly utilized for improving CLIP's capability on multimodal tasks, especially those requiring high-level understanding and reasoning. To this end, we introduce hierarchy-aware attention into CLIP, denoted as HiCLIP. Hierarchy-aware attention applies an attention mask to the conventional attention mechanism to indicate the tendency to merge certain vision patches and language tokens into groups because they are spatially and semantically or visually similar. We generalize hierarchy-aware attention to both images and texts, where its mask is obtained by first calculating the neighbouring affinity score among adjacent patches or tokens, and then propagating the scores across any given patch or token pairs. In addition, we formulate the affinity score with an increasing trend as the layer gets deeper to ensure the merged groups remains the same. In this way, we progressively aggregate hierarchies in a layer-by-layer manner for both images and texts. To be specific, for modeling hierarchies in natural language, we share similar intuitions with the previous studies on unsupervised grammar induction, which aim at unsupervised hierarchical mining (Shi et al., 2019; Drozdov et al., 2019 ). Tree Transformer (Wang et al., 2019) proposes a similar modified attention mechanism which is essentially a special case of hierarchy-aware attention, where the attention mask is instantiated as constituent prior to encourage the merging of semantically similar tokens. Capturing hierarchies in visual contents is more challenging, because spatial correlation should also be considered in addition to visual similarities. Therefore, we extend the hierarchy-aware attention to Vision Transformers (Dosovitskiy et al., 2021) by creating a Group Transformer to progressively aggregate image patches into semantic groups until all patches are merged in one common group which is the original image. Different from the 1D scenario in Tree Transformer, the neighboring affinity score is computed among the four adjacent neighbors of each image patch (Fig. 1 (a) ). Afterwards, we propagate neighboring affinity scores by comparing two special paths connecting image patches on the 2D grid graph. When we apply such hierarchy-aware attention to both image and text branches in CLIP, we obtain the proposed hierarchy-aware CLIP (HiCLIP) which features the following advantages: (1) it is able to automatically discover hierarchies in vision and language that matches human intuitions in an unsupervised manner; (2) it generates better multimodal representations especially for visionlanguage downstream tasks; and (3) it features a comprehensive hierarchy visualization to help parse visual and textual hierarchies. To prove the aforementioned advantages, we pretrain HiCLIP and other CLIP-style approaches on large-scale image-text pairs, and conduct extensive experiments on downstream tasks including visual recognition, image-text retrieval, visual question answering, and visual entailment reasoning. To sum up, our contributions are summarized as follows: • We incorporate hierarchy-aware attention into CLIP (HiCLIP) for both image and text contents, which achieves better performances on vision and vision-language downstream tasks. • To model images in a hierarchical manner, we propagate neighboring affinity scores through two special paths on 2D grid graphs and generalize the hierarchy-aware attention to Vision Transformer. • We visualize the evolution of hierarchies in images and texts to demonstrate the ability of unsupervised hierarchy induction of HiCLIP, which contributes to a better interpretability.

2. RELATED WORK

Vision-Language Models. With the proliferation of multimodal information, exploring the interaction of vision and language information becomes an important topic. As a result, many vision-language



Figure 1: Illustration of hierarchical structures in both (a) vision and (b) language modalities. Based on the affinity scores between adjacent vision patches or word tokens (marked in blue boxes), the attention mask C in hierarchy-aware attention considers both spatial and semantic similarity following the highest valued route (marked in red arrows) between two patches or tokens. The affinity scores evolve layer-by-layer contributing to the different levels of hierarchy granularity.

availability

https://github.com/jeykigung/HiCLIP.

