HICLIP: CONTRASTIVE LANGUAGE-IMAGE PRETRAIN-ING WITH HIERARCHY-AWARE ATTENTION

Abstract

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding. The concise design brings CLIP the advantage in inference efficiency against other visionlanguage models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP does not explicitly capture the hierarchical nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks.

1. INTRODUCTION

In recent years, vision-language pretraining has achieved significant progress pairing with large-scale multimodal data. Contrastive vision-language pretraining (CLIP) features its generalization ability for zero-shot tasks and robustness to domain shift (Radford et al., 2021) . Moreover, the spectrum of problems that CLIP can solve range from visual recognition, image-text retrieval, and vision-language reasoning tasks via providing appropriate prompt engineering (Zhou et al., 2022; Gao et al., 2021; Xu et al., 2021; Shridhar et al., 2021; Rao et al., 2022; Zhong et al., 2022) . Since CLIP is built upon simple cross-modal interactions, it has superior inference efficiency over cross-attention based vision-language models (Li et al., 2021; Chen et al., 2020; Li et al., 2020; Tan & Bansal, 2019; Dou et al., 2022) . Recent studies including DeCLIP (Li et al., 2022) , SLIP (Mu et al., 2021) , and FILIP (Yao et al., 2022) extend CLIP by either leveraging extra self-supervision training objectives or performing contrastive loss on dense token features. As humans perceive the world in a hierarchical manner Hubel & Wiesel (1968) ; Fukushima & Miyake (1982) ; Kuzovkin et al. (2018) , such hierarchical nature in vision and language contents has been explored to assist the design of various model architectures. However, contrastive vision-language learning methods like CLIP often cannot capture visual and linguistic hierarchies in an explicit way. In the example of right figure in Fig. 1 (a), the pixels first form local patches as image encoder's inputs, and are further merged into semantic groups denoting objects ("traffic lights", "sky"), attributes ("cloudy"), etc. Similarly, syntax hierarchies can also be observed in natural languages where the caption can be decomposed into constituents as shown in Fig. 1 (b) . Therefore, we argue that the hierarchical nature (i.e., merging from local to global) in vision and language is critical and can be explicitly utilized for improving CLIP's capability on multimodal tasks, especially those requiring high-level understanding and reasoning. To this end, we introduce hierarchy-aware attention into CLIP, denoted as HiCLIP. Hierarchy-aware attention applies an attention mask to the conventional attention mechanism to indicate the tendency to merge certain vision patches and language tokens into groups because they are spatially and semantically or visually similar. We generalize hierarchy-aware attention to both images and texts, where its mask is obtained by first calculating the neighbouring affinity score among adjacent patches or tokens, and then propagating the scores across any given patch or token pairs. In addition, we formulate the affinity score with an increasing trend as the layer gets deeper to ensure the merged groups remains the same. In this way, we progressively aggregate hierarchies in a layer-by-layer manner for both images and texts. To be specific, for modeling hierarchies in natural language, we share similar intuitions with the previous studies on unsupervised grammar induction, which aim at unsupervised hierarchical mining (Shi et al., 2019; Drozdov et al., 2019) . Tree Transformer (Wang et al., 2019) proposes a similar modified attention mechanism which is essentially a special case of hierarchy-aware attention, where the attention mask is instantiated as constituent prior to encourage the merging of semantically similar tokens. Capturing hierarchies in visual contents is more challenging, because spatial correlation should also be considered in addition to visual similarities. Therefore, we extend the hierarchy-aware attention to Vision Transformers (Dosovitskiy et al., 2021) by creating a Group Transformer to progressively aggregate image patches into semantic groups until all patches are merged in one common group which is the original image. Different from the 1D scenario in Tree Transformer, the neighboring affinity score is computed among the four adjacent neighbors of each image patch (Fig. 1 (a)). Afterwards, we propagate neighboring affinity scores by comparing two special paths connecting image patches on the 2D grid graph. When we apply such hierarchy-aware attention to both image and text branches in CLIP, we obtain the proposed hierarchy-aware CLIP (HiCLIP) which features the following advantages: (1) it is able to automatically discover hierarchies in vision and language that matches human intuitions in an unsupervised manner; (2) it generates better multimodal representations especially for visionlanguage downstream tasks; and (3) it features a comprehensive hierarchy visualization to help parse visual and textual hierarchies. To prove the aforementioned advantages, we pretrain HiCLIP and other CLIP-style approaches on large-scale image-text pairs, and conduct extensive experiments on downstream tasks including visual recognition, image-text retrieval, visual question answering, and visual entailment reasoning. To sum up, our contributions are summarized as follows: • We incorporate hierarchy-aware attention into CLIP (HiCLIP) for both image and text contents, which achieves better performances on vision and vision-language downstream tasks. • To model images in a hierarchical manner, we propagate neighboring affinity scores through two special paths on 2D grid graphs and generalize the hierarchy-aware attention to Vision Transformer. • We visualize the evolution of hierarchies in images and texts to demonstrate the ability of unsupervised hierarchy induction of HiCLIP, which contributes to a better interpretability.

2. RELATED WORK

Vision-Language Models. With the proliferation of multimodal information, exploring the interaction of vision and language information becomes an important topic. As a result, many vision-language models have flourished recently. Based on how the training objective is designed, they can be divided into three categories. The first category includes early bilinear pooling and attention based multimodal models such as MUTAN (Ben-Younes et al., 2017) , BAN (Kim et al., 2018) , bottom-up top-down attention (Anderson et al., 2018) , and intra-inter modality attention (Gao et al., 2019) . The second category is built upon the masked language modeling (MLM) pretraining objective and consists of approaches such as ViLBERT (Lu et al., 2019) , LXMERT (Tan & Bansal, 2019) , UNITER (Chen et al., 2020) and ALBEF (Li et al., 2021) . Several recent approaches such as SOHO (Huang et al., 2021) and BEiT (Wang et al., 2022) futhur extends MLM to masked visual modeling (MVM) to push the boundary of multimodal learning. In addition, CLIP family models (Radford et al., 2021; Li et al., 2022; Mu et al., 2021; Yao et al., 2022; Chen et al., 2023) that rely on vision-language contrastive learning and large-scale image-text pairs constitutes the last category. Unsupervised Grammar Induction. Unsupervised grammar induction is a classic topic in NLP domain aiming at automatically inducing phrase-structure grammars from free-text without parse tree annotations. In the earlier age, probabilistic context free grammars (PCFGs) built upon contextfree grammar is widely applied and are solved by inside-outside algorithm (Baker, 1979) or CYK algorithm (Sakai, 1961) . More recently, many deep learning based approaches have been proposed by extending the conventional methods into neural networks such as C-PCFG (Kim et al., 2019) and DIORA (Drozdov et al., 2019) , designing special modules to induce tree structures such as PRPN (Shen et al., 2018) , ON-LSTM (Shen et al., 2019 ), Tree Transformer (Wang et al., 2019) , or assisting unsupervised grammar induction with the help of cross-modality alignment such as VG-NSL (Shi et al., 2019) , VC-PCFG (Zhao & Titov, 2020) , and CLIORA (Wan et al., 2022) . Hierarchical Discovery in Vision. Discovering the hierarchy in visual contents is a well-established area of vision research. For example, Lin et al. (2017) constructs the hierarchy of feature pyramids in object detection to help the model capture semantics at all scales. More recent work on transformers (Liu et al., 2021; Zhang et al., 2020 ) also adopts similar intuition to generate hierarchical feature maps with special local-global attention designs. Meanwhile, another line of research on designing new fine-grained parsing tasks aims to understand hierarchy within a scene, such as scene graph parsing (Krishna et al., 2017; Zhang et al., 2019) and action graph parsing (Ji et al., 2020) . Recently, more efforts are devoted to automatic hierarchy learning with self-supervised or weakly-supervised objectives (Xie et al., 2021; Dai et al., 2021) , designing special inductive bias for the self-attention mechanism (Yu et al., 2022; Zheng et al., 2021) , and automatically merging semantically similar embeddings (Xu et al., 2022; Locatello et al., 2020) . Our work has the scope of developing special attention constraint and utilizing contrastive learning objectives for unsupervised hierarchy discovery.

3. HIERARCHY-AWARE ATTENTION IN CLIP

As discussed in Section 1, both vision and language share a hierarchical nature in information parsing. The lower level of the hierarchy contains more localized and finer-grained information while the higher levels capture more holistic semantics. These properties are in line with how we humans understand vision (Hubel & Wiesel, 1968; Fukushima & Miyake, 1982; Kuzovkin et al., 2018) and language information (Chomsky, 1956; Manning, 2022) .

3.1. A FRAMEWORK OF HIERARCHICAL INFORMATION AGGREGATION

Hierarchy-aware attention is based on the attention mechanism in conventional Transformers. Given query Q, key K, and value V , and the scaling factor √ d h that maintains the order of magnitude in features where d h denotes the feature dimension, the general Attention function is defined as: Attention(Q, K, V ) = softmax QK ⊤ √ d h V (1) As illustrated in Fig. 2 , we propose to enhance CLIP's vision and language branch with a hierarchyaware attention. Following the common transformer architecture, given the modality inputs being split into low-level image patches and text tokens, we recursively merge patches and tokens that are semantically and spatially similar, and gradually form more semantic-concentrated clusters such as image objects and text phrases. First, we define the hierarchy aggregation priors as follows: Tendency to merge. We recursively merge patches and tokens into higher-level clusters that are spatially and semantically similar. Intuitively, if two nearby image patches have similar appearances, it is natural to merge them as one to convey the same semantic information. Non-splittable. Once the patches or tokens are merged, they will never be split at later layers. With this constraint, we aim to enforce that the hierarchical information aggregation will never get degraded, and as a result, preserve the complete process of hierarchy evolving layer-by-layer. We then incorporate these hierarchy aggregation priors into an attention mask C which serves as an extra inductive bias to help the conventional attention mechanism in Transformers to better explore the hierarchical structures adaptive to each modality format, i.e., 2D grid on images and 1D sequence on texts. Therefore, the proposed hierarchy-aware attention can be defined as: Hierarchy Attention = C ⊙ softmax QK T √ d h V. Note that C is shared among all heads and progressively updated bottom-up across Transformer layers. We elaborate on the formulations of the hierarchy-aware mask C for each modality as follows.

3.1.1. HIERARCHY INDUCTION FOR LANGUAGE BRANCH

In this section, we revisit the tree-transformer method from the proposed hierarchy-aware attention point of view and show how to impose hierarchy aggregation priors on C with three steps. Generate neighboring attention score. Neighboring attention score describes the merging tendency of adjacent word tokens. Two learnable key, query matrices W ′ Q , W ′ K are adopted to transfer any adjacent word tokens (t i , t i+1 ), so that the neighboring attention score s i,i+1 is defined as their inner-product: s i,i+1 = (t i W ′ Q ) • (t i+1 W ′ K ) σ t . Here σ t is a hyper-parameter to control the scale of the generated scores. Then, for each token t i , a softmax function is employed to normalize its merging tendency of two neighbors: p i,i+1 , p i,i-1 = softmax (s i,i+1 , s i,i-1 ) . For neighbor pairs (t i , t i+1 ), the neighboring affinity score âi,i+1 is defined as the geometric mean of p i,i+1 and p i+1,i : âi,i+1 = √ p i,i+1 • p i+1,i . From a graph perspective, it describes the strength of edge e i,i+1 by comparing it with edges e i-1,i (p i,i+1 v.s. p i,i-1 ) and e i+1,i+2 (p i+1,i v.s. p i+1,i+2 ). Enforcing Non-splittable property. Intuitively, a higher neighboring affinity score indicates that two neighbor tokens are more closely bonded. To assure merged tokens will not be splitted, layer-wise affinity scores a l i,i+1 should increase as the network goes deeper, i.e., a l i,i+1 ≥ a l-1 i,i+1 for all l, to help gradually generate a hierarchy structure as desired: a l i,i+1 = a l-1 i,i+1 + 1 -a l-1 i,i+1 âl i,i+1 , Modeling the tendency to merge. To measure the tendency to merge, namely C i,j , for any word token pair (t i , t j ), we propagate the affinity scores of neighboring tokens between (t i , t j ). Specifically, C i,j is derived through the multiplication operation as C i,j = j-1 k=i a k,k+1 . Note that C is a symmetric matrix, so we have C i,j = C j,i .

3.1.2. HIERARCHY INDUCTION FOR VISUAL BRANCH

From a graph perspective, it is easier to generalize the hierarchy-aware mask C from the 1D sequence in language to the 2D grid in vision domain. First, we also employ query and key matrices W ′′ Q , W ′′ K to calculate the neighboring attention scores among the four-adjacency neighbors of each patch t i,j : s (i,j),(i ′ ,j ′ ) = (t i,j W ′′ Q ) • (t i ′ ,j ′ W ′′ K ) σ v , where σ v is used to control the scale of the generated scores, and the neighborhood t i ′ ,j ′ is limited to the four-adjacency patches of t i,j as (i ′ , j ′ ) ∈ {(i + δ, j + η); δ, η ∈ {-1, +1}} ≡ A. Next, for each patch p i,j , the softmax normalizing function is employed to get the merging tendency of t i,j to its four neighbors: {p (i,j),(i ′ ,j ′ ) } = softmax({s (i,j),(i ′ ,j ′ ) ; (i ′ , j ′ ) ∈ A}) . Similar to the formulation in the language branch, the neighboring affinity score with non-splittable property can be obtained by: a l (i,j),(i ′ ,j ′ ) = a l-1 (i,j),(i ′ ,j ′ ) + 1 -a l-1 (i,j),(i ′ ,j ′ ) âl (i,j),(i ′ ,j ′ ) , where â(i,j),(i ′ ,j ′ ) = √ p (i,j),(i ′ ,j ′ ) • p (i ′ ,j ′ ),(i,j) . Lastly, the neighboring affinity scores needs to be propagated to the whole image to acquire C (i1,j1),(i2,j2) between any two patches (t i1,j1 , t i2,j2 ). As the image can be seen as a 2D-grid graph, a natural solution is considering C (i1,j1),(i2,j2) as the length of the shortest path with -log(a (i,j),(i ′ ,j ′ ) ) as the edge weights. To achieve better computational efficiency, we consider two special paths: connecting (t i1,j1 , t i2,j2 ) along the grid with only one turn. The length of these two paths can be calculated by horizontal and vertical propagation as follows: C1, C2 = i 2 -1 n=i 1 a (n,j 1 ),(n+1,j 1 ) j 2 -1 m=j 1 a (i 2 ,m),(i 2 ,m+1) , j 2 -1 m=j 1 a (i 1 ,m),(i 1 ,m+1) i 2 -1 n=i 1 a (n,j 2 ),(n+1,j 2 ) , and C (i1,j1),(i2,j2) = max(C 1 , C 2 ). Intuitively, C (i1,j1 ),(i2,j2) finds the maximum merging tendency along two possible paths, either horizontal-first or vertical-first. In this way, both spatial and visual similarities have contributed to the attention mask C for 2D images. Since our approach tries to organize vision patches sharing high similarities into groups, we thus dub it "Group Transformer". Relation to Recursive Bilateral Filtering. Recursive bilateral filtering (Yang, 2012) shares similar spirit with our Group Transformer. Given two pixels in an image, recursive bilateral filtering decomposes the calculation of range filtering kernel R into two 1D operations -horizontal and vertical. For each 1D operation, let x k , x i denote two pixels on a scanline of the 2D image, the 1D range filtering kernel R k,i = R k,k+1 R k+1,k+2 • • • R i-2,i-1 R i-1,i = i-1 j=k R j,j+1 , where R j,j+1 is computed with a radial basis function kernel: R j,j+1 = exp - |xj -xj+1| 2 2σ 2 R . We can observe the similar property that Tree Transformer possesses. As bilateral filters tend to preserve sharp edges while smoothing other parts of an image, this is in accordance with the goal of our Group Transformer -aggregating similar patches into a group. The differences between recursive bilateral filtering and our framework are mainly two perspectives: 1) the basic operation unit is pixel in recursive bilateral filtering while our approach uses either patch or word token; 2) radial basis function kernel is employed to measure the range similarity in recursive bilateral filtering while we use neighboring affinity score instead.

3.2. HIERARCHY-AWARE CLIP

Pretraining with HiCLIP. To equip both CLIP branches with the ability of dynamic hierarchy discovery, our Hierarchy-aware CLIP adopts Group Transformer as the image encoder and employs Tree Transformer as the text encoder. Let v and u denote the image and text feature vectors, the contrastive pretraining objective L can be written as: L = - 1 N N i log exp v ⊤ i u i /τ N j=1 exp v ⊤ i u j /τ - 1 N N i log exp u ⊤ i v i /τ N j=1 exp u ⊤ i v j /τ ( ) where τ is a learnable temperature parameter, and N is the total number of image-text pairs. Unsupervised Hierarchy Induction. During inference, we follow Eq. ( 5) and Eq. ( 8) to generate all neighboring affinity scores {a l i,i+1 } L l=1 and {a l (i,j),(i ′ ,j ′ ) } L l=1 from bottom to top layer L for the texts and images, respectively. These neighboring affinity scores are then used for hierarchy induction. Intuitively, a low affinity score at a certain layer indicates the two corresponding neighbours remain split within this layer. When repeating such process in a top-down greedy manner, we are able to generate tree hierarchies for texts and similar group hierarchies for images in an unsupervised fashion.

4.1. EXPERIMENTAL SETTINGS

Pretraining Datasets To make a fair comparison with the state-of-the-art contrastive vision-language pretraining approaches, we adopt the YFCC15M benchmark proposed in (Cui et al., 2022) which builds on a subset from YFCC100M (Thomee et al., 2016) consisting of 15M image-text pairs. In addition, we construct a 30M version of pretraining data by including Conceptual Caption 3M (CC3M) (Sharma et al., 2018) and 12M (CC12M) (Changpinyo et al., 2021) . We thus validate our model on the two different scales of pretraining data. Downstream Datasets Following CLIP and DeCLIP, we select 11 visual recognition datasets under the zero-shot setting, namely ImageNet (Deng et al., 2009) , CIFAR 10 & CIFAR 100 (Krizhevsky et al., 2009) , StanfordCars (Krause et al., 2013 ), Caltech101 (Fei-Fei et al., 2004) , Flowers102 (Nilsback & Zisserman, 2008), SUN397 (Xiao et al., 2010) , DTD (Cimpoi et al., 2014) , FGVCAircraft (Maji et al., 2013) , OxfordPets (Parkhi et al., 2012) , and Food101 (Bossard et al., 2014) . Same zero-shot classification protocol is applied following Radford et al. (2021) which uses predefined prompts as text inputs. The full list of used prompts is provided in the Appendix. Although CLIP and DeCLIP only evaluates on visual recognition, we also provide comprehensive comparisons on vision-language tasks which are more desired in evaluating multimodal models, including: imagetext retrieval on MSCOCO Caption (Chen et al., 2015) , as well as vision-language reasoning on VQAv2 (Antol et al., 2015) and SNLI-VE (Xie et al., 2019) . Implementation Details Two variants of Vision Transformer (Dosovitskiy et al., 2021) are used as the image encoder in our experiments -ViT-B/32 and ViT-B/16, while the text encoder is a vanilla Transformer (Vaswani et al., 2017) following CLIP as a fair comparison. The embedding size of both image and text features are 512 throughout our paper. To make a fair comparison with CLIP family baselines, we train all models for 32 epochs under the same set of pretraining hyperparameters including learning rate, warmup steps, weight decay, etc. The input image size is set to 224 × 224, and the input text sequence length is truncated or padded to 77. The scaling factor σ t and σ v of Hierarchy-aware attention are both set to 256 for Group Transformer and Tree Transformer. Following CLIP and DeCLIP, the learnable temperature parameter τ is initialized as 0.07.

4.2. VISUAL RECOGNITION

We first compare HiCLIP with state-of-the-art CLIP family approaches on YFCC15M benchmark (Cui et al., 2022) 2022) apply multiple single-modal self-supervised tasks in addition to CLIP, we incorporated the same objectives into our hierarchy-aware model for a fair comparison (denoted as HiDeCLIP). By combining the contrastive learning and self-supervised learning loss functions, our HiDeCLIP further improves the zero-shot ImageNet classification performance by 2.7% over DeCLIP, and overall 13.1% higher than CLIP. 11 Visual Recognition Benchmarks. Note that we included both versions of training data, YFCC15M (short as 15M) and 30M, in this experiments as discussed in Section 4.1. We observed that the zero-shot performance on Cars and Aircraft datasets are very low for all models, because in the YFCC benchmark there are 0.04% and 0% of descriptions contains aircraft and car labels used in these datasets, such as "Audi V8 Sedan 1994". HiCLIP achieves significant improvement in average over CLIP on both pretraining datasets, indicating that HiCLIP maintains substantial advantage over CLIP when scaling up the size of training data. Despite the fact that the absolute improvements by incorporating hierarchy-aware attentions into CLIP is relatively less significant than adding multiple self-supervised tasks, it is interesting that hierarchy-aware attention is compatible with self-supervised learning (DeHiCLIP) and further achieves performance improvement over DeCLIP.

4.3. PERFORMANCE COMPARISON ON VISION-LANGUAGE TASKS

In Table 2 , we compare different CLIP-style methods on downstream vision-language tasks, including image-text retrieval which emphasizes on cross-modal alignment and two vision-language reasoning tasks (VQA and SNLI-VE) which focus more on collaborative multimodal reasoning. Zero-shot Image-Text Retrieval on MSCOCO. On all different algorithms and training datasets, HiCLIP and HiDeCLIP improve the retrieval performance by a large margin. It is worth noting that without complicated self-supervised learning objectives, HiCLIP constantly outperforms DeCLIP when merely relying on CLIP's contrastive loss which is different from visual recognition tasks. This finding suggests that the benefits brought by adding self-supervised learning is effective within the scope of visual recognition, while our approach fully explores the hierarchical nature of multimodal contents which contributes to a significantly performance boost in vision-language tasks. Fine-tuning on Vision-Language Reasoning Tasks. Similar to the results on zero-shot retrieval, we observe consistent performance gains for all visual reasoning tasks and pretraining data, indicating that hierarchy-aware attentions are more efficient multimodal learners and is capable of tackling tasks that require content understanding and reasoning capabilities.

4.4. ABLATION STUDY

In this section, we provide additional ablation studies on influence factors including the patch granularity of visual encoder and the training data volumes in Table 3 . We report the following experimental results including zero-shot accuracy on ImageNet and averaged accuracy on all 11 R@1 R@5 R@10 R@1 R@5 R@10 Y/N Num. Other All Acc. Table 3 : Ablations on the patch granularity and pretraining data scale for HiCLIP & HiDeCLIP. Rsum is the summation of the R@1, R@5, R@10 of image-to-text and text-to-image retrievals. visual recognition dataset, Rsum over recall@1, 5, 10 on zero-shot image-text retrieval, as well as accuracy on VQA and SNLI with fine-tuning. In addition, we conduct component analysis in Table 4 to show that Group Transformer and Tree Transformer both play important roles in HiCLIP.

CLIP

On Patch Granularity. We compare all downstream tasks using ViT-B/32 and ViT-B/16 as visual encoders. Since the Group Transformer is based on visual patches and benefits from finer-grained patch segments, we expect HiCLIP and HiDeCLIP achieves consistent performance improvements when directly comparing the same method across different visual encoder variants. When we fix the visual encoder and compare HiCLIP and HiDeCLIP with their corresponding baselines (i.e., CLIP and HiCLIP), HiCLIP and HiDeCLIP constantly outperform CLIP and DeCLIP on all tasks with the help of hierarchical-aware attention. It is worth noting that HiCLIP alone without complex self-supervised losses outperforms DeCLIP on three out of five tasks, with the exceptions on ImageNet and VQA by a small margin. This shows that hierarchical information captured by HiCLIP potentially benefits more to the vision-language contrastive learning paradigm. On Pretraining Data Scale. As shown in Table 3 , for most vision recognition tasks, we observe that the benefits contributed by a better modeling strategy saturates when more data is used during pretraining, which is in line with the findings reported by many other works including CLIP Radford et al. (2021) . One possible explanation is that, in order to achieve further improvements on visual recognition tasks, a more vision-specific training scheme such as self-supervised learning potentially benefits more, because the ability of multimodal high-level reasoning are not as critical in vision-only tasks. In contrast, by scaling up the pretraining data, the performance improvements achieved on vision-language tasks are more significant and consistent across all methods. Similarly, HiCLIP and HiDeCLIP still enjoys large improvements against CLIP and DeCLIP when the pretraining dataset scales up. In addition, HiCLIP pretrained on 30M data achieves better vision-language performances on all three tasks over DeCLIP suggesting a potential better scalability of HiCLIP on vision-language reasoning tasks, while DeCLIP features better vision recognition performances. On Component Analysis. In Table 4 , we demonstrate that using Group Transformer alone (HiCLIP-Group) for vision modeling yields comparable improvements on visual recognition task (zeroshot ImageNet classification) with using Tree Transformer alone (HiCLIP-Tree). In addition, the improvements on image-text retrieval are more significant when applying Tree Transformer alone than applying Group Transformer, indicating that language modeling may have more potential impact than visual modeling with regard to such vision-language tasks. Moreover, when we activate both Group Transformer and Tree Transformer, substantial performance boosts are obtained against HiCLIP-Group and HiCLIP-Tree, showcasing the synergy between dual hierarchy-aware attentions even under naive cross-modal interactions. Rsum R@1 R@5 R@10 R@1 R@5 R@10 CLIP ViT-B/

4.5. UNSUPERVISED HIERARCHY INDUCTION WITH PRETRAINED HICLIP MODEL

It is natural to adopt Tree Transformer because texts are essentially discrete tokens among which certain semantic dependencies are shared. Following the same analogy, since each image is prepatchified in Vision Transformers, we expect the image patches to join semantic groups gradually from bottom layers to top layers for a better representation, although it seems to be more challenging than the language counterpart. Therefore, in addition to the performance gains achieved over various downstream tasks, we also visualize the hierarchies captured in our Group and Tree Transformers. As shown in Figure 3 , by virtue of explicitly modeling the inputs with hierarchy-aware attentions during pretraining, our model is able to gradually group semantically similar neighbors, showing the ability of performing hierarchical visual and language inductions in an unsupervised manner. 

5. CONCLUSION AND FUTURE WORK

In this work, we equip both the visual and language branches of CLIP with hierarchy-aware attention to automatically capture the hierarchies from image-caption pairs. Following the discovered hierarchical structures, the proposed HiCLIP creates compact image and text embeddings via gradually aggregating spatially and semantically similar patches or tokens into common groups. Supported by extensive experiments on multiple downstream tasks, we show that hierarchy-aware attention greatly improves the alignment of image and text modalities against several recent CLIP-style approaches. Moreover, after pretraining, both branches of HiCLIP can be adopted for unsupervised hierarchy induction by analyzing the generated constituent attention weights. With limited computational resources, we conduct experiments up to 30 million image-text pairs without extensive parameter tuning. As a future direction, we plan to scale up the pretraining dataset as well as the scale of the visual encoder to fully validate the scalability of our approach. In addition, we also plan to explore the full potential of hierarchy-aware attentions with better multimodal information fusion operations compared with the simple dot product used in CLIP-style models.

A PROMPTS ENGINEERING FOR ZERO-SHOT VISUAL RECOGNITION

In this work, we follow the 80 prompts proposed in Radford et al. (2019) to evaluate zero-shot image classification on ImageNet dataset. The full list of prompts for ImageNet are presented in Table 5 . For other downstream visual recognition datasets, we also use domain-specific prompts according to Radford et al. (2019) . The whole prompts for the 10 downstream datasets can be found in Table 6 .

B LINEAR PROBE PERFORMANCE

In addition to the zero-shot image classification tasks presented in Table 1 , we also perform linear probe on frozen image features to estimate the quality of pretrained image encoders. We follow the settings of CLIP and DeCLIP to train a linear classifier with the L-BFGS optimizer from scikitlearn machine learning library. The linear probe results on downstream visual recognition tasks are presented in Table 7 . From the table, we can observe that HiCLIP / HiDeCLIP still outperform CLIP / DeCLIP across all varied visual encoder types and pretraining data sizes.

C ADDITIONAL PRETRAINING IMPLEMENTATION DETAILS

Our implementation is based on the open-source PyTorch implementationfoot_0 . Following Cui et al. (2022) , we use AdamW optimizer (Loshchilov & Hutter, 2019) with a weight decay rate of 0.1 during pretraining. The learning rate is first linearly increased to 0.001 within 2500 warmup steps, and then decayed to 0 following the cosine strategy. For the 15M version pretraining data, we set the batch size to 4096 and run all experiments on 32 A100 GPUs. For 30M version pretraining data, we set the batch size to 8192 and run all experiments on 64 A100 GPUs. 

D MORE VISUALIZATION RESULTS & VISUALIZATION PROCESS

Besides the visualization results illustrated in Figure 3 , we also provide eight more cases on unsupervised hierarchy induction from Figure 4 to Figure 11 . Moreover, we provide the detailed descriptions of the visualization process for input images in Algorithm 1, where we set the list of "break" threshold values {θ 1 , . . . , θ 12 } to {0.35, 0.5, 0.5, 0.6, 0.8, 0.85, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9}. For unsupervised grammar induction, we adopt the same parsing algorithm as in Tree Transformer (Wang et al., 2019) . Based on the visualization results, we can conclude that by integrating hierarchy-aware attention into the conventional attention mechanism, our HiCLIP can discover and aggregate spatially and semantically similar visual patches and language tokens in a layer-by-layer manner. However, current unsupervised hierarchy induction of HiCLIP (visualization of vision encoder especially) follows a top-down style and relies on the threshold values to decide whether to split two adjacent visual patches and language tokens. For the visual hierarchy, we trivially specify thresholds for different layers (the higher layer also has a higher threshold value). Thus, the threshold list may not be suitable for every image. In addition, changing the threshold values may influence the visual and language induction results. It would be better if the thresholds are adaptive to every input image and sentence. Our future work is to find a better way (e.g., a data-dependent algorithm) to parse the C matrix for each layer.

E VISUALIZATION OF LEARNED FEATURE SPACE

In Figure 12 and Figure 13 , we provide the t-SNE visualization of the learned feature space for CLIP, HiCLIP and DeCLIP pretrained on YFCC-15M and 30M data, respectively. We use the 10 classes of CIFAR-10 dataset to conduct all the visualization experiments. 

FGVCAir-craft

a photo of a {label}, a type of aircraft. a photo of the {label}, a type of aircraft.

Flowers102

a photo of a {label}, a type of flower. OxfordPets a photo of a {label}, a type of pet.

SUN39

a photo of a {label}. a photo of the {label}.

F DETAILED ILLUSTRATION OF THE COMPUTATION OF C

In Figure 14 , we illustrate the detailed computation steps of the attention mask C. For the toy example sentence "a blue cat sitting on bench", we show how the C l i,j matrix in each Tree-Transformer layer is calculated from neighbourhood affinity scores a l i,i+1 through the multiplication operation (i.e., C l i,j = j-1 k=i a l k,k+1 ), where i ∈ {0, . . . , N -1}, N is the input sequence length. Figure 4 : Visualization results about "a collection of fruits and vegetables sitting on a stove top". Our HiCLIP successfully recognizes the green vegetable, fruits like the apples as well as the stove top. In the mean time, the language hierarchy of the input sentence is also created through analyzing the constituent attention weights. Figure 5 : Visualization results about "a group of zebra standing next to each other on a dirt field". Our HiCLIP approach can generate correct parsing tree while aggregating image patches that correspond to the concepts zebra and dirt field into common groups in an unsupervised manner. Figure 6 : Visualization results about "a player swings his bat during a baseball game". Our HiCLIP approach can successfully aggregate the regions of dugout and baseball field, while the batter is not well recognized in the visual hierarchy. Meanwhile, the language parsing tree is generally correct by analyzing the constituent attention weights. Figure 7 : Visualization results about "a living room with a couch and a tv near a wall". Our HiCLIP successfully generates a correct language hierarchy of the inputs sentence. Moreover, it merges the patches that correspond to the couch and tv, as well as the carpet and door regions of the living room. Figure 10 : Visualization results about "a stone building that has a clock on the top". In this case, our HiCLIP can roughly merge the regions of stone building and clock tower. For the language hierarchy, it seems that "a" and "clock" should be aggregated together first and before "has". Figure 11 : Visualization results about "there are people playing a game of tennis". In this case, our HiCLIP didn't achieve meaningful visual hierarchy result for the tennis court, even though the patches correspond to the player has been merged during the induction process. For the language hierarchy, it seems that "a" and "game" should be aggregated together first and before "playing". We take a short sentence "a blue cat sitting on bench" as a toy example. We show real values of a l i,i+1 and calculated C l i,j matrices in the first three layers of Tree-Transformer. We can clearly see that several words are grouped together when the affinity score between them is high enough and greater than a threshold (e.g., 0.8): blue and cat in Layer 1; sitting and on in Layer 1; sitting, on, and bench in Layer 2.



https://github.com/Sense-GVT/DeCLIP



Figure 1: Illustration of hierarchical structures in both (a) vision and (b) language modalities. Based on the affinity scores between adjacent vision patches or word tokens (marked in blue boxes), the attention mask C in hierarchy-aware attention considers both spatial and semantic similarity following the highest valued route (marked in red arrows) between two patches or tokens. The affinity scores evolve layer-by-layer contributing to the different levels of hierarchy granularity.

Figure2: Illustration of Hierarchy-aware CLIP (HiCLIP), which employs hierarchy-aware attention to both vision and language encoders. HiCLIP estimates the affinity scores of neighbouring vision patches or language tokens and progressively groups them into higher-level constituents, encouraging encoders to explicitly capture hierarchical information during training.

"a small blue plane sitting on top of a field" (a) Visual Hierarchy (b) Language Hierarchy

Figure 3: An example of unsupervised hierarchy induction for a semantically aligned image-text pair.

drawing of the {label}. a photo of the large {label}. a black and white photo of a {label}. the plushie {label}. a dark photo of a {label}. itap of a {label}. graffiti of the {label}. a toy {label}. itap of my {label}. a photo of a cool {label}. a photo of a small {label}. a tattoo of the {label}.

a group of zebra standing next to each other on a dirt field"

a living room with a couch and a tv near a wall"

Figure8: Visualization results about "a train passing by a flag on a clear day". Our HiCLIP can successfully recognize the regions of train, flag, and elevated track. For the language hierarchy, HiCLIP can also aggregate these concept words together correctly.

Figure9: Visualization results about "a couple of elephants standing by some trees". Our HiCLIP captures a correct language hierarchy of the verb and concept words in the input sentences. Meanwhile, HiCLIP can also aggregate image patches that correspond to the concepts elephant and tree.

there are people playing a game of tennis"

Figure14: Detailed illustration of the computation of C. We take a short sentence "a blue cat sitting on bench" as a toy example. We show real values of a l i,i+1 and calculated C l i,j matrices in the first three layers of Tree-Transformer. We can clearly see that several words are grouped together when the affinity score between them is high enough and greater than a threshold (e.g., 0.8): blue and cat in Layer 1; sitting and on in Layer 1; sitting, on, and bench in Layer 2.

containing only ImageNet zero-shot setting. Then we present zero-shot classification results on the other common visual recognition datasets. Both results are presented in Table1. YFCC15M Benchmark.CLIP Radford et al. (2021),SLIP Mu et al. (2021), and FILIPYao et al. (2022) all leverage a contrastive learning and can be directly compared with HiCLIP. When limiting Zero-shot classification top-1 accuracy (%) on 11 vision datasets against state-of-the-art CLIP-style models, including: CIFAR10/100 (C10/100), Food101 (F101), Flowers (Flow), Caltech (Cal), Aircraft (Air), and ImageNet (IN). ViT-B/32 is used for all compared models.

Zero-shot image-text retrieval on MSCOCO (5K) dataset and vision-language reasoning on VQAv2 and SNLI-VE with fine-tuning. ViT-B/32 is adopted for all models.

Ablations on the use of Group Transformer (G-Trans) and Tree Transformer (T-Trans) in HiCLIP. All models are pretrained on YFCC15M.

Full list of prompts to evaluate on 10 downstream domain-specific visual recognition datasets. CIFAR 10 & CIFAR 100 a photo of a {label}. a blurry photo of a {label}. a black and white photo of a {label}. a low contrast photo of a {label}.

Linear probe performance on downstream datasets. C10/100 is CIFAR10/100, F101 is Food101, Flow is Flowers, Cal is Caltech, and Air is Aircraft.

availability

https://github.com/jeykigung/HiCLIP.

Algorithm 1 Unsupervised hierarchy induction for input images

Require: All neighboring affinity scores a l (i,j),(i ′ ,j ′ ) for l = {1, . . . , N } layers, A list of "break" threshold values {θ1, . . . , θN } for every layer. 1: l ← N ▷ Start from the highest layer 2: Initialize a nested list B = {B1, . . . , BN } ▷ Store break edges of each layer 3: while l > 0 do 4:for each edge (i, j), (i ′ , j ′ ) in the patch graph do 5:if a l (i,j),(i ′ ,j ′ ) < θ l then 6:if l = N then 7:Append the edge (i, j), (i ′ , j ′ ) to B l ▷ Break the edge (i, j), (i ′ , j ′ ) in the top layer N 8: else 9:if edge (i, j), (i ′ , j ′ ) not in B l+1 then 10:Append the edge (i, j), (i ′ , j ′ ) to B l ▷ Break the edge (i, j), (i ′ , j ′ ) in layer l 11:end if 12:end if 13:end if 14:end for 15:l ← l -1 ▷ Move to the next lower layer 16: end while 17: Draw visual hierarchy based on B, then remove redundant edges by finding connected components.

CLIP 15M

HiCLIP 15M DeCLIP 15MFigure 12 : Visualization of the learned feature space via t-SNE on CIFAR-10 dataset. We use CLIP, HiCLIP, DeCLIP checkpoints that pretrained on YFCC-15M data.

CLIP 30M

HiCLIP 30M DeCLIP 30MFigure 13 : Visualization of learned feature space via t-SNE on CIFAR-10 dataset. We use CLIP, HiCLIP, DeCLIP checkpoints that pretrained on 30M data.

