Topic and Hyperbolic Transformer to Handle Multi-modal Dependencies

Abstract

As multi-modal search relies on jointly learning image-text representations and has been investigated in the literature, our innovation is to develop Chimera, a framework in which to learn their representations and similarities. Because the core of multi-modal search is learning the modalities in a shared semantic space and measuring their similarities, search quality depends on which expressive space is utilized in learning. This motivates us to identify the space that can elucidate their semantic and complex relationships with small information loss. Novelty is assured by introducing the topic and hyperbolic as spaces, and performing contrastive/metric learning tasks to ensure the cooperation of these spaces with Transformer. Experiments show that Chimera empowers pre-trained models for multi-modal search tasks and demonstrate the ability of the layers it introduces.

1. INTRODUCTION

While conventional search systems find relevant documents to a query, multi (cross) -modal search systems retrieve relevant instances to answer the query. The text-based image retrieval models return images matched to a given text, where images are ranked based on their similarity to the text. Inferring the semantic alignment between images and words in the associated sentences allows the fine-grained interplay between vision and language data to be captured, and makes image-text matching more interpretable. As this search model needs semantic-intermediate expressions for both images and text, our challenge is to identify spaces that can learn their appropriate representations and define similarities. Since joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, many studies have explored how to learn their embeddings, project them into a semantic space, and measure their similarity. Many existing multi-modal approaches employ either embedding Lee et al. (2018) or classification Huang et al. (2017) . As evidenced by the embedding approach, the semantic space is generally of low dimension allowing representation by vectors and the direct computation of modalities by conventional distance metrics (e.g., cosine similarity or Euclidean distance) as the similarity function in this embedding space. The literature describes the use of a variety of matching functions, such as metric learning Hsieh et al. (2017) ; Tay et al. (2018b) and/or neural networks He et al. (2017) . Although the studies are primarily focused on designing distance functions over objects (i.e., between users and items), and have empirically demonstrated reasonable success in collaborative ranking with implicit feedback, matching functions work only in the scope of Euclidean space Tran et al. (2020) . Since there is a hypernym, and an entailment relationship within words, and texts Vendrov et al. (2016) , one is able to perceive a semantic hierarchical relationship between sentences and images. This intuition motivates us to explore spaces that explicitly illuminate both semantic and hierarchical relationships and project them for representation learning. That is, our framework, Chimera, focuses on topics and hyperbolic geometry as the semantic space and the roomier space, respectively; the contrastive and metric learning approach are used as training objectives. As topic models describe the statistical relationships of word occurrences as global information (e.g., over a given data), Chimera adopts their concept to complement Transformer with global information, and mitigate the strong conditional dependencies be-tween inputs and modalities. As recent studies Hui & Ku (2022) ; Yang et al. (2022) show that hyperbolic space is roomier than Euclidean space, and can better represent complex relationships, Chimera employs both this space and its operation to realize simpler representations rather than looking for good image/text matches in Euclidean space. Unlike other models, Chimera is a framework in which Transformer-based models are directed to jointly learn representations in different rooms (i.e., topics and hyperbolic space). Chimera applies metric learning in both Euclidean and hyperbolic space to utilize the benefits of both these spaces, and provides higher similarity scores to the positive pairs than the negative ones. The key advantages of our approach are two-fold: Theoretical contribution: Chimera (1) derives a way to incorporate both topic and hyperbolic space with Euclidean space to learn more semantic and complex relationships over modalities, in an end-to-end fashion. (2) trains objectives in the multi-task learning, and adopts the simplicity and effectiveness of the metric learning paradigm to combine Euclidean space based pre-trained models and hyperbolic space model in the fine-tuning stage. Practical contributions: Chimera is a plug and play framework, that can utilize the benefits of pre-trained models by fusing Euclidean/hyperbolic/topic spaces through a contrastive/metric learning paradigm in fine-tuning, as shown in 4.3 and 4.4.

2. RELATED WORK

Cross-modal or multi-modal retrieval Balaneshin-kordan & Kotov (2018); Carvalho et al. (2018) ; Wang et al. (2019) finds a set of objects in response to textual queries, or assigns understandable terms to objects; multimodality inputs are processed simultaneously for joint vision + language (V+L) understanding. Recent pre-trained language models Devlin et al. (2019) ; Radford et al. (2019) ; Yang et al. (2019) ; Liu et al. (2019b) ; Lan et al. (2020 ), use Transformer Vaswani et al. (2017) for learning contextualized text representations; they have yielded great advances in NLP tasks, and have been applied to multi-modal tasks Singh et al. (2022) ; Hu & Singh (2021) . ViLBERT Lu et al. (2019) and LXMERT Tan & Bansal (2019) introduced the two-stream architecture, where separate Transformers are applied to images and text independently; the results of which are fused by a third Transformer in the second stage. Unicoder-VL Li et al. (2020a) takes the visual regions of the image and textual tokens of the sentence as the input and then encodes the input to yield the linguistic embedding and image embedding, respectively. As with pre-training, VL- BERT Su et al. (2020) performs pre-training on both visual-linguistic and text-only datasets jointly, but does not address the task of Sentence-Image Relationship Prediction unlike other works Lu et al. (2019) . UNITER Chen et al. (2020b) uses conditional masking on Masked Language Modeling and Masked Region Modeling, and introduces a novel Word-region Alignment pre-training task. It can realize heterogeneous downstream V+L tasks with joint multi-modal embeddings. Zhou et al. Zhou et al. (2020) present a unified Vision-Language Pre-training model that also allows generation via a unified Transformer with various self-attention masks Dong et al. (2019) . Murahari et al. Murahari et al. (2020) proposes VisDial-BERT that employs pretrained models for visual dialog. VD- BERT Murahari et al. (2020) relies on the self-attention mechanism within a single-stream Transformer encoder to capture the complex interactions in a unified manner. Oscar Li et al. (2020b) is motivated by the observation that the salient objects in an image can be detected, and are often mentioned in the accompanying text, and uses object tags as anchor points to align the image and language modalities in a shared semantic space. VinVL Zhang et al. (2021) feeds the visual features generated by the new object detection model, and utilizes an improved version of OSCAR+ to pre-train the V+L model and fine-tune it for a wide range of downstream VL tasks. Hyperbolic representation learning has recently demonstrated great potential across a diverse range of applications such as learning entity hierarchies Nickel & Kiela (2017) and/or natural language processing Tay et al. (2018a) , hyperbolic graph neural network Chami et al. (2019) ; Liu et al. (2019a) . Hyperbolic geometry has already been applied to the recommender systems domainTran et al. (2020); Wang et al. (2022a) or image recognition applications Liu et al. (2020) ; Khrulkov et al. (2020) . Hyperbolic space is a kind of manifold space studied in Riemannian geometry, in which basic mathematical operations (e.g., distance measurement) are defined differently from Euclidean space. Nickel and Kiela Nickel & Kiela (2018) propose a method that exploits the individual strengths of both these models by building embeddings using the Lorentz model and mapping them into the Poincaré ball. Hyperbolic Visual Embedding Learning Networks Chen et al. (2020a) learn the hierarchyaware image embedding features in hyperbolic space, where image labels are projected into hyperbolic space with the Poincaré hierarchy embedding model Nickel & Kiela (2018) and Poincaré Globe Tifrea et al. (2019) . Law et.al Law et al. (2019) explain why the squared Lorentzian distance is a better choice than the Poincaré metric. Despite the advances made in hyperbolic embeddings, how to use hyperbolic embeddings for downstream tasks such as classification is a challenge due to the absence of corresponding hyperbolic neural network layers Hui & Ku (2022) . This motivates us to explore a space that identifies more semantic and complex relationships with small information loss, and introduce the topic and hyperbolic as alternative spaces.  , K H H v, V k, K v, V q, Q k, K v, V Einstein Midpoint

3.1. Model Overview

The architecture of Chimera is illustrated in Figure 1 . Chimera exists between the top of Transformer layers and model specific objectives or output functions. It receives the hidden states of last Transformer layer as its input, transforms them via the topic and Hyperbolic space with objectives, and feeds its output into the model-specific objectives or functions. As Chimera is based on the understanding that hyperbolic space can better represent complex relationships than Euclidean space Nickel & Kiela (2018) ; Zhang & Gao (2020) ; Gülçehre et al. (2019) , it represents sets of images and texts in both a topic and a hyperbolic space, to illuminate their semantic and hierarchical relationships. This direction leads us to adapt Transformer-based models for performing an attentive read operation over different representation vectors in both topic and hyperbolic space, while maintaining the simplicity and effectiveness of the metric learning paradigm. In practice, this framework enables Chimera to connect topics, and Euclidean and hyperbolic space, while retaining the benefits of pre-trained models and alleviating catastrophic forgetting Ramasesh et al. (2021).

3.2. Transformer layers

Transformer has l layers, each of which contains two blocks. While the core of the first block is multi-head attention with k-heads, the core of the second block is a feedforward network with ReLU activation Nair & Hinton (2010) ; it embeds inputs with |x| length, X ∈ R |x| , to the hidden representation, H 0 ∈ R |x|×d h , and projects it using inner dimension f , with layer normalization Ba et al. ( 2016): H 0 = EM B(X), Hl-1 = LayerN orm(H l-1 ), H l = M ultiHead( Hl-1 ) + Hl-1 , M ultiHead(Q, K, V ) = [h 1 ; • • •; h k ]W o , Hl = LayerN orm(H l ), H l+1 = F F N ( Hl ) + Hl , F F N ( Hl ) = max(0, Hl U )V, h j = Attention(Q j , K j , V j ), where U ∈ R d h ×f , V ∈ R f ×d h , and W o ∈ R kd h ×d k are learnable weights, Q j , K l , V j ∈ R |x|×d k are obtained by transforming the output of the l-th layer, H l , us- ing W Qj l , W Kj l , W V j l ∈ R d h ×d k , respectively.

3.3. Topic layer

The motivation for introducing the topic layer is to model the uncertainty in the generative process, and strike a balance between visual information in the image and linguistic knowledge acquired from the text. Given input sequence x d = {x d,1 , • • •, x d,|x| } and dataset D = {x 1 , •••, x D } , non-autoregressive generation can be achieved by minimizing the following independence assumption in the decoding process: min θ L(θ) = min θ - |D| d=1 |x| t=1 log P θ (x d,t |x d,\t ), where θ, and x d,\t represents model parameters, and x d without x d,t , respectively. To mitigate the gap between inputs and modalities, our framework introduces topics, z, into this process and so modifies Eq (2) to: min θ L T (θ) = min θ - |D| d=1 |x| t=1 log Z zt=1 P θ (x d,t |x d,\t , z t )P θ (z t |x d,\t ), where z t denotes the topic of the t-th token, and Z is the number of topics. By placing this topic layer on both the text and the visual encoder (Transformer layers), we mitigate the discrepancy between the cross-modal representations learned by the encoder and the representation needed by the decoder for generating text. As Chimera injects topics on the top blocked attention layers, it maps hidden representation vector h ∈ R d h to latent topic vector z ∈ R Z , and then projects this topic vector into the topic-specific distribution over words. This yields Eq (1) by defining topic matrix, W Z ∈ R d h ×Z , and word generation function, F(h L,t ), where V is the size of the vocabulary. This method is used to sample the next token according to the following probability: H t (text/image) = LayerN orm(H L )W Z P θ (zt|x d,1:t-1 ) × F (h L,t )|z = z t ) P θ (x d,t |x d,1:t-1 ,zt) , where W Z is a learnable weight. As with F z ∈ F(h L,t )|z = z t ), we propose three transformations to generate x d,t that accords with the given z t and x d,1:t-1 . F z = W V |Z ×          h L,t residual if z = 0 (1 -ω)h L,t + ωg z addition if z > 0 h L,t ⊗ g z multiplication if z > 0 h L,t W R|z + b z affine if z > 0, where g z ∈ R d h , W R|z ∈ R d h ×d h , and b z ∈ R d h are the learnable weights specific to topic z. We prepare the residual to select the input if z = 0, to select Euclidean space. Note that just as Eq (2) is transformed into Eq (3) through the introduction of topics, W V |Z ∈ R d h ×V used in previous Transformer is decomposed into the product of W Z and F(h L,t ) in Eq (4).

3.4. Hyperbolic layer: Lorentz model and attention mechanism

Formally, n-dimensional hyperbolic space is a manifold of constant negative curvature. As the number of objects grows exponentially with semantic distance from the query, hyperbolic geometry can, unlike Euclidean geometry, encode those objects without creating interference Gülçehre et al. (2019) . Among the five most common models for hyperbolic space, we primarily make use of the Lorentz model because its distance function avoids numerical instabilities, unlike the Poincaré equivalent. Note that one-to-one isometric transformations can be defined between each different model of hyperbolic space. The Lorentz model is the only unbounded hyperbolic model and is defined as L n = (H n , g L ) with points constrained by H n = {x ∈ R n+1 : x, x L = -1, x 0 > 0}, where the Riemannian metric tensor is g L (x) = diag(-1, 1, ..., 1), and x, y L is the Lorentzian scalar product. The associated distance function on d L is then given as: d L (x, y) = arcosh(-x, y L ), x, y L = Σ n i=1 x i y i -x 0 y 0 (6) As shown in Eq (1), h j corresponds to scaled dot-product attention, and is written as R = sof tmax( QK T √ d )V , where Q, K, V denotes the queries, keys, and values, respectively; d h is the shared dimensionality of the queries and keys. The attentive read operation over keys, k, by query q has the following form: r i = Σ j [ α i,j Z ]v ij , α i,j = exp( 1 √ d q i , k j ), v ij = v j , Z = Σ j α i,j , where q i is a vector called the query and k j s are the keys for the memory locations being read from, α i,j is the function that computes a scalar matching score between q i and k j , v ij is a value to be read from j-th location by i-th query, and Z(> 0) is a normalization factor for the full sum. As the visual/linguistic features are extracted as vectors in the Euclidean space, they are projected into the hyperbolic space to align with the word representation in this phase. The most natural way to exploit hyperbolic geometry for matching pairs of points is to use the hyperbolic distance between them Gülçehre et al. (2019) . To gain the weighted value in the Euclidean Transformer, we project v i,j on Klein model and adopt the Einstein midpoint to compute the aggregation weights appropriately Ungar (2005) and Eq (6). Thus, the hyperbolic attention weights, α i,j , and each entry, r i,L , of R that result from an attentive read can be expressed as: r i,L = Σ j [ α i,j γ(v i,j ) Σ l α i,l γ(v i,l ) ]v ij , α i,j = f (-βd L (q i , k j ) -c), γ(v i,j ) = 1 1 -v i,j 2 where γ(•) are the Lorentz factors, and f (•) denotes the sigmoid function; β and c are manually set parameters and learned along with the rest of the network. The above yields H h .

3.5. Matching layer

This layer is to integrate the output of the topic layers and the Hyperbolic layer into the hidden representation, H M , which can be fed to objective tasks of previous Transformer based models. Thus, the output of this layer, H l , can be expressed as: H M = (1 -ω t )H h + ω t (ω v H T (image) + (1 -ω v )H T (text)) where ω t and ω v is the learnable weights of topic and visual, respectively. That is, both the topic layers and the hyperbolic layer can be adopted by and work seamlessly within existing Transformer based models.

3.6. Training Tasks

Topic Level Matching (TLM): Inspired by contrastive learning Khosla et al. (2020) ; Radford et al. (2021) , Topic Level Matching (TLM) aims to bring similar instances closer and push away dissimilar instances further from each other. Assuming that similar instances in the Euclid space are considered similar at the topic level, topics are learned to explain the latent semantic similarity; accordingly, we apply TLM to the topic layer. We use t, and i to refer to text and image on the topic layer, respectively. As each image has multiple sentences and these sentences refer to the same image, they are considered to be semantically similar. Since this task aims to utilize this semantic similarity for representation learning, it is designed to predict whether one linguistic sentence can be semantically matched with other sentences attached to the same image. Like Image-Text Matching (ITM) in other models Chen et al. (2020b) , this model samples both sentences (positive) with the same image and those with the other images (negative) and learns their matching scores, where it creates the negative pair by replacing the sentence in a paired sample with a randomly-selected equivalent from the other sample. L T LM (ζ) = -E (t,i)∼D log exp φ θ (t, i) exp φ θ (t, i) + N n=1 exp φ θ (t, i n ) , ( ) where φ θ (t, i) denotes the inner product of t and i, N is the size of negative samples, and {i n } N n=1 are N negative samples.

Hyperbolic Level Matching (HLM):

To realize fine-tuning for image-text and textimage retrieval, we formulate the task as a ranking problem. The fine-tuning inputs share the same data preprocessing procedures used in pre-training, except that we do not mask words or regions in the fine-tuning stage. Following the V+L task, we denote the score function as s. Here, we are motivated by Visual Linguistic Matching (VLM) in Li et al. (2020a) , and ITM. Both define the similarity between two inputs with the ranking loss constraint widely used in embedding-based methods to construct a bi-directional max-margin ranking loss instead if treating the "match" and "mismatch" as a binary classification problem. Similar to Hsieh et al. (2017) , for a given query, the nearest neighborhood images are: 1) the images previously referred to by this query, and 2) the images referred to by other words that have similar semantics to this word. That is, we are also able to indirectly observe the relationships between word-word pairs and image-image pairs through the pull-push mechanism of metric learning in Euclidean space. Using the exponential map allows us to project image features v i into the hyperbolic space. Here, we transform this feature vector into its polarity form as (d, r) ∈ R n+1 , where r = ||v i || and d = vi r , i.e. ||d|| = 1, and denote this transformed feature as ṽj . Similarly, we get the transformed feature, wi , from the word embedding w i . Thus, we can project these different features in the same Lorentz space. As the model is trained in Lorentz hyperbolic space, we use the distance, d L , defined in Eq (6) and define our pull-and-push loss function as: L HLM = (i,j)∈D (k,j) / ∈D [m + d 2 L ( wi , ṽj ) -d 2 L ( wk , ṽj )] + , where D contains all observed implicit feedback, i.e. positive image-word pairs; [z] + = max(0, z) is the standard hinge loss and m > 0 is the safety margin size. Because the goal when embedding one space into another is to preserve distances while maintaining complex structures/relationships Sala et al. (2018 ), HyperML Tran et al. (2020) explores metric learning in hyperbolic space for recommender systems. It defines the distortion optimization function while embedding user-item pairs into hyperbolic space with the constraint of preserving good structure quality for metric learning. As our goal is to coordinate the Euclid space with the Hyperbolic space, our approach adapts this distortion optimization function as follows: L D = (i,j)∈D [ |d L ( wi , ṽj ) -d E (w i , vj )| d E (w i , vj ) ] + + (k,j) / ∈D [ |d L ( wk , ṽj ) -d E (w k , vj )| d E (w k , vj ) ] + , where | • | defines the absolute value, and vj is the embedding representation in Euclid space. Thus, fine-tuning aims to preserve the distances by minimizing L D , as this function associates lower distortion with better preservation. 

3.7. Fine-Tuning

Chimera can adopt pre-trained Transformer-based models and can be fine-tuned for their downstream multi-modal tasks, because Topic/Hyperbolic/Matching Layer and TLM/HLM/L D can be easily plugged into these models without changing their structure. By performing the pre-training over a suitable visual-linguistic corpus, Chimera can be customized and applied to visual-linguistic tasks, while keeping the size of the input and output representations in line with those of the pre-trained Transformer-based models. While jointly training tasks, L T (θ), L T LM , L HLM and L D contribute to boost the model performance while providing good representations; there is, however, an unavoidable tradeoff between these functions Sala et al. (2018); Tran et al. (2020) . Accordingly, we integrate these tasks, the model-specific objective task set, L O , in an end-to-end fashion in a multitask learning framework. The objective function is defined as: min L Θ Θ = λ T L T (θ) + λ T LM L T LM + λ HLM L HLM + λ D L D Chimera + λ O L O pre-trainedmodels , ( ) where Θ is the total parameter space that covers all embeddings and variables of the networks, and λ * is the weight of each task.

4.1. Datasets

We use two datasets, MSCOCO Lin et al. (2014) and Flickr30foot_0 , a widely used dataset for multi-modal search Wang et al. (2019) ; Li et al. (2020a) , as follows. We follow Karpathy & Li (2015) , Faghri et al. (2018) to split these datasets into training/validation/test subsets. Their statistics are shown in Table 1 .

4.2. Implementation Detail

We implemented Chimera using Pytorchfoot_1 with Nvidia Apexfoot_2 , and Horovodfoot_3 to speed up training. We will release this code later. As shown in Figure 1 , Chimera can be inserted as additional layers with original objective tasks, L O in Eq (13), into various pre-trained Transformer models without changing their architecture. Chimera works with the Transformer layers and their objective tasks. To get the best out of the combination of the pre-trained models and Chimera rather than comparing fairly over various models, we follow the settings of each pre-trained model. 4 3.3 8.4 2.3 4.6 9.8 1.6 3.3 9.2 2.7 5.2 11.2  ViLBERT 1.5 3.4 8.2 2.4 4.4 9.6 1.5 3.1 9.1 2.8 5.0 11.1  VD-BERT 1.4 3.2 8.1 2.2 4.3 9.7 1.5 3.2 9.1 2.8 5 1.9 3.9 8.9 3.2 5.7 12.2 2.1 4.1 9.8 3.5 6.2 12.6 VL-BERT 1.8 3.7 9.1 3.1 5.8 12.1 2.2 4.2 10.1 3.4 6.3 12.4 ViLBERT 1.7 3.6 9.0 3.0 5.7 12.2 2.1 4.1 10.0 3.2 6.2 12.4 VD-BERT 1.7 3.7 9.1 3.1 5.5 12.1 2.1 4.2 10.1 3.3 6.3 12.5 ViT 1.7 3.6 9.0 3.0 5.5 12.2 2.1 4.1 10.0 3.2 6.2 12.4 optimization using grid search, we set them as λ T , λ T LM , λ HLM , λ D , and λ O =0.5, 0.5, 0.5, 0.5, and, 1.0, and apply these values to other models. Like other state-of-the-art pre-trained V-L models, our framework can fine-tune for various downstream visual-linguistic tasks, by simply modifying the input format, output prediction, loss function and training strategy.

4.3. Evaluation metrics and Results

For the V-L task, we perform two image-text matching tasks: I2T: sentence retrieval, i.e., retrieving ground truth sentences given a query image. T2I: image retrieval, i.e., retrieving ground truth images given a query text. Since these tasks are based on rankings, candidates are ranked according to their scores, and the number of candidates varies with the query. For both tasks, the commonly used evaluation metric is R@N, which is defined as the recall rate for the top N results for the query. Table 2 shows the results of these tasks, EX1. We confirm that the contribution of Chimera is proportional to the increase in N ; this tendency is stronger for T2I than I2T. As the accuracy was virtually the same in the previous experiment, we performed an additional experiment with different conditions. That is, we randomly permuted words in each sentence to determine the robustness of the model. As shown in Table 3 , Chimera offers significantly better performance than the other models than in experiment EX2. This table indicates that not only the hyperbolic space but also TLM/HLM, which are newly introduced as the training tasks in our architecture, contributed to this superiority, as shown by the ablation analysis. Since the difference is greater for T2I than for I2T, the latter task is considered to be a more difficult than the former. This result suggests that hyperbolic space is better suited to the visual-linguistic task. Chimera is more robust against word reordering because Topic and Hyperbolic layers learn representations on each input level.

4.4. Ablation analysis

To investigate the respective contributions of pre-training tasks to overall performance, we conducted an ablation analysis over the same datasets with the same metric following EX1 2. In order to compare the effect of each component, Chimera used ViT as the pre-trained model. To emphasize the difference between the effects of these components, we performed fine-tuning over the data with 20% of each of the images and texts randomly masked, the 7,89.8,96.2 56.3,76.7,89.3 65.2,87.9,94.2 56.7,75.7,87.8 order of the words in each sentence were randomly shuffled. A comparison of improvements achieved is shown in Table 4 . The lower the value, the greater the effectiveness of that excluded task. A comparison of the magnitude of the decrease shows that the contribution of both the Topic layer and the Hyperbolic layer to Chimera is higher, and supports our assumption that both topic space and hyperbolic space contribute to providing better representations through the discovery of the semantic relationships and the summarization of the complex relationships. This table shows that among these tasks, the topic layer is more sensitive to both topic specific tasks (e.g., TLM and HLM) and a decrease in the number of topics, which corresponds to the number of Hyperbolic dimensions, than the Hyperbolic layer.

5. DISCUSSION

Our approach uses topic and hyperbolic space and metric learning to focus on the characteristics and differences of multi-modal dependencies. Each image and its corresponding text differ not only in the number of features but also in the amount of information they contain. The relationship between images and text is asymmetric, and this relationship is observed as task asymmetry in Table 3 , EX2. Table 3 implies that the meaning of a sentence in a text does not change much even if a word is replaced, but images lose their meaning if regions are shuffled in each image, that is the degree of freedom is low. As the V+L task requires a representation space into which both images and text can be embedded without any other bias, our architecture learns their representations and projects them in topic and hyperbolic space. To jointly learn image-text representations more effectively, Chimera employs metric learning to coordinate representations in the Euclid space with those in the hyperbolic space, and can move a data point a certain distance in hyperbolic space with smaller force than is needed in Euclidean space.

6. CONCLUSION

To develop semantic intermediate expressions for both images and texts, our proposed framework, Chimera, bridges the gap between Euclidean and hyperbolic geometry through topics and the metric learning approach. This leads our framework to project image-text representations into topic and hyperbolic space while preserving semantic similarity in the Euclidean space through metric learning. Experiments on various data sets showed that Chimera achieved better results on the Vision-and-Language task than the baselines and showed its ability to learn the image-text dependencies.



http://bryanplummer.com/Flickr30kEntities/ https://pytorch.org/ https://github.com/NVIDIA/apex https://github.com/horovod/horovod https://github.com/ChenRocks/UNITER,https://github.com/jackroos/VL-BERT,https: //github.com/salesforce/VD-BERT.git,https://github.com/jeonsworld/ViT-pytorch



As with training tasks, Chimera can jointly fine-tune models with other training frameworks Radford et al. (2021); Wang et al. (2022b); Huang et al. (2022); Zeng et al. (2022); Zhong et al. (2022).

Figure 1: Architecture of (left) Chimera, and (right) various spaces and models adopted in Chimera for the attention layer: Chimera consists of a text/image topic layer, a hyperbolic layer, a matching-layer, and objective function (e.g., L T (θ), L D , TLM, and HLM) which can be injected into pre-trained Transformer-based models without modifying their architecture. Upon receiving the output of top Transformer layer, H L , the hyperbolic layer feeds H h to the matching layer. The text/image topic layer receives H L (text)/H L (image) and feeds H T (text)/H T (image) to the matching layer, where H L (text), and H L (image) denotes the H L associated with text, and image, respectively. That is, Chimera decomposes H L ∈ R |x|×d h into H h , H T (text), and H T (image), and integrates them into H M ∈ R |x|×d h .

As with Transformer-based models (UNITER, VL-BERT, VD-BERT, ViT)), we modify the implemented models 5 and optimize their parameters. With regard to training Transformer in our framework, we use Adaptive Moment Estimation (Adam) Kingma & Ba (2015) with β1 = 0.9 and β2 = 0.999 used for optimization, over mini-batches to update parameters and adopt the dropout strategySrivastava et al. (2014) to optimize networks. Then we follow the fine-tuning ofRadford et al. (2021);Jia et al. (2021) and set number N in Eq (10) to the batch size. After performing hyperparameter

Basic statistics of the datasets used in fine-tuning

EX1: The contribution of Chimera in image-text matching tasks over Flickr30 and MSCOCO datasets using R@N : The value is the improvement (+%).

EX2: The contribution of Chimera in image-text matching tasks over Flickr30 and MSCOCO datasets: Different from the previous experiment shown in Table2using R@N , we randomly shuffle the order of the words in each input sentence.

Ablation analysis of pre-training of image-text matching tasks over Flickr30: In this table, TL, HL, Z, and N denote Topic Layer, Hyperbolic Layer, #topics, and # dimensions of Hyperbolic space, respectively. In TLM, ad, mu, and af denote addition, multiplication, and affine in Eq (3), respectively. The best results are marked in bold, where the bold value denotes the statistical significance for p < 0.01, compared to the best baseline. TLM and HLM/L D need the topic layer and the Hyperbolic layer, respectively.

