Topic and Hyperbolic Transformer to Handle Multi-modal Dependencies

Abstract

As multi-modal search relies on jointly learning image-text representations and has been investigated in the literature, our innovation is to develop Chimera, a framework in which to learn their representations and similarities. Because the core of multi-modal search is learning the modalities in a shared semantic space and measuring their similarities, search quality depends on which expressive space is utilized in learning. This motivates us to identify the space that can elucidate their semantic and complex relationships with small information loss. Novelty is assured by introducing the topic and hyperbolic as spaces, and performing contrastive/metric learning tasks to ensure the cooperation of these spaces with Transformer. Experiments show that Chimera empowers pre-trained models for multi-modal search tasks and demonstrate the ability of the layers it introduces.

1. INTRODUCTION

While conventional search systems find relevant documents to a query, multi (cross) -modal search systems retrieve relevant instances to answer the query. The text-based image retrieval models return images matched to a given text, where images are ranked based on their similarity to the text. Inferring the semantic alignment between images and words in the associated sentences allows the fine-grained interplay between vision and language data to be captured, and makes image-text matching more interpretable. As this search model needs semantic-intermediate expressions for both images and text, our challenge is to identify spaces that can learn their appropriate representations and define similarities. Since joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, many studies have explored how to learn their embeddings, project them into a semantic space, and measure their similarity. Many existing multi-modal approaches employ either embedding Lee et al. (2018) or classification Huang et al. (2017) . As evidenced by the embedding approach, the semantic space is generally of low dimension allowing representation by vectors and the direct computation of modalities by conventional distance metrics (e.g., cosine similarity or Euclidean distance) as the similarity function in this embedding space. The literature describes the use of a variety of matching functions, such as metric learning Hsieh et al. ( 2017 Although the studies are primarily focused on designing distance functions over objects (i.e., between users and items), and have empirically demonstrated reasonable success in collaborative ranking with implicit feedback, matching functions work only in the scope of Euclidean space Tran et al. (2020) . Since there is a hypernym, and an entailment relationship within words, and texts Vendrov et al. (2016) , one is able to perceive a semantic hierarchical relationship between sentences and images. This intuition motivates us to explore spaces that explicitly illuminate both semantic and hierarchical relationships and project them for representation learning. That is, our framework, Chimera, focuses on topics and hyperbolic geometry as the semantic space and the roomier space, respectively; the contrastive and metric learning approach are used as training objectives. As topic models describe the statistical relationships of word occurrences as global information (e.g., over a given data), Chimera adopts their concept to complement Transformer with global information, and mitigate the strong conditional dependencies be-tween inputs and modalities. As recent studies Hui & Ku (2022); Yang et al. (2022) show that hyperbolic space is roomier than Euclidean space, and can better represent complex relationships, Chimera employs both this space and its operation to realize simpler representations rather than looking for good image/text matches in Euclidean space. Unlike other models, Chimera is a framework in which Transformer-based models are directed to jointly learn representations in different rooms (i.e., topics and hyperbolic space). Chimera applies metric learning in both Euclidean and hyperbolic space to utilize the benefits of both these spaces, and provides higher similarity scores to the positive pairs than the negative ones. The key advantages of our approach are two-fold: Theoretical contribution: Chimera (1) derives a way to incorporate both topic and hyperbolic space with Euclidean space to learn more semantic and complex relationships over modalities, in an end-to-end fashion. (2) trains objectives in the multi-task learning, and adopts the simplicity and effectiveness of the metric learning paradigm to combine Euclidean space based pre-trained models and hyperbolic space model in the fine-tuning stage. Practical contributions: Chimera is a plug and play framework, that can utilize the benefits of pre-trained models by fusing Euclidean/hyperbolic/topic spaces through a contrastive/metric learning paradigm in fine-tuning, as shown in 4.3 and 4.4.

2. RELATED WORK

Cross-modal or multi-modal retrieval Balaneshin-kordan & Kotov ( 2018 



); Tay et al. (2018b) and/or neural networks He et al. (2017).

); Carvalho et al. (2018); Wang et al. (2019) finds a set of objects in response to textual queries, or assigns understandable terms to objects; multimodality inputs are processed simultaneously for joint vision + language (V+L) understanding. Recent pre-trained language models Devlin et al. (2019); Radford et al. (2019); Yang et al. (2019); Liu et al. (2019b); Lan et al. (2020), use Transformer Vaswani et al. (2017) for learning contextualized text representations; they have yielded great advances in NLP tasks, and have been applied to multi-modal tasks Singh et al. (2022); Hu & Singh (2021). ViLBERT Lu et al. (2019) and LXMERT Tan & Bansal (2019) introduced the two-stream architecture, where separate Transformers are applied to images and text independently; the results of which are fused by a third Transformer in the second stage. Unicoder-VL Li et al. (2020a) takes the visual regions of the image and textual tokens of the sentence as the input and then encodes the input to yield the linguistic embedding and image embedding, respectively. As with pre-training, VL-BERT Su et al. (2020) performs pre-training on both visual-linguistic and text-only datasets jointly, but does not address the task of Sentence-Image Relationship Prediction unlike other works Lu et al. (2019). UNITER Chen et al. (2020b) uses conditional masking on Masked Language Modeling and Masked Region Modeling, and introduces a novel Word-region Alignment pre-training task. It can realize heterogeneous downstream V+L tasks with joint multi-modal embeddings. Zhou et al. Zhou et al. (2020) present a unified Vision-Language Pre-training model that also allows generation via a unified Transformer with various self-attention masks Dong et al. (2019). Murahari et al. Murahari et al. (2020) proposes VisDial-BERT that employs pretrained models for visual dialog. VD-BERT Murahari et al. (2020) relies on the self-attention mechanism within a single-stream Transformer encoder to capture the complex interactions in a unified manner. Oscar Li et al. (2020b) is motivated by the observation that the salient objects in an image can be detected, and are often mentioned in the accompanying text, and uses object tags as anchor points to align the image and language modalities in a shared semantic space. VinVL Zhang et al. (2021) feeds the visual features generated by the new object detection model, and utilizes an improved version of OSCAR+ to pre-train the V+L model and fine-tune it for a wide range of downstream VL tasks. Hyperbolic representation learning has recently demonstrated great potential across a diverse range of applications such as learning entity hierarchies Nickel & Kiela (2017) and/or natural language processing Tay et al. (2018a), hyperbolic graph neural network Chami et al. (2019); Liu et al. (2019a). Hyperbolic geometry has already been applied to the recommender systems domainTran et al. (2020); Wang et al. (2022a) or image recognition applications Liu et al. (2020); Khrulkov et al. (2020). Hyperbolic space is a kind of manifold space studied in Riemannian geometry, in which basic mathematical operations (e.g., distance measurement) are defined differently from Euclidean space. Nickel and Kiela Nickel &

