Topic and Hyperbolic Transformer to Handle Multi-modal Dependencies

Abstract

As multi-modal search relies on jointly learning image-text representations and has been investigated in the literature, our innovation is to develop Chimera, a framework in which to learn their representations and similarities. Because the core of multi-modal search is learning the modalities in a shared semantic space and measuring their similarities, search quality depends on which expressive space is utilized in learning. This motivates us to identify the space that can elucidate their semantic and complex relationships with small information loss. Novelty is assured by introducing the topic and hyperbolic as spaces, and performing contrastive/metric learning tasks to ensure the cooperation of these spaces with Transformer. Experiments show that Chimera empowers pre-trained models for multi-modal search tasks and demonstrate the ability of the layers it introduces.

1. INTRODUCTION

While conventional search systems find relevant documents to a query, multi (cross) -modal search systems retrieve relevant instances to answer the query. The text-based image retrieval models return images matched to a given text, where images are ranked based on their similarity to the text. Inferring the semantic alignment between images and words in the associated sentences allows the fine-grained interplay between vision and language data to be captured, and makes image-text matching more interpretable. As this search model needs semantic-intermediate expressions for both images and text, our challenge is to identify spaces that can learn their appropriate representations and define similarities. Since joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, many studies have explored how to learn their embeddings, project them into a semantic space, and measure their similarity. Many existing multi-modal approaches employ either embedding Lee et al. (2018) or classification Huang et al. (2017) . As evidenced by the embedding approach, the semantic space is generally of low dimension allowing representation by vectors and the direct computation of modalities by conventional distance metrics (e.g., cosine similarity or Euclidean distance) as the similarity function in this embedding space. The literature describes the use of a variety of matching functions, such as metric learning Hsieh et al. ( 2017 Although the studies are primarily focused on designing distance functions over objects (i.e., between users and items), and have empirically demonstrated reasonable success in collaborative ranking with implicit feedback, matching functions work only in the scope of Euclidean space Tran et al. (2020) . Since there is a hypernym, and an entailment relationship within words, and texts Vendrov et al. (2016) , one is able to perceive a semantic hierarchical relationship between sentences and images. This intuition motivates us to explore spaces that explicitly illuminate both semantic and hierarchical relationships and project them for representation learning. That is, our framework, Chimera, focuses on topics and hyperbolic geometry as the semantic space and the roomier space, respectively; the contrastive and metric learning approach are used as training objectives. As topic models describe the statistical relationships of word occurrences as global information (e.g., over a given data), Chimera adopts their concept to complement Transformer with global information, and mitigate the strong conditional dependencies be-



); Tay et al. (2018b) and/or neural networks He et al. (2017).

