MULTI-MODALITY ALONE IS NOT ENOUGH: GENERATING SCENE GRAPHS USING CROSS-RELATION-MODALITY TOKENS

Abstract

Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict the relationships between objects detected in a scene. One of its key challenges is the strong bias of the visual world around us toward a few frequently occurring relationships, leaving a long tail of under-represented classes. Although infusing additional modalities is one prominent way to improve SGG performance on underrepresented classes, we argue that using additional modalities alone is not enough. We propose to inject entity relation information (Cross-Relation) and modality dependencies (Cross-Modality) into each embedding token of a transformer which we term primal fusion. The resulting Cross-RElAtion-Modality (CREAM) token acts as a strong inductive bias for the SGG framework. Our experimental results on the Visual Genome dataset demonstrate that our CREAM model outperforms state-of-the-art SGG models by around 20% while being simpler and requiring substantially less computation. Additionally, to analyse the generalisability of the CREAM model we also evaluate it on the Open Images dataset. Finally, we examine the impact of the depth-map quality on SGG performance and empirically show the superiority of our model over the prior state of the art by better capturing the depth data, boosting the performance by a margin of around 25%.

1. INTRODUCTION

Visual scene understanding has evolved in recent years from mere object detection and recognition tasks to more complex problems such as Visual Question Answering (VQA) (Antol et al., 2015) and Image Captioning (IC) (Hossain et al., 2019) . One prominent tool for scene understanding is scene graph generation (SGG) (Lu et al., 2016) : Given any two entities in a scene, the task of SGG is to detect any existing relationships between them. While standard SGG uses entity features from the RGB images to detect relations, to move towards the goal of generating scene graphs that adequately typify our visual world, we need additional clues to effectively capture under-represented classes in SGG. In this regard, researchers have taken multiple directions such as infusing complementary modalities (Zareian et al., 2020; Sharifzadeh et al., 2021) or conditioning using additional image context (Lu et al., 2021) . Among SGG methods based on further modalities, Zareian et al. ( 2020) exploit external knowledge by using a late fusion mechanism in which the scene graphs generated from the RGB features are refined in multiple iterations per relation detection iteration using knowledge graphs, resulting in very high computational cost. On the other hand, SGG with additional depth data (Sharifzadeh et al., 2021) has used an early fusion mechanism. Although Sharifzadeh et al. (2021) requires less computation, it fails to effectively fuse the modalities and uses depth maps of limited quality (Fig. 7 ). To effectively fuse different modalities, transformer models can be beneficial, with their usage expanding from text (Vaswani et al., 2017) to other modalities such as images (Dosovitskiy et al., 2021) and speech (Lin and Wang, 2020) . Recently, transformers have also been used in SGG, mostly to capture dependencies across time in video-based SGG (Cong et al., 2021) . In the case of still images, transformers were used to extract object dependencies (Dhingra et al., 2021) and context dependencies (Lu et al., 2021) . Capturing known dependencies explicitly can boost the performance on under-represented classes (Lu et al., 2021) . However, using transformers for multi-modal fusion

