MULTI-MODALITY ALONE IS NOT ENOUGH: GENERATING SCENE GRAPHS USING CROSS-RELATION-MODALITY TOKENS

Abstract

Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict the relationships between objects detected in a scene. One of its key challenges is the strong bias of the visual world around us toward a few frequently occurring relationships, leaving a long tail of under-represented classes. Although infusing additional modalities is one prominent way to improve SGG performance on underrepresented classes, we argue that using additional modalities alone is not enough. We propose to inject entity relation information (Cross-Relation) and modality dependencies (Cross-Modality) into each embedding token of a transformer which we term primal fusion. The resulting Cross-RElAtion-Modality (CREAM) token acts as a strong inductive bias for the SGG framework. Our experimental results on the Visual Genome dataset demonstrate that our CREAM model outperforms state-of-the-art SGG models by around 20% while being simpler and requiring substantially less computation. Additionally, to analyse the generalisability of the CREAM model we also evaluate it on the Open Images dataset. Finally, we examine the impact of the depth-map quality on SGG performance and empirically show the superiority of our model over the prior state of the art by better capturing the depth data, boosting the performance by a margin of around 25%.

1. INTRODUCTION

Visual scene understanding has evolved in recent years from mere object detection and recognition tasks to more complex problems such as Visual Question Answering (VQA) (Antol et al., 2015) and Image Captioning (IC) (Hossain et al., 2019) . One prominent tool for scene understanding is scene graph generation (SGG) (Lu et al., 2016) : Given any two entities in a scene, the task of SGG is to detect any existing relationships between them. While standard SGG uses entity features from the RGB images to detect relations, to move towards the goal of generating scene graphs that adequately typify our visual world, we need additional clues to effectively capture under-represented classes in SGG. In this regard, researchers have taken multiple directions such as infusing complementary modalities (Zareian et al., 2020; Sharifzadeh et al., 2021) or conditioning using additional image context (Lu et al., 2021) . Among SGG methods based on further modalities, Zareian et al. ( 2020) exploit external knowledge by using a late fusion mechanism in which the scene graphs generated from the RGB features are refined in multiple iterations per relation detection iteration using knowledge graphs, resulting in very high computational cost. On the other hand, SGG with additional depth data (Sharifzadeh et al., 2021) has used an early fusion mechanism. Although Sharifzadeh et al. (2021) requires less computation, it fails to effectively fuse the modalities and uses depth maps of limited quality (Fig. 7 ). To effectively fuse different modalities, transformer models can be beneficial, with their usage expanding from text (Vaswani et al., 2017) to other modalities such as images (Dosovitskiy et al., 2021) and speech (Lin and Wang, 2020) . Recently, transformers have also been used in SGG, mostly to capture dependencies across time in video-based SGG (Cong et al., 2021) . In the case of still images, transformers were used to extract object dependencies (Dhingra et al., 2021) and context dependencies (Lu et al., 2021) . Capturing known dependencies explicitly can boost the performance on under-represented classes (Lu et al., 2021) . However, using transformers for multi-modal fusion In the case of primal fusion the modalities and the subject-object features are fused explicitly to form the transformer input tokens. Contrastingly, for early fusion there is no interaction between the modalities and subject-object features prior to entering the transformer. can be challenging because of the high computational cost and model complexity (Nagrani et al., 2021) . The major reason for the increased expense in multi-modal transformers stems from the fusion strategies used. Although there is lot of research into multi-modal fusion, it has primarily focused on fusion strategies inside the transformers, which results in an increased sequence length thus increasing the model complexity. Moreover, explicitly modeling the known inductive bias is challenging for fusion strategies that start inside the transformer. The limitations of the existing methods raise the following open questions: (Q1) How can we effectively combine different modalities in SGG? (Q2) What is more important for improving SGG performance, better quality depth maps or a better fusion strategy? (Q3) Will having multiple modalities alone be enough to boost the coverage of under-represented classes in SGG? To address the above questions, we propose a simple yet effective token generation strategy that we call Cross-RElAtion-Modality (CREAM) tokens, by strategically combining entity relations (between the features of the subject and the object entity) with modality dependencies (i.e. between RGB and the corresponding depth modality). As depicted in Fig. 1 we explicitly combine the modalities and subject-object features prior to entering the transformer unlike early fusion in which we rely on implicit fusion inside the transformer. We call our fusion strategy primal fusion. Surprisingly, our primal fused CREAM tokens are able to capture the inherent inductive bias in SGG using a single encoder-only transformer without using any complex cross-attention (Chen et al., 2021), fusion (Prakash et al., 2021) , or encoder-decoder (Lu et al., 2021) architectures. In particular, we are using our system to study the effect of the depth-map quality in scene graph generation using an improved depth map, VG-Depth.v2, generated for the Visual Genome (VG) dataset (Krishna et al., 2017) using the monocular depth estimator of Yin et al. ( 2021) and compare to VG-Depth.v1 maps (Sharifzadeh et al., 2021) (Fig. 7 ). This study is crucial for two reasons: (1) to evaluate the relationship between architectural choice and depth-map quality in SGG performance and (2) for the efficacy of SGG in real-world visual scene understanding scenarios such as automated driving, where instantaneous 3D reconstruction is far-fetched but using monocular depth estimators to generate depth maps is feasible. We make our code publicly available at https://anonymous.4open.science/r/CREAM_Model-113F. Precisely, we make the following contributions: (C1) We propose a novel token generation strategy (CREAM tokens) for transformer based multi-modal SGG. Our CREAM tokens explicitly force the multi-head-self-attention (MSA) component of the transformer to learn enriched representation by focusing on different subspace. Our proposal can significantly boost the performance while also reducing the computational cost. (C2) By primaly fusing CREAM tokens, we outperform stateof-the-art models on the mRecall metric despite not using any additional context. (C3) We conduct extensive depth-data analysis and ablation studies to show the significance of our proposed approach.

2. RELATED WORK

SGG. Scene graphs have been receiving increased attention from the research community due to their potential usability in assisting downstream visual reasoning tasks (Shi et al., 2019; Wang et al., 2019; Krishna et al., 2018) . et al. (2019) . Various debiasing techniques were proposed subsequently such as using unbiased loss



Figure 1: Primal vs Early Fusion. In the case of primal fusion the modalities and the subject-object features

The scene graph generation (SGG) task was first introduced by Lu et al. (2016). The bias problem in SGG was first brought into focus by Zellers et al. (2018) and an unbiased evaluation metric, meanRecall (mRecall), was proposed by Chen et al. (2019a) and Tang

