WEAKLY SUPERVISED SCENE GRAPH GROUNDING

Abstract

Recent researches have achieved substantial advances in learning structured representations from images. However, current methods rely heavily on the annotated mapping between the nodes of scene graphs and object bounding boxes inside images. Here, we explore the problem of learning the mapping between scene graph nodes and visual objects under weak supervision. Our proposed method learns a metric among visual objects and scene graph nodes by incorporating information from both object features and relational features. Extensive experiments on Visual Genome (VG) and Visual Relation Detection (VRD) datasets verify that our model post an improvement on scene graph grounding task over current state-of-the-art approaches. Further experiments on scene graph parsing task verify the grounding found by our model can reinforce the performance of the existing method.

1. INTRODUCTION

Motivated by various needs, researchers have designed multiple representations to describe visual contents. More specifically, object bounding boxes localize the objects inside an image while scene graphs represent object-wise interactions. Ideally, each bounding box should corresponds to a node in the scene graph. However, in many cases, such node-object level correspondences are not established, particularly when the information of scene graphs come from non-visual inputs, such as image captions (Wang et al., 2018b) , knowledge graph (Zareian et al., 2020b) and commonsense base (Shi et al., 2019) . The lack of node-object level mapping in data results in constraints on various multi-modal learning tasks, e.g., scene graph parsing (Xu et al., 2017; Zhang et al., 2017b) , VQA (Ghosh et al., 2019) and image captioning (Yang et al., 2019) . If the mapping can be learned without extra annotations, a comprehensive view of an image will be created and benefit a number of downstream tasks. Therefore, in this paper, we focus on grounding scene graph nodes to visual objects under weak supervision, where the node-object correspondences are not annotated even during training phase. Although the scene graph grounding problem can benefit plentiful downstream tasks, it has been barely studied. Unlike other weakly supervised learning tasks, which focus on single label space (Dietterich et al., 1997; Wang et al., 2018a) , the scene graph grounding problem is involved with two label spaces: object categories and relation types, which are disjoint but dependent. More specifically, visual relations are highly correlated with visual objects. As a result, a desirable model should correctly handle the interaction among object categories and visual relations instead of simply learning them independently. Therefore, most of the well-studied weakly supervised learning methods are not suitable for the learning on scene graphs. Among the few relevant efforts recently spent on this task, Zareian et al. achieve impressive results. They notice the grounding problem when they are trying to handle weakly supervised scene graph parsing. They treat it as a side challenge in weakly supervised learning and propose to tackle it by jointly learning the node-object mapping and a visual relation parser under weak supervision. In their method, a parser capturing the interaction of relation feature and object feature represents the image as a bounding box graph. Then they align such bounding box graph with the scene graph to construct the correspondence between visual regions and scene graph nodes. The mapping found by alignment algorithm is further utilized in optimizing the scene graph parser. However, the graph aligning process results in one core limitation of their method. Given the fact that the graph-matching problem is NP-hard, they must take the trade-off between efficiency and accuracy into consideration. Furthermore, to enable weakly supervised learning, in training stage

