SIM2SG: SIM-TO-REAL SCENE GRAPH GENERATION FOR TRANSFER LEARNING

Abstract

Scene graph (SG) generation has been gaining a lot of traction recently. Current SG generation techniques, however, rely on the availability of expensive and limited number of labeled datasets. Synthetic data offers a viable alternative as labels are essentially free. However, neural network models trained on synthetic data, do not perform well on real data because of the domain gap. To overcome this challenge, we propose Sim2SG, a scalable technique for sim-to-real transfer for scene graph generation. Sim2SG addresses the domain gap by decomposing it into appearance, label and prediction discrepancies between the two domains. We handle these discrepancies by introducing pseudo statistic based self-learning and adversarial techniques. Sim2SG does not require costly supervision from the real-world dataset. Our experiments demonstrate significant improvements over baselines in reducing the domain gap both qualitatively and quantitatively. We validate our approach on toy simulators, as well as realistic simulators evaluated on real-world data.

1. INTRODUCTION

Scene Graphs (SGs) in both computer vision and computer graphics are an interpretable and structural representation of scenes. A scene graph summarizes entities in the scene and plausible relationships among them. SGs (Dai et al., 2017; Herzig et al., 2018; Li et al., 2017; Newell & Deng, 2017; Xu et al., 2017; Yang et al., 2018; Zellers et al., 2018) are a manifestation of vision as inverse graphics. They have found a variety of applications such as image captioning, visual question answering, high level reasoning tasks, image retrieval, image generation, etc. However, most prior work on SG generation relies on the availability of expensive and limited number of labeled datasets such as Visual Genome (Krishna et al., 2017) and Visual Relationship Dataset (VRD) (Lu et al., 2016) . One of the main limitations in machine learning applications is the general lack of sufficient labeled data for supervised learning tasks. Synthetic data is a viable alternative to this problem since annotations are essentially free. Synthetic data has been used for a variety of tasks such as image classification, object detection, semantic segmentation, optical flow modeling, 3D keypoint extraction, object pose estimation, 3D reconstruction, etc. (Borrego et al., 2018; Butler et al., 2012; Dosovitskiy et al., 2015; McCormac et al., 2016; Mueller et al., 2017; Richter et al., 2016; Ros et al., 2016; Suwajanakorn et al., 2018; Tremblay et al., 2018; Tsirikoglou et al., 2017) . It has also been shown to be effective in initializing task networks (Prakash et al., 2019) and for data augmentation. However, the use of synthetic data for SG generation and visual relationships is yet to be explored. One crucial issue with training on a labeled source domain (synthetic data) and evaluating on an unlabeled target domain (real data) is the performance gap known as domain gap (Torralba & Efros, 2011) . This gap is due to the difference of data distribution between the source and target domains. Kar et al. (2019) argue that domain gap can be divided into appearance and content gap. The appearance gap can be addressed by making scenes photo-realistic (McCormac et al., 2016; Wrenninge & Unger, 2018) , by using image translations (Hoffman et al., 2018; Huang et al., 2018; Zhu et al., 2017) , by feature alignment (Chang et al., 2019; Chen et al., 2018; Li et al., 2019; Luo et al., 2019; Saito et al., 2019; Sun et al., 2019) , or by learning robust representations based on domain randomization (Prakash et al., 2019; Tobin et al., 2017) . There are also studies that address the content gap for image classification (Azizzadenesheli et al., 2019; Lipton et al., 2018; Tan et al., 2019) . We present a thorough investigation of the domain gap between source and target domains. We assume a gap in both appearance and content, expand those gaps into different sub-components and provide a way to address them. We primarily apply our method to reduce the domain gap for SG generation. Nonetheless, our techniques can also be applied to other vision tasks such as image classification, image segmentation and object detection among others. We propose Sim2SG (Simulation to Scene Graph); a model that learns sim-to-real scene graph generation leveraging labeled synthetic data and unlabeled real data. Extending the formulation in (Wu et al., 2019) , Sim2SG addresses the domain gap by bounding the task error (where the task is scene graph generation) on real data through appearance, prediction, label (ground truth) discrepancies between the two domains and task error on synthetic data. Our work differs from (Wu et al., 2019) as they do not provide a way to address the content gap, and their risk discrepancy is intractable. To the best of our knowledge, Sim2SG is the first work to introduce a tractable error bound on the content component of the domain gap. We minimize the appearance and prediction discrepancies by aligning the corresponding latent and output distributions via Gradient Reversal Layers (Ganin et al., 2017) . We address discrepancy in the label using principles of self-learning (Zou et al., 2018) . However, self-learning based on pseudo labels often suffer from the inaccurately generated labels (e.g. predicted bounding boxes are ill-categorized or imprecise, hence, the model will regress on the wrong objects) (Zheng & Yang, 2020; Kim et al., 2019) . Therefore, we instead propose to collect a higher level statistic (e.g. list of objects and their type, position and relationships for placement), that we call pseudo-statistics, from target data and leverage the synthetic data generator to produce valid objects with their precise labels (e.g. bounding boxes). We experimentally demonstrate our method in three distinct environmentsall synthetic CLEVR (Johnson et al., 2017) , more realistic Dining-Sim and Drive-Sim with a driving simulator evaluated on KITTI (Geiger et al., 2012) . We almost close the domain gap in the Clevr environment and we show significant improvements over respective baselines in Dining-Sim and Drive-Sim. Through ablations, we validate our assumptions about appearance and content gaps. Sim2SG differs from other unsupervised domain adaptation methods (Chen et al., 2018; Xu et al., 2020; Li et al., 2020) as it can modify the source distribution (via self-learning based on pseudostatistics to align with the target distribution) with access to a synthetic data generator. We also outperform these domain adaptation baselines (Chen et al., 2018; Xu et al., 2020; Li et al., 2020) as shown in Section 4.3. Contributions: Our contributions are three-fold: In terms of methodology, to the best of our knowledge, (1) We are the first to propose sim-to-real transfer learning for scene graph generation. We do not require costly supervision from the target real-world dataset. (2) We study domain gap from synthetic to real data in detail, provide a tractable error bound on the content component of the gap and propose a novel pipeline including pseudo statistics to fully handle the gap. Experimentally, (3) we show that Sim2SG can learn SG generation and obtains significant improvements over baselines in all three scenarios -Clevr, Dining-Sim and Drive-Sim. We also present ablations to illustrate the effectiveness of our technique.

2. PROPOSED METHOD: SIM2SG

Our proposed Sim2SG pipeline is illustrated in Figure 1 . We first describe how we generate scene graphs in Section 2.1. When we naïvely train on a source distribution (synthetic data) and evaluate on a target distribution (real data), we have a domain gap (Torralba & Efros, 2011) . We study it in more detail in Section 2.2 and propose methods to address it.

2.1. SCENE GRAPHS

This section describes scene graphs (SGs) and how we train the SG predictor module using labels from the source domain. Notation: We represent a scene graph of a given image I as a graph G with nodes o and edges r. Each node is a tuple o i = b i , c i of bounding box b i = {xmin i , ymin i , w i , h i } and category c i . Relationships r are a triplet of o i , p, o j where p is a predicate. SG prediction has two key components: feature extractor φ and graph predictor h. φ maps input space x to a latent space z and h maps from latent space z to output space y. The predicted SG is G = h (φ(x)). We use Resnet 101 (He et al., 2016) to implement φ and GraphRCNN (Yang et al., 2018) architecture to implement h. We train the networks φ and h using the following task loss (Yang et al., 2018) : cross entropy loss for object classification & relationship classification and 1 loss for bounding boxes.

