SIM2SG: SIM-TO-REAL SCENE GRAPH GENERATION FOR TRANSFER LEARNING

Abstract

Scene graph (SG) generation has been gaining a lot of traction recently. Current SG generation techniques, however, rely on the availability of expensive and limited number of labeled datasets. Synthetic data offers a viable alternative as labels are essentially free. However, neural network models trained on synthetic data, do not perform well on real data because of the domain gap. To overcome this challenge, we propose Sim2SG, a scalable technique for sim-to-real transfer for scene graph generation. Sim2SG addresses the domain gap by decomposing it into appearance, label and prediction discrepancies between the two domains. We handle these discrepancies by introducing pseudo statistic based self-learning and adversarial techniques. Sim2SG does not require costly supervision from the real-world dataset. Our experiments demonstrate significant improvements over baselines in reducing the domain gap both qualitatively and quantitatively. We validate our approach on toy simulators, as well as realistic simulators evaluated on real-world data.

1. INTRODUCTION

Scene Graphs (SGs) in both computer vision and computer graphics are an interpretable and structural representation of scenes. A scene graph summarizes entities in the scene and plausible relationships among them. SGs (Dai et al., 2017; Herzig et al., 2018; Li et al., 2017; Newell & Deng, 2017; Xu et al., 2017; Yang et al., 2018; Zellers et al., 2018) are a manifestation of vision as inverse graphics. They have found a variety of applications such as image captioning, visual question answering, high level reasoning tasks, image retrieval, image generation, etc. However, most prior work on SG generation relies on the availability of expensive and limited number of labeled datasets such as Visual Genome (Krishna et al., 2017) and Visual Relationship Dataset (VRD) (Lu et al., 2016) . One of the main limitations in machine learning applications is the general lack of sufficient labeled data for supervised learning tasks. Synthetic data is a viable alternative to this problem since annotations are essentially free. Synthetic data has been used for a variety of tasks such as image classification, object detection, semantic segmentation, optical flow modeling, 3D keypoint extraction, object pose estimation, 3D reconstruction, etc. (Borrego et al., 2018; Butler et al., 2012; Dosovitskiy et al., 2015; McCormac et al., 2016; Mueller et al., 2017; Richter et al., 2016; Ros et al., 2016; Suwajanakorn et al., 2018; Tremblay et al., 2018; Tsirikoglou et al., 2017) . It has also been shown to be effective in initializing task networks (Prakash et al., 2019) and for data augmentation. However, the use of synthetic data for SG generation and visual relationships is yet to be explored. One crucial issue with training on a labeled source domain (synthetic data) and evaluating on an unlabeled target domain (real data) is the performance gap known as domain gap (Torralba & Efros, 2011) . This gap is due to the difference of data distribution between the source and target domains. Kar et al. (2019) argue that domain gap can be divided into appearance and content gap. The appearance gap can be addressed by making scenes photo-realistic (McCormac et al., 2016; Wrenninge & Unger, 2018) , by using image translations (Hoffman et al., 2018; Huang et al., 2018; Zhu et al., 2017) , by feature alignment (Chang et al., 2019; Chen et al., 2018; Li et al., 2019; Luo et al., 2019; Saito et al., 2019; Sun et al., 2019) , or by learning robust representations based on domain randomization (Prakash et al., 2019; Tobin et al., 2017) . There are also studies that address the content gap for image classification (Azizzadenesheli et al., 2019; Lipton et al., 2018; Tan et al., 2019) . We present a thorough investigation of the domain gap between source and target domains. We assume a gap in both appearance and content, expand those gaps into different sub-components and provide a way to address them. We primarily apply our method to reduce the domain gap for

