CONTRASTIVE SYN-TO-REAL GENERALIZATION

Abstract

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverages the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while promoting the diversity of feature embeddings as an inductive bias to improve generalization. In addition, we enhance the proposed CSG framework with attentional pooling (A-pool) to let the model focus on semantically important regions and further improve its generalization. We demonstrate the effectiveness of CSG on various synthetic training tasks, exhibiting state-of-the-art performance on zero-shot domain generalization.

1. INTRODUCTION

Deep neural networks have pushed the boundaries of many visual recognition tasks. However, their success often hinges on the availability of both training data and labels. Obtaining data and labels can be difficult or expensive in many applications such as semantic segmentation, correspondence, 3D reconstruction, pose estimation, and reinforcement learning. In these cases, learning with synthetic data can greatly benefit the applications since large amounts of data and labels are available at relatively low costs. For this reason, synthetic training has recently gained significant attention (Wu et al., 2015; Richter et al., 2016; Shrivastava et al., 2017; Savva et al., 2019) . Despite many benefits, synthetically trained models often have poor generalization on the real domain due to large domain gaps between synthetic and real images. Limitations on simulation and rendering can lead to degraded synthesis quality, such as aliased boundaries, unrealistic textures, fake appearance, over-simplified lighting conditions, and unreasonable scene layouts. These issues result in domain gaps between synthetic and real images, preventing the synthetically trained models from capturing meaningful representations and limiting their generalization ability on real images.

Labeled Training (Synthetic

To mitigate these issues, domain generalization and adaptation techniques have been proposed (Li et al., 2017; Pan et al., 2018; Yue et al., 2019) . Domain adaptation assumes the availability of target data (labeled, partially labeled, or unlabeled) during training. On the other hand, domain generalization considers zero-shot generalization without seeing the target data of real images, and is therefore more challenging. An illustration of the domain generalization protocol on the VisDA-17 dataset (Peng et al., 2017) is shown in Figure 1 . Considering that ImageNet pre-trained representation is widely used as model initialization, recent efforts on domain generalization show that such knowledge can be used to prevent overfitting to the synthetic domain (Chen et al., 2018; 2020c) . Specifically, they impose a distillation loss to regularize the distance between the synthetically trained and the ImageNet pre-trained representations, which improves synthetic-to-real generalization. The above approaches still face limitations due to the challenging nature of this problem. Taking a closer look, we observe the following pitfalls in training on synthetic data. First, obtaining photorealistic appearance features at the micro-level, such as texture and illumination, is challenging due to the limits of simulation complexity and rendering granularity. Without special treatment, CNNs tend to be biased towards textures (Geirhos et al., 2019) and suffer from badly learned representations on synthetic data. Second, the common lack of texture and shape variations on synthetic images often leads to collapsed and trivial representations without any diversity. This is unlike training with natural images where models get sufficiently trained by seeing enough variations. Such a lack of diversity in the representation makes the learned models vulnerable to natural variations in the real world.

Summary of contributions and results:

• We observe that the diversity of learned feature embedding plays an important role in syntheticto-real generalization. We show an example of collapsed representations learned by a synthetic model, which is in sharp contrast to features learned from real data (Section 2). • Motivated by the above observation, we propose a contrastive synthetic-to-real generalization framework that simultaneously regularizes the synthetically trained representation while promoting the diversity of the learned representation to improve generalization (Section 3.1). • We further enhance the CSG framework with attentional pooling (A-pool) where feature representations are guided by model attention. This allows the model to localize its attention to semantically more important regions, and thus improves synthetic-to-real generalization (Section 3.4). • We benchmark CSG on various synthetic training tasks including image classification (VisDA-17) and semantic segmentation (GTA5 → Cityscapes). We show that CSG considerably improves the generalization performance without seeing target data. Our best model reaches 64.05% accuracy on VisDA-17 compared to previous state-of-the-art (Chen et al., 2020c) with 61.1% (Section 4).

2. A MOTIVATING EXAMPLE

We give a motivating example to show the significant differences between the features learned on synthetic and real images. Specifically, we use a ResNet-101 backbone and extract the l 2 normalized feature embedding after global average pooling (defined as v). We consider the following three models: 



* Work done during the research internship with NVIDIA. † Corresponding author. For a fair comparison, we make a random subset of the training set with an equal size of the validation set, since the training set of VisDA-17 is larger than the validation set.



Figure 1: An illustration of the domain generalization protocol on the VisDA-17 dataset, where real target domain (test) images are assumed unavailable during model training.

Figure 2: Feature diversity on VisDA-17 test images in R 2 with Gaussian kernel density estimation (KDE). Darker areas have more concentrated features. Es: hyperspherical energy of features, lower the more diverse.

