PROPORTIONAL AMPLITUDE SPECTRUM TRAINING AUGMENTATION FOR SYN-TO-REAL DOMAIN GENER-ALIZATION

Abstract

Synthetic data offers the promise of cheap and bountiful training data for settings where lots of labeled real-world data for some tasks is unavailable. However, models trained on synthetic data significantly underperform on real-world data. In this paper, we propose Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves perturbing the amplitude spectrums of the synthetic images in the Fourier domain to generate augmented views. We design PASTA to perturb the amplitude spectrums in a structured manner such that high-frequency components are perturbed relatively more than the low-frequency ones. For the tasks of semantic segmentation (GTAV→Real), object detection (Sim10K→Real), and object recognition (VisDA-C Syn→Real), across a total of 5 syn-to-real shifts, we find that PASTA either outperforms or is consistently competitive with more complex state-of-the-art methods while being complementary to other generalization approaches.

1. INTRODUCTION

Performant deep models for complex tasks heavily rely on access to substantial labeled data during training. However, gathering labeled real-world data can be expensive and often only captures a portion of the real-world seen at test time. Therefore, training models on synthetic data to better generalize to diverse real-world data has emerged as a popular alternative. However, models trained on synthetic data have a hard time generalizing to real world data -e.g., the performance of a vanilla DeepLabv3+ (Chen et al., 2018a) (ResNet-50 backbone) architecture on semantic segmentation drops from 73.45% mIoU on GTAV to 28.95% mIoU on Cityscapes for the same set of classes. Several approaches have been considered in prior work to tackle this problem. In this paper, we propose an augmentation strategy, called Proportional Amplitude Spectrum Training Augmentation (PASTA), for the synthetic-to-real generalization problem. PASTA involves perturbing the amplitude spectrums of the source synthetic images in the Fourier domain. While prior work in domain generalization has considered augmenting images in the Fourier domain (Xu et al., 2021; Yang & Soatto, 2020; Huang et al., 2021a) , they mostly rely on the observations that -(1) lowfrequency bands of the amplitude spectrum tend to capture style information / low-level statistics (illumination, lighting, etc.) (Yang & Soatto, 2020) and (2) the corresponding phase spectrum tends to capture high-level semantic content (Oppenheim et al., 1979; Oppenheim & Lim, 1981; Piotrowski & Campbell, 1982; Hansen & Hess, 2007; Yang et al., 2020) . In addition to the observations from prior work, we make the observation that synthetic images have less diversity in the high-frequency bands of their amplitude spectrums compared to real images (see Sec. 3.2 for a detailed discussion). Motivated by these key observations, PASTA provides a structured way to perturb the amplitude spectrums of source synthetic images to ensure that a model is exposed to more variations in high-frequency components during training. We empirically observe that by relying on such a simple set of motivating observations, PASTA leads to significant improvements in synthetic-to-real generalization performance -e.g., out-of-the-box GTAV→Cityscapes generalization performance of a vanilla DeepLabv3+ (ResNet-50 backbone) semantic segmentation architecture improves from 28.95% mIoU to 44.12% mIoU. PASTA involves the following steps. Given an input image, we apply 2D Fast Fourier Transform (FFT) to obtain the corresponding amplitude and phase spectrums in the Fourier domain. For every spatial frequency (m, n) in the amplitude spectrum, we sample a multiplicative jitter value ϵ from

