PROPORTIONAL AMPLITUDE SPECTRUM TRAINING AUGMENTATION FOR SYN-TO-REAL DOMAIN GENER-ALIZATION

Abstract

Synthetic data offers the promise of cheap and bountiful training data for settings where lots of labeled real-world data for some tasks is unavailable. However, models trained on synthetic data significantly underperform on real-world data. In this paper, we propose Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves perturbing the amplitude spectrums of the synthetic images in the Fourier domain to generate augmented views. We design PASTA to perturb the amplitude spectrums in a structured manner such that high-frequency components are perturbed relatively more than the low-frequency ones. For the tasks of semantic segmentation (GTAV→Real), object detection (Sim10K→Real), and object recognition (VisDA-C Syn→Real), across a total of 5 syn-to-real shifts, we find that PASTA either outperforms or is consistently competitive with more complex state-of-the-art methods while being complementary to other generalization approaches.

1. INTRODUCTION

Performant deep models for complex tasks heavily rely on access to substantial labeled data during training. However, gathering labeled real-world data can be expensive and often only captures a portion of the real-world seen at test time. Therefore, training models on synthetic data to better generalize to diverse real-world data has emerged as a popular alternative. However, models trained on synthetic data have a hard time generalizing to real world data -e.g., the performance of a vanilla DeepLabv3+ (Chen et al., 2018a) (ResNet-50 backbone) architecture on semantic segmentation drops from 73.45% mIoU on GTAV to 28.95% mIoU on Cityscapes for the same set of classes. Several approaches have been considered in prior work to tackle this problem. In this paper, we propose an augmentation strategy, called Proportional Amplitude Spectrum Training Augmentation (PASTA), for the synthetic-to-real generalization problem. PASTA involves perturbing the amplitude spectrums of the source synthetic images in the Fourier domain. While prior work in domain generalization has considered augmenting images in the Fourier domain (Xu et al., 2021; Yang & Soatto, 2020; Huang et al., 2021a) , they mostly rely on the observations that -(1) lowfrequency bands of the amplitude spectrum tend to capture style information / low-level statistics (illumination, lighting, etc.) (Yang & Soatto, 2020) and (2) the corresponding phase spectrum tends to capture high-level semantic content (Oppenheim et al., 1979; Oppenheim & Lim, 1981; Piotrowski & Campbell, 1982; Hansen & Hess, 2007; Yang et al., 2020) . In addition to the observations from prior work, we make the observation that synthetic images have less diversity in the high-frequency bands of their amplitude spectrums compared to real images (see Sec. 3.2 for a detailed discussion). Motivated by these key observations, PASTA provides a structured way to perturb the amplitude spectrums of source synthetic images to ensure that a model is exposed to more variations in high-frequency components during training. We empirically observe that by relying on such a simple set of motivating observations, PASTA leads to significant improvements in synthetic-to-real generalization performance -e.g., out-of-the-box GTAV→Cityscapes generalization performance of a vanilla DeepLabv3+ (ResNet-50 backbone) semantic segmentation architecture improves from 28.95% mIoU to 44.12% mIoU. PASTA involves the following steps. Given an input image, we apply 2D Fast Fourier Transform (FFT) to obtain the corresponding amplitude and phase spectrums in the Fourier domain. For every spatial frequency (m, n) in the amplitude spectrum, we sample a multiplicative jitter value ϵ from  N (1, σ 2 [m, n]) such that σ[m, n] increases monotonically with (m, n) (specifically √ m 2 + n 2 ), thereby, ensuring that higher frequency components in the amplitude spectrum are perturbed more compared to the lower frequency components. The dependence of σ[m, n] on (m, n) can be controlled using a set of hyper-parameters that govern the degree of monotonicity. Finally, given the perturbed amplitude and the original phase spectrums, we can apply an inverse 2D Fast Fourier Transform (iFFT) to obtain the augmented image. Fig. 1 shows a few examples of augmentation by PASTA. In terms of Fourier domain augmentations, closest to PASTA are perhaps the approaches -Amplitude Jitter (AJ) (Xu et al., 2021) and Amplitude Mixup (AM) (Xu et al., 2021) . The overarching principle across these methods is to perturb only the amplitude spectrums of images (while keeping the phase spectrum unaffected) to ensure models are invariant to the applied perturbations. For instance, AM, which is a type of mixup strategy (Zhang et al., 2018; Verma et al., 2019) , performs mixup between the amplitude spectrums of distinct intra-source images, while AJ uniformly perturbs the amplitude spectrums with a single jitter value ϵ. Another frequency randomization technique, Frequency Space Domain Randomization (FSDR) (Huang et al., 2021a) , first isolates domain variant and invariant frequency components by using SYNTHIA (Ros et al., 2016) (extra data) and ImageNet and then sets up a learning paradigm. Unlike these methods, PASTA applies fine-grained perturbations and does not involve sampling a separate mixup image or the use of any extra images. Instead, PASTA provides a simple strategy to perturb the amplitude spectrum of images in a structured way that leads to strong out-of-the-box generalization. We will release our code and data upon acceptance. In summary, we make the following contributions. et al., 2018; Wang et al., 2020; Balaji et al., 2018; Chen et al., 2022; Dou et al., 2019) , manipulating feature statistics to augment training data (Zhou et al., 2021; Li et al., 2022; Nuriel et al., 2021) , and using models crafted based on risk minimization formalisms (Arjovsky et al., 2019) . Recently, properly tuned Empirical Risk Minimization (ERM) has proven to be a competitive DG approach (Gulrajani & Lopez-Paz, 2020) with follow-up work adopting various optimization and regularization techniques on top of ERM (Shi et al., 2021; Cha et al., 2021) .



Figure 1: PASTA augmentation samples. Examples of images from different synthetic datasets when augmented using PASTA and RandAugment (Cubuk et al., 2020). Row 1 includes examples from GTAV and row 2 from VisDA-C.

• We introduce Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy for synthetic-to-real generalization. PASTA involves perturbing the amplitude spectrums of synthetic images in the Fourier domain so as to expose a model to more variations in high-frequency components. • We show that PASTA leads to considerable improvements or competitive results across three tasks -(1) Semantic Segmentation: GTAV → Cityscapes, Mapillary, BDD100k, (2) Object Detection: Sim10K → Cityscapes and (3) Object Recognition: VisDA-C Syn → Real -covering a total of 5 syn-to-real shifts across multiple backbones. • We show that PASTA (1) often makes a baseline model competitive with prior state-of-the-art approaches relying on either specific architectural components, extra data, or objectives, (2) is complementary to said approaches and (3) is competitive with augmentation strategies like FACT (Xu et al., 2021) and RandAugment (Cubuk et al., 2020). 2 RELATED WORK Domain Generalization (DG). DG involves training models on single or multiple labeled data sources to generalize well to novel test time data sources (unseen during training). Since its inception (Blanchard et al., 2011; Muandet et al., 2013), several approaches have been proposed to tackle the problem of domain generalization. These include -decomposing a model into domain invariant and specific components and utilizing the former to make predictions (Ghifary et al., 2015; Khosla et al., 2012), learning domain specific masks for generalization (Chattopadhyay et al., 2020), using meta-learning to train a robust model by mimicking the DG problem during training (Li

