PROPORTIONAL AMPLITUDE SPECTRUM TRAINING AUGMENTATION FOR SYN-TO-REAL DOMAIN GENER-ALIZATION

Abstract

Synthetic data offers the promise of cheap and bountiful training data for settings where lots of labeled real-world data for some tasks is unavailable. However, models trained on synthetic data significantly underperform on real-world data. In this paper, we propose Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves perturbing the amplitude spectrums of the synthetic images in the Fourier domain to generate augmented views. We design PASTA to perturb the amplitude spectrums in a structured manner such that high-frequency components are perturbed relatively more than the low-frequency ones. For the tasks of semantic segmentation (GTAV→Real), object detection (Sim10K→Real), and object recognition (VisDA-C Syn→Real), across a total of 5 syn-to-real shifts, we find that PASTA either outperforms or is consistently competitive with more complex state-of-the-art methods while being complementary to other generalization approaches.

1. INTRODUCTION

Performant deep models for complex tasks heavily rely on access to substantial labeled data during training. However, gathering labeled real-world data can be expensive and often only captures a portion of the real-world seen at test time. Therefore, training models on synthetic data to better generalize to diverse real-world data has emerged as a popular alternative. However, models trained on synthetic data have a hard time generalizing to real world data -e.g., the performance of a vanilla DeepLabv3+ (Chen et al., 2018a ) (ResNet-50 backbone) architecture on semantic segmentation drops from 73.45% mIoU on GTAV to 28.95% mIoU on Cityscapes for the same set of classes. Several approaches have been considered in prior work to tackle this problem. In this paper, we propose an augmentation strategy, called Proportional Amplitude Spectrum Training Augmentation (PASTA), for the synthetic-to-real generalization problem. PASTA involves perturbing the amplitude spectrums of the source synthetic images in the Fourier domain. While prior work in domain generalization has considered augmenting images in the Fourier domain (Xu et al., 2021; Yang & Soatto, 2020; Huang et al., 2021a) , they mostly rely on the observations that -(1) lowfrequency bands of the amplitude spectrum tend to capture style information / low-level statistics (illumination, lighting, etc.) (Yang & Soatto, 2020) and (2) the corresponding phase spectrum tends to capture high-level semantic content (Oppenheim et al., 1979; Oppenheim & Lim, 1981; Piotrowski & Campbell, 1982; Hansen & Hess, 2007; Yang et al., 2020) . In addition to the observations from prior work, we make the observation that synthetic images have less diversity in the high-frequency bands of their amplitude spectrums compared to real images (see Sec. 3.2 for a detailed discussion). Motivated by these key observations, PASTA provides a structured way to perturb the amplitude spectrums of source synthetic images to ensure that a model is exposed to more variations in high-frequency components during training. We empirically observe that by relying on such a simple set of motivating observations, PASTA leads to significant improvements in synthetic-to-real generalization performance -e.g., out-of-the-box GTAV→Cityscapes generalization performance of a vanilla DeepLabv3+ (ResNet-50 backbone) semantic segmentation architecture improves from 28.95% mIoU to 44.12% mIoU. PASTA involves the following steps. Given an input image, we apply 2D Fast Fourier Transform (FFT) to obtain the corresponding amplitude and phase spectrums in the Fourier domain. For every spatial frequency (m, n) in the amplitude spectrum, we sample a multiplicative jitter value ϵ from N (1, σ 2 [m, n]) such that σ[m, n] increases monotonically with (m, n) (specifically √ m 2 + n 2 ), thereby, ensuring that higher frequency components in the amplitude spectrum are perturbed more compared to the lower frequency components. The dependence of σ[m, n] on (m, n) can be controlled using a set of hyper-parameters that govern the degree of monotonicity. Finally, given the perturbed amplitude and the original phase spectrums, we can apply an inverse 2D Fast Fourier Transform (iFFT) to obtain the augmented image. Fig. 1 shows a few examples of augmentation by PASTA. In terms of Fourier domain augmentations, closest to PASTA are perhaps the approaches -Amplitude Jitter (AJ) (Xu et al., 2021) and Amplitude Mixup (AM) (Xu et al., 2021) . The overarching principle across these methods is to perturb only the amplitude spectrums of images (while keeping the phase spectrum unaffected) to ensure models are invariant to the applied perturbations. For instance, AM, which is a type of mixup strategy (Zhang et al., 2018; Verma et al., 2019) , performs mixup between the amplitude spectrums of distinct intra-source images, while AJ uniformly perturbs the amplitude spectrums with a single jitter value ϵ. Another frequency randomization technique, Frequency Space Domain Randomization (FSDR) (Huang et al., 2021a) , first isolates domain variant and invariant frequency components by using SYNTHIA (Ros et al., 2016) (extra data) and ImageNet and then sets up a learning paradigm. Unlike these methods, PASTA applies fine-grained perturbations and does not involve sampling a separate mixup image or the use of any extra images. Instead, PASTA provides a simple strategy to perturb the amplitude spectrum of images in a structured way that leads to strong out-of-the-box generalization. We will release our code and data upon acceptance. In summary, we make the following contributions. • We introduce Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy for synthetic-to-real generalization. PASTA involves perturbing the amplitude spectrums of synthetic images in the Fourier domain so as to expose a model to more variations in high-frequency components. • We show that PASTA leads to considerable improvements or competitive results across three tasks -(1) Semantic Segmentation: GTAV → Cityscapes, Mapillary, BDD100k, (2) Object Detection: Sim10K → Cityscapes and (3) Object Recognition: VisDA-C Syn → Real -covering a total of 5 syn-to-real shifts across multiple backbones. • We show that PASTA (1) often makes a baseline model competitive with prior state-of-the-art approaches relying on either specific architectural components, extra data, or objectives, ( 2) is complementary to said approaches and ( 3) is competitive with augmentation strategies like FACT (Xu et al., 2021) and RandAugment (Cubuk et al., 2020) .

Domain Generalization (DG)

. DG involves training models on single or multiple labeled data sources to generalize well to novel test time data sources (unseen during training). Since its inception (Blanchard et al., 2011; Muandet et al., 2013) , several approaches have been proposed to tackle the problem of domain generalization. These include -decomposing a model into domain invariant and specific components and utilizing the former to make predictions (Ghifary et al., 2015; Khosla et al., 2012) , learning domain specific masks for generalization (Chattopadhyay et al., 2020) , using meta-learning to train a robust model by mimicking the DG problem during training (Li et al., 2018; Wang et al., 2020; Balaji et al., 2018; Chen et al., 2022; Dou et al., 2019) , manipulating feature statistics to augment training data (Zhou et al., 2021; Li et al., 2022; Nuriel et al., 2021) , and using models crafted based on risk minimization formalisms (Arjovsky et al., 2019) . Recently, properly tuned Empirical Risk Minimization (ERM) has proven to be a competitive DG approach (Gulrajani & Lopez-Paz, 2020) with follow-up work adopting various optimization and regularization techniques on top of ERM (Shi et al., 2021; Cha et al., 2021) . Single Domain Generalization (SDG). Unlike DG which leverages diversity across domains for generalization, SDG considers generalizing from only one source domain. Some notable approaches for SDG involve using meta-learning (Qiao et al., 2020) by considering strongly augmented versions of source images as meta-target data (by exposing the model to increasingly distinct augmented views of the source data (Wang et al., 2021; Li et al., 2021) ) and learning feature normalization schemes with auxiliary objectives (Fan et al., 2021) . Synthetic-to-Real Generalization (syn-to-real). Approaches specific to syn-to-real generalization in prior work (most relevant to our experimental settings) mostly consider either learning specific feature normalization schemes so that predictions are invariant to style characteristics (Pan et al., 2018; Choi et al., 2021) , collecting external data to inject style information (Kim et al., 2021; Kundu et al., 2021) , learning to optimize for robustness (Chen et al., 2020b) , leveraging strong augmentations / domain randomization (Yue et al., 2019; Kundu et al., 2021) or using contrastive techniques to aid generalization (Chen et al., 2021) . To adapt to real images from synthetic data (Chen et al., 2018b) trained Faster R-CNN in an adversarial manner, (Saito et al., 2019) leveraged adversarial alignment loss to emphasize globally similar images, (Chen et al., 2020a) proposed a method to harmonize transferability and discriminability of features in a hierarchical method, and (Vibashan et al., 2021) ensures category-aware feature alignment for learning domain-invariant features. We consider three of the most commonly studied settings for syn-to-real generalization in this paper -(1) Semantic Segmentation -GTAV→Real datasets, (2) Object Detection -Sim10K→Real and (3) Object Recognition -the VisDA-C (Peng et al., 2017) dataset. To appropriately characterize the "right" synthetic data for generalization, (Mishra et al., 2021) has recently considered tailoring synthetic data for downstream generalization. Fourier Domain Generalization, Adaptation and Robustness. Prior work has considered augmenting images in the frequency domain as opposed to the pixel space. These approaches rely on the empirical observation (Oppenheim et al., 1979; Oppenheim & Lim, 1981; Piotrowski & Campbell, 1982; Hansen & Hess, 2007) that the phase component of the Fourier spectrum corresponds to the semantics of the image. PASTA is in line with this style of approach. Closest and perhaps most relevant to our work are that of (Xu et al., 2021; Yang & Soatto, 2020) which consider perturbing the amplitude spectrums of the source synthetic images. Building on top of (Xu et al., 2021) , (Yang et al., 2021) adds a significance mask during linear interpolation of amplitudes. (Huang et al., 2021b) decomposes images into multiple frequency components and only perturbs components that capture little semantic information. (Wang et al., 2022) uses an encoder-decoder architecture to obtain high and low frequency features, and augments the image by adding random noise to the phase of high frequency features and to the amplitude of low frequency features. More generally, (Yin et al., 2019) finds that perturbations in the higher frequency domain increase robustness of models to high-frequency image corruptions.

3. METHOD

In this section, we first cover preliminaries and then describe our proposed approach. Given an image, we first apply a 2D Fast Fourier Transform (FFT) to obtain the amplitude and phase spectrums. Following this, the Amplitude spectrum is perturbed as outlined in Eqns. 4 and 5. Finally, we use the perturbed amplitude spectrum and the pristine phase spectrum to recover the augmented image by applying a 2D inverse FFT. to generalize well to a target distribution(s), without access to target data during training. For our experiments, source data is synthetic and target data is real. Fourier Transform. Consider a single-channel image x ∈ R H×W . The Fourier transform for x can be expressed as, F(x)[m, n] = H-1 h=0 W -1 w=0 x[h, w] exp -2πi h H m + w W n where i 2 = -1. The inverse Fourier transform, F -1 (•), that maps signals from the frequency domain to the image domain can be defined accordingly. Note that the Fourier spectrum F(x) ∈ C H×W . If Re(F(x)[•, •]) and Im(F(x)[•, •]) denote the real and imaginary parts of the Fourier spectrum, the corresponding amplitude (A(x) [•] ) and phase (P(x)[•]) spectrums can be expressed as, A(x)[m, n] = Re(F(x)[m, n]) 2 + Im(F(x)[m, n]) 2 (2) P(x)[m, n] = arctan Im(F(x)[m, n]) Re(F(x)[m, n]) Without loss of generality, we will assume for the rest of this section that the amplitude and phase spectrums are zero-centered, i.e., the low-frequency components (low m, n) have been shifted to the center (lowest frequency component is at the center). The Fourier transform and its inverse can be calculated efficiently using the Fast Fourier Transform (FFT) (Nussbaumer, 1981) algorithm. For an RGB image, we can obtain the Fourier spectrum (and A(x)[•] and P(x)[•]) independently for each channel. For the following section, we will use a single channel image as a running example to describe our proposed approach.

3.2. PROPORTIONAL AMPLITUDE SPECTRUM TRAINING AUGMENTATION (PASTA)

We propose to create a set of augmented images through perturbations to the Fourier Amplitude spectrum. Following prior work (Oppenheim et al., 1979; Oppenheim & Lim, 1981; Piotrowski & Campbell, 1982; Hansen & Hess, 2007) , our augmentations perturb only the amplitude spectrum and leave the phase untouched to roughly preserve the semantics of the scene. The key steps involved in generating an augmented image from an original image x ∈ R H×W are: 1. Translating x to the Fourier domain to obtain the Fourier, amplitude and phase spectrums (see Eqns 1, 2, 3) -F(x), A(x), P(x) 2. Perturbing "only" the amplitude spectrum via a perturbation function g Λ (•) to obtain the perturbed amplitude spectrum Â(x) = g Λ (A(x)) 3. Applying an inverse Fourier transform, F -1 (•), to the perturbed amplitude Â(x) and the pristine phase spectrum P(x) to obtain an augmented image. Alternative approaches to augment images in the Fourier domain (Yang & Soatto, 2020; Xu et al., 2021) differ mostly in terms of how the function g Λ (•) operates. Across a set of synthetic source datasets, we make the important observation that synthetic images tend to have smaller variations in the high-frequency components of their amplitude spectrum than real images. Fig. 2 shows sample amplitude spectrums of images from different datasets (on the left) and shows the standard deviation of amplitude values for different frequency bands per-dataset (on the right). 1 We find that the variance in amplitude values for high frequency components is significantly higher for real as opposed to synthetic data (see Fig. 2 (b) ). In Sec. A.4 of appendix, we show how this phenomenon is consistent across (1) several syn-to-real shifts and (2) fine-grained frequency band discretizations. This phenomenon is likely a consequence of how synthetic images are rendered. For instance, in VisDA-C, the synthetic images are viewpoint images of 3D object models (under different lighting conditions), so it's unlikely for them to be diverse in high-frequency details. For images from GTAV, synthetic renderings can lead to contributing factors such as low texture variations -for instance, "roads" (one of the head classes in semantic segmentation) in synthetic images likely have less high-frequency variation compared to real roads. 2 Consequentially, to generalize well to real data, we would like to ensure that our augmentation strategy exposes the model to more variations in the high-frequency components of the amplitude spectrum during training. Our intent is to ensure that the learned models are invariant to variations in high-frequency components -thereby avoiding overfitting to a specific syn-to-real shift. PASTA. This is precisely where we come in. PASTA proposes perturbing the amplitude spectrums of the images in a manner that is proportional to the spatial frequencies, i.e., higher frequencies are perturbed more compared to lower frequencies (see Fig. 3 ). For PASTA, we express g Λ (•) as, g Λ (A(x))[m, n] = ϵ[m, n]A(x)[m, n] where ϵ[m, n] ∼ N (1, σ[m, n]) and σ[m, n] = 2α m 2 + n 2 H 2 + W 2 k + β and Λ = {α, k, β} We ensure that the perturbation applied at spatial frequency (m, n) has a direct dependence on (m, n) (Eqn. 5). Λ = {α, k, β} are controllable hyper-parameters. β ensures a baseline level of jitter applied to all frequencies and α, k govern how the perturbations grow with increasing frequencies. Note that setting either α = 0 or k = 0 (removing the frequency dependence) results in a setting where the σ[m, n] is the same across all (m, n). In Sec. A.4 of appendix, we show quantitatively how applying PASTA increases the variance metric measured in Fig. 2 (b) for synthetic images across fine-grained frequency band discretizations.

4. EXPERIMENTS

We conduct our experiments across three tasks -semantic segmentation, object detection and object recognition, where we train on synthetic data so as to generalize to real world data. 4.1 DATASETS Semantic Segmentation. For segmentation, we consider GTAV (Richter et al., 2016) as our synthetic source dataset. GTAV consists of ∼ 25k driving-scene images with 19 object classes. We consider Cityscapes (Cordts et al., 2016) , BDD-100K (Yu et al., 2020) and Mapillary (Neuhold et al., 2017) as our real target datasets which contain ∼ 5k, ∼ 8k, and ∼ 25k finely annotated real-world driving / street view images respectively. The 19 classes in GTAV are compatible with those of Cityscapes, BDD-100K and Mapillary. We train all our models on the training split of the source synthetic dataset and evaluate on the validation splits of the real target datasets. We report performance in terms of mIoU (mean intersection over union). Object Detection. For object detection, we consider Sim10K (Johnson-Roberson et al., 2016) as our synthetic source dataset. Sim10K consists of ∼ 10k images of street scenes obtained from GTAV with ∼ 59k bounding boxes for cars. We consider Cityscapes (Cordts et al., 2016) as our target Table 1 : Synthetic-to-Real generalization results for semantic segmentation and object detection. Tables 1a, 1b and 1c summarize syn-to-real generalization results for semantic segmentation (across 3 runs) on Cityscapes (C), BDD-100K (B) and Mapillary (M) when trained on GTAV (G). 2) consider detecting instances of the class "car" across Sim10K and Cityscapes. We use mAP@50 (mean average precision at an IoU threshold of 0.5) to report performance. Object Recognition. For object recognition, we consider the VisDA-2017 (Peng et al., 2017) imageclassification benchmark. The source synthetic domain, consists of 3D renderings of 12 object categories from different angles and under different lighting conditions -resulting in a total of ∼ 152k synthetic images. The target real domain ("real" val split in VisDA-C) consists of ∼ 55k images of the 12 classes cropped from images from the COCO dataset. We split the source domain into an 80/20 train / val split (for checkpoint selection) and evaluate on the entirety of the target domain specified above. For comparisons with CSG on VisDA-C, we use the same experimental configurations as (Chen et al., 2021) . We use accuracy as our evaluation metric.

4.2. MODELS, IMPLEMENTATION DETAILS AND BASELINES

Models and Implementation Details. For our semantic segmentation experiments, we consider the DeepLabv3+ (Chen et al., 2018a) architecture with backbones -ResNet-50 (R-50) (He et al., 2016) , ResNet-101 (R-101) (He et al., 2016) and MobileNetv2 (MN-v2) (Sandler et al., 2018) (see Tables 1a,  1b and 1c ). Unlike Table . 1a (R-50), for Tables. 1b (R-101) and 1c (MN-v2), we downsample source GTAV images to the resolution 1024×560 for faster training (due to limited computational resources). 2) the setting where α = 0 in PASTA to assess the extent to which monotonic increase in σ[m, n] makes a difference. For segmentation and detection, we consider only the pixel-level / photometric transforms in the RandAugment vocabulary. For our object recognition experiments, we consider the entire RandAugment vocabulary.

5.1. SYNTHETIC-TO-REAL GENERALIZATION RESULTS

Our semantic segmentation results are summarized in Tables. 1a, 1b and 1c, object detection results in Table . 1d and object recognition results in Tables. 2 and 3. PASTA consistently improves performance. We observe that PASTA offers consistent improvements for all three tasks and the considered synthetic-to-real shifts. -Semantic Segmentation. From Tables. 1a, 1b and 1c, we observe that PASTA significantly improves performance of a baseline model (∼16%, ∼10% and ∼11% mIoU for R-50, R-101 and MN-v2 respectively) and outperforms RandAugment (Cubuk et al., 2020) by a significant margin -9.3% for R-50 (Table . 1a rows 2,4). 3 Additionally, we find that PASTA makes the baseline DeepLabv3+ model improve over existing approaches that use either extra data, modeling components, or objectives by a significant margin. For instance, Baseline + PASTA rows across Tables. 1a, 1b and 1c outperforms IBN-Net, ISW, DRPC (Yue et al., 2019) , ASG (Chen et al., 2020b) and CSG (Chen et al., 2021) . For WEDGE (Kim et al., 2021) , FSDR (Huang et al., 2021a) & WildNet (Lee et al., 2022) we find that Baseline + PASTA outperforms only for R-50 and not for R-101. We would like to note that DRPC, ASG, CSG, WEDGE, FSDR, WildNet use either more data (the entirety of GTAV or additional datasets) or different base architectures, making these comparisons unfair to PASTA. We would also like to note that for our experiments using R-101 and MN-v2, due to limited computational resources, we downsample GTAV images to a lower resolution while training. This slightly hurts PASTA performance relative to experiments run at the original resolution (e.g. performance of PASTA is complementary to existing generalization methods. In addition to ensuring that a baseline model is competitive with or significantly improves over existing methods, PASTA can also complement to existing generalization methods. For semantic segmentation, from Tables. 1a, 1b and 1c, we find that applying PASTA to IBN-Net and ISW significantly improves performance (see IBN-Net, IBN-Net + PASTA and ISW, ISW + PASTA set of rows)with improvements ranging from ∼6-9% mIoU for IBN-Net and ∼5-7% mIoU for ISW across different backbones. For object recognition, in Table . 3, we apply PASTA to CSG (Chen et al., 2021) , one of the state-of-the-art generalization methods on VisDA-C. Since CSG inherently uses RandAugment, we consider both settings where PASTA is applied to CSG with and without RandAugment. In both cases, applying PASTA improves performance. PASTA vs Frequency-based Augmentation Strategies. Closest to our work is FACT (Xu et al., 2021) , where authors employ Amplitude Mixup (AM) and Amplitude Jitter (AJ) as augmentations in a broader training pipeline. On our experimental settings (semantic segmentation using a R-50 DeepLabv3+ baseline) we find that PASTA outperforms AM and AJ -41.90% (PASTA) vs 39.70% (AM) vs 30.70% (AJ) mIoU on real datasets (single run, trained at 1024 × 560). Since FACT was designed specifically for multi-source domain generalization settings (not necessarily in a syn-to-real context), we also compare directly with FACT on PACS (a multi-source DG setup for object recognition) by replacing the augmentation pipeline in FACT with PASTA. We find that FACT-PASTA outperforms FACT-Vanilla -87.97% vs 87.10% average leave-one-out domain accuracy. We further note that PASTA is also competitive (within 1%) with FSDR (Huang et al., 2021a) , another frequency space domain randomization method, which uses extra data in its training pipeline (see Table. 1b).foot_3 5.2 ANALYZING PASTA Sensitivity of PASTA to α, k and β. We discuss how sensitive generalization performance is to the choice of α, k and β in PASTA and attempt to provide general guidelines and pitfalls for appropriately selecting the same. We note the following: -Same choice of α, k and β leads to consistent improvements across multiple tasks (object detection and semantic segmentation). We find that the same set of values, α ∈ {0, 3}, k = 2, β ∈ {0.25, 0.5}, offer significant improvements across (1) both the tasks of detection and segmentation, ( 2 datasets {GTAV, Sim10K} and VisDA-C (Tables. 1a, 1b, 1c, 1d vs 2, 3). This is not entirely surprising since VisDA-C and {GTAV, Sim10K} images differ significantly -GTAV includes synthetic street views and VisDA-C includes different viewpoints of 3D CAD models of objects. -Appropriate selection of α and k is more important compared to β. In Fig. 4 we evaluate R-50 trained on synthetic VisDA-C source on the real (validation) target split for different values of α, k and β and report generalization performance. More than anything, we find that performance is quite sensitive to the choice of k -performance increases slightly with increasing k and then drops significantly. This holds across all values of β. Furthermore, increasing α for a value of k results in significant drops. Qualitatively, we observe that for high α and k we get augmented images have their semantic content significantly occluded, thereby, resulting in poor generalization. Is PASTA complementary to other augmentation strategies? We first check if PASTA is complementary to RandAugment. To assess this, we conduct experiments with a vanilla R-50 CNN on VisDA-C by using both PASTA and RandAugment. We find that, in terms of absolute values, PASTA + RandAugment outperforms RandAugment but not PASTA, but all performances are within 1% of each other. We notice that combining the best PASTA setting with the best RandAugment one leads to very strong augmentations and further note that reducing the strength of PASTA in the combination leads to improved results. On semantic segmentation, we find that a simple combination of just RGB gaussian noise and color jitter itself leads to significant improvements in generalization (leading to a mIoU of almost 40.96%; single run, trained at 1024 × 560) -with gaussian noise likely adding more high-frequency components and color jitter likely perturbing low-frequency components. PASTA on the other hand is a more fine-grained method to perturb all frequencies in a structured manner, which is likely why it outperforms this combination. To summarize, across our set of extensive experiments, we observe that PASTA serves as a simple and effective augmentation strategy that -(1) ensures a baseline method is competitive with existing generalization approaches, (2) often improves performance over existing generalization methods and (3) is complementary to existing generalization methods. Limitations. By relying on the observation that synthetic images can be lacking in high-frequency variations, even though PASTA leads to strong generalization results, akin to any augmentation strategy, it is not without it's limitations. Firstly, it is important to note that PASTA has only been tested on the syn-to-real generalization settings explored in this work and it is an important step for future work to assess the extent to which improvements offered by PASTA translate to more such settings. Secondly, as observed in Sec. 5.2, PASTA is quite sensitive to the choice of k, with a small subset of values of α, k and β defining beneficial levels of augmentations.

6. CONCLUSION

We propose Proportional Amplitude Spectrum Training Augmentation (PASTA), an augmentation strategy for synthetic-to-real generalization. PASTA is motivated by the observation that the amplitude spectra are less diverse in synthetic than real data, especially for high-frequency components. Thus, PASTA augments synthetic data by perturbing the amplitude spectra, with magnitudes increasing for higher frequencies. We show that PASTA offers strong out-of-the-box generalization performance on semantic segmentation, object detection, and object classification tasks. The strong performance of PASTA holds true alone (i.e., training with ERM using PASTA augmented images) or together with alternative generalization/augmentation algorithms. We would like to emphasize that the strength of PASTA lies in its simplicity and effectiveness, offering strong improvements despite not using extra modeling components, objectives, or extra data. We hope that future research endeavors in syn-to-real generalization take domain randomization techniques like PASTA into account.

7. ETHICS STATEMENT

While simulation offers the promise of having more diverse synthetic training data, this is predicated on the assumption that simulators will be designed to produce the diversity of data needed to represent the world. Most existing work on developing simulated visuals focus on diversity such as weather, lighting, and sensor choice. In order for syn-to-real transfer to be effective for the world population we will also need to consider design and collection of synthetic data that adequately represents all sub-populations.

8. REPRODUCIBILITY STATEMENT

For all our experiments surrounding PASTA and associated baselines, we provide details surrounding the datasets and training, validation and test splits in Sec. 4 of the main paper and Sec. A.1 of the appendix. Sec. A.1 also provides details surrounding the choice of hyper-parameters, optimization, and model selection criteria for all the three tasks of semantic segmentation, object detection, and object recognition. Sec. A.6 of the appendix provides details about the frameworks and base code repositories used for our experiments. We will release our code and data upon acceptance. A APPENDIX (He et al., 2016) , ResNet-101 (R-101) (He et al., 2016) and MobileNetv2 (MN-v2) (Sandler et al., 2018) (see Tables. 1a, 1b, and 1c in the main paper). In our experiments using R-101 and MN-v2, we resize the source GTAV images to the resolution 1024 × 560 for faster training under limited computational resources. We adopt the hyper-parameter (and distributed training) settings used by (Choi et al., 2021) for training. We train ResNet-50, ResNet-101 and MobileNetv2 based models across 4, 4 and 2 GPUs respectively in a distributed manner (similar to (Choi et al., 2021) ). This includes the use of SGD as an optimizer with an initial learning rate of 10 -2 and a momentum of 0.9. Similar to (Choi et al., 2021) , we also use a polynomial learning rate schedule (Liu et al., 2015) with a power of 0.9. Our models are initialized with ImageNet (Krizhevsky et al., 2012) pre-trained weights. We train all our models for 40k iterations with a batch size of 16 for GTAV. Our segmentation models are trained on the train split of GTAV and evaluated on the validation splits of the target datasets (Cityscapes, BDD-100K and Mapillary). For segmentation, PASTA is applied with a base set of positional and photometric augmentations (PASTA first and then the base augmentations) -gaussian blur, color jitter, random crop, random horizontal flip and random scaling are used. We use Pytorch (Paszke et al., 2019) implementations of these augmentations. For RandAugment (Cubuk et al., 2020) , we only consider the vocabulary of photometric augmentations for segmentation. We conduct ablations (within computational constraints) for the best performing RandAugment setting using R-50 for syn-to-real generalization and find that best performance is achieved when 8 augmentations are sampled from the vocabulary for application at the highest severity level (30). Whenever we train a prior generalization approach, ISW (Choi et al., 2021) or IBN-Net (Pan et al., 2018) , we follow the same set of hyper-parameter configurations as (Choi et al., 2021) . All models are trained across 3 random seeds. Object Detection (see 2) where we check if PASTA is complementary to CSG (Chen et al., 2021) , when trained with a ResNet-101 backbone (see Table . 3, main paper). For (1), we split the synthetic source data into a training and validation split in an 80/20 ratio for checkpoint selection. We use SGD without momentum as an optimizer with a learning rate of 10 -4 , weight decay of 10 -5 and batch size of 64. We train these models for 10 epochs. Our models are initialized with ImageNet (Krizhevsky et al., 2012) pre-trained weights. We use Pytorch (Paszke et al., 2019) implementations of color jitter, random resized crop and random horizontal flip as the base set of augmentations. For (2), we follow the same hyper-parameter generalization results when semantic segmentation models trained on GTAV are evaluated on Cityscapes. Base model involved is DeepLabv3+ (R-50 backbone). Results are reported across 3 runs. * indicates drawn directly from published manuscripts. k = 2 and β = 0.25 for PASTA (α = 3). β = 0.5 for PASTA (α = 0). Class headers are in decreasing order of pixel frequency. configurations adopted by (Chen et al., 2021) . This includes the use of an SGD (with momentum 0.9) optimizer with a learning rate of 10 -4 , weight decay of 5×10 -4 and a batch size of 32. These models are trained for 30 epochs. CSG (Chen et al., 2021 ) also uses RandAugment (Cubuk et al., 2020) as an augmentation. For both (1) and ( 2), models are trained on single GPUs across 3 random seeds.

A.2 SYNTHETIC-TO-REAL GENERALIZATION RESULTS

Overall GTAV→Real Generalization Results. We first note that using RandAugment (Cubuk et al., 2020) for the Baseline DeepLabv3+ model with ResNet-101 and MobileNetv2 backbones leads to overall syn-to-real generalization performances of 38.28±0.44 (PASTA performance being 42.01±0.26) and 30.47±1.22 (PASTA performance being 37.71±0.54) respectively. Overall, across MNv2, R-50 and R-101, we observe that the gap between RandAugment and PASTA reduces as the number of parameters in the backbone increase. In Tables. 1a, 1b and 1c of the main paper, we present overall synthetic-to-real generalization results when models trained on GTAV are evaluated on Cityscapes, BDD-100K and Mapillary. As stated in Sec. 5.1 of the main paper, for ResNet-50 and ResNet-101 (in Tables. 1a and 1b of the main paper), including comparisons with DRPC (Yue et al., 2019) , ASG (Chen et al., 2020b) , CSG (Chen et al., 2021) , WEDGE (Kim et al., 2021) , FSDR (Huang et al., 2021a) and WildNet (Lee et al., 2022) is not entirely fair to PASTA since these approaches use either more data or different base architectures for training. For instance, WEDGE (Kim et al., 2021) and CSG (Chen et al., 2021) use DeepLabv2, ASG (Chen et al., 2020b) uses FCNs, DRPC (Yue et al., 2019) uses the entirety of GTAV (not just the training split) and WEDGE uses ∼5k extra Flickr images in it's overall pipeline. FSDR (Huang et al., 2021a ) is another such approach that, when trained with a ResNet-101 backbone, achieves overall mIoU of 43.13% (within 1% of Baseline / IBN-Net / ISW + PASTA) on real datasets. However, FSDR also uses FCNs and the the entirety of GTAV for training. FSDR (Huang et al., 2021a) and WildNet (Lee et al., 2022 ) also use extra ImageNet (Krizhevsky et al., 2012) images for stylization / randomization. For FSDR, the first step in the pipeline also requires access to SYNTHIA (Ros et al., 2016) . Per-class GTAV→Real Generalization Results. Tables 5, 6 and 7 include per-class synthetic-to-real generalization results when a DeepLabv3+ (R-50 backbone) model trained on GTAV is evaluated on Cityscapes, BDD-100K and Mapillary respectively. We briefly discuss in the main paper how PASTA (α ≥ 0) improves performance across several classes for GTAV→Real. For GTAV→Cityscapes (see Table . 5), we find that Baseline + PASTA consistently improves over Baseline and RandAugment. For IBN-Net and ISW in this setting, we observe consistent improvements (except for the classes terrain and fence). For GTAV→BDD-100K (see Table . 6), we find that for the Baseline, while PASTA outperforms RandAugment on the majority of classes, both are fairly competitive and outperform the vanilla Baseline approach. For IBN-Net and ISW, PASTA (α = 3) almost always outperforms the vanilla approaches (except for the class wall). For GTAV→Mapillary (see Table . 7), for Baseline, we find that PASTA (α ≥ 0) outperforms the vanilla approach and RandAugment. For IBN-Net and ISW, PASTA (α = 3) outperforms the vanilla approaches with the exception of the classes train and fence. Overall SYNTHIA→Real Generalization Results. We conducted additonal experiments using SYNTHIA (Ros et al., 2016) as the source domain and Cityscapes, BDD-100K and Mapillary as Table 6 : GTA5→BDD-100K per-class generalization results. Per-class IoU comparisons for syn-to-real generalization results when semantic segmentation models trained on GTAV are evaluated on BDD-100K. Base model involved is DeepLabv3+ (R-50 backbone). Results are reported across 3 runs. * indicates drawn directly from published manuscripts. k = 2 and β = 0.25 for PASTA (α = 3). β = 0.5 for PASTA (α = 0). Class headers are in decreasing order of pixel frequency. the target domains with the same set of hyper-parameters for PASTA (α = 3, k = 2, β = 0.25). For a baseline DeepLabv3+ model (R-101), we find that PASTA -(1) provides strong improvements over the vanilla baseline (31.77% mIoU, +3.91% absolute improvement) and ( 2) is competitive with RandAugment (32.30% mIoU). More generally, we find that syn-to-real generalization performance is worse when SYNTHIA is used as the source domain as opposed to GTAV -for instance, ISW (Choi et al., 2021) achieves an average mIoU of 31.07% (SYNTHIA) as opposed to 35.58% (GTAV). SYNTHIA has significantly fewer images compared to GTAV (9.4k vs 25k), which likely contributes to relatively worse generalization performance. PASTA and Base Augmentations. As stated in Sec. A.1, PASTA is applied with some consistent color and positional augmentations. To understand if PASTA alone leads to any improvements, we trained a baseline DeepLabv3+ model (R-50) without these augmentations. We find that from the settings reported in Table. 1a, Row 4 of the main paper plus training at a resolution of 1024 × 560, average performance on real datasets drops from (1) 41.90% to 40.37% mIoU when the photometric augmentations (color jitter and gaussian blur) are removed and (2) further drops to 40.25% mIoU when both positional and photometric augmentations are removed. Therefore, applying PASTA without any base augmentations leads to performance that is still significantly above a vanilla baseline (40.25% vs 26.99% mIoU).

A.3 PASTA ANALYSIS

In this section, we provide more discussions surrounding different aspects of PASTA. Functional form of PASTA. We now discuss the specific functional form of the perturbations introduced by PASTA as per Eqns. 4 and 5 in the main paper. Eqn. 5 is constructed to ensure we have a multiplicative jitter interaction. Unlike the multiplicative interaction, an additive jitter interaction in the Fourier space where the perturbation is sampled from a gaussian distribution can be expressed as gaussian jitter perturbation in the RGB space -easier to attain by adding gaussian noise to the images. For Eqn. 5, we first considered uniform perturbations (α = 0) independent of the spatial frequencies. As a natural extension based on the observations made in Section 3.2 of the main paper, we considered a linear dependence on spatial frequencies -hence, the inclusion of α. The inclusion of k decides how much attention we pay to the high-frequency bands compared to the low frequency ones. For fixed α, since the frequency dependent term being exponentiated in Eqn. 5 is normalized (2 m 2 +n 2 H 2 +W 2 ∈ [0, 1]), increasing k perturbs the lower frequency components less. PASTA and Amplitude Mixup (AM). We also compare PASTA with the Amplitude Mixup (AM) technique proposed in (Xu et al., 2021) . Amplitude Mixup involves perturbing the amplitude spectrum of the image of concern by performing a convex combination with the amplitude spectrum of another "mixup" image drawn from the same source data. For a DeepLabv3+ architecture with a R-50 backbone, we find that when trained on GTAV (single run trained at a resolution of 1024 × 560), AM achieves an mIoU of 39.70% (vs 41.90% for PASTA) on the real datasets. We further observe that for AM, performance isn't altered even if we reduce the set of mixup images to a very small set (10 randomly selected images from GTAV) instead of sampling one for every source image. We observe that this variant of AM also ends up achieving an mIoU of 39.70% (single run). Overall, we observe that while AM tends to underperform compared to PASTA, the pipeline involved in AM -sampling "mixup" images -ends up being an overkill. PASTA and Amplitude Jitter (AJ). We also compare PASTA with the Amplitude Jitter (AJ) technique considered in (Xu et al., 2021) . Amplitude Jitter (AJ) perturbs the amplitude spectrum with a single jitter value ϵ for every spatial frequency and channel. ϵ is a multiplicative factor sampled from a gaussian distribution centered at 1. We use AJ with ϵ ∼ N (1, 0.5) for our experiments with DeepLabv3+ (R-50). We find that AJ ends up achieving an mIoU of 30.70% (single run trained at a resolution of 1024 × 560) on the real datasets on average, which is significantly lower than PASTA. Qualitatively, we observe that multiplying the entire amplitude spectrum by a constant value results in a uniform change of brightness in the augmented image. If we sample ϵ separately per-channel, we find that the performance of this AJ variant improves to an mIoU of 32.83% (single run trained at a resolution of 1024 × 560). Here, we observe that multiplying the amplitude spectrum channel-wise by three distinct constants leads to one of image channels dominating over the others, resulting in a red-ish, green-ish or blue-ish hue in the augmented images. Finally, if we consider PASTA (α = 0, β = 0.5), which is a more fine-grained setting where ϵ is sampled per-spatial frequency and per-channel, we find that generalization performance improves significantly to 39.19% mIoU. Therefore, we observe that approaches along the lines of AJ for generalization lead to better results if implemented in a fine-grained manner as PASTA. Constrained PASTA (α = 0). Note that sampling an ϵ per spatial frequency creates a jitter image per-channel, which is then multiplied in an elementwise manner to the original amplitude to perturb the same. Similar to the Amplitude Mixup (AM) set of experiments, we also consider a setting where we sample these jitter images from a fixed set, instead of sampling a new one for every source image. We create these jitter image sets (each jitter image randomly sampled) of sizes 1, 2, 5 and 10 and observe that for DeepLabv3+ (R-50) (single runs trained at a resolution of 1024 × 560) we achieve mIoUs of 39.81%, 40.33%, 38.78%, and 39.16% respectively on the real datasets. Therefore, even if we severely restrict the source of randomness / variations in ϵ for PASTA (α = 0), we don't observe a significant loss in generalization performance. Comparison with Fourier Domain Adaptation (FDA) (Yang & Soatto, 2020) . FDA is a recent approach for syn-to-real domain adaptation and naturally requires access to unlabeled target data. In FDA, low frequency bands of the amplitude spectra of source images are replaced with those of target -essentially mimicking a cheap style transfer operation. Since we do not assume access to target data in our experimental settings, a direct comparison is not possible. Instead, we consider a proxy task where we intend to generalize to real datasets (Cityscapes, BDD-100K, Mapillary) by assuming additional access to 6 real world street view images under different weather conditions (for style transfer) -sunny day, rainy day, cloudy day, etc. -in addition to synthetic images from GTAV. We find that FDA in this setting achieves an mIoU of 33.04% (vs 26.99% for Baseline and 41.90% for Baseline + PASTA (α = 3, k = 2)) for DeepLabv3+ with an R-50 backbone trained at a resolution of 1024 × 560. . 1a in the main paper) Fig. 10 , 12 and 14 respectively when different augmentations are applied (RandAugment and PASTA). The Cityscapes images we show predictions on were selected randomly. We include RandAugment predictions only for the Baseline. To get a better sense of the kind of mistakes made by different approaches, we also include the difference between the predictions and ground truth segmentation masks in Fig. 11 , 13 and 15 (ordered accordingly for easy reference). The difference images show the predicted classes only for pixels where the prediction differs from the ground truth.

A.6 ASSETS LICENSES

The assets used in this work can be grouped into three categories -Datasets, Code Repositories and Dependencies. We include the licenses of each of these assets below. Datasets. We used the following publicly available datasets in this work -GTAV (Richter et al., 2016) , Cityscapes (Cordts et al., 2016) , BDD-100K (Yu et al., 2020) , Mapillary (Neuhold et al., 2017) , and VisDA-C (Peng et al., 2017) . For GTAV, the codebase used to extract data from the original GTAV game is distributed under the MIT license. 5 The license agreement for the Cityscapes dataset dictates that the dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal 



In Fig. 2(b), for every image, upon obtaining the amplitude spectrum, we first take an element-wise logarithm. Then, for a particular frequency band (pre-defined), we compute the standard deviation of amplitude values within that band (across all the channels). Finally, we average these standard deviations across images to report the same in the bar plots.2 When PASTA is applied, we find that performance on "road" increases by a significant margin (see per-class generalization results in Sec. A.2 of appendix). In Sec. A.2 of appendix, we show how PASTA improves across several classes for the considered shifts. We discuss these experiments in Sec. A.3 of appendix. https://bitbucket.org/visinf/projects-2016-playing-for-data/src/master/



Figure 1: PASTA augmentation samples. Examples of images from different synthetic datasets when augmented using PASTA and RandAugment (Cubuk et al., 2020). Row 1 includes examples from GTAV and row 2 from VisDA-C.

Figure 2: Amplitude spectrum characteristics. (a) Sample amplitude spectrums (lowest frequency at the center) for one channel of synthetic and real images. Note that the amplitude spectrums tend to follow a specific pattern -statistics of natural images have been found to exhibit the property that amplitude values tend to follow an inverse power law w.r.t. the frequency (Burton & Moorhead, 1987; Tolhurst et al., 1992), i.e., roughly, the amplitude at frequency f , A(f ) ∝ 1 f γ , for some γ determined empirically. (b) Variations in amplitude values across images. Synthetic images have less variance in high-frequency components of the amplitude spectrum compared to real images.

Figure 3: PASTA. The figure outlines the augmentation pipeline involved in PASTA. Given an image, we first

Figure 4: Ablating α, β, k in PASTA. We train a vanilla ResNet-50 on VisDA-C source by applying PASTA and varying α, k and β within the sets {0, 3, 5, 7, 9}, {2, 4, 6, 8} and {0.25, 0.5, 0.75, 1.0} respectively. The trained models are evaluated on the target (validation) real data with class-balanced accuracy as the metric.

α = 0) 77.1 76.3 85.5 82.3 35.2 72.9 33.1 65.2 31.6 25.7 25.4 22.8 19.8 23.2 33.8 2.8 33.8 21.0 24.0 41.7 4 +PASTA (α = 3) 84.1 80.5 85.8 85.9 40.1 81.8 31.9 66.0 31.4 28.1 29.0 21.8 28.5 24.5 28.7 7.0 32.9 23.4 27.2 α = 3) 76.6 78.4 85.6 83.7 32.5 83.1 33.1 63.4 40.4 23.6 27.3 17.4 22.3 25.7 30.1 3.3 35.9 18.2 19.9 42.1

Figure 9: PASTA augmentation samples. Examples of images from different synthetic datasets when augmented using PASTA and RandAugment (Cubuk et al., 2020). Rows 1-3 include examples from GTAV and rows 4-6 from VisDA-C.

Figure 11: GTAV→Cityscapes Baseline segmentation prediction diffs. Differences between prediction and ground truth for predictions made on randomly selected Cityscapes validation images by a Baseline DeepLabv3+ model (R-50 backbone) trained on GTAV synthetic images. The first two columns indicate the original image and the associated ground truth and rest indicate the considered approaches.

Figure 13: GTAV→Cityscapes IBN-Net (Pan et al., 2018) segmentation prediction diffs. Differences between prediction and ground truth for predictions made on randomly selected Cityscapes validation images by IBN-Net (DeepLabv3+ model with R-50 backbone) trained on GTAV synthetic images. The first two columns indicate the original image and the associated ground truth and rest indicate the considered approaches.

Table1dsummarizes syn-to-real generalization results for object detection on Cityscapes (C) when trained on Sim10K (S). * indicates numbers drawn from published manuscripts. † indicates trained with downsampled 1024 × 560 images due to restricted compute. k = 2 and β = 0.25 for PASTA (α = 3). β = 0.5 for PASTA (α = 0). Rows in gray font use different base architectures and / or extra data for training and have been included primarily for completeness (drawn directly from published manuscripts). Bold indicates best and underline indicates second best.

VisDA-C (ResNet-50) generalization. Vanilla ResNet-50 CNN trained (3 runs) on the synthetic source data of VisDA-C is evaluated on the real (val split) target data of VisDA-C. k = 3 and β = 0.875. I.D. and O.O.D. are regular in and out-of-domain accuracies. O.O.D. (Bal.) is class balanced accuracy on out-ofdomain data. Bold indicates best and underline indicates second best.

Table. 1a row 4 drops from 43.81 to 41.90 when downsampling during training). In Sec. A.2 of appendix, we also show that for the SYNTHIA → Real shift (1) PASTA provides strong improvements over the vanilla baseline (+3.91% absolute improvement) and (2) is competitive with RandAugment. -Object Detection. For Object Detection (see Table. 1d), we compare PASTA to RandAugment and 17% and ∼ 12% for R-50 and R-101 respectively) and (2) improves over PD and RandAugment for R-50 and is competitive with RandAugment for R-101. More interestingly, for R-101, we find that not only does PASTA improve performance on real data but also significantly outperforms the state-of-the-art adaptive object detection method -ILLUME (Khindkar et al., 2022), a method that has access to target data at training time! -Object Recognition. For object recognition (see Tables. 2 and 3), we find that while PASTA improves over a vanilla baseline, the offered improvements are competitive with that of RandAugment. From rows 1, 2, 4 in Table. 2, we can observe that while both RandAugment and PASTA outperform the Baseline, RandAugment and PASTA (α = 4) are competitive when we look at best syn-to-real generalization performance. In Table. 3 (rows 8, 9, 10) we can observe that CSG is competitive with versions where RandAugment in CSG is replaced with PASTA. VisDA-C (ResNet-101) generalization. We apply PASTA to CSG (Chen et al., 2021). Since CSG inherently uses RandAug, we also report results with and without the use of RandAug when PASTA is applied. indicates drawn directly from published manuscripts. We report class-balanced accuracy on the real (val split) target data of VisDA-C. k = 1 and β = 0.5 for PASTA. Results are reported across 3 runs. Bold indicates best and underline indicates second best.

This appendix is organized as follows. In Sec. A.1, we first expand on implementation and training details from the main paper. Then, in Sec. A.2, we provide per-class synthetic-to-real generalization results (see Sec. 5.1 of the main paper). Sec. A.3 includes additional discussions surrounding different aspects of PASTA. Sec. A.4 includes additional analysis of the amplitudes across multiple frequency bands and datasets. Next, Sec. A.5 contains more qualitative examples of PASTA augmentations and predictions for semantic segmentation. Finally, Sec. A.6 summarizes the licenses associated with different assets used in our experiments.A.1 IMPLEMENTATION AND TRAINING DETAILSIn this section, we outline our training and implementation details for each of the three tasks -Semantic Segmentation, Object Detection, and Object Recognition. We also summarize these details in Tables. 4a, 4b, and 4c. Semantic Segmentation (seeTable. 4a). As stated in Sec. 4.2 of the main paper, for our semantic segmentation experiments, we consider the DeepLabv3+(Chen et al., 2018a)  architecture with backbones -ResNet-50 (R-50)

Table. 4b). For object detection, we consider the Faster-RCNN(Ren et al., 2015) architecture with ResNet-50 and ResNet-101 backbones (see Table.1d in the main paper). We train on the entirety of Sim10K Johnson-Roberson et al. (2016) (source dataset) for 10k iterations and pick the last checkpoint for Cityscapes (target dataset) evaluation. We use SGD with momentum as our optimizer with an initial learning rate of 10 -2 (adjusted according to a step learning rate schedule) and a batch size of 32. Our models are initialized with ImageNet(Krizhevsky et al., 2012) pre-trained weights. All models are trained on 4 GPUs. For detection, we also compare PASTA against RandAugment(Cubuk et al., 2020) and Photometric Distortion (PD). The sequence of operations in PD to augment input images are -randomized brightness, randomized contrast, RGB→HSV conversion, randomized saturation & hue changes, HSV→RGB conversion, randomized contrast, and randomized channel swap. Object Recognition (see Table. 4c). As stated earlier, for object recognition, we consider two sets of experiments (see Tables. 2 and 3 in the main paper) -(1) where we train a baseline ResNet-50 CNN from scratch with different augmentation strategies (see Table. 2, main paper) and (

Implementation & Optimization Details. We summarize details surrounding dataset, training, optimization and model selection criteria for our semantic segmentation, object detection and object recognition experiments.

GTA5→Cityscapes per-class generalization results. Per-class IoU comparisons for syn-to-real

GTA5→Mapillary per-class generalization results. Per-class IoU comparisons for syn-to-real generalization results when semantic segmentation models trained on GTAV are evaluated on Mapillary. Base model involved is DeepLabv3+ (R-50 backbone). Results are reported across 3 runs. * indicates drawn directly from published manuscripts. k = 2 and β = 0.25 for PASTA (α = 3). β = 0.5 for PASTA (α = 0). Class headers are in decreasing order of pixel frequency.

annex

Algorithm 1 Pseudocode of PASTA in a PyTorch-like style. 

A.4 AMPLITUDE ANALYSIS

PASTA is motivated by the empirical observation that synthetic images have less variance in their high frequency components compared to real images. In this section, we first show how this observation is widespread across a set of syn-to-real shifts over fine-grained frequency band discretizations and then demonstrate how PASTA helps counter this discrepancy.Fine-grained Band Discretization. For Fig. 2 (b) in the main paper, the low, mid and high frequency bands are chosen such that the first band is 1/3 the height of the image (includes all spatial frequencies till 1/3rd of the image height), second band is up to 2/3 the height of the image excluding band 1 frequencies, and the third band considers all the remaining frequencies. We begin by splitting the amplitude spectrum into 3, 5, 7, and 9 frequency bands in the manner described above, and analyze the diversity of these frequency bands across multiple datasets. Across 7 domain shifts (see Fig. 5 and 6) -{GTAV, SYNTHIA} → {Cityscapes, BDD-100K, Mapillary}, and VisDA-C Syn→Real, we find that (1) for every dataset (whether synthetic or real), diversity decreases as we head towards higher frequency bands and (2) synthetic images exhibit less diversity in high-frequency bands at all the levels of granularity we consider.Increase in amplitude variations post-PASTA. Next, we observe how PASTA effects the diversity of the amplitude spectrums on GTAV and VisDA-C. Similar to above, we split the amplitude spectrum into 3, 5, 7, and 9 frequency bands, and we analyze the diversity of these frequency bands before and after applying PASTA to images (see Fig. 7 and 8 ). For synthetic images from GTAV, when PASTA (α = 3, k = 2) is applied, we observe that the standard deviation of amplitude spectrums increases from 0.4 to 0.497, 0.33 to 0.51 and 0.3 to 0.52 for the low, mid and high frequency bands respectively. As expected, we observe maximum increase for the high-frequency bands. For GTAV, we find that applying PASTA increases variations in amplitude values across different frequency bands. Four plots correspond to fine-grained frequency bands (3, 5, 7 and 9 bands; increasing in frequency from Band-1 to Band-n). We find the maximum amount of increase for the highest frequency bands across different granularity levels. 

