CYCLE-CONSISTENT MASKED AUTOENCODER FOR UNSUPERVISED DOMAIN GENERALIZATION

Abstract

Self-supervised learning methods undergo undesirable performance drops when there exists a significant domain gap between training and testing scenarios. Therefore, unsupervised domain generalization (UDG) is proposed to tackle the problem, which requires the model to be trained on several different domains without supervision and generalize well on unseen test domains. Existing methods either rely on a cross-domain and semantically consistent image pair in contrastive methods or the reconstruction pair in generative methods, while the precious image pairs are not available without semantic labels. In this paper, we propose a cycle cross-domain reconstruction task for unsupervised domain generalization in the absence of paired images. The cycle cross-domain reconstruction task converts a masked image from one domain to another domain and then reconstructs the original image from the converted images. To preserve the divergent domain knowledge of decoders in the cycle reconstruction task, we propose a novel domain-contrastive loss to regularize the domain information in reconstructed images encoded with the desirable domain style. Qualitative results on extensive datasets illustrate our method improves the state-of-the-art unsupervised domain generalization methods by average +5.59%, +4.52%, +4.

1. INTRODUCTION

Recent progresses have shown the great capability of unsupervised learning in learning good representations without manual annotations (Doersch et al., 2015; Noroozi & Favaro, 2016; Gidaris et al., 2018; Chen et al., 2020b; He et al., 2020; Chen et al., 2021; Zbontar et al., 2021; Caron et al., 2021; Tian et al., 2020; Henaff, 2020; Oord et al., 2018; Wu et al., 2018; Misra & Maaten, 2020; Caron et al., 2020; Li et al., 2022; 2023) . However, they mostly rely on the assumption that the testing and training domain should follow an independent and identical distribution. In many real-world situations, this assumption is hardly held due to the existence of domain gaps between the training set and testing set in the real world. As a result, significant performance drops can be observed when deep learning models encounter out-of-distribution deployment scenarios (Zhuang et al., 2019; Sariyildiz et al., 2021; Wang et al., 2021; Bengio et al., 2019; Engstrom et al., 2019; Hendrycks & Dietterich, 2018; Recht et al., 2019; Su et al., 2019) . A novel setting, unsupervised domain generalization (UDG) (Zhang et al., 2022; Harary et al., 2021; Yang et al., 2022) , is therefore introduced to solve the problem, in which the model is trained on multiple unlabeled source domains and expected to generalize well on unseen target domains. Existing unsupervised domain generalization methods rely on constructing cross-domain but semantically consistent image pairs to design the pretext tasks, i.e., contrastive-based (Zhang et al., 2022; Harary et al., 2021) and generative-based methods (Yang et al., 2022) . The contrastive-based methods aim to push the cross-domain positive pairs (samples of the same classes but from different domains) together and pull the negative pairs (the samples of different classes) apart (Fig. 1(a) ). In contrast, the generative-based method proposes a new cross-domain masked image reconstruction task to recover the original image based on its style-transferred counterpart, which aims to disentangle the domain information and obtain a domain-invariant content encoder (see Fig. 1(b) ). Although they achieve great success in unsupervised domain generalization, how to fully exploit the multiple domain information and establish better input-target reconstructed pairs is still a fundamental challenge. In fact, the reconstructed pairs are expected to cover more realistic and diverse cross-domain sample pairs, but without image annotations, those pairs can not be accurately obtained. To tackle the challenge above, we propose a generative-based model named Cycle-consistent Masked AutoEncoder (CycleMAE). Our method designs a novel cycle cross-domain reconstruction task for unsupervised domain generalization in the absence of paired images, which reconstructs multiple images from different domains in a self-circulating manner with a masked autoencoder. Specifically, the cycle cross-domain reconstruction task is to reconstruct an image under randomly masking from one domain to another domain, and then bring this generated counterpart back to its original domain, forming the self-circulating approach (as illustrated in Fig. 1(c) ). In this way, we can establish the cross-domain counterparts of multiple domains by the forward reconstruction, and construct high-quality input-target pairs for the backward reconstruction. Since the high-quality input-target pairs are from the model outputs, they can be more realistic than cross-domain pairs designed by manual rules. With these more realistic pairs, the models can be taught to better disentangle the domain-invariant features. We further observe that directly applying the cycle reconstruction tasks may underestimate the model's ability to disentangle the style information in the domain-specific decoders. Without any supervision of the reconstructed images in the forward reconstruction, the model tends to take "shortcuts" that domain-specific decoders generate images in a similar domain to reduce the learning difficulty of encoders to extract content information in the backward reconstruction. To this end, we additionally introduce a domain contrastive loss to keep different decoders capturing divergent information. Specifically, we accomplish this regularization by pulling the samples with the same domain labels close and pushing the samples with different domain labels apart, which can force the decoder to capture less redundant information from each other and thus preferably disentangle the domain information. To demonstrate the effectiveness of CycleMAE, massive experiments are conducted on the commonly used multi-domain UDG benchmarks, including PACS (Li et al., 2017) and DomainNet (Peng et al., 2019) . The experiment results demonstrate that CyCleMAE achieves new state-of-the-art and obtains significant performance gains on all correlated unsupervised domain generalization settings (Zhang et al., 2022) . Specifically, CycleMAE improves the states-of-the-art unsupervised domain generalization methods by Average +5.59%, +4.52%, +4.22%, +7.02% on 1%, 5%, 10%, 100% data fraction setting of PACS, and +5.08%, +6.49%, +1.79%, +0.53% on 1%, 5%, 10%, 100% data fraction setting of DomainNet. Our contributions are two-fold: (1) We propose a self-circulating cross-domain reconstruction task to learn domain-invariant features. (2) We propose a domain contrastive loss to preserve the domain discriminativeness of the transformed image in the cycle cross-domain reconstruction task, which regularizes the encoder to learn domain-invariant features. Extensive experiments validate the effectiveness of our proposed method by improving the state-of-the-art generative-based methods by a large margin. Related works will be elaborated on in the Appendix 5.1.

2. CYCLE-CONSISTENT MASKED AUTOENCODER

We now introduce our cycle-consistent Masked Autoencoder (CycleMAE) to learn domain-invariant features for unsupervised domain generalization based on a simple generative baseline method Di-MAE. Given images in a set of is the domain-specific decoder of the image domain X i . The key innovation of our proposed method lies in easing the difficulty of constructing the cross-domain but semantically same image pairs by proposing a novel cycle cross-domain reconstruction task and a domain contrastive loss. Given an image x from the domain X i , the image reconstruction cycle should be able to bring x back to the original image, i.e., N different domain X = {X 1 , X 2 , ..., X N }, x → y = D(E(x)) → x = D i (E(y)) ≈ x, where y = {y 1 , y 2 , ..., y N } is the images generated by the forward transformation, y i = D i (E(x)) is the generated image encoded with the style of X i , and x is the reconstructed image in X i after the cycle cross-domain reconstruction task. The domain contrastive loss is used to preserve the domain difference among y to regularize the domain-specific decoder to learn domain style information.

2.1. OVERVIEW OF CYCLEMAE

As shown in Fig. 2 , the proposed CycleMAE is based on DiMAE (Yang et al., 2022) , but differently undergoes two consecutive and reversible processes: the forward transformation process (illustrated by blue arrows) and the backward transformation process (illustrated by green arrows). For the forward transformation (Step 1), we mostly follow the process in DiMAE, which transforms a stylemixed image to images from different domains, i.e., x → y. For the backward transformation process (Step 2), we transform the generated images y to x to reconstruct x. The cycle consistency loss, the domain contrastive loss along with the original cross-domain reconstruction loss in DiMAE are imposed in Step 3. Step1: Transform an image x to the forward-transformed images y 1 , y 2 , ..., y N (blue arrow). Given an image x in X i , we implement style-mix in (Yang et al., 2022) to generate its style-mixed image v. Then we randomly divide v into visual patches v v and masked patches v m . We feed the visual patches v v into the encoder-decoder structure to generate the forward-transformed images y = {y 1 , y 2 , ..., y N }. Step 2: Transform the generated images y to the image x (green arrow). Given a set of forwardtransformed images y, we randomly divide each of them into visual patches y v and mask patches y m . Then we use the visual patches y v to reconstruct the image x, where x and x belong to the same domain. Step 3: Network optimization with the cross-domain reconstruction loss, the proposed domain constructive loss, and the cycle reconstruction loss. The parameters in the encoder E and domainspecific decoders D = {D 1 , D 2 , ..., D N } are optimized by the cycle-reconstruction loss (Eq. 4), the domain contrastive loss (Eq. 5) and the cross-domain reconstruction loss (Eq. 6) used in (Yang et al., 2022) . 

2.2. CYCLE CROSS-DOMAIN RECONSTRUCTION TASK

Previous contrastive-based methods and generative-based methods rely on high-quality crossdomain but semantically different image pairs to construct the pretext task. However, we argue that typical algorithms, e.g., using different image augmentations on the same image or selecting nearest neighbors, can not preciously define such paired images because of the large domain gap. Therefore, we propose the cycle cross-domain reconstruction task to generate the cross-domain and semantic similar image pairs in a self-circulating way, i.e., transforming an image to other domains and then bringing the transformed images to their original domain. Specifically, the overall cycle cross-domain reconstruction task consists of the forward transformation process and backward transformation process. The cross-domain cycle reconstruction loss is utilized to minimize the distance between the generated image after the cycle process and the original image. Forward transformation process. Forward transformation process transforms an image into images in multiple domains with the designed encoder and domain-specific decoder. Specifically, given an image x in the domain X, we leverage the style-mix in (Yang et al., 2022) to the style-mixed image v, and then randomly divide them into the visual patches and masked patches v v and v m . The visual patches will be fed into the encoder E to extract the content feature z, i.e., z = E(v v ). With the domain-specific decoders {D 1 , D 2 , ..., D N }, and the masked patches query q, and the content feature z, we reconstruct the images y = {y 1 , y 2 , ..., y N } in multiple domains X = {X 1 , X 2 , ..., X N }. Given a masked query q, the reconstructed image y i is defined as y i = D i (z, q). Backward transformation process. Backward transformation process transforms the generated images {y 1 , y 2 , ..., y N } from multiple domains to the image x in the domain of the original image to reconstruct the original image x. Specifically, we divide every image y i into the visible patches y v i and masked patches y m i , where i = 1, 2, ..., N . We follow the implementation in forward transformation by replacing the v v with y v i , i.e., t i = E(y v i ). With the decoder D where D generates images in the same domain X as x, a set of masked patches queries {q 1 , q 2 , ..., q N } for {y 1 , y 2 , ..., y N }, and the content feature {t 1 , t 2 , ..., t N }, we reconstruct the image x in X. Mathematically, given masked queries {r 1 , r 2 , ..., r N }, the reconstructed image xi from t i can be formulated as xi = D(t i , r i ). (3)

DiMAE input CycleMAE reconstruction inputs Target

StyleMixed View Sketch Painting Real Figure 3 : DiMAE leverages the heuristic methods in (Yang et al., 2022) to construct the reconstruction input with lots of artifacts. The CycleMAE utilizes the cross-domain reconstructed images as the reconstruction input, which are more natural and realistic. Cycle reconstruction loss. The cycle reconstruction loss minimizes the distance between the masked patches and the corresponding masked patches of the original image x. Specifically, L cycle = N i=1 (x i -x) 2 , ( ) where xi are the outputs of the backward transformation process and x is the original image. Discussion. The high quality of the input-target reconstruction pair should be both natural and diverse. As shown in Fig. 3 , the previous method, i.e., DiMAE, constructs the input-target pair by the heuristic style-mix method, which causes artificial defects in the reconstruction input. In contrast, our CycleMAE leverages images generated by domain-specific decoders as the inputs for reconstructing the original image. Our reconstruction inputs do not rely on the heuristic designs and utilize the information in the deep model, which is more natural. Furthermore, owing to the multiple domain-specific designs of the generative-based UDG method, we can generate multiple and diverse inputs, instead of one image in DiMAE, for reconstructing the original image.

2.3. DOMAIN CONTRASTIVE LEARNING

Although the cycle cross-domain reconstruction task can generate precious input-target reconstruction pairs by the self-circulating process, we observe that directly applying the cycle reconstruction tasks decreases the model's ability to disentangle the style information in domain-specific decoders, which will underestimate the ability of the encoder to learn domain-invariant features. Without any supervision of the transformed images in the forward transformation process, the model tends to take "shortcuts" that domain-specific decoders generate images in a similar domain to reduce the difficulty of the encoder to extract content information in the backward transformation process. Therefore, we propose a domain contrastive loss to regularize different decoders to learn divergent domain style information, which could give the encode diverse inputs for the encoder to reconstruct the original image in the backward transformation process. Specifically, domain contrastive learning loss aims at pulling the samples in the same domain close and pushing the samples in different domains apart. As mentioned in (Cao et al., 2022) , the semantic information of features increases as the layers go deeper. To effectively regularize the decoder to learn divergent domain information and minimizes the influence on the semantic information, we utilize the features d from the first decoder transformer layer for domain contrastive learning, where d i = D 1 i (z, q). Here, D 1 i denotes the first transformer layer of D i . Given an intermediate feature d i of the decoder and the content feature z defined in Eq. 2, the domain contrastive loss is defined as L domain = - N i=1 log exp(di • d + i /τ ) exp(di • d + i /τ ) + exp(di • d - i /τ ) , where d + i denotes the features of samples in the same domain with d i and d - i denotes the features of samples in the different domains of d i among the training batch. With the domain contrastive loss, the decoder can generate images with divergent domain styles, giving diverse input images for the encoder to reconstruct the original images in the backward transformation process.

2.4. OBJECTIVE FUNCTION

Our cycle cross-domain reconstruction task can be compatible to the original cross-domain reconstruction task in (Yang et al., 2022) . Therefore, to take the most advantage of reconstruction tasks, we also preserve the original cross-domain reconstruction loss proposed in (Yang et al., 2022) in our total objective function. Specifically, given an image from the domain X i , the cross-domain reconstruction is formulated as L recons = (y i -x) 2 , Therefore, our total objective function can be formulated as L = L recons + αL domain + βL cycle where α and β are hyperparameters that can be empirically set to 2 and 2. The sensitivity analysis of hyperparameters is presented in the Appendix 5.2.1.

3. EXPERIMENTS

3.1 EXPERIMENTAL SETUP Dataset. Two benchmark datasets are adopted to carry through these two settings. PACS, proposed by (Li et al., 2017) Following the all correlated setting of DARLING (Zhang et al., 2022) , we select Painting, Real, and Sketch as source domains and Clipart, Infograph, and Quickdraw as target domains for DomainNet. In this setting, we select 20 classes out of 345 categories for both training and testing, exactly following the setting in (Zhang et al., 2022) . For PACS, we follow the common setting in domain generalization (Li et al., 2018; Rahman et al., 2020; Albuquerque et al., 2019) where three domains are selected for self-supervised training, and the remaining domain is used for evaluation. Except above, the remaining experiments will be shown in Appendix 5.2.2. Evaluation protocol. We follow the all correlated setting of DARLING (Zhang et al., 2022) and divide the testing process into three steps. First, we train our model in the unlabeled source domains. Second, we use a different number of labeled training examples of the validation subset in the source domains to finetune the classifier or the whole backbone. In detail, when the fraction of the labeled finetuning data is lower than 10% of the whole validation subset in the source domains, we only finetune the linear classifier for all the methods. When the fraction of labeled finetuning data is larger than 10% of the whole validation subset in the source domains, we finetune the whole network, including the backbone and the classifier. Last, we can evaluate the model on the target domains. Implementation details. We use ViT-small as the backbone network unless otherwise specified. The learning rate for pre-training is 1.5 × 10 -5 and then decays with a cosine decay schedule. The weight decay is set to 0.05 and the batch size is set to 256 × N d , where N d is the number of domains in the training set. All methods are pre-trained for 1000 epochs, which is consistent with the implementations in (Zhang et al., 2022) for fair comparisons. The feature dimension is set to 1024. For finetuning, we follow the exact training schedule as that in (Zhang et al., 2022) . We use a MAE (He et al., 2021) unsupervised pre-training model in ImageNet for 1600 epochs to ensure labels are not available during the whole pretraining process.

3.2. EXPERIMENTAL RESULTS

We present the results in Tab. 8 (DomainNet) and Tab. 9 (PACS), which shows that our CycleMAE achieves better performances on most tasks compared to previous methods. Specifically, CycleMAE improves the performance by +3.81% and +5.08% for DomainNet on overall and average accuracy for 1% fraction setting. For 5% fraction setting, CycleMAE improves the previous methods by +5.53% and +6.49% for DomainNet on overall and average accuracy. For 10% and 100% fraction settings, which adopts finetuning for the whole network, CycleMAE improves the state-of-art by linear evaluation protocol and improves the contrastive-based methods by +23.29% and +10.40% on average accuracy using finetuning evaluation protocol. Furthermore, we find that CycleMAE gets higher performance gains on finetuning evaluation protocol, which is consistent with other unsupervised learning researches (He et al., 2021; Xie et al., 2022) . We also compare our CycleMAE with the state-of-the-art generative-based method DiMAE, which heuristically generates the artificial style-mixed views to construct the cross-domain images. In contrast, CycleMAE utilizes a cycle reconstruction task to construct cross-domain pairs, where we can obtain diverse image pairs from multiple domains. Our cycleMAE shows +5.08%, +6.49%, +1.79%, +0.53% gains on 1%, 5%, 10%, 100% DomainNet.

3.3. ABLATION STUDY

To further investigate the components of our proposed CycleMAE, we conduct a detailed ablation study on the proposed method. Specifically, we train Vit-Tiny for 100 epochs on the combination of Painting, Real and Sketch training set on DomainNet, and using the linear evaluation protocol on Clipart. Effectiveness of Each Component of CycleMAE. As shown in Tab. 3, we explore the effectiveness of each component in CycleMAE. The cycle reconstruction improves the accuracy by +5.00%, and the performance can be further improved with domain contrastive loss by +2.45%. Thus, the results verify that the proposed modules can benefit the encoder to learn more domain-invariant features. Comparison of Features from Different Layers. Domain contrastive loss is regularized on the output features of decoders. It raises the question that the features after which layer should be used to obtain the best performances. As shown in Tab. 6, the performance decreases from the first layer to the last layer. As mentioned in (Cao et al., 2022) , the semantic information of features increases as the layers go deeper, thus adopting the regularization on the first decoder layer can in turn force the features from the different decoders to share less redundancy. Therefore, we set the regularization after first decoder layer as the default protocol. Effectiveness of Domain Distance Regularization Loss. Domain distance regularization is used to regularize the domain information in different decoders to be different. However, there are many designs of distance regularization. As shown in Tab. 4, the domain contrastive loss regularization gets the best result. That is because the domain contrastive loss regularization pulls the samples from the same domain close and the samples from different domains far away. But other distance metrics only pull the samples from different domains far away without minimizing the intra-domain distance. Thus, the domain contrastive loss improves other domain distance regularization by +1.66%. Comparison of Cross-domain Pairs for Construction. Paired data is important but absent in cross-domain datasets. Previous generative-based method (Yang et al., 2022) proposes the heuristic pairs which consist of an original image and its style-mixed view. Compared them with that proposed by (Yang et al., 2022) in Tab. 5, where FT/BT denote forward/backward transformation, respectively, we demonstrate that CycleMAE gets a +5.64% performance gain compared with other situations. Compared with using heuristic pairs in both FT and BT, the optimal setting which uses heuristic pairs only in FT is better because the heuristic images by style mix is not real. However, compared with not using heuristic pairs, using heuristic pairs in FT is important because heuristic pairs in FT can be a good starting point for the encoder in our CycleMAE to learn domain-invariant features and for the decoders to learn domain-specific information. 

4. CONCLUSIONS

In this paper, we propose the Cycle-consistent Masked AutoEncoder (CycleMAE) to tackle the unsupervised domain generalization problem. CycleMAE designs a cycle reconstruction task to construct cross-domain input-target pairs, thus we can generate more real and diverse image pairs and help to learn a content encoder to extract domain-invariant features. Furthermore, we propose a novel domain contrastive loss to help CycleMAE better disentangle the domain information. Extensive experiments on PACS and DomainNet show that CycleMAE achieves state-of-the-art performance, verifying the effectiveness of our method. Self-supervised Learning. Self-supervised learning (SSL) is introduced to learn powerful semantic representations from massive unlabeled data. Recent SSL methods can be divided into two categories: discriminative (Noroozi & Favaro, 2016; Gidaris et al., 2018; Chen et al., 2020b; Grill et al., 2020; He et al., 2020; Chen et al., 2020d; 2021; Zbontar et al., 2021; Caron et al., 2021) and generative methods (Pathak et al., 2016; Larsson et al., 2016; 2017; He et al., 2021) . Among the discriminative methods, the early works try to design some auxiliary tasks, like jigsaw puzzle (Noroozi & Favaro, 2016) and rotation prediction (Gidaris et al., 2018) , to learn semantic representations. The recent works are mainly based on contrastive loss (Chen et al., 2020b; Grill et al., 2020; He et al., 2020; Chen et al., 2020d; 2021; Zbontar et al., 2021; Caron et al., 2021) , which models the similarity and dissimilarity by constructing positive pairs and negative pairs, and learns semantic representations by pulling the positive pairs close and push the negative pairs away. Generative methods depend on the design of the encoder-decoder structure. Recent methods utilize masked image modeling (MIM) to recover the original images by the masked ones with vision transformer. iGPT (Chen et al., 2020a) predict the pixel value which are from the pixel sequences. MAE (He et al., 2021) , a recent state-of-the-art method, recovers the input images based on a few patches of the images for pre-training the autoencoder, which captures semantic representations in this way. However, these SSL methods only focus on the situation where the training and testing datasets share the same data distribution and there may be performance drops when training and testing datasets exist domain gap. Thus we propose a generative-based method that takes the domain gap into consideration. Unsupervised Domain Generalization. Despite the success, domain generalization still relies on the fully labeled data. To ease the annotation burden, Unsupervised Domain Generalization (UDG) is proposed as a novel generalization task that trains with unlabeled source domains and tests with target domains that serve domain shifts with training domains. Derived from contrastive learning, DARLING (Zhang et al., 2022) incorporates domain information into the contrastive loss by reweighting domain labels. BrAD (Harary et al., 2021) projects inputs into an auxiliary bridge domain and utilizes contrastive learning in this domain to learn domain-invariant features. Although the above works can eliminate the influence of domain shifts to some extent, their performances are still limited due to the difficulty to well define positive pairs in contrastive learning. Recently, generative-based methods are proposed to solve the UDG problem. One of the most representative methods is DiMAE (Yang et al., 2022) , which establishes an MAE-style (He et al., 2021) generative framework for UDG task. DiMAE contains a content encoder and multiple domainspecific decoders. The input images will be transformed into style-view by the proposed Content-Preserved StyleMix module (Yang et al., 2022) , and then be masked randomly. The content encoder extracts domain-invariant features of the style-view counterparts and then the domain specific decoders are designed to recover the reconstruction from the features. In this way, domain-invariant features are learned by the content encoder. Different from DiMAE, our proposed method does not rely on the heuristic pairs which are generated by stylemix and thus can get more realistic crossdomain pairs. Cycle-consistency. Cycle-consistency is a common visual trick to preserve the content unchanged. (Zhou et al., 2016) utilizes the consistency across instances of the same category as a supervised signal to force the model to predict the corresponding relationship between cross instances with the same object. (Yi et al.) and (Zhu et al., 2017) both solve the image-to-image translation problem by cycle-consistency. DualGAN is similar to dual learning and trains both primal and dual GANs at the same time. CycleGAN utilizes cycle-consistency to make the reconstructed images match closely to their original images. The previous methods we mentioned use the cycle-consistency to remain the object of images unchanged from one image to another image and make two images share the same or similar object. Although we also introduce cycle-consistency to retain the content unchanged, our purpose is not to generate or find another image with the same or similar object. We utilize the cycle-consistency to push the features extracted by the content encoder to retain more content information and thus we can get better domain-invariant features. shows that cycle reconstruction loss is not sensitive otherwise the weight equals 0. We conjecture the reason is that cycle reconstruction loss, as the key point of the cycle reconstruction task, always works to guarantee the good performance of CycleMAE whatever the weight of it is except 0. But domain contrastive loss acts as a regularization of the cycle reconstruction task, and its weight influences its regularization ability. From the results, we set α = 2 and β = 2. 

5.2.2. EXPERIMENTS ON OPPOSITE SETTING OF DOMAINNET

We showed part of our results in the main text where we train our model on Painting, Real, and Sketch, and evaluate the generalization ability of our model on Clipart, Infograph, and Quickdraw. In this section, we train our model on Clipart, Infograph, and Quickdraw, and evaluate it on Painting, Real, and Sketch. Same as we mentioned in the main text, we exactly follow the all correlated setting proposed by DARLING (Zhang et al., 2022) . The results are presented in Tab 7. Our CycleMAE still achieves a good performance. We achieve a improvement by +1.00% and +0.79% on overall accuracy and +1.52% and +1.05% on average accuracy for 1% and 5% fraction setting. Specifically, for 10% fraction and 100% fraction setting, our CycleMAE improves the state-of-the-art methods by +1.76% and +1.81% on average accuracy and +1.94% and +1.90% on overall accuracy.

5.3. EXPERIMENTS WITH VIT TINY BACKBONE

In this section, we use a smaller backbone, ViT tiny, to illustrate the effectiveness of our proposed CycleMAE. We still follow the all correlated setting of DARLING (Zhang et al., 2022) . 



Figure 1: Comparison of previous and the proposed methods for UDG task. (a): Contrastive methods aim at pulling cross-domain but semantic similar images together. (b): Previous generative-based methods aim at reconstructing the original image based on the cross-domain images (generated by handcraft style-mix manners). (c): Our proposed CycleMAE leverages a self-circulating crossdomain reconstruction task for unsupervised domain generalization in the absence of paired data. (X and Y are a cross-domain but semantically the same image pair.)

Figure 2: The illustration of CycleMAE with the cross-domain Reconstruction task. The training process includes the forward and backward transformation process. For an image x, the forward transformation process transforms it to reconstruct the images in multiple domains. The backward transformation process then brings them to the original domain. We use the cross-domain reconstruction loss, the cycle cross-domain reconstruction loss and the domain contrastive loss to supervise the model optimization.

Figure 4: The t-SNE Visualization on the feature distributions with different methods.

Figure 5: Reconstruction visualization of different decoders.

VISUALIZATION Feature Visualization. We present the feature distribution visualization of MoCo V3, MAE, Di-MAE, and CycleMAE in Fig. 4 by t-SNE (Van der Maaten & Hinton, 2008), where the features are part of the combination of Painting, Real, and Sketch training set in DomainNet. From the visualization, we can see the features from different domains in CycleMAE are more uniform than other methods, which indicates that our CycleMAE shows better domain-invariant characteristics. Reconstruction Visualization. In Fig 5, we present the reconstruction results of CycleMAE using ViT-Base for better visualization. From the visualization, we know that our CycleMAE can produce more realistic and diverse pairs that are semantically the same but from different domains.

Figure 6: Performance with different hyper-parameters. The horizontal axis is the value of hyperparameters and the vertical axis is the top-1 accuracy.

Fig 6 (a), we ablate the loss weight of domain contrastive loss. We observe that the performance is sensitive to the weight of domain contrastive loss, and the Fig 6 (b)

.1 EXPERIMENTS ON DOMAINNET In this section, we use the ViT tiny as the backbone of our proposed CycleMAE and evaluate the performance on DomainNet with two tasks. The first task is training our model on Painting, Real and Sketch, and evaluating the generalization ability of our model on Clipart, Infograph and Quickdraw (Painting, real, sketch → Clipart, Infograph, Quickdraw). The second task is training our model on Clipart, Infograph and Quickdraw, and evaluating it on Painting, Real and Sketch (Clipart, Infograph, Quickdraw → Painting, real, sketch).

, is a widely used benchmark for domain generalization. It consists of four domains, including Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images), and Sketch (3,929 images) and each domain contains seven categories. (Peng et al., 2019) proposes a large and diverse cross-domain benchmark DomainNet, which contains 586,575 examples with 345 object classes, including six domains: Real, Painting, Sketch, Clipart, Infograph, and Quickdraw.

The cross-domain generalization results on DomainNet. All of the models are trained on Painting, Real, and Sketch domains of DomainNet and tested on the other three domains. The title of each column indicates the name of the target domain. All the models are pretrained for 1000 epochs before finetuned on the labeled data. Results style: best, second best.

The cross-domain generalization results in PACS. Given the experiment for each target domain run respectively, there is no overall accuracy across domains. Thus we report the average accuracy and the accuracy for each domain. The title of each column indicates the name of the domain used as the target. All the models are pretrained for 1000 epochs before finetuned on the labeled data. Results style: best, second best.

Effectiveness of each proposed component of CycleMAE.

Comparison of different domain distance regularization loss.

Ablation study on heuristic pairs on forward and backward transformation process. The heuristic pair consists of an image and its style-mixed view.

Comparison of features from different layers for domain contrastive regularization. The decoders consist 8 layers.

Results of the cross-domain generalization on DomainNet. All of the models are trained on Clipart, Infograph, Quickdraw domains of DomainNet and tested on the other three domains. The title of each column indicates the name of the domain used as target. All the models are pretrained for 1000 epoches before finetuned on the labeled data. Results style: best, second best.

ACKNOWLEDGMENTS

Majority of this work was completed during Haiyang Yang's internship at Sensetime under the mentorship of Feng Zhu. We also extend our gratitude to Qingsong Xie for his contribution to the part of this idea.

annex

Table 9 : Results of the cross-domain generalization setting on PACS. Given the experiment for each target domain runs respectively, there is no overall accuracy across domains. Thus we report the average accuracy and the accuracy for each domain. The title of each column indicates the name of the domain used as target. All the models are pretrained for 1000 epochs before finetuned on the labeled data. Results style: best, second best.

5.5. BROADER IMPACT

We propose an effective generative-based unsupervised domain generalization method and we can get a more realistic cross-domain pairs from model outputs. However, in our experiments, there are a potential issue that we should consider to remedy in the future. The issue is that our experiments rely on many GPUs to pretrain and test, which may consume a lot of electricity. And we know that using too much electricity can cause pollution which may influence our world. 

