DOMAIN GENERALIZATION VIA INDEPENDENT REGU-LARIZATION FROM EARLY-BRANCHING NETWORKS Anonymous

Abstract

Learning domain-invariant feature representations is critical for achieving domain generalization, where a model is required to perform well on unseen domains. The critical challenge is that standard training often results in entangled domaininvariant and domain-specific features (see Figure 2 ). To address this issue, we use a dual-branching network to learn two features, one for the domain classification problem and the other for the original target classification problem, and the feature of the latter is required to be independent of the former. While this idea seems straightforward, we show that several factors need to be carefully considered for it to work effectively. In particular, we investigate different branching structures and discover that the common practice of using a shared base feature extractor with two lightweight prediction heads is detrimental to the performance. Instead, a simple early-branching architecture, where the domain classification and target classification branches share the first few blocks while diverging thereafter, leads to better results. Moreover, we also incorporate a random style augmentation scheme as an extension to further unleash the power of the proposed method, which can be seamlessly integrated into the dual-branching network by our loss terms. Such an extension gives rise to an effective domain generalization method. Experimental results show that the proposed method outperforms state-of-the-art domain generalization methods on various benchmark datasets.

1. INTRODUCTION

Domain generalization (DG) asks learned models to perform well on unseen domains, which lies its key in learning domain-invariant representations that are robust to domain shift (Ben-David et al., 2006) . Standard training often results in entangled domain-invariant and domain-specific features, which hinders the model from generalizing to new domains. Existing methods address this issue by introducing various forms of regularization, such as adopting alignment (Muandet et al., 2013; Ghifary et al., 2016; Li et al., 2018b; Hu et al., 2020) , using domain-adversarial training (Ganin et al., 2016; Li et al., 2018b; Yang et al., 2021; Li et al., 2018c) , or developing meta-learning methods (Li et al., 2018a; Balaji et al., 2018; Dou et al., 2019; Li et al., 2019) . Despite the success of these arts, DG remains challenging and is far from being solved. For example, as a recent study (Gulrajani & Lopez-Paz, 2021) suggests, under a rigorous evaluation protocol, it turns out that the naive empirical risk minimization (ERM) method (Vapnik, 1999) , which aggregates training data from all domains and trains them in an end-to-end manner without additional efforts, can perform competitively against more elaborate alternatives. This observation indicates that a more effective approach might be needed to disentangle the domain-invariant and domain-specific features for better DG. In this paper, we adopt a simple method by leveraging a conventional dual-branching network with one branch predicting image classes (target prediction) and another predicting domain labels. Regarding the features from the target and domain branches as domain-invariant and domain-specific representations, respectively, entanglement will result in an undesired situation where the domainspecific information is also encoded in the target branch, which will inevitably corrupt the prediction when the domain varies during inference. Thus, to explicitly disentangle the domain-invariant and domain-specific features, we impose a regularization to require the former to be independent of the latter. This idea seems straightforward, but we show that several factors need to be carefully considered for it to work effectively. Particularly, we first investigate the structure of the dual-branching network and somehow surprisingly discover that the common practice of using a shared base feature extractor with two lightweight prediction heads (Chen et al., 2021; Atzmon et al., 2020) is detrimental to the performance. Instead, a simple early-branching architecture, where the domain classification branch and target classification branch share the first few blocks while diverging thereafter, yields the optimal results. Incorporating this discovery, we propose the basic form of the proposed method. Specifically, we employ Hilbert-Schmidt Information Criterion (HSIC) (Gretton et al., 2005; 2007) as a measurement of the feature independencefoot_0 and use two sub-networks (branches) with only a few shared convolution blocks for target prediction task and domain prediction task. A glimpse of the basic form is shown in Figure 1 . Next, to further unleash the power of the proposed method, we suggest using domain augmentation to encourage the domain-invariant features to explore sufficient diversity of domain-specific representations. Precisely, we propose a new random style sampling (RDS) scheme that explores augmenting the domain types by incorporating features with randomly modified style statistics. In contrast to previous methods (Zhou et al., 2021; Li et al., 2022) that use mixing or adding noise to synthesize new domains, RDS can directly perturb the mean and variance of feature maps with a controllable perturbing strength. To seamlessly integrate the basic form and the augmentation strategy, we further propose subsequential loss terms to encourage the target branch to be invariant to the original and augmented representations and vice versa for the domain-specific branch. Through our experimental studies, we illustrate, (1) the effectiveness of enforcing independence of the class and domain features within the early-branching design; (2) the advantages of the proposed RDS methods compared to existing solutions (Zhou et al., 2021; Li et al., 2022) , and effectiveness of the adopted loss functions; (3) our complete method performs favorably against other state-of-the-art algorithms when evaluated in the current benchmark (Gulrajani & Lopez-Paz, 2021) .

2. RELATED WORKS

Various methods have been proposed in the DG literature recently (Li et al., 2017; Motiian et al., 2017; Li et al., 2018b; a; Gong et al., 2019; Zhou et al., 2020a; Zhao et al., 2020; Li et al., 2019; Honarvar Nazari & Kovashka, 2020; Li et al., 2021; Zhou et al., 2021; Xu et al., 2021a; Kim et al., 2021; Wang et al., 2020; Bui et al., 2021; Yang et al., 2021; Li et al., 2022; Chen et al., 2022) . Despite the varying details, current DG methods can be roughly categorized into a few categories by motivating intuition: invariant representation learning (Ganin et al., 2016; Li et al., 2017; 2018b; c; Shi et al., 2021 ), augmentation (Zhou et al., 2021; Li et al., 2022; Xu et al., 2021a; Li et al., 2021) , and general machine learning algorithms such as meta-learning (Li et al., 2018a; Balaji et al., 2018; Dou et al., 2019; Li et al., 2019) and self-supervised learning (Carlucci et al., 2019a; Jeon et al., 2021; Kim et al., 2021) . This section briefly reviews methods from the most relevant categories. Invariant representation learning. The pioneer work (Ben-David et al., 2006) theoretically proved that if the features remain invariant across different domains, then they are general and transferable to different domains. Inspired by this theory, many recent arts aim to use deep networks to explore domain-invariant features. For example, (Ganin et al., 2016) train a domain-adversarial neural network (DANN) to obtain domain-invariant features by maximizing the domain classification loss. This idea is further explored by (Li et al., 2018b) . They employ a maximum mean discrepancy constraint for the representation learning of an auto-encoder via adversarial training. Instead of directly obtaining the domain-invariant features, some arts (Khosla et al., 2012; Li et al., 2017) suggest decomposing the model parameters into domain-invariant and domain-specific parts and only using the domain-invariant parameters for prediction when confronting unseen domains. Recently, the task has been further explored at a gradient level. Koyama and Yamaguchi (Koyama & Yamaguchi, 2020) learn domain-invariant features by minimizing the variances of inter-domain gradients. Inspired by the fact that optimization directions should be similar across domains, (Shi et al., 2021) suggest maximizing the gradient inner products between domains to maintain the invariance. Different from previous approaches, our method is based on a simple intuition that domain and class features should be totally independent. By enforcing a straightforward independent constraint, our method can achieve comparable or even better performance against these arts.



It is noteworthy that the early-branching structure can work with many independence measurements, and we use HSIC because it yields better performance. Please see Sec. 4.3 and B for details.

