HARNESSING OUT-OF-DISTRIBUTION EXAMPLES VIA AUGMENTING CONTENT AND STYLE

Abstract

Machine learning models are vulnerable to Out-Of-Distribution (OOD) examples, and such a problem has drawn much attention. However, current methods lack a full understanding of different types of OOD data: there are benign OOD data that can be properly adapted to enhance the learning performance, while other malign OOD data would severely degenerate the classification result. To Harness OOD data, this paper proposes a HOOD method that can leverage the content and style from each image instance to identify benign and malign OOD data. Particularly, we design a variational inference framework to causally disentangle content and style features by constructing a structural causal model. Subsequently, we augment the content and style through an intervention process to produce malign and benign OOD data, respectively. The benign OOD data contain novel styles but hold our interested contents, and they can be leveraged to help train a style-invariant model. In contrast, the malign OOD data inherit unknown contents but carry familiar styles, by detecting them can improve model robustness against deceiving anomalies. Thanks to the proposed novel disentanglement and data augmentation techniques, HOOD can effectively deal with OOD examples in unknown and open environments, whose effectiveness is empirically validated in three typical OOD applications including OOD detection, open-set semi-supervised learning, and open-set domain adaptation. * Corresponding to Tongliang Liu and Chen Gong. 1 We follow (Bengio et al., 2011) to regard the augmented data as a type of OOD data

1. INTRODUCTION

Learning in the presence of Out-Of-Distribution (OOD) data has been a challenging task in machine learning, as the deployed classifier tends to fail if the unseen data drawn from unknown distributions are not properly handled (Hendrycks & Gimpel, 2017; Pan & Yang, 2009) . Such a critical problem ubiquitously exists when deep models meet domain shift (Ganin et al., 2016; Tzeng et al., 2017) and unseen-class data (Hendrycks & Gimpel, 2017; Scheirer et al., 2012) , which has drawn a lot of attention in some important fields such as OOD detection (Hein et al., 2019; Hendrycks & Gimpel, 2017; Lee et al., 2018; Liang et al., 2018; Liu et al., 2020; Wang et al., 2022a; 2023; 2022b) , Open-Set Domain Adaptation (DA) (Liu et al., 2019; Saito et al., 2018) , and Open-Set Semi-Supervised Learning (SSL) (Huang et al., 2021b; 2022b; a; Oliver et al., 2018; Saito et al., 2021; Yu et al., 2020) . In the above fields, OOD data can be divided into two types, namely benign OOD data 1 and malign OOD data. The benign OOD data can boost the learning performance on the target distribution through DA techniques (Ganin & Lempitsky, 2015; Tzeng et al., 2017) , but they can be misleading if not being properly exploited. To improve model generalization, many positive data augmentation techniques (Cubuk et al., 2018; Xie et al., 2020) have been proposed. For instance, the performance of SSL (Berthelot et al., 2019; Sohn et al., 2020) has been greatly improved thanks to the augmented benign OOD data. On the contrary, malign OOD data with unknown classes can damage the The green lines and red lines denote the augmentation of benign OOD data X and malign OOD data X, respectively. In all figures, the blank variables are observable and the shaded variables are latent. classification results, but they are deceiving and hard to detect (Hendrycks & Gimpel, 2017; Liang et al., 2018; Wei et al., 2022b; a) . To train a robust model against malign OOD data, some works (Kong & Ramanan, 2021; Sinha et al., 2020) conduct negative data augmentation to generate "hard" malign data which resemble in-distribution (ID) data. By separating such "hard" data from ID data, the OOD detection performance can be improved. When presented with both malign and benign OOD data, it is more challenging to decide which to separate and which to exploit. As a consequence, the performance of existing open-set methods could be sub-optimal due to two drawbacks: 1) radically exploiting too much malign OOD data, and 2) conservatively denying too much benign OOD data. In this paper, we propose a HOOD framework (see Fig. 2 ) to properly harness OOD data in several OOD problems. To distinguish benign and malign OOD data, we model the data generating process by following the structural causal model (SCM) (Glymour et al., 2016; Pearl, 2009; Gao et al., 2022) in Fig. 1 (a). Particularly, we decompose an image instance X into two latent components: 1) content variable C which denotes the interested object, and 2) style variable S which contains other influential factors such as brightness, orientation, and color. The content C can indicate its true class Y , and the style S is decisive for the environmental condition, which is termed as domain D. Intuitively, malign OOD data cannot be incorporated into network training, because they contain unseen contents, thus their true classes are different from any known class; and benign OOD data can be adapted because they only have novel styles but contain the same contents as ID data. Therefore, we can distinguish the benign and malign OOD data based on the extracted the content and style features. In addition, we conduct causal disentanglement through maximizing an approximated evidence lower-bound (ELBO) (Blei et al., 2017; Yao et al., 2021; Xia et al., 2022b) of joint distribution P (X, Y, D). As a result, we can effectively break the spurious correlation (Pearl, 2009; Glymour et al., 2016; Hermann et al., 2020; Li et al., 2021b; Zhang et al., 2022) between content and style which commonly occurs during network training (Arjovsky et al., 2019) , as shown by the dashed lines in Fig. 1 (b ). In the ablation study, we find that HOOD can correctly disentangle content and style, which can correspondingly benefit generalization tasks (such as open-set DA and open-set SSL) and detection task (such as OOD detection). To further improve the learning performance, we conduct both positive and negative data augmentation by solely intervening the style and content, respectively, as shown by the blue and red lines in Fig. 1 (c). Such process is achieved through backpropagating the gradient computed from an intervention objective. As a result, style-changed data X must be identified as benign OOD data, and contentchanged data X should be recognized as malign OOD data. Without including any bias, the benign OOD data can be easily harnessed to improve model generalization, and the malign OOD data can be directly recognized as harmful ones which benefits the detection of unknown anomalies. By conducting extensive experiments on several OOD applications, including OOD detection, open-set SSL, and open-set DA, we validate the effectiveness of our method on typical benchmark datasets. To sum up, our contributions are three-fold: • We propose a unified framework dubbed HOOD which can effectively disentangle the content and style features to break the spurious correlation. As a result, benign OOD data and malign OOD data can be correctly identified based on the disentangled features. • We design a novel data augmentation method which correspondingly augments the content and style features to produce benign and malign OOD data, and further leverage them to enhance the learning performance. 

2. METHODOLOGY

In this section, we propose our HOOD framework as shown in Fig. 2 . Specifically, we utilize the class labels of labeled data and the pseudo labels (Lee, 2013) of unlabeled data as the class supervision to capture the content feature. Moreover, we perform different types of data augmentation and regard the augmentation types as the domain supervision for each style. Thereby, each instance x is paired with a class label y and a domain label d. Then, we apply two separate encoders g c and g s parameterized by θ c and θ s to model the posterior distributions q θc (C | X) and q θs (S | X), respectively. Subsequently, the generated C and S are correspondingly fed into two fully-connected classifiers f c and f s parameterized by ϕ c and ϕ s , which would produce the label predictions q ϕc (Y | C) and q ϕs (D | S), respectively. To further enhance the identifiability of C and S, a decoder h with parameter ψ is employed to reconstruct the input instance x based on its content and style. Below, we describe the detailed procedures and components during modeling HOOD. We first introduce the proposed variational inference framework for disentangling the content and style based on the constructed SCM. Subsequently, we conduct intervention to produce benign OOD data and malign OOD data. Further, we appropriately leverage the benign and malign OOD data to boost the learning performance. Finally, we formulate the deployment of HOOD in three OOD applications.

2.1. VARIATIONAL INFERENCE FOR CONTENT AND STYLE DISENTANGLEMENT

First, we assume that the data generating process can be captured by certain probability distributions. Therefore, according to the constructed SCM in Fig. 1 . Therefore, to break the correlations, the posteriors q ϕs (D | C) and q ϕc (Y | S) which are correspondingly approximated by the decoders ϕ s and ϕ c can be used as denominators to q ϕc (Y | C) and q ϕs (D | S), respectively. In this way, we can successfully disentangle content C and style S and ensure the decoding process of Y and D would not be influenced by spurious features from S and C, respectively. To this end, our factorization in Eq. 1 can be approximated as: P (X, Y, D, C, S) := P (C)P (S)P (Y | C)P (D | S)P (X | C, S) q ϕs (D | C)q ϕc (Y | S) . Then, we maximize the log-likelihood of the joint distribution p(x, y, d) of each data point (x, y, d): log p(x, y, d) := log c s p(x, y, d, c, s)dcds, in which we use lower case to denote the values of corresponding variables. Due to the integration of latents C and S is intractable, we follow variational inference (Blei et al., 2017) to obtain an approximated evidence lower-bound ẼLBO(x, y, d) of the log-likelihood in Eq. 3: log p(x, y, d) ≥ E (c,s)∼q θ (C,S|x) log p(x, y, d, c, s) q θ (c, s | x) := ẼLBO(x, y, d). ( ) Recall the modified joint distribution factorization in Eq. 2, we can have: ẼLBO(x, y, d) =E (c,s)∼q θ (C,S|x) log p(c)p(s)q ϕc (y | c)q ϕs (d | s)p ψ (x | c, s) q θ (c, s | x)q ϕc (y | s)q ϕs (d | c) = -KL(q θc (c | x)∥p(C)) -KL(q θs (s | x)∥p(S)) + E c∼q θc (C|x) [log q ϕc (y | c) -log q ϕs (d | c)] + E s∼q θs (S|x) [log q ϕs (d | s) -log q ϕc (y | s)] + E (c,s)∼q θ (C,S|x) [log p ψ (x | c, s)] (5a) =ELBO(x, y, d) -E c∼q θc (C|x) [log q ϕs (d | c)] -E s∼q θs (S|x) [log q ϕc (y | s)] . In Eq. 5a, the first two terms indicate the Kullback-Leibler divergence between the latent variables C and S and their prior distributions. In practice, we assume that the priors p(C) and p(S) follow standard multivariate Gaussian distributions. The third and fourth terms contain the approximated log-likelihoods of label predictions and the disentanglement of the content and style. The last term stands for estimated distribution of x. Note that in Eq. 5b, our approximated ẼLBO is composed of two parts: the original ELBO which could be obtained from the factorization in Eq. 1, and two regularization terms that aims to disentangle C and S through maximizing the log-likelihoods log q ϕs (d | c) and log q ϕc (y | s), which is shown by the dashed lines in Fig. 2 . By maximizing ẼLBO, we can train an accurate class predictor which is invariant to different styles. The detailed derivation is provided in supplementary material. Next, we introduce our data augmentation to assist in harnessing OOD examples.

2.2. DATA AUGMENTATION WITH CONTENT AND STYLE INTERVENTION

After disentangling content and style, we try to harness OOD examples via two opposite augmentation procedures, namely positive data augmentation and negative data augmentation which aim to produce benign OOD data x and malign OOD data x, respectively, so as to further enhance model generalization and improve robustness against anomalies. Specifically, to achieve this, positive data augmentation only conducts intervention on the style feature meanwhile keeping the content information the same; and the negative data augmentation attempts to affect the content feature while leaving the style unchanged, so as to produce malign OOD data, as shown in Fig. 1 

(b).

To achieve this goal, we employ adversarial data augmentation (Goodfellow et al., 2014; Miyato et al., 2018; Volpi et al., 2018) which can directly conduct intervention on the latent variables without influencing each other, thus it is perfect for our intuition of augmenting content and style. Particularly, by adding a learnable perturbation e to each instance x, we can obtain malign OOD data x and benign OOD data x with augmented content and style, respectively. For each data point (x, y, d), the perturbation e can be obtained through minimizing the intervention objective L(•): e = arg min e;∥e∥p<ϵ L(x + e, y, d; θ c , ϕ c , θ s , ϕ s ), where ϵ denotes the magnitude of the perturbation e with ℓ p -norm. Since our goal of positive and negative data augmentation is completely different, here the intervention objective is designed differently for producing x and x. For positive data augmentation, the intervention objective is: L pos = L d (g c (x; θ c ), g c (x + e; θ c )) -L ce (f s (g s (x + e; θ s ); ϕ s ), d), where the first term L d (•) indicates the distance measured between the contents extracted from the original instance and its perturbed version, and the second term L ce (•) denotes the cross-entropy loss. By minimizing L pos , the perturbation e would not significantly affect the content feature, meanwhile introducing a novel style that is distinct from its original domain d. Consequently, the augmented benign data with novel styles can be utilized to train a style-invariant model that is resistant to domain shift. Moreover, a specific style with domain label d ′ can be injected via modifying L pos as: L ′ pos = L d (g c (x; θ c ), g c (x + e; θ c )) + L ce (f s (g s (x + e; θ s ); ϕ s ), d ′ ). Different from Eq. 7, we hope to minimize the cross-entropy loss such that the perturbed instance can contain the style information from a target domain d ′ . As a result, the augmented benign data can successfully bridge the gap between source domain and target domain, and further improve the test performance in the target distribution. As for negative data augmentation, the intervention objective is defined as: L neg = L d (g s (x; θ s ), g s (x + e; θ s )) -L ce (f c (g c (x + e; θ c ); ϕ c ), y). By minimizing L neg , the perturbation would not greatly change the style information but would deviate the content from its original one with class label y. Subsequently, by recognizing the augmented malign data as unknown, the trained model would be robust to deceiving anomalies with familiar styles, thus boosting the OOD detection performance. To accomplish the adversarial data augmentation process, here we perform multi-step projected gradient descent (Madry et al., 2018; Wang et al., 2021) . Formally, the optimal x and x can be iteratively found through: xt+1 = xt + arg min e t ;∥e t ∥p<ϵ L pos (x t + e t ), xt+1 = xt + arg min e t ;∥e t ∥p<ϵ L neg (x t + e t ). where the final iteration t is set to 15 in practice. Further, the optimal augmented data will be incorporated into model training, which is described in the next section.

2.3. MODEL TRAINING WITH BENIGN AND MALIGN OOD DATA

Algorithm 1 Training process of HOOD 1: Labeled set D l = {(xi, yi)} l i=1 , unlabeled set D u = {(xi)} u i=1 . 2: for i = 1 to M ax_Iter do 3: Pre-train the variational inference framework through maximizing ẼLBO in Eq. 5a; 4: Assigning pseudo labels y ps for unlabeled data D u := {(xi; y ps i )} u i=1 ; 5: if i == Augmentation_Iter then 6: Conduct Adversarial Data Augmentation to obtain x and x through Eq. 10; 7: Add x and x into D and D, respectively; 8: end if 9: Enumerate D and conduct supervised training for each x; 10: Enumerate D and recognize each x as unknown; 11: end for Finally, based on the aforementioned disentanglement and data augmentation in Sections 2.1 and 2.2, we can obtain a benign OOD data x and a malign OOD data x from each data point (x, y, d), which will be appended to the benign dataset D and malign dataset D, respectively. For utilization of benign OOD data x, we assign it with the original class label y and perform supervised training. For separation of malign OOD data x, we employ a one-vs-all classifier (Padhy et al., 2020) to recognize them as unknown data that is distinct from its original class label y. The proposed HOOD method is summarized in Algorithm 1. Below, we specify the proposed HOOD algorothm to three typical applications with OOD data, namely OOD detection, open-set SSL, and open-set DA.

2.4. DEPLOYMENT TO OOD APPLICATIONS

Generally, in all three investigated applications, we are given a labeled set D l = {(x i , y i )} l i=1 containing l labeled examples drawn from data distribution P l , and an unlabeled set D u = {x i } u i=1 composed of u unlabeled examples sampled from data distribution P u . Moreover, the label space of D l and D u are defined as Y l and Y u , respectively. OOD detection. The labeled set is used for training, and the unlabeled set is used as a test set which contains both ID data and malign OOD data. Particularly, the data distribution of unlabeled ID data Q id is the same as distribution P , but the distribution of OOD data P u ood is different from P , i.e., P u id = P l ̸ = P u ood . The goal is to correctly distinguish OOD data from ID data in the test phase. During training, we conduct data augmentation to obtain domain label d, and then follow the workflow described by Algorithm 1 to obtain the model parameters. During test, we only use the content branch to predict the OOD score which is produced by the one-vs-all classifier. An instance is considered as an ID datum if the OOD score is smaller than 0.5, and an OOD datum otherwise. Open-set SSL. The labeled set D l and unlabeled set D u are both used for training, and they are sampled from the same data distribution with different label spaces. Specifically, the unlabeled data contain some ID data that have the same classes as D l , and the rest unlabeled OOD data are from some unknown classes that do not exist in D l , formally, Y l ⊂ Y u , Y u \ Y l ̸ = ∅ and P l (x | y) = P u (x | y), y ∈ Y l . The goal is to properly leverage the labeled data and unlabeled ID data without being misled by malign OOD data, and correctly classify test data with labels in Y l . The training process is similar to OOD detection, except that HOOD would produce an OOD score for each unlabeled data. If an unlabeled instance is recognized as OOD data, it would be left out. Open-set DA. The labeled set is drawn from source distribution P l which is different from the target distribution P u of unlabeled set. In addition, the label space Y l is also a subset of Y u . Therefore, the unlabeled data consist of benign OOD data which have the same class labels as labeled data, and malign OOD data which have distinct data distribution as well as class labels from labeled data, formally, P l ̸ = P u , Y l ⊂ Y u , Y u \ Y l ̸ = ∅. The goal is to transfer the knowledge of labeled data to the benign OOD data, meanwhile identify the malign OOD data as unknown. In this application, we assign each target instance with a domain label to distinguish them from other augmented data. Then we alter the positive data augmentation objective from Eq. 7 to Eq. 8 and train the framework through Algorithm 1. During test, HOOD would predict each target instance as some class if it is benign OOD data, and as unknown otherwise.

3. EXPERIMENT

In this section, we first describe the implementation details. Then, we experimentally validate our method on three applications, namely OOD detection, open-set SSL, and open-set DA. Finally, we present extensive performance analysis on our disentanglement and intervention modules. Additional details and quantitative findings can be found in the supplementary material.

3.1. IMPLEMENTATION DETAILS

In experiments, we choose Wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) for OOD detection and Open-set SSL tasks, and follow (You et al., 2020; Cao et al., 2019) to utilize ResNet50 pre-trained on Imagenet (Russakovsky et al., 2015) for Open-set DA. For implementing HOOD, we randomly choose 4 augmentation methods from the transformation pool in RandAugment (Cubuk et al., 2020) , to simulate different styles. The pre-training iteration Augmentation_Iter is set to 100,000, and the perturbation magnitude ϵ = 0.03, following (Volpi et al., 2018) in all experiments. Next, we will experimentally validate HOOD in three applications.

3.2. OOD DETECTION

In OOD detection task, we use SVHN (Netzer et al., 2011) and CIFAR10 (Krizhevsky et al., 2009) as the ID datasets, and use LSUN (Yu et al., 2015) , DTD (Cimpoi et al., 2014) , CUB (Wah et al., 2011) , Flowers (Nilsback & Zisserman, 2006) , Caltech (Griffin et al., 2007) , and Dogs (Khosla et al., 2011) datasets as the OOD datasets that occur during test phase. Particularly, to explore the model generalization ability, we only sample 100 labeled data and 20,000 unlabeled data from each class and conduct semi-supervised training, then we test the trained model on the unlabeled OOD dataset. To evaluate the performance, we utilize AUROC (Hendrycks & Gimpel, 2017) which is an essential metric for OOD detection, and a higher AUROC value indicates a better performance. For comparison, we choose some typical OOD detection methods including Likelihood (Hendrycks & Gimpel, 2017) which simply utilizes softmax score as the detection criterion, ODIN (Liang et al., 2018) which enhances the performance of Likelihood through adding adversarial attack, Likelihood Ratio (Ren et al., 2019) 

3.3. OPEN-SET SSL

In open-set SSL task, we follow (Guo et al., 2020) to construct our training dataset using two benchmark datasets CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) , which contains 10 and 100 classes, respectively. The constructed dataset has 20,000 randomly sampled unlabeled data and a varied number of labeled data. Here the number of labeled data is set to 50, 100, and 400 per class in both CIFAR10 and CIFAR100. Moreover, to create the open-set problem in CIFAR10, the unlabeled data is sampled from all 10 classes and the labeled data is sampled from the 6 animal classes. As for CIFAR100, the unlabeled data are sampled from all 100 classes and the labeled data is sampled from the first 60 classes. For evaluation, we first use the test dataset from the original CIFAR10 and CIFAR100 and denote the test accuracy as "Clean Acc.". Further, to evaluate the capability of handling OOD examples, we test on CIFAR10-C and CIFAR100-C (Hendrycks & Dietterich, 2019) which add different types of corruptions to CIFAR10 and CIFAR100, respectively. The test accuracy from the corrupted datasets can reveal the robustness of neural networks against corruptions and perturbations, and it is denoted as "Corrupted Acc.". For comparison, we choose some typical open-set SSL methods including Uncertainty-Aware Self-Distillation method UASD (Chen et al., 2020) and T2T (Huang et al., 2021a) which filters out the OOD data via using OOD detection, Safe Deep Semi-Supervised Learning DS3L (Guo et al., 2020) which employs meta-learning to down-weight the OOD data, Multi-Task Curriculum Framework MTCF (Yu et al., 2020) which recognizes the OOD data as different domain, and OpenMatch (Saito et al., 2021) which utilizes open-set consistency training on OOD data. The experimental results are shown in Table 2 . Compared to the strongest baseline method Open-Match, which randomly samples eleven different transformations from a transformation pool, our method has transformations that are limited to only four types. In CIFAR10 and CIFAR100 regarding the Clean Acc., the proposed HOOD is slightly outperformed by OpenMatch. However, thanks to the disentanglement, HOOD can be invariant to different styles and focus on the content feature. Therefore, when facing corruption, HOOD can be more robust than all baseline methods. As shown by the Corrupted Acc. results, our method surpasses OpenMatch for more than 3%.

3.4. OPEN-SET DA

In open-set DA task, we follow (Saito et al., 2018) to validate on two DA benchmark datasets Office (Saenko et al., 2010) and VisDA (Peng et al., 2018) . Office dataset contains three domains Amazon (A), Webcam (W), and DSLR (D), and each domain is composed of 31 classes. VisDA dataset contains two domains Sythetic and Real, and each domain consists of 12 classes. To create an open-set situation in Office, we follow (Saito et al., 2018; Liu et al., 2019) to construct the source dataset by sampling from the first 21 classes in alphabetical order. Then, the target dataset is sampled from all 31 classes. As for VisDA, we choose the first 6 classes for source domain, and use all the 12 classes for target domain. We use "A→W" to indicate the transfer from "A" domain to "W" domain. For comparison, we choose three typical open-set DA approaches including Open-Set DA by Back-Propagation OSBP (Saito et al., 2018) which employs an OpenMax classifier to recognize unknown classes and perform gradient flipping for open-set DA, Universal Adaptation Network UAN (You et al., 2020) which utilize entropy and domain similarity to down-weight malign OOD data, and Separate To Adapt STA (Liu et al., 2019) which utilizes SVM to separate the malign OOD data. The experimental results are shown in Table 3 . Compared to the baseline methods, the proposed HOOD is largely benefited from the generated benign OOD data, which have two major strengths: 1) they resemble target domain data by having common styles, and 2) their labels are accessible as they share the same content as their corresponding source data. Therefore, through conducting supervised training such benign OOD data, the domain gap can be further mitigated, thus achieving better performance than baseline methods. Quantitative results show that HOOD can surpass other methods in most scenarios. Especially in VisDA, HOOD can outperform the second-best method with 6% improvement, which proves the effectiveness of HOOD in dealing with open-set DA. Analysis on Augmentation Number: Since HOOD does not introduce any hyper-parameter, the most influential setting is the number of data augmentation. To analyze its influence on the learning results, we vary the number of augmentations that are sampled from the RandAugment Pool (Cubuk et al., 2020) from 2 to 6. The results are shown in Fig. 4 . We can see that both too less and too many augmentations would hurt the results. This is because a small augmentation number would undermine the generalization to various styles; and a large augmentation number would increase the classification difficulty of the style branch, further making the disentanglement hard to achieve. Therefore, setting the augmentation number to 4 is reasonable.

3.5. PERFORMANCE ANALYSIS

Visualization: Furthermore, to show the effect of our data augmentations, we visualize the augmented images by applying large perturbation magnitude (4.7) Tsipras et al. (2018) in Fig. 5 . The model prediction is shown below each image. We can see that the negative data augmentation significantly changes the content which is almost unidentifiable. However, positive data augmentation can still preserve most of the content information and only change the style of images. Therefore, the augmented data can be correctly leveraged to help train a robust classifier.

Disentanglement of Content and Style:

To further testify that our disentanglement between content and style is effective, we select the latent variables from different content and style categories, and use the learned class and domain classifiers for cross-prediction. Specifically, there are four kinds of input-prediction types: content-class, content-domain, style-class, and style-domain. As we can see in Fig. 3 , only the content features are meaningful for class prediction, and the same phenomenon goes for style input and domain prediction. However, neither of the style and content features can be identified by the class predictor and domain predictor, respectively. Therefore, we can reasonably conclude that our disentanglement between content and style is effectively achieved.

4. CONCLUSION

In this paper, we propose HOOD to effectively harness OOD examples. Specifically, we construct a SCM to disentangle content and style, which can be leveraged to identify benign and malign OOD data. Subsequently, by maximizing ELBO, we can successfully disentangle the content and style feature and break the spurious correlation between class and domain. As a result, HOOD can be more robust when facing distribution shifts and unseen OOD data. Furthermore, we augment the content and style through a novel intervention process to produce benign and malign OOD data, which can be leveraged to improve classification and OOD detection performance. Extensive experiments are conducted to empirically validate the effectiveness of HOOD on three typical OOD applications.

A APPENDIX

In this supplementary material, we first discuss some related work in Section B. Then, we complement the implementation details of HOOD in Section C. Further, we present the detailed derivation of ELBO in Section D. Additionally, we discuss two notable causal graph assumptions in Section E. Moreover, to further understand the effectiveness of HOOD, we provide additional analysis in Section F. Finally, we will discuss the limitations and social impact of our method in Section G. 2019), it tries to transfer the knowledge from source ID data to the benign OOD data in target domain, meanwhile detecting the malign OOD data that are encountered during transferring. In both three applications, the predictive confidence has been frequently leveraged to separate malign OOD data (Hendrycks & Gimpel, 2017; Liang et al., 2018; Ren et al., 2019; Xia et al., 2022a) . Moreover, ID data and OOD data can be distinguished via using a discriminator (Kong & Ramanan, 2021; Neal et al., 2018; Yu et al., 2020; Xia et al., 2021) . Further, various open classifiers are designed to predict OOD dataset as unknown (Ge et al., 2017; Padhy et al., 2020; Saito et al., 2018) . Thanks to the advances in unsupervised learning, many approaches employ self-supervised learning to make ID data and OOD data separable (Cao et al., 2022; Li et al., 2021a; Saito et al., 2020) from each other.

B RELATED WORK

Causality in OOD problems mainly focuses on learning invariant representations that stay constant when other causal factors are changing, thus achieving better performance when facing non-stationary data distribution. To accomplish this goal, it is common to learn causal factors and non-causal factors through the variational auto-encoder framework (Blei et al., 2017; Kingma & Welling, 2013) . Thanks to which, domain adaptation (Gong et al., 2016; Schölkopf et al., 2011; Zhang et al., 2013) and domain generalization (Li et al., 2018; Shankar et al., 2018) can be tackled through extracting the domain invariant features. Moreover, based on causal effects, the biased feature can be eliminated through re-weighting (Bahadori et al., 2017; Shen et al., 2018) . Additionally, the spurious correlation which is harmful for inference could be alleviated through do-calculus (Lee et al., 2021; Nam et al., 2020; Pearl, 2009) . Recent methods (Ilse et al., 2021; Mitrovic et al., 2020; Von Kügelgen et al., 2021) conduct data augmentations with self-supervised learning to train a robust model that can handle distribution shifts and corruptions. In general, HOOD has two major differences from existing methods in OOD applications and causality. On one hand, instead of treating an image instance as a whole as commonly done in many approaches, HOOD can properly leverage OOD examples through their disentangled contents and styles. Moreover, augmenting content and style can help improve generalization and robustness simultaneously. On the other hand, current causal approaches are incapable of dealing with malign OOD data, but HOOD is able to learn style-invariant features from benign OOD data, meanwhile avoiding the damage brought by malign OOD data.

C COMPLEMENTARY DETAILS

Each trial of our experiments is conducted in one single NVIDIA 3090 GPU. In the open-set SSL and OOD detection tasks, we follow (Saito et al., 2021) by using the Wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) as the network backbone and train from scratch. In open-set DA application, we follow (Saito et al., 2018) to fine-tune ResNet50 pre-trained on Imagenet (Russakovsky et al., 2015) . The employed Stochastic Gradient Descent (SGD) optimizer starts with an initial learning rate 3e -2 which is decayed by following the cosine function cos(const × current_iteration 500,000 ) without warm-up, in which const is constant, we follow (Saito et al., 2021) to set it as 7 16 π. The momentum factor is set to 0.9, which is also the same as (Saito et al., 2021) . For choosing the pseudo labels of unlabeled data, we follow (Lee, 2013) to set the pseudo label threshold as 0.95. The unlabeled examples with confidence smaller than 0.95 are excluded from training the class branch. However, all of them are leveraged to optimize the domain branch. Moreover, the content and style features are the output of the penultimate layer, which are further fed into the fully-connected layer to produce the class and domain prediction. Here the class number is decided by the class labels, and the domain number is the number of augmentations plus one logit that stands for the original input.

D DERIVATION OF ELBO

In the main paper, the modified structural causal model is factorized as:  We employ two encoders to model the distribution of content and style, respectively: q θc (C | X), q θs (S | X). Moreover, we utilize two classifiers to model the posteriors of class and domain, respectively: q ϕc (Y | C), q ϕs (D | S). Additionally, a decoder is employed to reconstruct the input instance through following distribution: q ϕs (X | C, S). Then, our goal is to maximize the log-likelihood of the joint distribution p(x, y, d): log p(x, y, d) = log c s p ′ (x, y, d, c, s)dcds = log c s p ′ (x, y, d, c, s) q θ (c, s | x) q θ (c, s | x) dcds = log E (c,s)∼q θ (C,S|x) p ′ (x, y, d, c, s) q θ (c, s | x) ≥E (c,s)∼q θ (C,S|x) log p ′ (x, y, d, c, s) q θ (c, s | x) := ELBO(x, y, d). By applying Eq. 2 to p ′ (x, y, d, c, s), we have:  ELBO(x, y, d) =E (c,s)∼q θ (C,S|x) log p(c)p(s)q ϕc (y | c)q ϕs (d | s)p ψ (x | c, s) q θ (c, s | x)q ϕc (y | s)q ϕs (d | c) =E (c,s)∼q θ (C,S|x) log p(c)p(s) q θc (c | x)q θs (s | x) + E (c,s)∼q θ (C,S|x) log q ϕc (y | c) q ϕs (d | c) +E (c,s)∼q θ (C,S|x) log q ϕs (d | s) q ϕc (y | s) + E (c,s)∼q θ (C,S|x) [log p ψ (x | c, s)] =E (c)∼q θc (C|x) log p(c) q θc (c | x) + E (s)∼q θs (S|x) log p(c)p(s) q θs (s | x) +E (c)∼q θs (C|x) log q ϕc (y | c) q ϕs (d | c) + E (s)∼q θs (S|x) log q ϕs (d | s) q ϕc (y | s) +E (c,s)∼q θ (C,S|x) [log p ψ (x | c, s)] = -KL(q θc (c | x)∥p(C)) -KL(q θs (s | x)∥p(S)) + E c∼q θc (C|x) [log q ϕc (y | c) -log q ϕs (d | c)] + E s∼q θs (S|x) [log q ϕs (d | s) -log q ϕc (y | s)] + E (c,s)∼q θ (C,S|x) [log p ψ (x | c, s)] ,

F.1 VISUALIZATION OF HIGHER RESOLUTION IMAGES

In the main paper, we have provided the visualization of augmented CIFAR10 images. To testify that our positive data augmentation and negative data augmentation are still effective on higher resolution images, we conduct experiments using ImageNet30 with resolution 256 × 256, and show the augmented benign data and malign data in Fig. 7 . We can see the similar phenomenon as in CIFAR10: In negative data augmentation, the objects are completely unidentifiable which lead to erroneous model predictions. On the contrary, the positive data augmentation only changes the style (there are three style types are presented: purple style, green style, and sharp-texture style) but leave the objects intact. As a result, the model predictions are usually correct. Therefore, our augmentation method can be effectively deployed to high resolution images. 

F.4 OOD SCORE

To show the effectiveness of identifying malign OOD data from benign OOD data, we test the performance of HOOD on three applications to observe the OOD scores of benign OOD data and malign OOD data and show the averaged OOD scores in Table 5 . We can see that the OOD score produced by our one-vs-all classifier can clearly distinguish benign and malign OOD data during the test phase, which again validates the effectiveness of HOOD. 

G LIMITATION AND SOCIAL IMPACT

The proposed HOOD method has many advantages which have been demonstrated in the main paper. However, HOOD is still limited in some aspects. Firstly, HOOD contains an extra phase that computes the gradient to produce the augmented data, so it cannot be conducted in an endto-end manner. Therefore, it is feasible to improve HOOD by designing a more compact method and incorporating augmentation into the training process. Secondly, HOOD utilizes existing data augmentation techniques to simulate different styles, which cannot perfectly cover all the possible styles in the real world. Hence, better style simulation might further improve the learning performance of HOOD. Regardless of the limitations, our method could have some positive social impacts. First, HOOD can be safely deployed into many open situations where many unknown classes exist. As practical problems contain lots of uncertainties and novel instances occur constantly. Thanks to the negative data augmentation of HOOD, the novel instances can be successfully identified and would not harm the prediction accuracy. Secondly, in many non-stationary environments, the knowledge of the backbone model can be easily transferred thanks to the positive data augmentation of HOOD, which further broadens the practical usage of HOOD in modern industry.



Figure 1: (a) An ideal causal diagram which reveals the data generating process. (b) Illustration of our disentanglement. The brown-edged variables C and S are approximations of content C and style S. The dashed lines indicate the unwanted causal relations that should be broken. (c) Illustration of the data augmentation of HOOD.The green lines and red lines denote the augmentation of benign OOD data X and malign OOD data X, respectively. In all figures, the blank variables are observable and the shaded variables are latent.

Figure 2: Architecture of the HOOD. The solid lines denote the inference flow, the dashed lines indicate the disentanglement of content and style, and the tildes stand for the approximation of the corresponding variables.

(a), the joint distribution P (X, Y, D, C, S) of the interested variables can be factorized as follows: P (X, Y, D, C, S) = P (C, S)P (Y, D | C, S)P (X | C, S). (1) Based on the SCM in Fig. 1 (a), Y and D are conditionally independent to each other, i.e., Y ⊥ ⊥ D | (C, S), so we have P (Y, D | C, S) = P (Y | C, S)P (D | C, S). Similarly, we have P (C, S) = P (C)P (S). Moreover, we can also know that Y is not conditioned on S, and D is not conditioned on C. Hence, we can further derive P (Y, D | C, S) = P (Y | C)P (D | S). However, the aforementioned spurious correlation frequently appears when facing OOD examples (Arjovsky et al., 2019). As a consequence, when variational inference is based on the factorization in Eq. 1, the approximated content C and style S could both directly influence Y and D, i.e., Y ← C → D and Y ← S → D, thus leading to inaccurate approximations. However, the desired condition is Y ← C ↛ D and Y ↚ S → D. We can see that the unwanted correlations C → D and S → Y in Fig. 1 (b) is caused by erroneous posteriors P (D | C) and P (Y | S)

Figure 4: Augmentation number analysis.

Figure 3: Illustration of disentanglement between content and style on CIFAR10. The number in each cell denotes the prediction probability.

OOD applications contains three typical problems, namely OOD detection, open-set SSL, and open-set DA. OOD detection Hendrycks & Gimpel (2017); Liang et al. (2018) aims to train a robust model which can accurately identify the newly-emerged malign OOD data during the test phase. Open-set SSL Guo et al. (2020); Chen et al. (2020); He et al. (2022a;b); Yu et al. (2020) deals with the problem when labeled data are scarce and the unlabeled data are contaminated by malign OOD data. As for open-set DA Saito et al. (2018); Liu et al. (

′ (X, Y, D, C, S) = P (C)P (S)P (Y | C)P (D | S)P (X | C, S) P (D | C)P (Y | S) .

Figure 6: Comparison of two different causal relationship assumptions.

.2 T-SNE VISUALIZATIONTo further demonstrate the effectiveness of our disentanglement, we show the t-SNE(Maaten & Hinton, 2008) visualization of the feature extracted with or without disentanglement, as shown in Fig.8. We can see that when training without disentanglement, the features are closely gathered in each cluster. However, the malign OOD data represented by gray and pink points are also intensely aligned in the clusters, which would damage the model robustness. In contrast, when training with disentanglement, there is a slight gap between OOD data and ID data, which means that the model trained with disentanglement can avoid overfitting to specific styles, and show better robustness against OOD data.

Figure 8: t-SNE visualization on CIFAR10 dataset. The blue, brown, yellow, green, orange, and purple points are ID data, and the gray and pink points are OOD data.

Comparison with typical OOD detections methods. Averaged AUROC (%) with standard deviations are computed over three independent trails. The best results are highlighted in bold.

Comparison with typical Open-set SSL methods. Averaged test accuracies (%) with standard deviations are computed over three independent trails. The best results are highlighted in bold.

which modifies the softmax score through focusing on the semantic feature, and OpenGAN(Kong & Ramanan, 2021) which can further improve the performance via separating the generated "hard" examples that are deceivingly close to ID data.

Comparison with typical Open-set DA methods. Averaged test accuracies (%) with standard deviations are computed over three independent trails. The best results are highlighted in bold.

Ablation study on necessity of each module.The experimental results are shown in Table4. We can see each module influences the performance differently in three applications. First, we can see that the malign OOD data is essential for OOD

Averaged OOD scores with standard deviations on three applications.Additionally, to give a quantitative comparison on the execution efficiency of HOOD, here we provide the running time on 3090 GPU compared to some typical baseline methods. The results are shown in Table6. Note that our method involves causal disentanglement as well as adversarial training, therefore, the training time is more than other opponents.

Execution efficiency comparisons on three applications.

5. ACKNOWLEDGEMENT

Li Shen was partially supported by the Major Science and Technology Innovation 2030 "New Generation Artificial Intelligence" key project No. 2021ZD0111700. Zhuo Huang was supported by JD Technology Scholarship for Postgraduate Research in Artificial Intelligence No. SC4103. Xiaobo Xia was supported by Australian Research Council Projects DE-190101473 and Google PhD Fellowship. Bo Han was supported by NSFC Young Scientists Fund No. 62006202, Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652 and RGC Early Career Scheme No. 22200720. Mingming Gong was supported by ARC DE210101624. Chen Gong was supported by NSF of China No. 61973162, NSF of Jiangsu Province No. BZ2021013, NSF for Distinguished Young Scholar of Jiangsu Province No. BK20220080, and CAAI-Huawei MindSpore Open Fund. Tongliang Liu was partially supported by Australian Research Council Projects IC-190100031, LP-220100527, DP-220102121, and FT-220100318.

