OUT-OF-DISTRIBUTION DETECTION WITH IMPLICIT OUTLIER TRANSFORMATION

Abstract

Outlier exposure (OE) is powerful in out-of-distribution (OOD) detection, enhancing detection capability via model fine-tuning with surrogate OOD data. However, surrogate data typically deviate from test OOD data. Thus, the performance of OE, when facing unseen OOD data, can be weakened. To address this issue, we propose a novel OE-based approach that makes the model perform well for unseen OOD situations, even for unseen OOD cases. It leads to a min-max learning scheme-searching to synthesize OOD data that leads to worst judgments and learning from such OOD data for uniform performance in OOD detection. In our realization, these worst OOD data are synthesized by transforming original surrogate ones. Specifically, the associated transform functions are learned implicitly based on our novel insight that model perturbation leads to data transformation. Our methodology offers an efficient way of synthesizing OOD data, which can further benefit the detection model, besides the surrogate OOD data. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts.

1. INTRODUCTION

Deep learning systems in the open world often encounter out-of-distribution (OOD) data whose label space is disjoint with that of the in-distribution (ID) samples. For many safety-critical applications, deep models should make reliable predictions for ID data, while OOD cases (Bulusu et al., 2020) should be reported as anomalies. It leads to the well-known OOD detection problem (Lee et al., 2018c; Fang et al., 2022) , which has attracted intensive attention in reliable machine learning. OOD detection remains non-trivial since deep models can be over-confident when facing OOD data (Nguyen et al., 2015; Bendale & Boult, 2016) , and many efforts have been made in pursuing reliable detection models (Yang et al., 2021; Salehi et al., 2021) . Building upon discriminative models, existing OOD detection methods can generally be attributed to two categories, namely, posthoc approaches and fine-tuning approaches. The post-hoc approaches assume a well-trained model on ID data with its fixed parameters, using model responses to devise various scoring functions to indicate ID and OOD cases (Hendrycks & Gimpel, 2017; Liang et al., 2018; Lee et al., 2018c; Liu et al., 2020; Sun et al., 2021; 2022; Wang et al., 2022) . By contrast, the fine-tuning methods allow the target model to be further adjusted, boosting its detection capability by regularization (Lee et al., 2018a; Hendrycks et al., 2019; Tack et al., 2020; Mohseni et al., 2020; Sehwag et al., 2021; Chen et al., 2021; Du et al., 2022; Ming et al., 2022; Bitterwolf et al., 2022) . Typically, fine-tuning approaches benefit from explicit knowledge of unknowns during training and thus generally reveal reliable performance across various real-world situations (Yang et al., 2021) . For the fine-tuning approaches, outlier exposure (OE) (Hendrycks et al., 2019) ing these surrogate OOD data with low-confident predictions, OE explicitly enables the detection model to learn knowledge for effective OOD detection. A caveat is that one can hardly know what kind of OOD data will be encountered when the model is deployed. Thus, the distribution gap exists between surrogate (training-time) and unseen (test-time) OOD cases. Basically, this distribution gap is harmful for OOD detection since one can hardly ensure the model performance when facing OOD data that largely deviate from the surrogate OOD data (Yang et al., 2021; Dong et al., 2020) . Addressing the OOD distribution gap issue is essential but challenging for OE. Several works are related to this problem, typically shrinking the gap by making the model learn from additional OOD data. For example, Lee et al. (2018a) synthesize OOD data that the model will make mistakes by generative models, and the synthetic data are learned by the detection model for low confidence predictions. However, synthesizing unseen is intractable in general (Du et al., 2022) , meaning that corresponding data may not fully benefit OE training. Instead, Zhang et al. (2023) mixup ID and surrogate OOD data to expand the coverage of OOD cases; and Du et al. (2022) sample OOD data from the low-likelihood region of the class-conditional distribution in the low-dimensional feature space. However, linear interpolation in the former can hardly cover diverse OOD situations, and feature space data generation in the latter may fail to fully benefit the underlying feature extractors. Hence, there is still a long way to go to address the OOD distribution gap issue in OE. To overcome the above drawbacks, we suggest a simple yet powerful way to access extra OOD data, where we transform available surrogate data into new OOD data that further benefit our detection models. The key insight is that model perturbation implicitly leads to data transformation, and the detection models can learn from such implicit data by model updating after its perturbation. The associated transform functions are free from tedious manual designs (Zhang et al., 2023; Huang et al., 2023) and complex generative models (Lee et al., 2018b) while remaining flexible for synthetic OOD data that deviate from original data. Here, two factors support the effectiveness of our data synthesis: 1) implicit data follow different distribution from that of the original one (cf., Theorem 1) and 2) the discrepancy between original and transformed data distributions can be very large, given that our detection model is deep enough (cf., Lemma 1). It indicates that one can effectively synthesize extra OOD data that are largely different from the original ones. Then, we can learn from such data to further benefit the detection model. Accordingly, we propose Distributional-agnostic Outlier Exposure (DOE), a novel OE-based approach built upon our implicit data transformation. The "distributional-agnostic" reflects our ultimate goal of making the detection models perform uniformly well with respect to various unseen OOD distributions, accessing only ID and surrogate OOD data during training. In DOE, we measure the model performance in OOD detection by the worst OOD regret (WOR) regarding a candidate set of OOD distributions (cf., Definition 2), leading to a min-max learning scheme as in equation 6. Then, based on our systematic way of implicit data synthesis, we iterate between 1) searching implicit OOD data that lead to large WOR via model perturbation and 2) learning from such data for uniform detection power for the detection model. DOE is related to distributionally robust optimization (DRO) (Rahimian & Mehrotra, 2019) , which similarly learns from the worst-case distributions. Their conceptual comparison is summarized in Figure 1 . Therein, DRO considers a close-world setting, striving for uniform performance regarding various data distributions in the support (Sagawa et al., 2020) . However, it fails in the open-world OOD settings that require detecting unseen data (cf., Section 5.3), which is the part of the test support that is disjoint with the surrogate one in Figure 1 (b) . By contrast, our data transformation offers an effective approach in learning from unseen data, considering the region's uniform performance beyond the support. Thus, DOE can mitigate the distribution gap issue to some extent, reflected by the smaller disjoint region than the DRO case in Figure 1 (c). We conduct extensive experiments in Section 5 on widely used benchmark datasets, verifying the effectiveness of our method with respect to a wide range of different OOD detection setups. For common OOD detection, our DOE reduces the average FPR95 by 7.26%, 20.30%, and 13.97% compared with the original OE on CIFAR-10, CIFAR-100, and ImageNet datasets. For hard OOD detection, our DOE reduces the FPR95 by 7.45%, 7.75%, and 4.09% compared with advanced methods regarding various hard OOD datasets.

2. PRELIMINARY

Let X ⊆ R d denote the input space and Y = {1, . . . , C} the label space. We consider the ID distribution D ID defined over X × Y and the OOD distribution D OOD defined over X . In general, the OOD distribution D OOD is defined as an irrelevant distribution whose label set has no intersection with Y (Yang et al., 2021) , which is unseen during training and should not be predicted by the model.

2.1. SOFTMAX SCORING

Building upon the model h ∈ H : X → R C with logit outputs, our goal is to utilize the scoring function s : X → R in discerning test-time inputs given by D ID from that of D OOD . Typically, if the score value s(x) is greater than a threshold τ ∈ R, the associated input x ∈ X is taken as an ID case, otherwise an OOD case. A representative scoring function in the literature is the maximum softmax prediction (MSP) (Hendrycks & Gimpel, 2017) , following s MSP (x; h) = max k softmax k h(x), where softmax k (•) denotes the k-th element of a softmax output. Since the true labels of OOD are not in the label space, the model will return lower scores for them than ID cases in expectation.

2.2. OUTLIER EXPOSURE

Unfortunately, for a normally trained model h(•), MSP may make over-confident predictions for some OOD data (Liu et al., 2020) , which is detrimental in effective OOD detection. To this end, OE (Hendrycks et al., 2019) boosts the detection capability by making the model h(•) learn from the surrogate OOD distribution D s OOD , with the associated learning objective of the form: L(h) = E DID [ℓ CE (h(x), y)] LCE(h;DID) +λ E D s OOD [ℓ OE (h(x))] LOE(h;D s OOD ) , ( ) where λ is the trade-off parameter, ℓ CE (•) is the cross-entropy loss, and ℓ OE (•) is defined by Kullback-Leibler divergence to the uniform distribution, which can be written as Note that since we know nothing about unseen during training, the surrogate distribution D s OOD is largely different from the real one D OOD in general. Then, the difference between surrogate and unseen OOD data leads to the OOD distribution gap between training-(i.e., D s OOD ) and test-time (i.e., D OOD ) situations. When deployed, the model inherits this data bias, potentially making overconfident predictions for unseen OOD data that differ from the surrogate ones. ℓ OE (h(x)) = -k softmax k h(x)/C.

3. OOD SYNTHESIS

The OOD distribution gap issue stems from our insufficient knowledge about (test-time) unseen OOD data. Therefore, a direct approach is to give the model access to extra OOD data via data synthesis, doing our best to fill the distribution gap between training-and test-time situations. When it comes to data synthesis, a direct approach is to utilize generative models (Lee et al., 2018a) , while generating unseen is intractable in general (Du et al., 2022) . Therefore, MixOE (Zhang et al., 2023) mixup ID and surrogate OOD to expand the coverage of various OOD situations, and VOS (Du et al., 2022) generates additional OOD in the embedding space with respect to low-likelihood ID regions. However, the former relies on manually designed synthesizing procedures, which can hardly cover diverse OOD situations. The latter generates OOD in low-dimensional space, which relies on specific assumptions for ID distribution (e.g., a mixture of Gaussian) and hardly benefits the underlying feature extractors to learn meaningful OOD patterns.

3.1. MODEL PERTURBATION FOR DATA SYNTHESIS

Considering previous drawbacks in OOD synthesis, we suggest a new way to access additional OOD data, which is simple yet powerful. Overall, we transform the available surrogate OOD data to synthesize new data that can further benefit our model. The associated transform function is parasitic on our detection model, which is learnable without auxiliary deep models or manual designs. The key insight is that perturbing model parameters have the same impact as transforming data, where specific model perturbations indicate specific transform functions. For the beneficial data of our interest (e.g., the worst OOD data), we can implicitly get them access by finding the corresponding model perturbation. Updating the detection model thereafter, it can learn from the transformed data (i.e., the beneficial ones) instead of the original inputs. Now, we formalize our intuition. We study the piecewise affine ReLU network model (Arora et al., 2018) , covering a large group of deep models with ReLU activations, fully connected layers, convolutional layers, residual layers, etc. Here, we consider the recursive definition of a L-layer ReLU network, following z (l) = h (l) (W (l-1) z (l-1) ) for l = 1, . . . , L, where W (l) ∈ R n l ×n l-1 is the l-th layer weights and h (l) (z) = max{0, t} the ReLU activation. We have z (1) = x the model input and z (L) = h(x) the model output. If necessary, we write h W in place of h with the joint form of weights W = {W (l) } L l=1 that contains all trainable parameters. Our discussion is on a specific form of model perturbation named multiplicative perturbation. Definition 1 (Multiplicative Perturbation (Petzka et al., 2021) ). For a L-layer ReLU network h(•), its l-th layer is multiplicatively perturbed if W (l) is changed into W (l) (I + αA (l) ), where α > 0 is the perturbation strength and A (l) ∈ R n l-1 ×n l-1 is the perturbation matrix. Furthermore, the model h(•) is multiplicatively perturbed if all its layers are multiplicatively perturbed. Now, we link the multiplicative perturbation of the l-th layer to data transformation in the associated embedding space, summarized by the following proposition. Proposition 1. Considering the data distribution D and the multiplicative perturbation regarding the l-th layer of a ReLU network. Then, measuring in the feature space, multiplicative perturbation is equivalent to data transformation. Further, the transformed data follows a new distribution D ′ that is different from D if the eigenvalues of A (l) are greater than 0. Therefore, model perturbation offers an alternative way to modify data and their distribution implicitly. Now, we generalize Proposition 1 for the multiplicative perturbation of the model, showing that it can modify the data distribution in the original input space. Theorem 1. Considering the data distribution D and an L-layer ReLU network. Measuring in the input space X ⊆ R d , multiplicative perturbation of the model is equivalent to data transformation in the input space following distribution D ′ . Then, D ′ and D are different if the eigenvalues of A (l) are greater than 0 and W (l), † = W (l),-1 for l = 1, . . . , L. The proof of the above theorem directly leads to the following lemma, indicating that our datasynthesizing approach can benefit from the layer-wise architectures of deep models. Lemma 1. Considering a L-layer ReLU network with the multiplicative perturbation {A All the above proofs can be found in Appendix A, revealing that model perturbation leads to data transformation. There are two points worth emphasizing. First, the distribution of transformed data can be very different from that of the original data under the mild condition of non-negative eigenvalues. Further, the corresponding transform function is complex enough with layer-wise nonlinearity, where deep models induce strong forms of transformations (regarding distributions).

4. DISTRIBUTIONAL-AGNOSTIC OUTLIER EXPOSURE

Our data synthesis scheme allows the model h(•) to learn from additional OOD data besides the surrogate ones. Recalling that, we aim for the model to perform uniformly well for various unseen OOD data. Then, a critical issue is what kinds of synthesized OOD can benefit our model the most. To begin with, we measure the detection capability by the worst-case OOD performance of the detection model, leading to the following definition of the worst OOD regret (WOR). Definition 2 (Worst OOD Regret). For the detection model h(•), its worst OOD regret is WOR(h) = sup D∈DOOD L OE (h; D) -inf h ′ ∈H L OE (h ′ ; D) , where D OOD denotes the set of all OOD distributions and H is the hypothesis space. Minimizing the WOR upper bounds the uniform performance of the detection model for the OOD cases. Therefore, synthetic OOD data that lead to WOR are of our interest, and learning from such data can benefit our model the most. Note that we can also measure the detection capability by the risk, i.e., sup D∈DOOD L OE (h; D), while we find that our regret-based measurement is better since it further considers the fitting power of the model when facing extremely large space of unseen data.

4.1. LEARNING OBJECTIVE

The WOR measures the worst OOD regret with respect to the worst OOD distribution, suitable for our perturbation-based data transformation that can lead to new data distributions (cf., Theorem 1). Therefore, to empirically upper-bound the WOR, one can first find the model perturbation that leads to large OOD regret and then update model parameters after its perturbation. Here, an implicit assumption is that the associated data given by model perturbation (with surrogate OOD inputs) are valid OOD cases. It is reasonable since the WOR in equation 5 does not involve any term to make the associated data close to ID data in either semantics or stylish. Then, we propose an OE-based method for uniformly well OOD detection, namely, Distributionalagnostic Outlier Exposure (DOE). It is formalized by a min-max learning problem, namely, L DOE (h W ; D ID , D s OOD ) = L CE (h W ; D ID )+ λ max P :||P ||≤1 L OE (h W +αP ; D s OOD ) -min W ′ L OE (h W ′ +αP ; D s OOD ) WORP(h W ;D s OOD ) , where WOR P (h W ; D s OOD ) is a perturbation-based realization for the WOR calculation. Several points therein require our attention. First, ID data remain the same during training and testing, and the distribution gap occurs only for OOD cases. Therefore, WOR is applied only to the surrogate OOD data, and the original risk L CE (h W ; D ID ) is applied for the ID data. Furthermore, we adopt the implicit data transformation to search for the worst OOD distribution, substituting the search space of distribution D OOD by the search space of the perturbation, i.e., {P : ||P || ≤ 1}. Here, we adopt a fixed threshold of 1 since one can change the perturbation strength via the parameter α. Finally, we adopt the additive perturbation W + αP which is easier to implement than the multiplicative counterpart, and they are equivalent when assuming P = W A.

4.2. REALIZATION

We consider a stochastic realization of DOE, where ID and OOD mini-batches are randomly sampled in each iteration, denoted by B ID and B s OOD , respectively. The overall DOE algorithm is summarized in Appendix B. Here, we emphasize several vital points. Regret Estimation. The exact regret computation is hard since we need to find the optimal risk for each candidate perturbation. As its effective estimation, following (Arjovsky et al., 2019; Agarwal & Zhang, 2022) , we calculate the norm of the gradients with respect to the risk L OE , namely, WOR G (h W ; B s OOD ) = ||∇ σ|σ=1.0 L OE (σ • h W +αP ; B s OOD )|| 2 . (7) Intuitively, a large value of the gradient norm indicates that the current model is far from optimal, and thus the corresponding regret should be large. It leads to an efficient indicator of regret. Perturbation Estimation. The gradient ascent is employed to find the proper perturbation P for the max operation in equation 6. In each step, the perturbation is updated by P ← ∇ P WOR G (h W +αP ; B s OOD ), with P initialized to 0. We further normalize P using P NORM = NORM(P ) to satisfy the norm constraint. By default, we employ one step of gradient update as an efficient estimation for its value, which can be taken as the solution for the first-order Taylor approximated model. Stable Estimation. Equation 8 is calculated for the mini-batch of OOD samples, biased from the exact solution of P that leads to the worst regret regarding the whole training sample. To mitigate the gap, for the resultant P NORM , we adopt its moving average across training steps, namely, P MA ← (1 -β)P MA + βP NORM , where β ∈ (0, 1] is the smoothing strength. Overall, a smaller β indicates that we take the average for a wider range of steps, leading to a more stable estimation of the perturbation. Scoring Function. After training, we adopt the MaxLogit scoring (Hendrycks et al., 2022) in OOD detection, which is better than the MSP scoring when facing large semantic spaces. It is of the form: s ML (x; h) = max k h k (x), where h k (•) denotes the k-th element of the logit output. In general, a large value of s ML (x; h) indicates the high confidence of the associated x to be an ID case.

5. EXPERIMENTS

This section conducts extensive experiments in OOD detection. In Section 5.1, we verify the superiority of our DOE against state-of-the-art methods on both the CIFAR (Krizhevsky & Hinton, 2009) and the ImageNet (Deng et al., 2009) benchmarks. In Section 5.2, we demonstrate the effectiveness of our method for hard OOD detection. In Section 5.3, we further conduct an ablation study to understand our learning mechanism in depth. The code is publicly available at: github.com/qizhouwang/doe. Baseline Methods. We compare our DOE with advanced methods in OOD detection. For post-hoc approaches, we consider MSP (Hendrycks & Gimpel, 2017) , ODIN (Liang et al., 2018) , Mahalanobis (Lee et al., 2018c) , Free Energy (Liu et al., 2020) , ReAct (Sun et al., 2021) , and KNN (Sun et al., 2022) ; for fine-tuning approaches, we consider OE (Hendrycks et al., 2019) , CSI (Tack et al., 2020) , SSD+ (Sehwag et al., 2021) , MixOE (Zhang et al., 2023) , and VOS (Du et al., 2022) . Evaluation Metrics. The OOD detection performance of a detection model is evaluated via two representative metrics, which are both threshold-independent (Davis & Goadrich, 2006) : the false positive rate of OOD data when the true positive rate of ID data is at 95% (FPR95); and the area under the receiver operating characteristic curve (AUROC), which can be viewed as the probability of the ID case having greater score than that of the OOD case. Pre-training Setups. For the CIFAR benchmarks, we employ the WRN-40-2 (Zagoruyko & Komodakis, 2016) as the backbone model following (Liu et al., 2020) . The models have been trained for 200 epochs via empirical risk minimization, with a batch size 64, momentum 0.9, and initial learning rate 0.1. The learning rate is divided by 10 after 100 and 150 epochs. For the ImageNet, we employ ResNet-50 (He et al., 2016) with well-trained parameters downloaded from the PyTorch repository following (Sun et al., 2021) . DOE Setups. Hyper-parameters are chosen based on the OOD detection performance on validation datasets, which are separated from ID and surrogate OOD data. For the CIFAR benchmarks, DOE is run for 10 epochs with an initial learning rate of 0.01 and the cosine decay (Loshchilov & Hutter, 2017) . The batch size is 128 for ID cases and 256 for OOD cases. The number of warm-up epochs is set to 5. λ is 1 and β is 0.6. For the ImageNet dataset, DOE is run for 4 epochs with an initial learning rate of 0.0001 and cosine decay. The batch sizes are 64 for both ID and surrogate OOD cases. The number of warm-up epochs is 2. λ is 1 and β is 0.1. For both the CIFAR and the ImageNet benchmarks, σ is uniformly sampled from {1e -1 , 1e -2 , 1e -3 , 1e -4 } in each training step, which allows covering a wider range of OOD situations than assigning fixed values. Furthermore, the perturbation step is fixed to be 1. Surrogate OOD datasets. For the CIFAR benchmarks, we adopt the tinyImageNet dataset (Le & Yang, 2015) as the surrogate OOD dataset for training. For the ImageNet dataset, we employ the ImageNet-21K-P dataset (Ridnik et al., 2021) , which makes invalid classes cleansing and image resizing compared with the original ImageNet-21K (Deng et al., 2009) .

5.1. COMMON OOD DETECTION

We begin with our main experiments on the CIFAR and ImageNet benchmarks. Model performance is tested on several common OOD datasets widely adopted in the literature (Sun et al., 2022) . For the CIFAR cases, we employed Texture (Cimpoi et al., 2014) , SVHN (Netzer et al., 2011) , Places365 (Zhou et al., 2018) , LSUN-Crop (Yu et al., 2015) , and iSUN (Xu et al., 2015) ; for the Im-ageNet case, we employed iNaturalist (Horn et al., 2018) , SUN (Xu et al., 2015) , Places365 (Zhou et al., 2018) , and Texture (Cimpoi et al., 2014) . In Table 1 , we report the average performance (i.e., FPR95 and AUROC) regarding the OOD datasets mentioned above. Please refer to Tables 4-5 and 8 in Appendix C for the detailed results. CIFAR Benchmarks. Overall, the fine-tuning methods can lead to effective OOD detection in that they (e.g., OE and DOE) generally demonstrate better results than most of the post-hoc approaches ImageNet Benchmark. Huang & Li (2021) show that many advanced methods developed on the CIFAR benchmarks can hardly work for the ImageNet dataset due to its large semantic space with about 1k classes. Therefore, Table 1 also compares the results of DOE with advanced methods on ImageNet. As we can see, similar to the cases with CIFAR benchmarks, the fine-tuning approaches generally reveal superior results compared with the post-hoc approaches, and DOE remains effective in showing the best detection performance in expectation. Overall, Table 1 demonstrates the effectiveness of DOE across widely adopted experimental settings, revealing the power of our implicit data search scheme and distributional robust learning scheme.

5.2. HARD OOD DETECTION

Besides the above test OOD datasets, we also consider hard OOD scenarios (Tack et al., 2020) , of which the test OOD data are very similar to that of the ID cases in style. Following the common setup (Sun et al., 2022) with the CIFAR-10 dataset being the ID case, we evaluate our DOE on three hard OOD datasets, namely, LSUN-Fix (Yu et al., 2015) , ImageNet-Resize (Deng et al., 2009) , and CIFAR-100. Note that data in ImageNet-Resize (1000 classes) with the same semantic space as To some extent, it may indicate that our implicit data synthesis can even cover some hard OOD cases, and thus our DOE can lead to improved performance in hard OOD detection.

5.3. ABLATION STUDY

Our proposal claims two key contributions. The first one is the implicit data transformation via model perturbation, and the second one is the distributional robust learning scheme regarding WOR. Here, we design a series of experiments to demonstrate their respective power. Implicit Data Transformation. In Section 3.1, we demonstrate that model perturbation can lead to data transformation. Here, we verify that other realizations (besides searching for WOR) can also benefit the model with additional OOD data. We employ perturbation with fixed values of ones (allones) and two types of random noise, namely, Gaussian noise with 0 mean and I covariance matrix (Gaussian) and uniform noise over the interval [-1, 1] (Uniform) (cf., Appendix B). We summarize their results on CIFAR-100 in Table 3 (Implicit Data Transformation). Compared to MSP and OE without model perturbation, all the forms of perturbation can lead to improved detection, indicating that our implicit data transformation is general to benefit the model with additional OOD data. Distributional Robust Learning. In Section 4, we employ the implicit data transformation for uniform performance in OOD detection. As mentioned in Section 1, DRO (Rahimian & Mehrotra, 2019 ) also focuses on distributional robustness. Here, we conduct experiments with two realizations of DRO, with χ 2 divergence (χ 2 ) (Hashimoto et al., 2018) and Wasserstein distance (WD) (Kwon et al., 2020) (cf., Appendix B) . We also consider the adversarial training (AT) (Madry et al., 2018b) as a baseline method, which can also be interpreted from the lens of DRO. We summarize the related experiments on CIFAR-100 in Table 3 (Distributional Robust Learning). For two traditional DRO-based methods (i.e., χ 2 and WD), they mainly consider the cases where the support of the test OOD data is a subset of the surrogate case. This close-world setup fails in OOD detection, and thus they reveal unsatisfactory results. Though AT also makes data transformation, its transformation is limited to additive noise, which can hardly cover the diversity of unseen data. In contrast, our DOE can search for complex transform functions that exploit unseen, having large improvements compared to all other robust learning methods.

6. CONCLUSION

Our proposal makes two key contributions. The first is the implicit data transformation for OOD synthesis, based on our novel insight that model perturbation leads to data transformation. Synthetic data follow a diverse distribution compared to original ones, rendering the target model to learn from unseen data. The second contribution is a distributional-robust learning method, building upon a min-max optimization scheme in searching for the worst regret. We demonstrate that learning from the worst regret in OOD detection can demonstrate better results than the risk-based counterpart. Accordingly, we propose DOE to mitigate the OOD distribution gap issue inherent in OE-based methods, where the extensive experiments verify our effectiveness. Our two contributions may not be limited to the OOD detection field. We will explore their usage scenarios in other areas, such as OOD generalization, adversarial training, and distributionally robust optimization. 

8. ETHIC STATEMENT

This paper does not raise any ethical concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

9. REPRODUCIBILITY STATEMENT

The experimental setups for training and evaluation as well as the hyper-parameters are described in detail in Section 5, and the experiments are all conducted using public datasets. The code is publicly available at: github.com/qizhouwang/doe.

A PROOFS

This section provides the detailed proofs for our theoretical claims in the main text.

A.1 PROOF OF PROPOSITION 1

Proof. To make the derivation clear, we adopt the equivalent form for our recursive definition of the model in equation 3, following: h (l+1) (W (l) z (l) ) = h (l+1) (z (l) ; W (l) ). Then, by multiplicatively perturbing the l-th layer of the model, we have h (l+1) z (l) ; W (l) (I + αA (l) ) =max W (l) (I + αA (l) ) z (l) , 0 =max W (l) (I + αA (l) )z (l) , 0 =h (l+1) (I + αA (l) )z (l) ; W (l) . Therefore, measuring in feature space Z (l) , multiplicative perturbation modifies the original features z (l) by an affine transformation I + αA (l) . Assuming that the original data are i.i.d. drawn from the distribution with the probability density function (pdf) f Z (l) (z (l) ), then the transformed data are i.i.d. drawn from the distribution with the pdf f Z ′(l) (z ′(l) ) = f Z (l) (z (l) ) I + αA (l) -1 . Using the KL-divergence to measure the discrepancy between the original feature distribution and the transformed feature distribution, we have D KL (f Z (l) ||f Z ′(l) ) = E f Z (l) (z (l) ) log f Z (l) (z (l) ) f Z ′ (l) (z ′(l) ) = log I + αA (l) . Without loss of generality, we assume K different eigenvalues for the matrix A (l) . Then, by the Jordan matrix decomposition, we can write A (l) = T (l),-1 J (l) T (l) . Therein, J (l) is of the form:        J(λ 1 ) J(λ 2 ) • • • J(λ k ) • • • J(λ K )        , and J(λ k ) is the k-th Jordan block (of size n k × n k ) corresponding to the k-th eigenvalue of the matrix A (l) . Then, we have I + αA (l) = T (l),-1 (I + αJ (l) )T (l) = I + αJ (l) . Since J (l) is an upper triangular matrix, we can write I + αJ (l) = K k=1 (αλ k + 1) n k . Accordingly, if the eigenvalues of the matrix A (l) are all greater than 0 and α > 0, we have I + αA (l) > 1 and D KL (f Z (l) ||f Z ′(l) ) > 0. Therefore, the distributions f Z (l) and f Z ′ (l) are different regarding the KL divergence. Thus we complete our proof.

A.2 PROOF OF THEOREM 1

Proof. We consider an induction proof, justifying that: the multiplicative perturbation with A (l) ∈ R n l ×n l of any layer in l = 1, . . . , L can be transformed into an equivalent multiplicative perturbation with Ā(l-1) ∈ R n l-1 ×n l-1 in the (l -1)-th layer. Moreover, | Ā(l-1) | > 0 if |A (l) | > 0. Then, one can transform the multiplicative perturbation of the model to an equivalent form in the input space. Since the determinant of the equivalent perturbation is greater than 0, by applying Proposition 1, we conclude that multiplicative perturbation can lead to data transformation in the original input space. To find the equivalent perturbation matrix Ā(l-1) in the (l -1)-th layer regarding the original one A (l) in the l-th layer, we solve the following equation: W (l) (I + αA (l) )h (l) (W (l-1) z (l-1) ) = W (l) h (l) (W (l-1) (I + α Ā(l-1) )z (l-1) ). If [z (l-foot_0) ] i ̸ = 0 in each dimension, equation 15 can be rewritten as A (l) h (l) (W (l-1) z (l-1) ) = h (l) ′ (W (l-1) z (l-1) )W (l-1) Ā(l-1) z (l-1) , (16) by applying the Taylor Theorem for the right-hand side 1 . Then, since the ReLU activation is applied, we solve the equivalent formulation for equation 16, following, A (l) W (l-1) = W (l-1) Ā(l-1) . (17) Then, the solution of Ā(l-1) is W (l-1), † A (l) W (l-1) with † being the Moore-Penrose inverse. We justify that the multiplicative perturbation in the l-th layer can be transformed to that of the (l -1)-th layer. Therefore, the equivalent perturbation Ā(l-1) and the original perturbation in the (l -1)-th layer can formulate a joint perturbation Ā(l-1) , namely, I + α Ā(l-1) , with Ā(l-1) = Ā(l-1) + A (l-1) + αA (l-1) Ā(l-1) . Now, we justify that Ā(l-1) can also lead to distributional transformation. If W (l-1), † = W (l-1),-1 (Here, we implicitly assume that n l-1 = n l-2 ) and the eigenvalues of the matrix A (l) are all greater than 0, then we know that the eigenvalues of the matrix Ā(l-1) are all greater than 0. Again, we have I + α Ā(l-1) > 1. Then, the joint perturbation Ā(l-1) satisfies: I + α Ā(l-1) = (I + αA (l-1) )(I + α Ā(l-1) ) (19) = I + αA (l-1) I + α Ā(l-1) > I + αA (l-1) (21) >1. By induction, the multiplicative perturbation of the model can be approximated by the input transformation. By applying Proposition 1, we know that x and the perturbation-based transformed counterpart follow the different data distributions. Thus we complete our proof.

A.3 PROOF OF LEMMA 1

Proof. For the L + 1-layer ReLU network, we assume its model parameters and the model perturbation are the same as that of the corresponding layers for the L-layer ReLU network (except for the L + 1-th layer). Then, by inspecting equation 20, the perturbation from the L + 1-th layer can make the perturbation matrices for the L + 1-layer network no smaller than that of the L-layer network regarding each layer (including the input space) of the joint multiplicative perturbation. Thus, we complete our proof.

A.4 EXCESS RISK BOUND

We further derive the learning bound of DOE. Here, we make the standard assumptions for our learning problem. First, we assume that the Rademacher Complexity R n (H) of H is bounded, i.e., there is a C H such that R n (H) ≤ C H / √ n, holding for ReLU models. Further, the CE loss is bounded by A CE and is L CE Lipschitz continuous; the OE loss is bounded by A OE and is L OE Lipschitz continuous. To ease notation, we also define ϵ(C, L, A) = 2CL + A log 1/δ 2 . ( ) We are now ready to state the upper bound for the worst-case population performance of our DOE. Theorem 2. Given ID and surrogate OOD training sample S ID and S OOD , we write the optimal solution as h * W = arg min h W ∈H L DOE (h W ; D ID , D s OOD ) and the empirical counterpart as ĥW = arg min h W ∈H L DOE (h W ; S ID , S s OOD ). Then, under above assumptions, w.p. at least 1 -δ, we have L DOE ( ĥW ; D ID , D s OOD ) ≤ L DOE (h * W ;D ID , D s OOD ) + (2 + 4λ)ϵ(C H , L, A)/ min{|S ID |, |S s OOD |}, ( ) where L = max{L CE , L OE } and A = max{A CE , A OE }. Proof. We apply the Rademacher Bound for L CE and L OE , given that w.p. at least 1 -σ, we have |L CE (h W ; D ID ) -L CE (h W ; S ID )| ≤ ϵ(C H , L CE , A CE )/ |S ID |, ( ) |L OE (h W ; D OOD ) -L OE (h W ; S OOD )| ≤ ϵ(C H , L OE , A OE )/ |S OOD |, for all h W ∈ H. When the hypothesis space H is large enough, we have  h * W = arg min hW∈H L CE (h W ; D ID ) = L CE ( ĥW ; D ID ) -L CE (h * W ; D ID ) =L CE ( ĥW ; D ID ) -L CE (h ϵ W ; S ID ) + L CE (h ϵ W ; S ID ) -L CE (h * W ; D ID ) (28) ≤L CE ( ĥW ; D ID ) -L CE (h ϵ W ; S ID ) + ϵ (29) =L CE ( ĥW ; D ID ) -L CE ( ĥW ; S ID ) + L CE ( ĥW ; S ID ) -L CE (h ϵ W ; D ID ) + ϵ (30) ≤L CE ( ĥW ; D ID ) -L CE ( ĥW ; S ID ) + L CE (h ϵ W ; S ID ) -L CE (h ϵ W ; D ID ) + ϵ (31) ≤2 sup h∈H |L CE (h; D ID ) -L CE (h; S ID )| + ϵ. ( ) Since equation 25 holds for h W ∈ H and ϵ > 0, we have L CE ( ĥW ; D ID ) ≤ L CE (h * W ; D ID ) + 2ϵ(C H , L CE , A CE )/ |S ID |. For any h W ∈ H, we also have (39) Combining equation 33 and equation 39, we complete our proof. The theorem states that the empirical solution leads to a promising detection capability in expectation, which considers the uniform OOD performance via the WOR. The critical point is that the original surrogate OOD is still very important (i.e., the small sample size of S OOD leads to loose excess bound), even if our method can synthesize additional OOD data.

Algorithm 1 Distribution-agnostic Outlier Exposure (DOE).

Input: ID and OOD samples from D ID and D s OOD , resp; P MA = 0; for ns = 1 to num step do Sample B ID and B s OOD from ID and surrogate OOD, resp; P = 0; if ns > num warm then for np = 1 to num pert do WOR G (h W ; B s OOD ) = ||∇ σ|σ=1.0 L OE (σ • h W +αP ; B s OOD )|| 2 ; P ← ∇ P WOR G (h W +αP ; B s OOD ); end for P MA ← (1 -β) • P MA + β • NORM(P ); W ← W -lr • ∇ W [L CE (h W ; B ID ) + λL OE (h W +αPMA ; B s OOD )]; else W ← W -lr • ∇ W [L CE (h W ; B ID ) + λL OE (h W ; B s OOD )]; end if end for Output: detection model h W (•).

B ALGORITHM DESIGNS

We summarize details of algorithm designs for a set of related learning schemes.

B.1 DISTRIBUTIONAL ROBUSTNESS AND DISTRIBUTION GAP

Overall, to demonstrate why our distributional-robust learning scheme can mitigate the OOD distribution gap, we consider the following two situations: (1) the true OOD distribution contains all the different OOD situations; and (2) the capacity of implicit data transformation is large enough. For the first situation, we assume that the true OOD distribution contains all the different OOD situations, i.e., all samples with labels out of the considered label space. It is a reasonable consideration since we do not know what kinds of OOD data will be encountered during the test, and thus all the different OOD situations can be encountered. In this case, the surrogate and the (associated) implicit OOD data are subsets of the true OOD distribution since they do not have overlapped semantics with the ID distribution. Then, compared with OE that learns only from surrogate OOD data, our DOE can further benefit from implicit OOD data. It can enlarge the coverage of OOD situations since implicit data follows new data distributions over the surrogate OOD distribution (cf., Theorem 1). For the second situation, we assume that the capacity of implicit data transformation is large enough to cover sufficiently many OOD cases. This is also a reasonable assumption since the transformation's capacity can benefit from layer-wise architectures (cf., Lemma 1), and deep models (which contain many layers) are typically adopted in OOD detection. Accordingly, although we do not know precisely what is the true OOD distribution, we can upper-bound the worst OOD performance to guarantee uniform performance of the model under various test situations (cf., Theorem 2). When the capacity is large enough (covering many test OOD situations), DOE performs well under these unseen test OOD data, thus mitigating the OOD distribution gap.

B.2 DOE

Algorithm 1 summarizes a stochastic realization of our DOE. The overall algorithm is run for num step steps, with num warm epochs of warm-up in employing the original OE. Then, in each training step, we first calculate the perturbation P regarding the OOD mini-batch for num pert steps, and the normalized results are used to update the moving average P MA . With the resultant perturbation P MA for the OE loss, we update the model via one step of mini-batch gradient descent. After training, we apply the MaxLogit scoring in discerning ID and OOD cases.

B.3 WORST RISK-BASED DOE

Our proposed DOE searches for the model perturbation that leads to WOR, which is the worst regretbased realization. In our main text, we state its superior to the risk-based counterpart in Section 4, with the experimental verification in Section 5.3. For integrity, we further describe the realization of the worst risk-based DOE, named DOE-risk. Similar to our proposed DOE, DOE-risk can also be formalized by a min-max learning problem: L DOE (h W ; D ID , D s OOD ) = L CE (h W ; D ID ) + λ max P :||P ||≤1 L OE (h W +αP ; D s OOD ). Then, for its stochastic realization, one step of gradient ascent is employed for the perturbation with respect to the mini-batch, namely, P ← ∇ P |P =0 L OE (h W +αP ; B s OOD ). All other parts follow the realization of the original DOE. After training, we also employ the MaxLogit scoring in discerning ID and OOD data.

B.4 IMPROVED OE WITH PREDEFINED PERTURBATION

We consider several implicit data transformations with the predefined perturbations in Section 5.3. Here, we briefly summarize their realizations. All-ones Matrices. For the perturbation matrices with fixed values, we employ the simple all-ones matrices, namely, P one = {I (l) } L l=1 , with I (l) ∈ R n l-1 ×n l-1 for l = 1, . . . , L being the all-ones matrix. Then, the associated learning objective can be written as: L OE-one (h W ; D ID , D s OOD ) = L CE (h W ; D ID ) + λL OE (h W +αPone ; D s OOD ). Gaussian Noise. When adopting Gaussian noise for random perturbation, we have P gau = {N (l) } L l=1 , with the elements drawn from Gaussian distribution with 0 mean and 1 standard deviation. Then, the associated learning objective is of the form L OE-gau (h W ; D ID , D s OOD ) = L CE (h W ; D ID ) + λL OE (h W +αPgau ; D s OOD ). Uniform Noise. Similarly, one can adopt uniform noise for random perturbation, which we denote by P uni = {U (l) } L l=1 . The elements of U (l) are drawn from the uniform noise over the interval [-1, 1]. Then, the associated learning objective is L OE-uni (h W ; D ID , D s OOD ) = L CE (h W ; D ID ) + λL OE (h W +αPuni ; D s OOD ).

B.5 DRO

The distirbutionally robust optimization (DRO) (Rahimian & Mehrotra, 2019 ) is a traditional technique to make the model perform uniformly well. In OE, one can utilize DRO by replacing the original OE risk in equation 2 with its distributional robust counterpart, namely, L DRO (h; D ID , D s OOD ) = L CE (h; D ID ) + λ sup D w OOD ∈U (D s OOD ) L OE (h; D w OOD ) L DRO OE (h;D s OOD ) , where U(D s OOD ) is the ambiguity set. Basically, U(D s OOD ) constrains the difference between the surrogate OOD distribution D s OOD and its worst counterpart D w OOD . In expectation, equation 45 makes the training procedure cover a wide range of potential test OOD distributions in U(D s OOD ), guaranteeing its uniform performance by bounding the worst OOD risk derived by D w OOD . The ambiguity set U(D s OOD ) is defined by {D w : Div f (D w ||D) ≤ ρ} with Div f (•) the fdivergence and ρ the constraint. For the worst OOD distribution that leads to the worst OOD risk, a weighting-based searching scheme can be derived for the empirical counterpart of equation 45, following the form of re-weighted empirical risk, namely, sup p pℓ OE (h(x)) s.t. p ∈ {p | p ∈ ∆ and D f (p||1) ≤ ρ}. However, due to this equivalent re-weighting scheme, DRO actually assumes that the support of testtime OOD data is among that of the training situation. This assumption is violated in OOD detection since the surrogate OOD data can be largely different from the unseen situations, i.e., their support sets can be greatly different. Therefore, traditional DRO cannot lead to much improved results compared with original OE, which we demonstrate by the experimental results in Section 5.3. Note that some DRO methods (Krueger et al., 2021) try to search for worst distributions that go beyond the support set of training data. However, they rely on more than one training domain, which is not directly applicable in OOD detection.  inf η∈R (2(1/α min -1) 2 + 1) 1/2 E D s OOD max {ℓ OE (h(x)) -η, 0} 2 1/2 + η , where α min = min k α k . Then, Hashimoto et al. suggest that for deep models that rely on stochastic gradient descent, one can utilize the dual objective in equation 47, leading to the learning objective of the form: L χ 2 (h; D ID , D s OOD ) = L CE (h; D ID ) + λE D s OOD max {ℓ OE (h(x)) -η, 0} 2 , ( ) where η is treated as a hyperparameter. Overall, equation 48 ignores all data points that suffer less than η-levels of loss values, while large loss above η are upweighted due to the square operation. Wasserstein DRO. The ambiguity set with the Wasserstein distance has also attracted much attention in the literature. Specifically, Wasserstein distance is given by W r (P, Q) = inf O∈J(P,Q) Z×Z ||ζ -ζ|| r dO(ζ, ζ) 1/r . ( ) However, the direct calculation for the Wasserstein DRO is intractable, and Kwon et al. (2020) propose a simple learning method that leads to its effective approximation. Specifically, if the loss function is differentiable and its gradient is Holder continuous, one can optimize the following surrogate objective as an effective approximation for the optimal solution of Wasserstein DRO: L WDRO (h; D ID , D s OOD ) = L CE (h; D ID ) + λ (L OE (h; D OOD ) + E DOOD ||∇ x ℓ OE (h(x))||) . Please refer to (Kwon et al., 2020) for an in-depth discussion.

B.6 AT

Adversarial training (AT) (Madry et al., 2018a) directly modifies input features that lead to increased risk, which can also be interpreted from the lens of distributional robust learning (Sinha et al., 2018) . For OE, one can modify features for surrogate OOD data by adding adversarial noise, namely, δ p ← Proj δ p + κsign ∇ δp ℓ OE (h(x + δ p )) , where κ controls the magnitude of the perturbation, Proj is the clipping operation for the valid δ p , and sign is the signum function. Equation 51 iterates for several steps and δ p is typically initialized by random noise. Applying the adversarial noise for surrogate OOD data, the resultant learning objective is L AT (h; D ID , D s OOD ) = L CE (h; D ID ) + λE D s OOD [ℓ OE (h(x + δ p ))] . AT can be viewed as a direct way of data transformation. However, as demonstrated in Section 5.3, the associated transformation function is simpler than our DOE. Therefore, the performance of AT is inferior to our DOE. 6 , where we report the mean results and the standard deviation. As we can see, our DOE not only leads to improved average performance in OOD detection, and the results are more stable than that of the original OE. The superiority of DOE in stability may lie in the fact that the target model can learn from more data than the OE case, further demonstrating the effectiveness of our proposal. Also, we compare the performance of OE and DOE when using the MSP scoring and the MaxLogit scoring, of which the experiments are summarized in Table 7 . Regarding both the cases with different scoring functions, our DOE always achieve superior performance than that of the OE, demonstrating that our proposal can genuinely mitigate the OOD distribution gap issue in OOD detection. Further, comparing the results across different scoring strategies, we observe that using the 10 .

C.5 EFFECT OF HYPER-PARAMETERS

We study the effect of hyper-parameters on the final performance of our DOE, where we consider the trade-off parameter λ, the perturbation strength α, the smoothing strength β, the perturbation steps num pert, and the warm-up epochs num warm. We also study the case of sub-model perturbation, where the model perturbation is only applied to a part of the whole model. All the above experiments are conducted on the CIFAR-100 dataset. As one can see from the Tables 11-14, our DOE is pretty robust to different choices of the hyperparameters (i.e., λ, β, num pert, and num warm), and the results are superior to the OE across most of the hyper-parameter settings. However, a proper choice of the hyper-parameters can truly induce improved results in effective OOD detection, reflecting that all the introduced hyper-parameters are useful in our proposed DOE. In Table 15 , we further demonstrate that randomly selected α (from the candidates) reveals superior performance than assigning fixed values. Note that random selection can cover a wider range of OOD situations than that of fixed values. Then, since the model can learn from more implicit OOD data, the capability of the model in OOD detection is better than the case with fixed values. Finally, we show the experimental results with sub-model perturbation in Table 16 , where only a part of the model is perturbed in our DOE. Here, we separate the WRN-40-2 into 3 blocks, following the block structure in (Zagoruyko & Komodakis, 2016) . As we can see, perturbing the whole model can reveal superior performance than the cases with sub-model perturbation. It can be explained by our Lemma 1 in that perturbing the whole model can benefit the data transformation from the layer-wise structure of deep models most. Then, with the more flexible form of transform function, perturbing the whole model can reveal better results than the cases with sub-model perturbation since the model can learn from more diverse (implicit) OOD data. 



With the usual adjustments that the equations only hold almost everywhere in parameter space.



Figure 1: Comparison between OE, DRO, and DOE. Black boxes indicate support sets for surrogate/test OOD data. Intensities of color indicate the coverage of learning schemes-a deeper colored region indicates the associated model can make more reliable detection therein. As we can see, OE directly makes the model learn from surrogate OOD data, largely deviating from test OOD situations. DRO further makes the model perform uniformly well regarding sub-populations, and the model can excel in the support set of the surrogate case. Moreover, DOE makes the model learn from additional OOD data besides surrogate cases, covering wider OOD situations (exceeding the support set) than that of OE and DRO. Thus, OOD detection capability increases from left to right.

distribution D ′ L . Then, there exists a L + 1-layer ReLU network with the multiplicative perturbation {A (l) L+1 } L+1 l=1 and the associated transformed distribution D ′ L+1 , such that the difference between D ′ L+1 and D is no smaller than the difference between D ′ L and D.

Figure 2: The scoring densities of OE and DOE on CIFAR-100 dataset, where the MaxLogit is employed.We emphasize that the improvement of our DOE compared with that of OE is not dominated by our specific choice of scoring strategy. To verify this, we conduct experiments with OE and DOE and then employ the MaxLogit scoring after model training. Figure2illustrates the scoring densities with (a) OE and (b) DOE on the CIFAR-100 dataset, where we consider two test-time OOD datasets, namely, Texture and SVHN. Compared with that of OE, the overlap regions of DOE between the ID (i.e., CIFAR-10) and the OOD (i.e., Texture and SVHN) distributions are reduced. It reveals that even with the same scoring function (i.e., MaxLogit), DOE can still improve the model's detection capability compared with the original OE. Therefore, we state that the key reason for our improved performance is our novel learning strategy, learning from extra OOD data that can benefit the model. Please refer to Appendix C for their detailed comparison.

-divergence DRO. Hashimoto et al. (2018) define the ambiguity set by the χ 2 divergence, given by D χ 2 (P ||Q) They assume that data distribution can be written as the joint form of the sub-populations, i.e., D s OOD = k∈[K] α k D s,k OOD . Then, one can derive the dual form of the L DRO OE (h; D s OOD ) in equation 46 with respect to the χ 2 divergence, namely,

is among the most potent ones, engaging surrogate OOD data during training to discern ID and OOD patterns. By mak-

Basically, the OE loss ℓ OE (•) plays the role of regularization, making the model learn from surrogate OOD data with low confident predictions. Since the model can see some OOD data during training, OE typically reveals reliable performance in OOD detection.

Comparison in OOD detection on the CIFAR and ImageNet benchmarks. ↓ (or ↑) indicates smaller (or larger) values are preferred; a bold font indicates the best results in a column. AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑

Comparison of DOE and advanced methods in hard OOD detection. ↓ (or ↑) indicates smaller (or larger) values are preferred; a bold font indicates the best results in a column. AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑

Effectiveness of implicit data transformation and distributional robust learning. ↓ (or ↑) indicates smaller (or larger) values are preferred; a bold font indicates the best results in a row. ImageNet (200 classes) are removed. We compare our DOE with several works reported to perform well in hard OOD detection, including KNN, OE, CSI, and SSD+, where the results are summarized in Table2. As we can see, our DOE can beat these advanced methods across all the considered datasets, even for the challenging CIFAR-10 versus CIFAR-100 setting.

arg min ID ), for any ϵ > 0, there exists h ϵ W such that L CE (h ϵ W ; D ID ) ≤ L CE (h * W ; D ID ) + ϵ. Thus, using L CE ( ĥW ; S ID ) ≤ L CE (h ϵ W ; S ID ), we can write

Comparison of DOE and advanced methods on CIFAR-10 dataset. ↓ (or ↑) indicates smaller (or larger) values are preferred; a shaded row of results indicate the best method in previous post-hoc (or fine-tuning) methods; and a bold font indicates the best result in a column. AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑

Comparison of DOE and advanced methods on the CIFAR-100 dataset. ↓ (or ↑) indicates smaller (or larger) values are preferred; a shaded row of results indicate the best method in post-hoc (or fine-tuning) methods; and a bold font indicates the best results in the a column.AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑We first summarize the main experiments in Table4-5 on CIFAR benchmarks for the common OOD detection. A brief version can also be found in Table1in the main text. Overall, our DOE reveals superior performance on average regarding both the evaluation metrics of FPR95 and AUROC. However, when it comes to individual test-time OOD datasets, our DOE may not work best in all situations (e.g., KNN on OOD dataset Places365 and ID dataset CIFAR-100). We emphasize that it does not challenge the generality of our proposal since DOE has demonstrated stable improvements for the original OE. Here, the interesting point is that if we can further benefit the OE from the latest progress in OOD scoring, one can further improve the performance of our method in effective OOD detection, which requires our future study. Now, we compare OE and DOE on CIFAR benchmarks with five individual trails in Table

Comparison of DOE and OE on CIFAR benchmarks with 5 individual trails. ↓ (or ↑) indicates smaller (or larger) values are preferred; and a bold font indicates the best results in the corresponding column. AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑

Comparison of DOE and OE on CIFAR and ImageNet benchmarks with the MSP scoring and the MaxLogit scoring. ↓ (or ↑) indicates smaller (or larger) values are preferred; and a bold font indicates the best results in the corresponding column.AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑

Comparison of OE and DOE when training from scratch on CIFAR benchmarks.

Comparison of DOE and advanced methods on ImageNet dataset. ↓ (or ↑) indicates smaller (or larger) values are preferred; a shaded row of results indicate the best method in post-hoc (or finetuning) methods; and a bold font indicates the best results in the corresponding column. AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑ FPR95 ↓ AUROC ↑

Roubst learning with worst OOD regret and worst OOD risk.Table10summarizes the results on the CIFAR-100 dataset in comparison between searching for the worst OOD regret (DOE-regret) and the worst OOD risk (DOErisk). Therein, both realizations can improve results compared with the original OE. However, the regret-based DOE can reveal better results than the risk-based one, with 4.59 further improvement in FPR95. Here, the worst regret can better indicate the worst OOD distribution than the risk counterpart, and thus the DOE-regret, as employed in Algorithm 1, demonstrates superior results in Table



DOE on CIFAR-100 with various α.

DOE on CIFAR-100 with sub-model perturbation.

