NON-PARAMETRIC OUTLIER SYNTHESIS

Abstract

Out-of-distribution (OOD) detection is indispensable for safely deploying machine learning models in the wild. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Recent work on outlier synthesis modeled the feature space as parametric Gaussian distribution, a strong and restrictive assumption that might not hold in reality. In this paper, we propose a novel framework, non-parametric outlier synthesis (NPOS), which generates artificial OOD training data and facilitates learning a reliable decision boundary between ID and OOD data. Importantly, our proposed synthesis approach does not make any distributional assumption on the ID embeddings, thereby offering strong flexibility and generality. We show that our synthesis approach can be mathematically interpreted as a rejection sampling framework. Extensive experiments show that NPOS can achieve superior OOD detection performance, outperforming the competitive rivals by a significant margin.

1. INTRODUCTION

When deploying machine learning models in the open and non-stationary world, their reliability is often challenged by the presence of out-of-distribution (OOD) samples. As the trained models have not been exposed to the unknown distribution during training, identifying OOD inputs has become a vital and challenging problem in machine learning. There is an increasing awareness in the research community that the source-trained models should not only perform well on the In-Distribution (ID) samples, but also be capable of distinguishing the ID vs. OOD samples. To achieve this goal, a promising learning framework is to jointly optimize for both (1) accurate classification of samples from P in , and (2) reliable detection of data from outside P in . This framework thus integrates distributional uncertainty as a first-class construct in the learning process. In particular, an uncertainty loss term aims to perform a level-set estimation that separates ID vs. OOD data, in addition to performing ID classification. Despite the promise, a key challenge is how to provide OOD data for training without explicit knowledge about unknowns. A recent work by Du et al. (2022c) proposed synthesizing virtual outliers from the low-likelihood region in the feature space of ID data, and showed strong efficacy for discriminating the boundaries between known and unknown data. However, they modeled the feature space as class-conditional Gaussian distribution -a strong and restrictive assumption that might not always hold in practice when facing complex distributions in the open world. Our work mitigates the limitations. In this paper, we propose a novel learning framework, Non-Parametric Outlier Synthesis (NPOS), that enables the models learning the unknowns. Importantly, our proposed synthesis approach does not make any distributional assumption on the ID embeddings, thereby offering strong flexibility and generality especially when the embedding does not conform to a parametric distribution. Our framework is illustrated in Figure 1 . To synthesize outliers, our key idea is to "spray" around the low-likelihood ID embeddings, which lie on the boundary between ID and OOD data. These boundary points are identified by non-parametric density estimation with the nearest neighbor distance. Then, the artificial outliers are sampled from the Gaussian kernel centered at the embedding of the boundary ID samples. Rejection sampling is done by only keeping synthesized outliers with low likelihood. Leveraging the synthesized outliers, our uncertainty loss effectively performs the level- 6). Best view in color. set estimation, learning to separate data between two sets (ID vs. outliers). This loss term is crucial to learn a compact decision boundary, and preventing overconfident predictions for unknown data. Our uncertainty loss is trained in tandem with another loss that optimizes the ID classification and embedding quality. In particular, the loss encourages learning highly distinguishable ID representations where ID samples are close to the class centroids. Such compact representations are desirable and beneficial for non-parametric outlier synthesis, where the density estimation can depend on the quality of learned features. Our learning framework is end-to-end trainable and converges when both ID classification and ID/outlier separation perform satisfactorily. Extensive experiments show that NPOS achieves superior OOD detection performance, outperforming the competitive rivals by a large margin. In particular, Fort et al. (2021) recently exploited large pre-trained models for OOD detection. They fine-tuned the model using standard cross-entropy loss and then applied the maximum softmax probability (MSP) score in testing. Under the same pre-trained model weights (i.e., CLIP-B/16 (Radford et al., 2021) ) and the same number of finetuning configuration, NPOS reduces the FPR95 from 41.87% (Fort et al., 2021) to 5.76% -a direct 36.11% improvement. Since both methods employ MSP in testing, the performance gap signifies the efficacy of our training loss using outlier synthesis for model regularization. Moreover, we contrast NPOS with the most relevant baseline VOS (Du et al., 2022c) using parametric outlier synthesis, where NPOS outperforms by 13.40% in FPR95. The comparison directly confirms the advantage of our proposed non-parametric outlier synthesis approach. To summarize our key contributions: 1. We propose a new learning framework, non-parametric outlier synthesis (NPOS), which automatically generates outlier data for effective model regularization and improving testtime OOD detection. Compared to a recent method VOS that relies on parametric distribution assumption (Du et al., 2022b) , NPOS offers both stronger performance and generality. 2. We mathematically formulate our outlier synthesis approach as a rejection sampling procedure. By non-parametric density estimation, our training process approximates the level-set distinguishing ID and OOD data. 3. We conduct comprehensive ablations to understand the efficacy of NPOS, and further verify its scalability to a large dataset, including ImageNet. These results provide insights on the non-parametric approach for OOD detection, shedding light on future research.

2. PRELIMINARIES

Background and notations. Over the last few decades, machine learning has been primarily operating in the closed-world setting, where the classes are assumed the same between training and test data. Formally, let X = R d denote the input space and Y in = {1, . . . , C} denote the label space, with C being the number of classes. The learner has access to the labeled training set D in = {(x i , y i )} n i=1 , drawn i.i.d. from the joint data distribution P XYin . Let P in denote the marginal distribution on X, which is also referred to as the in-distribution. Let f : X → R C denote a function for the classification task, which predicts the label of an input sample. To obtain an optimal classifier, a classic approach is called empirical risk minimization (ERM) (Vapnik, 1999)  : f * = argmin f ∈F R closed (f ), where R closed (f ) = 1 n n i=1 ℓ(f (x i ), y i ) and ℓ is the loss function and F is the hypothesis space. Out-of-distribution detection. The closed-world assumption rarely holds for models deployed in the open world, where data from unknown classes can naturally emerge (Bendale & Boult, 2015) . Formally, our framework concerns a common real-world scenario in which the algorithm is trained on the ID data with classes Y in = {1, ..., C}, but will then be deployed in environments containing out-of-distribution samples from unknown class y / ∈ Y in and therefore should not be predicted by f . At its core, OOD detection estimates the lower level set L := {x : P in (x) ≤ β}. Given any x, it classifies x as OOD if and only if x ∈ L. The threshold β is chosen by controlling the false detection rate: L P in (x)dx = 0.05. The false rate can be chosen as a typical (e.g., 0.05) or another value appropriate for the application. As in classic statistics (see (Chen et al., 2017) and the references therein), we estimate the lower level set L from the ID dataset D in .

3. METHOD

Framework overview. Machine learning models deployed in the wild must operate with both classification accuracy and safety performance. We use safety to characterize the model's ability to detect OOD data. This safety performance is lacking for off-the-shelf machine learning algorithms -which typically focus on minimizing error only on the in-distribution data from P in , but does not account for the uncertainty that could arise outside P in . For example, recall that empirical risk minimization (ERM) (Vapnik, 1999) , a long-established method that is commonly used today, operates under the closed-world assumption (i.e., no distribution shift between training vs. testing). Models optimized with ERM are known to produce overconfidence predictions on OOD data (Nguyen et al., 2015) , since the decision boundary is not conservative. To address the challenges, our learning framework jointly optimizes for both: (1) accurate classification of samples from P in , and (2) reliable detection of data from outside P in . Given a weighting factor α, the risk can be formalized as follows: argmin [ R closed (f ) Classification error on ID + α • R open (g) Error of OOD detector ], where  Q(x | OOD) = 1[ Pin (x) ≤ β] Pin (x) Z out , Z out = 1[ Pin (x) ≤ β] Pin (x)dx. ( ) where 1[•] is the indicator function. Q(x | OOD) is Pin restricted to L := {x : Pin (x) ≤ β} and renormalized. Similarly, define an ID conditional distribution: Q(x | ID) = 1[ Pin (x) > β] Pin (x) 1 -Z out . Note Q(x | OOD) and Q(x | ID) have disjoint support that partition h(X), where h : X → R d is a feature encoder, which maps an input to the penultimate layer with d dimensions. In particular, for any non-degenerate prior Q(OOD) = 1 -Q(ID), the Bayes decision boundary for the joint distribution Q is precisely the empirical β level set. Since we only have access to the ID training data, a critical consideration is how to provide OOD data for training. A recent work by Du et al. (2022c) proposed synthesizing virtual outliers from the low-likelihood region in the feature space h(x), which is more tractable than synthesizing x in the input space X. However, they modeled the feature space as class-conditional Gaussian distribution -a strong assumption that might not always hold in practice. To circumvent this limitation, our new idea is to perform non-parametric outlier synthesis, which does not make any distributional assumption on the ID embeddings. Our proposed synthesis approach thereby offers stronger flexibility and generality. To synthesize outlier data, we formalize our idea by rejection sampling (Rubinstein & Kroese, 2016) with Pin as the proposal distribution. In a nutshell, the rejection sampling can be done in three steps: 1. Draw an index in D in by i ∼ uniform [n] where n is the number of training samples. 2. Draw sample v (candidate synthesized outlier) in the feature space from a Gaussian kernel centered at h(x i ) with covariance σ 2 I: v ∼ N (h(x i ), σ 2 I).

3.. Accept v with probability

Q(h(x)|OOD) M Pin where M is an upper bound on the likelihood ratio Q(h(x) | OOD)/ Pin . Since Q(h(x) | OOD) is truncated Pin , one can choose M = 1/Z out . Equivalently, accept v if Pin (v) < β. Despite the soundness of the mathematical framework, the realization in modern neural networks is non-trivial. A salient challenge is the computational efficiency -drawing samples uniformly from Pin in step 1 is expensive since the majority of samples will have a high density that is easily rejected by step 3. To realize our framework efficiently, we propose the following procedures: (1) identify boundary ID samples, and ( 2) synthesize outliers based on the boundary samples. Identify ID samples near the boundary. We leverage the non-parametric nearest neighbor distance as a heuristic surrogate for approximating Pin , and select the ID data with the highest k-NN distances as the boundary samples. We illustrate this step in the middle panel of Figure 1 . Specifically, denote the embedding set of training data as Z = (z 1 , z 2 , ..., z n ), where z i is the L 2 -normalized penultimate feature z i = h(x i )/∥h(x i )∥ 2 . For any embedding z ′ ∈ Z, we calculate the k-NN distance w.r.t. Z: d k (z ′ , Z) = ∥z ′ -z (k) ∥ 2 , where z (k) is the k-th nearest neighbor in Z. If an embedding has a large k-NN distance, it is likely to be on the boundary in the feature space. Thus, according to the k-NN distance, we select embeddings with the largest k-NN distances. We denote the set of boundary samples as B. Synthesize outliers based on boundary samples. Now we have obtained a set of ID embeddings near the boundary in the feature space, we synthesize outliers by sampling from a multivariate Gaussian distribution centered around the selected ID embedding h(x i ) ∈ B: v ∼ N (h(x i ), σ 2 I), where the v denotes the synthesized outliers around h(x i ), and σ 2 modulates the variance. For each boundary ID sample, we can repeatedly sample p different outliers using Equation 5, which produces a set V i = (v 1 , v 2 , ..., v p ). To ensure that the outliers are sufficiently far away from the ID data, we further perform a filtering process by selecting the virtual outlier in V i with the highest k-NN distance w.r.t. Z, as illustrated in the right panel of Figure 1 (dark orange points). The final collection of accepted virtual outliers will be used for the binary training objective: R open = E v∼V -log 1 1 + exp ϕ(v) + E x∼Pin -log exp ϕ(h(x)) 1 + exp ϕ(h(x)) , where ϕ(•) is a nonlinear MLP function and h(x) denotes the ID embeddings. In other words, the loss function takes both the ID and synthesized outlier embeddings and aims to estimate the level set through the binary cross-entropy loss. Now we discuss the design of R closed , which minimizes the risk on the in-distribution data. We aim to produce highly distinguishable ID representations, which non-parametric outlier synthesis (cf. Section 3.1) depends on. In a nutshell, the model aims to learn compact representations that align ID samples with their class prototypes. Specifically, we denote by µ 1 , µ 2 , . . . , µ C the prototype embeddings for the ID classes c ∈ {1, 2, .., C}. The prototype for each sample is assigned based on the ground-truth class label. For any input x with corresponding embedding h(x), we can calculate the cosine similarity between h(x) and prototype vector µ j : f j (x) = h(x) • µ j ∥h(x)∥ • ∥µ j ∥ , which can be viewed as the j-th logit output. A larger f j (x) indicates a stronger association with the j-th class. The classification loss is the cross-entropy applied to the softmax output: R closed = -log e fy(x)/(τ •∥f ∥) C j=1 e fj (x)/(τ •∥f ∥) , ( ) where τ is the temperature, and f y (x) is the logit output corresponding to the ground truth label y. Our training framework is end-to-end trainable, where the two losses R open (cf. Section 3.1) and R closed work in a synergistic fashion. First, as the classification loss (Equation 8) shapes ID embeddings, our non-parametric outlier synthesis module benefits from the highly distinguishable representations. Second, our uncertainty loss in Equation 6 would facilitate learning a compact decision boundary between ID and OOD, which provides a reliable estimation for OOD uncertainty that can arise. The entire training process converges when the two components perform satisfactorily.

3.3. TEST-TIME OOD DETECTION

In testing, we use the same scoring function as Ming et al. (2022a) for OOD detection S(x) = max j e f j (x)/τ C c=1 e fc (x)/τ , where f j (x) = h(x)•µj ∥h(x)∥•∥µj ∥ . The rationale is that, for ID data, it will be matched to one of the prototype vectors with a high score and vice versa. Based on the scoring function, the OOD detector is G λ (x) = 1{S(x) ≥ λ}, where by convention, 1 represents the positive class (ID), and 0 indicates OOD. λ is chosen so that a high fraction of ID data (e.g., 95%) is above the threshold. Our algorithm is summarized in Algorithm 1 (Appendix C).

4. EXPERIMENTS

In this section, we present empirical evidence to validate the effectiveness of our method on realworld classification tasks. We describe the setup in Section 4.1, followed by the results and comprehensive analysis in Section 4.2-Section 4.5.

4.1. SETUP

Datasets. We use both standard CIFAR-100 benchmark (Krizhevsky et al., 2009) and the largescale ImageNet dataset (Deng et al., 2009) as the in-distribution data. Our main results and ablation et al., 2017) , and TEXTURE (Cimpoi et al., 2014) . For each OOD dataset, the categories are disjoint from the ID dataset. We provide details of the datasets and categories in Appendix A. Model. In our main experiments, we perform training by fine-tuning the CLIP model (Radford et al., 2021) , which is one of the most popular and publicly available pre-trained models. CLIP aligns an image with its corresponding textual description in the feature space by a self-supervised contrastive objective. Concretely, it adopts a simple dual-stream architecture with one image encoder I : x → R d (e.g., ViT (Dosovitskiy et al., 2021 )), and one text encoder T : t → R d (e.g., Transformer (Vaswani et al., 2017) ). We fine-tune the last two blocks of CLIP's image encoder, using our proposed training objective in Section 3. To indicate input patch size in ViT models, we append "/x" to model names. We prepend -B, -L to indicate Base and Large versions of the corresponding architecture. For instance, ViT-B/16 implies the Base variant with an input patch resolution of 16 × 16. We mainly use CLIP-B/16, which contains a ViT-B/16 Transformer as the image encoder. We utilize the pre-extracted text embeddings from a masked self-attention Transformer as the prototypes for each class, where µ i = T (t i ) and t i is the text prompt for a label y i . The text encoder is not needed in the training process. We additionally conduct ablations on alternative backbone architecture including ResNet in Section 4.3. Note that our method is not limited to pre-trained models; it can generally be applicable for models trained from scratch, as we will later show in Section 4.4. Experimental details. We employ a two-layer MLP with a ReLU nonlinearity for ϕ, with a hidden layer dimension of 16. We train the model using stochastic gradient descent with a momentum of 0.9, and weight decay of 10 -4 . For ImageNet-100, we train the model for a total of 20 epochs, where we only use Equation 8for representation learning for the first ten epochs. We train the model jointly with our outlier synthesis loss (Equation 6) in the last 10 epochs. We set the learning rate to be 0.1 for the R closed branch, and 0.01 for the MLP in the R open branch. For the ImageNet-1k dataset, we train the model for 60 epochs, where the first 20 epochs are trained with Equation 8. Extensive ablations on the hyperparameters are conducted in Section 4.3 and Appendix F. Evaluation metrics. We report the following metrics: (1) the false positive rate (FPR95) of OOD samples when the true positive rate of ID samples is 95%, (2) the area under the receiver operating characteristic curve (AUROC), and (3) ID classification error rate (ID ERR).

4.2. MAIN RESULTS

NPOS significantly improves OOD detection performance. As shown in the Table 1 , we compare the proposed NPOS with competitive OOD detection methods. For a fair comparison, all the methods only use ID data without using auxiliary outlier datasets. We compare our methods with the following recent competitive baselines, including (1) Maximum Concept Matching (MCM) (Ming et al., 2022a) , ( 2) Maximum Softmax Probability (Hendrycks & Gimpel, 2017; Fort et al., 2021) , (3) ODIN score (Liang et al., 2018) , (4) Energy score (Liu et al., 2020b) , (5) GradNorm score (Huang et al., 2021) , (6) ViM score (Wang et al., 2022) , (7) KNN distance (Sun et al., 2022) , and (8) VOS (Du et al., 2022b) which synthesizes outliers by modeling the ID embeddings as a mixture of Gaussian distribution and sampling from the low-likelihood region of the feature space. MCM is the latest zero-shot OOD detection approach for vision-language models. All the other baseline methods are fine-tuned using the same pre-trained model weights (i.e., CLIP-B/16) and the same number of layers as ours. We added a fully connected layer to the model backbone, which produces the classification output. In particular, Fort et al. (2021) fine-tuned the model using cross-entropy loss and then applied the MSP score in testing. KNN distance is calculated using features from the penultimate layer of the same fine-tuned model as Fort et al. (2021) . For VOS, we follow the original loss function and OOD score defined in Du et al. (2022b) . We show that NPOS can achieve superior OOD detection performance, outperforming the competitive rivals by a large margin. In particular, NPOS reduces the FPR95 from 41.87% (Fort et al. (2021) ) to 5.76% (ours) -a direct 36.11% improvement. The performance gap signifies the effectiveness of our training loss using outlier synthesis for model regularization. By incorporating the synthesized outliers, our risk term R open is crucial to prevent overconfident predictions for OOD data, and improve test-time OOD detection. Non-parametric outlier synthesis outperforms VOS. We contrast NPOS with the most relevant baseline VOS, where NPOS outperforms by 13.40% in FPR95. A major difference between the two approaches lies in how outliers are synthesized: parametric approach (VOS) vs. non-parametric approach (ours). Compared to VOS, our method does not make any distributional assumption on the ID embeddings, hence offering stronger flexibility and generality. Another difference lies in the ID classification loss: VOS employs the softmax cross-entropy loss while our method utilizes a different loss (cf. Equation 8) to learn distinguishable ID embeddings. To clearly isolate the effect, we further enhance VOS by using the same classification loss as defined in Equation 8and endow VOS with a stronger representation space. This resulting method dubbed VOS+ and its corresponding performance is also shown in Table 1 . Note that VOS+ and NPOS only differ in how outliers are synthesized. While VOS+ indeed performs better compared to the original VOS, the FPR95 is still worse than our method. This experiment directly and fairly confirms the superiority of our proposed non-parametric outlier synthesis approach. Computationally, the training time using NPOS is 30.8 minutes on ImageNet-100, which is comparable to VOS (30.0 minutes). NPOS scales effectively to large datasets. To examine the scalability of NPOS, we also evaluate on the ImageNet-1k dataset (ID) in Table 2 . Recently, Fort et al. ( 2021) explored small-scale OOD detection by fine-tuning the ViT model, and then applying the MSP score in testing. When extending to large-scale tasks, we find that NPOS yields superior performance under the same image encoder configuration (ViT-B/16). In particular, NPOS reduces the FPR95 from 67.31% to 37.93%. This highlights the advantage of utilizing non-parametric outlier synthesis to learn a conservative decision boundary for OOD detection. Our results also confirm that NPOS can indeed scale to large datasets with complex visual diversity. Learning compact ID representation benefits NPOS. We investigate the importance of optimizing ID embeddings for non-parametric outlier synthesis. Recall that our loss function in Equation 8facilitates learning a compact representation for ID data. To isolate its effect, we replace the loss function with the cross-entropy loss while keeping everything else the same. On ImageNet-100, this yields an average FPR95 of 17.94%, and AUROC 95.75%. The worsened OOD detection performance signifies the importance of our ID classification loss that optimizes ID embedding quality. 2 (a), we ablate the effect of weight α on the OOD detection performance. Here the ID data is ImageNet-100. When α reduces to a small value (e.g., 0.01), the performance becomes closer to the MSP baseline trained with R closed only. In contrast, under a mild weighting, such as α = 0.1, the OOD detection performance is significantly improved. Too excessive regularization using synthesized outliers ultimately degrades the performance. Ablation on the variance σ 2 in sampling. A proper variance σ 2 in sampling virtual outliers is critical to our method. Recall in Equation 5 that σ 2 modulates the variance when synthesizing outliers around boundary samples. In Figure 2 (b), we systematically analyze the effect of σ 2 on OOD detection performance. We vary σ 2 = {0.01, 0.1, 0.5, 1.0, 10}. We observe that the performance of NPOS is insensitive under a moderate variance. In the extreme case when σ 2 becomes too large, the sampled virtual outliers might suffer from severe overlapping with ID data, which leads to performance degradation as expected. Ablation on k in calculating KNN distance. In Figure 2 (c) , we analyze the effect of k, i.e., the number of nearest neighbors for non-parametric density estimation. In particular, we vary k = {100, 200, 300, 400, 500}. We observe that our method is not sensitive to this hyperparameter, as k varies from 100 to 500.

4.4. ADDITIONAL RESULTS WITHOUT PRE-TRAINED MODEL

Going beyond fine-tuning with the large pre-trained model, we show that NPOS is also applicable and effective when training from scratch. This setting allows us to evaluate our algorithm itself without the impact of strong model initialization. Appendix D showcases the performance of NPOS trained on three datasets: CIFAR-10, CIFAR-100, and ImageNet-100 datasets. We substitute the text embeddings µ i in vision-language models with the class-conditional image embeddings estimated in an exponential-moving-average (EMA) manner (Li et al., 2021) : µ c := Normalize(γµ c + (1γ)z), ∀c ∈ {1, 2, . . . , C}, where the prototype µ c for class c is updated during training as the moving average of all embeddings with label c, and z denotes the normalized embedding of samples in class c. The EMA style update avoids the costly alternating training and prototype estimation over the entire training set as in the conventional approach (Zhe et al., 2019) . We evaluate on six OOD datasets: TEXTURES (Cimpoi et al., 2014) , SVHN (Netzer et al., 2011) , PLACES365 (Zhou et al., 2017) , LSUN-RESIZE & LSUN-C (Yu et al., 2015) , and ISUN (Xu et al., 2015) . The comparison is shown in Table 4, Table 5, and Table 6 . NPOS consistently improves OOD detection performance over all the published baselines. For example, on CIFAR-100, NPOS outperforms the most relevant baseline VOS by 27.41% in FPR95. 

5. RELATED WORK

OOD detection has attracted a surge of interest since the overconfidence phenomenon in OOD data is first revealed in Nguyen et al. (2015) . One line of work performs OOD detection by devising scoring functions, including confidence-based methods (Bendale & Boult, 2016; Hendrycks & Gimpel, 2017; Liang et al., 2018) , energy-based score (Liu et al., 2020b; Wang et al., 2021) , distance-based approaches (Lee et al., 2018b; Tack et al., 2020; Sehwag et al., 2021; Sun et al., 2022; Du et al., 2022a; Ming et al., 2023) , gradient-based score (Huang et al., 2021) , and Bayesian approaches (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Maddox et al., 2019; Malinin & Gales, 2019; Wen et al., 2020; Kristiadi et al., 2020) . Another promising line of work addressed OOD detection by training-time regularization (Bevandić et al., 2018; Malinin & Gales, 2018; Geifman & El-Yaniv, 2019; Hein et al., 2019; Meinke & Hein, 2020; Mohseni et al., 2020; Jeong & Kim, 2020; van Amersfoort et al., 2020; Yang et al., 2021; Wei et al., 2022) . For example, the model is regularized to produce lower confidence (Lee et al., 2018a; Hendrycks et al., 2019) or higher energy (Du et al., 2022c; Liu et al., 2020b; Katz-Samuels et al., 2022; Ming et al., 2022b) on the outlier data. Most regularization methods require the availability of auxiliary OOD data. Among methods utilizing ID data only, Hsu et al. (2020) proposed to decompose confidence scoring during training with a modified input pre-processing method. Liu et al. (2020a) proposed a spectral-normalized neural Gaussian process by optimizing the network design for uncertainty estimation. Closest to our work, VOS (Du et al., 2022c) synthesizes virtual outliers using multivariate Gaussian distributions, and regularizes the model's decision boundary between ID and OOD data during training. In this paper, we propose a novel non-parametric outlier synthesis approach, mitigating the distributional assumption made in VOS. Large-scale OOD detection. Recent works have advocated for OOD detection in large-scale settings, which are closer to real-world applications. Research efforts include scaling OOD detection to large semantic label space (Huang & Li, 2021) and exploiting large pre-trained models (Fort et al., 2021) . Recently, powerful pre-trained vision-language models have achieved strong results on zero-shot OOD detection (Ming et al., 2022a) . Different from prior works, we propose a new training/fine-tuning procedure with non-parametric outlier synthesis for model regularization. Our learning framework renders a conservative decision boundary between ID and OOD data, and thereby improves OOD detection.

6. CONCLUSION

In this paper, we propose a novel framework NPOS, which tackles ID classification and OOD uncertainty estimation in one coherent framework. NPOS mitigates the key shortcomings of the previous outlier synthesis-based OOD detection approach, and synthesizes outliers without imposing any distributional assumption. To the best of our knowledge, NPOS makes the first attempt to employ a non-parametric outlier synthesis for OOD detection and can be formally interpreted as a rejection sampling framework. NPOS establishes competitive performance on challenging real-world OOD detection tasks, evaluated broadly under both the recent vision-language models and models that are trained from scratch. Our in-depth ablations provide further insights on the efficacy of NPOS. We hope our work inspires future research on OOD detection based on non-parametric outlier synthesis.

Non-parametric Outlier Synthesis (Appendix)

A DETAILS OF DATASETS ImageNet-100. We randomly sample 100 classes from ImageNet-1k (Deng et al., 2009) to create ImageNet-100. The dataset contains the following categories: n01986214, n04200800, n03680355, n02963159, n03874293, n02058221, n04612504, n02841315, n02099712, n02093754, n03649909, n02114712, n03733281, n02319095, n01978455, n04127249, n07614500, n03595614, n04542943, n02391049, n04540053, n03483316, n03146219, n02091134, n02870880, n04479046, n03347037, n02090379, n10148035, n07717556, n04487081, n04192698, n02268853, n02883205, n02002556, n04273569, n02443114, n03544143, n03697007, n04557648, n02510455, n03633091, n02174001, n02077923, n03085013, n03888605, n02279972, n04311174, n01748264, n02837789, n07613480, n02113712, n02137549, n02111129, n01689811, n02099601, n02085620, n03786901, n04476259, n12998815, n04371774, n02814533, n02009229, n02500267, n04592741, n02119789, n02090622, n02132136, n02797295, n01740131, n02951358, n04141975, n02169497, n01774750, n02128757, n02097298, n02085782, n03476684, n03095699, n04326547, n02107142, n02641379, n04081281, n06596364, n03444034, n07745940, n03876231, n09421951, n02672831, n03467068, n01530575, n03388043, n03991062, n02777292, n03710193, n09256479, n02443484, n01728572, n03903868 . OOD datasets. Huang & Li curated a diverse collection of subsets from iNaturalist (Van Horn et al., 2018) , SUN (Xiao et al., 2010) , Places (Zhou et al., 2017) , and Texture (Cimpoi et al., 2014) as large-scale OOD datasets for ImageNet-1k, where the classes of the test sets do not overlap with ImageNet-1k. We provide a brief introduction for each dataset as follows. iNaturalist contains images of natural world (Van Horn et al., 2018) . It has 13 super-categories and 5,089 sub-categories covering plants, insects, birds, mammals, and so on. We use the subset that contains 110 plant classes which are not overlapping with ImageNet-1k. SUN stands for the Scene UNderstanding Dataset (Xiao et al., 2010) . SUN contains 899 categories that cover more than indoor, urban, and natural places with or without human beings appearing in them. We use the subset which contains 50 natural objects not in ImageNet-1k. Places is a large scene photographs dataset (Zhou et al., 2017) . It contains photos that are labeled with scene semantic categories from three macro-classes: Indoor, Nature, and Urban. The subset we use contains 50 categories that are not present in ImageNet-1k. Texture stands for the Describable Textures Dataset (Cimpoi et al., 2014) . It contains images of textures and abstracted patterns. As no categories overlap with ImageNet-1k, we use the entire dataset as in Huang & Li (2021) .

B BASELINES

To evaluate the baselines, we follow the original definition in MSP (Hendrycks & Gimpel, 2017; Fort et al., 2021) , ODIN score (Liang et al., 2018) , Energy score (Liu et al., 2020b) , GradNorm score (Huang et al., 2021) , ViM score (Wang et al., 2022) , KNN distance (Sun et al., 2022) and VOS (Du et al., 2022b) . • For ODIN, we follow the original setting in the work and set the temperature T as 1000. • For both Energy and GradNorm scores, the temperature is set to be T = 1. • For ViM, we follow the original implementation according to the released code. • For VOS, we ensure that the number of negative samples is consistent with our methodfor each class, we sample 60k points after estimating the distribution and select six outliers with the lowest likelihood. For the OOD score, we adopt the uncertainty proposed in the original method. • For VOS+, we use the same loss function as defined in Section 3, but only replace the sampling method to be parametric. The way VOS+ synthesizes outliers is the same as VOS (first modeling the feature embedding as a mixture of multivariate Gaussian distribution, and then sample virtual outliers from the low-likelihood region in the embedding space). For a fair comparison, we also use the textual embedding extracted from CLIP as the prototype for VOS+. Note that VOS+ and NPOS only differs in how outliers are synthesized.

C ALGORITHM OF NPOS

We summarize our algorithm in implementation. Following (Du et al., 2022c) , we construct a classconditional in-distribution sample queue {Q c } C c=1 , which is periodically updated as new batches of training samples arrive. 

D EXPERIMENTAL DETAILS AND RESULTS ON TRAINING FROM SCRATCH

In this section, we provide the implementation details and the experimental results for NPOS trained from scratch. We evaluate on three datasets: CIFAR-10, CIFAR-100, and ImageNet-100. We summarize the training configurations of NPOS in Table 7 . CIFAR-10 and CIFAR-100. The results on CIFAR-10 are shown in Table 4 . All methods are trained on ResNet-18. We consider the same set of baselines as in the main paper. For the post-hoc OOD detection methods (MSP, ODIN, Energy score, GradNorm, ViM, KNN), we report the results by training the model with the cross-entropy loss for 100 epochs using stochastic gradient descent with momentum 0.9. The start learning rate is 0.1 and decays by a factor of 10 at epochs 50, 75, and 90 respectively. The batch size is set to 256. The average FPR95 of NPOS is 10.16%, significantly outperforming the best baseline VOS (27.88%). The results on CIFAR-100 are shown in Table 5 , where the strong performance of NPOS holds. Model calibration. In Table 9 , we also measure the calibration error of NPOS (with 5 random seeds) on different datasets using Expected Calibration Error (ECE, in %) (Guo et al., 2017) . In the implementation, we adopt the codebasefoot_0 for metric calculation. The results suggest that NPOS maintains an overall comparable (in some cases even better) calibration performance, while achieving a much stronger performance of OOD uncertainty estimation. 

F ADDITIONAL ABLATIONS ON HYPERPARAMETERS AND DESIGNS

In this section, we provide additional analysis of the hyperparameters and designs of NPOS. For all the ablations, we use the ImageNet-100 dataset as the in-distribution training data, and fine-tune on ViT-B/16. Ablation on the number of boundary samples. We show in Table 10 the effect of m -the number of boundary samples selected per class. We vary m ∈ {100, 150, 200, 250, 300, 350, 400}. We observe that NPOS is not sensitive to this hyperparameter. Ablation on the number of candidate outliers sampled from the Gaussian kernel (per boundary ID sample). As shown in Table 12 , we analyze the effect of p -the number of synthesized candidate outliers using Equation 5 around each ID boundary sample. We vary p ∈ {600, 800, 1000, 1200, 1400}. A reasonably large p helps provide a meaningful set of candidate outliers to be selected. Ablation on the temperature for ID embedding optimization. In Table 13 , we ablate the effect of temperature τ used for the ID embedding optimization loss (cf. Equation 8). Ablation on the density estimation implementation. NPOS adopts a class-conditional approach for outlier synthesis. For instance, it identifies the boundary ID samples by calculating the k-NN distance between sample pairs holding the same class label. After synthesizing the outliers in the feature space, it rejects synthesized outliers that have lower k-NN distance, which is also implemented in a class-conditional way. In this ablation, we contrast with an alternative class-agnostic implementation, i.e., we calculate the k-NN distance between samples across all classes in the training set. Under the same training and inference setting, the class-agnostic NPOS gives a similar OOD detection performance compared to the class-conditional NPOS (Table 15 ). We repeat the training of our method on ImageNet-100 with pre-trained ViT-B/16 for 5 different times. We report the mean and standard deviations for both NPOS (ours) and the most relevant baseline VOS in Table 16 . NPOS is relatively stable, and outperforms VOS by a significant margin. 



https://github.com/gpleiss/temperature_scaling



Figure 1: Illustration of our non-parametric outlier synthesis (NPOS). (a) Embeddings of ID data are optimized using Equation 8, which facilitates learning distinguishable representations. (b) Boundary ID embeddings are selected based on the non-parametric k-NN distance. (c) Outliers are synthesized by sampling from a multivariate Gaussian distribution centered around the boundary embeddings. Rejection sampling is performed by keeping the synthesized outliers (orange) with the lowest likelihood. The risk term Ropen performs level-set estimation, learning to separate the synthesized outliers and ID embeddings (Equation6). Best view in color.

Figure 2: (a) Ablation study on the regularization weight α on R open . (b) Ablation on the variance σ 2 for synthesizing outliers in Equation 5. (c) Ablation on the k for the k-NN distance. The numbers are FPR95. The ID training dataset is ImageNet-100.

Figure 3: t-SNE visualization of synthesized outliers by NPOS. Qualitatively, we show the t-SNE visualization (Van der Maaten & Hinton, 2008) of the synthesized outliers by our proposed method NPOS in Figure 3. The ID features (colored in purple) are extracted from the penultimate layer of a model trained on ImageNet-100 (class name: HERMIT CRAB). Without making any distributional assumption on the embedding space, NPOS is able to synthesize outliers (colored in orange) in the lowlikelihood region, thereby offering strong flexibility and generality.

NPOS: Non-parametric Outlier Synthesis Input: ID training data D in = {(x i , y i )} n i=1 , initial model parameters θ for backbone, nonlinear MLP layer ϕ and class-conditional prototypes µ. Output: Learned classifier f (x), and OOD detector G(x). while train do 1. Update class-conditional queue {Q c } C c=1 with the feature embeddings h(x) of training samples in the current batch. 2. Select a set of boundary samples B c consisting of top-m embeddings with the largest k-NN distances using Equation 4. 3. Synthesize a set of outliers V i around each boundary sample x i ∈ B 1 ∪ B 2 ∪ ...B C using Equation 5. 4. Accept the outliers in each V i with large k-NN distances. 5. Calculate level-set estimation loss R open and ID embedding optimization loss R closed using Equations 6 and 8, respectively, update the parameters θ, ϕ based on loss in Equation 1. 6. Update prototypes using µ c := Normalize(γµ c + (1 -γ)z), ∀c ∈ {1, 2, . . . , C}. end while eval do 1. Calculate the OOD score defined in Section 3.3. 2. Perform OOD detection by thresholding comparison. end

OOD detection performance on ImageNet-100(Deng et al., 2009) as ID. All methods are trained on the same backbone. Values are percentages. Bold numbers are superior results. ↑ indicates larger values are better, and ↓ indicates smaller values are better.

OOD detection performance for ImageNet-1k(Deng et al., 2009) as ID.

Ablation on model capacity and architecture. ID dataset is ImageNet-100.

4.3 ABLATION STUDYAblation on model capacity and architecture. To show the effectiveness of the ResNet-based architecture, we replace the CLIP image encoder with RN50x4 (178.3M), which shares a similar number of parameters as CLIP-B/16 (149.6M). The OOD detection performance of NPOS for the ImageNet-100 dataset (ID) is shown in Figure3. It can be seen that NPOS still shows promising results with the ResNet-based backbone, and the performance is comparable between RN50x4 and

OOD detection performance on CIFAR-10 as ID. All methods are trained on ResNet-18. Values are percentages. Bold numbers are superior results. The results are shown in Table6. All methods are trained on ResNet-101 using the ImageNet-100 dataset. We use a slightly larger model capacity to accommodate for the largerscale dataset with high-solution images. NPOS significantly outperforms the best baseline KNN by 18.96% in FPR95. For the post-hoc OOD detection methods, we report the results by training the the bracket of the first column indicates clean accuracy on the in-distribution test data. The results demonstrate that compared to the vanilla classifier trained with the cross-entropy loss only, NPOS does not incur substantial change in distributional robustness.

Evaluations on data-shift robustness (numbers are in %).

Calibration performance (numbers are in %).

Ablation study on the number of boundary samples (per class).Ablation on the number of samples in the class-conditional queue. In Table11, we investigate the effect of ID queue size |Q c | ∈ {1000, 1500, 2000, 2500, 3000}. Overall, the OOD detection performance of NPOS is not sensitive to the size of the class-conditional queue. A sufficiently large |Q c | is desirable since the non-parametric density estimation can be more accurate.

Ablation study on the size of ID queue (per class).

Ablation on the of candidate outliers drawn from the Gaussian kernel.

Ablation study on the temperature τ . Ablation on the starting epoch of adding R open (g). In Table 14, we ablate on the effect of the starting epoch of adding R open (g) in training. The table shows that adding R open (g) at the beginning of the training yields a slightly worse OOD detection performance. The reason might be that the representations are still not well-formed at the early stage of training. Instead, adding regularization in the middle of training yields more desirable performance.

Ablation study on the starting epoch of adding R open (g).

Ablation on different implementations of the non-parametric density estimation.

Results on the mean and standard deviations after 5 runs. use Python 3.8.5 and PyTorch 1.11.0, and 8 NVIDIA GeForce RTX 2080Ti GPUs.

availability

https://github.com/deeplearning-wisc

annex

model with the cross-entropy loss for 100 epochs using stochastic gradient descent with momentum 0.9. The start learning rate is 0.1 and decays by a factor of 10 at epochs 50, 75, and 90 respectively. The batch size is set to 512.Our results above demonstrate that NPOS can achieve strong OOD detection performance without necessarily relying on the pre-trained models. Thus, our framework provides strong generality across both scenarios: training from scratch or fine-tuning on pre-trained models. 

