OUT-OF-DISTRIBUTION DETECTION WITH DIFFUSION-BASED NEIGHBORHOOD

Abstract

Out-of-distribution (OOD) detection is an important task to ensure the reliability and safety of deep learning and the discriminator models outperform others for now. However, the feature extraction of the discriminator models must compress the data and lose certain information, leaving room for bad cases and malicious attacks. In this paper, we provide a new assumption that the discriminator models are more sensitive to some subareas of the input space and such unfair treatment creates bad cases and overconfidence areas. Under this assumption, we design new detection methods and indicator scores. For detection methods, we introduce diffusion models (DMs) into OOD detection. We find that the denoising process (DDP) of DMs also functions as a novel form of asymmetric interpolation, which is suitable to enhance the input and reduce the overconfidence areas. For indicator scores, we find that the features of the discriminator models of OOD inputs occur sharp changes under DDP and use the norm of this dynamic change as our indicator scores. Therefore, we develop a new framework to combine the discriminator and generation models to do OOD detection under our new assumption. The discriminator models provide proper detection spaces and the generation models reduce the overconfidence problem. According to our experiments on CIFAR10 and CIFAR100, our new methods successfully outperform state-of-the-art methods. Our implementation is put in the supplementary materials.

1. INTRODUCTION

Out-of-distribution (OOD) detection is an important task for deep models that helps the models determine their capability boundary and keep them from being fooled by OOD data. It has a strong connection with many real-world machine-learning applications, such as cybersecurity (Xin et al., 2018) , medical diagnosis (Latif et al., 2018; Guo et al., 2020) and autopilot (Geiger et al., 2012) . The existing methods for OOD detection can be generally categorized into discriminator-based and generation-based methods. The discriminator-based methods (Wang et al., 2022) use the logit or the feature space to do that. The generation-based methods (An & Cho, 2015; Nalisnick et al., 2019) use the reconstruction difference in data space or density estimation in latent space to do that. The discriminator-based methods can extract useful features and make the detection faster and better in most cases. However, such extraction and compression lose some information and leave room for bad cases and malicious attacks (Goodfellow et al., 2014; Amodei et al., 2016) . The generation-based methods can capture the whole data distribution but lack effective indicator scores to compete with the SOTA discriminator-based methods, partly because of the curse of dimensionality. Previous works mostly concentrate on solving these challenges using only one kind of model. For discriminator-based methods, Wang et al. (2022) combine the information from both features and logits. Sehwag et al. (2020) use self-supervised learning to improve feature extraction. For generation-based methods, Nalisnick et al. (2019) use the typicality set to design better indicator scores. Jiang et al. (2022) use statistical methods in the latent space, such as the Kolmogorov-Smirnov test. In addition to overcoming the problems of each kind of model by itself, we find that generative and discriminative models can be combined and solve each other's problems. We provide a new explain our motivation assumption that the discriminator models are more sensitive to some subareas of the input space to explain the existence of bad cases and overconfidence areas. To solve this problem, we get inspiration from water quality detection in the real world. They use some fixed detectors and make the water flow by stirring. Then these detectors can monitor a large area of water. For OOD detection, the discriminator models also concentrate on some fixed subareas of the whole input space. Therefore, we also want to "stir" the input to improve detection accuracy and reduce overconfidence areas. We find that generation models are a good choice to be the "stirring" operators. Under such operators, the results of InD data remain normal at all times and that of OOD data expose anomalies. To design suitable generation strategies that can enhance discriminator models, we introduce diffusion models (DMs), which play an important role in generation models, into OOD detection. DMs have created many state-of-the-art generation results, including (Vahdat et al., 2021; Ho et al., 2022) . We dive into the structure of DMs and find that the diffusion denoising process (DDP) of DMs can be an ideal choice for the "stirring" operator we mentioned above. Because it can adjust any level of feature space and provides tools to keep the adjustment under control by using the denoising and interpolation properties. Such an operator "stirs" the input and needs to be resampled several times to make the result convergent, which builds a neighborhood of input data, called the diffusion-based neighborhood (DiffNB). According to our above analysis, the feature of OOD data explores anomalies, which means that the feature can change sharply. We can detect such anomalies by simple Euclidean distance between several different features of DiffNB. Our pipeline is in Figure 1 . We choose ten representative methods to compare with our methods on two representative datasets: CIFAR10 and CIFAR100. According to our experiments, our new methods outperform existing models and methods in most cases. Our work has the following contributions: • We provide a new assumption that the discriminator models are more sensitive to some subareas for OOD detection. We analyze why it causes the overconfidence problem and how to solve it. • We find that the diffusion denoising process of invertible diffusion models is a novel kind of asymmetric interpolation, which can keep the InD data relatively unchanged and provide tools to control the direction of the denoising process. • We develop a framework to combine the discriminator and generation models, which uses a ResNet to extract features and the diffusion denoising process of a diffusion model to reduce overconfidence areas. Our methods get competitive OOD detection results with SOTA methods.

2. BACKGROUND

In this section, we first introduce existing methods for OOD detection. Then, we show the development of diffusion models related to our paper. Because of the limited space in the main paper, more related works about diffusion models can be found in Appendix A.1.

2.1. OUT-OF-DISTRIBUTION DETECTION

OOD detection is an important task that can help neural networks to determine their capability boundary. More specifically, let X = {x 1 , . . . , x n } ∼ p be a group of images from the in-distribution (InD) p. We want to build a detector f that f (x 1 , . . . , x n ) = 1, ∀i, p(x i ) ≥ σ and f (x 1 , . . . , x n ) = 0, ∀i, p(x i ) ≤ σ. Here, σ controls the decision boundary. When we get another group of data Y = {y 1 , . . . , y n }, we decide whether this group is from InD p or an unknown distribution q based on the results of f . If n = 1, this is pointwise OOD detection, and if n ≥ 2, this is group OOD detection. In general, the existing OOD detection methods can be categorized into discriminator-based and generation-based methods. Discriminator-based methods design indicator scores based on the output of discriminator models. Some methods can be used without modifying the model. ODIN (Liang et al., 2018) uses temperature scaling and the softmax results to detect OOD samples. ViM (Wang et al., 2022) combines the information of features and logits. KNN (Sun et al., 2022) includes the kth nearest neighbor of the input in feature space into the detection process. Some methods try to improve the detection ability in the training process. G-ODIN (Hsu et al., 2020) designs a new loss function. ConfGAN (Sricharan & Srivastava, 2018) generates OOD data using GANs to help the discriminator models to determine the boundary. PixMix (Hendrycks et al., 2022) uses data augmentation to improve the detection results. SSD (Sehwag et al., 2020) uses self-supervised learning to improve feature extraction. Generation-based methods use the reconstruction difference in the input space and the density estimation in the latent space to do OOD detection. An & Cho (2015) use the reconstruction ability of VAEs. Some methods assume that the generation models can reconstruct the in-distribution data better. Some methods use the Distribution transformation capability of generation models and transfer the input distribution into simple Gaussian distribution. The likelihood of the input becomes a direct choice, but Nalisnick et al. (2018) finds that OOD data can also locate in the high-likelihood area. Nalisnick et al. (2019) find the InD data is concentrated in the typical set instead of the high likelihood area and design new methods using the typical set. Serrà et al. (2019) find that we can use input complexity to correct the bias of likelihood. In addition to the likelihood, many existing statistical methods can detect whether a distribution obeys standard Gaussian distribution. Zhang et al. (2020) uses KL-divergence to detect OOD data. Jiang et al. (2022) use a nonparametric statistics method called the Kolmogorov-Smirnov test.

2.2. DIFFUSION MODEL

Classical diffusion model DMs build a transformation from Gaussian distribution to image distribution through a multistep denoising process. Given a data distribution x 0 ∼ q(x 0 ), the diffusion process satisfies a Markov process as following Ho et al. (2020) : q(x 1:T |x 0 ) = T t=1 N ( 1 -β t x t-1 , β t I) q(x t |x 0 ) = N ( √ ᾱt x 0 , (1 -ᾱt )I). Here, T = 1000, which is the max iteration step. β t ∈ (0, 1), which controls the speed of adding noise. Additionally, α t = 1 -β t , ᾱt = t i=1 α i , μt = √ ᾱt-1βt 1-ᾱt x 0 + √ αt(1-ᾱt-1) 1-ᾱt x t and βt = 1-ᾱt-1 1-ᾱt β t . The objective function is defined by: L t-1 = E x0,ϵ β 2 t α t (1 -ᾱt ) ||ϵ -ϵ θ ( √ ᾱt x 0 + √ 1 -ᾱt ϵ, t)|| 2 . (2) Here, ϵ θ is an estimate of the noise ϵ. After we get well-trained ϵ θ , according to Song et al. (2021a) , the denoising process of Denoising Diffusion Probabilistic Models (DDPMs) and Denoising Diffusion Implicit Models (DDIMs) satisfies: x t-δ = √ ᾱt-δ x t - √ 1 -ᾱt ϵ θ (x t , t) √ ᾱt + 1 -ᾱt-δ -σ 2 t ϵ θ (x t , t) + σ t ϵ t . Here, δ is the iteration step size. If σ t equals one, Equation (3) represents the denoising process of DDPMs; if σ t equals zero, this equation represents the denoising process of DDIMs. In Appendix A.3, we will further describe how to make the iteration of the diffusion model fast and invertible. Classifier-free guidance Ho & Salimans (2021) show a simple and effective way to generate conditional samples called classifier-free guidance. It adds a condition embedding c into ϵ θ in the training process and changes the final estimation of noise as: εθ (x t , c) = (1 + ω)ϵ θ (x t , t) -ωϵ θ (x t ). Here, ω is the guidance weight, which controls the balance between realness and diversity.

3. DIFFUSION-BASED NEIGHBORHOOD FOR OOD DETECTION

In this section, we first theoretically define our task, which is to design proper additional operators for a fixed detector. We assume that the detector only concentrates on some subarea of the input space. Then we use a toy example to show how to design proper additional operators and find that reconstruction operators are potential candidates for general cases. After that, we use invertible diffusion models and the diffusion denoising process (DDP) to develop a new asymmetry interpolation to satisfy the requirements of the additional operators. Then we use a new toy example to show what will happen under DDP for more complex situations and use the norm of the dynamic change under DDP as our detection score. Both DDP and our new score can be applied to any black box detector f and we introduce classifier-free guidance for the multi-class condition. In the end, we provide our general methods and visualize their actual effect when the detector is a pre-trained ResNet18.

3.1. THEORETICAL SETTING

We first transfer the OOD detection task and the idea behind water quality detection mentioned in Section 1 into a theoretical problem. Following the definitions in Section 2.1, the detector is a function f that can be an analytic function or a neural network. The detection problem is that given a distribution p on a certain space U , let {x ∈ U |p(x) > σ} be W and we want to find a detector f that satisfies f -foot_0 (1) = W exactly. The idea behind water quality detection is to use a fixed detector and make water flow by stirring. The fixed detector means that the detector f may treat bad cases as normal and f -1 (1) is strictly bigger than W . The stirring operator means that we can add additional operators to help the detection. Therefore, we can define a group of operators {g i } i∈Ig 1 satisfying f • g i (x) = f (x) = 1 if x ∈ W and ∩ i∈Ig (f • g i ) -1 (1) ⊂ f -1 (1) . The first condition means that all f • g i can identify the InD data correctly and the second condition means that the ∩ i∈Ig (f • g i ) -1 (1) becomes smaller by adding more g i . Now we introduce the theoretical version of our problem: Given a fixed subarea W ⊂ U , a single value function f satisfying W ⊂ f -1 (1), how can we design operators {g i } i∈Ig to minimize ∩ i∈Ig (f • g i ) -1 (1)? For our task, f is a neural network and the difference between f -1 (1) and W is the overconfidence area. Here, we assume all training data belong to one class. To design proper {g i } i∈Ig , we need to locate what causes such an overconfidence problem. We find that discriminator models are similar to the fixed water detector and only concentrate on some subareas of the whole input space. Here, we provide the main assumption in our paper: For a discriminator model, different channels concentrate on different subareas of the input space. Some subareas are more important than others. Here, we emphasize that even in the class-related area, some subareas are more important than others. To show why this assumption is reasonable, We dive into the structure of a discriminator model and find that the linear operators, including the convolutional layers and the following pooling layers, play an important role. According to the singular value decomposition, every linear operator L A : R n → R m , m < n, can be transferred to a restriction operatorfoot_1 L A * | S under proper change of basis and S is a m-dimensional subspace. The input data is bounded in our task, so the subspace S becomes a bounded area in the input space. Different linear kernels in a layer concentrate on different S i and the total sensitive area of this layer is ∪ i∈I S S i . Some subareas can be included by more S i , which are more important than the remaining area.

3.2. SINGLE-LAYER MODEL

The whole input space may be covered by the support area of all kernels ∪ i∈I S S i and some areas make the assumption clear are covered more times than others. For simplification, we can transfer the problem to a restriction operator. The key idea is to set a threshold N and mask the area that is covered by less than N support areas S i . Therefore, the data pass through a layer of a neural network as if it had first passed through a masking and some information is then ignored. For the convenience of theoretical analysis, we choose a simple neural network as f , which only contains one convolutional block and a full connection layer. The support area of each convolutional kernel in the input space is S i and the total support area is ∪ i∈I S S i . Here, we simplify it to a simple restriction operator, namely, each S i is a square and all S i are disjoint with each other. Our input space is the image space and our training set contains only one element zero (the empty image). Not zero ✔ Move Zero ✔ Mix Zero ✘ Recon Not zero ✔ Not zero ✔ Not zero ✔ Input Output Input ൗ 𝟏 𝟏𝟔 ⋯ ൗ 𝟏 𝟏𝟔 ⋮ ⋱ ⋮ ൗ 𝟏 𝟏𝟔 ⋯ ൗ 𝟏 𝟏𝟔

Output

Figure 2 : The left side shows the detection process of our toy example and a bad case under this setting. The right side provides three different operators to correct the result of this bad case. Our InD is just a uniform distribution in a spherical neighborhood of zero and our target is to detect whether a new input is InD or OOD, namely, detect whether it falls in the neighborhood of zero. We show this toy example on the left side of Figure 2 and the third row shows a bad case under this setting, which looks like zero after the mask operator. On the right side of Figure 2 , we use three kinds of additional operators moving, mixing and reconstruction to solve this problem. The first two can perfectly solve the toy example in Figure 2 and can be extended to any restriction operator support on different S. We put the proof in Appendix A.5. The key idea of these solutions is that they move or mix information from one place to another. Then the detector f can get the whole information from a small support area S. Such ideas can be generalized to more complex cases and the third reconstruction operator in Figure 2 plays a central role here. We find that the reconstruction strategy successfully contains the moving operator (the face of the cat moves to the left and down), the mix operator (the boundary of each small box becomes unclear) and the semantic level moving operator (the color of the cat is lighter). We can move or mix the information at both the semantic and pixel level instead of only the pixel level now. Therefore, it provides a candidate {g i } i∈Ig with potential advantages for real detectors f , such as a ResNet.

3.3. DIFFUSION DENOISING PROCESS

Such a reconstruction strategy can be finished with any kind of generation model. However, this strategy also has its weakness. In Figure 2 , we can detect the OOD examples more precisely using a reconstruction operator. However, we also need to satisfy the condition W ⊂ f -1 (1), namely, to keep the pure white picture from being dirty. This challenge tells us that we need to add more control to the reconstruction strategy. In the following, we introduce the diffusion-denoising process (DDP) to solve this challenge. We dive into the structure of diffusion models and show that DDP is a kind of interpolation under the invertible condition. Therefore, DDP combines the benefit of denoising and interpolation and provides many powerful tools to control the reconstruction process. Due to space limitations, we put the analysis of enhancing invertibility in Appendix A.3 and directly analyze interesting applications of DDP under the invertible condition. Interpolation Many previous papers use diffusion models to interpolate two inputs, but our method is different from the existing symmetric one using spherical linear interpolation (Shoemake, 1985) . We combine the invertible diffusion models and the diffusion denoising process to get a new asymmetric interpolation. Let us assume that x 0 is an image and ϵ is Gaussian noise, which is the reverse of an image x 1 , and we use x 0 and ϵ to get x t . The diffusiondenoising process is shown in Algorithm 1 and we define DDP as Φ(x t , t, 0). When t equals zeros, we do not add any noise to the images, we can get the original image x 0 , and when t equals T , we remove the total image x 0 and only leave the noise ϵ, we will do a full denoising process to this noise. Because this denoising process is invertible, we can get the original image x 1 . Therefore, the outputs of DDP gradually change from x 0 to x 1 . Figure 3 shows the three different interpolation results. The first row uses the cat as x 0 and the second row uses the car as x 0 . We can find that x 0 can be better preserved with respect to the symmetric interpolation in the third row. DDP is a kind of reconstruction strategy, in the meanwhile, it is also a denoising operator and an interpolation operator. That DDP is a denoising operator means that it can keep the InD data relatively unchanged and pull the OOD data to the high-density area of InD, which is a distributionsensitive property. That DDP is an interpolation operator means that even if we cannot reconstruct the input perfectly, we can at least control the direction of change. Both these properties give us more possibilities to control DDP and solve the above weakness. Finally, we get the following property: The invertible diffusion-denoising process is an asymmetric distribution-sensitive interpolation.

3.4. MULTI-LAYER MODEL

In the above section, we show how to design {g i } i∈Ig and our choice is invertible DDP. Here, we make the assumption clear show the effect of DDP in a more complex example to design a proper detection score. We consider the last convolutional layer of a multi-layer model. In this setting, we face a major change. The input of this layer has many channels. To be more theoretical, the input feature is in R m×m×c , the input feature space is R m×m and the input feature is a combination of c elements in R m×m . Each of them is the output of a previous convolutional kernel. Therefore, a single input becomes c points in the feature space instead of a single point. For the convenience of theoretical analysis, we abstract the whole input space into 2-dimensional space and the ideal feature space belonging to the input data is the light blue square. The point in each image represents c = 6 features of a single input and the corresponding arrow shows the movement of these features under DDP. We use the dark blue area to show the features used by the convolutional kernels of the last convolutional layer. For better visualization, the output of each convolutional kernel is a single value, the number of points N that fall in its sensitive area. Again, we assume that all the sensitive areas form a mask. This example is just a simplification of the process to compute confidence. For a normal InD input (the first row), its features are uniformly distributed in the feature area, which ensures that N maintains a dynamic balance under the perturbation of DDP in the feature space. For a normal OOD input (the second row), DDP pulls them to the high-density area and N increases at the same time. For more challenging OOD data (the third row), it is not InD and its features are relatively sparse in the ideal feature set. However, all these features happen to fall in the dark blue area at the same time by coincidence or man-made, which causes the over-confidence problem. Such imbalance breaks the dynamic balance between the dark blue area and the remaining and causes a rapid decline of N under the perturbation of DDP. Therefore, all two kinds of OOD input can be successfully detected by the change of confidence. These analyses can be extended to any middle layer by replacing the single value output with a vector output.

3.5. DIFFUSION-BASED NEIGHBORHOOD

The first toy example tells us that we can use DDP to enhance the detection process and the second solve the gap toy example tells us what will happen in feature spaces under DDP. Detecting the dynamic change of confidence still requires the output is single-value. We can generalize this idea to detect the change in any feature space of the model by using the norm of the change as the score. After we remove the limitations of the form of the output, both DDP and the dynamic change score can be applied to any black box detector f , such as a ResNet. The final step is to take multiple categories into account and avoid category migration for InD input. Here, we need the help of the interpolation property and there exist two choices. First, we can search Algorithm 1 Diffusion-denoising process the kth nearest neighbor of the input in the input space and generate the corresponding noises of them. Input: Images x 0 , generative interval [0, T ], gen- erative gap δ 1: for t = T, • • • , δ do 2: x 0 t = 1 √ ᾱt (xt - √ 1 -ᾱtϵθ(xt, t)) 3: ϵ = ϵ θ (xt, t) 4: x t-δ = √ ᾱt-δ x 0 t + √ 1 -ᾱt-δ We interpolate these noises with the original input using DDP. Another more interesting choice is that we can train a conditional diffusion model, and fix the class condition to the class of inputfoot_2 . Then all noise are corresponding to the images in the same class. We can interpolate the input with any noise, instead of searching for it first. What's more, we can choose several noises for each input and all the results of DDP become a neighborhood of the input. We call this neighborhood the diffusion-based neighborhood (DiffNB). Then we use a discriminator model to the dynamic change in the feature spaces and determine the OOD samples based on that. We put our algorithm in Algorithm 4. In Figure 5 

4. EXPERIMENT

In this section, we first show the detailed setting of our experiments. Then we offer our OOD detection results, including our method and existing representative methods. After that, we provide ablation study results to show the contributions of each item and hyperparameter in our new scores. Table 1 : The AUROC results of different methods. We train the models on the training data for 160k epochs and test the results on the test data. We use the conditional version method and the guidance weight is 2. We set the disturbance degree t = 300, the repeat size r = 4 and use the logic space as our detection space. The higher results are better and the bold results are the best in each case. 

4.1. SETUP

We evaluate our methods on the most recent OOD detection benchmarks, OpenOOD benchmarks (Yang et al., 2022) . We use images from six different datasets, which are filtered to ensure that the in-distribution and the OOD do not have overlapping data. We use the CIFAR10 (Krizhevsky et al., 2009) and CIFAR100 as InD samples. For the CIFAR10 dataset, we use CIFAR100, TinyImagenet (Krizhevsky et al., 2017) , SVHN (Netzer et al., 2011) , Texture and Places365 (Zhou et al., 2017) as OOD data. For the CIFAR100 dataset, the OOD datasets are the same, except for swapping CIFAR100 for CIFAR10 as the OOD dataset. For a fair comparison, we first train discriminator and generation models using the training set. we evaluate the results by calculating the area under the receiver operating characteristic curve (AUROC) Fawcett (2006) between the test set of the InD dataset and the test set of others, to avoid the influence of model overfitting. All images from different datasets are resized into 32 × 32. The discriminator models are pre-trained ResNet18 from OpenOOD. The diffusion model used in this paper is just the classical model from DDPMs. We use pre-trained unconditional models and train the conditional version by ourselves.

4.2. OUT-OF-DISTRIBUTION DETECTION

We put the results in Table 1 . We choose ten representative baselines. The first seven methods do not adjust the discriminator model similar to our method. ODIN (Liang et al., 2018) uses temperature scaling and gradient-based input perturbation. EBO (Liu et al., 2020) uses an energy-based function. ReAct (Sun et al., 2021) uses rectified activation. MLS (Hendrycks et al., 2019) uses maximum logit scores. VIM (Wang et al., 2022) combines the information of feature space and logic space. KNN (Sun et al., 2022) uses the nearest neighbor in the feature space. All these methods are post-hoc methods and we outperform them in all cases of CIFAR10 and two cases of CIFAR100. We also compare our methods with four methods with additional training on the discriminator model. G-ODIN (Hsu et al., 2020) decomposes the posterior to model the probability of InD. CSI (Tack et al., 2020) explores the effectiveness of contractive learning objectives. CutMix (Yun et al., 2019) and PixMix (Hendrycks et al., 2022) are two new kinds of data augmentation to improve the capability of models. Our method outperforms them in three cases and gets competitive results in the others. Our method performs worse than the SOTA methods when the test dataset is SVHN. A performance bottleneck is that, in addition to density estimation, DDP also has a lazy strategy in the denoising process. It tends to keep the smooth area unchanged. In Figure 6 , we show this phenomenon using a simple case. We resize an InD image to r × r and then resize it back, which pulls the InD data away. However, the reconstruction error decrease instead of increase under this operator. This phenomenon also occurs when the input is the relatively simple SVHN dataset.

4.3. ABLATION STUDY

Here, we show the influence of each item and hyperparameter on our scores. And all the setting is the same as the main experiments in Table 1 except for the ablation object. Detection space In Figure 7 , we compare the results when we use different detection spaces, including the input image space, the different level feature spaces, and the logit space. We get the best results when we use the high-level feature or the logit. Here, the logit has 10 dimensions and is much smaller than the high-level feature space (512 dimensions). This shows that DDP can reduce information loss successfully. The low-level features and the image space get relatively bad results, the main reason is the information is still redundant at these levels. Condition In Figure 8a , we compare our the unconditional and conditional methods. The main problem is that CIFAR100 has much more classes, which makes separating the feature space become more difficult and unconditional diffusion models cannot keep the interpolation in a single class. We also test different class weights ω, we find that a higher class weight can get relatively better results. This shows that realness is more important than diversity in the OOD detection task. In addition, we also find that the conditional version improves the detection results on CIFAR100 more obviously, which means the conditional control is important especially when the number of classes is big. Training In Figure 8b , we find that although the training process of diffusion models is relatively computation-cost to achieve the best FID results, the OOD detection does not need the models to be 100 percent well-trained (200k steps). After 40k step training, we can get relatively good results and the improvement of FID does not help the OOD detection after that. Timestep In Figure 8c , we determine how to choose the best t in DDP. We find that the best choice is t = 300 and this is consistent with our examples in Figure 4 . When t ≤ 300, the difference caused by DDP is still not obvious enough. When t ≥ 300, the information start to lose because the noise item accounts for a larger and larger proportion, which limits the OOD detection results. Resampling In Figure 8d , we determine the influence of the repeated sampling size. According to our analysis, the consistent detection results are maintained by dynamic balance, therefore, we need to resample several times to remove the random error in DDP. We find that more is better and 4 times resampling is good enough.

5. DISCUSSION

In this paper, we start with an assumption to explain the overconfidence problem. Then we combine discriminator and generation models to solve it. Under the setting of the first toy example, such a strategy can perfectly solve the OOD detection problem. Although we cannot say the general cases can also be perfectly solved, we show how to use this idea in the abstract feature space and get competitive results on CIFAR10 and CIFAR100 using the combination of a ResNet and a diffusion model. Our approach has good interpretability and a solid theoretical background. We believe that this strategy opens a new door to developing more powerful OOD detection methods and has the potential to be applied to OOD generalization and other related tasks.

REPRODUCIBILITY STATEMENT

Our implementation put in the supplementary materials. We will publicize our method on GitHub once our paper is accepted.

APPENDIX

A.1 RELATED WORK Diffusion Model Denoising Diffusion Probabilistic Models (DDPMs) Ho et al. (2020) successfully generate high-quality images and make DMs become popular. For now, DMs can not only generate unconditional high-quality images but also are applied to many different fields. For the conditional generation task, DMs can do interpolation, manipulation, image-editing, style transformation and text-conditional generation Ramesh et al. (2021; 2022) . For different data types, DMs can do text Austin et al. (2021) , audio Kong et al. (2020) ; Lam et al. (2021) and video generation. The main challenge for DMs is that they require hundreds to thousands of iterations to produce highquality results, which limits the application of DMs. After DDPMs, many works try to make DMs faster and better. Some of them focus on the denoising equations of DMs. Nichol & Dhariwal (2021) design a better time schedule for the denoising process. Liu et al. (2022) provide new numerical methods for the denoising process. Bao et al. (2022) find analytic results for the variance of the denoising process. Some of them try to design new training strategies and new models. Salimans & Ho (2022) use distillation to accelerate DMs. Dhariwal & Nichol (2021) change the model structure from Unet to GAN to make each iteration step more powerful. What's more, Song et al. (2021b) first shows that DMs can be rewritten as two neural differential equations Chen et al. ( 2018); Dupont et al. (2019) . Therefore, the numerical methods used to accelerate neural differential equations can also be used here.

A.2 SIMILARITY AND DIFFERENCE

Several baselines are similar to ours in some ways. The first one is KNN (Sun et al., 2022) , which does a KNN search in the feature space. However, KNN ignores the possibility that an OOD input may have a similar feature as an InD data, and all methods that only use the final feature have this problem, too. In addition, using KNN in the input space directly is also invalid, because of data sparsity and irrelevant information interference. Another similar approach is data generation and augmentation. These methods retrain the discriminator, but our generation and discriminator models are trained separately. What's more, existing methods use generation models to generate OOD data (Marek et al., 2021) to help the discriminator models to know the capability boundary. Our methods use the generation models to do interpolation between the new input data and the training data, which do not need to carefully classify the training data and design new loss functions. Some methods use data augmentation to help the training process of discriminator models. Although classical data augmentation can enhance the richness of data and keep the data realistic, some new methods start to add complex and unreal augmentation (Hendrycks et al., 2022) , which increases the burden on the models and lacks clear motivation. Many methods combine a generation model and a discriminator model to do OOD detection, such more papers using generation and discriminator models as Ge et al. (2017) ; Neal et al. (2018) ; Lee et al. (2017) ; Du et al. (2022) . However, we find these papers all try to use generation models to generate OOD data, which needs to design new object functions and combine the training process of the generation and discriminator models. Our method use diffusion models to do interpolation and detect the dynamic change under DDP, which is simple and effective.

A.3 INVERTIBLE DIFFUSION MODEL

Here, we introduce more details of the diffusion model used in this paper. In addition, we analyze how to make the iteration of diffusion models more invertible. Score-based generation model Song et al. (2021b) show that the diffusion-denoising process can also be treated as two differential equations: dx = ( 1 -β(t) -1)x(t)dt + β(t)dw dx = ( 1 -β(t) -1)x(t) - 1 2 β(t)s θ (x(t), t) dt. This is called probability (PFs). The noise ϵ θ of DMs and the gradient of logic likelihood s θ are equivalent Bao et al. (2022) . More specifically, we have that s θ (x, t) = -1 1-ᾱt ϵ θ (x, t). Pseudo numerical method Liu et al. (2022) provide pseudo numerical methods for diffusion models (PNDMs) to accelerate DDIMs. PNDMs define Equation (3) with σ t = 0 as transfer function: ϕ(x t , ϵ t , t, t -δ) = √ ᾱt-δ √ ᾱt x t - (ᾱ t-δ -ᾱt ) √ ᾱt ( (1 -ᾱt-δ )ᾱ t + (1 -ᾱt )ᾱ t-δ ) ϵ t . PNDMs combine this transfer function with the noise estimated by classical numerical methods, like the linear multistep method, to get the new denoising equations: ϵ ′ t = 1 24 (55ϵ t -59ϵ t+δ + 37ϵ t+2δ -9ϵ t+3δ ) x t-δ = ϕ(x t , ϵ ′ t , t, t -δ). Here, ϵ t = ϵ θ (x t , t). Both PFs and PNDMs accelerate the denoising process without loss of quality. Invertibility We show the test results in Figure 9 . For DDIMs, the error occurs at the beginning, and the error accumulates with the increase of the total generation step. For PFs, the initial error is not huge, but the cumulative error occurs when the number of the total generation steps is bigger than 500. DDIMs are firstorder methods, and other methods are high-order methods. We can say higher convergent order can increase the invertibility. PFs use numerical methods of adaptive step size, and PNDMs use methods of fixed step size. Therefore, we think that fixed step size can also benefit invertibility. To verify this, we replace the methods of adaptive step size used by PFs with the methods of fixed step size and call it probability flows plus (PFs+). We find that the errors decrease immediately. The reason is that fixed step size maintains consistency between the sampling locations of the forward and reverse processes, which benefits the invertibility. Combining the above analysis, we have the following property: High convergent order and fixed iteration step size can improve the invertibility of DMs under fixed total iteration steps.

A.4 INTERPOLATION

In Algorithm 5 and 6, we introduce two types of interpolation using diffusion models. The positions of x 0 and x 1 are symmetric in the original interpolation and asymmetric for the asymmetric one. According to our experiment, x 0 is more important in asymmetric interpolation. The visualization of these interpolations can be found in Fig 3 . Algorithm 5 Symmetric interpolation Input: Images x 1 0 , x 2 0 , generative gap δ, interpo- lation rate σ 1: x 1 T , x 2 T = Φ(x 1 0 , 0, T, δ), Φ(x 2 0 , 0, T, δ) 2: x inter T = Slerp(x 1 T , x 2 T , σ) 3: x inter 0 = Φ(x inter T , T, 0, δ) 4: return x inter 0 Algorithm 6 Asymmetric interpolation Input: Images x 1 0 , x 2 0 , generative gap δ, interpo- lation timestep t 1: x 2 T = Φ(x 2 0 , 0, T, δ) 2: x inter t = √ ᾱt x 1 0 + √ 1 -ᾱt x 2 T 3: x inter 0 = Φ(x inter t , t, 0, δ) 4: return x inter 0 A.

5. THE PROOF OF TOY EXAMPLE

To make this claim strict, we need to make some definitions. Each image can be represented as a function r on [0, 1] 2 and the value of r(x, y) is the RGB value at position (x, y). And the images are continuous in most positions. Therefore, we simplify the input space to C([0, 1] 2 ) the continuous function on [0, 1] 2 and then to one-dimensional C([0, 1]) for simplicity. The mask operator is a restriction operator ϕ S (r) = r| S=[0,0.25]∪[0.5,0.75] here. The InD is just {r ∈ C([0, 1]) | |r| ≤ σ} and is the max absolute value of r on [0, 1]. The bad cases form a set that satisfies A σ (ϕ S ) = {r|ϕ S (r) = ϕ S (0) = 0, |r| > σ}. We can prove that A σ (ϕ S ) = ∅ when we choose proper {g i }. A straightforward solution is that for each r ∈ C(R), let {g(x) i = r(x + i)|i ∈ {0, ±0.25}}, which represents the moving operator. The proof is that we can use g 0 | S to get the information about r on S and use g We add experiments about adversarial OOD examples in 2. We use 5000 images of CIFAR10 to generate 5000 adversarial OOD examples. We use PGD and OnePixel methods, which use L ∞ and L 0 norms, respectively. We use an open library Harry24k/adversarial-attacks-pytorch to generate the adversarial examples. For PGD, we use eps=8/255, alpha=2/255 and steps=10. For Onepixel, we use pixels=10, steps=10, popsize=10 and inf batch=128. We find that our method can perform pretty well on the adversarial OOD examples. We also test the computation cost of our method. We record the total time to compute the detection scores of 5000 images using an RTX3090. We use about 0.06s to process an image. To be honest, it is much slower compared to existing methods. However, just like the success story of diffusion model acceleration, we believe that our new method can be faster in the future. Table 2 : The AUROC results of different methods. We train the models on the training data for 160k epochs and test the results on the test data. We use the conditional version method and the guidance weight is 2. We set the disturbance degree t = 300, the repeat size r = 4 and use the logic space as our detection space. The higher results are better and the bold results are the best in each case. 



Ig, IS is the index set of {gi} and {Si} respectively. They can be finite or infinite. A restriction operator is that ∀f , f |S(x) = f (x) if x ∈ S, else f |S(x) = 0. S is called the support area. When the label is unavailable, we use the discriminator model to generate a pseudo label.



Figure 1: The pipeline of our detection method.

Figure 3: Different interpolation results using two fixed images.

Figure 4: The change in different levels of feature space under the disturbance of DDP.

Figure 5: The influence on the different levels under DDP, including the image space, the first and the fourth feature spaces of the ResNet18. In Fig (a), the first column is the original input and the remaining is the DiffNB of them. In Fig (b, c), the blue and green arrow is the feature change of InD data and OOD data respectively under DDP after dimensionality reduction using PCA. Fig (b) and (c) show the results in the first and the fourth feature spaces of a ResNet18 respectively.

, we show the DiffNB of different inputs using DDP and the influence of DDP on the different feature spaces of a pre-trained ResNet18. The DiffNB has a more obvious semantic change for OOD input. For example, a rickshaw becomes a horse in Fig 5a. On the other hand, the semantic information of DiffNB of the InD data is relatively unchanged. Correspondingly, We can find that the change in low-level features (the first feature space, in Fig 5b) is similar but the change in high-level semantic features (the fourth feature space, in Fig 5c) becomes small when the input is InD and relatively big when the input is OOD.

Figure 6: The bad denoising cases for DDP.

Figure 8: The AUROC results under different guidance weights, training steps, disturbance degrees and resampling sizes.

Figure 7: The results using different detection spaces.

Figure 9: The reconstruction error under different iteration interval [0, t] and fixed step size 20.

±0.25 | S to get the remaining on [0, 1]/S. Then r mush satisfies |r(x)| < σ, ∀x ∈ [0, 1]. There also exist other kinds of solutions. For example, let {g(x) a,b ≡ 1 b-a b a r(x)dx|a, b ∈ [0, 1]}, which represents the mixing operator. In addition, the mask operator is not necessary and we can extend it to any restriction operator.Here we prove that {g(x) i = f (x+i)|i ∈ {0, ±0.25}} and {g(x) a,b ≡ 1 b-a b a f (x)dx|a, b ∈ [0, 1]} can solve the problem:Given the input space C([0, 1]), the restriction operator ϕ S and a fixed δ, how can we design additional operator set {g i } to minimize the annihilator set A δ ({g i | S })?For {g(x) i = f (x + i)|i ∈ {0, ±0.25}}, g 0 | S = f | Sand it equals to zero means that f is zero on S. And then we also know that g 0.25 | S = f | S-0.25 and g -0.25 | S = f | S+0.25 equals zero, which means that f is zero on S -0.25 ∪ S + 0.25 = [0, 1]/S. Then the only choice of f is zero and the annihilator set is empty.For{g(x) a,b ≡ 1 b-a b a f (x)dx|a, b ∈ [0, 1]}. If ∃x 0 ∈ [0, 1], f (x 0 ) > δ, then ∃ϵ s.t. ∀x ∈ [x 0ϵ, x 0 +ϵ], f (x) > δ because f is continuous. Then we have that g(x) x0-ϵ,x0+ϵ ≡ 1 b-a b a f (x)dx > δ and g(x) x0-ϵ,x0+ϵ | S >δ, too. This does not satisfy our condition, so the annihilator set is empty.A.6 OOD DETECTION RESULT

