BOOSTING OUT-OF-DISTRIBUTION DETECTION WITH MULTIPLE PRE-TRAINED MODELS Anonymous

Abstract

Out-of-Distribution (OOD) detection, i.e., identifying whether an input is sampled from a novel distribution other than the training distribution, is a critical task for safely deploying machine learning systems in the open world. Recently, post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems. This advance raises a natural question: Can we leverage the diversity of multiple pre-trained models to improve the performance of post hoc detection methods? In this work, we propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pretrained models. Our approach uses the p-value instead of the commonly used hard threshold and leverages a fundamental framework of multiple hypothesis testing to control the true positive rate of In-Distribution (ID) data. We focus on the usage of model zoos and provide systematic empirical comparisons with current state-ofthe-art methods on various OOD detection benchmarks. The proposed ensemble scheme shows consistent improvement compared to single-model detectors and significantly outperforms the current competitive methods. Our method substantially improves the relative performance by 65.40% and 26.96% on the CIFAR10 and ImageNet benchmarks.

1. INTRODUCTION

Deep neural networks have achieved empirical success in many applications, but generalization robustness has always been a thorny problem in deep learning. A sophisticated and well-trained deep neural network can provide excellent test performance on identically distributed (ID) test data but may fail to make accurate predictions on inputs from outside the training distribution Nguyen et al. (2015) . This poses a big obstacle to the generalization of deep neural network models. Especially in safety-critical applications, it is better to identify out-of-distribution (OOD) inputs ahead of time rather than letting the model make predictions that may be unreliable. On the basis of pre-trained deep neural networks, many recent works on post hoc OOD detection have proposed diverse score functions to distinguish OOD samples utilizing the output probability, logits, gradients, and features of the pre-trained classifier. At the same time, some works also propose new training strategies to encourage the network to learn more features that may not be relevant to the OOD classification task. For example, MSP (Hendrycks & Gimpel, 2017) uses the maximum softmax probability, Energy score (Liu et al., 2020) considers the logits, and GradNorm (Huang et al., 2021) employs the vector norm of gradients. Based on these frameworks, several improved methods such as ODIN (Liang et al., 2018) , Adjusted Energy Score (Lin et al., 2021) , ReAct (Sun et al., 2021) are proposed to enhance the performance of OOD detection. These score functions above measure the similarity between a test input and the training (ID) data through pretrained feature extractors or classifiers. There are also many distance-based algorithms that directly quantify the distance of samples in the embedding space extracted from a pre-trained model and regard a test input as an OOD sample when it is far from the ID data. Lee et al. (2018) assumes the conditional distribution of features given the class label is a Gaussian distribution and derives a confidence score based on the Mahalanobis distance. SSD (Sehwag et al., 2020) considers selfsupervised pre-training and a Mahalanobis distance. Tack et al. (2020) uses contrastive learning with distributionally-shifted augmentations for pre-training and proposes a detection score specific to their training scheme. Sun et al. (2022) studies the nearest-neighbor distance and demonstrates the efficacy of non-parametric modeling of the feature distribution for OOD detection tasks. The performance of post hoc detection highly depends on the quality of pre-training. The most commonly used model architectures in OOD detection include convolutional networks such as ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) and Wide-ResNet (Zagoruyko & Komodakis, 2016) , and of course Transformer models such as Swin (Liu et al., 2022) or ViT (Dosovitskiy et al., 2021) . In general, the pre-trained models focus on the features related to classification tasks and the learnt representation may be insufficiently rich for OOD detection. Therefore, researchers have proposed ideas such as contrastive learning (Winkens et al., 2020; Tack et al., 2020) , adversarial training Biggio & Roli (2018); Miller et al. (2020) ; Chalapathy & Chawla (2019) , outlier exposure (Hendrycks et al., 2018; Papadopoulos et al., 2021) or other auxiliary artificially synthesized data (Lee et al., 2017) and auxiliary loss function (Vyas et al., 2018) to encourage models to learn high-level, taskagnostic and comprehensive features, which makes the model more robust and efficient in the downstream detection task. These models trained with different architectures and training strategies can extract diverse features that may complement each other well. So, a natural question is raised: Can we leverage the diversity of multiple pre-trained models to improve the performance of post hoc OOD detectors? To answer this question, we first build a model zoo that captures as many properties of the input as possible and remains sensitive to distributional changes. Then we reformulate the OOD detection task to check whether there exists a model in the model zoo that can identify the test input as an OOD sample. Section 3.1 shows that the naive ensemble of multiple OOD detection decisions cannot maintain the true positive rate of the ID data (TPR). Therefore, we propose an ensemble scheme to integrate the results of multiple OOD detectors and provide theoretical guarantees that our method can keep TPR at the target level. In Section 4, we also report the empirical TPR of our method, which is close to the target TPR level. Ensembling is not new to OOD detection. Morningstar et al. (2021) combines multiple test statistics from generative models to differentiate ID and OOD data. Haroush et al. (2022) uses both the Simes' method and Fisher's method to summarize p-values computed for each channel and layer of a deep neural network. Bergamin et al. (2022) shows that combining different types of test statistics using Fisher's method overall leads to a more accurate out-of-distribution test. Recently, Magesh et al. (2022) proposes an ensemble framework that combines any number of different test statistics using the Benjamini-Yekutieli procedure (Benjamini & Yekutieli, 2001 ) and a conformal p-value estimator (Vovk et al., 1999) . In this work, we develop a simple and fundamental ensemble scheme for using model zoos in OOD detection and name our method Zoo-based OOD Detection Enhancement (ZODE). Our method directly estimates the p-values according to its definition and employs the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) to control TPR. Then, we provide theoretical guarantees and empirical validation to show that ZODE can maintain the TPR close to its target level. On the other hand, we focus on the settings of the model zoo and conduct systematic experiments to demonstrate the superiority of our approach. First, we show that ZODE can consistently improve current OOD detectors. Second, by comparing single-model detectors with the ZODE-ensembled detector, we find that ZODE can exploit the diversity of multiple pretrained models and leverage complementarity among single-model detectors. Finally, our approach significantly improves current SOTA performance. We summarize our contributions as follows: • We provide novel insights into OOD detection from the perspective of the model zoo. We propose an enhancement scheme, ZODE, for OOD detection by exploiting the diversity of pre-trained models. The proposed method is inspired by a simple and fundamental framework of multiple hypothesis testing. Our theoretical results and experiments clearly show that ZODE can leverage the complementarity among single-model detectors to improve performance. • We point out that the naive ensemble of multiple OOD detectors leads to lower TPR. Then we provide theoretical analysis and empirical validation to demonstrate that our proposed method can maintain TPR well under the settings of the model zoo. • Extensive experiments show that our method can effectively and consistently improve the power of identifying OOD samples. On a commonly used CIFAR10 benchmark, our method significantly improves the SOTA result of the average false positive rate from 11.07% to 3.83%. For a challenging OOD detection task based on ImageNet, we show that our method is scalable to large-scale problems and significantly improves the SOTA result of the average false positive rate from 38.47% to 28.10%.

2. PRELIMINARIES

Out-of-Distribution Detection aims to check whether a test input is generated from the training distribution or not. It is a one-sample hypothesis testing problem if we can only access the training data. We denote X and Y as the input and label space respectively and let P id be the training distribution over X × Y. Suppose that ϕ(x) is a neural network trained on data drawn from P id to predict the label of input x ∈ X . Let D id denote the marginal distribution on X . Then we call x ∼ D id an in-distribution (ID) sample, otherwise, we identify it as an "unknown" input, called out-of-distribution (OOD) data. At test time, OOD detection distinguishes OOD samples and ID samples by using a decision function: G(x * ) = ID S(x * ) ≥ λ; OOD S(x * ) < λ; (1) where x * is a test input, S(•) is a score function that gives higher scores for ID data and lower for OOD data, and λ is the threshold. In this work, we consider post hoc OOD detection in which the score function S is derived from a pre-trained classifier ϕ, i.e. S(x * ) = S(x * ; ϕ). We denote F (s; ϕ) as the distribution of S(x; ϕ) with x ∼ D id and any pre-trained model ϕ. Then, if x * is an ID sample, the score S(x * ; ϕ) is an ID value following the distribution F (s; ϕ). Therefore, given a pre-trained model zoo M = {ϕ 1 , . . . , ϕ m }, we strengthen the OOD detection problem to: Is there ϕ ∈ M that would allow us to identify x * as an OOD sample? In this work, we proposed an approach to achieve the goal of this OOD detection problem.

3.1. NAIVE ENSEMBLE CANNOT MAINTAIN TPR

To leverage the model zoo M for OOD detection, a straightforward way is to execute the detection procedure in Eq.(1) based on each pre-trained model: G(x * ; ϕ) = ID S(x * ; ϕ) ≥ λ ϕ ; OOD S(x * ; ϕ) < λ ϕ ; and identify x * as an OOD sample if there exists ϕ ∈ M such that G(x * ; ϕ) = OOD, i.e.,

G(x

* ; M) = ID if S(x * ; ϕ) ≥ λ ϕ , ∀ϕ ∈ M; OOD if S(x * ; ϕ) < λ ϕ , ∃ϕ ∈ M; In other words, x * is classified as an ID sample only if all detectors G(x * ; ϕ i ), ϕ i ∈ M agree that x * is an ID sample. However, this simple approach is not easy to control the true positive rate of the ID data (TPR). In practice, the threshold λ ϕ is chosen so that a high fraction (e.g. 95%) of ID data is correctly identified. We denote the target level of the true positive rate of the ID data as TPR 0 and write α = 1 -TPR 0 . Therefore, each detector G(x * ; ϕ i ) has a α probability of misidentifying an ID sample as an OOD sample. When ensembling multiple single-model detectors, the probability of making mistakes also accumulates. It is easy to see that the detector G(x * ; M) can misidentify an ID sample as an OOD sample with probability more than α, specifically 1 -(1 -α) m when detectors are independent. As more and more pre-trained models become available, this error probability of G(x * ; M) increases until it becomes 100%. This implies that the naive ensembled detector cannot maintain the target TPR level. On the other hand, by fixing m, we can assign a low probability to α to make sure 1 -(1 -α) m = 5%. In this case, TPR 0 should be very large and even close to 1. This greatly reduces the probability of successfully identifying OOD data, as each single-model detector becomes very conservative and can only identify extreme OOD data. In this work, we develop an ensemble scheme that can maintain the target TPR level while keeping a high probability of successfully identifying OOD data.

3.2. USING P-VALUE FOR OOD DETECTION

However, directly integrating score functions is uninterpretable and lacks theoretical guarantees. Therefore, we use the p-value for OOD detection. P-value (Abramovich & Ritov, 2013) is defined in the framework of statistical hypothesis testing. In OOD detection, the p-value is a probability measure that quantifies how extreme the observed score is when the input is an ID sample (Cai & Koutsoukos, 2020; Morningstar et al., 2021; Haroush et al., 2022; Bergamin et al., 2022; Magesh et al., 2022; Kaur et al., 2022) . For example, we identify an input x as an OOD sample (reject the null hypothesis) when the observed detection score S(x) is smaller than a critical value γ. Given a test sample x * , the lower value of S(x * ), the more likely x * is not drawn from the training distribution. Hence, the p-value of x * is the probability that S(x) is less than S(x * ) under the ID distribution, that is, P-value of x * = P S(x) ≤ S(x * ) x ∼ D id . In general, if the p-value of x * is less than 0.05, we can determine that x * is an OOD sample at the significance level 0.05. In Appendix C, we show that using the p-value is equivalent to using the hard threshold λ in Eq. ( 1). Suppose the test input x * is an ID sample that x * ∼ D id and the detection score S(x * ) is a continuous random variable. We write p 0 as the p-value of x * and let F (s) be the cumulative distribution function of S(x) with x ∼ D id . Then we have p 0 = P S(x) ≤ S(x * ) x ∼ D id = F (S(x * )). It follows from the continuity of S(x * ) and Lemma 21.1 of Van der Vaart (2000) that P(p 0 < α) = 1 -P F (S(x * )) ≥ α = 1 -P S(x * ) ≥ F -1 (α) = F (F -1 (α)) = α. This implies that the p-value of x * follows a uniform distribution U [0, 1]. In the following, we will use this property to develop an ensemble scheme (Theorem 1 and Lemma 2).

3.3. TPR CONTROLLING FOR ENSEMBLE

According to Eq. ( 3), the p-value relies on the score function S(x), which is derived from a pretrained model ϕ, i.e. S(x) = S(x; ϕ). Given one pre-trained model, we can construct a score function and compute the p-value of a test input. But when multiple pre-trained models are accessible, how to fuse the single-model results to leverage the diversity of multiple pre-trained models while strictly maintaining TPR on ID data? We borrow the idea of the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) and propose an ensemble scheme for OOD detection via p-value correction. Consider a model zoo with m pre-trained models: M = {ϕ 1 , ϕ 2 , . . . , ϕ m } and a score function S(x; ϕ). Given a test input x * and a pre-trained model ϕ i , we compute the score value S(x * ; ϕ i ) and obtain the corresponding p-value p i . Going through all pre-trained models, we obtained m p-values: {p 1 , p 2 , . . . , p m }, and sort them in ascending order: p (1) ≤ p (2) ≤ • • • ≤ p (m) . Then, we identify the test input x * as an OOD sample if there exists an integer 1 ≤ k ≤ m such that p (k) ≤ k m (1-TPR 0 ). Here 'TPR 0 ' is a predetermined TPR level of the ID data. In general, it is taken to be 95%. We call the proposed method Zoo-based OOD Detection Enhancement (ZODE) and present the details of ZODE in Algorithm 1. Next, we provide theoretical guarantees that Algorithm 1 can maintain the target TPR level on ID data. Theorem 1 Suppose a pre-trained model zoo {ϕ 1 , ϕ 2 , . . . , ϕ m } is accessible and the score function is S(x; ϕ). Let TPR 0 > 0.5 be a predetermined TPR level for the ID Data. If the test input x * is an ID sample that x * ∼ D id and S(x * ; ϕ i ) is independent of S(x * ; ϕ j ) for ∀i ̸ = j, then Algorithm 1 can identify x * as an ID data with probability larger than T P R 0 . Remark. Here we assume that S(x * ; ϕ i ) is independent of S(x * ; ϕ j ), which leads to the independence between p i and p j for ∀i ̸ = j. This assumption can hold if different pre-trained models learn completely different features. In this case, the model zoo haves the desired diversity. In practice, the pre-trained models can still be very diverse but different models may extract related features. Therefore, we report the empirical TPR of our method in Section 4. One can find that ZODE can still maintain the empirical TPR not less than the target level though the p-values may be related. The proof is postponed to Appendix A. In Appendix B, we analyze the detection power of Algorithm 1 Estimate the p-value of x * given ϕ j : return x * is an ID sample. 13: end if as m tends to infinity, which implies that FPR is guaranteed as the size of the model zoo increases. Appendix D presents more discussions about the ensemble scheme and compares the BH procedure with three baseline ensemble schemes. p j = # x i : S(x i , ϕ j ) ≤ S( Computational complexity. In Algorithm 1, we decompose ZODE into three stages: inference, testing, and ensemble. The inference stage requires computing the score of all validation samples for all pre-trained models, and its computational complexity is m times that of post hoc OOD detection using a single pre-trained model since ZODE uses m pre-trained models. However, the inference stage only needs the ID data and can be done before deploying the OOD detector. Therefore, its computational complexity does not increase the detection time when testing new inputs. In addition, the inference stage is only feed-forward and is easily parallelizable. Therefore, the computational burden is not heavy. In this work, all experiments can be done using one NVIDIA V100 GPU. Interpretability. One of the benefits of ZODE is interpretability. If a test input is classified as an OOD sample, we can track which pre-models lead to this detection decision. At Step 9 of Algorithm 1, if p (k) ≤ k m (1 -TPR 0 ) and p (j) > j m (1 -TPR 0 ), ∀j > k, then there are k pre-trained models, corresponding to p (1) , . . . , p (k) respectively, that identify the test input as an OOD sample. In our experiments, we exploit this interpretability to find that there are OOD images that only one pre-trained model can detect. This implies that ZODE leverages the complementarity between all single-model detectors. Limitation. The limitation of ZODE is that the testing stage takes up a lot of storage space. Post hoc OOD detection computes the score of all validation samples and selects a hard threshold by the quantile of the empirical distribution of the detection score. Therefore, post hoc OOD detection only passes the threshold from the inference stage to the testing stage. In Algorithm 1, the testing stage requires the score of validation samples to compute p-values. Therefore, the testing stage of ZODE takes up more storage space than post hoc OOD detection methods. (Netzer et al., 2011) , LSUN (Yu et al., 2015) , iSUN (Xu et al., 2015) , Texture (Cimpoi et al., 2014) , Places365 (Zhou et al., 2017) , and CIFAR100 (Krizhevsky et al., 2009) . We then consider more challenging benchmarks based on ImageNet, i.e., large-scale OOD detection tasks. The ID data is ImageNet-1K (Deng et al., 2009) . We evaluate OOD detectors on four test datasets that are subset of : Places365 (Zhou et al., 2017 ), iNaturalist (Van Horn et al., 2018) , SUN (Xiao et al., 2010) , and Texture (Cimpoi et al., 2014) with different categories of each other.

Metrics:

We evaluate OOD detection methods by the following three metrics: (1) the true positive rate of the ID samples (TPR); (2) the false positive rate of OOD samples when the true positive rate of the ID samples is about 95% (FPR); (3) the area under the receiver operating characteristic curve (AUC). For single-model detectors, the hard threshold is determined by TPR = 95%. Therefore, the first metric aims to check whether our ensemble scheme can maintain the TPR level close to 95%. FPR and AUC are often used in the literature to reflect the capabilities of OOD detectors. For the AUC metric, we use grid values of TPR ranging from 0 to 1 with a gap of 0.0005 and obtain the corresponding FPR to compute the area under the receiver operating characteristic curve. Enhanced OOD detection: We consider three OOD detection methods: MSP (Hendrycks & Gimpel, 2017) , Energy (Liu et al., 2020) and KNN (Sun et al., 2022) . MSP is a simple baseline method that uses maximum softmax probabilities as the detection score. In some experiments, MSP can yield surprisingly good results when used on top of a large pre-trained model that has been finetuned on the ID data (Fort et al., 2021) . The energy-based model (LeCun et al., 2006 ) maps a test input to a scalar that is higher for OOD samples and lower for the training data. Liu et al. (2020) proposes an energy score that uses the logits output by a pre-trained classifier. Sun et al. (2022) uses the feature distance between the test input and the k-th nearest ID sample and proposes a KNNbased detector. These three OOD detection methods represent three kinds of detectors based on probability, logit, and distance, respectively. We take them as the baseline methods and denote our enhanced methods by 'ZODE-MSP', 'ZODE-Energy', and 'ZODE-KNN' respectively.

4.1. EVALUATION ON CIFAR10 BENCHMARKS

Model Zoo. We build a model zoo with seven pre-trained models: ResNet18, ResNet34, ResNet50, ResNet101, ResNet152 (He et al., 2016) , DenseNet (Huang et al., 2017) ZODE-KNN achieves superior performance. We compare our method with competitive OOD detection methods, including MSP (Hendrycks & Gimpel, 2017) , ODIN (Liang et al., 2018) , Energy (Liu et al., 2020) , GODIN (Hsu et al., 2020) , Mahalanobis (Lee et al., 2018) , KNN (Sun et al., 2022) , CSI (Tack et al., 2020) , SSD+ (Sehwag et al., 2020) , as well as KNN+ (Sun et al., 2022) . We cite the results of the competitors reported in Sun et al. (2022) . For a fair comparison, we set k = 50 in the experiments of ZODE-KNN, which is the same as Sun et al. (2022) . We can find that compared to the best baseline KNN+, ZODE-KNN reduces the FPR from 11.07% to 3.83%, which significantly improves the relative detection accuracy by 65.40%. Note that ZODE-KNN significantly reduces FPR when OOD samples are drawn from iSUN, Texture, and Places365. For LSUN, ZODE-KNN slightly improves the performance of KNN+. In addition, SSD+ outperforms ZODE-KNN on SVHN. Overall, ZODE-KNN significantly improves the performance of existing methods on these five OOD datasets. ZODE achieves consistent improvements. We consider three different kinds of OOD detection scores. MSP (Hendrycks & Gimpel, 2017 ) is based on the probabilities, Energy (Liu et al., 2020) uses the logits, and Mahalanobis (Lee et al., 2018) and KNN (Sun et al., 2022) quantify the distance in the embedding space. Then we compare them with the corresponding enhanced detectors: ZODE-MSP, ZODE-Energy, ZODE-Mahalanobis, and ZODE-KNN. For ZODE-MSP, ZODE-Energy, and ZODE-Mahalanobis, we use the same settings as Hendrycks & Gimpel (2017) and Liu et al. (2020) . We find that ZODE-enhanced detectors consistently improve the performance of the corresponding baselines (Table 1 ). ZODE leverages the complementarity between the single-model detectors. implies that the superior performance of ZODE does not fully come from any single-model detector. Therefore, our ensemble procedure works and is necessary for the improvements. We further take Place365 as an example to illustrate that ZODE exploits the diversity of multiple pre-trained models. At step 9 of Algorithm 1, if p (1) ≤ 1 m (1 -TPR 0 ) and p (j) > j m (1 -TPR 0 ), ∀j ≥ 2, then there is only one pre-trained model that can help to identify the test input as an OOD sample. Figure 1 presents seven such images and each image corresponds to one pre-trained model in our model zoo. Evaluations on CIFAR10 vs CIFAR100. We consider a challenging OOD detection task that identifies OOD samples drawn from CIFAR100 when the ID data is CIFAR10. Table 3a summarizes a detailed comparison with GRAM (Sastry & Oore, 2019), MaSF (Haroush et al., 2022) , SSD (Sehwag et al., 2020) , and KNN (Sun et al., 2022) . Compared with the best baseline SSD+, ZODE reduces the FPR by 20.21%, which is a relative 52.49% improvement in detection power. The results in Table 3b clearly show that ZODE significantly outperforms the single-model-based KNN detectors and our ensemble scheme fully leverages the complementarity between the single-model detectors.

4.2. EVALUATION ON IMAGENET BENCHMARKS

Model zoo and implementation details. We use five pre-trained models to build a model zoo, consisting of models with different architectures and different pre-training strategies. The models are as follows: ResNet50* (Sun et al., 2022) , semi-weekly supervised ResNeXt101 32x16d (Yalniz et al., 2019) , Swinv2-B256, Swinv2-B384, and Swinv2-L256 (Liu et al., 2022) . Significantly, resolutions of Swinv2-B256, Swinv2-B384, and Swinv2-L256 are 256x256, 256x256, and 384x384 respectively. ResNet50* is trained with SupCon loss (Khosla et al., 2020) , which pulls points belonging to the same class together in the embedding space and separates samples from different classes. ResNeXt101 is pre-trained on Billion-scale images associated with meta information semantically relevant to ImageNet, which achieves 84.8% top-1 accuracy on ImageNet. The three Swinv2 models are pre-trained at higher resolution, and their top-1 accuracy on Imagenet all exceed 84%. In the following, we only report the results of ZODE-KNN based on the model zoo. The hyperparameter TPR 0 is taken to be 93.50%, which makes the empirical TPR of ZODE-KNN close to 95%. We use k = 1000 for ResNet50*, which is same as Sun et al. (2022) . For the rest models, we selected k from {100, 200, 500, 700, 800, 900, 1000, 3000, 5000} that minimize the FPR. ZODE+KNN achieves superior performance. In Table 4 , we compare ZODE-KNN with competitive OOD detection methods, including MSP (Hendrycks & Gimpel, 2017) , ODIN (Liang et al., 2018) , Energy (Liu et al., 2020) , GODIN (Hsu et al., 2020) , Mahalanobis (Lee et al., 2018) , KNN (Sun et al., 2022) , SSD+ (Sehwag et al., 2020) , as well as KNN+ (Sun et al., 2022) . ZODE-KNN outperforms the best baseline KNN+ uniformly on all four OOD datasets, substantially reducing the average FPR from 38.47% to 28.10%, which achieves a relative 26.96% improvement in detection power. Especially when test datasets are iNaturalist and Textures, ZODE-KNN reduces the relative FPR by 83.40% and 70.61% respectively, which highlights the effectiveness of ZODE. ZODE combines the advantages of the single-model detectors. In Table 5 , we report the performance of every single-model detector derived from our model zoo. We highlight three trends: (1) ZODE-KNN outperforms the best single-model KNN detector with a relative 22.95% improvement in FPR. This implies that ZODE works in the ImageNet benchmarks and the ensemble scheme of ZODE-KNN is necessary for the improvements. (2) ZODE combines the advantages of singlemodel detectors. In Table 5 , we can observe that ResNet50* and ResNeXt101 32x16 perform well on Textures, but underperform on iNaturalist, while the Swin models show the opposite performance. However, the ZODE-ensembled detector achieves strong and stable performance in all test datasets. (3) ZODE leverages the complementarity between the single-model detectors. Similar to the discussions in Figure 1 , we find some images in Textures that can be successfully identified as OOD samples and the detection decision depends only on one single-model detector. Figure 1 presents five such images and each image corresponds to one pre-trained model in our model zoo.

5. CONCLUSION

In this paper, we exploit the diversity of multiple pre-trained models in a model zoo to improve the performance of post hoc OOD detection. We propose, ZODE, an efficient and fundamental ensemble scheme for combining multiple detection decisions. Extensive experiments show that ZODE can effectively solve the missed detection problem of single-model detectors by exploiting the complementarity of multiple detectors. We find that ZODE combined with the KNN detector (Sun et al., 2022) works very well. On a wide range of OOD detection benchmarks, ZODE-KNN significantly improves the current SOTA results.

□

According to Lemma 2, for any ϕ i ∈ M, p i = P S(x; ϕ i ) ≤ S(x * ; ϕ i ) x ∼ D id ∼ U [0, 1], and the density function of p i is f pi (x) = 1 x ∈ [0, 1]; 0 otherwise. Then, the joint probability density of the ordered values p (1) , p (2) , ..., p (m) is f p (1) ,p (2) ,...,p (m) (x 1 , x 2 , ..., x m ) = m! S i=1 f pi (x i ) = m! We denote α = 1 -TPR 0 and define an event E: ∀ 1 ≤ j ≤ m, p (j) ≥ j m α. Then we have P(E|x * ∼ D id ) = 1 m m α ... 1 2 m α 1 1 m α f p (1) ,••• ,p (m) (x 1 , x 2 , • • • , x m )dx 1 dx 2 • • • dx m = m!(1 - 1 m α)(1 - 2 m α)...(1 - m m α). Next, we prove that for any m ≥ 1 and α ≤ 0.5, m!(1 - 1 m α)(1 - 2 m α)...(1 - m m α) ≥ 1 -α It is easy to see that Eq. ( 5) holds when m = 1. Suppose Eq. ( 5) holds for m = m 0 . Then for m = m 0 + 1, we have (m 0 + 1)!(1 - 1 m 0 + 1 α)(1 - 2 m 0 + 1 α)...(1 - m 0 + 1 m 0 + 1 α) ≥(m 0 + 1)!(1 - 1 m 0 α)(1 - 2 m 0 α)...(1 - m 0 m 0 α)(1 - m 0 + 1 m 0 + 1 α) ≥(m 0 + 1)(1 -α) 2 ≥ 1 -α, which implies that Eq.( 5) also holds for m = m 0 + 1. Hence, the proof is finished.

□ B THE POWER OF ALGORITHM 1

In this section, we study the FPR of Algorithm 1 from the asymptotic perspective, i.e., m tends to infinity. According to Section 3.2 and Appendix A, the p-value of an ID sample follows the uniform distribution U [0, 1]. Also, if a pre-trained model fails to identify OOD samples, the p-value derived from the model follows the uniform distribution. For an OOD dataset, we assume that there is a fixed proportion π of pre-trained models that can recognize the OOD data points. We call these models active models and denote the set of active models as A. Then pre-trained models that fail to identify OOD samples belong to the set A c = M -A. We have, for any 0 ≤ u ≤ 1, P(p j ≤ u|ϕ j ∈ A c ) = u and P(p j ≤ u|ϕ j ∈ A) = G(u), where G(u) is a cumulative distribution function different from that of the uniform distribution U [0, 1]. Therefore, the p-values of the OOD dataset are sampled from a mixture model with a cumulative distribution function: F (u) = (1 -π)u + πG(u). Let k be the number of pre-trained models that classify the OOD input as an OOD sample. That is p (k) ≤ k m α and p (j) > j m α, ∀j > k with α = 1 -TPR 0 . Table 6 summarizes the OOD detection  E S m 1 = E k × S k m 1 = E k m × (1 -V k ) m1 m . According to Chi (2007), k m converges to a positive value p * (α, F ) as m → ∞, which serves as the limit of the proportion of rejected p-values. By Theorem 1 and its lemma in Benjamini & Hochberg (1995) , E 1 - V k ≥ 1 - m 0 m α = 1 -(1 -π)α. Therefore, if m is sufficiently large, then we have E S m 1 ≥ p * (α, F )(1 -(1 -π)α) π ≥ 1 πm = 1 m 1 . This implies that if m is sufficiently large, S is greater or equal to 1 with high probability. In other words, if the number of pre-trained models is sufficiently large, Algorithm 1 can identify OOD samples with high probability.

C USING P-VALUE FOR OOD DETECTION

In OOD detection, the p-value is a probability measure that quantifies how extreme the observed score is when the input is an ID sample (Cai & Koutsoukos, 2020; Morningstar et al., 2021; Haroush et al., 2022; Bergamin et al., 2022; Magesh et al., 2022; Kaur et al., 2022) . Given a test sample x * , the lower value of S(x * ; ϕ), the more likely x * is not drawn from the training distribution. Hence, the p-value of x * is the probability that S(x; ϕ) is less than S(x * ; ϕ) under the ID distribution: P-value of x * = P S(x; ϕ) ≤ S(x * ; ϕ) x ∼ D id . In practice, using the p-value is equivalent to using the hard threshold S(x * ) < λ. We denote {(x i , y i )} n i=1 as validation data sampled from the ID distribution P id and sort their detection score in ascending order: S(x (1) ; ϕ) ≤ S(x (2) ; ϕ) ≤ • • • ≤ S(x (n) ; ϕ). In post hoc OOD detection, the threshold λ ϕ is determined by correctly classifying 95% of validation data as ID samples, i.e., TPR is at least 95%. Therefore, S(x (⌊0.05n⌋) ; ϕ) ≤ λ ϕ ≤ S(x (⌊0.05n⌋+1) ; ϕ), where ⌊•⌋ is the floor function. On the other hand, the p-value of x * less than 0.05 implies that P S(x; ϕ) ≤ S(x * ; ϕ) x ∼ Did ≈ 0.05 ⇒ S(x * ; ϕ) ⪅ S(x (⌊0.05n⌋+1) ; ϕ), where Did is the empirical distribution of {x i } n i=1 . Therefore, when the sample size n is sufficiently large, the OOD region derived from the critical value {x : S(x; ϕ) < λ ϕ } is the same as the OOD region determined by the p-value {x : P-value of x < 0.05}. 

D USING BH PROCEDURE FOR ENSEMBLE

In this work, we use the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) to ensemble multiple detection decisions (p-values) to exploit the diversity of pre-trained models. Please see Steps 7 -13 in Algorithm 1. Recall that TPR 0 is the target TPR level and α = 1 -TPR 0 . According to Theorem 1, this procedure controls the true positive rate of ID data at a level greater than TPR 0 by assuming independent p-values. Benjamini & Yekutieli (2001) points out that the procedure is also available for cases with certain types of positively related p-values. In this section, we consider three ensemble schemes, Naive, Average, and Voting, as competitors to illustrate the superior of the BH procedure. Here 'Naive' represents the naive ensemble scheme in Eq.( 2), 'Average' refers to the ensemble scheme that identifies a test input as an OOD sample if the average p-value is smaller than α, and 'Voting' is the majority voting that identifies a test input as an OOD sample if more than half of the p-values are less than α. The score function is the KNN score (Sun et al., 2022) with k = 50 and TPR 0 is 95%. The results are presented in Table 7 and 8 . We can find that the empirical TPR of our method is well-controlled and is close to the target TPR level. The naive ensemble scheme cannot maintain the target TPR level. Therefore its low FPR is unreliable. This observation is consistent with our theoretical results in Section 3.1. The average ensemble scheme can maintain its empirical TPR larger than 95% but its FPR is very large. Overall, the voting ensemble is comparable to our scheme. Its TPR on ImageNet is a bit out of control.

E INFLUENCE OF HYPERPARAMETER

In Secction 4, we find that the ZODE-KNN detector works very well and significantly improves the best baseline method on a wide range of OOD detection benchmarks. The KNN detector (Sun et al., 2022) uses the distance between the test input and the k-th nearest ID sample as the detection score. Here k is a hyperparameter that needs to be predetermined. In this section, we use CIFAR10 to investigate the effect of the choice of k on ZODE. We consider k = 1 and k = 50, and report the results in Table 9 . One can find that the choice of k affects the performance of ZODE-KNN. But the influence is not significant compared to the improvements of ZODE-KNN. Both ZODE-KNN(k=1) and ZODE-KNN(k=50) significantly outperform the best baseline methods. In Table 10 , we report the detailed comparison between the ZODE-ensembled KNN detector and the single-model KNN detectors derived from our model zoo. We observe a similar phenomenon in that the choice of k affects the performance of both ensembled and single-model detectors, while the influence is not significant compared to the improvement caused by the ensemble scheme. (Hendrycks et al., 2018) can explicitly regularize the model. Another method to fine-tune the model is to modify the loss function or use auxiliary objectives. Some loss functions encourage the predictive distribution of OOD sample toward uniform distribution (Lee et al., 2017; Hendrycks et al., 2018) , and also some by adding contrastive loss (Winkens et al., 2020) , margin loss (Vyas et al., 2018) or objective function of adversarial learning (Biggio & Roli, 2018; Miller et al., 2020; Chalapathy & Chawla, 2019) can force models to learn more high-level, task-agnostic, comprehensive features from the training dataset, to enable it robust enough for downstream tasks with various distribution shifts. These works either modify the neural network and objective function or require additional data, at the cost of computational cost.



EXPERIMENTSIn this section, we demonstrate the effectiveness of our proposed method. First, we evaluate whether our model zoo and ensemble scheme can enhance OOD detectors. Second, we demonstrate that ZODE exploits the diversity of pre-trained models and leverages the complementarity between the single-model detectors to achieve superior performance. Finally, we show that our method can significantly improve the current SOTA results.Dataset:We evaluate our proposed method on the CIFAR benchmarks. We use CIFAR10(Krizhevsky et al., 2009) as the ID data and evaluate OOD detectors on six OOD datasets: SVHN



Figure 1: Places365. Example OOD images that only one single-model detector can identify.

Figure 2: Textures. Example OOD images that only one single-model detector can identify.

Algorithm 1 ZODE: Zoo-based OOD Detection Enhancement Require: Training data {x i } n i=1 , pre-trained model zoo {ϕ 1 , . . . , ϕ m }, test sample x * , detection score S(x; ϕ), TPR level for ID data 'TPR 0 '; 1: Stage 1. Inference 2: Compute the score value of S(x i , ϕ j ), ∀1 ≤ i ≤ n and ∀1 ≤ j ≤ m; 3: Stage 2. Testing 4: for 1 ≤ j ≤ m do

x * , f j ) m Sort {p 1 , . . . , p m } in ascending order: {p (1) , . . . , p (m) };9: if ∃1 ≤ k ≤ m such that p (k) ≤ k m (1 -TPR 0 ) then

Results on CIFAR10. Comparison with competitive OOD detection methods. The results of all competitors are from Sun et al. (2022). All values are percentages. ↓ indicates smaller values are better and vice versa.

and ResNet18 *(Sun et al., 2022). Here ResNet and DenseNet are two backbones routinely used in the literature on OOD detection. Therefore, we consider different architectures and use six models trained by cross-entropy loss. In addition, we also notice the effect of the loss function and introduce the model ResNet18

Results on CIFAR10. We compare the ZODE-KNN detector with the single-model KNN detector. All values are percentages. ↓ indicates smaller values are better and vice versa.

Results on CIFAR10 for CIFAR100 as OOD. The results of GRAM and MaSF are fromHaroush et al. (2022). We cite the results of SSD and SSD+ reported in(Sehwag et al., 2020). All values are percentages. ↓ indicates smaller values are better and vice versa.

Results on ImageNet. All results of the competitors are cited fromSun et al. (2022). Methods reported are all based on ID data only (ImageNet-1k). All values are percentages. ↓ indicates smaller values are better and vice versa.

Results on ImageNet. Comparison with single-model detectors and ZODE. All values are percentages. ↓ indicates smaller values are better and vice versa.

OOD detection.

Results on CIFAR10. Comparison with three baseline ensemble schemes. All values are percentages. ↓ indicates smaller values are better and vice versa.

Results on ImageNet. Comparison with three baseline ensemble schemes. All values are percentages. ↓ indicates smaller values are better and vice versa.

Results on CIFAR10. Comparison with competitive OOD detection methods. The results of all competitors are from Sun et al. (2022). All values are percentages. ↓ indicates smaller values are better and vice versa.

Results on CIFAR10. Comparison with single-model detectors at different k levels. Setting k = 1 and k = 50 respectively, we compare the performance with single-model detectors and ZODE.

A PROOF OF THEOREM 1

Theorem. Suppose a pre-trained model zoo {ϕ 1 , ϕ 2 , . . . , ϕ m } is accessible and the score function is S(x; ϕ). Let TPR 0 > 0.5 be the target TPR level for the ID data. If the test input x * is an ID sample that x * ∼ D id and S(x * ; ϕ i ) is independent of S(x * ; ϕ j ) for ∀i ̸ = j, then Algorithm 1 can identify x * as an ID data with probability larger than T P R 0 .Proof. Before proving the theorem, we state a useful lemma that provides the distribution of p-values under the ID distribution.Lemma 2 Suppose the test input x * is an ID sample that x * ∼ D id and the detection score S(x * ; ϕ) is a continuous random variable. We write the p-value of x * asThen p 0 follows the uniform distribution U [0, 1].Proof. Let F ϕ (s) be the cumulative distribution function of S(x; ϕ) with x ∼ D id . Then,By the continuity of S(x * ; ϕ) and Lemma 21.1 of Van der Vaart (2000) , we haveHence p 0 follows the uniform distribution U [0, 1].

