UNLEASHING MASK: EXPLORE THE INTRINSIC OUT-OF-DISTRIBUTION DETECTION CAPABILITY Anonymous

Abstract

Out-of-distribution (OOD) detection is an important aspect of safely deploying machine learning models in real-world applications. Previous approaches either design better scoring functions or utilize the knowledge of outliers to equip the well-trained models with the ability of OOD detection. However, few of them explore to excavate the intrinsic OOD detection capability of a given model. In this work, we discover the existence of an intermediate stage of a model trained on in-distribution data having higher OOD detection performance than that of its final stage across different settings and further identify the critical attribution to be learning with atypical samples. Based on such empirical insights, we propose a new method, Unleashing Mask (UM), that restores the OOD discriminative capabilities of the model. To be specific, we utilize the mask to figure out the memorized atypical samples and fine-tune the model to forget them. Extensive experiments have been conducted to characterize and verify the effectiveness of our method.

1. INTRODUCTION

Out-of-distribution (OOD) detection has drawn increasing attention when deploying machine learning models into the open-world scenarios (Nguyen et al., 2015; Lee et al., 2018a) . Since the test samples can naturally arise from a label-different distribution, identifying OOD inputs is important, especially for those safety-critical applications like autonomous driving and medical intelligence. Previous studies focus on designing a series of scoring functions (Hendrycks & Gimpel, 2017b; Liang et al., 2018; Lee et al., 2018a; Liu et al., 2020; Sun et al., 2021; 2022) for OOD uncertainty estimation or finetuning with auxiliary outlier data to better distinguish the OOD inputs (Hendrycks et al., 2019c; Tack et al., 2020; Mohseni et al., 2020; Sehwag et al., 2021; Wei et al., 2022; Ming et al., 2022) . Despite the promising results achieved by previous methods (Hendrycks & Gimpel, 2017a; Hendrycks et al., 2019c; Liu et al., 2020; Ming et al., 2022) , little attention is paid to considering whether the well-trained given model is the most appropriate for OOD detection. In general, models deployed for various applications have different targets (e.g., multi-class classification) (Goodfellow et al., 2016) instead of OOD detection (Nguyen et al., 2015; Lee et al., 2018a) . However, most representative score functions, e.g., MSP (Hendrycks et al., 2019c) , ODIN (Liang et al., 2018) , and Energy (Liu et al., 2020) , uniformly leverage the given models for OOD detection. Considering the target-oriented discrepancy, it arises a critical question: does the well-trained given model have the optimal OOD detection capability? If not, how can we find a more appropriate model for OOD detection? In this work, we start by revealing an important observation (as illustrated in Figure 1 ), i.e., there exists a historical training stage where the model has a higher OOD detection performance than the final well-trained one. This is generally true across different OOD/ID datasets (Netzer et al., 2011; Van Horn et al., 2018; Cimpoi et al., 2014) , learning rate schedules (Loshchilov & Hutter, 2017) , and model structures (Huang et al., 2017; Zagoruyko & Komodakis, 2016) . The empirical results of Figure 1 reflect the inconsistency between gaining better OOD detection capability (Nguyen et al., 2015) and pursuing better performance on ID data. We delve into the differences between the intermediate model and the final model by visualizing the misclassified examples. As shown in Figure 2 , one possible attribution for covering the detection capability should be memorizing the atypical samples (at the semantic level) that are hard to learn for the model. Seeking zero error on those samples makes the model more confident on OOD data (see Figures 1(b ) and 1(c)). All the experiments testing for OOD detection performance have been conducted multiple times. By backtracking the training phase, we can observe the existence of the model stage with better OOD detection capability using the Energy score to distinguish the OOD inputs. When zooming in the ID and OOD distributions at Epoch 60 and Epoch 100 respectively, it can be seen that, along with the training at the later stage, the overlap between them grows. Figure 2 contains further exploration. The above analysis inspires us to propose a new strategy, namely, Unleashing Mask (UM), to excavate the once-covered detection capability of a well-trained given model by alleviating the memorization of those atypical samples (as illustrated in Figure 3 ) of ID data. In general, we aim to backtrack its previous stage with a better OOD detection capability. To achieve this target, there are two essential issues: (1) the model that is well-trained on ID data has already memorized some atypical samples; (2) how to forget those memorized atypical samples considering the given model? Accordingly, our proposed UM contains two parts utilizing different insights to address the above problems. First, as atypical samples are more sensitive to the change of model parameters, we initialize a mask with the specific cutting rate to mine these samples with constructed discrepancy. Then, with the loss reference estimated by the mask, we conduct the constrained gradient ascent (i.e., Eq. 3) for model forgetting. It will encourage the model to finally stabilize around the optimal stage. To avoid the severe sacrifice of the original task performance on ID data, we further propose UM Adopts Pruning (UMAP) which performs the tuning on the introduced mask with the newly designed objective. For our proposed methods, we conduct extensive experiments to characterize and understand the working mechanism (in Section 4 and Appendix F). The comprehensive results accordingly demonstrate their effectiveness. We have verified the effectiveness of UM with a series of OOD detection benchmarks considering the two different ID datasets, i.e., CIFAR-10 and CIFAR-100. Under the various evaluation metrics, our UM, as well as UMAP, can indeed excavate the better OOD detection capability of given models and the averaged FPR95 can be reduced by a significant margin. Finally, a range of ablation studies and further discussions related to our proposed strategy are provided. We summarize our main contributions as follows, • Conceptually, we explore the OOD detection performance via a new perspective, i.e., backtracking the model training phase without regularizing by any auxiliary outliers, different from most previous works that start with the well-trained model on ID data. • Empirically, we reveal the potential detection capability of the well-trained model. We observe the general existence of an intermediate stage where the model has more appropriate discriminative features that can be utilized for OOD detection. • Technically, we introduce a new strategy, i.e., Unleashing Mask, to excavate the once-covered OOD detection capability of a given model. By introducing the mask, we estimate the loss constraint for forgetting the atypical samples and empower the detection performance. • Experimentally, we conduct extensive explorations to verify the general effectiveness on improving the OOD detection performance of our methods. Using various ID and OOD benchmarks, we provide comprehensive results across different setups and further discussion.

2. BACKGROUND

In this section, we briefly introduce the preliminaries and related work about OOD detection.

2.1. PRELIMINARIES

We consider multi-class classification as the training task, where X ⊂ R d denotes the input space and Y = {1, . . . , C} denotes the label space. Given the model deployed in the real world, the reliable classifier is expected to figure out the OOD input, which can be formulated as a binary classification problem. Given P, the distribution over X × Y, we consider D in as the marginal distribution of P for X , namely, the distribution of ID data. At test time, the environment can present a distribution D out over X of OOD data. In general, the OOD distribution D out is defined as an irrelevant distribution of which the label set has no intersection with Y Yang et al. ( 2021) and therefore should not be predicted by the model. The decision can be made with the threshold λ: D λ (x; f ) = ID S(x) ≥ λ OOD S(x) < λ , Building upon the model f ∈ H : X → R c trained on ID data with the logit outputs, the goal of decision is to utilize the scoring function S : X → R to distinguish the inputs of D in from that of D out by S(x). Typically, if the score value is larger than the threshold λ, the associated input x is classified as ID and vice versa. We consider several representative scoring functions designed for OOD detection, e.g., MSP (Hendrycks & Gimpel, 2017b) , ODIN (Liang et al., 2018) , and Energy (Liu et al., 2020) . More detailed definitions and implementation can refer to Appendix A. To mitigate the issue of over-confident predictions for (Hendrycks & Gimpel, 2017b; Liu et al., 2020) some OOD data, recent works (Hendrycks et al., 2019c; Tack et al., 2020) utilize the auxiliary unlabeled dataset to regularize the model behavior. Among them, one representative baseline is outlier exposure (OE) (Hendrycks et al., 2019c) . OE can further improve the detection performance by making the model f (•) finetuned from a surrogate OOD distribution D s out , and its corresponding learning objective is defined as follows, L f = E Din [ℓ CE (f (x), y)] + λE D s out [ℓ OE (f (x))] , where λ is the balancing parameter, ℓ CE (•) is the Cross-Entropy (CE) loss, and ℓ OE (•) is the Kullback-Leibler divergence to the uniform distribution, which can be written as ℓ OE (h(x)) = k softmax k f (x)/C, where softmax k (•) denotes the k-th element of a softmax output. The OE loss ℓ OE (•) is designed for model regularization, making the model learn from surrogate OOD inputs to return low-confident predictions. The general formulation of Eq 2 is also adopted in other related works for designing better tuning objectives that use different auxiliary outlier data. Although previous works show promising results via designing scoring functions or regularizing models based on the model f trained on ID data, few of them investigated the original detection capability of the well-trained given model. In this work, we introduce the layer-wise mask m (Han et al., 2016; Ramanujan et al., 2020) to mine the atypical samples that memorized by the model. Accordingly, the decision can be written as D(x; m⊙f ), and the output of masked model is m⊙f (x).

2.2. RELATED WORK

Out-of-distribution Detection without auxiliary data. (Hendrycks & Gimpel, 2017a) formally shed light on out-of-distribution detection, proposing to use softmax prediction probability as a baseline which is demonstrated to be unsuitable for OOD detection. Subsequent works keep focusing on designing post-hoc metrics to distinguish ID samples from OOD samples, among which ODIN (Liang et al., 2018) introduces small perturbations into input images to facilitate the separation of softmax score, Mahalanobis distance-based confidence score (Lee et al., 2018b) exploits the feature space by obtaining conditional Gaussian distributions, energy-based score (Liu et al., 2020) aligns better with the probability density. Besides designing score functions, many other works pay attention to various aspects to enhance the OOD detection such that LogitNorm (Wei et al., 2022) produces confidence scores by training with a constant vector norm on the logits, and DICE (Sun & Li, 2022) reduces the variance of the output distribution by leveraging the sparsification of the model. Out-of-distribution Detection with auxiliary data. Another promising direction towards OOD detection involves the auxiliary outliers for model regularization. On one hand, some works generate virtual outliers such that Lee et al. (2018a) uses generative adversarial networks to generate boundary samples, VOS (Du et al., 2022a) regularizes the decision boundary by adaptively sampling virtual outliers from the low-likelihood region. On the other hand, other works tend to exploit information from natural outliers, such that outlier exposure (OE) is introduced by Hendrycks et al. (2019b) , given that diverse data are available in enormous quantities. (Yu & Aizawa, 2019 ) train an additional "head" and maximizes the discrepancy of decision boundaries of the two heads to detect OOD samples. Energy-bounded learning (Liu et al., 2020) fine-tunes the neural network to widen the energy gap by adding an energy loss term to the objective. Some other works also highlight the sampling strategy, such that ATOM (Chen et al., 2021) greedily utilizes informative auxiliary data to tighten the decision boundary for OOD detection, and POEM (Ming et al., 2022) adopts Thompson sampling to contour the decision boundary precisely. The performance of training with outliers is usually superior to that without outliers, shown in many recent works (Mohseni et al., 2020; Liu et al., 2020; Fort et al., 2021; Sun et al., 2021; Sehwag et al., 2021; Yang et al., 2021; Chen et al., 2021; Salehi et al., 2021) .

3. PROPOSED METHOD: UNLEASHING MASK

In this section, we introduce our new method, i.e., Unleashing Mask (UM), to reveal the potential OOD detection capability of the well-trained model. First, we present and discuss the important observation that inspires our methods (Section 3.1). Second, we provide the insights behind the two critical parts of our UM (Section 3.2). Lastly, we describe the framework and the learning objective of UM that incorporates the previous component, as well as its variant, i.e., UMAP (Section 3.3).

3.1. ONCE-COVERED OOD DETECTION CAPABILITY

First, we present the phenomenon of the inconsistent trend between a better OOD detection capability and smaller training error during training. Empirically, as shown in Figure 1 , we plot the OOD detection performance during the model training after multiple runs of the experiments. Across three different OOD testing datasets, we can observe the existence of a better detection capability using the index of FPR95 metric based on the Energy (Liu et al., 2020) or ODIN (Liang et al., 2018) score. The generality has also been verified using different learning schedules and model structures in our experimental sections. We further show the comparison of the ID/OOD distributions in Figures 1(b ) and 1(c). To be specific, the statics of the two distributions indicate that the gap between the ID and OOD data gets narrow as their overlap grows along with the training. After Epoch 60, although the model becomes more confident on ID data which satisfies a part of the calibration target (Hendrycks et al., 2019a) , its predictions on the OOD data also become more confident which is unexpected. Without seeing any auxiliary outliers, it motivates us to explore how the model achieves that. We take a closer look at the model behaviors in Figure 2 , where we check its corresponding training/testing loss and accuracy. We find that the training loss has reached a reasonably small value at Epoch 60 where its detection performance also achieves a satisfactory level. (Arpit et al., 2017) , we attribute the inconsistent trend to memorizing those atypical data at the later stage. In Appendix C, we provide a detailed discussion between it with the concept of conventional overfitting (Goodfellow et al., 2016) .

3.2. UNLEASHING THE INTRINSIC DETECTION POWER

In general, the models that are developed for the original classification tasks are always seeking better performance (i.e., higher testing accuracy and lower training loss) in practice. However, the inconsistent trend revealed before indicates that the intrinsic OOD detection capability maybe once-covered during the training. It gives us a chance to unleash the potential detection power only considering the ID data in training. To this end, we have two important issues that need to address: (1) the model that is well-trained on ID data may have already memorized some atypical samples which can not be figured out; (2) how to forget those atypical samples considering the given model? Mining the atypical samples with constructed discrepancy. According to Figures 2(a ) and 2(b), both training accuracy and loss provide limited information that can differentiate the typical and atypical data. Inspired by the learning dynamics (Goodfellow et al., 2016; Arpit et al., 2017) of deep neural networks and the pathway conjecture (Barham et al., 2022) for inference, we try to manually construct the parameter discrepancy to mine the atypical samples from a well-trained model. To be specific, we introduce a novel layer-wise mask to achieve the goal. The masks are applied to all layers, which is consistent with the mask generation in the conventional pruning pipeline (Han et al., 2016) . In Figure 3 (c), we provide empirical evidence to show that we can figure out the atypical samples via enlarging the mask rate. Utilizing the masked output for loss computation, the atypical samples can be better differentiated. We also provide more discussion about the intuition in Appendix B. Forgetting the atypical samples with gradient ascent. As the training loss achieves zero at the final stage of the given model, we need extra optimization signals to forget those memorized atypical samples. Considering the previous consistent trend before the potential optimal stage (e.g., before Epoch 60 in Figure 1 (a)), the optimization signal also needs to control the model update not to be too greedy to drop the discriminative features for OOD detection. Starting with the given model, we can employ the gradient ascent (Sorg et al., 2010; Ishida et al., 2020) to forget the targeted samples, while the tuning phase should also prevent further updates if the model can achieve the expected stage.

3.3. METHOD REALIZATION

Based on previous insights, we present our overall framework as well as the learning objective of the proposed Unleashing Mask for OOD detection. Lastly, we discuss its compatibility with either the fundamental scoring functions or the outlier exposure approaches utilizing auxiliary outlier data. Framework. As illustrated in Figure 3 (a), our framework consists of two critical components for uncovering the intrinsic OOD detection capability: (1) the initialized mask with a specific pruning rate for constructing the output discrepancy with the original model; (2) the fine-tuning procedure for alleviating the memorization of atypical samples. The overall workflow starts with obtaining the loss value of misclassifying those atypical samples and then conducts tuning with the model to forget. Objective for forgetting. Based on our framework, we introduce the forgetting objective as, min L UM = min m δ ∈[0,1] n |ℓ CE (f ) -ℓ CE (m δ ⊙ f * )| + ℓ CE (m δ ⊙ f * ), where m δ is our proposed layer-wise mask with the pruning rate δ, ℓ CE is the CE loss, ℓ CE is the averaged CE loss over the ID training data, | • | indicates the computation for absolute value and m δ ⊙ f * denotes the masked output of the fixed pretrained model that is used to estimate the loss constraint for the learning objective of forgetting, which would be a constant value during the whole finetuning process. Concretely, the well-trained model will start to optimize itself again if it memorized the atypical samples and achieved almost zero loss value. We provide a positive gradient signal when the current loss value is lower than the estimated one and vice versa. The model is expected to finally stabilize around the stage that can forget those atypical samples. Unleashing Mask Adopts Pruning (UMAP). Considering the potential negative effect on the original task performance when conducting tuning for forgetting, we further propose a variant of UM Adopts Pruning, i.e., UMAP, to conduct tuning based on the masked output (e.g., replace ℓ CE (f ) to ℓ CE ( mp ⊙ f * ) in Eq 3) using the different mask mp with its pruning rate p as follows, min L UMAP = min mp∈[0,1] n ,m δ ∈[0,1] n |ℓ CE ( mp ⊙ f * ) -ℓ CE (m δ ⊙ f * )| + ℓ CE (m δ ⊙ f * ), Different from the objective of UM (i.e., Eq 3) that minimizes the loss value over the model parameter, the objective of UMAP minimizes the loss over the mask to achieve the target of forgetting those atypical samples. UMAP provides an extra mask to restore the detection capacity but doesn't affect the model parameter for the inference on original tasks, indicating that UMAP is a more practical choice in real-world applications (as empirically verified in our experiments like Table 1 ). We summarize the algorithms of UM (in Algorithm 1) and UMAP (in Algorithm 2) in Appendix D. Compatible to other methods. As we explore the original OOD detection capability of the welltrained model, it is orthogonal and compatible with those promising methods that equip the given model with better detection ability. To be specific, through our proposed methods, we reveal the oncecovered OOD detection capability via tuning the original model towards its intermediate training stage. The discriminative feature learned at that stage can be utilized by different scoring functions (Huang et al., 2021; Sun & Li, 2022; Wei et al., 2022) , like ODIN (Liang et al., 2018) adopted in Figure 2(c ). For those methods (Hendrycks et al., 2019a; Liu et al., 2020; Ming et al., 2022) utilizing the auxiliary outliers to regularize the model, our fine-tuned model obtained by UM and UMAP can also serve as their starting point or adjustment. As our strategy does not require any auxiliary outlier data to be involved in training, adjusting the model using ID data during its developing phase is practical.

4. EXPERIMENTS

In this section, we present the performance comparison of the proposed method in the OOD detection scenario. Specifically, we verify the effectiveness of our UM and UMAP with two mainstreams of OOD detection approaches: (i) fundamental scoring function methods; (ii) outlier exposure methods involving auxiliary samples. To better understand and characterize our proposed method, we further conduct extensive explorations on the ablation study and provide the corresponding discussion on each sub-aspect considered in our work. More details and additional results can also refer to Appendix F.

4.1. EXPERIMENTAL SETUPS

Datasets. Following the common benchmarks used in previous work (Liu et al., 2020; Ming et al., 2022) , we adopt CIFAR-10, CIFAR-100 (Krizhevsky, 2009) as our ID datasets. We use a series of different image datasets as the OOD datasets, namely Textures (Cimpoi et al., 2014) , Places365 (Zhou et al., 2017) , SUN (Xiao et al., 2010) , LSUN (Yu et al., 2015) , and iNaturalist (Van Horn et al., 2018) . We also use the other ID dataset as OOD dataset when training on a specific ID dataset, given that none of them shares any same classes (Yang et al., 2021) . e.g. we treat CIFAR-100 as the OOD dataset when training on CIFAR-10 in our experiments for comparison. Training details. We conduct all major experiments on DenseNet-101 (Huang et al., 2017) with training epochs fixed to 100. We also include experiment results on other types of models in the Appendix F. The models are trained using stochastic gradient descent (Kiefer & Wolfowitz, 1952) with Nesterov momentum (Duchi et al., 2011) . We adopt Cosine Annealing (Loshchilov & Hutter, 2017) to schedule the learning rate which begins at 0.1. We set the momentum and weight decay to be 0.9 and 10 -4 respectively throughout all experiments. The size of the mini-batch is 64 for both ID samples (when training and testing) and OOD samples (when testing). More details and further discussion about choosing the mask ratio in experiments can be referred to at the end of Appendix F. Evaluation metrics. We employ the following three common metrics to evaluate the performance of OOD detection: (i) Area Under the Receiver Operating Characteristic curve (AUROC) (Davis & Goadrich, 2006 ) can be interpreted as the probability for a positive sample to have a higher discriminating score than a negative sample (Fawcett, 2006) ; (ii) Area Under the Precision-Recall curve (AUPR) (Manning & Schütze, 1999 ) is an ideal metric to adjust the extreme difference between positive and negative base rates; (iii) False Positive Rate (FPR) at 95% True Positive Rate (TPR) (Liang et al., 2018) indicates the probability for a negative sample to be misclassified as positive when the true positive rate is at 95%. We also include in-distribution testing accuracy (ID-ACC) to reflect the preservation level of the performance for the original classification task on ID data. OOD detection baselines. We compare the proposed method with several competitive baselines in the two directions. Specifically, we adopt Maximum Softmax Probability (MSP) (Hendrycks & Gimpel, 2017a) , ODIN (Liang et al., 2018) , Mahalanobis score (Lee et al., 2018b) , and Energy score (Liu et al., 2020) as scoring function baselines; We adopt OE (Hendrycks et al., 2019b) , Energybounded learning (Liu et al., 2020) , and POEM (Ming et al., 2022) as baselines with outliers. For all scoring function methods, we assume the accessibility of well-trained models. For all methods involving outliers, we constrain all major experiments to a fine-tuning scenario, which is more practical in real cases. Different from training a dual-task model at the very beginning, equipping deployed models with OOD Detection ability is a much more common circumstance, considering the millions of existing and running deep learning systems. We leave more details in Appendix A.

4.2. PERFORMANCE COMPARISON

In this part, we present the performance comparison with some representative baseline methods to demonstrate the effectiveness of our UM and UMAP. Our proposed UM is designed for excavating the potential OOD detection capability of the given model. Here we consider several scoring functions to compare the detection performance, and also some outlier exposure methods to further regularize the given model and boost the OOD detection ability. In each category, we choose one with the best detection performance to adopt UM/UMAP and check the detection results with the ID-ACC. In Table 1 , we summarize the results of different OOD test sets using different methods. Note that, here the evaluation results are obtained by averaging several OOD test datasets across multiple independent trials. For the scoring-based methods, our UM can further improve the overall detection performance by alleviating the memorization of atypical ID data, when the ID-ACC keeps comparable with the baseline. For the complex CIFAR-100 dataset, our UMAP can be adopted as a practical way to empower the detection performance and simultaneously avoid affecting the original performance on ID data. As for those methods of the second category (i.e., involving auxiliary outlier D aux sampling from ImageNet), since we consider a practical workflow, i.e., fine-tuning, on the given model, OE achieves the best performance on the task. Due to the special optimization characteristic, Energy (w.D aux ) and POEM focus more on the energy loss on differentiating OOD data while performing not well on the preservation of ID-ACC. Without sacrificing much performance on ID data, OE with our UM can still achieve better detection performance. In Table 3 , the fine-grained detection performance on each OOD testing set demonstrates the general effectiveness of UM. We have comprehensively verified the significant improvement (up to 18% reduced on averaged FPR95) in OOD detection of our methods across different setups in Appendix F, the complete results can refer to Tables 8 to 20 . More fine-grained results of the experiments on CIFAR-100 is provided in Table 16 and 19. In addition, we also provide similar results using another model structure in Table 18 .

4.3. ABLATION AND FURTHER DISCUSSION

In this part, we conduct various explorations to provide a thorough understanding of our presented Unleashing Mask from different perspectives. the exploration about the mask, which helps us to characterize its effect on figuring out the atypical samples. Third, we provide further exploration on excavating the detection power via pruning. General existence of once-covered OOD detection capability. In Figure 4 (a) and Figure 4 (b), we explore 4 learning rate schedules to demonstrate the general existence of once-covered OOD detection capability. To be specific, the OOD performance (indicated by FPR95) is evaluated along with the training in every 5 epoch, in which the model takes CIFAR-10 as ID data and SVHN as OOD data. As shown by the curves, a middle stage exists with better OOD performance than that of the final stage across different schedules. We empirically verify the existence of this phenomenon without schedule specificity. More explorations on the other ID dataset and model structure reveal similar results in Figures 5, 6 , and 7. The detailed information and discussion can refer to Appendix F. Effects of the mask on mining atypical samples. Following our previous illustration of Figure 3 (c), we scrutinize the change of training loss on a random batch of the training set in Figure 4 (c). The results further explain why the loss value estimated by the UM can be used to force the model to forget atypical samples. It can be seen from Figure 4 (c) that the loss is proportionally increased by randomly knocking out 2.5% weights. In this case, the estimated loss is more influenced by those who have a higher initial loss and are what we termed as atypical samples. By controlling the training loss to the estimated value, the model is encouraged to backtrack to a middle training stage where samples with high loss value have little influence on the forgetting process of the gradient ascent. Exploration on revealing detection capability with model pruning. Although the large constrain on training loss can help reveal the model's OOD performance, the ID-ACC is undermined under such circumstances. Generally speaking, the proposed UM forces the model to forget the atypical samples and may result in lower test performance. To mitigate this issue, we further adopt pruning as a countermeasure to learn a mask instead of tuning the model parameters directly. In Figure 4 (e), we experiment with various prune rates p and demonstrate that we can achieve the same or better OOD performance also by pruning. Specifically, our UMAP can achieve a lower FPR95 than pure pruning with the original objective. The prune rate can be selected from a wide range (e.g. p ∈ [0.3, 0.9]) to guarantee a fast convergence and effectiveness. Since pruning doesn't change the well-trained model parameters, it can preserve the performance of the original task. We also provide additional empirical results and corresponding discussion about the effectiveness of our UMAP in Appendix F.

5. CONCLUSION

In this work, we explore the intrinsic OOD detection capability of a well-trained model. Without involving any auxiliary outliers in training, we reveal the inconsistent trend between minimizing original training loss and gaining OOD detection capability. We further attribute it to the memorization behavior of atypical samples. To excavate the once-covered capability, we propose a new method, namely, Unleashing Mask (UM). Through UM, we construct model-level discrepancy that figures out the memorized atypical samples and utilizes the constrained gradient ascent to encourage forgetting. It better utilizes the given model for OOD detection via backtracking or sub-structure pruning. We hope our work could provide new insights for revisiting the model development in OOD detection.

APPENDIX REPRODUCIBILITY STATEMENT

We will provide the anonymous repository about our source codes in the discussion phase for reviewing purposes to ensure the reproducibility of our experimental results. Below we summarize some critical aspects to facilitate the reproducible results: • Datasets. The datasets we used are all publicly accessible, which is introduced in Section 4.1. For methods involving auxiliary outliers, we strictly follow previous works (Sun et al., 2021; Du et al., 2022b) to avoid overlap between the auxiliary dataset (ImageNet-1k) (Deng et al., 2009) and any other OOD datasets. • Assumption. We set our experiments to a post-hoc scenario where a well-trained model is available, and some parts of training samples are also available for subsequent fine-tuning. • Open source. The code repository will be available in an anonymous repository for reviewing purposes. We provide a backbone of our experiments as well as several auxiliary components, such as score estimation. • Environment. All experiments are conducted multiple runs on NVIDIA Tesla V100-SXM2-32GB GPUs with Python 3.6 and PyTorch 1.8.

A DETAILED INFORMATION ABOUT THE USED BASELINES AND METRICS

In this section, we provide the details about the baselines for the scoring functions and finetuning with auxiliary outliers, and the corresponding hyper-parameters that are considered in our work. Maximum Softmax Probability (MSP). (Hendrycks & Gimpel, 2017a) proposes to use maximum softmax probability to discriminate ID and OOD samples. The score is defined as follows, S MSP (x; f ) = max c P (y = c|x; f ) = max softmax(f (x)) where f represents the given well-trained model and c is one of the classes Y = {1, . . . , C}. The larger softmax score indicates the larger probability for a sample to be ID data, reflecting the model's confidence on the sample. ODIN. (Liang et al., 2018) designed the ODIN score, leveraging the temperature scaling and tiny perturbations to widen the gap between the distributions of ID and OOD samples. The ODIN score is defined as follows, S ODIN (x; f ) = max c P (y = c|x; f ) = max softmax( f (x) T ) where x represents the perturbed samples (controled by ϵ), T represents the temperature. For fair comparison, we adopt the suggested hyperparameters (Liang et al., 2018) : ϵ = 1.4 × 10 -3 , T = 1.0 × 10 4 . Mahalanobis. (Lee et al., 2018b ) introduces a Mahalanobis distance-based confidence score, exploiting the feature space of the neural networks by inspecting the class conditional Gaussian distributions. The Mahalanobis distance score is defined as follows, S Mahalanobis (x; f ) = max c -(f (x) -μc ) T Σ-1 (f (x) -μc ) where μc represents the estimated mean of multivariate Gaussian distribution of class c, Σ represents the estimated tied covariance of the C class-conditional Gaussian distributions. Energy. (Liu et al., 2020) proposes to use the Energy of the predicted logits to distinguish the ID and OOD samples. The Energy score is defined as follows, S Energy (x; f ) = -T log C c=1 e f (x)c/T (8) where T represents the temperature parameter. As theoretically illustrated in Liu et al. (2020) , a lower Energy score indicates a higher probability for a sample to be ID. Following (Liu et al., 2020) , we fix the T to 1.0 throughout all experiments. Outlier Exposure (OE). (Hendrycks et al., 2019b) initiates a promising approach towards OOD detections by involving outliers to force apart the distributions of ID and OOD samples. In the experiments, we use the cross-entropy from f (x out )to the uniform distribution as the L OE (Lee et al., 2018a) , L f = E Din [ℓ CE (f (x), y)] + λE D s out log C c=1 e f (x)c -E D s out (f (x)) Energy (w. D aux ). In addition to using the Energy as a post-hoc score to distinguish ID and OOD samples, (Liu et al., 2020) proposes an Energy-bounded objective to further separate the two distributions. The OE objective is as follows, L OE = E D s in (max(0, S Energy (x, f ) -m in )) 2 + E D s out (max(0, m out -S Energy (x, f ))) 2 We keep the thresholds same to (Liu et al., 2020) : m in = -25.0, m out = -7.0. POEM. (Ming et al., 2022) explores the Thompson sampling strategy (Thompson, 1933) to make the most use of outliers to learn a tight decision boundary. Though given the POEM's nature to be orthogonal to other OE methods, we use the Energy(w. D aux ) as the backbone, which is the same as Eq.( 10) in Liu et al. (2020) . The details of Thompson sampling can refer to Ming et al. (2022) . Detailed formulations of FPR and TPR. Suppose we have a binary classification task (to predict an image to be an ID or OOD sample in this paper). There are two possible outputs: a positive result (the model predicts an image to be an ID sample); a negative result (the model predicts an image to be an OOD sample). Since we have two possible labels and two possible outputs, we can form a confusion matrix with all possible outputs as follows. Therefore, the false positive rate (FPR) is calculated as : F P R = F P F P + T N The true positive rate (TPR) is calculated as: T P R = T P T P + F N B ADDITIONAL EXPLANATION TOWARDS MINING THE ATYPICAL SAMPLES First, for identifying those atypical samples using a layer-wise mask with the well-pre-trained model, the core intuition behind is constructing the parameter-level discrepancy to mine the atypical samples. It is inspired by and based on the evidence drawn from previous literature about learning behaviors (Arpit et al., 2017; Goodfellow et al., 2016) of deep neural networks (DNNs) and sparse representation (Frankle & Carbin, 2019; Goodfellow et al., 2013; Barham et al., 2022) . To be specific, the atypical samples tend to be learned by the DNNs later than those typical samples (Arpit et al., 2017) , and are relatively more sensitive to the changes of the model parameter as the model does not generalize well on that. By the layer-wise mask, the constructed discrepancy can make the model misclassify the atypical samples and estimate loss constraint for the forgetting objective, as visualized in Figure 3(c) . Second, introducing the layer-wise mask has several advantages for achieving the staged target of mining atypical samples in our proposed method, while we would also admit that the layer-wise mask is not an irreplaceable option or maybe not optimal. On the one hand, considering that the model has been trained to approach the zero error on training data, utilizing the layer-wise mask is an integrated strategy to 1) figure out the atypical samples and 2) obtain the loss value computed by the masked output that misclassifies them. The loss constraint is later used in the forgetting objective to fine-tune the model. On the other hand, the layer-wise mask is also compatible with the proposed UMAP to generate a flexible mask for restoring the detection capability of the original model. Third, we also adopt the unit/weight mask and visualize the misclassified samples in Figure 12 . We think they can also be used to mine the atypical samples and can be extended or improved to be a more flexible choice. Further investigating the specific effect of different methods that construct the parameter-level discrepancy would be an interesting sub-topic in future work. For the value of CE loss, although the atypical samples tend to have high CE loss value, they are already memorized and correctly classified as indicated by the zero training error. Only using the high CE error can not provide the loss estimation when the model does not correctly classify those samples.

C CONCEPTUAL AND EMPIRICAL COMPARISON WITH OVERFITTING

First of all, we would refer to the concept of the conventional overfitting (Goodfellow et al., 2016; Belkin et al., 2019) , i.e., the model "overfit" the training data but fail to generalize and perform well on the test data that is unseen during training. The common empirical reflection of overfitting is that the training error is decreasing while the test error is increasing at the same time, which enlarges the generalization gap of the model. It has been empirically confirmed not the case in our observation as observed in Figure 2 (a) and 2(b). To be specific, for the original classification task, there is no conventional overfitting observed as the test performance is still improved at the later training stage, which is a general pursuit of the model development phase on the original tasks. Then, when we consider the OOD detection performance of the well-pretrained model, our unique observation is about the inconsistency between gaining better OOD detection capability and pursuing better performance on the original classification task for the in-distribution (ID) data. It is worth noting that here the training task is not the binary classification of OOD detection, but the classification task on ID data. It is out of the rigorous concept of the conventional overfitting and has received limited focus and discussion in the previous literature about OOD detection to the best of our knowledge. Considering the practical scenario that exists target-level discrepancy, our revealed observation may encourage us to revisit the detection capability of the well-trained model. Third, through empirical observation, those strategies designed for preventing the conventional overfitting may need to change the target to the OOD detection based on the important observation. In our experiments, for all the baseline models including that used in Figure 1 , we have adopted those strategies (Srivastava et al., 2014; Hastie et al., 2009 ) (e.g., drop-out, weight decay) to reduce overfitting. It is found to be not enough to restore the OOD detection performance. For another shared issue, on the CIFAR-100 dataset, our UM restore the OOD detection capability of the well-trained model with a significant sacrifice on "ID-ACC". Using those strategies for reducing overfitting in the model development phase maybe not be acceptable to the users that it achieves such a lower performance on the original task. In contrast, our proposed UMAP can be a more practical and flexible way to restore detection performance. We conduct the extra comparisons between our UM and UMAP with those methods for reducing overfitting. The results are summarized in the following Tables 4, 5 , 6 and 7. According to our extra experiments, most conventional methods proposed to prevent conventional overfitting show limited benefits on gaining better OOD detection performance. Based on our important observation, the effective criterion, i.e., early stopping, also need to change its validation target to be the OOD data. However, most of them suffer from higher sacrifice on the performance of the original task and maybe not compatible and practical in the current general setting, i.e., starting from a well-trained model. Given the concept discrepancy aforementioned, one conclusive message is that "memorization of the atypical samples" are not "memorization in overfitting". Those atypical samples are empirically beneficial in improving the performance on the original classification task as shown in Figure 2 . However, this part of knowledge is not very necessary and even harmful to the OOD detection task as the detection performance of the model is drop significantly. Based on the training and test curves in our observation, the memorization in overfitting is expected to happen later than the final stage in which the test performance would drop. Since we have already used some strategies to prevent overfitting, it does not exist. Intuitively, the "atypical samples" identified in our work are relative to the OOD detection task. The memorization of "atypical samples" indicates that the model may not be able to draw the general information of the ID distribution through further learning on those atypical samples through the original classification task Since we mainly provide the empirical observation and understanding of the proposed algorithm in this work, further analysis from other views or theoretically would be an interesting and a major part of future work. 

D DETAILED REALIZATION OF THE PROPOSED ALGORITHMS

In this section, we provide the detailed realization of our proposed Unleashing Mask (UM) (i.e., in Algorithm 1) and Unleashing Mask Adopt Pruning (UMAP) (i.e., in Algorithm 2). To estimate the loss constrain ζ (i.e., ℓ CE (m δ ⊙ f * ) in Eq 3 with the fixed given model f * ) for forgetting, we need to randomly knock out parts of weights according to the given mask ratio δ. To be specific, we sample a score from a Gaussian distribution for every weight. Then we initialize a unit matrix for every layer of the model concerning the size of the layer. We formulate the mask m δ according to the sampled scores. Find the threshold for each layer that is smaller than the score of the given mask ratio in that layer (termed as quantile). Then set all the ones, whose corresponding scores are more significant than the layers' thresholds, to zeros. In our algorithms, the fine-tuning epochs k is the epochs we fine-tune after we get the well-trained model. We dot-multiply every layer's weights with the formulated binary matrix as if we delete some parts of the weights. We input a batch of training samples to the masked model and treat the mean value of the outputs' cross-entropy loss as the loss constraint. After all of these have been done, we begin to fine-tune the model's weights with the loss constraint applied to the cross-entropy loss. For UMAP, the only difference from UM is that, instead of fine-tuning the weights, we generate a popup score for every weight, and force the gradients to pass through the scores. In every iteration, we need to formulate a binary mask according to the given prune rate p. This is just what we do when estimating the loss constraint. For more details, it can refer to (Ramanujan et al., 2020) . In Table 8 , we summarize the overall comparison results of UM and UMAP to show their effectiveness.

E ADDITIONAL SETUPS OF THE EXPERIMENTS

In this section, we describe more details about the experimental setups for our exploration. Model setups. For DenseNet-101, we fix the growth rate and reduce rate to 12 and 0.5 respectively with the bottleneck block included in the backbone. We also explore the proposed UM on WideResNet (Zagoruyko & Komodakis, 2016) with 40 depth and 4 widen factor, which is termed as WRN-40-4. The batch size for both ID and OOD testing samples is 64, and the batch size of auxiliary samples is 2000. The λ in Eq.( 9) is 0.5 to keep the OE loss comparable to the CE loss. As for the strategy of sampling outliers, we randomly retrieve 50000 samples from ImageNet-1k (Deng et al., 2009) for OE and Energy (w. D aux ) and 50000 samples using Thompson sampling (Ming et al., 2022) for POEM. Learning rate schedules. We use 4 different learning rate schedules to demonstrate the existence of the once-covered OOD detection capability. For cosine annealing, we follow the common setups in Loshchilov & Hutter (2017) ; for linear schedule, the learning rate remains the same in the first one-third epochs, decreases linearly to the tenth of the initial rate in the middle one-third epochs, and decrease linearly to 1% of the initial rate in the last one-third epochs; for the multiple decay for l ∈ θ layers do 15: θ (t+1) = θ (t) -η ∂(|LCE(x,θ)-ζ|+ζ) ml p = s l(t) > quantile(s l(t) , p) 16: end for 17: s (t+1) = s (t) -η ∂(|LCE(x, mp⊙θ)-ζ|+ζ) ∂θ 18: end for schedule, the learning rate decreases 10% of the initial rate (0.01) every 10% epochs (10 epochs); for the multiple step schedule, the learning rate decreases to 10% of the current rate every 30 epochs. All those learning rate schedules for our experiments are intuitively illustrated in Figure 4 (a).

F ADDITIONAL EXPERIMENT RESULTS AND ABLATION STUDY

In this section, we provide more experiment results. We first show the fine-grained results on CIFAR-10 and CIFAR-100, then conduct the experiments under a different model structure (i.e., WRN-40-4), and finally apply an additional ablation study on the proposed UM and UMAP. The mean and variance of all metrics (ID-ACC, AUROC, AUPR, FPR95) are reported based on multiple independent trials. Empirical verification on typical/atypical data. In the following Tables 9, 10, 11, 12, and 13, we further conduct the experiments to identify the negative effect of learning on those atypical samples by comparing with a counterpart that learning only with the typical samples. The results confirm that the degeneration on detection performance is more likely to come from learning atypical samples. In Table 9 , we provide the main results for verification on typical/atypical samples. Intuitively, we intend to separate the training dataset into a typical set and an atypical set and train respectively on these two sets to see whether it is learning atypical samples that causes OOD performance to decrease during the latter part of the training phase. We force training samples through the model (DenseNet-101) of the 60th epoch and get the CE loss for separation. We provide the ACC of the generated sets on the model of the 60th epoch (ACC in the tables). The extremely low ACCs of the atypical sets show that the model of the 60th epoch can hardly predict the samples, which meets our definition of the atypical sample. We then fine-tune the model of the 60th epoch with the generated dataset and report the OOD performance. The results show learning from only those atypical data fail to gain better detection performance than its counterpart, i.e., learning from only those typical data. Learning on those atypical samples fails to draw the suitable features for the OOD detection task, though it still can improve the original task performance. The experiments provide a conceptual verification of our conjecture which links our observation and the proposed method. Results of fine-tuning for less epochs. UM adopts finetuning on the proposed objective for forgetting has shown the advantages of being cost-effective compared with train-from-scratch. For the tuning epochs, we show in Figures 9 and 10 that fine-tuning using UM can converge within about 20 epochs, indicating that we can apply our UM/UMAP for far less than 100 epochs (compared with train-from-scratch) to restore the better detection performance of the original well-trained model. It is intuitively reasonable that finetuning with the newly designed objective would benefit from the well-trained model, allowing a faster convergence since the two phases consider the same task with the same training data. As for the major experiments conducted in our work, finetuning adopts 100 epochs for better exploring and understanding its learning dynamics for research purposes, this configuration is indicated in the training details of Section 4.1. We also provide an extra comparison to show the relative efficiency of our proposed UM/UMAP in the following Table 14 and Table 15 . The results show that UM and UMAP can efficiently restore detection performance compared with the baseline. It is intuitively reasonable that fine-tuning would Fine-grained results on OOD data. In order to further figure out the effectiveness of the proposed UM and UMAP on different OOD datasets, we further report the fine-grained results of our experiments on CIFAR-10 and CIFAR-100 with 6 OOD datasets (CIFAR-10/CIFAR-100, textures, Places365, SUN, LSUN, iNaturalist). The results on the 6 OOD datasets show the general effectiveness of the proposed UM as well as UMAP. In Table 16 , OE + UM can outperform all the OOD baselines, and further improve the OOD performance even though the original detection performance is already well. By equipping with our proposed UM and UMAP, the baselines can outperform their counterparts on most of the OOD datasets. For instance, the FPR95 can decrease from 1.91 to 1.42. In Table 17 , we also take a closer check about results on CIFAR-100 with 6 OOD datasets. Our proposed method can almost improve all competitive baselines (either the scoring functions or the finetuning with auxiliary outliers) on the 6 OOD datasets. In both w. D aux and w.o. D aux scenarios, Unleashing Mask can significantly excavate the intrinsic OOD detection capability of the model. In addition to unleashing the excellent OOD performance, UMAP can also maintain the high ID-ACC by learning a binary mask instead of tuning the well-trained original parameters directly. Experiment on different model structure. Following 4.2, we additionally conduct critical experiments on the WRN-40-4 (Lin et al., 2021) backbone to demonstrate the effectiveness of the proposed UM and UMAP. In Figure 7 , we can find during the model training phase on ID data, there also exists the once-covered OOD detection capability can be explored in later development. In Table 18 , we show the comparison of multiple OOD detection baselines, evaluating the OOD performance on the 7 OOD datasets mentioned in Section 4.1. The results again demonstrate that our proposed method indeed excavates the intrinsic detection capability and improves the performance. As for the fine-grained results of WRN-40-4, we report results on 6 OOD datasets respectively. When trained on CIFAR-10, UM can outstrip all the scoring function baselines on 5 OOD datasets except Textures on which Mahalanobis performs better while UMAP still has excellent OOD performance ranking only second to UM. When trained on CIFAR-100, UM and UMAP can also outperform the baselines on most OOD datasets. The fine-grained results of WRN-40-4 further demonstrate the effectiveness of the proposed UM/UMAP on other architectures. Additional results about the general existence of once-covered OOD detection capability. In Section 4.3, we display the once-covered OOD detection capability on CIFAR-10 using SVHN as the OOD dataset. Here, we additionally verify the previously observed trend during training when training DenseNet-101 on CIFAR-100 using iNaturalist as an OOD dataset. In Figure 5 , we trace the three evaluation metrics during training on CIFAR-100 using 4 different learning rate schedules. Consistent with the original experiment, we still use iNaturalist as the OOD dataset. It can be seen for all the three metrics that exists a middle stage where the model has the better OOD detection capability (For FPR95, it is smaller (better) in the middle stage; for AUROC and AUPR, they are higher (better) in the middle stage). Besides that, we also look into the change of OOD performance on other architecture (e.g., WRN-40-4) in Figure 6 and Figure 7 . In Figure 6 , we display the curves of three metrics of WRN-40-4 when trained on CIFAR-10 with SVHN and Textures as OOD datasets. The trend that the OOD performance first goes better and then converges to worse OOD performance can be reflected. In Figure 7 , we continually provide curves of the three metrics of WRN-40-4 during training on CIFAR-100 with iNaturalist, Places365, and SUN as OOD datasets. A clear better middle stage can still be excavated in this scenario. UMAP: adopting pruning on UM. We conduct various experiments to see whether pruning has an impact on Unleashing Mask itself. To be specific, we expect the pruning to learn a mask on the given model while not impairing the excellent OOD performance that UM brings. In Figure 8 , it presents that pruning from a wide range (e.g. p ∈ [0.3, 0.9]) can well maintain the effectiveness of UM while possessing a terrific convergence trend. For simplicity, we use prune to indicate the original pruning approach and UMAP indicate UM with pruning on the mask with our newly designed forgetting objective in Figure 8 . In Figure 8 (a), the solid lines represent the proposed UMAP and the dashed lines represent only pruning the well-trained model at prune rates 0.2, 0.5, and 0.8. While the model's OOD performance can't be improved (not better than the baseline) through only pruning, using our proposed forgetting objective for the loss constrain can significantly bring out better OOD performance at a wide range of mask rates (e.g. p ∈ [0.5, 0.8]). In Figure 8 (b), we intuitively reflect the effect of the estimated loss constraint by the initialized mask which redirects the gradients when the loss reaches the value, while the loss will just approach 0 when pruning only. In Figure 8 (c), we The effectiveness of UM. In Figure 9 , we present the FPR95, AUROC, and AUPR curves during training to show the comparison of the original training and our proposed UM on ID data. We observe that training using UM can consistently outperform than the vanilla model training, either for the final stage or the middle stage with the best OOD detection performance indicated by the FPR95 curve. In Figure 10 , we also adopt different mask rates for the initialized loss constraint estimation for forgetting the atypical samples. The results show that a wide range of mask ratios (i.e., from 96% to 99%) to estimate the loss constraint used in Eq.3 can gain better OOD detection performance than the baseline. It shows the mask ratio would be robust to hyper-parameter selection under a certain value. The principle intuition behind this is our revealed important observation as indicated in Figures 1(a), 2(b), and 2(c). With the guidance of the general mechanism, empirically choosing the hyper-parameter using the validation set is supportable and valuable for excavating better OOD detection capability of the model as conducted by previous literature (Hendrycks et al., 2019b; Liu et al., 2020; Sun et al., 2021) . In our experiments, we empirically determine the value of our proposed UM and UMAP by examining the training loss on the masked output. For CIFAR-10 as ID datasets, the value of mask rate is 97.5% and the estimated loss constraint for forgetting is 0.10 for our tuning until the convergence; For CIFAR-100, the value of mask rate is 97% and the estimated loss constraint for forgetting is 1.20 for our tuning until the convergence. To choose the parameters of the estimated loss constraint, we use the TinyImageNet (Tavanaei, 2020) dataset as the validation set, which is not seen during training and is not considered in our evaluation of OOD detection performance. Since the core intuition behind our method is to restore the OOD detection performance starting from the well-trained model stage, forgetting a relatively small portion (empirically found around 97% mask ratio) of atypical samples can be beneficial. To find the optimal parameter for tuning, more advanced searching techniques like AutoML or validation design based on the important observation in our work may be further employed in the future. Fine-grained comparison of the model weights. We display the weights of the original model, pruned model, and the UMAP model respectively in Figure 11 . The histograms show that the adopted pruning algorithm tends to choose weights far from 0 for the first convolution layer, shown in Figure 11 (a). However, for almost all layers (from the 2nd to the 98th), the pruning chooses weights with no respect to the value of weights, shown in Figure 11 (b). For the "head" of the model (the fully connected layer), the pruning algorithm itself still keeps its behavior on the first layer, while UMAP forces the prune algorithm to choose weights near 0, shown in Figure 11 (c), indicating that forgetting learned atypical samples doesn't necessarily correspond to larger weights or smaller weights. 



Figure 1: (a) the curves of FPR95 (false positive rate of OOD examples when the true positive rate of in-distribution examples is at 95%) based on the Energy score (Liu et al., 2020) across three different OOD datasets during the training on the CIFAR-10 dataset. (b) comparison between ID and OOD distribution at Epoch 60. (c) comparison between ID and OOD distribution at Epoch 100.All the experiments testing for OOD detection performance have been conducted multiple times. By backtracking the training phase, we can observe the existence of the model stage with better OOD detection capability using the Energy score to distinguish the OOD inputs. When zooming in the ID and OOD distributions at Epoch 60 and Epoch 100 respectively, it can be seen that, along with the training at the later stage, the overlap between them grows. Figure2contains further exploration.

Figure 2: We train on CIFAR-10 for the original multi-class classification and check the details as follows: (a) training/testing loss on ID data; (b) training/testing accuracy on ID data; (c) curves of FPR95 based on ODIN score; (d) curves of FPR95 based on Energy score; (e) visualization of correct classified samples at Epoch 60; (f) visualization of misclassified samples at Epoch 60. We investigate the once-covered OOD detection capability by checking the model behavior during its training phase. We take a closer look at the corresponding training and testing performance with the OOD detection capability indicated by two different scores. Through comparison, we find that achieving a reasonably small loss value (at round Epoch 60) on ID data is enough for OOD detection. However, continually optimizing on those atypical samples can impair the detection performance.

Figure 3: (a) a brief illustration of the proposed Unleashing Mask (UM); (b) the mask rate w.r.t. loss value using the masked outputs; (c) examples of misclassified samples after masking the original well-trained model. As for our framework, given a well-trained model, we initialize an extra mask for mining the atypical samples that are sensitive to the changes in model parameters. Then we fine-tune the original model or adopt pruning with the estimated forgetting threshold, i.e., the loss value estimated by the UM. The final model can serve as the base of various score functions to utilize the discriminative features and also as the new initialization of fine-tuning with the auxiliary outliers.

Figure 4: Ablation studies. (a)-(b) exploring the existence of potential OOD detection capability with different learning rate schedules; (c) comparison of loss value using original output with masked output (the x-axis represents the index of samples within a mini-batch); (d) effects of using different masking ratios in UM; (e) comparison of using original pruning with our proposed UMAP.

∂θ 14: end for Algorithm 2 Unleashing Mask Adopt Pruning (UMAP) Input: well-trained model : θ, mask ratio: δ ∈ [0, 1], fine-tuning epochs of UM: k, training samples: x ∼ D s in , prune rate: p ; Output: learnt binary mask mp ; 1: Initialize a popup score for every weight 2: for w ∈ θ do 3: s w ∼ N (µ, σ 2 ) 4: end for 5: Generate mask by the popup scores 6: for l ∈ θ layers do 7:m l δ = s l > quantile(s l , δ) 8: end for 9: Estimate loss constrain 10: ζ = E x∼D s in ( LCE (x, m δ ⊙ θ)) 11: Unleashing Mask Adopt Pruning: fine-tuning 12: s (1) ∼ N (µ, σ 2 ) 13: for t ∈ (1, . . . , k) do 14:

Figure 5: Ablation studies on three metrics with 4 different learning rate schedules. The model is DenseNet-101 trained on CIFAR-100 with iNaturalist as the OOD dataset. (a) change of FPR95 throughout the pruning phase when training on CIFAR-100; (b) change of AUROC throughout the pruning phase when training on CIFAR-100; (c) change of AUPR throughout the pruning phase when training on CIFAR-100. It demonstrates a better middle stage exists according to the three metrics.

Figure 6: Ablation studies on three metrics of WRN-40-4 with CIFAR-10 as ID dataset, SVHN, and Textures as OOD datasets. (a) change of FPR95 throughout the pruning phase when training on CIFAR-10; (b) change of AUROC throughout the pruning phase when training on CIFAR-10; (c) change of AUPR throughout the pruning phase when training on CIFAR-10. It demonstrates a better middle stage exists according to the three metrics.

Figure 8: Ablation studies on Prune Rate of UMAP. (a) change of OOD performance throughout the pruning phase; (b) training loss converges to estimated loss constraint properly; (c) though ID-ACC is not taken into consideration for UMAP, it still raise high after training for 100 epochs.

Figure 9: Ablation studies to reflect the effectiveness of UM. The mask ratio of UM is 99.5%. (a) change of FPR95 throughout the training phase on CIFAR-10; (b) change of AUROC throughout the training phase on CIFAR-10; (c) change of AUPR throughout the training phase on CIFAR-10.

Figure 12: Examples of misclassified samples after masking the original well-trained model. The scores are estimated according to uniform distribution. (a) layer-wise masking scores; (b) layer-wise masking weights directly; (c) model-wise masking scores; (d) model-wise masking weights directly.

Main Results (%). Comparison with competitive OOD detection baselines.MSP(Hendrycks & Gimpel, 2017a)  89.90 ± 0.30 91.48 ± 0.43 60.08 ± 0.76 94.01 ± 0.08 ODIN(Liang et al., 2018) 91.46 ± 0.56 91.67 ± 0.58 42.31 ± 1.38 94.01 ± 0.08 Mahalanobis(Lee et al., 2018b) 75.10 ± 1.04 72.32 ± 1.92 61.35 ± 1.25 94.01 ± 0.08 Energy(Liu et al., 2020) 92.07 ± 0.22 92.72 ± 0.39 42.69 ± 1.31 94.01 ± 0.08 Energy+UM (ours) 93.73 ± 0.36 94.27 ± 0.60 33.29 ± 1.70 92.80 ± 0.47 Daux)(Liu et al., 2020) 94.58 ± 0.64 94.69 ± 0.65 18.79 ± 2.31 80.91 ± 3.13 MSP(Hendrycks & Gimpel, 2017a) 74.06 ± 0.69 75.37 ± 0.73 83.14 ± 0.87 74.86 ± 0.21 ODIN(Liang et al., 2018) 76.18 ± 0.14 76.49 ± 0.20 78.93 ± 0.31 74.86 ± 0.21 Mahalanobis(Lee et al., 2018b) 63.90 ± 1.91 64.31 ± 0.91 78.79 ± 0.50 74.86 ± 0.21 Energy(Liu et al., 2020) 76.29 ± 0.24 77.06 ± 0.55 78.46 ± 0.06 74.86 ± 0.21 Energy+UM (ours) 76.22 ± 0.42 76.39 ± 1.03 74.05 ± 0.55 64.55 ± 0.24 Daux)(Liu et al., 2020) 88.92 ± 0.57 89.13 ± 0.56 37.90 ± 2.59 57.85 ± 2.65

Fine-grained Results (%). Comparison on different OOD benchmark datasets.

Confusion Matrix.

Comparison among overfitting methods and ODIN with DenseNet-101 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Comparison among overfitting methods and Energy with DenseNet-101 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.



Comparison among overfitting methods and Energy with WRN-40-4 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Algorithm 1 Unleashing Mask (UM) Input: well-trained model : θ, mask ratio: δ ∈ [0, 1], fine-tuning epochs of UM: k, training samples:

Completed Results (%). Comparison with competitive OOD detection baselines. ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-tuning on typical/atypical samples with different model structures (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-tuning on typical/atypical CIFAR-10 samples with DenseNet-101 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-tuning on typical/atypical CIFAR-10 samples with WRN-40-4 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-tuning on typical/atypical CIFAR-100 samples with DenseNet-101 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-tuning on typical/atypical CIFAR-100 samples with WRN-40-4 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better. from the well-trained model, allowing a faster convergence as the two phases consider the same task and training data. Considering the significance of the OOD awareness for those safetycritical areas, it is worthwhile to further excavate the OOD detection capability of the deployed well-trained model using our UM and UMAP.

Fine-tuning for 20 epochs with DenseNet-101 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-tuning for 20 epochs with WRN-40-4 (%). ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-grained Results (%) of DenseNet-101 on CIFAR-10. Comparison on different OOD benchmark datasets respectively. ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-grained Results (%) of DenseNet-101 on CIFAR-100. Comparison on different OOD benchmark datasets respectively. ↑ indicates higher values are better, and ↓ indicates lower values are better.

Results (%) of Comparison with competitive OOD detection baselines. We respectively train WRN-40-4 on CIFAR-10 and CIFAR-100. For those methods involving outliers, we retrieve 5000 samples from ImageNet-1k. ↑ indicates higher values are better, and ↓ indicates lower values are better.

Fine-grained Results (%) of WRN-40-4 on CIFAR-10. Comparison on different OOD benchmark datasets. ↑ indicates higher values are better, and ↓ indicates lower values are better. MSP 70.96 ± 0.70 86.08 ± 0.08 68.81 ± 1.29 86.53 ± 0.83 68.31 ± 0.25 86.71 ± 0.13 ODIN 64.97 ± 0.08 83.36 ± 0.11 66.86 ± 2.24 81.34 ± 0.81 66.49 ± 1.16 83.47 ± 0.93 Mahalanobis 79.84 ± 0.55 70.33 ± 0.24 22.56 ± 0.08 94.07 ± 0.04 85.09 ± 0.59 67.90 ± 0.37 Energy 61.09 ± 0.58 86.66 ± 0.04 64.29 ± 1.72 85.56 ± 0.53 55.32 ± 0.13 88.29 ± 0.26 Energy+UM (ours) 57.21 ± 1.41 87.56 ± 0.15 46.49 ± 1.03 89.74 ± 0.45 40.68 ± 4.46 92.51 ± 0.97 Energy+UMAP (ours) 65.45 ± 1.10 84.65 ± 0.95 59.14 ± 1.64 85.27 ± 1.74 48.16 ± 1.89 90.43 ± 0.47

Fine-grained Results (%) of WRN-40-4 on CIFAR-100. Comparison on different OOD benchmark datasets. ↑ indicates higher values are better, and ↓ indicates lower values are better.

REPRODUCIBILITY STATEMENT

For the experimental setups, we have provided the details in Section 4.1 and Appendixes A and E. We will also provide the anonymous repository about our source codes in the discussion phase for reviewing purposes to ensure the reproducibility of our experimental results. 

