HOW TO EXPLOIT HYPERSPHERICAL EMBEDDINGS FOR OUT-OF-DISTRIBUTION DETECTION?

Abstract

Out-of-distribution (OOD) detection is a critical task for reliable machine learning. Recent advances in representation learning give rise to distance-based OOD detection, where testing samples are detected as OOD if they are relatively far away from the centroids or prototypes of in-distribution (ID) classes. However, prior methods directly take off-the-shelf contrastive losses that suffice for classifying ID samples, but are not optimally designed when test inputs contain OOD samples. In this work, we propose CIDER, a novel representation learning framework that exploits hyperspherical embeddings for OOD detection. CIDER jointly optimizes two losses to promote strong ID-OOD separability: a dispersion loss that promotes large angular distances among different class prototypes, and a compactness loss that encourages samples to be close to their class prototypes. We analyze and establish the unexplored relationship between OOD detection performance and the embedding properties in the hyperspherical space, and demonstrate the importance of dispersion and compactness. CIDER establishes superior performance, outperforming the latest rival by 13.33% in FPR95. Code is available at https://github.com/deeplearning-wisc/cider.

1. INTRODUCTION

When deploying machine learning models in the open world, it is important to ensure the reliability of the model in the presence of out-of-distribution (OOD) inputs-samples from an unknown distribution that the network has not been exposed to during training, and therefore should not be predicted with high confidence at test time. We desire models that are not only accurate when the input is drawn from the known distribution, but are also aware of the unknowns outside the training categories. This gives rise to the task of OOD detection, where the goal is to determine whether an input is in-distribution (ID) or not. A plethora of OOD detection algorithms have been developed recently, among which distance-based methods demonstrated promise (Lee et al., 2018; Xing et al., 2020) . These approaches circumvent the shortcoming of using the model's confidence score for OOD detection, which can be abnormally high on OOD samples (Nguyen et al., 2015) and hence not distinguishable from ID data. Distance-based methods leverage feature embeddings extracted from a model, and operate under the assumption that the test OOD samples are relatively far away from the clusters of ID data. Arguably, the efficacy of distance-based approaches can depend largely on the quality of feature embeddings. Recent works including SSD+ (Sehwag et al., 2021) and KNN+ (Sun et al., 2022) directly employ off-the-shelf contrastive losses for OOD detection. In particular, these works use the supervised contrastive loss (SupCon) (Khosla et al., 2020) for learning the embeddings, which are then used for OOD detection with either parametric Mahalanobis distance (Lee et al., 2018; Sehwag et al., 2021) or non-parametric KNN distance (Sun et al., 2022) . However, existing training objectives produce embeddings that suffice for classifying ID samples, but remain sub-optimal for OOD detection. For example, when trained on CIFAR-10 using SupCon loss, the average angular distance between ID and OOD data is only 29.86 degrees in the embedding space, which is too small for effective ID-OOD separation. This raises the important question: How to exploit representation learning methods that maximally benefit OOD detection? In this work, we propose CIDER, a Compactness and DispErsion Regularized learning framework designed for OOD detection. Our method is motivated by the desirable properties of hyperspherical embeddings, which can be naturally modeled by the von Mises-Fisher (vMF) distribution. vMF is a classical and important distribution in directional statistics (Mardia et al., 2000) , is analogous to spherical Gaussian distributions for features with unit norms. Our key idea is to design an endto-end trainable loss function that enables optimizing hyperspherical embeddings into a mixture of vMF distributions, which satisfy two properties simultaneously: (1) each sample has a higher probability assigned to the correct class in comparison to incorrect classes, and (2) different classes are far apart from each other. To formalize our idea, CIDER introduces two losses: a dispersion loss that promotes large angular distances among different class prototypes, along with a compactness loss that encourages samples to be close to their class prototypes. These two terms are complementary to shape hyperspherical embeddings for both OOD detection and ID classification purposes. Unlike previous contrastive loss, CIDER explicitly formalizes the latent representations as vMF distributions, thereby providing a direct theoretical interpretation of hyperspherical embeddings. In particular, we show that promoting large inter-class dispersion is key to strong OOD detection performance, which has not been explored in previous literature. Previous methods including SSD+ directly use off-the-shelf SupCon loss, which produces embeddings that lack sufficient interclass dispersion needed for OOD detection. CIDER mitigates the issue by explicitly optimizing for large inter-class margins and leads to more desirable hyperspherical embeddings. Noticeably, when trained on CIFAR-10, CIDER displays a relative 42.36% improvement of ID-OOD separability compared to SupCon. We further show that CIDER's strong representation can benefit different distance-based OOD scores, outperforming recent competitive methods SSD+ (Sehwag et al., 2021) and KNN+ (Sun et al., 2022) by a significant margin. Our key results and contributions are: 1. We propose CIDER, a novel representation learning framework designed for OOD detection. Compared to the latest rival (Sun et al., 2022) , CIDER produces superior embeddings that lead to 13.33% error reduction (in FPR95) on the challenging CIFAR-100 benchmark. 2. We are the first to establish the unexplored relationship between OOD detection performance and the embedding quality in the hyperspherical space, and provide measurements based on the notion of compactness and dispersion. This allows future research to quantify the embedding in the hyperspherical space for effective OOD detection. 3. We offer new insights on the design of representation learning for OOD detection. We also conduct extensive ablations to understand the efficacy and behavior of CIDER, which remains effective and competitive under various settings, including the ImageNet dataset.

2. PRELIMINARIES

We consider multi-class classification, where X denotes the input space and Y in = {1, 2, ..., C} denotes the ID labels. The training set D in tr = {(x i , y i )} N i=1 is drawn i.i.d. from P X Y in . Let P X denote the marginal distribution over X , which is called the in-distribution (ID). Out-of-distribution detection. OOD detection can be viewed as a binary classification problem. At test time, the goal of OOD detection is to decide whether a sample x ∈ X is from P X (ID) or not (OOD). In practice, OOD is often defined by a distribution that simulates unknowns encountered during deployment, such as samples from an irrelevant distribution whose label set has no intersection with Y in and therefore should not be predicted by the model. Mathematically, let D ood test denote an OOD test set where the label space Y ood ∩ Y in = ∅. The decision can be made via a level set estimation: G λ (x) = 1{S(x) ≥ λ}, where samples with higher scores S(x) are classified as ID and vice versa. The threshold λ is typically chosen so that a high fraction of ID data (e.g. 95%) is correctly classified.

Hyperspherical embeddings.

A hypersphere is a topological space that is homeomorphic to a standard n-sphere, which is the set of points in (n + 1)-dimensional Euclidean space that are located at a constant distance from the center. When the sphere has a unit radius, it is called the unit hypersphere. Formally, an n-dimensional unit-hypersphere S n := {z ∈ R n+1 |∥z∥ 2 = 1}. Geometrically, hyperspherical embeddings lie on the surface of a hypersphere. 

3. METHOD

Overview. Our framework CIDER is illustrated in Figure 1 . The general architecture consists of two components: (1) a deep neural network encoder f : X → R e that maps the augmented input x to a high dimensional feature embedding f (x) (often referred to as the penultimate layer features); (2) a projection head h : R e → R d that maps the high dimensional embedding f (x) to a lower dimensional feature representation z := h(f (x)). The loss is applied to the normalized feature embedding z := z/∥z∥ 2 . The normalized embeddings are also referred to as hyperspherical embeddings, since they are on a unit hypersphere. Our goal is to shape the hyperspherical embedding space so that the learned embeddings can be mostly effective for distinguishing ID vs. OOD data.

3.1. MODEL HYPERSPHERICAL EMBEDDINGS

The hyperspherical embeddings can be naturally modeled by the von Mises-Fisher (vMF) distribution, a classical and important distribution in directional statistics (Mardia et al., 2000) . In particular, vMF is analogous to spherical Gaussian distributions for features z with unit norms (∥z∥ 2 = 1). The probability density function for a unit vector z ∈ R d in class c is defined as: p d (z; µ c , κ) = Z d (κ) exp κµ ⊤ c z , where µ c is the class prototype with unit norm, κ ≥ 0 indicates the tightness of the distribution around the mean direction µ c , and Z d (κ) is the normalization factor. The larger value of κ, the stronger the distribution is concentrated in the mean direction. In the extreme case of κ = 0, the sample points are distributed uniformly on the hypersphere. Under this probability model, an embedding vector z is assigned to class c with the following normalized probability: P(y = c|z; {κ, µ j } C j=1 ) = Z d (κ) exp κµ ⊤ c z C j=1 Z d (κ) exp κµ ⊤ j z (2) = exp µ ⊤ c z/τ C j=1 exp µ ⊤ j z/τ , where κ = 1 τ . Next, we outline our proposed training method that promotes class-conditional vMF distributions for OOD detection.

3.2. HOW TO OPTIMIZE HYPERSPHERICAL EMBEDDINGS?

Training objective. Our key idea is to design a trainable loss function that enables optimizing hyperspherical embeddings into a mixture of vMF distributions, which satisfy two properties simultaneously: (1) each sample has a higher probability assigned to the correct class in comparison to incorrect classes, and (2) different classes are far apart from each other. To achieve 1 , we can perform maximum likelihood estimation (MLE) on the training data: argmax θ N i=1 p(y i |z i ; {κ j , µ j } C j=1 ), where i is the index of the embedding and N is the size of the training set. By taking the negative log-likelihood, the objective function is equivalent to minimizing the following loss: L comp = - 1 N N i=1 log exp z ⊤ i µ c(i) /τ C j=1 exp z ⊤ i µ j /τ , where c(i) denotes the class index of a sample x i , and τ is the temperature parameter. We term this compactness loss, since it encourages samples to be closely aligned with its class prototype. To promote property 2 , we propose the dispersion loss, optimizing large angular distances among different class prototypes: L dis = 1 C C i=1 log 1 C -1 C j=1 1{j ̸ = i}e µ ⊤ i µj /τ . ( ) Figure 2 : Illustration of desirable hyperspherical embeddings for OOD detection. As OOD samples lie between ID clusters, optimizing a large angular distance among ID clusters benefits OOD detection. While prototypes with larger pairwise angular distances may not impact ID classification accuracy, they are crucial for OOD detection, as we will show later in Section 4.2. Since the embeddings of OOD samples lie inbetween ID clusters, optimizing large inter-class margin benefits OOD detection. The importance of inter-class dispersion can also be explained in Figure 2 , where samples in the fox class (OOD) are semantically close to cat (ID) and dog (ID). A larger angular distance (i.e. smaller cosine similarity) between ID classes cat and dog in the hyperspherical space improves the separability from fox, and allows for more effective detection. We investigate this phenomenon quantitatively in Section 4. Formally, our training objective CIDER (compactness and dispersion regularized learning) is: L CIDER = L dis + λ c L comp , where λ c is the co-efficient modulating the relative importance of two losses. These two terms are complementary to shape hyperspherical embeddings for both ID classification and OOD detection. Prototype estimation and update. During training, an important step is to estimate the class prototype µ c for each class c ∈ {1, 2, ..., C}. One canonical way to estimate the prototypes is to use the mean vector of all training samples for each class and update it frequently during training. Despite its simplicity, this method incurs a heavy computational toll and causes undesirable training latency. Instead, the class-conditional prototypes can be effectively updated in an exponential-moving-average manner (EMA) (Li et al., 2020) : µ c := Normalize(αµ c + (1 -α)z), ∀c ∈ {1, 2, . . . , C} where the prototype µ c for class c is updated during training as the moving average of all embeddings with label c, and z denotes the normalized embedding of samples of class c. We ablate the effect of prototype update factor α in Section C. The pseudo-code for our method is in Appendix A. Remark 1: Differences w.r.t. SSD+. We highlight three fundamental differences w.r.t. SSD+, in terms of training objective, test-time OOD detection, and theoretical interpretation. (1) At training time, SSD+ directly uses off-the-shelf SupCon loss (Khosla et al., 2020) , which produces embeddings that lack sufficient inter-class dispersion needed for OOD detection. For example, when trained on CIFAR-10 using SupCon loss, the average angular distance between ID and OOD data is only 29.86 degrees in the embedding space. In contrast, CIDER enforces the inter-class dispersion by explicitly maximizing the angular distances among different ID class prototypes. As we will show in Section 4.2, CIDER displays a relative 42.36% improvement of ID-OOD separability compared to SSD+, due to the explicit inter-class dispersion. (2) At test time, SSD+ uses the Mahalanobis distance (Lee et al., 2018) , which imposes a strong Gaussian distribution assumption on hyperspherical embeddings. In contrast, CIDER alleviates this assumption with a non-parametric distance score. (3) Lastly, CIDER explicitly models the latent representations as vMF distributions, providing a direct and clear geometrical interpretation of hyperspherical embeddings. Remark 2: Differences w.r.t. Proxy-based methods. Our work also bears significant differences w.r.t. proxy-based metric learning methods. (1) Our primary task is OOD detection, whereas deep metric learning is commonly used for face verification and image retrieval tasks; (2) Prior methods such as ProxyAnchor (Kim et al., 2020) lack explicit prototype-to-prototype dispersion, which we show is crucial for OOD detection. Moreover, ProxyAnchor initializes the proxies randomly and updates through gradients, while we estimate prototypes directly from sample embeddings using EMA. We provide experimental comparisons next.

4. EXPERIMENTS

4.1 COMMON SETUP Datasets and training details. Following the common benchmarks in the literature, we consider CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) as in-distribution datasets. For OOD test datasets, we use a suite of natural image datasets including SVHN (Netzer et al., 2011 ), Places365 (Zhou et al., 2017) , Textures (Cimpoi et al., 2014b) , LSUN (Yu et al., 2015) , and iSUN (Xu et al., 2015) . In our main experiments, we use ResNet-18 as the backbone for CIFAR-10 and ResNet-34 for CIFAR-100. We train the model using stochastic gradient descent with momentum 0.9, and weight decay 10 -4 . To demonstrate the simplicity and effectiveness of CIDER, we adopt the same hyperparameters as in SSD+ (Sehwag et al., 2021) with the SupCon loss: the initial learning rate is 0.5 with cosine scheduling, the batch size is 512, and the training time is 500 epochs. We choose the default weight λ c = 2, so that the value of different loss terms are similar upon model initialization. Following the literature (Khosla et al., 2020) , we use the embedding dimension of 128 for the projection head. The temperature τ is 0.1. We adjust the prototype update factor α, batch size, temperature, loss weight, prototype update factor, and model architecture in our ablation study (Appendix C). We report the ID classification results for SSD+ and CIDER following the common linear evaluation protocol (Khosla et al., 2020) , where a linear classifier is trained on top of the normalized penultimate layer features. More experimental details are provided in Appendix B. Code and data is released publicly for reproducible research. OOD detection scores. During test time, we employ a distance-based method for OOD detection. An input x is considered OOD if it is relatively far from the ID data in the embedding space. By default, we adopt a simple non-parametric KNN distance (Sun et al., 2022) , which does not impose any distributional assumption on the feature space. Here the distance is the cosine similarity with respect to the k-th nearest neighbor, which is equivalent to the (negated) Euclidean distance as all features have unit norms. In the ablation study, we also consider the commonly used Mahalanobis score (Lee et al., 2018) for a fair comparison with SSD+ (Sehwag et al., 2021) . Evaluation metrics. We report the following metrics: (1) the false positive rate (FPR95) of OOD samples when the true positive rate of ID samples is at 95%, (2) the area under the receiver operating characteristic curve (AUROC), and (3) ID classification accuracy (ID ACC).

4.2. MAIN RESULTS AND ANALYSIS

CIDER outperforms competitive approaches. 4 ).

4.3. CHARACTERIZING AND UNDERSTANDING EMBEDDING QUALITY

CIDER learns distinguishable representations. We visualize the learned feature embeddings in Figure 3 using UMAP (McInnes et al., 2018) , where the colors encode different class labels. A salient observation is that embeddings obtained with CIDER enjoy much better compactness compared to embeddings trained with the CE loss (3a). Moreover, the classes are distributed more uniformly in the space, highlighting the efficacy of the dispersion loss. CIDER improves inter-class dispersion and intra-class compactness. Beyond visualization, we also quantitatively measure the embedding quality. We propose two measurements: inter-class dispersion and intra-class compactness: Dispersion(µ) = 1 C C i=1 1 C -1 C j=1 µ ⊤ i µ j 1{j ̸ = i}. Compactness(D in tr , µ) = 1 C C j=1 1 n n i=1 z ⊤ i µ j 1{y i = j}, where D in tr = {x i , y i } N i=1 , and z i is the normalized embedding of x i for all 1 ≤ i ≤ N . Dispersion is measured by the average cosine similarity among pair-wise class prototypes. The compactness can be interpreted as the average cosine similarity between each feature embedding and its corresponding class prototype. To make the measurements more interpretable, we convert cosine similarities to angular degrees. Hence, a higher inter-class dispersion (in degrees) indicates more separability among class prototypes, which is desirable. Similarly, lower intra-class compactness (in degrees) is better. The results are shown in Table 4 based on the CIFAR-10 test set. Compared to SSD+ (with SupCon loss), CIDER significantly improves the inter-class dispersion by 12.03 degrees. Different from SupCon, CIDER explicitly optimizes the inter-class dispersion, which especially benefits OOD detection. CIDER improves ID-OOD separability. Next, we quantitatively measure how the feature embedding quality affects the ID-OOD separability. We introduce a separability score, which measures on average how close the embedding of a sample from the OOD test set is to the closest ID class prototype, compared to that of an ID sample. The traditional notion of "OOD being far away from ID classes" is now translated to "OOD being somewhere between ID clusters on the hypersphere". A higher separability score indicates that the OOD test set is easier to be detected. Formally, we define the separability measurement as: ↑ Separability = 1 |D ood test | x∈D ood test max j∈[C] z ⊤ x µ j - 1 |D in test | x ′ ∈D in test max j∈[C] z ⊤ x ′ µ j , where D ood test is the OOD test dataset and z x denotes the normalized embedding of sample x. Table 4 shows that CIDER leads to higher separability and consequently superior OOD detection performance (c.f. Table 1 ). Averaging across 5 OOD test datasets, our method displays a relative 42.36% improvement of ID-OOD separability compared to SupCon. This further verifies the effectiveness of CIDER for improving OOD detection. CIDER is competitive on large-scale datasets. To further examine the performance of CIDER on real-world tasks, we evaluate the performance of CIDER on the more challenging large-scale benchmarks. Specifically, we use ImageNet-100 as ID, a subset of ImageNet (Deng et al., 2009) consisting of 100 randomly sampled classes. For OOD test datasets, we use the same ones in (Huang & Li, 2021) , including subsets of INATURALIST (Van Horn et al., 2018) , SUN (Xiao et al., 2010) , PLACES (Zhou et al., 2017) , and TEXTURE (Cimpoi et al., 2014a) . For each OOD dataset, the categories do not overlap with the ID dataset. For computational efficiency, we fine-tune pretrained ResNet-34 with CIDER and SupCon losses for 10 epochs with an initial learning rate of 0.01. For each loss, we update the weights of the last residual block and the nonlinear projection head, while freezing the parameters in the first three residual blocks. At test time, we use the same detection score (KNN) to evaluate representation quality. The performance (in AUROC) is shown in Figure 4 (more results are in Appendix D). We can see that CIDER remains very competitive on all the OOD test sets where CIDER consistently outperforms SupCon. This further verifies the benefits of explicitly promoting inter-class dispersion and intraclass compactness. Ablation studies on weight scale, prototype update factor, learning rate, temperature, batch size, and architecture.

4.4. ADDITIONAL ABLATIONS AND ANALYSIS

We provide comprehensive ablation studies to understand the impact of various factors in Appendix C. For example, as shown in Figure 5 , CIDER consistently outperforms SupCon across different batch sizes.

5. RELATED WORKS

Out-of-distribution detection. The majority of works in the OOD detection literature focus on the supervised setting, where the goal is to derive a binary ID-OOD classifier along with a classification model for the in-distribution data. Compared to generative model-based methods (Kirichenko et al., 2020; Nalisnick et al., 2019; Ren et al., 2019; Serrà et al., 2020; Xiao et al., 2020) , OOD detection based on supervised discriminative models typically yield more competitive performance. For deep neural networks-based methods, most OOD detection methods derive confidence scores either based on the output (Bendale & Boult, 2016; Hendrycks & Gimpel, 2017; Hsu et al., 2020; Huang & Li, 2021; Liang et al., 2018; Liu et al., 2020; Sun et al., 2021; Ming et al., 2022b) , gradient information (Huang et al., 2021) , or the feature embeddings (Lee et al., 2018; Sastry & Oore, 2020; Tack et al., 2020; Zhou et al., 2021; Sehwag et al., 2021; Sun et al., 2022; Ming et al., 2022a; Du et al., 2022) . Our method can be categorized as a distance-based OOD detection method by exploiting the hyperspherical embedding space. Contrastive representation learning. Contrastive representation learning (van den Oord et al., 2019) aims to learn a discriminative embedding where positive samples are aligned while negative ones are dispersed. It has demonstrated remarkable success for visual representation learning in unsupervised (Chen et al., 2020a; b; He et al., 2020; Robinson et al., 2021) , semi-supervised (Assran et al., 2021) , and supervised settings (Khosla et al., 2020) Representation learning for OOD detection. Self-supervised learning has been shown to improve OOD detection. Prior works (Sehwag et al., 2021; Winkens et al., 2020) verify the effectiveness of directly applying the off-the-shelf multi-view contrastive losses such as SupCon and SimCLR for OOD detection. CSI (Tack et al., 2020) investigates the type of data augmentations that are particularly beneficial for OOD detection. Different from prior works, we focus on hyperspherical embeddings and propose to explicitly encourage the desirable properties for OOD detection, and thus alleviate the dependence on specific data augmentations or self-supervision. Moreover, CIDER explicitly models the latent representations as vMF distributions, providing a direct and clear geometrical interpretation of hyperspherical embeddings. Closest to our work, Du et al. (2022) recently explores shaping representations into vMF distributions for object-level OOD detection. However, they do not consider the inter-class dispersion loss, which we show is crucial to achieve strong OOD detection performance. Deep metric learning. Learning a desirable embedding is a fundamental goal in the deep metric learning community. Various losses have been proposed for face verification (Deng et al., 2019; Liu et al., 2017; Wang et al., 2018) , person re-identification (Chen et al., 2017; Xiao et al., 2017) , and image retrieval (Kim et al., 2019; Movshovitz-Attias et al., 2017; Oh Song et al., 2016; Teh et al., 2020) . However, none of the works focus on desirable embeddings for OOD detection. The difference between CIDER and proxy-based methods has been discussed in Remark 2.

6. CONCLUSION AND OUTLOOK

In this work, we propose CIDER, a novel representation learning framework that exploits hyperspherical embeddings for OOD detection. CIDER jointly optimizes the dispersion and compactness losses to promote strong ID-OOD separability. We show that CIDER achieves superior performance on common OOD benchmarks, including large-scale OOD detection tasks. Moreover, we introduce new measurements to quantify the hyperspherical embedding, and establish the relationship with OOD detection performance. We conduct extensive ablations to understand the efficacy and behavior of CIDER under various settings and hyperparameters. We hope our work can inspire future methods of exploiting hyperspherical representations for OOD detection. Remark 1: The prototype update rule. The class prototypes are only updated by exponential moving average (EMA). Since the prototypes are not learnable parameters, the gradients of the dispersion loss have no direct impact on their updates. EMA-style techniques have been used in prior works Li et al. (2020) , and can be rigorously interpreted from a clustering-based Expectation-Maximization (EM) perspective. Alternatively, the prototypes can also be updated via gradients without EMA. We provide an ablation study of CIDER based on different prototype update rules in Appendix C. Lcomp = -b i=1 log exp(z ⊤ i µ c(i) /τ ) C j=1 exp(z ⊤ i µ j /τ ) // calculate dispersion loss 10 L dis = 1 C C i=1 log 1 C-1 C j=1 1{j ̸ = i}e µ ⊤ i µ j /τ // Remark 2: CIDER vs. Wang & Isola (2020) . The notion of alignment and uniformity for contrastive losses were proposed in Wang & Isola (2020) for the unsupervised setting where both metrics are based on individual samples. CIDER is a contrastive loss designed for the supervised setting. In particular, the uniform loss in Wang & Isola (2020) is defined based on randomly sampled pairs of data and promotes an instance-to-instance uniformity on the hypersphere. The notion of uniformity is fundamentally different from CIDER, which promotes prototype-to-prototype dispersion. Remark 3: CIDER vs. Cross Entropy. When a model is trained with the cross-entropy (CE) loss, the weight matrix of the last fully connected layer can be interpreted as the set of class prototypes. However, CE is suboptimal for OOD detection for two main reasons: (1) CE does not explicitly optimize for the intra-class compactness and inter-class dispersion in the feature space. As a consequence, the embeddings obtained by CE loss display insufficient compactness and dispersion (Table 4 ). Compared to CE, CIDER is more structured by exploiting the hyperspherical embeddings and explicitly optimizing towards the desirable properties for OOD detection and ID classification. (2) the feature space obtained by CE loss is Euclidean instead of hyperspherical. As shown in Tack et al. (2020) , ID data tend to have a larger norm than OOD data. As a result, the Euclidean distance between ID features can be larger than the distance from OOD to ID data. A recent work (Sun et al., 2022) verified that Euclidean embedding without feature normalization leads to suboptimal OOD detection performance. Instead, CIDER is designed to optimize hyperspherical embeddings, which benefit OOD detection. Remark 4: On the measurement of embedding quality. In Section 4.3, we provide measurements of embedding quality (inter-class dispersion and intra-class compactness) via prototypes. In practice, one can replace the class prototypes in the dispersion and compactness metrics to be one random sample (or the average of a random subset) from the corresponding class. For example, on CIFAR-10, as the embeddings of CIDER are compact, we observe that the two metrics give similar results (e.g., the ID-OOD Separability is 42.5 with prototype-based metrics vs. 38.3 with instance-based metrics). We choose to use prototypes because (1) the definition directly maps to our loss function design, and (2) prototypes are calculated as the averaged feature for each class, which helps to mitigate the sampling bias.

B EXPERIMENTAL DETAILS

Software and hardware. All methods are implemented in Pytorch 1.10. We run all the experiments on NVIDIA GeForce RTX-2080Ti GPU for small to medium batch size and on NVIDIA A100 GPU for large batch size and larger network encoder. Architecture. As shown in Figure 1 , the overall architecture of CIDER consists of a projection head h on top of a deep neural network encoder f . Following common practice and fair comparison with prior works (Sehwag et al., 2021) , (Khosla et al., 2020) , we fix the output dimension of the projection head to be 128. We use a two-layer non-linear projection head for CIFAR-10 and CIFAR-100 as in Sun et al. (2022) . Training. For methods based on pre-trained models such as MSP (Hendrycks & Gimpel, 2017) , ODIN (Liang et al., 2018) , Mahalanobis (Lee et al., 2018) , and Energy (Liu et al., 2020) , we follow the configurations in Sun et al. (2022) for CIFAR-10 and train with the cross-entropy loss for 100 epochs. The initial learning rate is 0.1 and decays by a factor of 10 at epochs 50, 75, and 90 respectively. For the more challenging dataset CIFAR-100, we train 200 epochs. We use stochastic gradient descent with momentum 0.9, and weight decay 10 -4 . For fair comparison, methods involving contrastive learning (Winkens et al., 2020) , (Tack et al., 2020) , (Sehwag et al., 2021) are trained for 500 epochs on CIFAR-10 and CIFAR-100. For CIDER, we adopt the same key hyperparameters for contrastive losses such as initial learning rate (0.5), temperature (0.1), and batch size (512) as SSD+ (Sehwag et al., 2021) in main experiments to demonstrate the effectiveness and simplicity of CIDER. For the prototype update factor α, we set the default value as 0.95 for simplicity. We observed that α = 0.95 on CIFAR-10 with ResNet-18 and α = 0.5 on CIFAR-100 with ResNet-34 provide stronger performance. OOD detection score. By default, we use the non-parametric KNN score Sun et al. (2022) . We use a larger K = 300 for CIFAR-100 and a smaller K = 100 for CIFAR-10 for simplicity. Adjusting K ∈ {10, 20, 50, 100, 200, 300, 500} yields similar performance. In practice, K can be tuned using a validation method (Sun et al., 2022) to further improve the performance.

C ADDITIONAL ABLATION STUDIES

CIDER is effective under various batch sizes. Figure 6a and 6b also indicate that CIDER remains competitive under different batch size configurations compared to SupCon. To explain this, the standard SupCon loss requires instance-to-instance distance measurement, whereas compactness loss reduces the complexity to instance-to-prototype. The class-conditional prototypes are updated during training, which capture the average statistics of each class and alleviate the dependency on the batch size. This leads to an overall memory-efficient solution for OOD detection. Ablation on the loss weights. In the main results (Table 1 ), we demonstrate the effectiveness of CIDER where the loss weight λ c is simply set to balance the initial scale between the L dis and L comp . In fact, CIDER can be further improved by adjusting λ c . As shown in Figure 7a , the performance of CIDER is relatively stable for moderate adjustments of λ c (e.g. 0.5 to 2), with the best performance at around λ c ∈ [1, 2] . This indicates CIDER provides a simple and effective solution for improving OOD detection, without much need for hyperparameter tuning on the loss scale. Adjusting prototype update factor α improves CIDER. We show in Figure 7b the performance by varying the moving-average discount factor α in Eq. 8. We can observe that the detection performance (averaged over 5 test sets) is still competitive across a wide range of α. In particular, for CIFAR-100, α = 0.5 results in the best performance with average FPR95 of 46.89% under KNN score. For CIFAR-10, we observe that a larger α (e.g. 0.95 to 0.99) results in stronger performance. Ablation on the learning rate. Prior works (Khosla et al., 2020; Sehwag et al., 2021) use a default initial learning rate (lr) of 0.5 to train contrastive losses, which is also the default setting of CIDER. We further investigate the impact of the initial learning rate on OOD detection. As shown in Figure 8a , a relatively higher initial lr is indeed desirable for competitive performance while too small lr (e.g. 0.1) would lead to performance degradation. Small temperature τ leads to better performance. Figure 8b demonstrates the detection performance as we vary the temperature parameter τ . We observe that the OOD detection performance is desirable at a relatively small temperature. Complementary to our finding, a relatively small temperature is shown to be desirable for ID classification (Khosla et al., 2020; Wang & Liu, 2021b) which penalizes hard negative samples with larger gradients and leads to separable features. Ablation on network capacity. We verify the effectiveness of CIDER under networks with other architectures such as ResNet-50 for CIFAR-100. The results are shown in Figure 9 . The trend is similar to what we observed with ResNet-34. Specifically, as a result of the improved representation, training with CIDER improves the FPR95 for various test sets compared to training with the SupCon loss. Ablation on gradient-based prototype update. We examine the effect of updating prototypes via gradients. Compared to CIDER with EMA, CIDER with learnable prototypes (LP) can be more sensitive to initialization. We report the average performance of CIDER (EMA) and CIDER (LP) across 3 independent runs for CIFAR-10 in Table 5 . All training and evaluation configurations (e.g., learning rate and batch size) are the same. We can see that CIDER with EMA improves the average FPR95 by 5.08% with smaller standard deviation. Therefore, we empirically verify that updating prototypes via EMA is a better option with stronger training stability in practice. Stability of CIDER. To verify that CIDER consistently provides strong performance, we train with 3 independent seeds for each ID dataset. Table 6 shows the OOD detection performance of CIDER with ResNet-18 trained on CIFAR-10 and ResNet-34 trained on CIFAR-100. Comparing Table 1 and Table 7 , we can see that CIDER yields consistently strong performance. Code and checkpoints are provided in https://github.com/deeplearning-wisc/cider. 

D RESULTS ON LARGE-SCALE DATASETS

In recent years, there has been a paradigm shift towards fine-tuning pre-trained models, as opposed to training from scratch. Given this trend, it is important to explore whether CIDER remains effective based on pre-trained models. Specifically, we fine-tune ImageNet pre-trained ResNet-34 on ImageNet-100 with CIDER and SupCon losses for 10 epochs. For each loss, we update the weights of the last residual block and the nonlinear projection head, while freezing the parameters in the first three residual blocks. At test time, we use the same detection score (KNN) to evaluate representation quality. FPR95 and AUROC for each OOD test set are shown in Figure 10a and 10b, respectively. The results suggest that CIDER remains very competitive, which highlight the benefits of promoting inter-class dispersion and intra-class compactness. E RESULTS ON CIFAR-10 In the main paper, we mainly focus on the more challenging task CIFAR-100. In this section, we additionally evaluate on CIFAR-10, a commonly used benchmark in literature. For methods involving contrastive losses, we use the same network encoder and embedding dimension, while only varying the training objective. The Mahalanobis score is used for OOD detection in SSD+ (Sehwag et al., 2021) , CE+SimCLR (Winkens et al., 2020) , SupCon, and CIDER. As CIFAR-10 is much less challenging compared to CIFAR-100, recent methods with contrastive losses yield similarly strong performance. For methods trained with cross-entropy loss, we use the publicly available checkpoints in Sun et al. (2022) for better consistency. The results are shown in Table 7 . Similar trends also hold as we describe in Section 4.2: (1) CIDER achieves superior OOD detection performance in CIFAR-10 as a result of better inter-class dispersion and intra-class compactness. For example, compared to the Mahalanobis baseline (Lee et al., 2018) , CIDER reduces the FPR95 by 24.99% averaged over 5 diverse test sets; (2) Although the ID classification accuracy of CIDER is similar to another proxy-based loss ProxyAnchor (Table 8 ), CIDER significantly improves the OOD detection performance by 21.05% in FPR95 due to the addition of explicit inter-class dispersion which we show is critical for OOD detection in Section 4.3. The significant improvements highlight the importance of representation learning for OOD detection.

F ID CLASSIFICATION ACCURACY

The ID classification accuracy on CIFAR-10 and CIFAR-100 can be seen in Table 8 and Table 9 , where for contrastive losses such as KNN+, SSD+, and CIDER, we follow the common practice as in Khosla et al. (2020) 



Figure 1: Overview of our compactness and dispersion regularized (CIDER) learning framework for OOD detection. We jointly optimize two complementary terms to encourage desirable properties of the embedding space: (1) a dispersion loss to encourage larger angular distances among different class prototypes, and (2) a compactness loss to encourage samples to be close to their class prototypes.

Figure 3: (a): UMAP (McInnes et al., 2018) visualization of the features when the model is trained with CE vs. CIDER for CIFAR-10 (ID). (b): CIDER makes OOD samples more separable from ID compared to CE (c.f. Table4).

Figure 4: Fine-tuning pre-trained ResNet-34 on ImageNet-100 (ID).

Figure 5: OOD Detection performance across different batch sizes.

Figure 6: Ablation on CIDER v.s. SupCon loss under different batch sizes. The results are averaged across the 5 OOD test sets based on ResNet-34. CIDER outperforms SupCon across different batch sizes, suggesting the effectiveness of explicitly facilitating prototype-wise dispersion.

Figure 7: Ablation on (a) weight λc of the compactness loss; (b) prototype update discount factor α. The results are based on CIFAR-100 (ID) averaged over 5 OOD test sets.

Figure 8: Ablation on (a) initial learning rate and (b) temperature. The results are based on CIFAR-100 (ID) averaged over 5 OOD test sets.

Figure 9: Ablation on architecture. Results are based on ResNet-50.

Table 1 contains a wide range of competitive methods for OOD detection. All methods are trained on ResNet-34 using CIFAR-100, without assuming access to auxiliary outlier datasets. For clarity, we divide the methods into two categories: trained with and without contrastive losses. For pre-trained model-based scores such as MSP (Hendrycks OOD detection performance for CIFAR-100 (ID) with ResNet-34. Training with CIDER significantly improves OOD detection performance.

Ablation on OOD detection score. Results are FPR95 on CIFAR-100 (ID) with ResNet-34. We evaluate both Mahalanobis and KNN score (K = 300). Here we examine the effects of loss components on OOD detection. As shown in Table3, we have the following observations: (1) For ID classification, training with L comp alone leads to an accuracy of 75.19%, similar to the ID accuracy of SSD+ (75.11%). This suggests that promoting intra-class compactness and a moderate level of inter-class dispersion (as a result of sample-to-prototype negative pairs in L

Ablation study on loss component. Results (in AUROC) are based on CIFAR-100 trained with ResNet-34. Training with only L comp suffices for ID classification. Inter-class dispersion induced by L dis is key to OOD detection.

Compactness and dispersion of CIFAR-10 feature embedding, along with the separability w.r.t. each OOD test set. We convert cosine similarity to angular degrees for better readability.

. Recently,Li et al. (2021b)  propose a prototype-based contrastive loss for unsupervised learning where prototypes are generated via a clustering algorithm, while our method is supervised where prototypes are updated based on labels.Li et al. (2021a)  incorporate a prototype-based loss to tackle data noise.Wang & Isola (2020)   analyze the asymptotic behavior of contrastive losses theoretically, whileWang & Liu (2021a)   empirically investigate the properties of contrastive losses for classification. Recently,Bucci et al. (2022) investigates contrastive learning for open-set domain adaptation. However, none of the works focus on OOD detection. We aim to fill the gap and facilitate the design and understanding of contrastive losses for OOD detection.

calculate overall loss

Ablation on prototype update rules. OOD detection performance for ResNet-18 trained on CIFAR-10 with EMA-style updates denoted as CIDER (EMA) vs. learnable prototypes denoted as CIDER (LP). CIDER with EMA demonstrates strong OOD detection performance. Results are averaged over 3 independent runs. ±1.50 99.55 ±0.48 28.13 ±1.59 94.53 ±0.12 5.23 ±2.68 98.16 ±0.99 36.47 ±8.93 94.59 ±0.92 16.25 ±1.07 97.38 ±0.19 17.65 ±2.36 96.84 ±0.40 CIDER (EMA) 3.04 ±1.38 99.50 ±0.30 26.60 ±2.47 94.64 ±0.51 4.10 ±1.68 99.14 ±0.19 15.94 ±4.56 97.10 ±0.54 13.19 ±0.82 97.39 ±0.48 12.57 ±1.31 97.56 ±0.33

Ablation on stability. OOD detection performance of CIDER for CIFAR-10 and CIFAR-100. Results are averaged over 3 independent runs. ±1.38 99.50 ±0.30 26.60 ±2.47 94.64 ±0.51 4.10 ±1.68 99.14±0.19 15.94 ±4.56 97.10 ±0.54 13.19 ±0.82 97.39 ±0.48 12.57 ±1.31 97.56±0.33  CIFAR-100 23.67 ±2.28 95.07 ±0.13 79.37 ±1.84 72.97 ±3.90 22.04 ±5.12 96.01 ±1.80 62.16 ±8.48 83.70 ±2.92 44.96 ±6.01 90.25 ±0.97 46.45 ±2.01 87.60 ±1.03

and use linear probe on normalized features. Results on CIFAR-10. OOD detection performance for ResNet-18 trained on CIFAR-10 with and without contrastive loss. CIDER achieves strong OOD detection performance and ID classification accuracy (Table8).

ID classification accuracy on CIFAR-10 (%)

ID classification accuracy on CIFAR-100 (%)

ACKNOWLEDGEMENT

We gratefully acknowledge the support of AFOSR Young Investigator Award under No. FA9550-23-1-0184; Philanthropic Fund from SFF; and faculty research awards from Google, Meta, and Amazon. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements either expressed or implied, of the sponsors.

APPENDIX A ALGORITHM DETAILS AND DISCUSSIONS

The training scheme of our compactness and dispersion regularized (CIDER) learning framework is shown in Algorithm 1. We jointly optimize: (1) a compactness loss to encourage samples to be close to their class prototypes, and (2) a dispersion loss to encourage larger angular distances among different class prototypes.Algorithm 1: Pseudo-code of CIDER. 

