A FUNCTIONAL PERSPECTIVE ON MULTI-LAYER OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

A crucial component for implementing reliable classifiers is detecting examples far from the reference (training) distribution, referred to as out-of-distribution (OOD) samples. A key feature of OOD detection is to exploit the network by extracting statistical patterns and relationships through the pre-trained multi-layer classifier. Despite achieving solid results, state-of-the-art methods require either additional OOD examples, expensive computation of gradients, or are tightened to a particular architecture, limiting their applications. This work adopts an original approach based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies. In this new framework, OOD detection translates into detecting samples whose trajectories differ from the typical behavior characterized by the training set. Our method significantly decreases the OOD detection error of classifiers trained on ImageNet and outperforms the state-of-the-art methods on average AUROC and TNR at 95% TPR. We demonstrate that the functional signature left by a sample in a network carries relevant information for OOD detection.

1. INTRODUCTION

The ability of a Deep Neural Network (DNN) to generalize to new data is mainly restricted to priorly known concepts in the training dataset. In real-world scenarios, Machine Learning (ML) models may encounter Out-Of-Distribution (OOD) samples, such as data belonging to novel concepts (classes) (Pimentel et al., 2014) , abnormal samples (Tishby & Zaslavsky, 2015) , or even carefully crafted attacks designed to exploit the model (Szegedy et al., 2013) . The behavior of ML systems on unseen data is of great concern for safety-critical applications (Amodei et al., 2016b; a) , such as medical diagnosis in healthcare (Subbaswamy & Saria, 2020) , autonomous vehicle control in transportation (Bojarski et al., 2016) , among others. To address safety issues arising from the presence of OOD samples, a successful line of work aims at augmenting ML models with an OOD binary detector to distinguish between abnormal and in-distribution examples (Hendrycks & Gimpel, 2017 ). An analogy to the detector is the human body's immune system, with the task of differentiating between antigens and the body itself. Distinguishing OOD samples is challenging. Some previous works developed detectors by combining scores at the various layers of the multi-layer pre-trained classifier (Sastry & Oore, 2020; Lee et al., 2018; Gomes et al., 2022; Huang et al., 2021) . These detectors require either a held-out OOD dataset (e.g., adversarially generated or OOD data) or ad-hoc methods to linearly combine OOD scores computed on each layer embedding tightened to a particular architecture. A key observation is that existing aggregation techniques overlook the sequential nature of the underlying problem and, thus, limit the discriminative power of those methods. Indeed, an input sample passes consecutively through each layer and generates a highly correlated signature that can be statistically characterized. Our observations in this work motivate the statement that: The input's trajectory through a network is key for discriminating typical from atypical samples. In this paper, we introduce a significant change of perspective. Instead of looking at each layer score independently, we cast the scores into a sequential representation that captures the statistical trajectory of an input sample through the various layers of a multi-layer neural network. To this end, we adopt a functional point of view by considering the sequential representation as curves parametrized by each layer. Consequently, we redefine OOD detection as detecting samples whose trajectories are abnormal (or atypical) compared to reference trajectories characterized by the training set. Our method, which requires little parameter tuning and, perhaps more importantly, no additional OOD or synthetic data, can identify OOD samples from their trajectories. Furthermore, we show that typical multivariate detection methods fail to detect OOD patterns, which may manifest in an isolated fashion by shifting in magnitude or overall shape. Figure 1 summarizes our method. Contributions. This work presents a new principle and unsupervised method for detecting OOD samples that do not require OOD (or extra) data and brings novel insights into the problem of OOD detection. Our main contributions can be summarized as follows. 1. A novel problem formulation. We reformulate the problem of OOD detection through a functional perspective that effectively captures the statistical dependencies of an input sample's path across a multi-layer neural classifier. Moreover, we propose a map from the multivariate feature space (at each layer) to a functional space that relies on the probability weighted projection of the test sample onto the class conditional training prototypes at the layer. It is computationally efficient and straightforward to implement.

2.

Computing OOD scores from trajectories. We compute the inner product between the test trajectory and the average training trajectories to measure the similarity of the input w.r.t the training set. Low similarity indicates that the test sample is likely sampled from OOD. 3. Empirical evaluation. We validate the value of the proposed method using a mid-size OOD detection benchmark on ImageNet-1k. We obtain competitive results, demonstrating an average ROC gain of 3.7% across three architectures and four OOD datasets. We release our code anonymized online.

2. RELATED WORKS

This section briefly discusses prior work in OOD detection, highlighting confidence-based and feature-based methods without special training as they resonate the most with our work. Another thread of research relies on re-training for learning representations adapted to OOD detection (Mohseni et al., 2020; Bitterwolf et al.; Mahmood et al., 2021) , either through contrastive training (Hendrycks et al., 2019; Winkens et al., 2020; Sehwag et al., 2021) , regularization (Lee et al., 2021; Nandy et al., 2021; Hein et al., 2019; Du et al., 2022) , generative (Schlegl et al., 2017; ood, 2019; Xiao et al.; Ren et al.; Zhang et al., 2021) , or ensemble (Vyas et al., 2018; Choi & Jang, 2018) based approaches. Related subfields are open set recognition (Geng et al., 2021) , novelty detection (Pi-mentel et al., 2014) , anomaly detection (Chandola et al., 2009; Chalapathy & Chawla, 2019) , outlier detection (Hodge & Austin, 2004) , and adversarial attacks detection (Akhtar & Mian, 2018) . Confidence-based OOD detection. A natural measure of uncertainty of a sample's label is the classification model's softmax output. Hendrycks & Gimpel (2017) observed that the maximum of the softmax output could be used as a discriminative score between in-distribution and OOD samples. Hein et al. (2019) observed that it may still assign overconfident values to OOD examples. Liang et al. (2018) and Hsu et al. (2020) propose re-scaling the softmax response with a temperature value and a pre-processing technique that further separates in-from out-of-distribution examples. Liu et al. (2020) proposes an energy-based OOD detection score by replacing the softmax confidence score with the free energy function. Sun & Li (2022) proposes sparsification of the classification layer weights to improve OOD detection by regularizing predictions. (Lee et al., 2018; Ren et al., 2021) or further information geometry tools (Gomes et al., 2022) . Efforts toward combining multiple features to improve performance were previously explored (Lee et al., 2018; Sastry & Oore, 2020; Gomes et al., 2022) . The strategy relies heavily upon having additional data for tuning the detector or focusing on specific model architectures, which are limiting factors in real-world applications.

3. PRELIMINARIES

We start by recalling the general setting of the OOD detection problem from a mathematical point of view (Section 3.1). Then, in Section 3.2, we motivate our method through a simple yet clarifying illustrative example showcasing the limitation of previous works and how we approach the problem.

3.1. BACKGROUND

Let (X, Y ) be a random variable valued in a space X ×Y with unknown probability density function (pdf) p XY and probability distribution P XY . Here, X ⊆ R d represents the covariate space and Y = {1, . . . , C} corresponds to the labels attached to elements from X . The training dataset S N = {(x i , y i )} N i=1 is defined as independent and identically distributed (i.i.d) realizations of P XY . From this formulation, detecting OOD samples boils down to building a binary rule g : X → {0, 1} through a soft scoring function s : X → R and a threshold γ ∈ R. Namely, a new observation x ∈ X is then considered as in-distribution, i.e., generated by P XY , when g(x) = 0 and as OOD when g(x) = 1. Finding this rule g from X can become intractable when the dimension d is large. Thus, previous work rely on a multi-layer pre-trained classifier f θ : X → Y defined as: f θ (•) = h • f L • f L-1 • • • • • f 1 (•), with L ≥ 1 layers, where f ℓ : R d ℓ-1 → R d ℓ is the ℓ-th layer of the multi-layer neural classifier, d ℓ denotes the dimension of the latent space induced by the ℓ-th layer (d 0 = d), and h indicates the classifier that outputs the logits. We also define z ℓ = (f ℓ • • • • • f 1 )(x) as the latent vectorial representation at the ℓ-th layer for an input sample x. We will refer to the logits as z L+1 and h as f L+1 to homogenize notation. It is worth emphasizing that the trajectory of (z 1 , z 2 , . . . , z L+1 ) corresponding to a test input x 0 are dependent random variables whose joint distribution strongly depends on the underlying distribution of the input. Therefore, the design of function g(•) is typically based on the three key steps: (i) A similarity measure d(• ; •) (e.g., Cosine similarity, Mahalanobis distance, etc.) between a sample and a population is applied at each layer to measure the similarity (or dissimilarity) of a test input x 0 at the ℓ-th layer z ℓ,0 = (f ℓ • • • • • f 1 )(x 0 ) w.r.t. the population of the training examples observed at the same layer z ℓ = (f ℓ • • • • • f 1 )(x) : x ∈ S N . (ii) The layer-wise score obtained is mapped to the real line collecting the OOD scores. (iii) Lastly, a threshold is chosen to build the final decision function.

A fundamental ingredient remains in step (ii):

How to consistently leverage the information collected from multiple layers outputs in an unsupervised way, i.e., without resorting to OOD or pseudo-OOD examples?

3.2. FROM INDEPENDENT MULTI-LAYER SCORES TO A SEQUENTIAL PERSPECTIVE OF OOD DETECTION

Previous multi-feature OOD detection works treat step (ii) as a supervised learning problem (Lee et al., 2018; Gomes et al., 2022) for which the solution is a linear binary classifier. The objective is to find a linear combination of the scores obtained at each layer that will sufficiently separate in-distribution from OOD samples. A held-out OOD dataset is collected from true (or pseudogenerated) OOD samples. The linear soft novelty score function s α writes: s α (x 0 ) = L ℓ=1 α ℓ • d x 0 ; (f ℓ • • • • • f 1 )(x) : x ∈ S N . The shortcomings of this method are the need for extra data or ad-hoc parameters, which results in decision boundaries that underfit the problem and fail to capture certain types of OOD samples. To illustrate this phenomenon, we designed a toy example (see Figure 2a ) where scores are extracted from two features fitting a linear discriminator on held-out in-distribution (IND) and OOD samples. As a consequence, areas of unreliable predictions where OOD samples cannot be detected due to the misspecification of the linear model arise. One could simply introduce a non-linear discriminator that better captures the geometry of the data for this 2D toy example. However, it becomes challenging as we move to higher dimensions with limited data. By reformulating the problem from a functional data point of view, we can identify trends and typicality in trajectories extracted by the network from the input. Figure 2b shows the dispersion of trajectories coming from the in-distribution and OOD samples. These patterns are extracted from multiple latent representations and aligned on a time-series-like object. We observed that trajectories coming from OOD samples exhibit a different shape when compared to typical trajectories from training data. Thus, to determine if an instance belongs to in-distribution, we can test if the observed path is similar to the functional trajectory reference extracted from the training set.

4. FUNCTIONAL OUT-OF-DISTRIBUTION DETECTION

This section presents our OOD detection framework, which applies to any pre-trained multi-layer neural network with no requirements for OOD samples. We describe our method through two key steps: functional representation of the input sample (see Section 4.1) and test time OOD score computation (see Section 4.2).

4.1. TOWARDS A FUNCTIONAL REPRESENTATION

The first step to obtain an univariate functional representation of the data from the multivariate hidden representations is to reduce each feature map to a scalar value. To do so, we first compute the class-conditional training population prototypes defined by: µ ℓ,y = 1 N y Ny i=1 z ℓ,i , where N y = {z ℓ,i : y i = y, ∀i ∈ {1..N }} , 1 ≤ ℓ ≤ L + 1 and z ℓ,i = (f ℓ • • • • • f 1 )(x i ). Given an input example, we compute the probability weighted scalar projectionfoot_0 between its features (including the logits) and the training class conditional prototypes, resulting in L + 1 scalar scores: d ℓ (x; M ℓ ) = C y=1 σ y (x) • proj µ ℓ,y z ℓ = C y=1 σ y (x)∥z ℓ ∥cos ∠ (z ℓ , µ ℓ,y ) , where M ℓ = {µ ℓ,y : y ∈ Y}, ∥•∥ is the ℓ 2 -norm, ∠ (•, •) is the angle between two vectors, and σ y (x; f θ ) is the softmax function on the logits f θ (x) of class y. Hence, our layer-wise scores rely on the notions of vector length and angle between vectors, which can be generalized to any n-dimensional inner product space without imposing any geometrical constraints. It is worth emphasizing that our layer score has some advantages compared to the class conditional Gaussian model first introduced in Lee et al. ( 2018) and the Gram matrix based-method introduced in Sastry & Oore (2020). Our layer score encompasses a broader class of distributions as we do not suppose an specific underlying probability distribution. We avoid computing covariance matrices, which are often ill-conditioned for latent representations of DNNs. Since we do not store covariance matrices, our functional approach has a negligible overhead regarding memory requirements. Also, our method can be applied to any vector-based hidden representation, not being restricted to matrixbased representations as in Sastry & Oore (2020). Thus, our approach applies to a broader range of models, including transformers. By computing the scalar projection at each layer, we define a following functional neuralrepresentation extraction function given by Eq. 3. Thus, we can map samples from the representation to a functional space while retaining information on the typicality of the sample w.r.t the training dataset. ϕ : X → R L+1 x → d 1 (x; M 1 ) , . . . , d L+1 (x; M L+1 ) We apply ϕ to the training input x i to obtain the representation of the training sample across the network u i = ϕ(x i ). We consider the related vectors u i , ∀ i ∈ [1 : N ] as curves parameterized by the layers of the network. We build a training dataset U = {u i } N i=1 from these functional representations that will be useful for detecting OOD samples during test time. We then rescale the training set trajectories w.r.t the maximum value found at each coordinate to obtain layer-wise scores on the same scaling for each coordinate. Hence, for j ∈ {1, . . . , L + 1}, let max (U) := [max i u i,1 , . . . , max i u i,L+1 ] ⊤ , we can compute a reference trajectory ū for the entire training dataset defined in equation 4 that will serve as a global typical reference to compare test trajectories with. ū = 1 N N i=1 u i max(U) (4) 4.2 COMPUTING THE OOD SCORE AT TEST TIME At inference time, we first re-scale the test sample's trajectory as we did with the training reference φ(x) = ϕ(x)/ max (U) . Then, we compute a similarity score w.r.t this typical reference that will be our OOD score. We choose as metric also the scalar projection of the test vector to the training reference. In practical terms, it boils down to the inner product between the test sample's trajectory and the training set's typical reference trajectory since the norm of the average trajectory is constant for all test samples. Mathematically, our scoring function s : X → R writes: s(x; ū) = φ(x), ū = L+1 j=1 φ(x) j ūj which is bounded by Cauchy-Schwartz's inequality. From this OOD score, we can derive a binary classifier g by fixing a threshold γ ∈ R: g(x; s, γ) = 1, if s(x) ≤ γ 0, otherwise, where g(x) = 1 means that the input sample x is classified as being out-of-distribution. Please refer to Appendix (see Section A.1) for further details on the algorithm.

5. EXPERIMENTAL SETTING

This section describes the experimental setting, including the datasets used, the pre-trained DNN architectures, the evaluation metrics, and other essential information. We open-source our code in https://github.com/ood-trajectory/ood-trajectory.

5.1. DATASETS

We set as in-distribution dataset ImageNet-1k (= ILSVRC2012; Deng et al., 2009) for all of our experiments, which is a challenging mid-size and realistic dataset that have been incorporated recently to OOD detection. It contains around 1.28M training samples and 50,000 test samples belonging to 1000 different classes. For the out-of-distribution datasets, we take the same dataset splits introduced by Huang & Li (2021) . The iNaturalist (Horn et al., 2017) dataset contains over 5,000 species of plant and animals. We consider a split with 10,000 test samples with concepts from 110 classes different from the in-distribution ones. The Sun (Xiao et al., 2010) dataset is a scene dataset of 397 categories. We considered a split with 10,000 randomly sampled test examples belonging to 50 categories. The Places365 (Zhou et al., 2017) is also a scenes dataset with 365 different concepts. We also considered a random split with 10,000 samples from 50 disjoint categories. For the DTD or Textures (Cimpoi et al., 2014) dataset is composed of textural patterns. We considered all of the 5,640 available test samples. Note that there is a few overlaps between the semantics of classes from this dataset and ImageNet. We decided to keep the entire dataset in order to be comparable with Huang et al. (2021) . We provide a study case on this in Section 6.

5.2. MODELS

We ran experiments with three models of different architectures. A DenseNet-121 (Huang et al., 2017) pre-trained on ILSVRC-2012 with 8M parameters and test set top-1 accuracy of 74.43%. We reduced the intermediate representations from the transition blocks with an max pooling operation obtaining a final vector with a dimension equal to the number of channels of each output. The dimensions are 128, 256, 512, and 1024, respectively. We resize input images to 224×224 pixels. A BiT-S-101 (Kolesnikov et al., 2020) accuracy of 77.41% and 44.5M parameters. We extract features from the outputs of layers 1 to 4 and the penultimate layer, obtaining representations with sizes 256, 512, 1024, and 2048, respectively after max pooling. We resize input images to 480×480. We also ran experiments with a Vision Transformer (ViT-B-16; Dosovitskiy et al., 2021) , which is trained on the ILSVRC2012 dataset with 82.64% top-1 test accuracy and 70M parameters. We take the output's class tokens for layers 1 to 13 and the encoder's output as latent representations, totaling 14 features of dimension 768 each. We resize images to 224×224. We download the DenseNet and Vision Transformer weights from PyTorch (Paszke et al., 2019) hub and the weights of Big Transfer from Kolesnikov et al. (2020) . All models are trained from scratch on ImageNet-1k.

5.3. EVALUATION METRICS

We evaluate the methods in terms of ROC and TNR. The Area Under The Receiving Operation Curve (ROC) is the Area Under the curve representing the true negative rate against the false positive rate when considering a varying threshold. It measures how well can the OOD score distinguish between out-and in-distribution data in a threshold-independent manner. The True Negative Rate at 95% True Positive Rate (TNR at 95% TPR or TNR for short) is a threshold-dependent metric that provides the detector's performance in a reasonable choice of threshold. It measures the accuracy in detecting OOD samples when the accuracy of detecting in-distribution samples is fixed at 95%. For both measures, higher is better.

5.4. BASELINES' HYPERPARAMETERS

For ODIN (Liang et al., 2018) , we set the temperature to 1000 and the noise magnitude to zero. We take a temperature equal to one for Energy (Liu et al., 2020) . We set the temperature to one for GradNorm (GradN. for short; Huang et al., 2021) . For Mahalanobis (Maha. for short; Lee et al., 2018) , we take only the scores of the outputs of the penultimate layer of the network. The MSP (Hendrycks & Gimpel, 2017) does not have any hyperparameters. For ReAct (Sun et al., 2021) , we compute the activation clipping threshold with a percentile equals to 90. For KNN (Sun et al., 2022) we set as hyperparameters α = 1% and k = 10. Our method is hyperparameter free.

6. RESULTS AND DISCUSSION

We report our main results in Table 1 , which includes the performance for the three model architectures, four OOD datasets, and seven detection methods. Our results are across the board consistent and on average superior to previous methods, obtaining an AUROC of 90% on average (see Table 1 ). We also ran experiments on a CIFAR-10 benchmark, showing that our method also achieves great performance for the small image datasets. Results are delegated to the Appendix (see Section A.4). Multivariate OOD Scores is Not Enough. Even though well-known multivariate novelty (or anomaly) detection techniques, such as One-class SVM (Cortes & Vapnik, 1995) , Isolation Forest (Liu et al., 2008) , Extended Isolation Forest (Hariri et al., 2021), Local Outlier Factor (LOF; Breunig et al., 2000) , k-NN approaches (Fix & Hodges, 1989) and distance-based approaches (Mahalanobis, 1936) are adapted to various scenarios, they showed to be inefficient for integrating layers' informa- tion. A hypothesis that explains this failure is the important sequential dependence pattern we noticed in the in-distribution layer-wise scores. Table 2 shows the performance of a few unsupervised aggregation methods based on a multivariate OOD detection paradigm. We tried typical methods: evaluating the Euclidean and Mahalanobis distance w.r.t the training set Gaussian representation, fitting an Isolation Forest, and a One-class SVM on training trajectory vectors. We compare the results with the performance of taking only the penultimate layer scores, and we observe that the standard multivariate aggregation fails to improve the scores for DenseNet-121. Qualitative Evaluation of the Functional Dataset. The test in-distribution trajectories follow a well-defined trend very similar to the training distribution (see Figure 3a ). While the OOD trajectories manifest mainly as shape and magnitude anomalies (w.r.t. the taxonomy of Hubert et al., 2015) . These characteristics reflects on the histogram of our detection score (see Figure 3b ). The in-distribution histogram is generally symmetric, while the histogram for the OOD data is typically skewed and has a smaller average. Please refer to the Appendix A.5 for additional similar figures. 4a shows the label given by our OOD binary discriminator. We set as threshold the score value with 95% TPR. Study Case. There are a few overlaps in terms of the semantics of class names in the Textures and ImageNet datasets. In particular, "honeycombed" in Textures versus "honeycomb" in ImageNet, "stripes" vs. "zebra", "tiger", and "tiger cat", and "cobwebbed" vs. "spider web". We showed that our method decreases the false negatives in this OOD benchmark. In order to better understand how our method can discriminate where baselines often fail, we designed a simple study case. Take the Honeycombed vs. Honeycomb, for instance (the first row of Fig. 4a shows four examples of this partition and a couple of examples from ImageNet). The honeycomb from ImageNet references natural honeycombs, usually found in nature, while honeycombed have a wider definition attached to artificial patterns. In this class, the Energy baseline makes 108 false negative mistakes, while we only make 20 mistakes. We noticed that some of our mistakes are aligned with real examples of honeycombs (e.g., the second example from the first row), whilst we confidently classify other patterns correctly as OOD. For the Spider webs class, most examples from Textures are visually closer to ImageNet. For the striped case (middle row), our method flags only 16 examples as being in-distribution, but we noticed an average higher score for the trajectories in Fig. 4b . Note that, for the animal classes, the context and head are essential features for classifying them. Overall, the study shows that our scores are aligned with the semantic proximity between testing samples and the training set. Potential Shortcomings of our Method. We believe this work is only the first step in paving the way for efficient post-score aggregation as we have tackled an open and challenging problem of combining multi-layer information. However, we believe there is room for improvement since our metric lives in the inner product space, which is a specific case for more general structures found in Hilbert spaces that might contain more adequate metrics. Another concern that may arise is that, from a practical point of view, current third-party ML services often restrict the practitioner from accessing the intermediate outputs of the model in production. Nonetheless, the service provider has access to this information and could leverage it to deliver OOD detection as a service, for instance.

7. CONCLUSION

In this work, we introduced an original approach to OOD detection based on a functional view of the pre-trained multi-layer neural network that leverages the sample's trajectories through the layers. We shifted the problem from a multivariate to a functional perspective in a simple yet original way. Our method detects samples whose trajectories differ from the typical behavior characterized by the training set. The key ingredient relies on the statistical dependencies of the scores extracted at each layer, using a purely self-supervised algorithm. Beyond the novelty and practical advantages of the algorithm, our results establish the value of using a functional approach as an unsupervised technique to combine multiple scores, which offers an exciting alternative to usual single-layer detection methods. We hope this work will be a first step to pave a new way for future research to enhance the safety of AI systems.

A APPENDIX

A.1 ALGORITHM AND COMPUTATIONAL DETAILS This section introduces further details on the computation algorithm and resources. Algorithms 1 and 2 describe how to extract the neural functional representations from the samples and compute the OOD score from test samples, respectively. Note that we emphasize the "functional representations", because the global behavior of the trajectory matters. In very basic terms, the method can look into the past and future of the series, contrary to a "sequential point of view" which is restricted to the past of the series only, not globally on the trajectory.  for ℓ ∈ {1, . . . , L + 1} do z ℓ,0 ← (f ℓ • • • • • f 1 )(x 0 ) end u 0 ← ϕ(z 0 ; {µ ℓ,y : y ∈ Y and ℓ ∈ {1, . . . , L + 1}}) ũ0 ← u0 max(U ) s(x 0 ) ← 1 ∥ ū∥ 2 L+1 i=1 ũ0,i ūi return s(x 0 ) A.1.1 COMPUTING RESOURCES We run our experiments on an internal cluster. Since we use pre-trained popular models, it was not necessary to retrain the deep models. Thus, our results should be reproducible with a single GPU. Since we are dealing with ImageNet datasets (approximately 150GB), a large storage capacity is expected. We save the features in memory to speed up downstream analysis of the algorithm, which may occupy around 200GB of storage. ablation studies to understand the impact of adding the intra-blocks convolutional layers in the trajectories. To do so, we extracted the outputs of every intermediate convolutional output of these networks, and we plotted their score trajectories in Figure 5 . The resulting trajectories are noisy and would be unfitting to represent the data properly. As a solution, we propose smoothing the curves with a polynomial filter to extract more manageable trajectories. Also, we observe regions of high variance for the in-distribution trajectory inside block 2 for the DenseNet-121 model (see Fig. 5b ). A further preprocessing could be filtering out these features of high variance that would be unreliable for the downstream task of OOD detection. In this section, we study whether the mean vectors are sufficiently informative for statistically modelling the classconditional embedding features. From a statistical point of view, the average would be informative if the data is compact. To address this point, we plotted the median and the mean for the coordinates of the feature map and measured their difference in Figure 7 . We observed that they almost superpose in most dimensions or are separated by a minor difference, which indicates that the data is compact and central. In addition, we showed in Figure 6 that the halfspace data depth (Dyckerhoff & Mozharovskyi, 2014) of the mean vector of a given is superior to the maximum depth of a training sample vector of the same class, suggesting the average is central or deep in the feature data distribution. From a practical point of view, the clear advantage of using only the mean as a reference are computational efficiency, simplicity, and interpretability. We believe that future work directions could be exploring a method that better models the density in the embedding features, especially as more accurate classifiers are developed. A.4 RESULTS ON CIFAR-10 We ran experiments with a ResNet-18 model trained on CIFAR-10 and evaluate the OOD performance of a few method compared to ours. We extract features from the outputs of blocks 2 to 4, the penultimate layer and logits. The results are displayed in the table below. Our method outperforms comparable state-of-the-art methods by 2.4% on average AUROC, demonstrating that it is consistent and suitable for OOD detection on small datasets too. The classes considered for the iNaturalist dataset were:



Other metrics to measure the similarity of an input w.r.t. the population of examples can also be used.



Figure 1: The left-hand side of the figure shows the feature extraction process of a deep neural network classifier f . The mapping of the hidden representations of an input sample into a functional representation is given by a function ϕ. The functional representation of a sample encodes valuable information for OOD detection. The right hand side of the figure shows how our method computes the OOD score s of a sample during test time. A projection of the sample's trajectory into the training reference trajectory ū is computed. Finally, a threshold γ is set to obtain a binary discriminator g.

Example of a mispecified model in a toy example in 2D caused by fitting with held-out OOD dataset. between layers training scores in a network, highlighting structure in the trajectories.

Figure2: Figure2asummarizes the limitation of supervised methods for aggregating layer scores that rely on held-out OOD or pseudo-OOD data. It biases the decision boundary (D.B) that doesn't generalize well to other types of OOD data. We observed that in-distribution and OOD data have disparate trajectories through a network (Fig.2b), specially on the last five features. These features are correlated in a sequential fashion, as observed in Fig.2c.

Figure 3: Functional representation with 5 and 95% quantiles, histogram and ROC curve for our OOD score on a DenseNet-121 model with Textures as OOD dataset.

(a) A few examples from the Textures sharing semantic overlap with ImageNet classes. Trajectories of the leftmost examples of Fig. 4a and their OOD scores in parenthesis.

Figure 4: Study case on OOD detection of individual samples for our method on classes of semantic overlap between the ImageNet and Textures datasets. The badge on each image on Fig.4ashows the label given by our OOD binary discriminator. We set as threshold the score value with 95% TPR.

Neural functional representation extraction algorithm computed offline.Input : Training dataset S N = {(x i , y i )} N i=1 and a pre-trained DNN f = f L+1 • • • • • f 1 . Output:Reference trajectory ū, scaling vector max (U), and neural functional trajectory extraction function ϕ(•; {µ ℓ,y : y ∈ Y and ℓ ∈ {1, . . . , L + 1}}). // Training dataset feature extraction for ℓ ∈ {1, . . . , L + 1} do zℓ,i ← (f ℓ • • • • • f 1 )(x i ) for y ∈ Y do µ ℓ,y ← 1 Ny Ny i=1 z ℓ,i // Class conditional features prototypes end end for i ∈ {1, . . . , N } do // Functional trajectory extractor ϕ(•) u i ← [d 1 (z 1,i ; {µ 1,y : ∀y ∈ Y}), . . . , d L+1 (z L+1,i ; {µ L+1,y : ∀y ∈ Y})] end U ← {u} N i=1 max (U) ← [max i u i,1 , . . . , max i u i,L+1 ] ⊤ ū return ū,max (U), ϕ(•; {µ ℓ,y : y ∈ Y and ℓ ∈ {1, . . . , L + 1}}) Algorithm 2: Out-of-distribution score computation online. Input : Test sample x 0 , the DNN f = f L+1 • • • • • f 1 , reference trajectory ū, scaling vector max (U), and neural functional trajectory extraction function ϕ(•; {µ ℓ,y : y ∈ Y and ℓ ∈ {1, . . . , L + 1}}). Output: Test sample's OOD score s(x 0 ).

ON THE CENTRALITY OF THE CLASS-CONDITIONAL FEATURES MAPS 0.80 0.81 0.82 0.83 0.84 0

Figure 6: Histogram showing that the halfspace depth of the average vectors for a given class is higher than the highest depth of an embedding feature vector of the same class, demonstrating multivariate centrality.

Figure 7: Histogram for the 20 first dimensions of the penultimate feature of a DenseNet-121 for class index 0 of ImageNet. The green line is the average and the red line is the estimated median.

Figure 8: Average trajectories, OOD detection score histogram and ROC curve for the DenseNet-121 model.

(a) In-distribution. (b) Out-of-distribution.

Figure 10: Example of data samples from in-distribution (ImageNet) and OOD samples from iNaturalist, SUN, Places365 and Textures datasets.

Comparison against post-hoc state-of-the-art methods for OOD detection on the ImageNet benchmark. M+Ours stands for using the Mahalanobis distance-based layer score with our proposed unsupervised score aggregation algorithm based on trajectory similarity. Values are in percentage.

The first two rows show single-layer baselines where Penultimate layer is our layer score on the penultimate layer outputs. The subsequent rows show the performance of other unsupervised multivariate aggregation methods and our method. We ran experiments with a DenseNet-121 model.

CIFAR-10 benchmark results in terms of AUROC based on a ResNet-18 model.A.5 ADDITIONAL PLOTS, FUNCTIONAL DATASET, HISTOGRAMS AND ROC CURVESWe display additional plots in Figures9 and 8for the observed functional data, the histogram of our scores showing separation between in-distribution and OOD data and the ROC curves for all of our experiments.

A.1.2 INFERENCE TIME

In this section, we conduct a time analysis of our algorithm. It is worth noting that most of the calculation burden is done offline. At inference, only a forward pass and feature-wise scores are computed. We conducted a practical experiment where we performed live inference and OOD computation with three models for the MSP, Energy, Mahalanobis, and Trajectory Projection (ours) methods. The results normalized by the inference time are available in Table 3 below. We reckon that there may exist better computationally efficient implementations of these algorithms. So this remains a naive benchmark of their computational overhead. We showed through several benchmarks that taking the outputs of each convolutional block for the DenseNet-121 and BiT-S-101 models is enough to obtain excellent results. We conduct further A.6.4 TEXTURESThe classes considered for the Textures dataset were: [banded, blotchy, braided, bubbly, bumpy, chequered, cobwebbed, cracked, crosshatched, crystalline, dotted, dibrous, flecked, freckled, frilly, gauzy, grid, grooved, honeycombed, interlaced, knitted, lacelike, lined, marbled, matted, meshed, paisley, perforated, pitted, pleated, polka-dotted, porous, potholed, scaly, smeared, spiralled, sprinkled, stained, stratified, striped, studded, veined, waffled, woven, wrinkled, zigzagged] 

