PERFECT DENSITY MODELS CANNOT GUARANTEE ANOMALY DETECTION

Abstract

Thanks to the tractability of their likelihood, some deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for out-of-distribution detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.

1. INTRODUCTION

Several machine learning methods aim at extrapolating a behavior observed on training data in order to produce predictions on new observations. But every so often, such extrapolation can result in wrong outputs, especially on points that we would consider infrequent with respect to the training distribution. Faced with unusual situations, whether adversarial (Szegedy et al., 2013; Carlini & Wagner, 2017) or just rare (Hendrycks & Dietterich, 2019) , a desirable behavior from a machine learning system would be to flag these outliers so that the user can assess if the result is reliable and gather more information if need be (Zhao & Tresp, 2019; Fu et al., 2017) . This can be critical for applications like medical decision making (Lee et al., 2018) or autonomous vehicle navigation (Filos et al., 2020) , where such outliers are ubiquitous. What are the situations that are deemed unusual? Defining these anomalies (Hodge & Austin, 2004; Pimentel et al., 2014) manually can be laborious if not impossible, and so generally applicable, automated methods are preferable. In that regard, the framework of probabilistic reasoning has been an appealing formalism because a natural candidate for outliers are situations that are improbable or out-of-distribution. Since the true probability distribution density p * X of the data is often not provided, one would instead use an estimator, p (θ) X , from this data to assess the regularity of a point. Density estimation has been a particularly challenging task on high-dimensional problems. However, recent advances in deep probabilistic models, including variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014; Vahdat & Kautz, 2020) , deep autoregressive models (Uria et al., 2014; van den Oord et al., 2016b; a) , and flow-based generative models (Dinh et al., 2014; 2016; Kingma & Dhariwal, 2018) , have shown promise for density estimation, which has the potential to enable accurate density-based methods (Bishop, 1994) for anomaly detection. Yet, several works have observed that a significant gap persists between the potential of density-based anomaly detection and empirical results. For instance, Choi et al. (2018) , Nalisnick et al. (2018), and Hendrycks et al. (2018) noticed that generative models trained on a benchmark dataset (e.g., CIFAR-10, Krizhevsky et al., 2009) and tested on another (e.g., SVHN, Netzer et al., 2011) are not able to identify the latter as out-of-distribution with current methods. Different hypotheses have been formulated to explain that discrepancy, ranging from the curse of dimensionality (Nalisnick et al., 2019) to a significant mismatch between p (θ) X and p * X (Choi et al., 2018; Fetaya et al., 2020; Kirichenko et al., 2020; Zhang et al., 2020) . In this work, we propose a new perspective on this discrepancy and challenge the expectation that density estimation should enable anomaly detection. We show that the aforementioned discrepancy x p * X (x) Figure 1 : There is an infinite number of ways to partition a distribution in two subsets, X in and X out such that P * X (X in ) = 0.95. Here, we show several choices for a standard Gaussian p * X = N (0, 1). persists even with perfect density models, and therefore goes beyond issues of estimation, approximation, or optimization errors (Bottou & Bousquet, 2008) . We highlight that this issue is pervasive as it occurs even in low-dimensional settings and for a variety of density-based methods for anomaly detection.

2.1. UNSUPERVISED ANOMALY DETECTION: PROBLEM STATEMENT

Unsupervised anomaly detection is a classification problem (Moya et al., 1993; Schölkopf et al., 2001) , where one aims at distinguishing between regular points (inliers) and irregular points (outliers). However, as opposed to the usual classification task, labels distinguishing inliers and outliers are not provided for training, if outliers are even provided at all. Given a input space X ⊆ R D , the task can be summarized as partitioning this space between the subset of outliers X out and the subset of inliers X in , i.e., X out ∪ X in = X and X out ∩ X in = ∅. When the training data is distributed according to the probability measure P * X (with density p * Xfoot_0 ), one would usually pick the set of regular points X in such that this set contains the majority (but not all) of the mass (e.g., 95%) of this distribution, i.e., P * X (X in ) = 1 -α ∈ 1 2 , 1 . But, for any given α, there exists in theory an infinity of corresponding partitions into X in and X out (see Figure 1 ). How are these partitions defined to match our intuition of inliers and outliers? We will focus in this paper on recently used methods based on probability density.

2.2. DENSITY SCORING

When talking about outliers, infrequent observations, the association with probability can be quite intuitive. For instance, one would expect an anomaly to happen rarely and be unlikely. Since the language of statistics often associate the term likelihood with quantities like p (θ) X (x), one might consider an unlikely sample to have a low "likelihood", that is a low probability density p * X (x). Conversely, regular samples would have a high density p * X (x) following that reasoning. This is an intuition that is not only prevalent in several modern anomaly detection methods (Bishop, 1994; Blei et al., 2017; Hendrycks et al., 2018; Kirichenko et al., 2020; Rudolph et al., 2020; Liu et al., 2020) but also in techniques like low-temperature sampling (Graves, 2013) used for example in Kingma & Dhariwal (2018) and Parmar et al. (2018) . The associated approach, described in Bishop (1994) , consists in defining the inliers as the points whose density exceed a certain threshold λ > 0 (for example, chosen such that inliers include a predefined amount of mass, e.g., 95%), making the modes the most regular points in this setting. X out and X in are then respectively the lower-level and upper-level sets {x ∈ X , p * X (x) ≤ λ} and {x ∈ X , p * X (x) > λ} (see Figure 2b ). x λ (b) Density scoring method applied to the distribution p * X . x e -H(p * X ) (c) Typicality test method (with one sample) applied to the distribution p * X . Figure 2 : Illustration of different density-based methods applied to a particular one-dimensional distribution p * X . Outliers are in red and inliers are in blue. The thresholds are picked so that inliers include 95% of the mass. In Figure 2b , inliers are considered as the points with density above the threshold λ > 0 while in Figure 2c , they are the points whose log-density are in the -interval around the negentropy -H(p * X ).

2.3. TYPICALITY TEST

The Gaussian Annulus theorem (Blum et al., 2016 ) (generalized in Vershynin, 2019) attests that most of the mass of a high-dimensional standard Gaussian N (0, I D ) is located close to the hypersphere of radius √ D. However, the mode of its density is at the center 0. A natural conclusion is that the curse of dimensionality creates a discrepancy between the density upper-level sets and what we expect as inliers (Choi et al., 2018; Nalisnick et al., 2019; Morningstar et al., 2020; Dieleman, 2020) . This motivated Nalisnick et al. (2019) to propose another method for testing whether a point is an inlier or not, relying on a measure of its typicality. This method relies on the notion of typical set (Cover, 1999) defined by taking as inliers points whose average log-density is close to the average log-density of the distribution (see Figure 2c ). Definition 1 (Cover, 1999) . Given independent and identically distributed elements x (n) n≤N from a distribution with density p * X , the typical set A (N ) (p * X ) ⊂ X N is made of all sequences that satisfy: H(p * X ) + 1 N N n=1 log p * X x (n) ≤ , where H(X) = -E[log p * X (X)] is the (differential) entropy and > 0 a constant. This method matches the intuition behind the Gaussian Annulus theorem on the set of inliers of a high-dimensional standard Gaussian. Indeed, using a concentration inequality, we can show that lim N →+∞ P * (Xi) 1≤n≤N A (N ) = 1, which means that with N large enough, A (N ) (p * X ) will contain most of the mass of (p * X ) N , justifying the name typicality.

3. THE ROLE OF REPARAMETRIZATION

Given the anomaly detection problem formulation Subsection 2.1, we are interested in reasoning about the properties a solution ought to satisfy, in the ideal case of infinite data and capacity. For density-based methods this means that p Although we work in practice on points (e.g., vectors), it is important to keep in mind that these points are actually representations of an underlying outcome. As a random variable, X is by definition the function from this outcome ω to the corresponding observation x = X(ω). However, at its core, an anomaly detection solution aims at classifying outcomes through these measurements. How is the  f (x) p * f (X) f (x) (c) Resulting density p * f (X) from applying f to X ∼ p * X as a function of the new axis f (x). Figure 3 : Illustration of the change of variables formula and how much the application of a bijection can affect the density of the points considered in a one-dimensional case. In Figures 3a and 3c, points x with high density p * X (x) are in blue and points with low density p * X (x) are in red. choice of X affecting the problem of anomaly detection? While several papers studied the effects of a change of representation through the lens of inductive bias (Kirichenko et al., 2020; Zhang et al., 2020) , we investigate the more fundamental effects of reparametrizations f . To sidestep concerns about loss of information (Winkens et al., 2020) , we study the particular case of an invertible map f . The measurements x = X(ω) and f (x) = (f • X)(ω) represent the same outcome ω (although differently), and, since x and f (x) are connected by an invertible transformation f , the same method applied respectively to X or f (X) should classify them with the same label, either as an inlier or an outlier. The target of these methods is to essentially assess the regularity of the outcome ω. From this, we could ideally make the following requirement for a solution to anomaly detection. Principle. In an infinite data and capacity setting, the result of an anomaly detection method should be invariant to any continuous invertible reparametrization f .

Do density-based methods follow this principle?

To answer that question, we look into how density behaves under a reversible change of representation. In particular, the change of variables formula (Kaplan, 1952 ) (used in Tabak & Turner, 2013; Dinh et al., 2014; Rezende & Mohamed, 2015) , formalizes a simple intuition of this behavior: where points are brought closer together the density increases whereas this density decreases when points are spread apart. The formula itself is written as: p * f (X) f (x) = p * X (x) ∂f ∂x T (x) -1 , where ∂f ∂x T (x) is the Jacobian determinant of f at x, a quantity that reflects a local change in volume incurred by f . Figure 3 already illustrates how the function f (Figure 3b ) can spread apart points close to the extremities to decrease the corresponding density round 0 and 1, and, as a result, turns the density on the left (Figure 3a ) into the density on the right (Figure 3c ). With this example, one can wonder to which degree an invertible change of representation can affect the density and the anomaly detection methods presented in Subsections 2.2 and 2.3 that use it.

4.1. UNIFORMIZATION

We start by showing that unambiguously defining outliers and inliers with any density-based approach becomes impossible when considering a particular type of invertible reparametrization of the problem, irrespective of dimensionality. Under weak assumptions, one can map any distribution to a uniform distribution using an invertible transformation (Hyvärinen & Pajunen, 1999) . This is in fact a common strategy for sampling from  CDF p * X (x) p * CDF p * X (X) CDF p * X (x) (c) The resulting density from ap- plying CDF p * X to X ∼ p * X is p * CDF p * X (X) = U([0, 1] ), therefore we color all the points the same. Figure 4 : Illustration of the one-dimensional case version of a Knothe-Rosenblatt rearrangement, which is just the application of the cumulative distribution function CDF p * X on the variable x. complicated one-dimensional distributions (Devroye, 1986) . Figure 4 shows an example of this where a bimodal distribution (Figure 4a ) is pushed through an invertible map (Figure 4b ) to obtain a uniform distribution (Figure 4c ). To construct this invertible uniformization function, we rely on the notion of Knothe-Rosenblatt rearrangement (Rosenblatt, 1952; Knothe et al., 1957) . A Knothe-Rosenblatt rearrangement (notably used in Hyvärinen & Pajunen, 1999) is defined for a random variable X distributed according to a strictly positive density p * X with a convex support X , as a continuous invertible map f (KR) from X onto [0, 1] D such that f (KR) (X) follows a uniform distribution in this hypercube. This rearrangement is constructed as follows: ∀d ∈ {1, ..., D}, f (KR)  (x) = CDF p * X d |X <d (x d | x <d ) where CDF p is the cumulative distribution function corresponding to the density p. In these new coordinates, neither the density scoring method nor the typicality test approach can discriminate between inliers and outliers in this uniform D-dimensional hypercube [0, 1] D . Since the resulting density p * f (KR) (X) = 1 is constant, the density scoring method attributes the same regularity to every point. Moreover, a typicality test on f (KR) (X) will always succeed as ∀ > 0, N ∈ N * , ∀ x (n) n≤N , H p * f (KR) (X) + 1 N N n=1 log p * f (KR) (X) f (KR) x (n) = H U [0, 1] D + 1 N N n=1 log(1) = 0 ≤ . However, these uniformly distributed points are merely a different representation of the same initial points. Therefore, if the identity of the outliers is ambiguous in this uniform distribution, then anomaly detection in general should be as difficult.

4.2. ARBITRARY SCORING

While a particular parametrization can prevent density-based outlier detection methods from separating between outliers and inliers, we find that it is also possible to build a reparametrization of the problem to impose to each point an arbitrary density level in the new representation. To illustrate this idea, consider some points from a distribution whose density is depicted in Figure 5a and a score function indicated in red in Figure 5b . In this example, high-density regions correspond to areas with low score value (and vice-versa). We show that there exists a reparametrization (depicted in Figure 5c ) such that the density in this new representation (Figure 5d ) now matches the desired score, which can be designed to mislead density-based methods into a wrong classification of anomalies.  x f (s) (x) (c) A continuous invertible reparametriza- tion f (s) such that p * f (s) (X) f (s) (x) = s(x). f (s) (x) p * f (s) (X) f (s) (x) (d) Resulting density p * f (s) (X) from applying f (s) to X ∼ p * X as a function of f (s) (x). Figure 5 : Illustration of how we can modify the space with an invertible function so that each point x follows a predefined score. In Figures 5a and 5d , points x with high density p * X (x) are in blue and points with low density p * X (x) are in red. Proposition 1. For any variable X ∼ p * X with p * X continuous strictly positive (with X convex) and any measurable continuous function s : X → R * + bounded below by a strictly positive number, there exists a continuous bijection f (s) such that for any x ∈ X , p f (s) (X) f (s) (x) = s(x) almost everywhere. Proof. We write x to denote (x 1 , . . . , x D-1 , x D ) and (x <D , t) for (x 1 , . . . , x D-1 , t). Let f (s) : X → Z ⊂ R D be a function such that f (s) (x) D = x D 0 p * X (x <D , t) s (x <D , t) dt, and ∀d ∈ {1, ..., D -1}, f (s) (x) d = x d . As s is bounded below, f (s) is well defined and invertible. By the change of variables formula, ∀x ∈ X , p * f (s) (X) f (s) (x) = p * X (x) • ∂f (s) ∂x T (x) -1 = p * X (x) • p * X (x) s(x) -1 = s(x). If X in and X out are respectively the true sets of inliers and outliers, we can pick a ball A ⊂ X in such that P * X (A) = α < 0.5, we can choose s such that for any x ∈ (X \ A), s(x) = 1 and for any x ∈ A, s(x) = 0.1. With this choice of s (or a smooth approximation) and the function f (s) defined earlier, both the density scoring and the (one-sample) typical set methods will consider the set of inliers to be (X \ A) while X out ⊂ (X \ A), making their results completely wrong. While we can also reparametrize the problem so that these methods may succeed, such reparametrization requires knowledge of (p * X /s)(x). Without any constraints on the space considered, individual densities can be arbitrarily manipulated, which reveals how little these quantities say about the underlying outcome in general.

4.3. CANONICAL DISTRIBUTION

Since our analysis in Subsections 4.1 and 4.2 reveals that densities or low typicality regions are not sufficient conditions for an observation to be an anomaly, whatever its distribution or its dimension, we are now interested in investigating whether additional realistic assumptions can lead to some guarantees for anomaly detection. Motivated by several representation learning algorithms which attempt to learn a mapping to a predefined distribution (e.g., a standard Gaussian, see Chen & Gopinath, 2001; Kingma & Welling, 2014; Rezende et al., 2014; Dinh et al., 2014; Krusinga et al., 2019) we consider the more restricted setting of a fixed distribution of our choice, whose regular regions could for instance be known. Surprisingly, we find that it is possible to exchange the densities of an inlier and an outlier even within a canonical distribution. Proposition 2. For any strictly positive density function p * X over a convex space X ⊆ R D with D > 2, for any x in , x out in the interior X o of X , there exists a continuous bijection f : out) , and p * f (X) f x (out) = p * X x (in) . X → X such that p * X = p * f (X) , p * f (X) f x (in) = p * X x ( We provide a sketch of proof and put the details in Appendix A. We rely on the transformation depicted in Figure 6 , which can swap two points while acting in a very local area. If the distribution of points is uniform inside this local area, then this distribution will be unaffected by this transformation. In order to arrive at this situation, we use the uniformization method presented in Subsection 4.1, along with a linear function to fit this local area inside the support of the distribution (see Figure 7 ). Once those two points have been swapped, we can reverse the functions preceding this swap to recover the original distribution overall. Since the resulting distribution p * f (X) is identical to the original f * X , then their entropies are the same H p * f (X) = H (f * X ). Hence, when x in and x out are respectively an inlier and an outlier, whether in terms of density scoring or typicality, there exists a reparametrization of the problem conserving the overall distribution while still exchanging their status as inlier/outlier. We provide an example applied to a standard Gaussian distribution in Figure 8 . This result is important from a representation learning perspective and a complement to the general non-identifiability result in several representation learning approaches (Hyvärinen & Pajunen, 1999; Locatello et al., 2019) . It means that learning a representation with a predefined, well-known distribution and knowing the true density p * X are not sufficient conditions to control the individual density of each point and accurately distinguish outliers from inliers.

5. DISCUSSION

Fundamentally, density-based methods for anomaly detection rely on the belief that density, as a quantity, conveys useful information to assess whether an outcome is an outlier or not. For example, several density-based methods operate in practice on features learned independently from the anomaly detection task (Lee et al., 2018; Krusinga et al., 2019; Morningstar et al., 2020; Winkens et al., 2020) or on the original input features (Nalisnick et al., 2018; Hendrycks et al., 2018; Kirichenko et al., 2020; Rudolph et al., 2020; Nalisnick et al., 2019) . In general, there is no evidence that the density in these representations will carry any useful information for anomaly detection bringing into question whether performance of probabilistic models on this task (e.g., Du & Mordatch, 2019; Grathwohl et al., 2019; Kirichenko et al., 2020; Liu & Abbeel, 2020) reflects goodness-of-fit of the density model. On the contrary, we have proven in this paper that density-based anomaly detection methods are inconsistent across a range of possible representationsfoot_1 , even under strong constraints on the distribution, which suggests that finding the right input representation for meaningful density-based anomaly detection requires privileged information, as discussed in Subsection 4.2. Moreover, several papers have pointed to existing problems in commonly used input representations; for example, the geometry of a bitmap representation does not follow our intuition of semantic distance (Theis et al., 2016) , or images can come from photographic sensors tuned to specific populations (Roth, 2009; Buolamwini & Gebru, 2018) . This shows how strong of an otherwise understated assumption it is to suppose that the methods presented in Subsection 2.2 and Subsection 2.3 would work on input representations. This is particularly problematic for applications as critical as autonomous vehicle navigation or medical decision-making. While defining anomalies might be impossible without prior knowledge (Winkens et al., 2020) as out-of-distribution detection is an ill-posed problem (Choi et al., 2018; Nalisnick et al., 2019; Morningstar et al., 2020) , several approaches make these assumptions more explicit. For instance, the density scoring method has also been interpreted in Bishop (1994) as a likelihood ratio method (Ren et al., 2019; Serrà et al., 2020; Schirrmeister et al., 2020) , which, on the one hand, relies heavily on the definition of an arbitrary reference density as a denominator of this ratio but, on the other hand, is invariant to reparametrization. Inspired by the Bayesian approach from Choi et al. (2018) , one can work on defining a prior distribution on possible reparametrizations over which to average (similarly to Jørgensen & Hauberg, 2020) .



We will also assume in the rest of the paper that for any x ∈ X , p * X (x) > 0. Alternatively, this can be seen as a change of base distribution used to define a probability density as a Radon-Nikodym derivative.



= p * X . This setting is appealing as it gives space for theoretical results without worrying about the underfitting or overfitting issues mentioned byHendrycks et al. (2018);Fetaya et al. (2020);Morningstar et al. (2020);Kirichenko et al. (2020);Zhang et al. (2020).

Example of an invertible function f from [0, 1] to [0, 1].

An example of a distribution density p * X . Points x with high density p * X (x) are in blue and points with low density p * X (x) are in red.

(a) Points xin and xout in a uniformly distributed subset. f (rot) will pick a two-dimensional plane and use the polar coordinate using the mean xm of xin and xout as the center. (b) Applying a bijection f (rot) exchanging the points xin and xout. f (rot) is a rotation depending on the distance from the mean xm of xin and xout in the previously selected two-dimensional plane.

Figure 6: Illustration of the norm-dependent rotation, a locally-acting bijection that allows us to swap two different points while preserving a uniform distribution (as a volume-preserving function).

Figure7: We illustrate how, given x in and x out in a uniformly distributed hypercube [0, 1] D , one can modify the space such that f(rot) shown in Figure6can be applied without modifying the distribution.

Applying a bijection f that preserves the distribution p * f (X) = N (0, I2) to the points in Figure8a.f (x) 1 f (x) 2 (c) The original distribution p * X with respect to the new coordinates f (x), p * X • f -1 .

Figure 8: Application of a transformation using the bijection in Figure 6 to a standard Gaussian distribution N (0, I 2 ), leaving it overall invariant.

annex

A PROOF OF PROPOSITION 2 Proposition 3. For any strictly positive density function p * X over a convex space X ⊆ R D with D > 2, for any x in , x out in the interior X o of X , there exists a continuous bijection f : out) , and p * f (X) f x (out) = p * X x (in) .Proof. Our proof will rely on the following non-rigid rotation f (rot) . Working in a hyperspherical coordinate system consisting of a radial coordinate r > 0 and (D -1) angular coordinateswhere for all i ∈ {1, 2, ..., D -2}, φ i ∈ [0, π) and φ D-1 ∈ [0, 2π), given r max > r 0 > 0, we define the continuous mapping f (rot) as:where (•) + = max(•, 0). This mapping only affects points inside B 2 (0, r max ), and exchanges two points corresponding to (r 0 , φ 1 , . . . , φ D-2 , φ D-1 ) and (r 0 , φ 1 , . . . , φ D-2 , φ D-1 + π) in a continous way (see Figure 6 ). Since the Jacobian determinant of the hyperspherical coordinates transformation is not a function of in) and z (out) = f (KR) x (out) . Since f (KR) is continuous, z (in) , z (out) are in the interior (0, 1) D . Therefore, there is an > 0 such that the L 2 -balls B 2 z (in) , and B 2 z (out) , are inside (0, 1) D . Since (0, 1) D is convex, so is their convex hull.Let r 0 = 1 2 z (in) -z (out) 2 and r max = r 0 + . Given z ∈ (0, 1) D , we write z and z ⊥ to denote its parallel and orthogonal components with respect to z (in) -z (out) . We consider the linear bijection L defined by KR) . Since L is a linear function (i.e., with constant Jacobian), f m) is the mean of z (in) and z (out) , then f (z) (X ) contains B 2 L z (m) , r max (see Figure 7 ). We can then apply the non-rigid rotation f (rot) defined earlier, centered on L z (m) to exchange L z (in) and L z (out) while maintaining this uniform distribution.We can then apply the bijection f (z) -1 to obtain the invertible map out) , and p * f (X) f x (out) = p * X x (in) .

