PERFECT DENSITY MODELS CANNOT GUARANTEE ANOMALY DETECTION

Abstract

Thanks to the tractability of their likelihood, some deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for out-of-distribution detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.

1. INTRODUCTION

Several machine learning methods aim at extrapolating a behavior observed on training data in order to produce predictions on new observations. But every so often, such extrapolation can result in wrong outputs, especially on points that we would consider infrequent with respect to the training distribution. Faced with unusual situations, whether adversarial (Szegedy et al., 2013; Carlini & Wagner, 2017) or just rare (Hendrycks & Dietterich, 2019) , a desirable behavior from a machine learning system would be to flag these outliers so that the user can assess if the result is reliable and gather more information if need be (Zhao & Tresp, 2019; Fu et al., 2017) . This can be critical for applications like medical decision making (Lee et al., 2018) or autonomous vehicle navigation (Filos et al., 2020) , where such outliers are ubiquitous. What are the situations that are deemed unusual? Defining these anomalies (Hodge & Austin, 2004; Pimentel et al., 2014) manually can be laborious if not impossible, and so generally applicable, automated methods are preferable. In that regard, the framework of probabilistic reasoning has been an appealing formalism because a natural candidate for outliers are situations that are improbable or out-of-distribution. Since the true probability distribution density p * X of the data is often not provided, one would instead use an estimator, p (θ) X , from this data to assess the regularity of a point. Density estimation has been a particularly challenging task on high-dimensional problems. However, recent advances in deep probabilistic models, including variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014; Vahdat & Kautz, 2020) , deep autoregressive models (Uria et al., 2014; van den Oord et al., 2016b; a) , and flow-based generative models (Dinh et al., 2014; 2016; Kingma & Dhariwal, 2018) , have shown promise for density estimation, which has the potential to enable accurate density-based methods (Bishop, 1994) for anomaly detection. Yet, several works have observed that a significant gap persists between the potential of density-based anomaly detection and empirical results. For instance, Choi et al. (2018 ), Nalisnick et al. (2018 ), and Hendrycks et al. (2018) noticed that generative models trained on a benchmark dataset (e.g., CIFAR-10, Krizhevsky et al., 2009) and tested on another (e.g., SVHN, Netzer et al., 2011) are not able to identify the latter as out-of-distribution with current methods. Different hypotheses have been formulated to explain that discrepancy, ranging from the curse of dimensionality (Nalisnick et al., 2019) to a significant mismatch between p (θ) X and p * X (Choi et al., 2018; Fetaya et al., 2020; Kirichenko et al., 2020; Zhang et al., 2020) . In this work, we propose a new perspective on this discrepancy and challenge the expectation that density estimation should enable anomaly detection. We show that the aforementioned discrepancy Under review as a conference paper at ICLR 2021 x p * X (x) Figure 1 : There is an infinite number of ways to partition a distribution in two subsets, X in and X out such that P * X (X in ) = 0.95. Here, we show several choices for a standard Gaussian p * X = N (0, 1). persists even with perfect density models, and therefore goes beyond issues of estimation, approximation, or optimization errors (Bottou & Bousquet, 2008) . We highlight that this issue is pervasive as it occurs even in low-dimensional settings and for a variety of density-based methods for anomaly detection.

2.1. UNSUPERVISED ANOMALY DETECTION: PROBLEM STATEMENT

Unsupervised anomaly detection is a classification problem (Moya et al., 1993; Schölkopf et al., 2001) , where one aims at distinguishing between regular points (inliers) and irregular points (outliers). However, as opposed to the usual classification task, labels distinguishing inliers and outliers are not provided for training, if outliers are even provided at all. Given a input space X ⊆ R D , the task can be summarized as partitioning this space between the subset of outliers X out and the subset of inliers X in , i.e., X out ∪ X in = X and X out ∩ X in = ∅. When the training data is distributed according to the probability measure P * X (with density p * Xfoot_0 ), one would usually pick the set of regular points X in such that this set contains the majority (but not all) of the mass (e.g., 95%) of this distribution, i.e., P * X (X in ) = 1 -α ∈ 1 2 , 1 . But, for any given α, there exists in theory an infinity of corresponding partitions into X in and X out (see Figure 1 ). How are these partitions defined to match our intuition of inliers and outliers? We will focus in this paper on recently used methods based on probability density.

2.2. DENSITY SCORING

When talking about outliers, infrequent observations, the association with probability can be quite intuitive. For instance, one would expect an anomaly to happen rarely and be unlikely. Since the language of statistics often associate the term likelihood with quantities like p (θ) X (x), one might consider an unlikely sample to have a low "likelihood", that is a low probability density p * X (x). Conversely, regular samples would have a high density p * X (x) following that reasoning. This is an intuition that is not only prevalent in several modern anomaly detection methods (Bishop, 1994; Blei et al., 2017; Hendrycks et al., 2018; Kirichenko et al., 2020; Rudolph et al., 2020; Liu et al., 2020) but also in techniques like low-temperature sampling (Graves, 2013) used for example in Kingma & Dhariwal (2018) and Parmar et al. (2018) . The associated approach, described in Bishop (1994) , consists in defining the inliers as the points whose density exceed a certain threshold λ > 0 (for example, chosen such that inliers include a predefined amount of mass, e.g., 95%), making the modes the most regular points in this setting. X out and X in are then respectively the lower-level and upper-level sets {x ∈ X , p * X (x) ≤ λ} and {x ∈ X , p * X (x) > λ} (see Figure 2b ).



We will also assume in the rest of the paper that for any x ∈ X , p * X (x) > 0.

