AN EMPIRICAL EXPLORATION OF OPEN-SET RECOG-NITION VIA LIGHTWEIGHT STATISTICAL PIPELINES Anonymous

Abstract

Machine-learned safety-critical systems need to be self-aware and reliably know their unknowns in the open-world. This is often explored through the lens of anomaly/outlier detection or out-of-distribution modeling. One popular formulation is that of open-set classification, where an image classifier trained for 1-of-K classes should also recognize images belonging to a (K + 1) th "other" class, not present in the training set. Recent work has shown that, somewhat surprisingly, most if not all existing open-world methods do not work well on high-dimensional open-world images (Shafaei et al., 2019) . In this paper, we carry out an empirical exploration of open-set classification, and find that combining classic statistical methods with carefully computed features can dramatically outperform prior work. We extract features from off-the-shelf (OTS) state-of-the-art networks for the underlying K-way closed-world task. We leverage insights from the retrieval community for computing feature descriptors that are low-dimensional (via pooling and PCA) and normalized (via L2-normalization), enabling the modeling of training data densities via classic statistical tools such as kmeans and Gaussian Mixture Models (GMMs). Finally, we (re)introduce the task of open-set semantic segmentation, which requires classifying individual pixels into one of K known classes or an "other" class. In this setting, our feature-based statistical models noticeably outperform prior open-world methods.

1. INTRODUCTION

Embodied perception and autonomy require systems to be self-aware and reliably know their unknowns. This requirement is often formulated as the open set recognition problem (Scheirer et al., 2012) , meaning that the system, e.g., a K-way classification model, should recognize anomalous examples that do not belong to one of K closed-world classes. This is a significant challenge for machine-learned systems that notoriously over-generalize to anomalies and unknowns on which they should instead raise a warning flag (Amodei et al., 2016) . Open-world benchmarks: Curating open-world benchmarks is hard (Liu et al., 2019) . One common strategy re-purposes existing classification datasets into closed vs open examples -e.g., declaring MNIST digits 0-5 as closed and 6-9 as open (Neal et al., 2018; Oza & Patel, 2019; Geng et al., 2020) . In contrast, anomaly/out-of-distribution (OOD) benchmarks usually generate anomalous samples by adding examples from different datasets -e.g., declaring CIFAR as anomalous for MNIST (Ge et al., 2017; Oza & Patel, 2019; Liu et al., 2019) . Most open-world protocols assume open-world data is not available during training (Liang et al., 2018; Oza & Patel, 2019) . Interestingly, Dhamija et al. (2018) ; Hendrycks et al. (2019b) In this paper, we carry out a rigorous empirical exploration of open-set recognition of highdimensionial images. We explore simple statistical models such as Nearest Class Means (NCMs), kmeans and Gaussian Mixture Models (GMMs). Our hypothesis is that such classic statistical methods can reliably model the closed-world distribution (through the closed-world training data), the-art semantic segmentation networks (Wang et al., 2019) do not model "strollers", which are outside the K closed-set categories in Cityscapes benchmark (Cordts et al., 2016) . Here, the network misclassifies the "stroller" as a "motorcycle", which can be a critical mistake when fed into an autonomy stack because the two objects exhibit different behaviours (and so require different plans for obstacle avoidance). Right: While classic semantic segmentation benchmarks explicitly evaluate background pixels outside the set of K classes (Everingham et al., 2015) , contemporary benchmarks such as Cityscapes ignore such pixels during evaluation. As a result, most segmentation networks also ignore such pixels during training. Perhaps surprisingly, such ignored pixels include vulnerable objects like wheelchairs and strollers (see left). We repurpose these ignored pixels as open-set examples that are from the (K+1) th "other" class, allowing for a large-scale exploration of open-set recognition via semantic segmentation. and help avoid overfitting (an issue in open-vs-closed classifiers). Traditionally, such simple models have been used to address the open-world (Chandola et al., 2009; Geng et al., 2020) , but are largely neglected in the recent literature. We revisit these simple methods, and find them quite effective once crucial techniques are considered, as summarized by contributions below. Contribution 1: We build classic statistical models on top of off-the-shelf (OTS) features computed by the underlying K-way classification network. We find it crucial to use OTS features that have been pre-trained and post-processed appropriately (discussed further below). Armed with such features, we find classic statistical models such as kmeans and GMMs (Murphy, 2012) can outperform prior work. We describe two core technical insights below. Insight-1 Pre-training networks (e.g., on ImageNet (Deng et al., 2009) ) is a common practice for traditional closed-world tasks. However, to the best of our knowledge, open-world methods do not sufficiently exploit pre-training (Oza & Patel, 2019) . Hendrycks et al. (2019a) report that pre-training improves anomaly detection using softmax confidence thresholding (Hendrycks & Gimpel, 2017) . We find pretraining to be a crucial factor in learning better representations that support more sophisticated open-world reasoning. Intuitively, pre-trained networks expose themselves to diverse data that may look similar to open-world examples encountered at test-time. We operationalize this intuition by building statistical models on top of existing discriminative networks, which tend to make use of pre-training by design. We demonstrate this significantly outperforms features trained from scratch, as most prior open-set work does. Insight-2 Low-dimensional normalized features. While some existing open-world methods also exploit OTS features (Lee et al., 2018) , we find it crucial to make use of insufficiently well-known best practices for feature extraction. Specifically, to reduce dimensionality, we pool spatially (Gong et al., 2014) and use principle component analysis (PCA) (Turk & Pentland, 1991) . Then, to ensure features are invariant to scalings, we adopt L2 normalization (Gong et al., 2014; Gordo et al., 2017) . While these are somewhat standard practices for deep feature extraction in areas such as retrieval, their combination is not well explored in the open-set literature (Bendale & Boult, 2016; Grathwohl et al., 2019) . Given a particular OTS K-way classification network, we determine the "right" feature processing through validation. In particular, we find that L2-normalization greatly boosts open-world recognition performance; spatial pooling and PCA altogether reduce feature dimension by three orders of magnitude without degrading performance, resulting in a lightweight pipeline. Contribution 2: We re(introduce) the problem of open-set semantic segmentation. Interestingly, classic benchmarks explicitly evaluate background pixels outside the set of K classes of interest (Everingham et al., 2015) . However, contemporary benchmarks such as Cityscapes (Cordts et al., 2016 ) Right: for open-set semantic segmentation, we extract OTS features from the "pyramid head" module, which has sufficiently captured multi-scale information. We do not adopt spatial pooling and instead use the per-pixel features to represent pixels. Note that, different from our practice, many other methods like OpenMax (Bendale & Boult, 2016) and generative methods (Grathwohl et al., 2019) work on logit features, which are too invariant to be effective for open-set recognition (cf. Figure 3 and Table 2 ). ignore such pixels during evaluation. As a result, most contemporary segmentation networks also ignore such pixels during training. Perhaps surprisingly, such ignored pixels include vulnerable objects like strollers and wheelchairs. Misclassifying such objects may have serious implications for real-world autonomous systems (see Figure 1 ). Instead of ignoring these pixels 

2. RELATED WORK

Open-set recognition. There are multiple lines of work addressing the open-world problems in the context of K-way classification, such as anomaly/out-of-distribution detection (Chandola et al., 2009; Zong et al., 2018; Hendrycks et al., 2019b) , novelty/outlier detection (Pidhorskyi et al., 2018) . Defined on K-way classification, these problems can be crisply formulated as open-set recognition (Scheirer et al., 2012; Bendale & Boult, 2016; Lee et al., 2018; Geng et al., 2020 ). Given a testing example, these methods compute the likelihood that it belongs to the open-world via post-hoc functions like density estimation (Zong et al., 2018) , uncertainty modeling (Gal & Ghahramani, 2016; Liang et al., 2018; Kendall & Gal, 2017) and reconstruction error of the testing example (Pidhorskyi et al., 2018; Dehaene et al., 2020) . Different from the above sophisticated methods, we train simple statistical models (e.g., GMM) which can work much better by following our proposed pipeline. Feature extraction. Off-the-shelf (OTS) features can be extracted from the discriminative network and act as powerful embeddings (Donahue et al., 2014) . Using OTS features for open-set recognition has been explored in prior work (Oza & Patel, 2019; Grathwohl et al., 2019; Lee et al., 2018) 

3. OPEN-SET RECOGNITION VIA LIGHTWEIGHT STATISTICAL PIPELINES

In this section, we discuss various design choices in our pipeline, including (1) training schemes for the underlying closed-world task, (2) methods for extracting and repurposing closed-world feature descriptors for open-world recognition, and (3) the statistical density estimation models built on such extracted features. We conclude with (4) an analysis of the additional compute required for self-aware processing (via the addition of an open-world "head" on top of the closed-world network), pointing out that minimal additional processing is needed. 1. Network training strategies. Virtually all state-of-the-art deep classifiers make use of large-scale pre-training, e.g., on ImageNet (Deng et al., 2009) , which seems to consistently improve towards the state-of-the-art performance on the closed-world data (Sun et al., 2017; Mahajan et al., 2018) . However, many, if not all, open-world methods trains the discriminant network purely on the closedworld data without pre-training (Oza & Patel, 2019; Hendrycks & Gimpel, 2017) . We argue that a pre-trained network also serves as an abstraction of the (pseudo) open world. Intuitively, such a pre-trained model has already seen diverse data that may look similar to the open-world examples that will be encountered at test-time, particularly if ImageNet does not look similar to the (closed) training set for the task of interest. Recently, Hendrycks et al. (2019a) show that pre-training improves open-world robustness with a simplistic method that thresholds softmax confidence (Hendrycks & Gimpel, 2017) . Our diagnostic study shows that our explored statistical models, as well as prior methods, do perform much better when built on a pre-trained network than a network trained from scratch! 2. Feature extraction. OTS features generated at different layers of the trained discriminative model can be repurposed for open-set recognition (Lee et al., 2018) . Most methods leverage softmax (Hendrycks & Gimpel, 2017) and logits (Bendale & Boult, 2016; Grathwohl et al., 2019) which can be thought of as features extracted at top layers. Similar to (Lee et al., 2018) , we find it crucial to analyze features from intermediate layers for open-set recognition, for which logits and softmax may be too invariant to be effective for open-set recognition (see Figure 3 ). One immediate challenge to extract features from an intermediate layer is their high dimensionality, e.g., of size 512x7x7 from ResNet18 (He et al., 2016) . To reduce feature dimension, we simply (max or average) pool the feature activations spatially into a 512-dim feature vectors (Yang & Ramanan, 2015) . We further use PCA, which can reduce dimension by 10× (from 512-dim to 50-dim) without sacrificing performance. We find this dimensionality particularly important for learning second-order covariance statistics as in GMMs, described below. Finally, following (Gong et al., 2014; Gordo et al., 2017) , we find it crucial to L2-normalize extracted features (see Figure 2 ).

3.. Statistical models.

Given the above extracted features, we can learn various generative statistical models to capture the confidence/probability that a test example belongs to the closed-world distribution. We explore simple parametric models such as Nearest Class Means (NCMs) (Mensink et al., 2013) and class-conditional Gaussian models (Lee et al., 2018; Grathwohl et al., 2019) , as well as non-parametric models such has nearest neighbors (NN) (Boiman et al., 2008; Júnior et al., 2017) . We finally explore an intermediate regime of mixture models, including (class-conditional) GMMs and kmeans (Chandola et al., 2009; ?; Cao et al., 2016; Geng et al., 2020) . Our models label a test example as open-world when the inverse probability (e.g., of the most-likely class-conditional GMMs) or distance (e.g., to the closest class centroid) is above a threshold. One benefit of such simple statistical models is that they are interpretable and relatively easier to diagnose failures. For example, one failure mode is an open-world sample being misclassified as a closed-world class. This happens when open-world data lie close to a class-centroid or Gaussian component mean (see Figure 3 -left). Note that a single statistical model may have several hyperparameters -GMMs can have multiple Gaussian components and different structures of second-order covariance, e.g., either a single scalar, a vector or a full-rank general covariance per component, as denoted by "spherical", "diag" and "full", respectively. We make use of a validation set to determine the hyperparameters (as well as feature processing steps listed above). 4. Lightweight Pipeline. We re-iterate that the above feature extraction and statistical models result in a lightweight pipeline for open-set recognition. We now analyze the number of additional parameters in our pipeline. Naively learning a GMM over features from the last convolutional layer result in massive second-order statistics, on the order of (512 × 7 × 7) 2 for a 512x7x7 Res18 feature map. We find that spatial pooling and PCA can reduce dimensionality to 50, which requires only 50 2 covariance parameters (a reduction of 10 5 ). We find linear dimensionality reduction more effective than sparse covariance matrices (e.g., assuming diagonal structure). The appendix includes additional experiments. Given a class-conditional five-component GMM (the largest found to be effective through cross validation), this requires 128KB storage per class, or 594KB for all 19 classes in Cityscapes. This is less than 0.1% of the compute of the underlying closed-world network (e.g., HRNet at 250 MB), making it a quite practical addition that enables self-aware processing on real-time autonomy stacks.

4. EXPERIMENT

We extensively validate our proposed lightweight statistical pipeline under standard open-set recognition benchmarks, typically focused on image classification. We also consider open-set semantic segmentation, revisiting classic formulations of semantic segmentation that make use of a background label (Everingham et al., 2015) . We start by introducing implementation details, evaluation metrics and baselines. We then present comprehensive evaluations on each setup. (Davis & Goadrich, 2006) . AUROC is a calibration-free and threshold-less metric, simplifying comparisons between methods. For open-set semantic segmentation, we also use AUROC to evaluate the performance of recognizing "background" pixels as open-world examples. This is different from traditional practice in segmentation benchmarks (Everingham et al., 2015) which treat such "background" pixels as just another class. Baselines. Our statistical pipeline supports various statistical models. We study the simple models proposed in Section 3, including NN, kmeans, NCM, and GMMs. All models, including baselines to which we compare, are based on the same underlying classification network. Hyperparameters for all models (e.g., number of mixtures) are tuned on a validation setfoot_0 . • Classifiers. Hendrycks et al. (2019b) learn a binary open-vs-closed classifier (CLS 2 ) for anomaly detection. Following classic work in semantic segmentation (Everingham et al., 2015) , we also evaluate a (K+1)-way classifier (CLS (K+1) ). We use the softmax score corresponding to the train ground-up models in contrast to our statistical models that operate on OTS features of an already-trained K-way classification network. As we focus on an empirical exploration rather than achieving the state-of-the-art, we refer readers to more recent approaches for the state-of-the-art by training ground-up models with sophisticated techniques (Zhang et al., 2020; Chen et al., 2020) . (Neal et al., 2018; Hendrycks & Gimpel, 2017) . All three datasets contain ten classes with balanced numbers of images per class. Standard protocol randomly splits six (four) classes of train/validation-sets into closed (open) train/validation-sets, respectively. We repeat five times and report average AUROC for each method. Through cross-validation, we find reliable OTS features can be computed by average-pooling features from the last convolutional layer down to 512-dim, projecting down to 50-dim via PCA, and L2-normalizing.

Results

. Table 1 shows that, perhaps surprisingly, simple statistical models (like kmeans and GMMs) defined on such normalized features already performs on par to many prior methods/ Because GDM (Lee et al., 2018) does not L2-normalize features, we evaluate a variant that does (GDM L2 ). The improved performance demonstrates the importance of feature normalization, which although is well known in the image retrieval community, is not widely used in open-set recognition. We hereby focus on statistical models trained on normalized features, providing raw vs. normalized comparisons in the appendix.

4.2. SETUP-II: CROSS-DATASET OPEN-SET RECOGNITION

Setup. In these experiments, we use the cross-dataset protocol advocated by (Shafaei et al., 2019) (Torralba & Efros, 2011) . Datasets. We use TinyImageNet as the closed-world dataset (for K-way classification), which has 200 classes of 64x64 images, split into 500/50/50 images as the train/val/test sets. Following (Shafaei et al., 2019) , we construct open val/test sets using cross-dataset images (Torralba & Efros, 2011) , including MNIST, SVHN, CIFAR and Cityscapes. For example, we use an outlier dataset (e.g., MNIST train-set) to tune/train an open-world method, and test on another dataset as the open-set (e.g., CIFAR test-set). We use bilinear interpolation to resize all images into 64x64 to match TinyImageNet image resolution. Through cross validation, we find reliable OTS features can be computed by average-pooling features from the last convolutional layer down to 2048-dim, projecting down to 200-dim via PCA, and L2-normalizing. Results for Table 2 are summarized below: • Simple statistical models (e.g., NCM and kmeans) can outperform prior open-set methods (e.g., C2AE and GDM). We find that L2-normalization greatly contributes to the success of these simple statistical methods (cf. details in the appendix). Both the metric learning and image retrieval (Mensink et al., 2012; Musgrave et al., 2020) literature have shown the importance of L2-normalization. Informally, open-set recognition queries the testing example and measures how close it is to any of the closed-world training examples (Musgrave et al., 2020) . • Interestingly, kmeans performs slightly better than GMMs. Considering that the former can be seen as a special case of GMMs that have an identity covariance, we conjecture that learning other types of covariance (e.g., a full-rank covariance matrix) does not help when the underlying K-way network has already provided compact feature representations. • From last row pair, we can see pre-training notably improves all the methods. GDM L2 outperforms the original GDM which operates on raw features (without L2-normalization). This further confirms the importance of L2-normalization in feature extraction for open-set recognition. • Perhaps surprisingly, OpenMax does not work well in this setup (though we have spent considerable effort tuning it). This is consistent with the results in (Dhamija et al., 2018; Shafaei et al., 2019) , and we conjecture the reason is that OpenMax cannot effectively recognize cross-dataset anomalous inputs using logit features because they are too invariant to be useful for open-set recognition (Figure 3 ). Similar lackluster results hold for other methods that operate on logit features (Entropy and MSP).

4.3. SETUP-III: OPEN-SET SEMANTIC SEGMENTATION

Setup. In these experiments, we (re)introduce the task of open-set segmentation by repurposing "background" pixels in contemporary segmentation benchmarks (Cityscapes) as open-world pixels. As elaborated before, such pixels are either traditionally treated as just another class for segmentation evaluation (Everingham et al., 2015) or ignored completely. Instead, we evaluate them using openworld metrics such as AUROC. We will show our statistical methods also Datasets. Cityscapes (Cordts et al., 2016) provides per-pixel annotations for urban scene images (1024x2048-resolution) for autonomous driving research. We construct our train-and val-sets from its 2,975 training images, in which we use the last 10 images as val-set and the rest as train-set. We use its official 500 validation images as our test-set. The "background" pixels (shown in white of ground-truth visual in Figure 4 ) are the open-world examples in this setup. Through validation, we find reliable OTS features can be computed by projecting features from the last convolutional layer from 720 down to 100-dim via PCA, and L2-normalizing. Results. For our statistical models (as well as GDM), we randomly sample 5000 closed-world pixel features from each class, as it is prohibitively space-consuming to use all the pixel features from the Cityscapes train-set. We show quantitative comparison in Table 3 and list salient conclusions below. • Clearly, our simple statistical models (e.g., NN and GMM) perform significantly better than the classic open-world methods (e.g., MSP and OpenMax). However, when training on large amounts of open-pixels, CLS methods achieve significantly better performance. This clearly shows the benefit of training on open-world pixels (Hendrycks et al., 2019b) . We do note that GMMs do not need any open pixels during learning, and so may generalize better to novel open-world scenarios not encountered in the training set (Figure 5 ). • GDM performs poorly, probably due to arbitrary scales of the raw features that are too uninformative to be used for open-set pixel recognition. We note that other statistical methods all struggle with raw pooled features (cf. appendix). However, once we L2-normalize the pixel features to be scale-invariant, these statistical methods perform significantly better (as reported in this table). • Figure 4 shows qualitative results. MSP predicts segment boundaries as open-pixels. This makes sense as the MSP mostly returns aleatoric uncertainties corresponding to ambiguous pixel sensor measurements around object boundaries (Kendall & Gal, 2017) . In contrast, GMM reports openpixels on truly novel objects, such as the street-shop and rollator, both of which are ignored by the semantic segmentation network during training HRNet (Wang et al., 2019) . These regions appear to be caused by epistemic uncertainty arising from the lack of training data (Kendall & Gal, 2017) . • Figure 6 plots AUROC performance vs model size for various statistical models. Notably, NN consumes the most memory, even more than the underlying networks. GMMs perform the best and are quite lightweight, only consuming 0.6MB when built on the HRNet model (250MB). We find the best AUROC-memory tradeoff on the validation set (shown here to be a single-mixture GMM with full-covariance and PCA), and find it generalizes well to held-out test (cf. Appendix).

5. CONCLUSION

We explore an empirical exploration of open-set recognition via lightweight statistical pipelines. We find simple statistical models quite effective if built on properly processed off-the-shelf features computed by the discriminative networks (originally trained for the closed-world tasks). Our pipelines endow K-way networks with the ability to be "self-aware", with negligible additional compute costs (0.1%). For GMM, we study it by specifying three types of covariance -"spherical", "diag" and "full" meaning the covariance matrix of each Gaussian component is controlled by a single scalar, a vector or a full-rank (symmetric) matrix, respectively. For open-set semantic segmentation, it is prohibitively space-consuming to memorize all the pixel features of the whole train set. So we randomly sample 5000 pixels from each of the 19 classes defined by Cityscapes (∼200k in total). In Figure 8 , we draw the open-world performance (AUROC) for the two tasks w.r.t the total memory cost (i.e., the required space to save a model's parameters). We can see that NN takes the most memory . Despite this, we note that the validation performance (as plotted here) can be nicely translated to real test sets, as shown in Figure 10 ). It is worth noting that the specified PCA-reduced dimension is not optimal that an even lower dimension can lead to better open-world performance (cf. Figure 7 ). We do not exhaustively explore this in this work, but instead emphasize that our pipeline is quite lightweight that can be tuned on specific tasks, e.g., 0.6MB GMM-full compared with 250MB HRNet for semantic segmentation.

E VISUALIZATION OF GAUSSIAN MEANS

As statistical models are interpretable, we visualize what the statistical model can capture. To do so, we visualize per-class Gaussian means through medoid images, which are training images that have features closest to their corresponding per-class mean feature. We show the medoid images in Figure 9 , as well as some random images sorted by the cosine similarity (i.e., Euclidean distance on L2-normalized features) to the Gaussian means within each class. We can see the medoid images most likely capture the canonical objects of each class, e.g., those of with "standard" pose and clean background.

F OPEN-SOURCE DEMONSTRATION

We attach our code (via three Jupyter Notebook files) to demonstrate our exploration of open-set recognition. One can run the code with access to networks (Res50 and HRNet (Wang et al., 2019) ) trained for closed-world tasks. We are not able to upload models or pre-computed features due to space limit, but we are committed to releasing them to the public after paper notification. We refer readers to the Jupyter Notebook files for self-explanatory descriptions. • "demo_Open-Set-Image-Recognition-Setup-II_GMM_Res50pt_pca_L2norm. 



We use open-source code when available. We implemented C2AE and its authors validated our code through personal communication.



find that, if some open examples are available during training, one can learn simple open-vs-closed binary classifiers that are remarkably effective. However, Shafaei et al. (2019) comprehensively compare various well-known open-world methods through rigorous experiments, and empirically show that none of the compared methods generalize to high-dimensional open-world images. Intuitively, classifiers can easily overfit to the available set of open-world images, which won't likely exhaustively span the open world outside the K classes of interest.

Figure 1: We motivate open-set recognition with safety concerns in autonomous systems. Left: State-of-

Figure 2: Flowchart for extracting off-the-shelf (OTS) features used for open-set recognition. We determine the appropriate feature processing steps on validation set, including spatial pooling (sp) and L2 normalization (L2). Left: for open-set image recognition, we extract OTS features at the last convolution layer of a K-way classification network.Right: for open-set semantic segmentation, we extract OTS features from the "pyramid head" module, which has sufficiently captured multi-scale information. We do not adopt spatial pooling and instead use the per-pixel features to represent pixels. Note that, different from our practice, many other methods like OpenMax(Bendale & Boult, 2016) and generative methods(Grathwohl et al., 2019) work on logit features, which are too invariant to be effective for open-set recognition (cf. Figure3and Table2).

, we use them to explore open-world recognition by repurposing them as open-world examples. Interestingly, this setup naturally allows for open-pixels in the train-set, a protocol advocated by (Dhamija et al., 2018; Hendrycks et al., 2019b). We benchmark various open-world methods on this setup, and show that our suggested simple statistical models still outperform typical open-world methods. Similar to past work, we also find that simple open-vs-closed binary classifiers serve as strong baselines, provided one has enough training examples of open pixels that span the open-world.

K+1) th "other" class as the open-set likelihood. Both methods require open-set examples during training. • Likelihoods. Many probabilistic models measure open-set likelihood on OTS features, including Max Softmax Probability (MSP) (Hendrycks & Gimpel, 2017) and Entropy (Steinhardt & Liang, 2016) (derived from softmax probabilities). OpenMax (Bendale & Boult, 2016) fits logits to Weibull distributions (Scheirer et al., 2011) that recalibrate softmax outputs for open-set recognition. C2AE (Oza & Patel, 2019) learns an additional K-way classifier on OTS features based on reconstruction errors, which are then used as open-set likelihood function. GDM (Lee et al., 2018) learns a Gaussian Discriminant Model on OTS features and designs open-set likelihood based on Mahalanobis distance. CROSR (Yoshihashi et al., 2019) trains a reconstruction-based model that jointly performs closed-set K-way classification and open-set recognition. G-Open (Ge et al., 2017) and OSRCI (Neal et al., 2018) turn to Generative Adversarial Networks (GANs) to generate fake images that augment the closed-set training set, and train a discriminative model for open-set recognition. CGDL (Sun et al., 2020) learns class-conditional Gaussian model and relies on reconstruction error for open-set recognition. The last three methods (CROSR, G-Open and CGDL)

Figure 3: tSNE plots (Maaten & Hinton, 2008) of open-vs-closed data, as encoded by different features from a Res50 model (trained with pre-training in the closed world, cf.Table 2). Points are colored w.r.t closed-world class labels. Left: Logit features mix open and closed data, suggesting that methods based on them (Entropy, SoftMax and OpenMax) may struggle in open-set classification. Right: Convolutional features better separate open vs closed data (cf. Figure 2).

outperform other typical open-world methods. As this setup has natural access to open-world pixels during training, we explore the training of simple open-vs-closed classifiers.

Figure 4: Two random images from Cityscapes-val, visualized with ground-truth and the predicted semantic segmentation maps by the HRNet. We visualize open-world pixels in the ground-truth (white regions), as well as predicted open-world pixels for a standard baseline (MSP) and our method (GMM). MSP tends to predict segment boundaries as open-pixels, while our GMM tends to find open-set objects (street-shop and rollator as pointed by the red arrows).

Figure 5: Performance of CLS versus amount of (open) training data. We train CLS on the OTS features of the state-of-the-art HRNet, which has already exploited all closed-world pixels in the train-set. The binary open-vs-closed classifier CLS 2 outperforms CLS (K+1) , presumably due to that the former being trained on balanced batches. With fewer than 50 images, GMMs outperform such discriminative models. But with enough open training examples, simple binary classification performs remarkably well. However, because GMMs do not require any open training examples, it cannot overfit to them and so may generalize better to the open-world.

Finally, we (re)introduce the task of open-set semantic segmentation by repurposing background pixels as open-world examples, requiring classification of individual pixels into one of K known/closed-world classes and an "other" open-world class.

Figure 7: Study of open-set performance versus feature dimension reduced by PCA. We study this with the experiment of open-set image classification, where TinyImageNet/Cityscapes are the closed/open sets. We extract OTS features from the Res50pt network (as detailed in the paper). The spatially pooled (and L2normalized) feature has 2048 dimension. To avoid randomness (as in some statistical models like GMM and k-means), we use NCM model which simply computes per-class mean feature and computes the distance of nearest center as the open-set likelihood. Surprisingly, using PCA to reduce feature dimension can improve the open-set performance!

Figure 8: Open-set performance w.r.t memory cost (MB) by different models. Memory cost means the space required to store the parameters in these models. Left: open-set (200-way) image recognition for TinyImageNetvs-Cityscapes. Right: open-set semantic segmentation (19 classes) on Cityscapes. NN memorizes training examples for open-world recognition, and hence it consumes huge memory to store OTS features of training examples (more memory consumption than the underlying SOTA models). Compared with the underlying networks, GMM-spherical and k-means induce negligible computation cost. They also perform considerably better than NN and GMM-full/diag on open-set image classification, but not as well as NN and GMM-full on open-world semantic segmentation. These plots clearly serve as guidelines to choose the appropriate statistical models on specific tasks. Note that the validation performance shown here can be nicedly translated to test sets, as detailed in Figure 10 .

ipynb": We show how we train, select and evaluate GMMs on cross-dataset open-set image recognition (Setup-II). • "demo_tsne_visual_res50pt.ipynb": We show t-SNE visualizations of OTS features of cross-dataset open-set examples (Setup-II). This intuitively demonstrates the benefit of exploiting OTS features for open-set recognition. • "demo_open-set-semantic-segmentation.ipynb": We demonstrate how we train and evaluate GMM under Setup-III, open-set semantic segmentation.

Figure 9: Visualization of per-class Gaussian means with medoid images (whose features are closest to the means within their corresponding classes).As comparison, we show some random images sorted by their cosine similarity to the per-class Gaussian mean. We can see the medoid images capture "canonical" objects representing the corresponding classes, e.g., those with "standard" shape and cleaner background. This visualization suggests our statistical models are quite interpretable.

Figure 10: Detailed results of (left) open-set image recognition using Cityscapes images as cross-dataset open-world examples under Setup-II) and (right) open-set semantic segmentation (Setup-III).In each cell, the first and second (if existing) row numbers denote AUROC performance on the val and test sets, respectively. We highlight the best performance on the val set, on which we tune the hyper-parameter and report the performance on the test set. As for notation, gGMM means we learn GMM "globally" on the whole closed train-set, agnostic to class labels; while cGMM means that we learn class-conditional GMMs. For open-set semantic segmentation, we only train GMM in a class-conditional fashion (i.e., cGMM), because using pixel features from all classes to train a global GMM is prohibitively time-consuming. "Raw feat." means the feature we extract from the last convolution layer without L2-normalization, and "w/ L2" means we L2-normalize the extracted features. Clearly, L2-normalization greatly boosts open-world performance.

Implementation. As discussed earlier, open-world recognition is often explored through the lens of open-set classification. To ensure our approaches retain high-accuracy on the original closedworld tasks, we build statistical models on top of off-the-shelf (OTS) state-of-the-art networks. For open-set image classification, we fine-tune an ImageNet-pretrained ResNet network (Res18/50 in our experiments)(He et al., 2016) exclusively on the closed-train-set using cross-entropy loss. For open-set semantic segmentation we use HRNet(Wang et al., 2019), a highly-ranked model on the Cityscapes leaderboard(Cordts et al., 2016). We extract features at the penultimate layer of each discriminative network (other layers also apply but we do not explore them in this work). We conduct experiments with PyTorch(Paszke et al., 2017) on a single Titan X GPU.

Single-dataset open-set recognition (Setup-I) AUROC↑. Error bars are shown in gray rows . Bolded numbers mark the top-5 ranked methods on each dataset. Because GDM does not L2-normalize features, we add a variant that does (denoted GDML2). L2-normalization clearly improves performance, particularly on CIFAR. Interestingly, GDML2 underperforms NCM, implying that a full-covariance Gaussian (used by GDM) overfits compared to an identity covariance. GMM makes use of low-dimensional covariances (learned via PCA) and strikes a balance between flexibility and generalization, achieving comparable performance to many sophisticated approaches.

, where some outlier examples are sampled from a different dataset for training/validation. (e.g., train on TinyImageNet-closed as closed-set, validate on MNIST-open as outlier set, and test on CIFAR-open as open-set). Conclusions drawn under this setup may generalize better due to less dataset bias in the experimental protocol



Cross-dataset open-set image recognition (Setup-II) AUROC↑. In this setup, we train on Tiny-ImageNet, validate using outlier images from a second dataset, and test using open-set images from a third dataset. For each open-set dataset, we compute the average AUROC over all results when using different outlier datasets. We study two Res50 models either trained from scratch (pink row), or fine-tuned from an ImageNet-pretrained model (blue row). Clearly, simple statistical models can handily outperform much prior work. Pre-training boosts open-set recognition performance for all methods (see last row pair). Binary classifiers CLS -test MSP Entropy OpenMax MSPc MCdrop C2AE GDM GDM L2 NN NCM kmeans GMM CLS 2 CLS (K+1)

Open-set semantic segmentation (Setup-III) AUROC↑. Simple statistical methods (GMMs) outperform prior methods, with the notable exception of discriminative classifiers (CLS 2 and CLS (K+1) ) that have access to open-set training examples. Figure 5 analyzes this further, demonstrating that GMMs can outperform such discriminative models when they have access to less open training examples, suggesting that GMMs may better generalize to never-before-seen open-world scenarios. AUROC vs. memory cost (MB) for various statistical models for open-set semantic segmentation. NN stores ∼100k OTS features, which is larger than the underlying network (HRNet). We explore GMMs with various covariance structures (spherical, diagonal, full), feature dimensionality via PCA, and mixture components.

and 5 list details of various methods on cross-dataset evaluation (Setup-II), Table2. Please refer to the caption for details.C PERFORMANCE VS. PCA REDUCED DIMENSIONAs analyzed in the main paper, PCA is an important technique to make our pipeline lightweight by considerably reducing feature dimensions. We study how a statistical model performs under different reduced feature dimensions by PCA. We choose the simplistic NCM method which does not induce randomness (unlike kmeans and GMM which require random initialization for learning). We study this through open-set image recognition under Setup-II. To simplify the study, we choose (resized) Cityscapes images as open-set data, i.e., we use TinyImagenet/Cityscapes images as the closed/open set. As we use the network Res50 in the diagnostic study, the original dimension of the pooled features is 2048. In Fig. 7, we plot the performance (AUROC) of NCM as a function of reduced dimension by PCA. Perhaps surprisingly, PCA even improves the open-world performance while significantly reducing feature dimension (from 2048 to 100)! D PERFORMANCE VS. MEMORY/COMPUTE As seen previously, PCA reduces the feature dimension greatly and hence makes the statistical models quite lightweight. We now study how lightweight different statistical models can be by considering the open-world performance. We analyze the models learned for two tasks (open-world image classification and open-world semantic segmentation), where the OTS features have dimension 2048 (extracted from Res50) and 720 (extracted from HRNet), respectively. We use PCA to reduce the features dimensions to 200 and 100, respectively.

APPENDIX OUTLINE

As elaborated in the main paper, we introduce a lightweight statistical pipeline for open-set recognition by repurposing off-the-shelf (OTS) features computed by a state-of-the-art recognition network. As our pipeline does not require (re)training the underlying network, it is guaranteed to replicate the state-of-the-art performance of the network on the (closed-world) task for which it was trained, but still allows the final recognition system to properly identify never-before-seen data from the open-world. In the appendix, we expand on our pipeline, including more experiments, analyses and visualizations. We outline the appendix below. Section C: Reduced dimension via PCA. We show that PCA can reduce dimensionality significantly (making our pipeline quite lightweight), while maintaining or even improving performance.Section D: Performance vs. memory/compute. We rigorously evaluate the memory/compute costs of our various statistical pipelines, emphasizing solutions that are both accurate and lightweight.Section E: Visualization of Gaussian component means. One benefit of our simple statistical models is their interpretability; we visualize Gaussian means through centroid images, and demonstrate that they correspond to canonical objects (e.g., those with standard poses and clean background).Section F: Open-Source demonstration. We include code (via Jupyter Notebook) for open-set semantic segmentation, assuming one has access to precomputed features from HRNet (Wang et al., 2019) . In the main paper, we state that we tune and select statistical models (if it has hyper-parameters to tune) on the small validation set, and report on the test set with the selected (best-performing) model. Such hyper-parameters can be the number of means/components in kmeans and GMM models, and covariance types in GMM -"spherical", "diagonal" and "full" denote that the covariance matrix of each Gaussian component is controlled by a single scalar, a vector and full-rank matrix, respectively. It demonstrates that validation can reliably tune the statistical models whose performance can be translated to the test sets. Moreover, we also record the detailed results of whether using Table 4 : Cross-dataset evaluation (Setup-II) with a K-way classification network that is trained-fromscratch. We report performance with the AUROC metric. This table supplements Table 2 . Recall that we train on TinyImageNet as the closed-set, use another dataset as outlier set to tune and select model, and report on the third dataset as the open-set. All the methods operate on off-the-shelf features extracted from the underlying classification network. We report their averaged performance and standard deviation in the last two columns. 

