NON-PARAMETRIC OUTLIER SYNTHESIS

Abstract

Out-of-distribution (OOD) detection is indispensable for safely deploying machine learning models in the wild. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Recent work on outlier synthesis modeled the feature space as parametric Gaussian distribution, a strong and restrictive assumption that might not hold in reality. In this paper, we propose a novel framework, non-parametric outlier synthesis (NPOS), which generates artificial OOD training data and facilitates learning a reliable decision boundary between ID and OOD data. Importantly, our proposed synthesis approach does not make any distributional assumption on the ID embeddings, thereby offering strong flexibility and generality. We show that our synthesis approach can be mathematically interpreted as a rejection sampling framework. Extensive experiments show that NPOS can achieve superior OOD detection performance, outperforming the competitive rivals by a significant margin.

1. INTRODUCTION

When deploying machine learning models in the open and non-stationary world, their reliability is often challenged by the presence of out-of-distribution (OOD) samples. As the trained models have not been exposed to the unknown distribution during training, identifying OOD inputs has become a vital and challenging problem in machine learning. There is an increasing awareness in the research community that the source-trained models should not only perform well on the In-Distribution (ID) samples, but also be capable of distinguishing the ID vs. OOD samples. To achieve this goal, a promising learning framework is to jointly optimize for both (1) accurate classification of samples from P in , and (2) reliable detection of data from outside P in . This framework thus integrates distributional uncertainty as a first-class construct in the learning process. In particular, an uncertainty loss term aims to perform a level-set estimation that separates ID vs. OOD data, in addition to performing ID classification. Despite the promise, a key challenge is how to provide OOD data for training without explicit knowledge about unknowns. A recent work by Du et al. (2022c) proposed synthesizing virtual outliers from the low-likelihood region in the feature space of ID data, and showed strong efficacy for discriminating the boundaries between known and unknown data. However, they modeled the feature space as class-conditional Gaussian distribution -a strong and restrictive assumption that might not always hold in practice when facing complex distributions in the open world. Our work mitigates the limitations. In this paper, we propose a novel learning framework, Non-Parametric Outlier Synthesis (NPOS), that enables the models learning the unknowns. Importantly, our proposed synthesis approach does not make any distributional assumption on the ID embeddings, thereby offering strong flexibility and generality especially when the embedding does not conform to a parametric distribution. Our framework is illustrated in Figure 1 . To synthesize outliers, our key idea is to "spray" around the low-likelihood ID embeddings, which lie on the boundary between ID and OOD data. These boundary points are identified by non-parametric density estimation with the nearest neighbor distance. Then, the artificial outliers are sampled from the Gaussian kernel centered at the embedding of the boundary ID samples. Rejection sampling is done by only keeping synthesized outliers with low likelihood. Leveraging the synthesized outliers, our uncertainty loss effectively performs the level-

availability

https://github.com/deeplearning-wisc

