AN UPPER BOUND FOR THE DISTRIBUTION OVERLAP INDEX AND ITS APPLICATIONS

Abstract

The overlap index between two probability distributions has various applications in statistics, machine learning, and other scientific research. However, approximating the overlap index is challenging when the probability distributions are unknown (i.e., distribution-free settings). This paper proposes an easy-to-compute upper bound for the overlap index without requiring any knowledge of the distribution models. We first utilize the bound to find the upper limit for the accuracy of a trained machine learning model when a domain shift exists. We additionally employ this bound to study the distribution membership classification of given samples. Specifically, we build a novel, distribution-free, computation-efficient, memory-efficient one-class classifier by converting the bound into a confidence score function. The proposed classifier does not need to train any parameters and is empirically accurate with only a small number of in-class samples. The classifier shows its efficacy and outperforms many state-of-the-art methods on various datasets in different one-class classification scenarios, including novelty detection, out-of-distribution detection, and backdoor detection. The obtained results show significant promise toward broadening the applications of overlap-based metrics.

1. INTRODUCTION

The distribution overlap index refers to the area intersected by two probability density functions (i.e., Fig. 1(a) ) and measures the similarity between the two distributions. A high overlap index value implies a high similarity. Although the overlap index has various applications in many areas, such as biology (Langøy et al., 2012; Utne et al., 2012) , economics (Milanovic & Yitzhaki, 2002) , and statistics (Inman & Bradley Jr, 1989) , the literature on approximating it under distributionfree settings is thin. This work proposes an upper bound for the overlap index with distributionfree settings to broaden the potential applications of overlap-based metrics. The bound is easy to compute and contains three terms: a constant number, the norm of the difference between the two distributions' means, and a variation distance between the two distributions over a subset. Even though finding such an upper bound for the distribution overlap index is already valuable, we further explore two additional applications of our bound as discussed below. One application of our bound is for domain shift analysis. Specifically, a domain shift is a change in the dataset distribution between a model's training dataset and the testing dataset encountered during implementation (i.e., the overlap index value between the distributions of the two datasets is less than 1). We calculated the model's testing accuracy in terms of the overlap index between the distributions of the training and testing datasets and further found the upper limit of the accuracy using our bound for the overlap index. Knowing the upper bound for a model's testing accuracy helps measure the model's potential and compare it with other models. We validated the calculated upper limit accuracy with experiments in backdoor attacks. Another application of our bound is for one-class classification. Specifically, one-class classification refers to a model that outputs positive for in-class samples and negative for out-class samples that are absent, poorly sampled, or not well defined (i.e., Fig. 1(b) ). We propose a novel oneclass classifier by converting our bound into a confidence score function to evaluate if a sample is in-class or out-class. The proposed classifier has many advantages. For example, implementing deep neural network-based classifiers requires training thousands of parameters and large memory, whereas implementing our classifier does not. It only needs sample norms to calculate the confi- Overall, the contributions of this paper include: • Finding a distribution-free upper bound for the overlap index. • Applying this bound to the domain shift analysis problem with experiments. • Proposing a novel one-class classifier with the bound being the confidence score function. • Evaluating the proposed one-class classifier through comparison with various state-ofthe-art methods on several datasets, including UCI datasets, CIFAR-100, sub-ImageNet, etc., and in different one-class classification scenarios, such as novelty detection, out-ofdistribution detection, and neural network backdoor detection.

1.1. BACKGROUND AND RELATED WORKS

Measuring the similarity between distributions: Gini & Livada (1943) and Weitzman (1970) introduced the concept of the distribution overlap index. Other measurements for the similarity between distributions include the total variation distance, Kullback-Leibler divergence (Kullback & Leibler, 1951 ), Bhattacharyya's distance (Bhattacharyya, 1943) , and Hellinger distance (Hellinger, 1909) . In psychology, some effect size measures' definitions involve the concept of the distribution overlap index, such as Cohen's U index (Cohen, 2013) 



Figure 1: (a): Overlap of two distributions. (b): One-class classification. (c): Backdoor attack.

,McGraw and Wong's CL measure (McGraw  & Wong, 1992), and Huberty's I degree of non-overlap index(Huberty & Lowman, 2000). However, they all have strong distribution assumptions (e.g., symmetry or unimodality) regarding the overlap index. Pastore & Calcagnì (2019) approximates the overlap index via kernel density estimators. One-class classification: Moya & Hush (1996) coined the term one-class classification. One-class classification intersects with novelty detection, anomaly detection, out-of-distribution detection, and outlier detection. Yang et al. (2021) explains the differences among these detection areas. Khan & Madden (2014) discusses many traditional non neural network-based one-class classifiers, such as one-class support vector machine (Schölkopf et al., 2001), decision-tree (Comité et al., 1999), and one-class nearest neighbor (Tax, 2002). Two neural network-based one-class classifiers are (Ruff et al., 2018) and OCGAN (Perera et al., 2019). Morteza & Li (2022) introduces a Gaussian mixturebased energy measurement and compares it with several other score functions, including maximum softmax score (Hendrycks & Gimpel, 2017), maximum Mahalanobis distance (Lee et al., 2018), and energy score (Liu et al., 2020a) for one-class classification. Neural network backdoor attack and detection: Gu et al. (2019) and Liu et al. (2018b) mentioned the concept of the neural network backdoor attack. The attack contains two steps: during training, the attacker injects triggers into the training dataset; during testing, the attacker leads the network to misclassify by presenting the triggers (i.e., Fig. 1(c)). The data poisoning attack (Biggio et al., 2012) and adversarial attack (Goodfellow et al., 2014) overlap with the backdoor attack. Some proposed

