ADAPTIVE ROBUST EVIDENTIAL OPTIMIZATION FOR OPEN SET DETECTION FROM IMBALANCED DATA

Abstract

Open set detection (OSD) aims at identifying data samples of an unknown class (i.e., open set) from those of known classes (i.e., closed set) based on a model trained from closed set samples. However, a closed set may involve a highly imbalanced class distribution. Accurately differentiating open set samples and those from a minority class in the closed set poses a fundamental challenge as the model may be equally uncertain when recognizing samples from the minority class. In this paper, we propose Adaptive Robust Evidential Optimization (AREO) that offers a principled way to quantify sample uncertainty through evidential learning while optimally balancing the model training over all classes in the closed set through adaptive distributively robust optimization (DRO). To avoid the model to primarily focus on the most difficult samples by following the standard DRO, adaptive DRO training is performed, which is governed by a novel multi-scheduler learning mechanism to ensure an optimal model training behavior that gives sufficient attention to the difficult samples and the minority class while capable of learning common patterns from the majority classes. Our experimental results on multiple real-world datasets demonstrate that the proposed model outputs uncertainty scores that can clearly separate samples from closed and open sets, respectively, and the detection results outperform the competitive baselines.

1. INTRODUCTION

In many practical scenarios (e.g., drug discovery, anomaly detection etc.), it is likely to encounter unknown samples and it is desirable that the model can properly detect these samples as unknown. Various approaches have been proposed to tackle the unknown sample detection problem (Bendale & Boult, 2016; Sun et al., 2020) , using techniques such as Weibull-Calibration SVM (W-SVM) (Scheirer et al., 2013) , reconstruction error (Zhang & Patel, 2017) , nearest neighbor (Júnior et al., 2016) , and quasi-linear function (Cevikalp & Yavuz, 2017 ). As a representative example, the Openmax framework removes softmax from the last layer of a neural network and includes an additional layer to produce the probability of a sample being unknown. This essentially redistributes the probability mass to (K + 1) classes (with unknown being a new class). Multiple efforts follow this direction (Sun et al., 2020; Neal et al., 2018) . While this technique is viable to detect open-set samples, the additional layer is included during the testing phase. As a result, the training still follows the closed set assumption. Recent advances in uncertainty quantification provide a more systematic way to break the closed set limitation by explicitly modeling the uncertainty mass that corresponds to the unknown class. One representative work is the evidential deep learning (EDL) model (Sensoy et al., 2018) , which treats the predicted multi-class probability as a multinomial opinion according to subjective logic (Jøsang, 2016) . Similar to EDL, Prior Networks (PNs) (Malinin & Gales, 2018) explicitly considers the distributional uncertainty that quantifies the distributional mismatch (Malinin & Gales, 2018) . The Posterior Networks further improves PNs by leveraging normalizing flows for density estimation in the latent space to predict a posterior distribution, which can be used to identify out-of-distribution (OOD) samples from in-distribution ones (Charpentier et al., 2020) . Despite the promising progress in OSD that focuses on differentiating samples from the closed and open sets, respectively, limited attention has been devoted to the situation where the closed set involves highly imbalanced classes, which may be quite common in many practical settings. For example, for anomaly detection, the known types of anomalies available for model training are usually unevenly distributed into multiple categories (e.g., car accident vs. shooting). Similarly, for computer-aided medical diagnosis, the known diseases (to the model) may be highly imbalanced based on the available cases. Thus, following the standard Empirical Risk Minimization (ERM) framework for training, the model may not learn properly from the minority class due to the lack of positive samples. As a result, it is more likely to misidentify a minority-class sample as an unknown-class sample during OSD, leading to a high false-positive rate. Distributionally Robust Optimization (DRO) offers an effective means to handle the imbalance class distribution in the closed set setting (Qi et al., 2020; Zhu et al., 2019) . In DRO, the worst case weighted loss is optimized, where the weights are searched in a given neighborhood (referred to as the uncertainty set) of the empirical sample distribution such that the overall loss is maximized. By expanding the uncertainty set, the model is encouraged to assign higher weights to difficult samples. As a result, samples from the minority class will be given more emphasis during model training if not properly learned (which incurs a larger loss). Another common solution to handle imbalanced class distribution in the closed set is through oversampling to achieve a more balanced class distribution (Chawla et al., 2002) . While both oversampling and DRO may help to improve the closed set performance, neither of them is adequate to address OSD from imbalanced data. A fundamental challenge lies in the interplay between samples from the minority class and the difficult samples from the majority classes. As a result, simply oversampling the minority class may neglect these difficult samples. Similarly, applying DRO with a flexible uncertainty set may put too much emphasis on these difficult samples and ignore the minority class as well as some representative samples from the majority classes, which affects proper model training. In fact, directly applying these models for OSD may lead to even worse detection performance, which is evidenced by our experimental results. Few recent approaches try to address the OSD under class-imbalanced setting. Liu et al. (2019) leverage the visual similarity across the centroids of closed set classes to allow more effective training from the minority class samples. However, it is possible that the samples from the minority class may look quite different from most other samples, making such a strategy less effective. Further, Wang et al. (2022) try to push minority class samples away from open set ones in the feature space using contrastive learning. However, the final OSD depends heavily on the selection of open set samples as evidenced by our experiment results. To systematically tackle the fundamental challenge as outlined above, we propose Adaptive Robust Evidential Optimization (AREO) that offers a principled way to quantify sample uncertainty through evidential learning while optimally balancing the model training over all classes in the closed set through novel adaptive DRO learning. To avoid the model from primarily focusing on the most difficult samples by following the standard DRO, the adaptive learning strategy gradually increases the size of the uncertainty set using Multi Scheduler Function (MSF) , which allows the model to learn from easy to hard samples. A class-ratio biased loss is further assigned to the minority class to ensure proper learning from its limited samples. Our main contribution is fourfold: • a novel extension of DRO to evidential learning, which enables principled uncertainty quantification under the class imbalanced setting, critical for many applications, including OSD, • adaptive DRO training governed by a uniquely designed multi-scheduler learning mechanism to ensure an optimal model training behavior that gives sufficient attention to the difficult samples and the minority class while capable of learning common patterns from the majority classes, • theoretical connection to a boosting model (i.e., AdaBoost), which ensures the nice convergence and generalization properties of AREO, • state-of-the-art OSD performance on various datasets.

2. RELATED WORK

Open set detection. Various SVM based techniques (Scheirer et al., 2013; Jain et al., 2014; Scheirer et al., 2014) have been proposed for OSD. For instance, Scheirer et al. (Scheirer et al., 2013) proposed an SVM based model, which performs detection using a Weibull-calibrated SVM (W-SVM) by leveraging Extreme Value Theory (EVT). Reconstruction based approaches have been proposed (Zhang & Patel, 2017) , where a threshold defined over the reconstruction error is used to decide whether the sample is from a known or an unknown class. Other traditional models, such as nearest neighbor (Júnior et al., 2016) , quasi-linear function (Cevikalp & Yavuz, 2017) , have also been explored as well. Deep learning models have been increasingly applied for open set detection (Yoshihashi et al., 2019; Sun et al., 2020; Bendale & Boult, 2016) . As an example, OpenMAX replaces the softmax function and probability of the softmax is redistributed to produce the probability of a sample being unknown (Bendale & Boult, 2016) . Sun et al. (2020) proposed VAE based open set recognition, where the probability of a sample belonging to each of the known classes is used as a proxy to detect whether the sample is known or unknown. Each known class distribution is modeled as a Gaussian using the training data. Some recent approaches aim to learn a more compact representation of closed set samples (Cevikalp et al., 2021; Yang et al., 2020) or push the open set class samples to a specific region in an embedding space for better recognition (Chen et al., 2021) . Special loss functions (Dhamija et al., 2018) and generative processes (Perera et al., 2020) have also been leveraged to separate open set samples from closed set ones. Recently, systematic approaches have been presented to break the closed set limitation by explicitly modeling the uncertainty mass belonging to the unknown distribution. One of the representative work inline with this is the evidential deep learning (EDL) model (Sensoy et al., 2018) . Similar to this work, Malinin & Gales (2018) propose Prior Networks (PNs) that explicitly consider the distributional uncertainty to quantify the distribution mismatch. Despite having a natural way to quantify the uncertainty, both of these methods require OOD data samples for model training, which is less practical. Charpentier et al. (2020) propose the posterior networks that leverage the normalizing flows for density estimation in the latent space in order to predict the posterior distribution by only using in-distribution samples. Despite the significant progress in OSD, limited attention has been drawn to the scenario, where the closed set involves highly imbalanced classes. Few recent works try to tackle this fundamental challenge of OSD under class-imbalanced setting. Liu et al. (2019) propose a technique based on the assumption that visual similarity exists between head and tail classes in the closed set. A model is designed to leverage this similarity to make it more robust for recognizing minority class samples. However, such an assumption may not universally hold, which limits the applicability of the model in general settings. Further, Wang et al. (2022) leverage contrastive learning to push the minority class samples away from the open set ones in the feature space during the training process. However, the final OSD performance highly depends on the training open set samples. Distributionally robust optimization. Distributionally robust optimization is based on principled statistical learning theory, where the worst case weighted loss is optimized by searching the weights in a given uncertainty set (Duchi & Namkoong, 2019; Zhu et al., 2019; Namkoong & Duchi, 2016) . DRO offers a systematic way to handle the imbalanced class distribution and has been commonly used in supervised learning setting (Qi et al., 2020; Zhu et al., 2019) as well as in multiple instance learning (Sapkota et al., 2021) . In a similar way, Li et al. (2020) propose a technique called Tilted Empirical Risk Minimization (TERM) by redefining the ERM with the introduction of hyperparameter t. Depending on the tunable parameter t value, different variants of loss (maximum, minimum, and average) are recovered and thereby provide a unified way to perform effective training in the presence of outlier and class imbalance scenarios. While DRO may help to improve the closed set performance, it is not sufficient to address the OSD problem with imbalanced data. This is because DRO with a flexible uncertainty set may put too much emphasis on the difficult samples and ignores the ones from the minority class as well as representative samples from majority classes. Our proposed AREO model offers an adaptive learning strategy to learn from easy samples in the early training phase and gradually shift the focus to the difficult samples. Furthermore, the class-ratio biased loss ensures proper learning from the limited samples in the minority class.

3. METHODOLOGY

3.1 PRELIMINARIES Evidential Learning for OSD. Let D N = {X, Y} = {(x 1 , y 1 ), ..., (x N , y N )} be a set of training samples in the closed set. Each x n ∈ R D is a D-dimensional feature vector and y n ∈ {0, 1} C indicates the one hot encoding associated with its class label: y nj = 1 and y nk = 0 for all k ̸ = j with j being the true label. Following the principle of Subjective Logic (SL) (Jøsang, 2016) , we consider a total of C + 1 mass values with C being the number of classes. We assign a belief mass b c , ∀c ∈ [C], to each singleton, which corresponds to one class in the closed set and the remaining mass is referred to as the uncertainty mass, denoted by u. The belief masses and the uncertainty mass are all non-negative and sum to one: u + C c=1 b c = 1, u ≥ 0 and b c ≥ 0. They can be evaluated as b c = e c S , u = C S where S = C c=1 (e c + 1) with e c ≥ 0 being the evidence derived for the c th singleton, which can be generated by a neural network enabled with a non-negative output. The belief mass assignment in the above expression corresponds to a Dirichlet distribution with the concentration parameters α c = e c + 1: Dir(p|α) = 1 B(α) C c=1 p αc-1 c , for p ∈ S C 0, otherwise where S C is a (C -1)-simplex and B(α) is a beta function. Given the evidences, the expected probability for the c th singleton is given by E[p c ] = αc S . Consider a sample x n and let f (x n , Θ) denote the evidence vector generated by an evidential neural network parameterized by Θ. This allows us to fully characterize the Dirichlet distribution, whose mean vector gives rise to the probability of assigning x n to each class. There are multiple ways to design a loss function to train the evidential neural network (Sensoy et al., 2018) . A simple but effective option is the sum of square loss: rl EL n (Θ) = ∥y n -E[p n ]∥ 2 2 + λ t KL[Dir(p n | αn )|Dir(p n |(1, ...., 1) ⊤ )] where λ t = min(1, t 10 ) is the annealing coefficient at epoch t and αn = y n + (1 -y n ) ⊙ α n . Remark. Besides being used as a powerful model for closed set classification, a unique benefit of evidential learning is that it offers a principled way to quantify the uncertainty mass, which is explicitly allocated to account for something that is 'unknown' to the model. Intuitively, a properly trained evidential model will output a high total evidence for data samples whose features are sufficiently exposed to the model during training. In contrast, it should predict a low total evidence for less representative samples in the training data. For these samples, their corresponding uncertainty mass u will be large (as the total mass sums to one). As a result, the uncertainty mass fits squarely for detecting open set samples, which have not been exposed to the model that is trained using the closed set samples. Distributionally Robust Optimization. Distributionally Robust Optimization (DRO) is inherently used to handle the minority and/or difficult class samples by optimizing the worst-case loss where weights assigned to each sample are given by uncertainty set. Let l n (Θ) be the loss for the x n sample network parameterized by Θ. Then the corresponding DRO loss is given as L DRO (Θ) = max p∈P DRO N n=1 p n l n (Θ) The uncertainty set defined to assign weights (p) is given as P DRO := p ∈ R N : p ⊤ 1 = 1, p ≥ 0, D f (p∥ 1 N ) ≤ η where D f (p∥q) is f -divergence between two distributions p and q and η controls the size of the uncertainty set. When η is large, the weight distribution p can deviate a lot from the uniform distribution, making it possible to assign a very high weight to certain data samples. In contrast, a small η will constrain p to be close to the uniform distribution and all samples share a similar weight.

3.2. DISTRIBUTIONALLY ROBUST EVIDENTIAL OPTIMIZATION

The standard evidential learning does not explicitly consider an imbalanced class distribution. Further, it also does not focus on the difficult samples resulting from multi-modality where a single class can have multiple types of samples. As a result, minority classes and/or difficult samples are usually assigned a higher uncertainty mass due to a lack of sufficient training data. While this may not significantly impact the closed set performance (i.e., accuracy), it poses a more severe issue for OSD as difficult/minority class samples become equally uncertain as those open set samples. To address this challenging solution, one straightforward way would be to integrate evidential learning with DRO for robust uncertainty mass quantification on minority class/difficult samples in the close-set. Intuitively, since the model explicitly focuses on learning from minority class/difficult samples, it provides a low uncertainty mass for minority/difficult samples while remain high (in terms of uncertainty mass) for those open set samples. This novel integration of DRO and evidential learning allows us to define a distributionally robust evidential loss (DREL) given as 6) is provided in Appendix B. Depending on η in the uncertainty set, we can decide whether we want to assign an equal weight to all data samples or focus on the most difficult ones. The lemma below reveals the relationship between DREL and the standard evidential loss. L DREL (Θ) = max p∈P DRO N n=1 p n l EL n (Θ) Lemma 1. With η → 0, the EDL loss under DRO reduces to the standard EDL loss. When η is set to be very small, the model gives similar weights to all samples, which allows them to participate equally in the training process. At another extreme, we can direct the model to fully focus on the most difficult sample with the maximum loss, as summarized in the lemma below. Lemma 2. With η → ∞, the loss under DRO becomes equivalent to a maximum loss based approach focusing only on the hardest sample. The above lemma implies that a highly flexible uncertainty set may cause the model to put too much emphasis on difficult samples. Since these difficult samples may come from the majority classes, simply setting a large η will not be necessary to direct the model's attention to the samples from the minority class. Furthermore, using a flexible uncertainty set in the initial phase of the model training may misguide the model to neglect a large number of representative data samples. As a result, the model will not be able to capture the common patterns that exhibit in most of the training samples. As such, the direct integration of DRO and EDL does work well which is also justified experimentally through the comparison of the proposed technique with DRO technique.

3.3. ADAPTIVE ROBUST EVIDENTIAL OPTIMIZATION (AREO)

The key idea to address the limitations in the distributionally robust evidential optimization is to gradually increase the size of the uncertainty set, which allows the model to learn from easy to hard samples from closed set classes. Scheduler functions (SF) provide a natural way to achieve the desired training behavior. Figure 1 (a-c ) shows three typical SFs, including cosine in (a): cos πt 2T , offset cosine in (b): 1 2 cos πt T + 1 2 , and exponential in (c): exp -t β , where t denotes the index of the training epoch, T is the terminating epoch, and β is a specific parameter of the exponential function. It can be seen that while the general trends of different SFs are similar, they exhibit some key differences that may lead to quite distinct model training behaviors. For example, a cosine function can help to ensure the uncertainty set to stay small for a relatively longer time in the beginning of model training. This ensures the model to learn from the representative samples in the majority classes (according to Lemma 1). In contrast, an exponential function can change the size of the uncertainty set very rapidly, which can give the model more time to learn from the difficult samples at the later phase (according to Lemma 2). The offset cosine function can offer both a relatively long initial learning and later learning phases. However, choosing a SF that best matches the nature of a given dataset poses a key challenge. Furthermore, a single SF may not be rich enough to express the desired training behavior of a complex dataset. To address this key challenge, we propose to conduct multi-scheduler learning to automatically construct a composite scheduler function that can be automatically learned for each given dataset to deliver the optimal training behavior. More specifically, the multi-scheduler function (MSF) is formulated as a convex combination of a set of atomic SFs: MSF(w, β, t, T ) = M m=1 w m SF m (β m , t, T ), M m=1 w m = 1, w m ≥ 0 ∀m ∈ [M ] where w are the mixing weights and β is a set of specific parameters for the atomic SFs. Figure 1 (d) visualizes an example MSF that combines a cosine and exponential functions with different mixing weights and fixed β = 20, T = 600. As can be seen, the MSF is much more expressive then either its component SF, which makes it capable to represent a much broader range of training behaviors. By leveraging the proposed MSF to control the size of the uncertainty set, we can achieve adaptive robust training. Let η 0 be the initial size of the uncertainty set and the size of the set at epoch t is η t = η t-1 MSF(w, β, t, T ) Based on this adaptive uncertainty set, we define the Adaptive Robust Evidential Loss (AREL) as L AREL (Θ) = max p∈P ARO N n=1 p n l EL n (Θ) where l EL n is the uncertainty mass loss for the x n given by Eq. ( 3) under adaptive robust optimization framework and P ARO is the adaptive robust uncertainty set defined as P ARO := p ∈ R N : p ⊤ 1 = 1, p ≥ 0, D f (p∥ 1 N ) ≤ η t As η t increases, the model gradually shifts its focus from easier samples to the more difficult ones. In this way, the model can be trained to first capture the common patterns in the data and then conduct fine-tuning by attending to those difficult samples. However, for imbalanced classes, there may be a good number of difficult samples from the majority classes. Therefore, solely controlling the size of the uncertainty set does not guarantee a sufficient training over the minority class. To address this, we further leverage the label of the minority-class c to formulate a ratio biased weight augmentation on samples from this class. Let p(c) = ∀ync=1 p n be the total weight for minority class c obtained by solving (9). Then, the weights for the minority class samples are adjusted as: p(c) = p(c), if p(c) ≥ 1 C min 1 C , p(c) MSF(w ′ ,β ′ ,t,T ) , otherwise p n =    p(c) p(c) p n , if y nc = 1 1-p(c) 1-p(c) p n , otherwise As the MSF monotonically decreases over the training epochs, the total weight for the minority class samples will eventually reach 1 C , making it equally weighted as the other (C -1) classes. Remark. Our approach considers a minority class if there is an obvious gap between the percentage of samples from the minority class over the total samples from all C classes and 1 C . Any other class that is not a minority one is regarded as a majority class. Our approach can handle the multiple minority classes which can be achieved by applying the ratio biased weighted augmentation (given by Eq. ( 11)) to each minority class. The adaptive robust training is achieved through a bi-level optimization, where the inner loop optimizes the the model parameters (Θ) and the outer loop optimizes the MSF parameters W = {w, w ′ , β, β ′ }: min W L AREL val (Θ * , W), s.t. Θ * = arg min Θ L AREL train (Θ, W) where L AREL train , L AREL val are training and validation losses, respectively. The outer loop optimization can be solved by computing the Hypergradients (Maclaurin et al., 2015; Pedregosa, 2016) or through a population-based methods (Jaderberg et al., 2017) , where the former may easily get stuck in local optimum (Tao et al., 2020) . To this end, we extend the existing population based method to learn an optimal MSF and the details are given in Appendix B.

3.4. THEORETICAL ANALYSIS

We establish the key theoretical properties of AREO, including the convergence speed in model training and the generalization capability by formally demonstrating the equivalence between AREO and AdaBoost under a non-convex robust uncertainty loss. The key idea is to leverage the equivalence between AdaBoost and the gradient descent search of an optimal function from a linear combination of a set of (weak) learners (Mohri et al., 2012; Blanchet et al., 2019) . Let F = {f 1 , ..., f K } be a set of different classifiers, and the linear span generated by the set F is LS(F) = f : f = K k=1 σ k f k , 1 ≤ k ≤ K (13) AREO training consists of two alternative updates between optimizing the worst case probability and predicting function f . The update in function prediction can be regarded as finding a sub-gradient G t ∈ ∂L AREL (f t ) and updating with LS(F ) (G) = arg min f ∈LS(F ) ∥f -G t ∥ D N where D N is the training data. Letting L n (f t ) be the loss associated with the data sample x n , the update of p involves the optimization of the following objective with f t being fixed: L AREL (f t ) = max p∈P ARO N n=1 p n L n (f t ) where the uncertainty set is given by ( 10). The corresponding Lagrangian of the above optimization problem is given by max p≥0,p ⊤ 1=1 N n=1 p n L n (f t ) -α N n=1 p n log p n -η t It should be noted that finding the optimal f value is non-trivial because the optimization involves the nonconvex loss (i.e., L AREL ). This creates difficulty showing equivalence between AREO and AdaBoost. To ensure the convergence of f to a stationary point, we adapt the ProbAbilistic Gradient Estimator technique (PAGE) (Li et al., 2021) to our unique adaptive robust evidential optimization setting. This convergence guarantee helps to move forward showing the equivalence between AREO and Adaboost given by the theorem below. Theorem 3. Under the assumption of finite exponential moment for L n (f ), with α ≥ 0 being sufficiently large and η t = β * ψ ′ (β * ) -ψ(β * ) (16) the worst case probability p * is given by p * n = exp Ln(ft) α N j=1 exp Lj (ft) α where β * = 1 α * , α * ≥ 0 be the optimal α, and ψ(β) = log N n=1 exp(βLn(ft)) N . The alternative optimization between f and with above worst case probability solution exactly recovers the AdaBoost algorithm proposed in (Freund & Schapire, 1997) . Remark. There are several key benefits of connecting AREO with AdaBoost. First, AdaBoost is less prone to overfitting even running for a large number of iterations (Mease & Wyner, 2008) . Inheriting such a property is crucial for OSD as an overfitted evidential model can produce highly confident wrong predictions. This implies that a low uncertainty may be predicted for samples that the model is less familiar with, resulting in a false negative detection of an open set sample. Furthermore, since the target function is expressed as a linear combination of a set of weak learners, the optimal function can be regarded as maximizing the l 1 geometric margin among the training samples to ensure good generalization capability like other maximum-margin classifiers (Mohri et al., 2012) . This ensures a decent closed set performance from AREO (as shown by our experiments). The proof of Theorem 3 is provided in Appendix C.

4. EXPERIMENTS

We perform extensive experimentation to evaluate the effectiveness of the proposed AREO model. We first describe five real-world image datasets where a minority class is introduced to create an imbalanced setting. We then assess the OSD performance of the proposed technique by comparing with competitive baselines. Finally, we conduct some qualitative analysis, which uncovers deeper insights on the performance advantage of the proposed model.

4.1. DATASETS

Our experiments involve five real-world image datasets: Cifar10, Cifar100 (Krizhevsky, 2009) , ImageNet (Deng et al., 2009) , MNIST (Deng, 2012) , and Architecture Heritage Elements Dataset (AHED) (Llamas, 2017) . In our experimentation, model training is performed solely based on the closed set samples. During the detection phase, the testing samples corresponding to the closed set classes will be assessed against the samples from open set classes. For all datasets, for the hyperparameter optimization, randomly selected 20% of the training set is used. The brief description for each dataset is given below. For the detailed description and data sample distribution in majority and minority classes, please refer to the Appendix. • MNIST: Five classes are treated as open set and the rest as the closed set. To make the dataset imbalanced, we consider class '3' as a minority class and randomly select 30% data samples as compared with other majority classes. The same imbalanced ratio is applied to both training and testing sets. In addition to the MNIST open set classes as described above, we follow other existing works (Sun et al., 2020) 

4.2. EXPERIMENTAL SETTINGS

Evaluation metric. To assess the model performance, we report mean average precision (MAP) score which summarizes the precision-recall curve as a weighted mean of precision achieved at each threshold, with the increase in recall from previous threshold as the weight. Specifically, in the OSD, we treat the open set samples as positive and closed set samples as negative and compute the MAP score based on the uncertainty score produced by the trained model. Different from AUROC, MAP places more emphasis on initial part of the ROC curve, which gives preference if model can rank the open set samples on the top based on their predicted uncertainty scores. This MAP metric works well in practice as the main focus may be devoted to the first few predicted candidate samples, especially when there is a long candidate list. The theoretical result shows that MAP is approximately the AUROC times the initial precision of the model (Su et al., 2015) . Therefore, we focus on reporting the MAP performance and leave the AUROC results in Appendix D. It is worth to note that our AUROC results also show a consistent trend as the MAP results. Network architecture. In terms of the architecture of the evidential neural network, for all datasets, we use an LeNet5 network with tanh activation in the feature extractor and ReLU in the fully connected layers. For training, we use the Adam optimizer with a learning rate of 0.001 and l 2 regularization with a coefficient of 0.001. The detailed hyperparameter setting is provided in Appendix.

4.3. PERFORMANCE COMPARISON

In our comparison study, we include baselines that are most relevant to our model, including EDL, EDL augmented with oversampling using SMOTE (Chawla et al., 2002) (referred to as AEDL), and EDL with standard DRO training (referred to as DRO). Further, we also compare the performance with the Posterior networks (Charpentier et al., 2020) and its robust form, PostNet (RS), proposed by Kopetzki et al. (Kopetzki et al., 2021) . In addition, we also compare with representative baselines with outstanding OSD performance: OpenMAX (Bendale & Boult, 2016) , CGDL (Sun et al., 2020) , and OLTR (Liu et al., 2019) . Please refer to the Appendix D for the more detailed description of the baselines used in our comparison study along with additional results and an ablation study. Tables 1 presents the OSD performance comparison between different models for all five datasets. AREO consistently outperforms all the baselines across all the datasets. For certain datasets, the performance advantage over the second best model is more than or close to 10%. This clearly demonstrates the benefits of conducting evidential learning through adaptive DRO training to achieve an optimal balanced learning from all classes and different types of data samples. We also observe that EDL consistently performs better than other non-evidential learning based models, such as OpenMAX, in most cases. The better OSD performance from EDL is attributed to its explicit modeling of the uncertainty mass that works naturally for detecting the open set samples. In contrast, directly applying DRO with a flexible uncertainty set, which aims to address the imbalanced class 

4.4. QUALITATIVE EXAMPLES

We perform a qualitative analysis to further assess the effectiveness of AREO. Figure 2 (a) top row shows representative testing samples from the minority class ('bird') in Cifar10. These images appear to be difficult even for the humans to identify the bird as only a small part is visible. Thus, EDL, AEDL, and DRO assign a relative higher uncertainty score for them. As a result, many open set samples may be assigned a relatively lower uncertainty score, leading to false negative detection on these samples. 

Appendix

In this appendix, we first summarize the major notations used in the main paper in Appendix A. We then present the detailed description of the training process obtained through a bi-level optimization in Appendix B. Proofs of main theoretical results are provided in Appendix C. Finally, we present more detailed experimental dataset, result, and settings in Appendix D. The link to the source code is provided in Appendix E.

A SUMMARY OF NOTATIONS

Table 2 summarizes all the major symbols along with their descriptions. 

B TRAINING THROUGH BI-LEVEL OPTIMIZATION

Our training involves a bi-level optimization, where we jointly optimize the network parameter Θ along with the MSF parameters W. Algorithm 1 shows the overall training process based on the population based optimization. We randomly initialize the MSF parameters W p and network parameters Θ p from the corresponding spaces H and Θ param respectively shown in Line 3. We perform this initialization for P different models. Next, in each epoch we independently optimize P models using the proposed objective function defined in Eq. ( 6). After s epochs, we evaluate the accuracy of each model by using 'eval' as the evaluation metric in the validation set. It should be noted that in our case, we used closed set classification performance (MAP) as 'eval' metric. We identify P (with P < P ) worst performing models and replace their model parameters by the randomly selected model parameters from set of b highest accurate models. This process is known as exploitation and is demonstrated in Line 12. MSF parameters for those worst performing model can be obtained either through random selection from the original space H or through small perturbation of the W of the model whose parameter is copied. This process is called exploration as we are searching for the new MSF parameters and is shown in Line 13. The best performing model parameters and accuracy are stored in the Θ * and acc * respectively. Finally, the best model Θ * is returned as the optimal model for the testing. The optimization specified in (9) involves an inequality constraint, which incurs a higher computational overhead. Therefore, in our actual optimization process, we consider a regularized version of the AREO loss as follows: L AREL = max p≥0,p ⊤ 1=1 N n=1 p n l t n -λD f p∥ 1 N (18) Solving the above maximization problem leads to a closed form solution for p * as shown by the following lemma. It should be noted that the role of the λ is exactly opposite as that of the η. Specifically, we start from a high λ so that the model gives equal emphasis to all data samples. Next, in each step we decrease λ using the following Equation λ t = λ t-1 MSF(w, β, t, T ) (19) Decreasing λ helps the model focus on the difficult samples as training progresses. Lemma 4. Assuming that D f is the KL divergence, then solving (18) leads to the following solution L AREL = N n=1 p * n l t n (20) where p * n is given by p * n = exp l t n λ N j=1 exp l t j λ The detailed proof is below in C.

C PROOFS OF THEORETICAL RESULTS

In this section, we present the detailed proofs for Lemmas 1, 2, 4, and Theorem 3. Proof of Lemma 1: By setting η → 0, we have D f (p∥ 1 N ) → 0. This implies that p is uniform with each element as 1 N . As a result, the optimization problem becomes L DRO (Θ) = 1 N N n=1 l EL n (Θ) Proof of Lemma 2: With η → ∞, the uncertainty set defined in (5) reduces to the following P DRO := p ∈ R N : p ⊤ 1 = 1, p ≥ 0 (23) Now, the corresponding Lagrangian form of (6) becomes L DRO (Θ, u, λ) = N n=1 p n l EL n (Θ) + u n p n + µ N n=1 p n -1 where u n and µ are Lagrangian multipliers. Taking gradient with respect to p n and setting it zero, we get l EL n (Θ) + u n + µ = 0 Let k = arg max n l EL n (Θ) be the index of data sample with the maximum loss (and assuming it is unique). Then, the following holds true u k < u n ; ∀n ∈ [1, N ], n ̸ = k (26) This consequently leads to u n > 0, ∀n ∈ [1, N ], n ̸ = k. Due to the KKT conditions, u n p n = 0; ∀n ∈ [1, N ] we have p n = 0, ∀n ∈ [1, N ], n ̸ = k. By using the following constraint N n=1 p n = 1 we have the following conclusion p n = 1, if n = k 0, otherwise This means our optimization reduces to the following L AREL (Θ) = max n l EL n (Θ) which proves the lemma. Proof of Lemma 4: The the Lagrangian of the regularized loss in ( 18) is L AREL (Θ, v, λ) = N n=1 p n l t n -λ N n=1 p n log p n + log N + v N n=1 p n -1 (31) where v is the Lagrangian multiplier. Taking the derivative with respect to p n and setting it to 0: l t n -λ log p n -λ + v = 0 Simplifying above equation, we get p n as p n = exp l t n + v λ -1 Using the summation constraint over p n i.e., N n=1 p n = 1, it leads to following N n=1 exp l t n + v λ -1 = 1 (34) Solving the above equation we get the expression for v as follows v = λ log   1 N n=1 exp l t n λ -1   Substituting the v value into (33) gives p n = exp l t n λ N n=1 exp l t n λ The concludes the proof of Lemma 4. Proof of Theorem 3. AdaBoost can be achieved through alternative optimization between a classification function f and the worst case probability solution (Freund & Schapire, 1997) . To show equivalence with the proposed AREO, our proof includes three steps: (i) a specially designed deep neural network (DNN) architecture and a loss function adapted to match the learning process of AdaBoost, (ii) projected functional sub-gradient descent to optimize the classification function f , and (iii) optimizing the worst case probability solution. Step 1: A specially designed DNN. Let ϕ(x) ∈ R M denote a M -dimensional feature vector learned using a DNN. By applying a fully connected linear layer with a weight matrix W ∈ R K×M on top of the feature vector, we obtain a set of K (discriminative) functions: f = (f 1 , ..., f K ) ⊤ = W ϕ(x). Then, the final output of the DNN is obtained by aggregating these K functions, leading to f = σ ⊤ f , where σ = (σ 1 , ..., σ K ) ⊤ . As a result of this design, the final function output by the DNN can be regarded as lying in the linear span of a set of functions F = {f 1 , ..., f K }, given by LS(F) = f : f = K k=1 σ k f k , 1 ≤ k ≤ K, σ k ∈ (-∞, ∞) Training of AREO involves alternating between re-weighting using the worst case probability distribution and updating the prediction function f . Next, we prove that given the specially designed DNN, we can exactly optimize the classification function f by keeping the worst case probability fixed and vice versa. Step 2: Optimizing the classification function f under the worst case probability. We first formulate the distributional robust evidential loss, which is given by L AREL = max p∈P ARO N n=1 p n L n (f ) ( ) where L n (f ) is the loss associated with the datasample x n . Then, the optimal f * can be obtained by minimizing the distributional robust loss: f * = min f ∈LS(F ) L AREL (39) This optimization involves a nonconvex loss L AREL . To ensure the convergence of f to a stationary point, we adapt the ProbAbilistic Gradient Estimator (PAGE) technique (Li et al., 2021) to the DRO setting (shown in Algorithm 2) which ensures the convergence in O(b + b ϵ 2 ) steps with b being the batch size. Please refer to Theorem 6 further details. To show that an optimal f * can be achieved, we first verify that the specially designed DNN and the loss function as described above meet a number key conditions as specified by (Blanchet et al., 2019)  : (i) the loss functional L AREL is L-smooth, (ii) for two different functions f 1 , f 2 ∈ LS(F), f 1 (ϕ(x n )) ̸ = f 2 (ϕ(x n )) , and (iii) LS(F) has a finite dimensional basis. First, (i) is true because L AREL is the convex combination of the losses L n (f ). As each individual loss involves the ReLU term with ReLU added in the output of DNN (to ensure non-negativity of the evidence), the resulting convex combination may not be smooth. Therefore, we use the SoftPlus which is smooth function to approximate the ReLU. The the convex combination of SoftPlus results in the function L AREL to be L-smooth. Second, the rich and high dimensional input data (i.e., diverse images) and the feature encoding through the deep architecture of the DNN ensures (ii) is true. Last, since the dimensionality of the weight matrix W is K × M , it implies that the dimensionality of the basis of LS(F) is bounded by K, so (iii) holds true. The smoothness of L AREL ensures that a stationary solution is achieved within the O(b + b ϵ 2 ) gradient steps. This allows us to have a guaranteed stationary solution with E[∥∇L AREL ∥] ≤ ϵ in a nonconvex optimization setting. Furthermore, since L AREL is a functional on f , the next two conditions ensure that the functional gradient exists and can be evaluated (Blanchet et al., 2019) . During the optimization process, we need to make sure that the trajectory of the functional gradient lies in the space LS(F), which can be achieved through functional gradient projection. Step 3: Optimizing the worst case probability solution. Let f t denote the optimal classification function for the current iteration t. Next, we continue to optimize the worst case probability solution. The following lemma shows that such an optimal solution exists. Lemma 5. Assuming that L n (f t ) has a finite exponential moment with α ≥ 0 being sufficiently large and η t = β * ψ ′ (β * ) -ψ(β * ) (40) the worst case probability is given as p * n = exp Ln(ft) α N j=1 exp Lj (ft) α ( ) where β * = 1 α * , α * ≥ 0 be the optimal α, and ψ(β) = log N n=1 exp(βLn(ft)) N . Proof. Taking the derivative of the Lagrangian for the optimization problem given in (15) with respect to p n leads to L n (f t ) -α log p n -α + u n = 0 (42) where u n is the Lagrangian multiplier for the constraint p ≥ 0 and α is the Lagrange multiplier for the DRO constraint with the size defined by η t . Simplification of the above expression yields log p n = L n (f t ) α + u n -α α (43) For some λ ′ with p n = λ ′ exp Ln(ft) α , a candidate solution is p * n = exp Ln(ft) α ) N j=1 exp Lj (ft) α (44) The above equation involves the expression in terms of the Lagrangian multiplier. By leveraging the sufficiency result presented in Chapter 8 Theorem 1 of (Luenberger, 1997), we can find the relationship between the multiplier and our constraint parameter η t . As such, our optimal solution can be expressed in terms of original constraint. Suppose that we can find α * ≥ 0 and p * ∈ P ARO such that p * maximizes (15) for α = α * and N n=1 p * n log p * n = η t with the optimal solution defined in (44). Considering this, we have the following η t = N n=1 p * n log p * n = N n=1 p * n L n (f t ) α * -log   N j=1 exp L j (f t ) α *   = β * ψ ′ (β * ) -ψ(β * ) (45) where we define β * = 1 α * and ψ(β) = log N n=1 exp (βL(f t )). This allows us to express the Lagrangian multiplier using η t . Next, we verify that there exists an unique solution defined in (44) by leveraging the convexity of the exponential function. Specifically, substituting (44) in ( 15), we get the following N n=1 p * n L n (f t ) -α N n=1 p * n log p * n = α log N n=1 exp L n (f t ) α (46) If we could show the following inequality holds true α log N n=1 exp L n (f t ) α ≥ N n=1 p n L n (f t ) -α N n=1 p n log p n ( ) then we can claim that the above candidate solution is the optimal solution. Rearranging the terms, we get the following N n=1 exp L n (f t ) α ≥ exp N n=1 p n L n (f t ) α -p n log p n ( ) This can be shown as N n=1 exp L n (f t ) α = N n=1 p n p -1 n exp L n (f t ) α = N n=1 p n exp L n (f t ) α -log p n Now applying Jensen inequality to the exponential function ψ x i n ≤ ψ(xi) n , we have the following N n=1 p n exp L n (f t ) α -log p n ≥ exp N n=1 p n L n (f t ) α -p n log p n (49) This completes the proof of the lemma. Theorem 6. Suppose that L AREL holds the L-smoothness criteria with following inequality ∥∇L AREL (f 1 ) -∇L AREL (f 2 )∥ ≤ L∥f 1 -f 2 ∥ (50) Then choosing a learning rate γ ≤ 1 L 1+ 1-p pb ′ with minibatch size b = n, secondary minimbatch size b ′ < b, the number of iterations required to be performed by our algorithm for finding ϵapproximate solution i.e., E[∥∇L AREL ( fT )∥ ≤ ϵ] can be bounded by the following: T = 2∆ 0 L ϵ 2 1 + 1 -p pb ′ (51) Further the gradient complexity in terms of number of gradient steps is given as N grad = b + 2∆ 0 L ϵ 2 1 + 1 -p pb ′ (pb + (1 -p)b ′ ) Before giving the formal proof, we first show two lemmas that are used during the proof. Lemma 7. The L-smoothness condition given by Eq. (50), leads to the following inequality L AREL (f 2 ) ≤ L AREL (f 1 ) + ⟨∇L AREL (f 1 ), f 2 -f 1 ⟩ + L 2 ∥f 2 -f 1 ∥ 2 , ∀f 1 , f 2 ∈ R m . ( ) where ⟨a, b⟩ = a T b, and ∥ • ∥ is the Euclidean norm. Proof of Lemma 7. For the completeness the proof of the above Lemma is as follow. L AREL (f 2 ) ≤ L AREL (f 1 ) + 1 0 ⟨∇L AREL (f 1 ) + τ (f 2 -f 1 )), f 2 -f 1 ⟩dτ = L AREL (f 1 ) + ⟨∇L AREL (f 1 ), f 2 -f 1 ⟩ + 1 0 ⟨∇L AREL (f 1 + τ (f 2 -f 1 )) -∇L AREL (f 2 ), f 2 -f 1 ⟩dτ Cauchy-Schwarz inequality ⟨u, v⟩ ≤ ∥u∥∥v∥ leads to the following L AREL (f 2 ) ≤ L AREL (f 1 ) + ⟨∇L AREL (f 1 ), f 2 -f 1 )⟩ + 1 0 ∥∇L AREL (f 1 + τ (f 2 -f 1 )) -∇L AREL (f 1 )∥∥f 2 -f 1 ∥dτ Now lets use the L-smoothness assumption from Eq. ( 50), we have L AREL (f 2 ) ≤ L AREL (f 1 ) + ⟨∇L AREL (f 1 ), f 2 -f 1 )⟩ + 1 0 Lτ ∥f 2 -f 1 ∥ 2 dτ = L AREL (f 1 ) + ⟨∇L AREL (f 1 ), f 2 -f 1 )⟩ + L 2 ∥f 2 -f 1 ∥ 2 Now, we provide another important Lemma required to prove the above Theorem based on Lemma 7 Lemma 8. Considering L-smoothness assumption in Eq. (50), and let f t+1 := f t -γg t . Then for any g t ∈ R M and γ > 0 we have the following L AREL (f t+1 ) ≤ L AREL (f t ) - γ 2 ∥∇L AREL (f t )∥ 2 - 1 2γ - L 2 ∥f t+1 -f t ∥ 2 + γ 2 ∥g t -∇L AREL (f t )∥ 2 Proof of Lemma 8. Let ft+1 := f t -γ∇L AREL (f t ). Then using L-smoothness of L AREL , we have the following L AREL (f t+1 ) ≤ L AREL (f t ) + ⟨∇L AREL (f t ), f t+1 -f t ⟩ + L 2 ∥f t+1 -f t ∥ 2 = L AREL (f t ) + ⟨∇L AREL (f t ) -g t , f t+1 -f t ⟩ + ⟨g t , f t+1 -f t ⟩ + L 2 ∥f t+1 -f t ∥ 2 = L AREL (f t ) + ⟨∇L AREL (f t ) -g t , -γg t ⟩ - 1 γ - L 2 ∥f -f t ∥ 2 = L AREL (f t ) + γ∥∇L AREL (f t ) -g t ∥ 2 -γ⟨∇L AREL (f t ) -g t , ∇L AREL (f t )⟩ - 1 γ - L 2 ∥f t+1 -f t ∥ 2 = L AREL (f t ) + γ∥∇L AREL (f t ) -g t ∥ 2 - 1 γ ⟨f t+1 -ft+1 , f t -ft+1 ⟩ - 1 γ - L 2 ∥f t+1 -f t ∥ 2 = L AREL (f t ) + γ∥∇L AREL (f t ) -g t ∥ 2 - 1 γ - L 2 ∥f t+1 -f t ∥ 2 - 1 2γ ∥f t+1 -ft+1 ∥ 2 + ∥f t -ft+1 ∥ 2 -∥f t+1 -f t ∥ 2 = L AREL (f t ) + γ∥∇L AREL (f t ) -g t ∥ 2 - 1 γ - L 2 ∥f t+1 -f t ∥ 2 - 1 2γ ∥γ 2 ∥∇L AREL (f t ) -g t ∥ 2 + γ 2 ∥∇L AREL (f t )∥ 2 -∥f t+1 -f t ∥ 2 = L AREL (f t ) - γ 2 ∥∇L AREL (f t )∥ 2 - 1 2γ - L 2 ∥f t+1 -f t ∥ 2 + γ 2 ∥g t -∇L AREL (f t )∥ 2 This completes the Proof of Lemma 8. The last term is the variance and it can be bounded using the following lemma. Lemma 9. Suppose that the smoothness assumption in Eq. (50) holds. If the gradient estimator g t+1 is defined in Algorithm 2 Line 13, then we have the following E[∥g t+1 -∇L AREL (f t+1 )∥ 2 ] ≤ (1 -p t )∥g t -∇L AREL (f t )∥ 2 + (1 -p t )L 2 b ′ ∥f t+1 -f t ∥ 2 (55) Proof of Lemma 9. According to Algorithm 2, we have the following g t+1 =    1 b n∈B a n (f t+1 )∇L n (f t+1 ) with probability p t g t + 1 b ′ n∈B ′ (a n (f t+1 )∇L n (f t+1 ) -a n (f t )∇L n (f t )), with probability 1 -p t Using this the left hand side of the above lemma can be written as E[ g t+1 -∇L AREL (f t+1 ) 2 ] = p t E ∥ 1 b n∈B a n (f t+1 )∇L n (f t+1 ) -∇L AREL (f t+1 )∥ 2 + (1 -p t )E ∥g t + 1 b ′ n∈B ′ (a n (f t+1 )∇L n (f t+1 ) -a n (f t )∇L n (f t )) -∇L AREL (f t+1 )∥ 2 = (1 -p t )E ∥g t + 1 b ′ n∈B ′ (a n (f t+1 )∇L n (f t+1 ) -a n (f t )∇L n (f t )) -∇L AREL (f t+1 )∥ 2 = (1 -p t )E ∥g t -∇L AREL (f t ) + 1 b ′ n∈B ′ (a n (f t+1 )∇L n (f t+1 ) -a n (f t )∇L n (f t )) + (1 -p t )E -∇L AREL (f t+1 ) + ∇L AREL (f t )∥ 2 = (1 -p t )E ∥ 1 b ′ n∈B ′ (a n (f t+1 )∇L n (f t+1 ) -a n (f t )∇L n (f t )) -∇L AREL (f t+1 ) + ∇L AREL (f t )∥ 2 + (1 -p t )∥g t -∇L AREL (f t )∥ 2 = 1 -p t b ′ 2 E n∈B ′ ∥(a n (f t+1 )∇L n (f t+1 ) -a n (f t )∇L n (f t )) - 1 -p t b ′ 2 E (∇L AREL (f t+1 ) -∇L AREL (f t ))∥ 2 + (1 -p t )∥g t -∇L AREL (f t )∥ 2 ≤ (1 -p t )L 2 b ′ ∥L AREL (f t+1 ) -L AREL (f t )∥ 2 + (1 -p t )∥g t -∇L AREL (f t )∥ 2 Using the L-smoothness assumption in Eq. (50), we have E[∥g t+1 -∇L AREL (f t+1 )∥ 2 ] ≤ (1 -p t )L 2 b ′ ∥f t+1 -f t ∥ 2 + (1 -p t )∥g t -∇L AREL (f t )∥ 2 Proof of Theorem 6. We leverage the above lemmas to prove the Theorem. Adding Eq. ( 54) with γ 2p ×Eq. (55) and taking expectation results in the following: E L AREL (f t+1 ) -L AREL * + γ 2p ∥g t+1 -∇L AREL (f t+1 ∥ 2 ≤ E L AREL (f t ) -L AREL * - γ 2 ∥∇L AREL (f t )∥ 2 - 1 2γ - L 2 ∥f t+1 -f t ∥ 2 + γ 2 E ∥g t -∇L AREL (f t )∥ 2 + γ 2p E (1 -p)∥g t -∇L AREL (f t )∥ 2 + γ 2p E (1 -p)L 2 b ′ ∥f t+1 -f t ∥ 2 = E L AREL (f t ) -L AREL * + γ 2p ∥g t -∇L AREL (f t )∥ 2 -E 1 2γ - L 2 - (1 -p)γL 2 2pb ′ ∥f t+1 -f t ∥ 2 where L AREL * is the loss at the optimal f * . Using the inequality of 1 2γ -L 2 -(1-p)ηL 2 2pb ′ ≥ 0, i.e., γ ≤ 1 L 1 + 1-p pb ′ we can write the following E[∥g t+1 -∇L AREL (f t+1 )∥ 2 ] ≤ E L AREL (f t ) -L AREL * + γ 2p ∥g t -∇L AREL (f t )∥ 2 - γ 2 ∥∇L AREL (f t )∥ 2 Now let us define ϕ t := L AREL (f t ) -L AREL * + γ 2p ∥g t -∇L AREL (f t )∥ 2 then we can write the following E[ϕ t+1 ] ≤ E[ϕ t ] - γ 2 E[∥∇L AREL (f t )∥ 2 ] Now summing from t = 0 to T -1 results in the following E[ϕ T ] ≤ E[ϕ 0 ] - γ 2 T -1 t=0 E[∥∇L AREL (f t )∥ 2 ] According to the Algorithm 2, fT is chosen from {f t } t∈[T ] and ϕ 0 = L AREL (f 0 ) -L AREL * + γ 2p ∥g 0 -∇L AREL (f 0 )∥ 2 = L AREL (f 0 ) -L AREL * = ∆ 0 , we have E[∥∇L AREL ( fT ∥ 2 ] ≤ 2∆ 0 γT Setting T = 2∆0 ϵ 2 γ and using Jensen's inequality results in the following E[∥∇L AREL ( fT )∥] ≤ E[∥∇L AREL ( fT )∥ 2 ] ≤ 2∆ 0 γT = ϵ With the following total number of iterations T = 2∆ ϵ 2 γ = 2∆ 0 L ϵ 2 1 + 1 -p pb ′ we can obtain ϵ-approximate stationary point solution. The number of gradient steps required in the Algorithm 2 is given as N grad = b + T (pb + (1 -p)b ′ ) Replacing T by Equation ( 62), we have the following N grad = b + 2∆ 0 L ϵ 2 1 + 1 -p pb ′ (pb + (1 -p)b ′ ) This proves Theorem 6.

D ADDITIONAL EXPERIMENTAL DETAILS

D.1 DATASET DESCRIPTIOM MNIST. In this dataset, classes corresponding to digits '1', '3', '5', '7', and '9' are treated as closed set classes and the rest as the open set. As the number of data samples per class is not exactly the same, we first sample 5,000 samples per class in the training set. For testing, we sample 1,000 samples per class. To make the dataset imbalanced, we consider class '3' as a minority class and randomly select 30% data samples as compared with other majority classes. The same imbalanced ratio is applied to both training and testing sets. Table 3 shows the number of data samples from both the minority class and majority classes. In addition to the MNIST open set classes as described above, we follow other existing works (Sun et al., 2020) and further test the OSD performance on additional open set samples from three more sources: (1) MNIST-Noise, (2) Noise, and (3) Omnigolot (Lake et al., 2015) . More specifically, MNIST-Noise is constructed by adding random noises to the closed set testing samples, Noise consists of random noises, and Omnigolot consists of data samples from the Omnigolot dataset. For those classes, we select the same data samples as those of the closed set samples. 4 compares the OSD performance between the MSF obtained through the proposed multischeduler learning strategy with a prefixed atomic scheduler function, including cosine and exponential. Due to the lack of expressiveness, a single atomic SF function usually cannot work well for all datasets. For example, in Cifar10, cosine yields a relatively better OSD compared to exponential, whereas in the case of MNIST, exponential produces a relatively higher OSD performance. In contrast, by combining both of them and properly balancing their contribution based upon the nature of the dataset, MSF achieves the best performance in all cases. Same Cardinality: In this first setting we demonstrate the ability of our technique handling multiple imbalanced classes. In this case, we consider two minority classes with c 1 = 6% and c 2 = 6%. In each step, we randomly select minority classes and repeat the experimentation two times and take an average over those runs to get the final score. Table 5 shows the performance for different baselines along with our proposed AREO. As shown, our technique has a far better performance in terms of OSD compared to the existing baselines. Different Cardinality with Severe Imbalance: In this setting, we demonstrate the ability of our approach on the multiple imbalanced classes with different cardinality.In this case, we consider two minority classes with c 1 = 6% and c 2 = 3%. Similar to first case, we randomly select minority classes and repeat the experimentation two times and take an average over those runs to get the final score. Further, this also demonstrates the ability of our technique handling severe type of the imbalanced scenario where c 2 = 3%. Table 6 shows the performance comparison across different techniques. As shown, our AREO has a better performance compared to the existing baselines. CGDL uses a variational autoencoder that can learn the class conditional posterior distribution for each class in the latent space (Sun et al., 2020) . Any sample with a low probability of belonging to any of the classes are regarded as the outliers. CGDL is consistently outperformed by the proposed AREO model. One possible reason is that CGDL may not learn the minority class posterior distribution properly in the latent space. Further, the approach may ignore the difficult samples when forming the posterior distribution. As such, the model is not able to differentiate the minority class and OOD samples as both may have low probability of belonging to any of the classes resulting in the lower performance. Posterior networks (Charpentier et al., 2020) leverage the normalizing flows to obtain the posterior distribution over the predicted probabilities. In this case, the latent representation is learned using the encoder and per class probability density value for that latent representation is determined to get the posterior distribution. In case of the minority class, the normalizing flow may not learn well to have a high probability density value to the sample. As such, while computing the uncertainty, it may still assign to the low uncertainty to minority class. Further, the model may not learn well to produce the high density value for difficult samples from other classes. As such, it is likely the model may have a confusion between hard samples and OOD samples resulting in difficulty in OSD. As shown in Table 1 , the posterior networks have consistently lower performance compared to AREO. Kopetzki et al. (2021) propose a more robust form of the posterior networks by training the network using randomized smoothing (RS). The key idea is to draw multiple samples x i s ∼ N (x i , σ) around the input sample x i . Although this technique has shown improvement over the adversarial attack, it is not designed for the imbalanced situation as demonstrated by its lower OSD performance in Table 1 . We have also included an approach called OLTR proposed by Liu et al. (2019) as a baseline. This work has proposed a way to deal with the OSD in the imbalanced data distribution however it has several limitations. First, the approach is based on the assumption that visual similarity is shared across the minority and majority classes and thereby having robust learning even for the minority class. So, if the minority class is different from other majority classes, the model may not properly learn the minority classes and thereby the model may be confused between minority and open set samples. Also, the proposed OLTR does not have a mechanism to focus on the difficult samples from the majority classes while training. As such, the model may detect difficult samples as open set samples. Those limitations are reflected in the performance shown in Table 1 . Recently, Wang et al. (2022) propose a contrastive loss based approach where minority classes are pushed from the OOD samples in the feature space. Although, this paper also considers the open set detection under class imbalanced-setting, it heavily relies on the selection of open set samples involved during the training process. Table 7 shows the OSD performance with respect to open set training datasets Flower (Nilsback & Zisserman, 2006) and MIT Indoor Scene (Quattoni & Torralba, 2009) . As shown, the performance using this method is highly dependent on the selection of the OOD dataset during training process. In this section, we first provide a detailed discussion on recently developed general open set recognition methods, which do not specifically focus on imbalanced data. We then choose some representative methods and present a comparison with the proposed AREO model. Finally, we discuss some recent OSD models designed for few-shot learning under the meta-learning setting. The proposed approach requires open set datasets to be available during the training set, which may limit its applicability in more general settings. In (Perera et al., 2020) , self supervision is performed to construct the decision boundary between classes based on the semantics in the feature space. A generative model is then trained based on the known class samples. Thus, the generated images will be close to that of the closed set class samples. As the open set samples are not seen during the generative modeling process, the produced images will exhibit a high disparity with those of the closed set samples. Finally, Yang et al. (2020) propose the Convolutional Prototype Network (CPN) where a prototype for each known class is constructed in the feature space and two different loss functions are defined. To define the generative loss, generative assumption is followed where the class-specific features are drawn from certain distributions (e.g., Gaussian) with the mean given by the prototype representation. This generative loss helps to reduce the intra-class variance and thereby making the known class sample representation very compact. Thus, the model can reserve more spaces for unknowns, making OSD relatively easier. The second loss (i.e., discriminative loss) encourages the separation of the class samples from different classes based on the distance between sample and prototype representations. We perform comparison with the first three methods discussed above, including two most recent baselines with competitive OSD performance. We report the comparison results on Cifar 10, Cifar 100, and MNIST datasets in In addition to the above baselines, there are also recent OSD models specifically developed for few-shot learning (FSR) under the meta-learning setting. For instance, Jeong et al. (2021) propose a few-shot open-set recognition (FSOSR) model. This approach is designed to work with testing tasks with limited labeled data through meta-learning. The design of the training paradigm is based on episodic learning (Snell et al., 2017) widely used in meta-learning, where query and support sets are constructed by selecting subsets of the meta-training data. In contrast, our model is not designed for few-shot setting through meta-learning. Furthermore, the FSOSR approach in (Jeong et al., 2021) does not consider the imbalanced class distribution, either. Therefore, the problem setting, model training, and evaluation process are all different. Similarly, Liu et al. (2020) propose an oPen set mEta LEaRning (PEELER) algorithm that adapts ProtoNet to FSOSR under the meta-learning setting. There are two key differences between our technique and PEELER. First, PEELER assumes that the unknown samples are also available during the training process. Second, the algorithm is also designed under the meta-learning setting that makes the direct comparison with our approach infeasible. In this case, the loss function does not perform closed set classification. In contrast, our approach achieves a state-of-the-art OSD performance while ensuring decent closed set performance. Finally, OpenGAN does not have a specific mechanism to handle the imbalanced setting, which is one primary design focus of our approach.

D.6 PERFORMANCE COMPARISON USING AUROC

In addition to the MAP scores, the AUROC scores, which are reported in Table 9 , also show a consistent trend in terms of OSD performance. It is interesting to see that DRO with a flexible uncertainty set performs the worst in the closed set setting as it does not learn properly from the most representative samples in the training data while only focusing on the difficult ones. AEDL performs very competitively and achieves the best performances on two datasets. This is partly because we are evaluating MAP by treating the minority class as positive and oversampling helps to improve the prediction on the minority class quite significantly. AREO also performs competitively and achieves the best performance on the other two datasets. The good closed set performance further confirms our theoretical result that proves the equivalence between AREO and AdaBoost, which justifies its strong generalization capability. Figure 4 provides a deeper insight on the superior OSD performance of AREO than other competitive baselines, including EDL, DRO, and AEDL. Cifar10 is used as an illustrative example and similar patterns are obtained on other datasetts. First, while EDL is able to separate outliers from most samples in the majority classes based on their predicted uncertainty scores, it assigns much higher uncertainty scores to samples from the minority class, making them hard to be separated from the outliers. Second, the uncertainty scores for the majority classes span a wide range, which implies that several (difficult) samples from these classes have also been assigned very high uncertainty scores. If the goal is to ensure that most top-ranked samples are true outliers for effective detection in practice, these highly uncertain close-set samples may significantly affect the detection effectiveness. Third, while oversampling can help to better detect the samples from the minority class, which is indicated by lower uncertainty scores achieved by AEDL, most majority classes become much more uncertain and some of them have even a higher average uncertainty score than the outliers. Furthermore, the uncertainty scores from most classes also span a wide range. Finally, DRO effectively narrows down the range of the uncertainty scores as it allows the model to focus more on the difficult samples. However, it does not effectively bring down the high uncertainty scores of the minority class, either, which is still higher than outliers. Similar to DRO, the proposed AREO also manages to keep the uncertainty scores of data samples from the majority classes low so that even the difficult samples are unlikely to be mis-identified as outliers. Meanwhile, it effectively lowers the uncertainty scores of the minority-class examples so that they can better separated from the outliers.

E LINK TO SOURCE CODE

For the source code, please click here.



This research was supported in part by an NSF IIS award IIS-1814450 and an ONR award N00014-18-1-2875. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agency.



Figure 1: Examples of Scheduler Functions

and further test the OSD performance on additional open set samples from three more sources: (1) MNIST-Noise, (2) Noise, and (3) Omnigolot (Lake et al., 2015). • Cifar10: Five classes are assigned as open set and closed set, respectively. Bird is made as the minority class using the same strategy introduced above. In addition to the open set classes from Cifar10 itself, we further assess the OSD performance with Cifar+10 and Cifar+50. • Cifar100: 'Living being' related super classes are assigned as the closed set and the remaining super classes are assigned as the open set. We make 'insect' related classes as the minority one. • ImageNet: Five classes are assigned as open set and closed set, respectively. We make 'king crab' as the minority class. • Architectural Heritage Elements Dataset (AHED): Five classes are assigned as open set and closed set, respectively. This is inherently highly imbalanced dataset where number of data points are unevenly distributed across different classes. The class 'portal' is the minority one.

Figure 2: (a) Top row: minority class; bottom-row: majority classes; (b) sample ranking.

Figure 2 (b)  shows the ranking of these samples according to the uncertainty scores (a lower ranking indicates a lower uncertainty). In contrast, AREO assigns much lower rankings for these birds objects. This analysis justifies the effectiveness of AREO for detecting minority class data samples in the closed set. Similarly, Figure2(a) bottom row show representative images from some majority classes. Again, AREO is able to recognize these difficult samples and assign a relatively low uncertainty score to avoid them being mis-identified as open set samples as shown by Figure 2 (b).5 CONCLUSIONIn this paper, we focus on open set detection from imbalanced closed set data. To address the fundamental challenge due to the interplay between the minority-class samples and difficult samples from the majority classes, we propose an important extension of DRO to the evidential learning setting, leading to a novel Adaptive Robust Evidential Optimization (AREO) model. As an evidential learning model, AREO effectively breaks the closed set assumption by explicitly modeling the uncertainty mass that is uniquely suitable for detecting open set samples. An adaptive DRO training process is achieved through multi-scheduler learning to achieve an optimal training behavior. The experimentation conducted on five real-world datasets with diverse types of open set data samples justifies the effectiveness of the proposed model.

Multi-Scheduler Learning Process Input: H, P , s, eval, P , T 1 Initialize: epoch = 0, Θ * = None, acc * = N one 2 for p ∈ [P ] do 3 Θ p , W p ← initialize (Θ param , H) 4 while epoch<T do 5 Θ p ← optimize(Θ p |W p ), p ∈ [P ] 6 if epoch%s = 0 then 7 acc p → eval(Θ p , W p ), p ∈ [P ] 8 sorted_idx ← arg sortDesc{acc p } P p=1 9 bottom_idx ← sorted_idx[: -P ] 10 top_idx ← sorted_idx[: P ] 11 for idx ∈ bottom_idx do 12 Θ_idx, j ← uniform({Θ j } top_idx j return Θ * with the highest acc

Figure3below shows the resulting MSF function output for the Cifar10 and Cifar100 datasets that exhibit quite different learning behaviors. For Cifar10, the model can quickly learn from the classes due to relatively easy data samples. As the MSF function decreases quickly, the resulting η value in AREO increases quickly making the model focusing mostly on the difficult samples. In case of Cifar100, because of its difficult nature, the model takes more time to learn well from all samples. Only in the latter phase of training, the model starts to put more emphasis on the difficult samples by increasing the η value in the AREO.

Chen et al. (2021) conduct Adversarial Reciprocal Point Learning (ARPL), where the adversarial point is generated for each known class in the embedding space by leveraging representations of other known class samples along with the unknown samples generated using an adversarial mechanism.During training, it maximizes the gap between the representations of known class samples and that of the adversarial point. Meanwhile, the model tries to push the unknown samples' representations into a specific region using the adversarial margin constraint. In order to achieve this, diverse and confusing training samples are generated through adversarial learning.Cevikalp et al. (2021) leverage the polyhedral conic function and define two losses. The first loss ensures a good separation among the known classes whereas the second loss achieves the compactness within each class so that the open set samples could be easily rejected.Dhamija et al. (2018) also leverage two losses, where the Entropic open set loss is to ensure that the softmax output for the open set samples are uniformly distributed to the all known classes and the Objectosphere loss aims to assign a higher feature magnitude to the known class samples in the embedding space than those from the open set classes.

Kong & Ramanan (2021) propose the OpenGAN model to discriminate the open set samples from the close-set ones. There are some key differences from our work. First, OpenGAN introduces open set samples in the training as well as validation sets whereas our approach does not involve any open set samples in training or validation sets. Second, OpenGAN only works in the binary classification setting where the loss function is proposed to discriminate whether a sample is in the open set or closed set.

Figure 4: OSD performance comparison from imbalanced Cifar10 dataset.

OSD (MAP) performance on all datasets

Symbols with Descriptions Probability for the c th singleton yn One hot encoded C dimensional multinomial variable ync Class label for the n th data sample for class c pnc Probability of the n th data sample belonging to class c ηt Readjusted weight associated with the c th class from Eq. (11) w Mixing weights associated with the MSF to control uncertainty set ηt w ′ Mixing weights associated with MSF to readjust the class-specific weights β Set of Specific parameters for the SFs to control uncertainty set ηt β ′ Set of Specific parameters for the SFs to readjust the class-specific weights

Ablation Study on OSD Results for the Different Datasets. In terms of MSF parameters (w and w ′ ) for each model in P , we initialize them by uniformly sampling from [0, 1]. Next, we follow the training procedure shown in Algorithm 1 with random parameter selection in the exploration phase.

OSD (MAP) performance on Multiple Imbalanced Classes with Same Cardinality

OSD (MAP) performance on Multiple Imbalanced Classes with Different Cardinality

OSD (MAP) performance on(Wang et al., 2022)

As discussed earlier, all these methods are general open set recognition models and hence suffer from a lower OSD performance for the more challenging setting that involves imbalanced data. The results further justify the effectiveness of the proposed AREL model.

OSD (MAP) performance on Multiple Imbalanced Classes with Different Cardinality

OSD (AUC) performance on Multiple DatasetsTable 10 below shows the closed set performance for competitive baselines.

closed set performance (MAP) on all datasets ± 3.78 31.80 ± 2.37 55.84 ± 1.60 99.58 ± 0.26 40.48 ± 2.65 AEDL 54.98 ± 0.63 36.11 ± 0.08 55.62 ± 0.58 99.62 ± 0.23 41.36 ± 3.98 DRO 27.16 ± 5.94 10.50 ± 0.25 20.71 ± 1.00 90.87 ± 3.89 30.02 ± 0.93 AREO 54.65 ± 1.02 36.44 ± 0.23 55.31 ± 1.11 99.88 ± 0.01 49.68 ± 1.58

annex

Algorithm 2: Alternative Optimization between f and p using Probabilistic SGD 1 Initialize: f 0 , stepsize γ, minibatch sizes b, b ′ < b, p t ∈ [0, 1], t = 0, p n (f Furthermore, the class 'Bird' is made as the minority class using the same strategy introduced above.In addition to the open set classes from Cifar10 itself, we further assess the OSD performance with Cifar+10 and Cifar+50. In particular, Cifar+10 includes data samples from the randomly selected 10 classes of the Cifar100 dataset and Cifar+50 includes data samples from 50 randomly selected classes of the Cifar100 dataset.Cifar100. In the Cifar100 dataset, 'living being' related super classes are regarded as the closed set classes and the remaining 'non-living being' related super classes are regarded as the open set classes. We make 'insect' related classes as the minority one.ImageNet. In the ImageNet dataset, we performed experimentation by randomly picking 5 classes as known classes and five classes as unknown classes. Specifically classes 'ant', 'king crab', 'lion', 'French bulldog', 'great white shark' are treated as known classes whereas, classes 'iPod', 'lipstick', 'street sign', 'bookshop', and 'miniskirt' as unknown classes. We make 'king crab' as the minority class in this dataset.Architectural Heritage Elements Dataset (AHED). In this dataset, we pick classes 'bell tower ', 'portal', 'gargoyle', 'dome', and 'column' as open set classes whereas, classes 'apse', 'vault', 'altar', 'stained glass', and 'flying buttress' as unknown classes. This is inherently highly imbalanced dataset where number of data points are unevenly distributed across different classes. The class 'portal' is the minority class in this dataset.

D.2 DETAILED EXPERIMENTAL SETTING

For all datasets, we use LeNet5 (Lecun et al., 1998) as the network architecture with Tanh activation in the CNN layers and SoftPlus activation in the fully connected layers. Training involves the Adam optimizer with a learning rate of 0.001 and l2 regularization coefficient of 0.001. We initialize the uncertainty set λ 0 = 100 for all datasets so that model gives the equal emphasis to the all

