EVALUATING ROBUSTNESS OF PREDICTIVE UNCER-TAINTY ESTIMATION: ARE DIRICHLET-BASED MOD-ELS RELIABLE?

Abstract

Robustness to adversarial perturbations and accurate uncertainty estimation are crucial for reliable application of deep learning in real world settings. Dirichletbased uncertainty (DBU) models are a family of models that predict the parameters of a Dirichlet distribution (instead of a categorical one) and promise to signal when not to trust their predictions. Untrustworthy predictions are obtained on unknown or ambiguous samples and marked with a high uncertainty by the models. In this work, we show that DBU models with standard training are not robust w.r.t. three important tasks in the field of uncertainty estimation. First, we evaluate how useful the uncertainty estimates are to (1) indicate correctly classified samples. Our results show that while they are a good indicator on unperturbed data, performance on perturbed data decreases dramatically. (2) We evaluate if uncertainty estimates are able to detect adversarial examples that try to fool classification. It turns out that uncertainty estimates are able to detect FGSM attacks but not able to detect PGD attacks. We further evaluate the reliability of DBU models on the task of (3) distinguishing between in-distribution (ID) and out-of-distribution (OOD) data. To this end, we present the first study of certifiable robustness for DBU models. Furthermore, we propose novel uncertainty attacks that fool models into assigning high confidence to OOD data and low confidence to ID data, respectively. Both approaches show that detecting OOD samples and distinguishing between ID-data and OOD-data is not robust. Based on our results, we explore the first approaches to make DBU models more robust. We use adversarial training procedures based on label attacks, uncertainty attacks, or random noise and demonstrate how they affect robustness of DBU models on ID data and OOD data. Recently, multiple works have analyzed uncertainty estimation and robustness of neural networks. (Snoek et al., 2019) compares uncertainty estimates of models based on drop-out and ensembles under data set shifts. (Cardelli et al., 2019; Wicker et al., 2020) study probabilistic safety of Bayesian networks under adversarial perturbations by analyzing inputs sets and the corresponding mappings Contrary to the other models, Prior Networks (PriorNet) (Malinin & Gales, 2018a; 2019) requires OOD data for training to "teach" the neural network the difference between ID and OOD data. PriorNet is trained with a loss function consisting of two KL-divergence terms. The fist term is designed to learn Dirichlet parameters for ID, while the second one is used to learn a flat Dirichlet distribution (α = 1) for OOD data. There a two variants of PriorNet. The first one is trained based on reverse KL-divergence (Malinin & Gales, 2019), while the second one is trained with KL-divergence (Malinin & Gales, 2018a). We include in our experiment the most recent reverse version of PriorNet, as it shows superior performance (Malinin & Gales, 2019).

1. INTRODUCTION

Neural networks achieve high predictive accuracy in many tasks, but they are known to have two substantial weaknesses: First, neural networks are not robust against adversarial perturbations, i.e., semantically meaningless input changes that lead to wrong predictions (Szegedy et al., 2014; Goodfellow et al., 2015) . Second, neural networks tend to make over-confident predictions at test time (Lakshminarayanan et al., 2017) . Even worse, standard neural networks are unable to identify samples that are different from the samples they were trained on. In these cases, they provide uninformed decisions instead of abstaining. These two weaknesses make them impracticable in sensitive domains like financial, autonomous driving or medical areas which require trust in predictions. To increase trust in neural networks, models that provide predictions along with the corresponding uncertainty have been proposed. There are three main families of models that aim to provide meaningful estimates of their predictive uncertainty. The first family are Bayesian Neural Networks (Blundell et al., 2015; Osawa et al., 2019; Maddox et al., 2019) , which have the drawback that they are computationally demanding. The second family consists of Monte-Carlo drop-out based models (Gal & Ghahramani, 2016) and ensembles (Lakshminarayanan et al., 2017) that estimate uncertainty by computing statistics such as mean and variance by aggregating forward passes of multiple models. A disadvantage of all of these models is that uncertainty estimation at inference time is expensive. In contrast to these, the recently growing family of Dirichlet-based uncertainty (DBU) models (Malinin & Gales, 2018a; 2019; Sensoy et al., 2018; Malinin et al., 2019; Charpentier et al., 2020) directly predict the parameters of a Dirichlet distribution over categorical probability distributions. They provide efficient uncertainty estimates at test time since they only require a single forward pass. DBU models bring the benefit of providing both, aleatoric and epistemic uncertainty estimates. Aleatoric uncertainty is irreducible and caused by the natural complexity of the data, such as class overlap or noise. Epistemic uncertainty results from the lack of knowledge about unseen data, e.g. when the model is presented an image of an unknown object. Both uncertainty types can be quantified using different uncertainty measures based on a Dirichlet distribution, such as differential entropy, mutual information, or pseudo-counts (Malinin & Gales, 2018a; Charpentier et al., 2020) . These uncertainty measures have been shown outstanding performance in, e.g., the detection of OOD samples and thus are superior to softmax based confidence (Malinin & Gales, 2019; Charpentier et al., 2020) . Neural networks from the families outlined above are expected to know what they don't know, i.e., notice when they are unsure about a prediction. This raises questions with regards to adversarial examples: should uncertainty estimates detect these corrupted samples and abstain from making a prediction (i.e. indicated by high uncertainty in the prediction), or should they be robust to adversarial examples and produce the correct output even under perturbations? Using humans as the gold standard of image classification and assuming that the perturbations are semantically meaningless, which is typically implied by small L p norm of the corruption, we argue that the best option is that the models are robust to adversarial perturbations (see Figure 1 ). Beyond being robust w.r.t. label prediction, we expect models to robustly know what they do not know. That is, they should robustly distinguish between ID and OOD data even if those are perturbed. In this work, we focus on DBU models and analyze their robustness capacity w.r.t. the classification decision and uncertainty estimations, going beyond simple softmax output confidence by investigating advanced measures like differential entropy. Specifically, we study the following questions: 1. Is high certainty a reliable indicator of correct predictions? 2. Can we use uncertainty estimates to detect label attacks on the classification decision? 3. Are uncertainty estimates such as differential entropy a robust feature for OOD detection? In addressing these questions we place particular focus on adversarial perturbations of the input in order to evaluate the worst case performance of the models. We address question one by analyzing uncertainty estimation on correctly and wrongly labeled samples, without and with adversarial perturbations on the inputs. To answer question two, we study uncertainty estimates of DBU models on label attacks. More specifically, we analyze whether there is a difference between uncertainty estimates on perturbed and unperturbed inputs and whether DBU models are capable of recognizing successful label attacks by uncertainty estimation. Addressing question three, we use robustness verification based on randomized smoothing and propose to investigate uncertainty attacks. Uncertainty attacks aim at changing the uncertainty estimate such that ID data is marked as OOD data and vice versa. Finally, we propose robust training procedures that use label attacks, uncertainty attacks or random noise and analyze how they affect robustness of DBU models on ID data and OOD data.

2. RELATED WORK

in the output space. In contrast, our work focus on DBU models and analyze their robustness w.r.t. adversarial perturbations specifically designed to fool label or uncertainty predictions of the models. Furthermore, previous works on attack defenses have focused on evaluating either robustness w.r.t. class predictions (Carlini & Wagner, 2017; Weng et al., 2018) or label attack detection (Carlini & Wagner, 2017) . In contrast, our work jointly evaluates both tasks by analyzing them from the uncertainty perspective. Furthermore, in addition to label attacks, we study a new type of adversarial perturbations that directly target uncertainty estimation. Those attacks are different from traditional label attacks (Madry et al., 2018; Dang-Nhu et al., 2020) . Different models have been proposed to account for uncertainty while being robust. (Smith & Gal, 2018) and (Lee et al., 2018) have tried to improve label attack detection based on uncertainty using drop-out or density estimation. In addition from improving label attack detection for large unseen perturbations, (Stutz et al., 2020) aimed at improving robustness w.r.t. class label predictions on small input perturbations. To this end, they proposed a new adversarial training with softer labels for adversarial samples further from the original input. (Qin et al., 2020) suggested a similar adversarial training where labels are soften differently depending on the input robustness. These previous works only consider the aleatoric uncertainty contained in the predicted categorical probabilities, i.e. the softmax output. They do not consider DBU models which explicitly account for both aleatoric and epistemic uncertainty. (Malinin & Gales, 2019) proposed to improve a single type of DBU model for label attack detection by assigning them high uncertainty during training. Please note that the works (Tagasovska & Lopez-Paz, 2019; Kumar et al., 2020; Bitterwolf et al., 2020; Meinke & Hein, 2020 ) study a different orthogonal problem. (Tagasovska & Lopez-Paz, 2019) propose to compute confidence intervals while (Kumar et al., 2020) propose certificates on softmax predictions. (Bitterwolf et al., 2020) uses interval bound propagation to compute bounds on softmax predictions in the L ∞ -ball around an OOD point and for ReLU networks, (Meinke & Hein, 2020) proposes an approach to obtain certifiably low confidence for OOD data. These four studies estimate confidence based on softmax predictions, which accounts for aleatoric uncertainty only. In this paper, we provide certificates on the OOD classification task using DBU models directly which is better suited to epistemic uncertainty measures.

3. DIRICHLET-BASED UNCERTAINTY MODELS

Standard (softmax) neural networks predict the parameters of a categorical distribution p (i) = [p (i) 1 , . . . , p (i) C ] for a given input x (i) ∈ R d , where C is the number of classes. Given the parameters of a categorical distribution, we can evaluate its aleatoric uncertainty, which is the uncertainty on the class label prediction y (i) ∈ {1, . . . , C}. For example, when predicting the result of an unbiased coin flip, we expect the model to have high aleatoric uncertainty and predict p(head) = 0.5. In contrast to standard (softmax) neural networks, DBU models predict the parameters of a Dirichlet distribution -the natural prior of categorical distributions -given input x (i) (i.e. q (i) = Dir(α (i) ) where f θ (x (i) ) = α (i) ∈ R C + ) . Hence, the epistemic distribution q (i) expresses the epistemic uncertainty on x (i) , i.e. the uncertainty on the categorical distribution prediction p (i) . From the epistemic distribution, follows an estimate of the aleatoric distribution of the class label prediction Cat( p(i) ) where E q (i) [p (i) ] = p(i) . An advantage of DBU models is that one pass through the neural network is sufficient to compute epistemic distribution, aleatoric distribution, and predict the class label: q (i) = Dir(α (i) ), p(i) c = α (i) c α (i) 0 with α (i) 0 = C c=1 α (i) c , y (i) = arg max [p (i) 1 , ..., p(i) C ] This parametrization allow to compute classic uncertainty measures in closed-form. As an example, the concentration parameters α (i) c can be interpreted as a pseudo-count of observed samples of class c and, thus, are a good indicator of epistemic uncertainty. Note that further measures, such as differential entropy of the Dirichlet distribution (see Equation 2, where Γ is the Gamma function and Ψ is the Digamma function) or the mutual information between the label y (i) and the categorical p (i) can also be computed in closed-form (App. A.2, (Malinin & Gales, 2018a) ). Hence, DBU models can efficiently use these measures to assign high uncertainty for unknown data making them specifically suited for detection of OOD samples like anomalies. m diffE = K c ln Γ(α c ) -ln Γ(α 0 ) - K c (α c -1) • (Ψ(α c ) -Ψ(α 0 )) Several recently proposed models for uncertainty estimations belong to the family of DBU models, such as PriorNet, EvNet, DDNet and PostNet. These models differ in terms of their parametrization of the Dirichlet distribution, the training, and density estimation. An overview of theses differences is provided in Table 1 . We evaluate all recent versions of these models in our study. Table 1 : Summary of DBU models. Further details on the loss functions are provided in the appendix. Evidential Networks (EvNet) (Sensoy et al., 2018) are trained with a loss that computes the sum of squares between the on-hot encoded true label y * (i) and the predicted categorical p (i) under the Dirichlet distribution. Ensemble Distribution Distillation (DDNet) (Malinin et al., 2019) is trained in two steps. First, an ensemble of M classic neural networks needs to be trained. Then, the softlabels {p (i) m } M m=1 provided by the ensemble of networks are distilled into a Dirichlet-based network by fitting them with the maximum likelihood under the Dirichlet distribution. Posterior Network (PostNet) (Charpentier et al., 2020) performs density estimation for ID data with normalizing flows and uses a Bayesian loss formulation. Note that EvNet and PostNet model the Dirichlet parameters as f θ (x (i) ) = 1 + α (i) while PriorNet, RevPriorNet and DDNet compute them as f θ (x (i) ) = α (i) .

4. ROBUSTNESS OF DIRICHLET-BASED UNCERTAINTY MODELS

We analyze robustness of DBU models in the field of uncertainty estimation w.r.t. the following four aspects: accuracy, confidence calibration, label attack detection and OOD detection. Uncertainty is quantified by differential entropy, mutual information and pseudo counts. A formal definition of all uncertainty estimation measures is provided in the appendix. Robustness of Dirichlet-based uncertainty models is evaluated based on label attacks and a newly proposed type of attacks called uncertainty attacks. While label attacks aim at changing the predicted class, uncertainty attacks aim at changing uncertainty assigned to a prediction. All existing works are based on label attacks and focus on robustness w.r.t. the classification decision. Thus, we are the first to propose attacks targeting uncertainty estimates such as differential entropy and analyze further desirable robustness properties of DBU models. Both attack types compute a perturbed input x(i) close to the original input x (i) i.e. ||x (i) -x(i) || 2 < r where r is the attack radius. The perturbed input is obtained by optimizing a loss function l(x) using Fast Gradient Sign Method (FGSM) or Projected Gradient Descent (PGD). We use also a black box attack (Noise) which generates 10 Noise samples from a Gaussian distribution with mean equal to the original sample. The pertrubed sample which fools the most the loss function is selected as an attack. To complement attacks, we propose the first study of certifiable robustness for DBU models, which is based on randomized smoothing (Cohen et al., 2019) . The following questions we address by our experiments have a common assessment metric. Distinguishing between correctly and wrongly classified samples, between non-attacked input and attacked inputs or between ID data and OOD data can be treated as binary classification problems. To quantify the performance of the models on these binary classification problems, we compute AUC-PR. Experiments are performed on two image data sets (MNIST (LeCun & Cortes, 2010) and CIFAR10 (Krizhevsky et al., 2009) ), which contain bounded inputs and two tabular data sets (Segment (Dua & Graff, 2017) and Sensorless drive (Dua & Graff, 2017) ), consisting of unbounded inputs. Note that unbounded inputs are challenging since it is impossible to describe the infinitely large OOD distribution. As PriorNet requires OOD training data, we use two further image data sets (FashionMNIST (Xiao et al., 2017) and CIFAR100 (Krizhevsky et al., 2009) ) for training on MNIST and CIFAR10, respectively. All other models are trained without OOD data. To obtain OOD data for the tabular data sets, we remove classes from the ID data set (class window for the Segment data set and class 9 for Sensorless drive) and use them as the OOD data. See appendix for further details on the setup.

4.1. UNCERTAINTY ESTIMATION UNDER LABEL ATTACKS

Label attacks aim at changing the predicted class. To obtain a perturbed input with a different label, we maximize the cross-entropy loss x(i) ≈ arg max x l(x) = CE(p (i) , y (i) ) under the radius constraint. For the sake of completeness we also analyze label attacks regarding their performance to change class predictions and report their accuracy to show the effectiveness based on different radii (see Appendix, Table 7 ). As expected and partially shown by previous works, none of the DBU models is robust against label attacks. However, we noted that PriorNet is slightly more robust than the other models. This might be explained by the use of OOD data during training, which can be seen as some kind of robust training. From now on, we switch to the core focus of this work and analyze robustness properties of uncertainty estimation.

Is high certainty a reliable indicator of correct predictions?

Expected behavior: Predictions with high certainty are more likely to be correct than low certainty predictions. Assessment metric: We distinguish between correctly classified samples (label 0) and wrongly classified ones (label 1) based on the differential entropy scores produced by the DBU models (Malinin & Gales, 2018a) . Correctly classified samples are expected to have low differential entropy, reflecting the model's confidence, and analogously that wrongly predicted samples tend to have higher differential entropy. Observed behavior: Note that the positive and negative classes are not balanced, thus, the use of AUC-PR scores (Saito & Rehmsmeier, 2015) are important to enable meaningful measures. While uncertainty estimates are indeed an indicator of correctly classified samples on non-perturbed data, none of the models maintains its high performance on perturbed data (see. Table 2 ). Thus, using uncertainty estimates as indicator for correctly labeled inputs is not robust to adversarial perturbations, although the used attacks do not target uncertainty. Can we use uncertainty estimates to detect label attacks on the classification decision? Expected behavior: Adversarial examples are not from the natural data distribution. Therefore, DBU models are expected to detect them as OOD data by assigning them a higher uncertainty. We expect perturbations with larger attack radius r to be easier to detect as they differ more significantly from the data distribution. Assessment metric: The goal of attack-detection is to distinguish between unperturbed samples (label 0) and perturbed samples (label 1). To quantify the performance, we use the differential entropy (Malinin & Gales, 2018a) . Non-perturbed samples are expected to have low differential entropy, reflecting the fact that they are from the distribution the models were trained on, while perturbed samples are expected to have a high differential entropy. Further results based on other uncertainty measures are provided in the appendix. Observed behavior: Table 7 shows that the accuracy of all models decreases significantly under PGD label attacks, but none of the models is able to provide an equivalently increasing high attack detection rate (see Table 3 ). Even larger perturbations are hard to detect for DBU models. Although PGD label attacks do not explicitly consider uncertainty, they seem to provide adversarial examples with similar uncertainty as the original input. Such high certainty adversarial examples are illustrated in Figure 2 , where certainty is visualized based on the precision α 0 that is supposed to be high for ID data and low for OOD data. While the original input (perturbation size 0.0) is correctly classified as frog and ID data, there exist adversarial examples that are classified as deer or bird. The certainty on the prediction of these adversarial examples has a similar or even higher value than the prediction on the original input. Using the differential entropy to distinguish between ID and OOD data results in the same ID/OOD assignment since the differential entropy of the three right-most adversarial examples is similar or even smaller than on the unperturbed input. For the less powerful FGSM and Noise attacks (see Appendix), DBU models achieve mostly better attack detection rates than for PGD attacks. This suggests that uncertainty estimation is able to detect weak attacks, which is consistent with the observations in (Malinin & Gales, 2018b) . Furthermore, PostNet provides better label attack detection rate for large perturbations on tabular data sets. An explanation for this observation is that the density estimation of the ID samples has been shown to work better for tabular data sets (Charpentier et al., 2020) . Standard adversarial training (based on label attacks targeting the crossentropy loss function) improves robustness w.r.t. class predictions (see Appendix, Table 32 ), but does not improve label attack detection performance of any model (see Table 40 ). Overall, none of the DBU models provides a reliable indicator for adversarial inputs that target the classification decision. DBU models are designed to provide uncertainty estimates (beyond softmax based confidence) alongside predictions and use this predictive uncertainty for OOD detection. Thus, in this section we focus on attacking these uncertainty estimates. We present result for attacks based on the differential entropy as loss function ( Are uncertainty estimates a robust feature for OOD detection? Expected behavior: We expect Dirichlet-based uncertainty models to be able to distinguish between ID and OOD data by providing reliable uncertainty estimates, even under small perturbations. That is, we expect the uncertainty estimates of DBU models to be robust under attacks. Assessment metric: x(i) ≈ arg max x l(x) = Diff-Ent(Dir(α (i) ))), We distinguish between ID data (label 0) and OOD data (label 1) based on the differential entropy as uncertainty scoring function (Malinin & Gales, 2018a) . Differential entropy is expected to be small on ID samples and high on OOD samples. Experiments on further uncertainty measure and results for AUROC are provided in the appendix. Observed behavior: OOD samples are perturbed as illustrated in Figure 3 . The left part shows an OOD-sample, which is identified as OOD. Adding adversarial perturbations ≥ 0.5 to it changes the Dirichlet parameters such that the resulting images are identified as ID, based on precision or differential entropy as uncertainty measure. Adding adversarial perturba- tions to an ID sample (right part) results in images identified as OOD. OOD detection performance of all DBU models decreases rapidly with the size of the perturbation, regardless of whether attacks are computed on ID or OOD data (Table 4 ). Thus, using uncertainty estimation to distinguish between ID and OOD data is not robust. PostNet and DDNet achieve slightly better performance than the other models. Further, PostNet provides better scores for large perturbations on tabular data sets which could again be explained by its density-based approach. During training we augment the data set by samples computed based on (i) PGD attacks against the crossentropy loss or (ii) against the differential entropy function, which is used to distinguish between ID and OOD data, or (iii) by adding random noise as proposed for randomized smoothing training. Since attacks are used during robust training, we want to avoid tying robustness evaluation to gradient based attacks. Instead, we propose the first approach that certifies robustness of DBU models based on randomized smoothing (Cohen et al., 2019) . Randomized smoothing was proposed to verify robustness w.r.t. class predictions and we modify it for ID/OOD-verification. As randomized smoothing treats classifiers as a black-box, we transform distinguishing between ID data (label 0) and OOD data (label 1) into a binary classification problem based on an uncertainty measure, which requires to set a threshold for the uncertainty measure to obtain an actual decision boundary. This is in contrast to our attack-based experiments where we avoided setting thresholds by analyzing area under the curve metrics. Thresholds for uncertainty measure are set for each model individually based on the validation set, such that the accuracy w.r.t. to ID/OOD-assignment of the model is maximized. In the following we discuss results for ID/OOD-verification based on differential entropy on CIFAR10 (ID data) and SVHN (OOD data). Further results on other data sets, other uncertainty measures and results on the standard classification based randomized smoothing verification are shown in the appendix. Table 5 shows the percentage of samples which are correctly identified as ID (resp. OOD) data and are certifiably robust within this type (cc; certified correct) along with the corresponding mean certified radius. The higher the portion of cc samples and the larger the radius the more robust is ID/OOD-distinguishing w.r.t. the corresponding perturbation size σ.foot_0  Table 5 : Randomized smoothing verification for different σ of CIFAR10 (ID data) and SVHN (OOD data). Left part: percentage of samples that is correctly identified and certified as ID data (cc) and corresponding mean certified radius (R). Right part: same for OOD data. ID-Verification OOD-Verification For each model, we observe a performance jump between ID-and OOD-verification, where robustness on ID data drops from high values to low ones while the cc percentage and radius on OOD-data increase. These jumps are observed for normal training as well as adversarial training based on the crossentropy or the differential entropy. Thus, either ID-verification or OOD-verification performs well, depending on the chosen threshold. Augmenting the data set with random noise perturbed samples (randomized smoothing loss) does not result in such performance jumps (except for PriorNet), but there is also a trade-off between robustness on ID data versus robustness on OOD data and there is no parametrization where ID-verification and OOD-verification perform equally well. σ 0.1 0.2 0.5 0.1 0.2 0.5 cc R cc R cc R cc R cc R cc Table 6 : Randomized smoothing verification for different σ of CIFAR10 (ID data) and SVHN (OOD data): percentage of samples that is wrongly identified as ID/OOD and certifiably robust as this wrong type (cw) and corresponding mean certified radius (R). The lower cw, the more robust the model. In this section, we provide details on the losses used by each DBU model. PostNet uses a Bayesian loss which can be expressed as follows: σ 0.1 0.2 0.5 0.1 0.2 0.5 cw R cw R cw R cw R cw R cw L PostNet = 1 N i E q(p (i) ) [CE(p (i) , y (i) )] -H(q (i) ) where CE denotes the cross-entropy. Both the expectation term (i.e. E q(p (i) ) [CE(p (i) , y (i) )]) and the entropy term (i.e. H(q (i) )) can be computed in closed-form (Charpentier et al., 2020) . PriorNet uses a loss composed of two KL divergence terms for ID and OOD data: L PriorNet = 1 N   x (i) ∈ID data [KL[Dir(α ID )||q (i) ]] + x (i) ∈OODdata [KL[Dir(α OOD )||q (i) ]]   . (4) Both KL divergences terms can be computed in closed-form (Malinin & Gales, 2019) . The precision α ID and α OOD are hyper-parameters. The precision α ID is usually set to 1e 1 for the correct class and 1 otherwise. The precision α OOD is usually set to 1. DDNet uses use the Dirichlet likelihood of soft labels produce by an ensemble of M neural networks: L DDNet = - 1 N i M m=1 [ln q (i) (π im )] where π im denotes the soft-label of mth neural network. The Dirichlet likelihood can be computed in closed-form (Malinin et al., 2019) . EvNet uses the expected mean square error between the one-hot encoded label and the predicted categorical distribution: L EvNet = 1 N i E p (i) ∼Dir(α (i) ) ||y * (i) -p (i) || 2 (6) where y * (i) denotes the one-hot encoded label. The expected MSE loss can also be computed in closed form (Sensoy et al., 2018) . For more details please have a look at the original paper on PriorNet (Malinin & Gales, 2018a) , PostNet (Charpentier et al., 2020) , DDNet (Malinin & Gales, 2019) and EvNet (Sensoy et al., 2018) .

A.2 CLOSED-FORM COMPUTATION OF UNCERTAINTY MEASURES & UNCERTAINTY ATTACKS

Dirichlet-based uncertainty models allow to compute several uncertainty measures in closed form (see (Malinin & Gales, 2018a ) for a derivation). As proposed by Malinin & Gales (2018a) , we use precision m α0 , differential entropy m diffE and mutual information m MI to estimate uncertainty on predictions. The differential entropy m diffE of a DBU model reaches its maximum value for equally probable categorical distributions and thus, a on flat Dirichlet distribution. It is a measure for distributional uncertainty and expected to be low on ID data, but high on OOD data. m diffE = K c ln Γ(α c ) -ln Γ(α 0 ) - K c (α c -1) • (Ψ(α c ) -Ψ(α 0 )) (7) where α are the parameters of the Dirichlet-distribution, Γ is the Gamma function and Ψ is the Digamma function. The mutual information m MI is the difference between the total uncertainty (entropy of the expected distribution) and the expected uncertainty on the data (expected entropy of the distribution). This uncertainty is expected to be low on ID data and high on OOD data. m MI = - K c=1 α c α 0 ln α c α 0 -Ψ(α c + 1) + Ψ(α 0 + 1) Furthermore, we use the precision α 0 to measure uncertainty, which is expected to be high on ID data and low on OOD data. m α0 = α 0 = K c=1 α c (9) As these uncertainty measures are computed in closed form and it is possible to obtain their gradients, we use them (i.e. m diffE , m MI , m α0 ) are target function of our uncertainty attacks. Changing the attacked target function allows us to use a wide range of gradient-based attacks such as FGSM attacks, PGD attacks, but also more sophisticated attacks such as Carlini-Wagner attacks.

A.3 DETAILS OF THE EXPERIMENTAL SETUP

Models. We trained all models with a similar based architecture. We used namely 3 linear layers for vector data sets, 3 convolutional layers with size of 5 + 3 linear layers for MNIST and the VGG16 Simonyan & Zisserman (2015) architecture with batch normalization for CIFAR10. All the implementation are performed using Pytorch (Paszke et al., 2019) . We optimized all models using Adam optimizer. We performed early stopping by checking for loss improvement every 2 epochs and a patience of 10. The models were trained on GPUs (1 TB SSD). We performed a grid-search for hyper-parameters for all models. The learning rate grid search was done in [1e -5 , 1e -3 ]. For PostNet, we used Radial Flows with a depth of 6 and a latent space equal to 6. Further, we performed a grid search for the regularizing factor in [1e -7 , 1e -4 ]. For PriorNet, we performed a grid search for the OOD loss weight in [1, 10] . For DDNet, we distilled the knowledge of 5 neural networks after a grid search in [2, 5, 10, 20] neural networks. Note that it already implied a significant overhead at training compare to other models. Metrics. For all experiments, we focused on using AUC-PR scores since it is well suited to imbalance tasks (Saito & Rehmsmeier, 2015) while bringing theoretically similar information than AUC-ROC scores (Davis & Goadrich, 2006) . We scaled all scores from [0, 1] to [0, 100]. All results are average over 5 training runs using the best hyper-parameters found after the grid search. Data sets. For vector data sets, we use 5 different random splits to train all models. We split the data in training, validation and test sets (60%, 20%, 20%). We use the segment vector data set Dua & Graff (2017) , where the goal is to classify areas of images into 7 classes (window, foliage, grass, brickface, path, cement, sky). We Finally, we use the CIFAR10 image data set Krizhevsky et al. (2009) where the goal is to classify a picture of objects into 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Each input is a 3 × 32 × 32 tensor. The data set contains 60, 000 samples. For OOD detection experiments, we use street view house numbers (SVHN) Netzer et al. (2011) and CIFAR100 (Krizhevsky et al., 2009) containing images of numbers and objects respectively. CIFAR100 was used as training OOD for PriorNet while SVHN is used as OOD at test time. Perturbations. For all label and uncertainty attacks, we used Fast Gradient Sign Methods and Project Gradient Descent. We tried 6 different radii [0.0, 0.1, 0.2, 0.5, 1.0, 2.0, 4.0]. These radii operate on the input space after data normalization. We bound perturbations by L ∞ -norm or by L 2 -norm, with L ∞ (x) = max i=1,...,D |x i | and L 2 (x) = ( D i=1 x 2 i ) 0.5 . ( ) For L ∞ -norm it is obvious how to relate perturbation size ε with perturbed input images, because all inputs are standardized such that the values of their features are between 0 and 1. A perturbation of size ε = 0 corresponds to the original input, while a perturbation of size ε = 1 corresponds to the whole input space and allows to change all features to any value. For L 2 -norm the relation between perturbation size ε and perturbed input images is less obvious. To justify our choice for ε w.r.  ∞ . Thus, perturbation ε 2 , such that L 2 encloses L ∞ is ε 2 = ( D i=1 ε 2 ∞ ) 0.5 = √ Dε ∞ . For the MNIST-data set, with D = 28 × 28 input features L 2 -norm with ε 2 = 28 encloses L ∞ -norm with ε ∞ = 1. Alternatively, ε 2 can be computes such that the volume spanned by L 2 -norm is equivalent to the one spanned by L ∞ -norm. Using that the volume spanned by L ∞ -norm is ε D ∞ and the volume spanned by L 2 -norm is π 0.5D ε D 2 Γ(0.5D+1) (where Γ is the Gamma-function), we obtain volume equivalence if ε 2 = Γ(0.5D + 1) 1 D √ πε ∞ . For the MNIST-data set, with D = 28 × 28 input features L 2 -norm with ε 2 ≈ 21.39 is volume equivalent to L ∞ -norm with ε ∞ = 1.

A.4 ADDITIONAL EXPERIMENTS

Table 7 and 8 illustrate that no DBU model maintains high accuracy under gradient-based label attacks. Accuracy under PGD attacks decreases more than under FGSM attacks, since PGD is stronger. Interestingly Noise attacks achieve also good performances with increasing Noise standard deviation. Note that the attack is not constraint to be with a given radius for Noise attacks. 

Is high certainty a reliable indicator of correct predictions?

On non-perturbed data uncertainty estimates are an indicator of correctly classified samples, but if the input data is perturbed none of the DBU models maintains its high performance. Thus, uncertainty estimates are not a robust indicator of correctly labeled inputs. better results when they are attacked by FGSM-attacks (Table 13 ), but as FGSM attacks provide much weaker adversarial examples than PGD attacks, this cannot be seen as real advantage. Can we use uncertainty estimates to detect attacks against the classification decision? PGD attacks do not explicitly consider uncertainty during the computation of adversarial examples, but they seem to provide perturbed inputs with similar uncertainty as the original input. FGSM and Noise attacks are easier to detect, but also weaker thand PGD attacks. This suggests that DBU models are capable of detecting weak attacks by using uncertainty estimation. Are uncertainty estimates a robust feature for OOD detection? Using uncertainty estimation to distinguish between ID and OOD data is not robust as shown in the following tables. We observe a better ID/OOD distinction for PostNet and EvNet for clean data. However, we do not observe for any model an increase of the uncertainty estimates on label attacked data. Even worse, PostNet, PriorNet and DDNet seem to assign higher confidence on class label attacks. On MNIST, models show a slightly better behavior. They are capable to assign a higher uncertainty to label attacks up to some attack radius. 



We want to highlight again that attacks are here only used to enable robust training of the models. The robustness evaluation itself operates on the original data (not attacked and, thus, seemingly easy); only smoothed via randomized smoothing. The verification provides us a radius that guarantees robustness around the sample.



Figure 1: Visualization of the desired uncertainty estimates.

Figure 2: Input & corr. Dir.-parameters under label attacks (dotted: threshold to distinguish ID and OOD).

Figure 3: ID and OOD input with corresponding Dirichlet-parameters under uncertainty attacks (dotted line: threshold to distinguish ID and OOD).

remove class window from ID training data to provide OOD training data to PriorNet. Further, We remove the class 'sky' from training and instead use it as the OOD data set for OOD detection experiments. Each input is composed of 18 attributes describing the image area. The data set contains 2, 310 samples in total.We further use the Sensorless Drive vector data setDua & Graff (2017), where the goal is to classify extracted motor current measurements into 11 different classes. We remove class 9 from ID training data to provide OOD training data to PriorNet. We remove classes 10 and 11 from training and use them as the OOD dataset for OOD detection experiments. Each input is composed of 49 attributes describing motor behaviour. The data set contains 58, 509 samples in total.Additionally, we use the MNIST image data setLeCun & Cortes (2010) where the goal is to classify pictures of hand-drawn digits into 10 classes (from digit 0 to digit 9). Each input is composed of a 1 × 28 × 28 tensor. The data set contains 70, 000 samples. For OOD detection experiments, we use FashionMNISTXiao et al. (2017) andKMNIST Clanuwat et al. (2018)  containing images of Japanese characters and images of clothes, respectively. FashionMNIST was used as training OOD for PriorNet while KMNIST is used as OOD at test time.

Figures 4 and 5 visualizes the differential entropy distribution of ID data and OOD data under label attacks. On CIFAR10, PriorNet and DDNet can barely distinguish between clean ID and OOD data. We observe a better ID/OOD distinction for PostNet and EvNet for clean data. However, we do not observe for any model an increase of the uncertainty estimates on label attacked data. Even worse, PostNet, PriorNet and DDNet seem to assign higher confidence on class label attacks. On MNIST, models show a slightly better behavior. They are capable to assign a higher uncertainty to label attacks up to some attack radius.

Figure 4: Visualization of the differential entropy distribution of ID data (CIFAR10) and OOD data (SVHN) under label attack. The first row corresponds to no attack. The other rows correspond do increasingly stronger attack strength. Figures 6, 7, 8 and 9 visualizes the differential entropy distribution of ID data and OOD data under uncertainty attacks. For both CIFAR10 and MNIST data sets, we observed that uncertainty estimations of all models can be manipulated. That is, OOD uncertainty attacks can shift the OOD uncertainty distribution to more certain predcitions, and ID uncertainty attacks can shift the ID uncertainty distribution to less certain predictions.

Figure 7: Visualization of the differential entropy distribution of ID data (CIFAR10) and OOD data (SVHN) under ID uncertainty attack. The first row corresponds to no attack. The other rows correspond do increasingly stronger attack strength.

Certainty based on differential entropy under PGD label attacks (AUC-PR).

Label Attack-Detection by normally trained DBU models based on differential entropy under PGD label attacks (AUC-PR).



as this wrong type (cw; certified wrong). These cw samples are worse than adversarial examples. Neither robust training based on label attacks, uncertainty attacks nor noise perturbed samples consistently reduce the portion of certifiably wrong samples, even worse it seems to increase the number of cw samples. Thus, although robust training improves DBU-model resistance against label attacks (see Appendix, Table35), ID/OOD-verification shows that each model is either robust on ID-data or on OOD-data. Achieving robustness on both types is challenging. Our results rise the following question: How do we make DBU models robust w.r.t. class label predictions and ID/OOD-differentiation without favoring either performance on ID data or OOD data?5 CONCLUSIONThis work analyze robustness of uncertainty estimation by DBU models and answer multiple questions in this context. Our results show: (1) While uncertainty estimates are a good indicator to identify correctly classified samples on unperturbed data, performance decrease drastically on perturbed datapoints. (2) None of the Dirichlet-based uncertainty models is able to detect PGD label attacks against the classification decision by uncertainty estimation, regardless of the used uncertainty measure.(3) Detecting OOD samples and distinguishing between ID-data and OOD-data is not robust. (4) Robust training based on label attacks or uncertainty attacks increases performance of Dirichlet-based uncertainty models w.r.t. either ID data or OOD data, but achieving high robustness on both is challenging -and poses an interesting direction for future studies.

t. this norm, we relate perturbations size ε 2 corresponding to L 2 -norm with perturbations size ε ∞ corresponding to L ∞ -norm. First, we compute ε 2 , such that the L 2 -norm is the smallest super-set of the L ∞ -norm. Let us consider a perturbation of ε ∞ . The largest L 2 -norm would be obtained if each feature is perturbed by ε

Accuracy under PGD label attacks.

Accuracy under FGSM label attacks.

Accuracy under Noise label attacks.

Certainty based on differential entropy under PGD label attacks (AUC-PR).

Certainty based on precision α 0 under PGD label attacks (AUC-PR).

Certainty based on mutual information under PGD label attacks (AUC-PR).

Certainty based on differential entropy under FGSM label attacks (AUC-PR).

Certainty based on differential entropy under Noise label attacks (AUC-PR).

Attack-Detection based on differential entropy under PGD label attacks (AUC-PR).

Attack-Detection based on precision α 0 under PGD label attacks (AUC-PR).

Attack-Detection based on mutual information under PGD label attacks (AUC-PR).

Attack-Detection based on differential entropy under FGSM label attacks (AUC-PR).

Attack-Detection based on differential entropy under Noise label attacks (AUC-PR).

OOD detection based on differential entropy under PGD uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR).

OOD detection (AUC-ROC) under PGD uncertainty attacks against precision α 0 on ID data and OOD data.

OOD detection (AU-PR) under PGD uncertainty attacks against distributional uncertainty on ID data and OOD data.

OOD detection (AUC-ROC) under PGD uncertainty attacks against distributional uncertainty on ID data and OOD data.

OOD detection (AU-PR) under FGSM uncertainty attacks against differential entropy on ID data and OOD data.

OOD detection (AU-PR) under Noise uncertainty attacks against differential entropy on ID data and OOD data. ROBUST TRAINING FOR DBU MODELS & ID/OOD VERIFICATION Table5and 29 on adversarial training illustrate that there is a jump between ID-verification and OOD-verification, where robustness on ID data drops while robustness on OOD data increases. These jumps are observed for each model and each training (normal, noise-based, adversarial with label attacks, adversarial with uncertainty attacks). Thus, either ID-verification or OOD-verification perform well, depending on the chosen threshold.

Randomized smoothing verification of MNIST (ID data) and KMNIST (OOD data): percentage of samples that is certifiably correct (cc) and mean certified radius (R).

Randomized smoothing verification of MNIST (ID data) and KMNIST (OOD data): percentage of samples that is certifiably wrong (cw) and mean certified radius (R).

Randomized smoothing verification of MNIST (ID data) and KMNIST (OOD data) harmonic mean.

Adversarial training with CE: Accuracy under PGD label attacks (AUC-PR).

Adversarial training with CE: Accuracy under FGSM label attacks (AUC-PR).

Randomized smoothing verification of CIFAR10: percentage of samples that is certifiably correct (cc) w.r.t. the predicted class label and mean certified radius (R) w.r.t. class labels.

Randomized smoothing verification of MNIST: percentage of samples that is certifiably correct (cc) w.r.t. the predicted class label and mean certified radius (R) w.r.t. class labels.

Adversarial training with CE: Certainty based on differential entropy under PGD label attacks (AUC-PR).

Adversarial training with CE: OOD detection based on differential entropy under PGD uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR).

Adversarial training with CE: OOD detection based on differential entropy under FGSM uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR).

Adversarial training with CE: OOD detection based on differential entropy under Noise uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR).

Adversarial training with Diff. Ent.: Accuracy based on differential entropy under PGD label attacks (AUC-PR).

Adversarial training with Diff. Ent.: Accuracy based on differential entropy under FGSM label attacks (AUC-PR).

Adversarial training with Diff. Ent.: Attack-Detection based on differential entropy under Noise label attacks (AUC-PR).

Adversarial training with Diff. Ent.: OOD detection based on differential entropy under PGD uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR).

Adversarial training with Diff. Ent.: OOD detection based on differential entropy under FGSM uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR).

Adversarial training with Diff. Ent.: OOD detection based on differential entropy under Noise uncertainty attacks against differential entropy on ID data and OOD data (AUC-PR). VISUALIZATION OF DIFFERENTIAL ENTROPY DISTRIBUTIONS ON ID DATA AND OOD DATA The following Figures visualize the differential entropy distribution for ID data and OOD data for all models with standard training. We used label attacks and uncertainty attacks for CIFAR10 and MNIST. Thus, they show how well the DBU models separate on clean and perturbed ID data and OOD data.

