EVALUATING ROBUSTNESS OF PREDICTIVE UNCER-TAINTY ESTIMATION: ARE DIRICHLET-BASED MOD-ELS RELIABLE?

Abstract

Robustness to adversarial perturbations and accurate uncertainty estimation are crucial for reliable application of deep learning in real world settings. Dirichletbased uncertainty (DBU) models are a family of models that predict the parameters of a Dirichlet distribution (instead of a categorical one) and promise to signal when not to trust their predictions. Untrustworthy predictions are obtained on unknown or ambiguous samples and marked with a high uncertainty by the models. In this work, we show that DBU models with standard training are not robust w.r.t. three important tasks in the field of uncertainty estimation. First, we evaluate how useful the uncertainty estimates are to (1) indicate correctly classified samples. Our results show that while they are a good indicator on unperturbed data, performance on perturbed data decreases dramatically. (2) We evaluate if uncertainty estimates are able to detect adversarial examples that try to fool classification. It turns out that uncertainty estimates are able to detect FGSM attacks but not able to detect PGD attacks. We further evaluate the reliability of DBU models on the task of (3) distinguishing between in-distribution (ID) and out-of-distribution (OOD) data. To this end, we present the first study of certifiable robustness for DBU models. Furthermore, we propose novel uncertainty attacks that fool models into assigning high confidence to OOD data and low confidence to ID data, respectively. Both approaches show that detecting OOD samples and distinguishing between ID-data and OOD-data is not robust. Based on our results, we explore the first approaches to make DBU models more robust. We use adversarial training procedures based on label attacks, uncertainty attacks, or random noise and demonstrate how they affect robustness of DBU models on ID data and OOD data. Recently, multiple works have analyzed uncertainty estimation and robustness of neural networks. (Snoek et al., 2019) compares uncertainty estimates of models based on drop-out and ensembles under data set shifts. (Cardelli et al., 2019; Wicker et al., 2020) study probabilistic safety of Bayesian networks under adversarial perturbations by analyzing inputs sets and the corresponding mappings

1. INTRODUCTION

Neural networks achieve high predictive accuracy in many tasks, but they are known to have two substantial weaknesses: First, neural networks are not robust against adversarial perturbations, i.e., semantically meaningless input changes that lead to wrong predictions (Szegedy et al., 2014; Goodfellow et al., 2015) . Second, neural networks tend to make over-confident predictions at test time (Lakshminarayanan et al., 2017) . Even worse, standard neural networks are unable to identify samples that are different from the samples they were trained on. In these cases, they provide uninformed decisions instead of abstaining. These two weaknesses make them impracticable in sensitive domains like financial, autonomous driving or medical areas which require trust in predictions. To increase trust in neural networks, models that provide predictions along with the corresponding uncertainty have been proposed. There are three main families of models that aim to provide meaningful estimates of their predictive uncertainty. The first family are Bayesian Neural Networks (Blundell et al., 2015; Osawa et al., 2019; Maddox et al., 2019) , which have the drawback that they are computationally demanding. The second family consists of Monte-Carlo drop-out based models (Gal & Ghahramani, 2016) and ensembles (Lakshminarayanan et al., 2017) that estimate uncertainty by computing statistics such as mean and variance by aggregating forward passes of multiple models. A disadvantage of all of these models is that uncertainty estimation at inference time is expensive. In contrast to these, the recently growing family of Dirichlet-based uncertainty (DBU) models (Malinin & Gales, 2018a; 2019; Sensoy et al., 2018; Malinin et al., 2019; Charpentier et al., 2020) directly predict the parameters of a Dirichlet distribution over categorical probability distributions. They provide efficient uncertainty estimates at test time since they only require a single forward pass. DBU models bring the benefit of providing both, aleatoric and epistemic uncertainty estimates. Aleatoric uncertainty is irreducible and caused by the natural complexity of the data, such as class overlap or noise. Epistemic uncertainty results from the lack of knowledge about unseen data, e.g. when the model is presented an image of an unknown object. Both uncertainty types can be quantified using different uncertainty measures based on a Dirichlet distribution, such as differential entropy, mutual information, or pseudo-counts (Malinin & Gales, 2018a; Charpentier et al., 2020) . These uncertainty measures have been shown outstanding performance in, e.g., the detection of OOD samples and thus are superior to softmax based confidence (Malinin & Gales, 2019; Charpentier et al., 2020) . Neural networks from the families outlined above are expected to know what they don't know, i.e., notice when they are unsure about a prediction. This raises questions with regards to adversarial examples: should uncertainty estimates detect these corrupted samples and abstain from making a prediction (i.e. indicated by high uncertainty in the prediction), or should they be robust to adversarial examples and produce the correct output even under perturbations? Using humans as the gold standard of image classification and assuming that the perturbations are semantically meaningless, which is typically implied by small L p norm of the corruption, we argue that the best option is that the models are robust to adversarial perturbations (see Figure 1 ). Beyond being robust w.r.t. label prediction, we expect models to robustly know what they do not know. That is, they should robustly distinguish between ID and OOD data even if those are perturbed. In this work, we focus on DBU models and analyze their robustness capacity w.r.t. the classification decision and uncertainty estimations, going beyond simple softmax output confidence by investigating advanced measures like differential entropy. Specifically, we study the following questions: 1. Is high certainty a reliable indicator of correct predictions? 2. Can we use uncertainty estimates to detect label attacks on the classification decision? 3. Are uncertainty estimates such as differential entropy a robust feature for OOD detection? In addressing these questions we place particular focus on adversarial perturbations of the input in order to evaluate the worst case performance of the models. We address question one by analyzing uncertainty estimation on correctly and wrongly labeled samples, without and with adversarial perturbations on the inputs. To answer question two, we study uncertainty estimates of DBU models on label attacks. More specifically, we analyze whether there is a difference between uncertainty estimates on perturbed and unperturbed inputs and whether DBU models are capable of recognizing successful label attacks by uncertainty estimation. Addressing question three, we use robustness verification based on randomized smoothing and propose to investigate uncertainty attacks. Uncertainty attacks aim at changing the uncertainty estimate such that ID data is marked as OOD data and vice versa. Finally, we propose robust training procedures that use label attacks, uncertainty attacks or random noise and analyze how they affect robustness of DBU models on ID data and OOD data.



Figure 1: Visualization of the desired uncertainty estimates.

