PAC CONFIDENCE PREDICTIONS FOR DEEP NEURAL NETWORK CLASSIFIERS

Abstract

A key challenge for deploying deep neural networks (DNNs) in safety critical settings is the need to provide rigorous ways to quantify their uncertainty. In this paper, we propose a novel algorithm for constructing predicted classification confidences for DNNs that comes with provable correctness guarantees. Our approach uses Clopper-Pearson confidence intervals for the Binomial distribution in conjunction with the histogram binning approach to calibrated prediction. In addition, we demonstrate how our predicted confidences can be used to enable downstream guarantees in two settings: (i) fast DNN inference, where we demonstrate how to compose a fast but inaccurate DNN with an accurate but slow DNN in a rigorous way to improve performance without sacrificing accuracy, and (ii) safe planning, where we guarantee safety when using a DNN to predict whether a given action is safe based on visual observations. In our experiments, we demonstrate that our approach can be used to provide guarantees for state-of-the-art DNNs.

1. INTRODUCTION

Due to the recent success of machine learning, there has been increasing interest in using predictive models such as deep neural networks (DNNs) in safety-critical settings, such as robotics (e.g., obstacle detection (Ren et al., 2015) and forecasting (Kitani et al., 2012) ) and healthcare (e.g., diagnosis (Gulshan et al., 2016; Esteva et al., 2017) and patient care management (Liao et al., 2020) ). One of the key challenges is the need to provide guarantees on the safety or performance of DNNs used in these settings. The potential for failure is inevitable when using DNNs, since they will inevitably make some mistakes in their predictions. Instead, our goal is to design tools for quantifying the uncertainty of these models; then, the overall system can estimate and account for the risk inherent in using the predictions made by these models. For instance, a medical decision-making system may want to fall back on a doctor when its prediction is uncertain whether its diagnosis is correct, or a robot may want to stop moving and ask a human for help if it is uncertain to act safely. Uncertainty estimates can also be useful for human decision-makers-e.g., for a doctor to decide whether to trust their intuition over the predicted diagnosis. While many DNNs provide confidences in their predictions, especially in the classification setting, these are often overconfident. This phenomenon is most likely because DNNs are designed to overfit the training data (e.g., to avoid local minima (Safran & Shamir, 2018) ), which results in the predicted probabilities on the training data being very close to one for the correct prediction. Recent work has demonstrated how to calibrate the confidences to significantly reduce overconfidence (Guo et al., 2017) . Intuitively, these techniques rescale the confidences on a held-out calibration set. Because they are only fitting a small number of parameters, they do not overfit the data as was the case in the original DNN training. However, these techniques do not provide theoretical guarantees on their correctness, which can be necessary in safety-critical settings to guarantee correctness. We propose a novel algorithm for calibrated prediction in the classification setting that provides theoretical guarantees on the predicted confidences. We focus on on-distribution guaranteesi.e., where the test distribution is the same as the training distribution. In this setting, we can build on ideas from statistical learning theory to provide probably approximately correctness (PAC) guarantees (Valiant, 1984) . Our approach is based on a calibrated prediction technique called histogram binning (Zadrozny & Elkan, 2001) , which rescales the confidences by binning them and then rescaling each bin independently. We use Clopper-Pearson bounds on the tails of the binomial distribution to obtain PAC upper/lower bounds on the predicted confidences. Next, we study how it enables theoretical guarantees in two applications. First, we consider the problem of speeding up DNN inference by composing a fast but inaccurate model with a slow but accurate model-i.e., by using the accurate model for inference only if the confidence of the inaccurate one is underconfident (Teerapittayanon et al., 2016) . We use our algorithm to obtain guarantees on accuracy of the composed model. Second, for safe planning, we consider using a DNN to predict whether or not a given action (e.g., move forward) is safe (e.g., do not run into obstacles) given an observation (e.g., a camera image). The robot only continues to act if the predicted confidence is above some threshold. We use our algorithm to ensure safety with high probability. Finally, we evaluate the efficacy of our approach in the context of these applications. Related work. Calibrated prediction (Murphy, 1972; DeGroot & Fienberg, 1983; Platt, 1999) has recently gained attention as a way to improve DNN confidences (Guo et al., 2017) . Histogram binning is a non-parametric approach that sorts the data into finitely many bins and rescales the confidences per bin (Zadrozny & Elkan, 2001; 2002; Naeini et al., 2015) . However, traditional approaches do not provide theoretical guarantees on the predicted confidences. There has been work on predicting confidence sets (i.e., predict a set of labels instead of a single label) with theoretical guarantees (Park et al., 2020a) , but this approach does not provide the confidence of the most likely prediction, as is often desired. There has also been work providing guarantees on the overall calibration error (Kumar et al., 2019) , but this approach does not provide per-prediction guarantees. There has been work speeding up DNN inference (Hinton et al., 2015) . One approach is to allow intermediate layers to be dynamically skipped (Teerapittayanon et al., 2016; Figurnov et al., 2017; Wang et al., 2018) , which can be thought of as composing multiple models that share a backbone. Unlike our approach, they do not provide guarantees on the accuracy of the composed model. There has also been work on safe learning-based control (Akametalu et al., 2014; Fisac et al., 2019; Bastani, 2019; Li & Bastani, 2020; Wabersich & Zeilinger, 2018; Alshiekh et al., 2018) ; however, these approaches are not applicable to perception-based control. The most closely related work is Dean et al. (2019) , which handles perception, but they are restricted to known linear dynamics.

2. PAC CONFIDENCE PREDICTION

In this section, we begin by formalizing the PAC confidence coverage prediction problem; then, we describe our algorithm for solving this problem based on histogram binning. Calibrated prediction. Let x ∈ X be an example and y ∈ Y be one of a finite label set, and let D be a distribution over X × Y. A confidence predictor is a model f : X → P Y , where P Y is the space of probability distributions over labels. In particular, f (x) y is the predicted confidence that the true label for x is y. We let ŷ : X → Y be the corresponding label predictor-i.e., ŷ(x) := arg max y∈Y f (x) y -and let p : X → R ≥0 be corresponding top-label confidence predictori.e., p(x) := max y∈Y f (x) y . While traditional DNN classifiers are confidence predictors, a naively trained DNN is not reliable-i.e., predicted confidence does not match to the true confidence; recent work has studied heuristics for improving reliability (Guo et al., 2017) . In contrast, our goal is to construct a confidence predictor that comes with theoretical guarantees. We first introduce the definition of calibration (DeGroot & Fienberg, 1983; Zadrozny & Elkan, 2002; Park et al., 2020b)-i.e ., what we mean for a predicted confidence to be "correct". In many cases, the main quantity of interest is the confidence of the top prediction. Thus, we focus on ensuring that the top-label predicted confidence p(x) is calibrated (Guo et al., 2017) ; our approach can easily be extended to providing guarantees on all confidences predicted using f . Then, we say a confidence predictor f is well-calibrated with respect to distribution D if P (x,y)∼D [y = ŷ(x) | p(x) = t] = t (∀t ∈ [0, 1]). That is, among all examples x such that the label prediction ŷ(x) has predicted confidence t = p(x), ŷ(x) is the correct label for exactly a t fraction of these examples. Using a change of variables (Park et al., 2020b) , f being well-calibrated is equivalent to p(x) = c * f (x) := P (x ,y )∼D [y = ŷ(x ) | p(x ) = p(x)] (∀x ∈ X ). (1) Then, the goal of well-calibration is to make p equal to c * f . Note that f appears on both sides of the equation p(x) = c * f (x)-implicitly in p-which is what makes it challenging to satisfy. Indeed, in general, it is unlikely that (1) holds exactly for all x. Instead, based on the idea of histogram binning (Zadrozny & Elkan, 2001) , we consider a variant where we partition the data into a fixed number of bins and then construct confidence coverages separately for each bin. In particular, consider K bins B 1 , . . . , B K ⊆ [0, 1], where B 1 = [0, b 1 ] and B k = (b k-1 , b k ] for k > 1. Here, K and 0 ≤ b 1 ≤ • • • ≤ b K-1 ≤ b K = 1 are hyperparameters. Given f , let κ f : X → {1, . . . , K} to denote the index of the bin that contains p(x)-i.e., p(x) ∈ B κ f (x) . Definition 1 We say f is well-calibrated for a distribution D and bins B 1 , . . . , B K if p(x) = c f (x) := P (x ,y )∼D y = ŷ(x ) p(x ) ∈ B κ f (x) (∀x ∈ X ), where we refer to c f (x) as the true confidence. Intuitively, this definition "coarsens" the calibration problem across the bins-i.e., rather than sorting the inputs x into a continuum of "bins" p( Problem formulation. We formalize the problem of PAC calibration. We focus on the setting where the training and test distributions are identical-e.g., we cannot handle adversarial examples or changes in covariate distribution (e.g., common in reinforcement learning). Importantly, while we assume a pre-trained confidence predictor f is given, we make no assumptions about f -e.g., it can be uncalibrated or heuristically calibrated. If f performs poorly, then the predicted confidences will be close to 1/|Y|-i.e., express no confidence in the predictions. Thus, it is fine if f is poorly calibrated; the important property is that the confidence predictor f have similar true confidences. The challenge in formalizing PAC calibration is that quantifying over all x in (2). One approach is to provide guarantees in expectation over x (Kumar et al., 2019) ; however, this approach does not provide guarantees for individual predictions. Instead, our goal is to find a set of predicted confidences that includes the true confidence with high probability. Of course, we could simply predict the interval [0, 1], which always contains the true confidence; thus, simultaneously want to make the size of the interval small. To this end, we consider a confidence coverage predictor Ĉ : X → 2 R , where c f (x) ∈ Ĉ(x) with high probability. In particular, Ĉ(x) outputs an interval [c, c] ⊆ R, where c ≤ c, instead of a set. We only consider a single interval (rather than disjoint intervals) since one suffices to localize the true confidence c f . We are interested in providing theoretical guarantees for an algorithm used to construct confidence coverage predictor Ĉ given a held-out calibration set Z ⊆ X × Y. In addition, we assume the algorithm is given a pretrained confidence predictor f . Thus, we consider Ĉ as depending on Z and f , which we denote by Ĉ(•; f , Z). Then, we want Ĉ to satisfy the following guarantee: Definition 2 Given δ ∈ R >0 and n ∈ N, Ĉ is probably approximately correct (PAC) if for any D, P Z∼D n x∈X c f (x) ∈ Ĉ(x; f , Z) ≥ 1 -δ. (3) Note that c f depends on D. Here, "approximately correct" technically refers to the mean of Ĉ(x; f , Z), which is an estimate of c f (x); the interval captures the bound on the error of this estimate; see Appendix A for details. Furthermore, the conjunction over all x ∈ X may seem strong. We can obtain such a guarantee due to our binning strategy: the property c f (x) ∈ Ĉ(x; f , Z) only depends on the bin B κ f (x) , so the conjunction is really only over bins k ∈ {1, ..., K}. Algorithm. We propose a confidence coverage predictor that satisfies the PAC property. The problem of estimating the confidence interval Ĉ(x) of the binned true confidence c f (x) is closely related to the binomial proportion confidence interval estimation; consider a Bernoulli random variable b ∼ B := Bernoulli(θ) for any θ ∈ [0, 1], where b = 1 denotes a success and b = 0 denotes a failure, and θ is unknown. Given a sequence of observations b 1:n := (b 1 , . . . , b n ) ∼ B n , the goal is to construct an interval Θ(b 1:n ) ⊆ R that includes θ with high probability-i.e., P b1:n∼B n θ ∈ Θ(b 1:n ) ≥ 1 -α, where α ∈ R >0 is a given confidence level. In particular, the Clopper-Pearson interval ΘCP (b 1:n ; α) := inf θ θ P θ [S ≥ s] ≥ α 2 , sup θ θ P θ [S ≤ s] ≥ α 2 , guarantees (4) (Clopper & Pearson, 1934; Brown et al., 2001) , where s = n i=1 b i is the number of observed successes, n is the number of observations, and S is a Binomial random variable S ∼ Binomial(n, θ). Intuitively, the interval is constructed such that the number of observed success falls in the region with high-probability for any θ in the interval. The following expression is equivalent due to the relationship between the Binomial and Beta distributions (Hartley & Fitch, 1951; Brown et al., 2001) -i.e., P θ [S ≥ s] = I θ (s, n -s + 1), where I θ is the CDF of Beta(s, n -s + 1): ΘCP (b 1:n ; α) = α 2 quantile of Beta(s, n -s + 1), 1 - α 2 quantile of Beta(s + 1, n -s) . Now, for each of the K bins, we apply ΘCP with confidence level α = δ K -i.e., Ĉ(x; f , Z, δ) := ΘCP W κ f (x) ; δ K where W k := 1(ŷ(x) = y) (x, y) ∈ Z s.t. κ f (x) = k . Here, W k is the set of observations of successes vs. failures corresponding to the subset of labeled examples (x, y) ∈ Z such that p(x) falls in the bin B k , where a success is defined to be a correct prediction ŷ(x) = y. We note that for efficiency, the confidence interval for each of the K bins can be precomputed. Our construction of Ĉ satisfies the following; see Appendix B for a proof: Theorem 1 Our confidence coverage predictor Ĉ is PAC for any δ ∈ R >0 and n ∈ N. Note that Clopper-Pearson intervals are exact, ensuring the size of Ĉ for each bin is small in practice. Finally, an important special case is when there is a single bin B = [0, 1]-i.e., Ĉ0 (x; f , Z , δ) := ΘCP (W ; δ) where W := {1(ŷ(x ) = y ) | (x , y ) ∈ Z }. Note that Ĉ0 does not depend on x, so we drop it-i.e., Ĉ0 ( f , Z , δ) := ΘCP (W ; δ)-i.e., Ĉ0 computes the Clopper-Pearson interval over Z , which is a subset of the original set Z.

3. APPLICATION TO FAST DNN INFERENCE

A key application of predicted confidences is to perform model composition to improve the running time of DNNs without sacrificing accuracy. The idea is to use a fast but inaccurate model when it is confident in its prediction, and switch to an accurate but slow model otherwise (Bolukbasi et al., 2017) ; we refer to the combination as the composed model. To further improve performance, we can have the two models share a backbone-i.e., the fast model shares the first few layers of the slow model (Teerapittayanon et al., 2016) . We refer to the decision of whether to skip the slow model as the exit condition; then, our goal is to construct confidence thresholds for exit conditions in a way that provides theoretical guarantees on the overall accuracy. Problem setup. The early-stopping approach for fast DNN inference can be formalized as a sequence of branching classifiers organized in a cascading way-i.e., ŷC (x; where M is the number of branches, fm is the confidence predictor, and ŷm and pm are the associated label and top-label confidence predictor, respectively. For conciseness, we denote the exit condition of the mth branch by d m (i.e., d m (x γ 1:M -1 ) := ŷm (x) if m-1 i=1 (p i (x) < γ i ) ∧ (p m (x) ≥ γ m ) (∀m ∈ {1, ..., M -1}) ŷM (x) otherwise, x f 4 f 4 (x) f 1 f 2 f 3 f 1 (x) f 2 (x) f 3 (x) ) := 1( m-1 i=1 (p i (x) < γ i ) ∧ (p m (x) ≥ γ m ))) with thresholds γ 1 , . . . , γ m ∈ [0, 1]. The fm share a backbone and are trained in the standard way; see Appendix F.4 for details. Figure 1 illustrates the composed model for M = 4; the gray area represents the shared backbone. We refer to an overall model composed in this way as a cascading classifier. Desired error bound. Given ξ ∈ R >0 , our goal is to choose γ 1:M -1 := (γ 1 , . . . , γ M -1 ) so the error difference of the cascading classifier ŷC and the slow classifier ŷM is at most ξ-i.e., p err := P (x,y)∼D [ŷ C (x) = y] -P (x,y)∼D [ŷ M (x) = y] ≤ ξ. (5) To obtain the desired error bound, an example x exits at the mth branch if ŷm is likely to classify x correctly, allowing for at most ξ fraction of errors total. Intuitively, if the confidence of ŷm on x is sufficiently high, then ŷm (x) = y with high probability. In this case, ŷM either correctly classifies or misclassifies the same example; if the example is misclassified, it contributes to decrease p err ; otherwise, we have ŷm (x) = y = ŷM (x) with high probability, which contributes to maintain p err . Fast inference. To minimize running time, we prefer to allow higher error rates at the lower branches-i.e., we want to choose γ m as small as possible at lower branches m. Algorithm. Our algorithm takes prediction branches fm (for m ∈ {1, . . . , M }), the desired relative error ξ ∈ [0, 1], a confidence level δ ∈ [0, 1], and a calibration set Z ⊆ X × Y, and outputs γ 1:M -1 so that (5) holds with probability at least 1 -δ. It iteratively chooses the thresholds from γ 1 to γ M -1 ; at each step, it chooses γ m as small as possible subject to p err ≤ ξ. Note that γ m implicitly appears in p err in the constraint due to the dependence of d m (x) on γ m . The challenge is enforcing the constraint since we cannot evaluate it. To this end, let e m := P (x,y)∼D [ŷ m (x) = y ∧ ŷm (x) = ŷM (x) ∧ d m (x) = 1] e m := P (x,y)∼D [ŷ M (x) = y ∧ ŷm (x) = ŷM (x) ∧ d m (x) = 1] , then it is possible to show that p err = M -1 m=1 e m -e m (see proof of Theorem 2 in Appendix C). Then, we can compute bounds on e m and e m using the following: P [ŷ m (x) = y | ŷm (x) = ŷM (x) ∧ d m (x) = 1] ∈ [c m , cm ] := Ĉ0 fm , Z m , δ 3(M -1) P [ŷ M (x) = y | ŷm (x) = ŷM (x) ∧ d m (x) = 1] ∈ [c m , c m ] := Ĉ0 fM , Z m , δ 3(M -1) P [ŷ m (x) = ŷM (x) ∧ d m (x) = 1] ∈ [r m , rm ] := ΘCP W m ; δ 3(M -1) , where Z m := {(x, y) ∈ Z | ŷm (x) = ŷM (x) ∧ d m (x) = 1} W m := {1(ŷ m (x) = ŷM (x) ∧ d m (x) = 1) | (x, y) ∈ Z}. Thus, we have e m ≤ cm rm and e m ≥ c m r m , in which case it suffices to sequentially solve γ m = arg min γ∈[0,1] γ subj. to m i=1 ci ri -c i r i ≤ ξ. Here, cm , rm , c m , and r m are implicitly a function of γ, which we can optimize using line search. We have the following guarantee; see Appendix C for a proof: Theorem 2 We have p err ≤ ξ with probability at least 1 -δ over Z ∼ D n . Moreover, the proposed greedy algorithm ( 6) is actually optimal in reducing inference speed when M = 2. Intuitively, we are always better off in terms of inference time by classifying more examples using a faster model. In particular, we have the following theorem; see Appendix D for a proof: 6), and the classifiers ŷm are faster for smaller m, then the resulting ŷC is the fastest cascading classifier among cascading classifiers that satisfy p err ≤ ξ. Theorem 3 If M = 2, γ * 1 minimizes (

4. APPLICATION TO SAFE PLANNING

Robots acting in open world environments must rely on deep learning for tasks such as object recognition-e.g., detect objects in a camera image; providing guarantees on these predictions is critical for safe planning. Safety requires not just that the robot is safe while taking the action, but that it can safely come to a stop afterwards-e.g., that a robot can safely come to a stop before running into a wall. We consider a binary classification DNN trained to predict a probability f (x) ∈ [0, 1] of whether the robot is unsafe in this sense. 1  If f (x) ≥ γ for some threshold γ ∈ [0, 1], then the robot comes to a stop (e.g., to ask a human for help). If the label 1( f (x) ≥ γ) correctly predicts safety, then this policy ensures safety as long as the robot starts from a safe state (Li & Bastani, 2020) . We apply our approach to choose γ to ensure safety with high probability. Problem setup. Given a performant but potentially unsafe policy π (e.g., a DNN policy trained to navigate to the goal), our goal is to override π as needed to ensure safety. We assume that π is trained in a simulator, and our goal is to ensure that π is safe according to our model of the environment, which is already a challenging problem when π is a deep neural network over perceptual inputs. In particular, we do not address the sim-to-real problem. Let x ∈ X be the states, X safe ⊆ X be the safe states (i.e., our goal is to ensure the robot stays in X safe ), o ∈ O be the observations, u ∈ U be the actions, g : X × U → X be the (deterministic) dynamics, and h : X → O be the observation function. A state x is recoverable (denoted x ∈ X rec ) if the robot can use π in state x and then safely come to a stop using a backup policy π 0 (e.g., braking). Then, the shield policy uses π if x ∈ X rec , and π 0 otherwise (Bastani, 2019) . This policy guarantees safety as long as an initial state is recoverable-i.e., x 0 ∈ X rec . The challenge is determining whether x ∈ X rec . When we observe x, we can use model-based simulation to perform this check. However, in many settings, we only have access to observations-e.g., camera images or LIDAR scans-so this approach does not apply. Instead, we propose to train a DNN to predict recoverability-i.e., ŷ(o) := 1 ("un-recoverable") if f (o) ≥ γ 0 ("recoverable") otherwise where o = h(x), with the goal that ŷ(o) ≈ y * (x) := 1(x ∈ X rec ), resulting in the following the shield policy π shield : π shield (o) := π(o) if ŷ(o) = 0 π 0 (o) otherwise. Safety guarantee. Our main goal is to choose γ so that π shield ensures safety with high probabilityi.e., given ξ ∈ R >0 and any distribution D over initial states X 0 ⊆ X rec , we have p unsafe := P ζ∼Dπ shield [ζ ⊆ X safe ] ≤ ξ, where ζ(x 0 , π) := (x 0 , x 1 , . . . ) is a rollout from x 0 ∼ D generated using π-i.e., x t+1 = g(x t , π(h(x t ))). 2 We assume the rollout terminates either once the robot reaches its goal, or once it switches to π 0 and comes to a stop; in particular, the robot never switches from π 0 back to π. Success rate. To maximize the success rate (i.e., the rate at which the robot achieves its goal), we need to minimize how often π shield switches to π 0 , which corresponds to maximizing γ. We want to maximize γ subject to p unsafe ≤ ξ, where p unsafe is implicitly a function of γ. However, we cannot evaluate p unsafe , so we need an upper bound. To this end, consider a rollout that the first unsafe state is encountered on step t (i.e., x t ∈ X safe but x i ∈ X safe for all i < t), which we call an unsafe rollout, and denote the event that the unsafe rollout is encountered by E t ; we exploit the unsafe rollouts to bound p unsafe . In particular, let p t := P ζ∼D π [E t ], and let p := ∞ t=0 p t be the probability that a rollout is unsafe. Then, consider a new distribution D over O with a probability density function p D (o) := ∞ t=0 p D π (o | E t ) • p t /p , where p D π is the original probability density function of D π ; in particular, we can draw an observation o ∼ D by sampling the observation of the first unsafe state from a rollout sample (and rejecting if the entire rollout is safe). Then, we can show the following (see a proof of Theorem 4 in Appendix E): p unsafe ≤ P o∼ D [ŷ(o) = 0] • p =: punsafe . (8) We use our confidence coverage predictor Ĉ0 to compute bounds on punsafe -i.e., P o∼ D [ŷ(o) = 1] ∈ [c, c] := Ĉ0 f , Z , δ 2 where Z := {(o, 1) | o ∈ Z} p ∈ [r, r] := ΘCP W , δ 2 where W := {1(ζ ⊆ X safe ) | ζ ∈ W }. Here, Z is a labeled version of Z, where "1" denotes "un-recoverable", n := |W |, and n := |Z|, where n ≥ n . Then, we have punsafe ≤ r • (1 -c), so it suffices to solve the following problem: γ := arg max γ ∈[0,1] γ subj. to r • (1 -c) ≤ ξ (9) Here, c is implicitly a function of γ ; thus, we use line search to solve this optimization problem. We have the following safety guarantee, see Appendix E for a proof: Theorem 4 We have p unsafe ≤ ξ with probability at least 1 -δ over W ∼ D n π and Z ∼ Dn .

5. EXPERIMENTS

We demonstrate that how our proposed approach can be used to obtain provable guarantees in our two applications: fast DNN inference and safe planning. Additional results are in Appendix G.

5.1. CALIBRATION

We illustrate the calibration properties of our approach using reliability diagrams, which show the empirical accuracy of each bin as a function of the predicted confidence (Guo et al., 2017) . Ideally, the accuracy should equal the predicted confidence, so the ideal curve is the line y = x. To draw our predicted confidence intervals in these plots, we need to rescale them; see Appendix F.3. Setup. We use the ImageNet dataset (Russakovsky et al., 2015) and ResNet101 (He et al., 2016) for evaluation. We split the ImageNet validation set into 20, 000 calibration and 10, 000 test images. Baselines. We consider three baselines: (i) naïve softmax of f , (ii) temperature scaling (Guo et al., 2017) , and (iii) histogram binning (Zadrozny & Elkan, 2001); see Appendix F.2 for details. For histogram binning and our approach, we use K = 20 bins of the same size. Metric. We use expected calibration error (ECE) and reliability diagrams (see Appendix F.3). Results. Results are shown in Figure 2 . The ECE of the naïve softmax baseline is 4.79% (Figure 2a ), of temperature scaling enhances this to 1.66% (Figure 2b ), and of histogram binning is 0.99% (Figure 2c ). Our approach predicts an interval that include the empirical accuracy in all bins (solid red lines in Figure 2c ); furthermore, the upper/lower bounds of the ECE over values in our bins is [0.0%, 3.76%], which includes zero ECE. See Appendix G.1 for additional results.

5.2. FAST DNN INFERENCE

Setup. We use the same ImageNet setup along with ResNet101 as the calibration task. For the cascading classifier, we use the original ResNet101 as the slow network, and add a single exit branch (i.e., M = 2) at a quarter of the way from the input layer. We train the newly added branch using the standard procedure for training ResNet101. Baselines. We compare to naïve softmax and to calibrated prediction via temperature scaling, both using a threshold γ 1 = 1 -ξ , where ξ is the sum of ξ and the validation error of the slow model; intuitively, this threshold is the one we would use if the predicted probabilities are perfectly calibrated. We also compare to histogram binning-i.e., our approach but using the means of each bin instead of the upper/lower bounds. See Appendix F.2 for details. Metrics. First, we measure test set top-1 classification error (i.e., 1accuracy), which we want to guarantee this lower than a desired error (i.e., the error of the slow model and desired relative error ξ). To measure inference time, we consider the average number of multiplication-accumulation operations (MACs) used in inference per example. Note that the MACs are averaged over all examples in the test set since the combined model may use different numbers of MACs for different examples. Results. The comparison results with the baselines are shown in Figure 3a . The original neural network model is denoted by "slow network", our approach (6) by "rigorous", and our baseline by "(1 -ξ )-softmax", "(1 -ξ )-temp.", and "hist. bin.". For each method, we plot the classification error and time in MACs. The desired error upper bound is plotted as a dotted line; the goal is for the classification error to be lower than this line. As can be seen, our method is guaranteed to achieve the desired error, while improving the inference time by 32% compared to the slow model. On the other hand, the histogram baseline improves the inference time but fails to satisfy the desired error. Asymptotically, histogram binning is guaranteed to be perfectly calibrated, but it makes mistakes due to finite sample errors. The other baselines do not improve inference time. Next, Figure 3b shows the classification error as we vary the desired relative error ξ; our approach always achieves the desired error on the test set, and is often very close (which maximizes speed). However, the baselines often violate the desired error bound. Finally, the MACs metric is only an approximation of the actual inference time. To complement MACs, we also measure CPU and GPU time using the PyTorch profiler. In Figure 3c , we show the inference times for each method, where trends are as before; our approach improves running time by 54%, while only reducing classification error by 1%. The histogram baseline is faster than our approach, but does not satisfy the error guarantee. These results include the time needed to compute the intervals, which is negligible.

5.3. SAFE PLANNING

Setup. We evaluate on AI Habitat (Savva et al., 2019) , an indoor robot simulator that provides agents with observations o = h(x) that are RGB camera images. The safety constraint is to avoid colliding with obstacles such as furniture in the environment. We use the learned policy π available with the environment. Then, we train a recoverability predictor, trained using 500 rollouts with a horizon of 100. We calibrate this model on an additional n rollouts. Baselines. We compare to three baselines: (i) histogram binning-i.e., our approach but using the means of each bin rather than upper/lower bounds, (ii) a naïve approach of choosing γ = 0.5, and (iii) a naïve but adaptive approach of choosing γ = ξ, called "ξ-naïve". Metrics. We measure both the safety rate and the success rate; in particular, a rollout is successful if the robot reaches its goal state, and a rollout is safe if it does not reach any unsafe states. Results. We show results in Figure 4a . The desire safety rate ξ is shown by the dotted line-i.e., we expect the safety rate to be above this line. As can be seen, our approach achieves the desired safety rate. While it sacrifices success rate, this is because the underlying learned policy π is frequently unsafe; in particular, it is only safe in about 30% of rollouts. The naïve approach fails to satisfy the safety constraint. The ξ-naïve approach tends to be optimistic, and also fails when ξ = 0.03 (Figure 4b ). The histogram baseline performs similarly to our approach. The main benefit of our approach is providing the absolute guarantee on safety, which the histogram baseline does not provide. Thus, in this case, our approach can provide this guarantee while achieving similar performance. Figure 4b shows the safety rate as we vary the desired safety rate ξ. Both our approach and the baseline satisfy the desired safety guarantee, whereas the naive approaches do not always do so.

6. CONCLUSION

We have proposed a novel algorithm for calibrated prediction that provides PAC guarantees, and demonstrated how our approach can be applied to fast DNN inference and safe planning. There are many directions for future work-e.g., leveraging these techniques in more application domains and extending our approach to settings with distribution shift (see Appendix F.1 for a discussion). A CONNECTION TO PAC LEARNING THEORY We explain the connection to PAC learning theory. First, note that we can represent Ĉ as a confidence interval around the empirical estimate of c f (x)-i.e., ĉ f (x) := (x ,y )∈Sx 1(y = ŷ(x ))/|S x |, where S x = {(x , y ) | p(x ) ∈ B κ f (x) }. Then, we can write Ĉ(x) = [ĉ f (x) -x , ĉ f (x) + ¯ x ]. In this case, (3) is equivalent to P Z∼D n x∈X ĉ f (x) -κ f (x) ≤ c f (x) ≤ ĉ f (x) + ¯ κ f (x) ≥ 1 -δ, for some 1 , ¯ 1 , . . . , K , ¯ K . In this bound, "approximately" refers to the fact that the empirical estimate ĉ f (x) is within of the true value c f (x), and "probably" refers to the fact that this error bound holds with high probability over the training data Z ∼ D n . By abuse of terminology, we refer to the confidence interval predictor Ĉ as PAC rather than just ĉ f (x). Alternatively, we also have the following connection to PAC learning theory: Definition 3 Given , δ ∈ R >0 and n ∈ N, Ĉ is probably approximately correct (PAC) if, for any distribution D, we have P Z∼D n P x∼D c f (x) ∈ Ĉ(x; f , Z) ≥ 1 -≥ 1 -δ. The following theorem shows that the proposed confidence coverage predictor Ĉ satisfies the PAC guarantee in Definition 3. Theorem 5 Our confidence coverage predictor Ĉ satisfies Definition 3 for all , δ ∈ R >0 , and n ∈ N. Proof. We exploit the independence of each bin for the proof. Let θ κ f (x) := c f (x), which is the parameter of the Binomial distribution of the κ f (x)th bin, the following holds: P Z∼D n P x∼D c f (x) ∈ Ĉ(x; f , Z, δ) ≥ 1 - = P Z∼D n P x∼D c f (x) ∈ Ĉ(x; f , Z, δ) ∧ K k=1 p(x) ∈ B k ≥ 1 - = P Z∼D n K k=1 P x∼D c f (x) ∈ Ĉ(x; f , Z, δ) ∧ p(x) ∈ B k ≥ 1 - = P Z∼D n K k=1 P x∼D c f (x) ∈ Ĉ(x; f , Z, δ) p(x) ∈ B k P x∼D [p(x) ∈ B k ] ≥ 1 - = P Z∼D n K k=1 P x∼D θ k ∈ ΘCP W k ; δ K p(x) ∈ B k P x∼D [p(x) ∈ B k ] ≥ 1 - = P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K P x∼D [p(x) ∈ B k ] ≥ 1 - ≥ P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K P x∼D [p(x) ∈ B k ] ≥ 1 - ∧ K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 = P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K P x∼D [p(x) ∈ B k ] ≥ 1 - K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 = P Z∼D n K k=1 P x∼D [p(x) ∈ B k ] ≥ 1 - K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 = P Z∼D n 1 ≥ 1 - K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 = P Z∼D n K k=1 1 θ k ∈ ΘCP W k ; δ K = 1 = P Z∼D n K k=1 θ k ∈ ΘCP W k ; δ K ≥ 1 -δ, where the last inequality holds by the union bound.

B PROOF OF THEOREM 1

We prove this by exploiting the independence of each bin. Recall that Ĉ(x ) := [ĉ f (x) -x , ĉ f (x) + ¯ x ], and the interval is obtained by applying the Clopper-Pearson interval with confidence level δ K at each bin. Then, the following holds due the Clopper-Pearson interval for all k ∈ {1, 2, . . . , K}: P |c f ,k -ĉ f ,k | > k ≤ δ K where c f ,k := c f (x) and ĉ f ,k := ĉ f (x) for x such that κ f (x) = k, and k := max( x , ¯ x ). By applying the union bound, the following also holds: P K k=1 |c f ,k -ĉ f ,k | > k ≤ δ, Considering the fact that X is partitioned into K spaces due to the binning and the equivalence form (10) of the PAC criterion in Definition 3, the claimed statement holds.

C PROOF OF THEOREM 2

We drop probabilities over (x, y) ∼ D. First, we decompose the error of a cascading classifier P [ŷ C (x) = y] as follows: P [ŷ C (x) = y] = P ŷC (x) = y ∧ ŷC (x) = ŷM (x) ∨ ŷC (x) = ŷM (x) = P ŷC (x) = y ∧ ŷC (x) = ŷM (x) ∨ ŷC (x) = y ∧ ŷC (x) = ŷM (x) = P [ŷ C (x) = y ∧ ŷC (x) = ŷM (x)] + P [ŷ C (x) = y ∧ ŷC (x) = ŷM (x)] , where the last equality holds since the events of ŷC (x) = ŷM (x) and of ŷC (x) = ŷM (x) are disjoint. Similarly, for the error of a slow classifier P [ŷ M (x) = y], we have: P [ŷ M (x) = y] = P [ŷ M (x) = y ∧ ŷC (x) = ŷM (x)] + P [ŷ M (x) = y ∧ ŷC (x) = ŷM (x)] . Thus, the error difference can be represented as follows: P [ŷ C (x) = y] -P [ŷ M (x) = y] = P [ŷ C (x) = y ∧ ŷC (x) = ŷM (x)] -P [ŷ M (x) = y ∧ ŷC (x) = ŷM (x)] . ( ) To complete the proof, we need to upper bound (12) by ξ. Define the following events: D m := m-1 i=1 (p i (x) < γ i ) ∧ pm (x) ≥ γ m (∀m ∈ {1, ..., M -1}) D M := M -1 i=1 (p i (x) < γ i ) E C := ŷC (x) = ŷM (x) E m := ŷm (x) = ŷM (x) (∀m ∈ {1, ..., M -1}) F C := ŷC (x) = y F m := ŷm (x) = y (∀m ∈ {1, ..., M -1}) G := ŷM (x) = y, where D 1 , D 2 , . . . , D M form a partition of a sample space. Then, we have: P [ŷ C (x) = y ∧ ŷC (x) = ŷM (x)] = P [F C ∧ E C ] = P F C ∧ E C ∧ M m=1 D m = P M m=1 (F C ∧ E C ∧ D m ) = M m=1 P [F C ∧ E C ∧ D m ] = M m=1 P [F m ∧ E m ∧ D m ] = M m=1 P [F m | E m ∧ D m ] • P [E m ∧ D m ] , Similarly, we have: P [ŷ M (x) = y ∧ ŷC (x) = ŷM (x)] = M m=1 P [G | E m ∧ D m ] • P [E m ∧ D m ] . Thus, ( 12) can be rewritten as follows: P [ŷ C (x) = y] -P [ŷ M (x) = y] = M m=1 (P [F m | E m ∧ D m ] • P [E m ∧ D m ] -P [G | E m ∧ D m ] • P [E m ∧ D m ]) = M m=1 (e m -e m ) = M -1 m=1 (e m -e m ) ≤ ξ, where the last equality holds since e M -e M = 0, and the last inequality holds due to ( 6) with probability at least 1 -δ, thus proves the claim.

D PROOF OF THEOREM 3

Suppose there is γ 1 which is different to γ * 1 and produces a faster cascading classifier than the cascading classifier with γ * 1 . Since γ * 1 is the optimal solution of ( 6), γ 1 > γ * 1 . This further implies that the less number of examples exits at the first branch of the cascading classifier with γ 1 , but these examples are classified by the upper, slower branch. This means that the overall inference speed of the cascading classifier with γ 1 is slower then that with γ * 1 , which leads to a contradiction.

E PROOF OF THEOREM 4

For clarity, we use r to denote a state x is "recoverable" (i.e., y * (x) = 0) and u to denote a state x is "un-recoverable" (i.e., y * (x) = 1). Now, note that a rollout ζ(x 0 , π shield ) := (x 0 , x 1 , . . . ) is unsafe if (i) at some step t, we have y * (x t ) = u (i.e., x t is not recoverable), yet ŷ(o t ) = r, where o t = h(x t ) (i.e., ŷ predicts x t is recoverable), and furthermore (ii) for every step i ≤ t -1, y * (x i ) = ŷ(o i ) = r-i.e., p unsafe = P ξ∼Dπ shield ∞ t=0 t-1 i=0 y * (x i ) = r ∧ ŷ(o i ) = r ∧ y * (x t ) = u ∧ ŷ(o t ) = r . Condition (i) is captured by the second parenthetical inside the probability; intuitively, it says that ŷ(o t ) is a false negative. Condition (ii) is captured by the first parenthetical inside the probability; intuitively, it says that ŷ(o i ) is a true negative for any i ≤ t -1. Next, let the event E t be E t := t-1 i=0 y * (x i ) = r ∧ y * (x t ) = u, then the following holds: P ξ∼Dπ shield ∞ t=0 t-1 i=0 y * (x i ) = r ∧ ŷ(o i ) = r ∧ y * (x t ) = u ∧ ŷ(o t ) = r = P ξ∼Dπ shield ∞ t=0 t-1 i=0 y * (x i ) = r ∧ y * (x t ) = u ∧ t-1 i=0 ŷ(o i ) = r ∧ ŷ(o t ) = r = P ξ∼Dπ shield ∞ t=0 E t ∧ t-1 i=0 ŷ(o i ) = r ∧ ŷ(o t ) = r ≤ P ξ∼Dπ shield ∞ t=0 E t ∧ ŷ(o t ) = r . Recall that p := ∞ t=0 P ξ∼D π [E t ] and p D (o) := ∞ t=0 p D π (o | E t ) • P ξ∼D π [E t ] /p; then we can upper-bound (13) as follows: p unsafe ≤ P ξ∼Dπ shield ∞ t=0 E t ∧ ŷ(o t ) = r = ∞ t=0 P ξ∼Dπ shield [E t ∧ ŷ(o t ) = r] = ∞ t=0 P ξ∼Dπ shield [ŷ(o t ) = r | E t ] • P ξ∼Dπ shield [E t ] = ∞ t=0 P ξ∼Dπ shield [ŷ(o) = r | E t ] • P ξ∼Dπ shield [E t ] = ∞ t=0 1(ŷ(o) = r) • p Dπ shield (o | E t ) • P ξ∼Dπ shield [E t ] do = 1(ŷ(o) = r) ∞ t=0 p Dπ shield (o | E t ) • P ξ∼Dπ shield [E t ] do ≤ 1(ŷ(o) = r) ∞ t=0 p D π (o | E t ) • P ξ∼D π [E t ] do = 1(ŷ(o) = r)p D (o)p do = p • P o∼ D [ŷ(o) = r], where we use the fact that E t are disjoint by construction for the first equality, and we use o without time index t for the third equality since it clearly represents the last observation if it is conditioned on E t . Moreover, the last inequality holds due to the following: (i) p Dπ shield (o | E t ) = p D π (o | E t ) , since E t implies that the backup policy of π shield isn't activated up to the step t, so π shield = π, and (ii) P ξ∼Dπ shield [E t ] ≤ P ξ∼D π [E t ] , since π shield is less likely to reach unsafe states by its design than π. Thus, the constraint in (9) implies p unsafe ≤ ξ with probability at least 1 -δ, so the claim follows. F ADDITIONAL DISCUSSION F.1 LIMITATION TO ON-DISTRIBUTION SETTING Our PAC guarantees (i.e., Theorems 1 & 5) transfer to the test distribution if it is identical to the validation distribution. We believe that providing theoretical guarantees for out-of-distribution data is an important direction; however, we believe that our work is an important stepping stone towards this goal. In particular, to the best of our knowledge, we do not know of any existing work that provides theoretical guarantees on calibrated probabilities even for the in-distribution case. One possible direction is to use our approach in conjunction with covariate shift detectors-e.g., (Gretton et al., 2012) . Alternatively, it may be possible to directly incorporate ideas from recent work on calibrated prediction with covariate shift (Park et al., 2020b) or uncertainty set prediction with covariate shift (Cauchois et al., 2020; Tibshirani et al., 2019) . In particular, we can use importance weighting q(x)/p(x), where p(x) is the training distribution and q(x) is the test distribution, to reweight our training examples, enabling us to transfer our guarantees from the training set to the test distribution. The key challenge is when these weights are unknown. In this case, we can estimate them given a set of unlabeled examples from the test distribution (Park et al., 2020b ), but we then need to account for the error in our estimates.

F.2 BASELINES

The following includes brief descriptions on baselines that we used in experiments. Histogram binning. This algorithm calibrates the top-label confidence prediction of f by sorting the calibration examples (x, y) into bins B i based on their predicted top-label confidence-i.e., (x, y) is associated with B i if p(x) ∈ B i . Then, for each bin, it computes the empirical confidence pi := 1 |Si| (x,y)∈Si 1(ŷ(x) = y), where S i is the set of labeled examples that are associated with bin B i -i.e., the empirical counterpart of the true confidence in (2). Finally, during the test-time, it returns a predicted confidence pi for all future test examples x if p(x) ∈ B i . (1 -ξ )-softmax. In fast DNN inference, a threshold can be heuristically chosen based on the desired relative error ξ and the validation error of the slow model. In particular, when a cascading classifier consists of two branches-i.e., M = 2, the threshold of the first branch is chosen by γ 1 = 1 -ξ , where ξ is the sum of ξ and the validation error of the slow model. We call this approach (1 -ξ )-softmax. (1 -ξ )-temperature scaling. A more advanced approach is to first calibrate each branch to get better confidence. We consider using temperature scaling to do so-i.e., we first calibrate each branch using the temperature scaling, and then use the branch threshold by γ 1 = 1 -ξ when M = 2. We call this approach (1 -ξ )-temperature scaling.

F.3 CALIBRATION: INDUCED INTERVALS FOR ECE AND RELIABILITY DIAGRAM

ECE. The expected calibration error (ECE), which is one way to measure calibration performance, is defined as follows: ECE := J j=1 |S j | |S| 1 |S j | (x,y)∈Sj p(x) - 1 |S j | (x,y)∈Sj 1(ŷ(x) = y) , where J is the total number of bins for ECE, S ⊆ X × Y is the evaluation set, and S j is the set of labeled examples associated to the jth bin-i.e., (x, y) ∈ S j if p(x) ∈ B j . A confidence coverage predictor Ĉ(x) outputs an interval instead of a point estimate p(x) of the confidence. To evaluate the confidence coverage predictor, we remap intervals using the ECE formulation. In particular, we equivalently represents Ĉ(x) by a mean confidence ĉ f (x) and differences from the mean-i.e., Ĉ(x) = [c(x), c(x)] = [ĉ f (x)x , ĉ f (x) + x ] (see Appendix A for a description on this equivalent representation). Then, we sort each labeled example into bins using ĉ f (x) to form S j . Next, we consider an interval instead of p(x) to compute ECEi.e., ECE induced := ECE, ECE , where We describe a way to train a cascading classifier with M branches. Basically, we consider to independently train M different neural networks with a shared backbone. In particular, we first train the M th branch using a training set by minimizing a cross-entropy loss. Then, we train the (M -1)th branch, which consists of two parts: a backbone part from the M th branch, and the head of the (M -1)th branch. Here, the backbone part is already trained in the previous stage, so we do not update the backbone part and only train the head of this branch using the same training set by minimizing the same cross-entropy loss (with the same optimization hyperparameters). This step is done repeatedly down to the first branch. ECE := J j=1 |S j | |S| inf pj ∈Confj pj - 1 |S j | (x,y)∈Sj 1(ŷ(x) = y) , ECE := J j=1 |S j | |S| sup pj ∈Confj pj - 1 |S j | (x,

F.5 SAFE PLANNING: DATA COLLECTION FROM A SIMULATOR

We collect required data from a simulator, where a given policy π is already learned over the simulator. We describe how we form the necessary data from rollouts sampled from the simulator. First, to sample a rollout ζ, we use π from a random initial state x 0 ∼ D; we denote the sequence of states visited as a rollout ζ(x 0 , π) := (x 0 , x 1 , . . . ). We denote the induced distribution over rollouts by ζ ∼ D π . Note that the ζ contains unsafe states since π is potentially unsafe. However, when constructing our recoverability classifier, we only use the sequence of safe states, followed by a single unsafe state. In particular, we let W be a set of i.i.d. sampled rollouts ζ ∼ D π . Next, for a given rollout, we consider the observation in the first unsafe state in that rollout (if one exists); we denote the distribution over such observations by D. Finally, we take Z to be a set of sampled observations o ∼ D. The "desired error" is a user-specified error bound, where the error of each method need to be below of this line. The proposed approach reduces the inference speed as the number of sample increases, while satisfying the desired error bound. The baselines are either fails to satisfy the desired error bound or overly conservative to satisfy the bound. The "desired error" is a user-specified error bound, where the error of each method need to be below of this line. The proposed approach allows more error as at most specified by the desired error, to reduce the inference speed. Other baselines either overly increase the error or conservatively maintain overly low error. The "desired error" is a user-specified error bound, where the error of each method need to be below of this line. The proposed approach produces larger error gap toward the desired error as we enforce stronger confidence level-i.e., from δ = 10 -1 to δ = 10 -4 . Note that other baselines do not depend on δ as designed. Figure 12 : Ablation study on various ξ for n = 20, 000 and δ = 10 -2 . Each plot uses the shown desired unsafe rate ξ with the same n and δ. The "naive", "ξ-naive", and "histogram binning" represent baseline results on the safety rate and success rate. The proposed approach is labeled "rigorous". The desired safety rate is a user-specified rate, where the safety rate of each method need to be above of this line. The safety rate of the proposed approach is closely above the desired safety rate to satisfy the safety constraint. However, the naive can violate the safety constraint, and the ξ-naive can be overly optimistic. The histogram binning and ξ-naive look empirically fine, but they can in theory violate desired safety rate (e.g., see Figure 4b ). Figure 13 : Ablation study on various δ for n = 20, 000 and ξ = 0.1. Each plot uses the shown misconfidence level ξ with the same n and ξ. The "naive", "ξ-naive", and "histogram binning" represent baseline results on the safety rate and success rate. The proposed approach is labeled "rigorous". The desired safety rate is a user-specified rate, where the safety rate of each method need to be above of this line. The proposed approach produces more conservative safety ratesi.e., larger gap between the estimated safety rate and desired safety rate-as we enforce stronger confidence level-i.e., from δ = 10 -1 to δ = 10 -4 . Note that other baselines do not depend on δ as designed.



Since |Y| = 2, f can be represented as a map f : X → [0, 1]; the second component is simply 1 -f (x).2 We can handle infinitely long rollouts, but in practice rollouts will be finite (but possibly arbitrarily long). and Dπ shield is an induced distribution over rollouts ζ(x0, πshield).



Figure 1: A composed model in a cascading way for M = 4.

Figure 2: Calibration comparison; default parameters of Ĉ are K = 20, n = 20, 000, and δ = 10 -2 .

Varying desired error rate ξ.

Figure 3: Fast DNN inference results; default parameters are n = 20, 000, ξ = 0.02, and δ = 10 -3 .

Varying desired safety rate ξ.

Figure 4: Safe planning results; default parameters are: n = 20, 000, ξ = 0.1, δ = 10 -2 .

y)∈Sj 1(ŷ(x) = y) , and Conf j := Conf j , Conf j := min Reliability diagram. This evaluation technique is a pictorial summary of the ECE, where the xaxis represents the mean confidence 1 |Sj | (x,y)∈Sj p(x) for each bin, and the y-axis represents the mean accuracy 1 |Sj | (x,y)∈Sj 1(ŷ(x) = y) for each bin. If an interval from a confidence coverage predictor is given, then the mean confidence is replaced by Conf j , resulting in visualizing an interval instead of a point. F.4 FAST DNN INFERENCE: CASCADING CLASSIFIER TRAINING

Figure 5: Calibration comparison in reliability diagrams and ECEs. The size of validation set for calibration is 20, 000. The blue histogram represents the example ratio for the corresponding bin. The diagonal line labeled "ideal" is the best reliability diagram, which produces the zero ECE. The estimated reliability diagram is represented "estimated" in dotted red. (a) The naïve softmax output from a neural network is unreliable in ECE. (b, c) The temperature scaling and histogram binning are fairly good calibration approaches, which decreases ECE. (d)The proposed approach generates "induced" intervals (see Appendix F.3) on top of the histogram binning approach, where each interval contains the ideal diagonal line with high probability. Moreover, the proposed one also produces induced ECE interval, where it contains the zero ECE with high probability.

Figure 7: Accuracy-confidence plot. This plot is useful for choosing a proper confidence threshold that achieves a desired conditional accuracy (Lakshminarayanan et al., 2017); the xaxis is the confidence threshold t and the y-axis is the empirical value of the conditional accuracy P [ŷ(x) = y | p(x) ≥ t]. Since our approach outputs an interval [c(x), c(x)], we plot P [ŷ(x) = y | c(x) ≥ t] and P [ŷ(x) = y | c(x) ≥ t] for the upper and lower bound of the green area, respectively.

Figure8: Ablation study on various n for ξ = 0.02, δ = 10 -2 , and M = 2. Each plot uses the shown number of samples n with the same ξ and δ. The "estimated trade-off" represents the error and MACs trade-off depending on threshold γ 1 . The markers show the trade-off by the baselines. The "desired error" is a user-specified error bound, where the error of each method need to be below of this line. The proposed approach reduces the inference speed as the number of sample increases, while satisfying the desired error bound. The baselines are either fails to satisfy the desired error bound or overly conservative to satisfy the bound.

Figure9: Ablation study on various ξ for n = 20, 000, δ = 10 -2 , and M = 2. Each plot uses the shown desired relative error ξ with the same n and δ. The "estimated trade-off" represents the error and MACs trade-off depending on threshold γ 1 . The markers show the trade-off by the shown baselines. The "desired error" is a user-specified error bound, where the error of each method need to be below of this line. The proposed approach allows more error as at most specified by the desired error, to reduce the inference speed. Other baselines either overly increase the error or conservatively maintain overly low error.

Figure10: Ablation study on various δ for n = 20, 000, ξ = 0.02, and M = 2. Each plot uses the shown misconfidence level δ with the same n and ξ. The "estimated trade-off" represents the error and MACs trade-off depending on threshold γ 1 . The markers show the trade-off by the shown baselines. The "desired error" is a user-specified error bound, where the error of each method need to be below of this line. The proposed approach produces larger error gap toward the desired error as we enforce stronger confidence level-i.e., from δ = 10 -1 to δ = 10 -4 . Note that other baselines do not depend on δ as designed.

Published as a conference paper at ICLR 2021 Kim Wabersich and Melanie Zeilinger. Linear model predictive safety certification for learningbased control. 03 2018. doi: 10.1109/CDC.2018.8619829. Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 409-424, 2018.

ACKNOWLEDGMENTS

This work was supported in part by AFRL/DARPA FA8750-18-C-0090, ARO W911NF-20-1-0080, DARPA FA8750-19-2-0201, and NSF CCF 1910769. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Air Force Research Laboratory (AFRL), the Army Research Office (ARO), the Defense Advanced Research Projects Agency (DARPA), or the Department of Defense, or the United States Government.

annex

Figure 11 : Ablation study on various n for ξ = 0.1 and δ = 10 -2 . Each plot uses the shown sample size n with the same ξ and δ. The "naive", "ξ-naive", and "histogram binning" represent baseline results on the safety rate and success rate. The proposed approach is labeled "rigorous". The desired safety rate is a user-specified rate, where the safety rate of each method need to be above of this line.The safety rate of the proposed approach is above the desired safety rate, and it tends to be closer to the desired safety rate as n increases.

