ESTIMATING AND EVALUATING REGRESSION PREDIC-TIVE UNCERTAINTY IN DEEP OBJECT DETECTORS

Abstract

Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy predictive distributions regardless of the correctness of the output mean. We propose to use the energy score as a non-local proper scoring rule and find that when used for training, the energy score leads to better calibrated and lower entropy predictive distributions than NLL. We also address the widespread use of non-proper scoring metrics for evaluating predictive distributions from deep object detectors by proposing an alternate evaluation approach founded on proper scoring rules. Using the proposed evaluation tools, we show that although variance networks can be used to produce high quality predictive distributions, adhoc approaches used by seminal object detectors for choosing regression targets during training do not provide wide enough data support for reliable variance learning. We hope that our work helps shift evaluation in probabilistic object detection to better align with predictive uncertainty evaluation in other machine learning domains. Code for all models, evaluation, and datasets is available at: https://github.com/asharakeh/probdet.git.

1. INTRODUCTION

Deep object detectors are being increasingly deployed as perception components in safety critical robotics and automation applications. For reliable and safe operation, subsequent tasks using detectors as sensors require meaningful predictive uncertainty estimates correlated with their outputs. As an example, overconfident incorrect predictions can lead to non-optimal decision making in planning tasks, while underconfident correct predictions can lead to under-utilizing information in sensor fusion. This paper investigates probabilistic object detectors, extensions of standard object detectors that estimate predictive distributions for output categories and bounding boxes simultaneously. This paper aims to identify the shortcomings of recent trends followed by state-of-the-art probabilistic object detectors, and provides theoretically founded solutions for identified issues. Specifically, we observe that the majority of state-of-the-art probabilistic object detectors methods (Feng et al., 2018a; Le et al., 2018; Feng et al., 2018b; He et al., 2019; Kraus & Dietmayer, 2019; Meyer et al., 2019; Choi et al., 2019; Feng et al., 2020; He & Wang, 2020; Harakeh et al., 2020; Lee et al., 2020) build on deterministic object detection backends to estimate bounding box predictive distributions by modifying such backends with variance networks (Detlefsen et al., 2019) . The mean and variance of bounding box predictive distributions estimated using variance networks are then learnt using negative log likelihood (NLL). It is also common for these methods to use non-proper scoring rules such as the mean Average Precision (mAP) when evaluating the quality of their output predictive distributions. Pitfalls of NLL We show that under standard training procedures used by common object detectors, using NLL as a minimization objective results in variance networks that output high entropy distributions regardless of the correctness of an output bounding box. We address this issue by using the Energy Score (Gneiting & Raftery, 2007) , a distance-sensitive proper scoring rule based on energy statistics (Székely & Rizzo, 2013) , as an alternative for training variance networks. We show that predictive distributions learnt with the energy score are lower entropy, better calibrated, and of higher quality when evaluated using proper scoring rules.

Pitfalls of Evaluation

We address the widespread use of non-proper scoring rules for evaluating probabilistic object detectors by providing evaluation tools based on well established proper scoring rules (Gneiting & Raftery, 2007) that are only minimized if the estimated predictive distribution is equal to the true target distribution, for both classification and regression. Using the proposed tools, we benchmark probabilistic extensions of three common object detection architectures on in-distribution, shifted, and out-of-distribution data. Our results show that variance networks can differentiate between in-distribution, shifted, and out-of-distribution data using their predictive entropy. We find that ad-hoc approaches used by seminal object detectors for choosing their regression targets during training do not provide a wide enough data support for reliable learning in variance networks. Finally, we provide clear recommendations in Sec. 5 to avoid the pitfalls described above.

2. RELATED WORK

Estimating predictive distributions with deep neural networks has long been a topic of interest for the research community. Bayesian Neural Networks (BNNs) (MacKay, 1992) quantify predictive uncertainty by approximating a posterior distribution over a set of network parameters given a predefined prior distribution. Variance networks (Nix & Weigend, 1994) capture predictive uncertainty (Kendall & Gal, 2017) by estimating the mean and variance of every output through separate neural network branches, and are usually trained using maximum likelihood estimation (Detlefsen et al., 2019) . Deep ensembles (Lakshminarayanan et al., 2017) train multiple copies of the variance networks from different network initializations to estimate predictive distributions from output sample sets. Monte Carlo (MC) Dropout (Gal & Ghahramani, 2016) provides predictive uncertainty estimates based on output samples generated by activating dropout layers at test time. We refer the reader to the work of Detlefsen et al. (2019) for an in depth comparison of the performance of variance networks, BNNs, Ensembles, and MC dropout on regression tasks. We find variance networks to be the most scalable of these methods for the object detection task. Finally, we do not distinguish between aleatoric and epistemic uncertainty as is done in Kendall & Gal (2017) , instead focusing on predictive uncertainty (Detlefsen et al., 2019) which reflects both types. State-of-the-art probabilistic object detectors model predictive uncertainty by adapting the work of Kendall & Gal (2017) to state-of-the-art object detectors. Standard detectors are extended with a variance network, usually referred to as the variance regression head, alongside the mean bounding box regression head and the resulting network is trained using NLL (Feng et al., 2018a; Le et al., 2018; He et al., 2019; Lee et al., 2020; Feng et al., 2020; He & Wang, 2020) . Some approaches combine the variance networks with dropout (Feng et al., 2018b; Kraus & Dietmayer, 2019) and use Monte Carlo sampling at test time. Others (Meyer et al., 2019; Choi et al., 2019; Harakeh et al., 2020 ) make use of the output predicted variance by modifying the non-maximum suppression postprocessing stage. Such modifications are orthogonal to the scope of the paper. It is important to note that a substantial portion of existing probabilistic object detectors focus on non-proper scoring metrics such as the mAP and calibration errors to evaluate the quality of their predictive distributions. More recent methods (Harakeh et al., 2020; He & Wang, 2020) use the probability-based detection quality (PDQ) proposed by Hall et al. (2020) for evaluating probabilistic object detectors, which can also be shown to be non-proper (See appendix D.3). Instead, we combine the error decomposition proposed by Hoiem et al. (2012) with well established proper scoring rules (Gneiting & Raftery, 2007) to evaluate probabilistic object detectors. scene. Given a training dataset D = {(x n , y n , z n )} N n=1 of N i.i.d samples from a true joint conditional probability distribution p * (y, z|x) = p * (z|y, x)p * (y|x), we use neural networks with parameter vector θ to model p θ (z|y, x) and p θ (y|x). p θ (z|y, x) is fixed to be a multivariate Gaussian distribution N (µ(x, θ), Σ(x, θ)), and p θ (y|x) as a categorical distribution Cat(p 1 (x, θ), . . . , p K (x, θ)). Unless mentioned otherwise, z ∈ R 4 and is represented as (u min , v min , u max , v max ) where the (u min , v min ), (u max , v max ) are the pixel coordinates of the top-left and bottom-right bounding box corners respectively. Throughout this work, we denote random variables with bold characters, and the associated ground truth instances of random variables that are realized in the dataset D are italicized.

3.2. PROPER SCORING RULES

Let a be either y or z. A scoring rule is a function S(p θ , (a, x)) that assigns a numerical value to the quality of the predictive distribution p θ (a|x) given the actual event that materialized a|x ∼ p * (a|x), where a lower value indicates better quality. With slight abuse of notation, let S(p θ , p * ) also refer to the expected value of S(p θ , .), then a scoring rule is said to be proper if S(p * , p * ) ≤ S(p θ , p * ), with equality if and only if p θ = p * , meaning that the actual data generating distribution is assigned the lowest possible score value (Gneiting & Raftery, 2007) . We provide a more formal definition of proper scoring rules in Appendix D.2. Beyond the notion of proper, scoring rules can be further divided into local and non-local rules. Local scoring rules evaluate a predictive distribution based on its value only at the true target, whereas non-local rules take into account other characteristics of the predictive distribution. As an example, distance-sensitive non-local proper scoring rules reward predictive distributions that assign probability mass to the vicinity of the true target, even if not exactly at that target. Lakshminarayanan et al. (2017) noted the utility of using proper scoring rules as neural network loss functions for learning predictive distributions. Predictive uncertainty for classification tasks has been extensively studied in recent literature (Ovadia et al., 2019; Ashukha et al., 2020) . We find commonly used proper scoring rules such as NLL and the Brier score (Brier, 1950) to be satisfactory to learn and evaluate categorical predictive distributions for probabilistic object detectors. On the other hand, regression tasks have overwhelmingly relied on a single proper scoring rule, the negative log likelihood (Kendall & Gal, 2017; Lakshminarayanan et al., 2017; Detlefsen et al., 2019) . NLL is a local scoring rule and should be satisfactory if used to evaluate pure inference problems (Bernardo, 1979) . In addition, the choice of a proper scoring rule should not matter as asymptotically, the true parameters of p θ should be recovered by minimizing any proper scoring rule (Gneiting & Raftery, 2007) during training. Unfortunately, object detection does not conform to these idealized assumptions, we show in the next section the pitfalls of using NLL for learning and evaluating bounding box predictive distributions. We also explore the Energy Score (Gneiting & Raftery, 2007) , a proper and non-local scoring rule as an alternative for learning and evaluating multivariate Gaussian predictive distributions.

3.3. NEGATIVE LOG LIKELIHOOD AS A SCORING RULE

For a multivariate Gaussian, the NLL can be written as: NLL = 1 2N N n=1 (z n -µ(x n , θ)) Σ(x n , θ) -1 (z n -µ(x n , θ)) + log det Σ(x n , θ), ( ) where N is the size of the dataset D. NLL is the only proper scoring rule that is also local (Bernardo, 1979) , with higher values implying a worse predictive density quality at the true target value. When used as a loss to minimize, the first term of NLL encourages increasing the entropy of the predictive distribution as the mean estimate diverges from the true target value. The log determinant regularization term has a contrasting effect, penalizing high entropy distributions and preventing a zero loss from infinitely high uncertainty predictions at all data points (Kendall & Gal, 2017) . It has been shown by Machete (2013) that NLL prefers predictive densities that are less informative, penalizing overconfidence even when the probability mass is concentrated on a likely outcome. We demonstrate the relevance of this property to object detection with a simple toy example. Bounding box results from the state-of-the-art object detector DETR (Carion et al., 2020) , trained to achieve a competitive mAP of 42% on the COCO validation split, are assigned a single mock multivariate Gaussian probability distribution N (µ(x n , θ), σI), where σI is a 4 × 4 isotropic covariance matrix with σ as a variable parameter. We split the detection results into a high error set with an IOU <= 0.5, and a low error set with an IOU > 0.5, where the IOU is determined as the maximum IOU with any ground truth bounding box in the scene. We plot the value of the NLL loss of both high error and low error detection instances in Figure 1 , where NLL is estimated at values of σ between [10 -2 , 10 5 ]. As expected, the variance value that minimizes NLL for low error detections is two orders of magnitude lower than for high error detections. What is more interesting is the behavior of NLL away from its minimum for both low error and high error detections. NLL is seen to penalize lower entropy distributions (smaller values of σ) more severely than higher entropy distributions, a property that is shown to be detrimental for training variance networks in Section 4.

3.4. THE ENERGY SCORE (ES)

The energy score is a strictly proper and non-local (Gneiting et al., 2008) scoring rule used to assess probabilistic forecasts of multivariate quantities. The energy score and can be written as: ES = 1 N N n=1   1 M M i=1 ||z n,i -z n || - 1 2M 2 M i=1 M j=1 ||z n,i -z n,j ||   , where z n is the ground truth bounding box, and z n,i is the i th i.i.d sample from N (µ(x n , θ), Σ(x n , θ)). The energy score is derived from the energy distance (Rizzo & Székely, 2016)foot_0 , a maximum mean discrepancy metric (Sejdinovic et al., 2013) that measures distance between distributions of random vectors. Beyond its theoretical appeal, the energy score has an efficient Monte-Carlo approximation (Gneiting et al., 2008) for multivariate Gaussian distributions, written as: ES = 1 N N n=1 1 M M i=1 ||z n,i -z n || - 1 2(M -1) M -1 i=1 ||z n,i -z n,i+1 || , which requires only a single set of M samples to be drawn from N (µ(x n , θ), Σ(x n , θ)) for every object instance in the minibatch. For training our object detectors, we find that setting M to be equal to 1000 allows us to compute the approximation in equation 3 with very little computational overhead. We also use the value of 1000 for M when using the energy score as an evaluation metric. The energy score has been previously used successfully as an optimization objective in DISCO Nets (Bouchacourt et al., 2016) to train neural networks that output samples from a posterior probability distribution, and to train generative adversarial networks (Bellemare et al., 2017) . Unlike DISCO nets, we use the energy score to learn parametric distributions with differentiable sampling. Since the energy distance is non-local, it favors distributions that place probability mass near the ground truth target value, even if not exactly at that target. Going back to the toy example in Figure 1 , ES is shown to be minimized at similar values to the NLL, not a surprising observation given that both are proper scoring rules. Unlike NLL, ES penalizes high entropy distributions more severely than low entropy ones. In the next section, we observe the effects of this property when using the energy score as a loss, leading to better calibrated, lower entropy predictive distributions when compared to NLL.

3.5. DIRECT MOMENT MATCHING (DMM)

A final scoring rule we consider in our experiments for learning predictive distributions is the Direct Moment Matching (DMM) (Feng et al., 2020) : DMM = 1 N N n=1 ||z n -µ(x n , θ)|| p + ||Σ(x n , θ) -(z n -µ(x n , θ))(z n -µ(x n , θ)) || p , where ||.|| p is a p norm. DMM was previously proposed by Feng et al. (2020) as an auxiliary loss to calibrate 3D object detectors using a multi-stage training procedure. DMM matches the mean and covariance matrix of the predictive distribution to sample statistics obtained using the predicted and true target values. DMM is not a proper scoring rule, we show this property with an example. If (z n -µ(x n , θ)) is a vector of zeros, DMM is minimized only if all entries of Σ(x n , θ) are 0 regardless of the actual covariance of the data generating distribution. We use results from variance networks trained with DMM to discuss the pitfalls of only relying on distance-sensitive proper scoring rules for evaluation.

4. EXPERIMENTS AND RESULTS

For our experiments, we extend three common object detection methods, DETR (Carion et al., 2020) , RetinaNet (Lin et al., 2017) and FasterRCNN (Ren et al., 2015) All probabilistic object detectors are trained on the COCO (Lin et al., 2014) training data split. For testing, the COCO validation dataset is used as in-distribution data. Following recent recommendations for evaluating the quality of predictive uncertainty estimates (Ovadia et al., 2019) , we also test our probabilistic object detectors on shifted data distributions. We construct 3 distorted versions of the COCO validation dataset (C1, C3, and C5) by applying 18 different image corruptions introduced by Hendrycks & Dietterich (2019) at increasing intensity levels [1, 3, 5] . To test on natural dataset shift, we use OpenImages data (Kuznetsova et al., 2020) to create a shifted dataset with the same categories as COCO and an out-of-distribution dataset that contain none of the 80 categories found in COCO. More details on these datasets and their construction can be found in Appendix B. The three deterministic backends are chosen to represent one-stage (RetinaNet), two-stage (Faster-RCNN), and the recently proposed set-based (DETR) object detectors. In addition, the implementation of DETRfoot_1 , RetinaNet, and FasterRCNN models is publicly available under the Detectron2 (Wu et al., 2019) object detection framework, with hyperparameters optimized to produce the best detection results for the COCO dataset. instances with 0.1 < IOU < 0.5. Detections with 0.5 ≤ IOU are considered true positives. If multiple detections have 0.5 ≤ IOU with the same ground truth object, the one with the lower classification score is considered a duplicate. In practice, we report the average of all scores for true positives and duplicates at multiple IOU thresholds between 0.5 and 0.95, similar to how mAP is evaluated for the COCO dataset. Table 1 shows that the number of output objects in each of the four partitions is very similar for probabilistic detectors sharing the same backend, reinforcing the fairness of our evaluation. Duplicates are seen to comprise a small fraction of the output detections, and analyzing them is not found to provide any additional insight over evaluation of the other three partitions (See Figure F.1). Partitioning details and IOU thresholds can be found in Appendix C. True positives and localization errors have corresponding ground truth targets, their predictive distributions can be evaluated using proper scoring rules. As non-local rules, we use the Brier score (Brier, 1950) for evaluating categorical predictive distributions and the energy score for evaluating bounding box predictive distributions. As a local rule, we use the negative log likelihood for evaluating both. False positives are not assigned a ground truth target; we argue that these output instances should be classified as background and assign them the background category as their classification ground truth target. The quality of false positive categorical predictive distributions can then be evaluated using NLL or the Brier score. Bounding box targets cannot be assigned to false positives and as such we only look at their differential predictive entropy. In addition to using proper scoring rules, we provide mAP, classification marginal calibration error (MCE) (Kumar et al., 2019) and regression calibration error (Kuleshov et al., 2018) results for all methods in Table 1 .

4.2. RESULTS ANALYSIS

Figures 2 and 3 show the results of evaluating the classification and regression predictive distributions for true positives, localization errors, and false positives under dataset shift and using proper scoring rules. Figure 2 also shows that the bounding box mean squared errors (MSE) for probabilistic extensions of the same backend are very similar, meaning that differences between regression proper scoring rules among these methods arise from different predictive covariance estimates and not predictive mean estimates. Similar to what has been reported by Ovadia et al. (2019) for pure classification tasks, we observe that the quality of both category and bounding box predictive distributions for all output partitions degrades under dataset shift. Probabilistic detectors sharing the same detection backend are shown in Figure 3 to have similar classification scores, which we expect given that we do not modify the classification loss function. However, when comparing regression proper scoring rules in Figure 2 , the rank of methods can vary based on which proper scoring rule one looks at, a phenomenon we explore later in this section. Advantages of Our Evaluation: Independently evaluating the quality of predictive distributions for various types of errors leads to a more insightful analysis when compared to standard evaluation using mAP or PDQ (Hall et al., 2020) 3 . As an example, Figure 3 shows probabilistic extensions of DETR to have a lower classification NLL, but a higher Brier score for their false positives when compared to their localization errors. We conclude that for its false positives, DETR assigns high probability mass to the correct target category while simultaneously assigning high probability mass to a small number of other erroneous categories, leading to lower NLL but a higher Brier score when compared to localization errors for which the probability mass is distributed across many categories. Our observation highlights the importance of using non-local proper scoring rules alongside local scoring rules for evaluation.

Pitfalls of Training and Evaluation Using NLL:

Figure 4 shows the differential entropy of bounding box predictive distributions plotted against the error, measured as the IOU of their means with ground truth boxes. When using DETR or RetinaNet as a backend, variance networks trained with NLL are shown to predict higher entropy values when compared to those trained with ES regardless of the error. This observation does not extend to the FasterRCNN backend, where variance networks trained with NLL and ES are seen to have similar predictive entropy values. Figure 4 shows the distribution of errors, measured as IOU with ground truth, of targets used to compute regression loss functions during training of all three detection backends. At all stages of training, DETR and RetinaNet backends compute regression losses on targets with a much lower IOU than FasterRCNN, which are seen in Figure 4 as low IOU tails in their estimated histograms. We observe a direct correlation between the number of low IOU regression targets used during training and the overall magnitude of the entropy of predictive distributions learnt using NLL. DETR, using the largest number of low IOU regression targets is seen in Figure 4 to learn the highest entropy predictive distributions, while FasterRCNN using the fewest number of low IOU regression targets learns the lowest entropy predictive distributions. Variance networks trained with ES do not exhibit similar behavior producing a consistent magnitude for entropy regardless of their backend. Table 1 shows NLL to have a ∼ 2.7%, ∼ 0.9%, and ∼ 0.14% reduction in mAP compared to ES and DMM when used to train DETR, RetinaNet, and FasterRCNN, respectively. Figure 4 shows that the drop in mAP when using NLL for training is directly correlated to the number of low IOU regression targets chosen during training by the three deterministic backends. Table 1 also shows low entropy distributions of DETR-ES and RetinaNet-ES to achieve a much lower regression calibration error than high entropy distributions of DETR-NLL and RetinaNet-NLL. Columns 2 and 3 of Figure 2 show DETR-ES and RetinaNet-ES to have lower negative log likelihood and energy score for bounding box predictive distributions of true positives when compared to DETR-NLL and RetinaNet-NLL. For localization errors (Columns 4 and 5), the deviation from the mean is large, and as such the high predictive entropy provided by DETR-NLL and RetinaNet-NLL leads to lower values on proper scoring rules when compared to DETR-ES and RetinaNet-ES. We notice that if one uses only NLL for evaluation, networks can achieve a constant value of NLL by estimating high entropy predictive distributions regardless of the deviation from the mean, as seen for DETR-NLL on true positives in Figure 2 . We can mitigate this issue by evaluating with the energy score, which is seen to distinguish between correct and incorrect high entropy distributions. As an example, DETR-NLL is shown to have lower negative log likelihood but a much higher Energy score when compared to DETR-ES for true positives on shifted data. On the other hand, DETR-NLL shows a slightly lower energy score when compared to DETR-ES on localization errors, meaning the energy score can indicate that the high entropy values provided by DETR-NLL are a better estimate for localization errors than the low entropy ones provided by DETR-ES. Considering that true positives outnumber localization errors by at least two in Table 1 , we argue that it is more beneficial to train with the energy score for higher quality true positives predictive distributions over training with the negative log likelihood and predicting high entropy distributions regardless of the true error. Evaluating only with the energy score also has disadvantages, we show that it does not sufficiently discriminate between the quality of low entropy distributions. Figure 2 shows that DMM-trained networks predicting the lowest entropy (Figure 4 ), achieve similar values of the energy score, but much higher values of negative log likelihood when compared to networks trained with ES. The higher NLL score seen in Figure 2 for all networks trained with DMM a lower quality distributions when compared to networks trained with ES, specifically at the correct ground truth target value. Pitfalls of Common Approaches For Regression Target Assignment: Figure 4 shows the entropy for all methods with the DETR backend to steadily decrease as a function of decreasing error. On the other hand, the entropy of methods using RetinaNet and FasterRCNN backend is seen to have two inflection points, one at an IOU of 0.5 and another at a higher IOU of around 0.9. As a function of decreasing error, the entropy increases before the first inflection point, decreases between the two, and then increases again after the second inflection point. We hypothesize that this phenomenon is caused by the way backends choose their regression targets during training. DETR uses optimal assignment to choose regression targets that span the whole range of possible IOU with GT, even during final stages of training, as is visible in Figure 4 . On the other hand, RetinaNet and FasterRCNN use ad-hoc assignment with IOU thresholds Ren et al. ( 2015), a method that is seen to provide regression targets concentrated in the 0.5 to 0.9 IOU range throughout the training process, resulting in a much narrower data support when compared to DETR. We conclude that outside the data support, variance networks with RetinaNet and FasterRCNN backends fail to provide uncertainty estimates that capture the quality of mean predictions. Our conclusion is not unprecedented, Detlefsen et al. (2019) has previously shown variance networks to perform poorly out of the training data support for multiple regression tasks. However, our analysis pinpoints the reason of such behavior in probabilistic object detectors, showing that well established training approaches based on choosing high IOU regression targets work well for predictive mean estimation, but are not necessarily optimal for estimating predictive uncertainty. Performance on OOD Data: Finally, Figure 5 shows histograms of regression predictive entropy of false positives from probabilitic detectors with DETR and FasterRCNN backend on in-distribution, naturally shifted, and out-of-distribution data. Variance networks trained with any of the three loss functions considered achieve the highest predictive differential entropy on out-of-distribution data, followed by the skewed data and achieve the lowest entropy on the in-distribution data, showing that variance networks are capable of reliably capturing dataset shift when used to predict uncertainty.

5. TAKEAWAYS

We propose to use the energy score, a proper and non-local scoring rule to train probabilistic detectors. Answering the call to aim for more reliable benchmarks in numerous setups of uncertainty estimation (Ashukha et al., 2020) we also present tools to evaluate probabilistic object detectors using well-established proper scoring rules. We summarize our main findings below: • No single proper scoring rule can capture all the desirable properties of category classification and bounding box regression predictive distributions in probabilistic object detection. We recommend using both local and non-local proper scoring rules on multiple output partitions for a more expressive evaluation of probabilistic object detectors. • Using a proper scoring rule as a minimization objective does not guarantee good predictive uncertainty estimates for probabilistic object detectors. Non-local rules, like the energy score, learn better calibrated, lower entropy, and higher quality predictive distributions when compared to local scoring rules like NLL. where T : Rfoot_3 → R 4 is an invertible transformation with inverse T -1 (.). As an example, FasterRCNN estimates the difference between proposals generated from an region proposal network and ground truth bounding boxes. We train all probabilistic object detectors to estimate the covariance matrix Σ b of the transformed bounding box representation b. To predict positive semi-definite covariance matrices, each of these three object detection architectures are extended with a covariance regression head that outputs the 10 parameters of the lower triangular matrix L of the Cholesky decomposition Σ b = LL . The diagonal parameters of the matrix L are passed through the exponential function to guarantee that the output covariance Σ b is positive semi-definite. We train two models for each of the 9 architecture/regression loss combination, one using a full covariance assumption and the other using a diagonal covariance assumption. We report the results of the top performing variant chosen according to its performance on mAP as well as regression proper scoring rules. Considering the covariance structure as a hyperparameter was necessary for fair evaluation as we could not get RetinaNet to converge with NLL and a full covariance assumption (See Figure F.5). In case of a diagonal covariance matrix assumption, off-diagonal elements of Σ b are set to 0. The covariance prediction head for each object detection architecture is an exact copy of the bounding box regression head used by that architecture, taking the same feature maps as input. Variance networks are initialized to produce identity covariance matrices, which we find to be key for stable training, especially for NLL loss.

A.2 MODEL INFERENCE

When evaluating our probabilistic detectors one needs to be agnostic to the source of probabilistic bounding box predictions. As such, all considered methods are required to provide a consistent output bounding box representation and a corresponding covariance matrix Σ as presented in Section 3. We approximate z ∼ N (µ, Σ) by drawing 1000 samples from b ∼ N (µ b , Σ b ), passing those through T -1 (.), and then estimate µ and Σ as the sample mean and covariance matrix. Note that because of T (.) a diagonal Σ b does not in general lead to a diagonal Σ. Other than the special consideration required to estimate the final bounding box probability distribution, the inference process is fixed to be the one provided in the original implementation for all detectors.

A.3 MODEL TRAINING

For training all probabilistic extensions of the three object detectors, We stick to the hyperparameters provided by the original implementation whenever possible. Complex architectures such as DETR need a couple of weeks of training on our hardware setup and as such searching for the optimal values of hyperparameters for each tested configuration is outside the scope of this paper. All models are trained using a fixed random seed, which is shared across all APIs (numpy, torch, detectron2, etc...). This insures that empirical results are not determined based on lucky convergence, we train using 5 random seeds per configuration and find the results to be consistent with very small variance in terms of mAP and probabilistic metrics. RetinaNet and FasterRCNN both use ResNet-50 followed by a feature pyramid network (FPN) for feature extraction. Since both models are trained with SGD and momentum in the original implementation, we use the linear scaling rule to scale down from batch size of 16 to a batch size of 4 due to hardware limitations. We train our probabilistic extensions of those models using 2 GPUS with a learning rate of 0.0025 for RetinaNet and 0.005 for FasterRCNN. Both models are trained for 270000 iterations, and the learning rate is dropped by a factor of 10 at 210000 and then again at 250000 iterations. All additional hyperparameters are left intact. Both RetinaNet and Faster-RCNN was trained using a soft mean warmup stage (Detlefsen et al., 2019) . This is achieved through loss annealing, where the bounding box regression loss is defined as: L reg = (1 -λ)L original + λL probabilistic , ω = min(1, i 250000 ) λ = 100 ω -1 100 -1 , where L original is the regression loss in the original non-probabilistic implementation of the respective object detector, and i is the current training step. The value of 100 as a base for the exponent was chosen using hyperparameter tuning. This loss formulation ensures that the network starts by emphasizing learning of the bounding box, slowly shifting to learning the probabilistic regression loss as training proceeds. For the last 20000 steps, only the probabilistic regression loss is used for training. We found this loss formulation to be essential for convergence of models trained using the NLL. Each FasterRCNN model takes ∼ 3 days to train using 2 P-100 GPUs. On the same setup, RetinaNet models take ∼ 4 days to finish training. DETR also uses ResNet-50 as a base feature extractor. DETR's original implementation requires a very long training schedule for convergence, leading us to use hard mean warmup for all probabilistic DETR models. We use the model parameters provided by DETR's authors after training for 500 epochs to reach 42% mAP on the COCO validation dataset. Weights from this deterministic model are used as initial weights of all probabilistic extension of DETR, which are trained for an additional 50 epochs using losses presented in Section 3 and the same hyperparameters of the original deterministic implementation. We reduce the batch size from 64 to 16, but use the same initial learning rate as DETR is trained with ADAM. The learning rate is then dropped by a factor of 10 after at 30 epochs. Training for 50 epochs takes ∼ 4 days using 4 T-4 GPUs.

B SHIFTED AND OOD DATASETS B.1 SHIFTING COCO DATASET WITH IMAGENET-C CORRUPTIONS

We corrupt the COCO validation dataset using 18 corruption types proposed in ImageNet-C (Hendrycks & Dietterich, 2019) at 5 increasing levels of intensity. The frames of COCO validation dataset are skewed using every corruption type in repeating sequential order, such that the first corruption type is applied to frame 1, 19, 37, . . . , the second to frames 2, 20, 38, . . . , and so on. By increasing the corruption intensity from level 1 to level 5, we create 5 shifted versions of the 5000 frames of the COCO validation dataset such that every frame is skewed with the same corruption type, but at an increasing intensity. 

B.2 SHIFTED AND OUT OF DISTRIBUTION DATASETS FROM OPENIMAGES-V4

Shifted Dataset: To test methods beyond artificially shifted datasets, we create a new dataset comprising of 9, 351 frames from OpenImages-V4 (Kuznetsova et al., 2020) 2D detection data, containing instances belonging to the 80 object categories found in the COCO dataset. This testing data is naturally shifted due to differences in image quality, and different labeling mechanisms employed to generate the ground truth object boxes. Figure B .1 shows the number of instances belonging to each one of the 80 categories in the COCO validation dataset when compared to our generated OpenImages dataset. We tried to maintain the balance of categories found in the COCO validation dataset, as we aim to highlight performance differences originating from distribution shift, rather than the number of instances per category. Out-Of-Distribution Dataset: To test methods on out-of-distribution data, we collect 1, 852 frames from OpenImages-V4, containing no instances from any of the 80 categories found in the COCO dataset. All frames were manually checked to minimize the existence of unlabeled in-distribution category instances.

C PARTITIONING ERRORS IN OBJECT DETECTION

To maximize mAP, all our detection architectures are designed to produce a fixed number of detections per frame, usually a 100 (Carion et al., 2020) , regardless of their classification score. To avoid performing our probabilistic evaluation on objects that can be trivially eliminated by a score threshold, we filter the output of our probabilistic detectors based on a classification score that maximizes the F-1 score on the COCO in-distribution dataset. To analyze the predictive distributions provided by object detection models, we partition their filtered results into mutually exclusive subsets by adapting the object detection error decomposition presented by Hoiem et al. (2012) . For every detection instance, the intersection-over-union(IOU) is calculated with every ground truth instance in the scene to determine the largest IOU, IOU max . Based on IOU max , detection instances are partitioned into false positives, true positives, localization errors, or duplicates. False positives are determined as detection instances for which IOU max ≤ 0.1 whereas localization errors are defined as ones with 0.1 < IOU max < 0.5. Multiple localization errors can be assigned the same ground truth detection; we argue that such duplication is an artifact of failed post-processing stages usually found in modern object detectors (non-maximum suppression for instance) and should neither be ignored nor lumped with false positives. To avoid a discussion on the definition of true positives, we define an IOU threshold η tp ∈ {0.5, 0.55, . . . , 0.95} and consider the detection with IOU max ≥ η tp a true positive detection. For a finer error decomposition, if two detections are assigned the same ground truth instance, the one with a lower classification score is considered a duplicate. The choice of η tp is inspired by how mean average precision is computed on the COCO (Lin et al., 2014) . By computing averages of evaluation metrics on true positives and duplicates determined using multiple IOU thresholds, we aim for a fair evaluation for models that are skewed for better performance at either a low or high IOU. Unlike the error decomposition presented in Hoiem et al. (2012) , we do not combine duplicates with localization errors as they are expected to have different predictive uncertainty qualities. Furthermore, we do not put any constraint on the category of ground truth instances assigned to predicted objects. We argue that well localized but miss-classified object instances should not be counted as false positives, but should instead have their predictive categorical distribution evaluated through classification proper scoring rules (Ovadia et al., 2019) . Finally, we do not consider false negatives in our evaluation as analyzing this type of error from a probabilistic detection prospective does not provide additional insights beyond what was provided by Hoiem et al. (2012) .

D EVALUATING PREDICTIVE DISTRIBUTIONS

With the increase in interest surrounding the estimation of predictive distributions in recent deep learning literature, correctly evaluating such distributions given members of a test dataset is of utmost importance. This section aims to explain what characteristics we would like to see in high quality predictive distributions. It also provides a mathematical explanation of proper scoring rules.

D.1 CALIBRATION AND SHARPNESS

Given a predictive distribution p(z|x, D; θ) learnt using a training dataset D, and a testing dataset D = {x n , z n } | n ∈ {1, . . . , N }, we need a scoring rule S that assigns a numerical score to the predictive distribution p(z|x n , D; θ) given the actual event z n that materialized in the testing dataset. A question naturally arises on the nature of the qualities of the predictive distributions than need to be captured by S. Gneiting & Raftery (2007) contended that the goal of a predictive distribution is to maximize the sharpness around the materialized event z n subject to calibration. Calibration (Kuleshov et al., 2018; Kumar et al., 2019) is a joint property of the estimated predictive distribution as well as the events or values that materialized that reflects their statistical consistency (Gneiting & Raftery, 2007) . In simple words, if a well-calibrated predictive distribution assigns a 0.8 probability to an event, the event should occur around 80% of the time. From its definition, one can see that calibration by itself is not enough to guarantee useful predictive distributions. As an example, a predictive distribution p(z|x, D) = 1 if z = E[z] 0 otherwise is perfectly calibrated, but not very useful (unless one wants to always predict E[z]). Sharpness on the other hand quantifies the concentration of the predictive distribution around the true materialized event z n and is a property of the predictive distribution only. Good predictions need to be sharp, but ideal predictions should be both sharp and well-calibrated. We would like our scoring rules S to capture both the sharpness and the calibration of a predictive distribution.

D.2 PROPER SCORING RULES

Let Ω be a general sample space, A be a σ-algebra of Ω, and P be a convex class of probability measure on (Ω, A) such that {p, p * ∈ P}. A scoring rule S : P × Ω → R is written as S(p(z|x n , D; θ), z n ) and maps a predictive distribution and a materialized event to a scalar value. Throughout the rest of this dissertation, scoring rules will be negatively oriented, that is the lower the score the better the predictive distribution. With some abuse of notation, given the real data generating distribution p * (z|x n ), we write the expected score under p * as: S(p(z|x n , D; θ), p * (z|x n )) = S(p(z|x n , D; θ), z n ) d p * (z n |x n ). A scoring rule is said to be proper relative to P if: PDQ can be written as: S (p * (z|x n ), p * (z|x n )) ≤ S (p(z|x n , D; θ), p * (z|x n )) ∀ p, p * ∈ P, PDQ(G, D) = 1 |G| + N F P i,j,f pP DQ(G f i , D f j ), where G f i is the i th ground truth object instance in the f th frame of a dataset, D f j is the j th matched detection from the same frame, |G| is the number of ground truth instances in the dataset, and N F P is the total number of false positives. The pPDQ can be written as: pPDQ(G f i , D f j ) = Q S (G f i , D f j ).Q L (G f i , D f j ), where Q S is a spatial quality quantifying the quality of the bounding box predictive distribution, and Q L is the label quality quantifying the quality of the categorical predictive distribution. Let us assume for the sake of our argument that both Q S and Q L are proper scoring rules. Different combinations of Q S and Q L can lead to the same value of pPDQ. An extreme example would be if Q S = 0, pPDQ will be 0 regardless of Q L . This would result in erroneous predictive distributions having the same of pPDQ as correct ones, meaning that PDQ cannot correctly rank probabilistic object detectors. A more substantial problem with PDQ relates to the spatial quality Q S , which we empirically show to not be a proper scoring rule. To do so, we will generate a toy example consisting of one ground truth bounding box and three corresponding predictive probability distributions. The ground truth bounding box is defined using the top left and bottom right corners: (u min , v min ), (u max , v max ). We will refer to the first predictive distribution as case 1, and will assign it the following mean µ 1 = (u min + 15, v min + 15), (u max -15, v max -15), which is the ground truth bounding box shrunken by 15 pixels on the two image axes. The second predictive distribution, case 2 will have µ 2 = (u min -15, v min -15), (u max + 15, v max + 15), which is the ground truth bounding box expanded by 15 pixels on the two image axes. The final predictive distribution, case 3, will also be assigned the ground truth bounding box expanded with 14 pixels as: µ 3 = (u min -14, v min -14), (u max + 14, v max + 14). All three predictive distributions will be assigned an identical 4 × 4 isotropic covariance matrix σ 2 I with σ 2 = 50. The three distributions, along with the ground truth bounding box, can be visually seen in Figure D.1. Our argument is simple: if Q S is a proper scoring rule, it should rank the three predictive distributions in an identical manner to NLL and ES. Table 2 shows the NLL, ES, and PDQ results of evaluating the three predictive distributions using the ground truth box sample. The first thing to note is that the NLL and ES values for cases 1 and 2 are identical. This is because both distributions share an identical covariance matrix, as well as an identical error with the ground truth bounding box (15 pixels in each image axis). However, the value of Q S for case 1 is around 6% higher than that of case 2. This phenomenon is due to the ad-hoc design choices used to define Q S in Hall et al. (2020) , particularly the way Q S is built from a "foreground" and "background" components. A more definitive proof that Q S is not proper stems from a comparison between the results of cases 1 and 3 in Table 2 (rows 1 and 3). Case 3 has an error of 14 pixels in every image axis when compared to the ground truth, 1pixel less than the error of case 1. The lower error translates to a lower NLL, as well as ES values for case 3 when compared to case 1, implying that the predictive distribution of case 3 is of higher quality than that of case 1. However, when using Q S for comparison between the two cases, case 1 has a 2% better value when compared to case 3. In short, Q S can provide a higher rank to lower quality predictive distributions when compared to higher quality ones (as measured by proper scoring rules such as NLL and ES), and as such is not a proper scoring rule. In addition to not being a proper scoring rule, pPDQ is computed on a single output partition, constructed by using optimal assignment to match each ground truth object to a single detection output. This prevents in-depth analysis on the sources of errors such as the one presented in Section 4. Finally, PDQ has been previously shown to be gameable, where ad-hoc modifications of predictive distributions lead to performance gains, regardless of the theoretical soundness of such modifications. Contestants in a recent probabilistic object detection challenge (CVPR 2019) that relies on PDQ for evaluation showed that replacing the categorical output probability distribution with a onehot vector representation led to strictly higher PDQ scores than any method attempting to accurately represent the data generating distribution (see Sections 4.3 in (Ammirato & Berg, 2019) and 3.2 in (Wang et al., 2019) ). The gameability of PDQ highlights the importance of our suggestion to use proper scoring rules for evaluation, as practitioners are seen to be susceptible to letting go of theoretical soundness in favor of performance gains on evaluation metrics.

E MAXIMUM MEAN DISCREPANCY AND THE ENERGY DISTANCE

Maximum Mean Discrepancy (MMD) has been previously used to train generative models (Li et al., 2015; 2017) , where minimizing MDD can be interpreted as matching the moments of the predicted model distribution to the empirical data distribution. In this work we are concerned with the Energy Distance (Rizzo & Székely, 2016), a maximum mean discrepancy (see Sejdinovic et al. (2013) ) that is simple and efficient to estimate from distribution samples. Given two independent random vectors f , g ∈ R d with cumulative distribution function F, G respectively, the squared Energy Distance (ED) can be written as: D 2 (F, G) = 2E||f -g|| -E||f -f || -E||g -g ||, where f , f are i.i.d samples from F , and g, g i.i.d samples from G. Rizzo & Székely (2016) show that the energy distance satisfies all axioms of a metric, providing a measure of equality of distributions and insuring that D 2 (F, G) = 0 if and only if F = G. In context of deep learning, the energy distance has been previously used to parallelize text-to-speech generative models (Gritsenko et al., 2020) , and to train generative adversarial networks (Bellemare et al., 2017) . It can be shown (Gneiting & Raftery, 2007) that Equation 2 can be written as: ES = E||f -g|| - 1 2 E||f -f || It is easy to see that the energy score presented in equation 11 is the energy distance when only one sample, g, is available from G.

COCO C1 C3 C5 OpenIm

Test Dataset 



See Appendix E. https://github.com/facebookresearch/detr/tree/master/d2 For more details see Appendix D.3 https://github.com/facebookresearch/detr/tree/master/d2



Figure 1: A toy example showing values of NLL (Blue), ES (Green) and DMM (Orange) plotted against the parameter σ of an isotropic covariance matrix σI, when assigned to low error (Solid Lines) and high error (Dashed Lines) detection outputs from DETR.

Figure 2: Average over 80 classification categories of NLL, ES, and MSE for bounding box predictive distributions estimates from probabilistic detectors with DETR, RetinaNet, and FasterRCNN backends on in-distribution (COCO), artificially shifted (C1-C5), and naturally shifted (OpenIm) datasets. Error bars represent the 95% confidence intervals around the mean.

Figure 3: Average over 80 classification categories of NLL and Brier score for classification predictive distributions generated using DETR. Error bars represent the 95% confidence intervals around the mean. Similar trends are seen for RetinaNet and FasterRCNN backends in Figures F.2, F.3.

Figure 4: Left: Differential Entropy vs IOU with ground truth plots for bounding box predictive distribution estimates on in-distribution data. Right: Histograms of the IOU of ground truth boxes with boxes assigned as regression targets during network training, plotted at 0%, 50%, and a 100% of the training process. The red dashed line signifies the 0.5 IOU level on both plots.

Figure 5: Histogram of differential entropy for false positives bounding box predictive distributions produced by probabilistic detectors with FasterRCNN and RetinaNet as a backend. Results for DETR exhibit similar trends (Figure F.4).

Figure C.1: Detection instances from deterministic Faster-RCNN partitioned into true positives (blue), duplicates (magenta), localization errors (teal), and false positives (red) according to η tp = 0.5.

Figure D.1: A toy example showing the three bounding box distributions used to show the spatial quality Q S to be a non-proper scoring rule. The green bounding box is the ground truth sample, whereas the red bounding box is the predicted mean of the probability distribution. The 95% confidence ellipse of bounding box corner predictive distributions is also plotted in red.

Figure F.3: Point plots of the Brier Score and NLL for results from FasterRCNN. The same trends are seen as in Figure 3.

Figure F.5: Bar plots showing that our probabilistic detectors implementations achieve the same level of performance as their deterministic counterparts (in Red) on mean AP. On RetinaNet using NLL and a full covariance matrix assumption results in a model with much lower mAP than the original RetinaNet implementation.

Left: Results of mAP and calibration errors of probabilistic extensions of DETR, Reti-naNet, and FasterRCNN. Right: The number of output detection instances classified as true positives (TP), duplicates, localization errors, and false positives (FP).

NLL, ES, and Spatial Quality (Q S ) results of the three predictive distributions from our toy example. strictly proper relative to P if Equation equation 7 holds with equality only if p = p * . In simple words, a strictly proper scoring rule is only minimized if the predictive distribution is exactly equal to the true data generating distribution. In both cases, the a lower score signifies predictive distributions that are closer to the data generating distribution. It is clear that one can use proper scoring rules to rank predictive distributions based on theoretically founded quantities.

Resultsof regression metrics seem to have the same trend as those of true positives, while results of classification metrics follow those of localization errors. This is no surprise given true positives and duplicates can have the same IOU with ground truth, the only difference being duplicates having a lower class probability. Point plots of the Brier Score and NLL for results from RetinaNet. The same trends are seen as in Figure3, with the expection of false positives having a lower Brier score than localization errors and duplicates. We suspect that this is due to usage of multilable classification with sigmoid instead of multiclass classification with softmax in the original RetinaNet implementation.

Histogram of differential entropy for false positives bounding box predictive distributions on DETR. For networks trained using NLL and ES, the same trend is observed as in Figure5. The network trained with DMM is shown to not be able to differentiate between in and out-ofdistribution false positives.

