MULTI-HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS

Abstract

Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highly illposed problem. Well-calibrated distributions of possible poses can make these ambiguities explicit and preserve the resulting uncertainty for downstream tasks. This study shows that previous attempts, which account for these ambiguities via multiple hypotheses generation, produce miscalibrated distributions. We identify that miscalibration can be attributed to the use of sample-based metrics such as minMPJPE. In a series of simulations, we show that minimizing minMPJPE, as commonly done, should converge to the correct mean prediction. However, it fails to correctly capture the uncertainty, thus resulting in a miscalibrated distribution. To mitigate this problem, we propose an accurate and well-calibrated model called Conditional Graph Normalizing Flow (cGNFs). Our model is structured such that a single cGNF can estimate both conditional and marginal densities within the same model -effectively solving a zero-shot density estimation problem. We evaluate cGNF on the Human 3.6M dataset and show that cGNF provides a well-calibrated distribution estimate while being close to state-of-theart in terms of overall minMPJPE. Furthermore, cGNF outperforms previous methods on occluded joints while remaining well-calibrated 1 . NEW

1. INTRODUCTION

The task of estimating the 3D human pose from 2D images is a classical problem in computer vision and has received significant attention over the years (Agarwal & Triggs, 2004; Mori & Malik, 2006; Bo et al., 2008) . With the advent of deep learning, various approaches have been applied to this problem with many of them achieving impressive results (Martinez et al., 2017; Pavlakos et al., 2016; 2018; Zhao et al., 2019; Zou & Tang, 2021) . However, the task of 3D pose estimation from 2D images is highly ill-posed: A single 2D joint can often be associated with multiple 3D positions, and due to occlusions, many joints can be entirely missing from the image. While many previous studies still estimate one single solution for each image (Martinez et al., 2017; Pavlakos et al., 2017; Sun et al., 2017; Zhao et al., 2019; Zhang et al., 2021) , some attempts have been made to generate multiple hypotheses to account for these ambiguities (Li & Lee, 2019; Sharma et al., 2019; FIX Biggs et al., 2020; Oikarinen et al., 2020; Li & Lee, 2020; Kolotouros et al., 2021; Wehrbein et al., 2021) . Many of these approaches rely on estimating the conditional distribution of 3D poses given the 2D observation implicitly through sample-based methods. Since direct likelihood estimation in sample-based methods is usually not feasible, different sample-based evaluation metrics have become popular. As a result, the field's focus has been on the quality of individual samples with respect to the ground truth and not the quality of the probability distribution of 3D poses itself. In this study, we show that common sample-based metrics in lifting, such as mean per joint position error, encourage overconfident distributions rather than correct estimates of the true distribution. As a result, they do not guarantee that the estimated density of 3D poses is a faithful representation of the underlying data distribution and its ambiguities. As a consequence, their predicted uncertainty cannot be trusted in downstream decisions, which would be one of the key benefits of a probabilistic model (Fig. 1 ). NEW In a series of experiments, we show that a probabilistic lifting model trained with likelihood provides a higher-quality estimated distribution. First, we evaluate the distributions learned by minimizing 

2. RELATED WORK

Lifting Models Estimating the human 3D pose from a 2D image is an active research area (Pavlakos et al., 2016; Martinez et al., 2017; Zhao et al., 2019; Wu et al., 2022 ). An effective approach is to decouple 2D keypoint detection from 3D pose estimation (Martinez et al., 2017) . First, the 2D keypoints are estimated from the image using a 2D keypoint detector, then a lifting model uses just these keypoints to obtain a 3D pose estimate. Since the task of estimating a 3D pose from 2D data is a highly ill-posed problem, approaches have been proposed to estimate multiple hypotheses (Li & Lee, 2019; Sharma et al., 2019; Oikarinen et al., 2020; Kolotouros et al., 2021; Li et al., 2021; Wehrbein et al., 2021) . However, these approaches i) do not explicitly account for occluded or missing keypoints and ii) do not consider the calibration of the estimated densities. Wehrbein et al. (2021) incorporate a Normalizing Flow (Tabak, 2000) architecture to model the well-defined 3D to 2D projection and exploit the invertible nature of Normalizing Flows to obtain 2D to 3D estimates. Albeit structured as a Normalizing Flow it is not trained as a probabilistic model. Instead, the authors optimize the model by minimizing a set of cost functions. All in some form depend on the distance of hypotheses to the ground truth. In addition, they utilize an adversarial loss to improve the quality of the hypotheses. The proposed model achieves high performance on popular metrics in multi-hypothesis pose estimation, which are all sample-based distance measures rather than distribution-based metrics. 

Sample-Based Metrics in Pose Estimation

The most widely used metric in pose estimation is the mean per joint position error (MPJPE) (Wang et al., 2021) . It is defined as the mean Euclidean distance between the K ground truth joint positions X ∈ R K×3 and the predicted joint positions X ∈ R K×3 . Multi-hypothesis pose estimation considers N hypotheses of positions X ∈ R N ×K×3



Code and pretrained model weights are available at https://github.com/XXXX.



Figure 1: Examples showcasing the consequences of an overconfident distributions vs.our wellcalibrated distribution. Ground truth marked with colored poses. Uses artificial 2D keypoint failures.

Sharma et al. (2019) introduces a conditional variational autoencoder architecture with an ordinal ranking to disambiguate depth. Similarly to Wehrbein et al. (2021), the authors additionally optimize the poses on sample-based reconstruction metrics and report performance on sample-based metrics only. Oikarinen et al. (2020) utilize a graph-based approach NEW to construct a mixture density network of Gaussian distributions. Kolotouros et al. (2021) use a volume-preserving normalizing flow model based on the GLOW architecture (Kingma & Dhariwal, 2018).

