MULTI-HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS

Abstract

Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highly illposed problem. Well-calibrated distributions of possible poses can make these ambiguities explicit and preserve the resulting uncertainty for downstream tasks. This study shows that previous attempts, which account for these ambiguities via multiple hypotheses generation, produce miscalibrated distributions. We identify that miscalibration can be attributed to the use of sample-based metrics such as minMPJPE. In a series of simulations, we show that minimizing minMPJPE, as commonly done, should converge to the correct mean prediction. However, it fails to correctly capture the uncertainty, thus resulting in a miscalibrated distribution. To mitigate this problem, we propose an accurate and well-calibrated model called Conditional Graph Normalizing Flow (cGNFs). Our model is structured such that a single cGNF can estimate both conditional and marginal densities within the same model -effectively solving a zero-shot density estimation problem. We evaluate cGNF on the Human 3.6M dataset and show that cGNF provides a well-calibrated distribution estimate while being close to state-of-theart in terms of overall minMPJPE. Furthermore, cGNF outperforms previous methods on occluded joints while remaining well-calibrated 1 . NEW

1. INTRODUCTION

The task of estimating the 3D human pose from 2D images is a classical problem in computer vision and has received significant attention over the years (Agarwal & Triggs, 2004; Mori & Malik, 2006; Bo et al., 2008) . With the advent of deep learning, various approaches have been applied to this problem with many of them achieving impressive results (Martinez et al., 2017; Pavlakos et al., 2016; 2018; Zhao et al., 2019; Zou & Tang, 2021) . However, the task of 3D pose estimation from 2D images is highly ill-posed: A single 2D joint can often be associated with multiple 3D positions, and due to occlusions, many joints can be entirely missing from the image. While many previous studies still estimate one single solution for each image (Martinez et al., 2017; Pavlakos et al., 2017; Sun et al., 2017; Zhao et al., 2019; Zhang et al., 2021) , some attempts have been made to generate multiple hypotheses to account for these ambiguities (Li & Lee, 2019; Sharma et al., 2019; FIX Biggs et al., 2020; Oikarinen et al., 2020; Li & Lee, 2020; Kolotouros et al., 2021; Wehrbein et al., 2021) . Many of these approaches rely on estimating the conditional distribution of 3D poses given the 2D observation implicitly through sample-based methods. Since direct likelihood estimation in sample-based methods is usually not feasible, different sample-based evaluation metrics have become popular. As a result, the field's focus has been on the quality of individual samples with respect to the ground truth and not the quality of the probability distribution of 3D poses itself. In this study, we show that common sample-based metrics in lifting, such as mean per joint position error, encourage overconfident distributions rather than correct estimates of the true distribution. As a result, they do not guarantee that the estimated density of 3D poses is a faithful representation of the underlying data distribution and its ambiguities. As a consequence, their predicted uncertainty cannot be trusted in downstream decisions, which would be one of the key benefits of a probabilistic model (Fig. 1 ). NEW In a series of experiments, we show that a probabilistic lifting model trained with likelihood provides a higher-quality estimated distribution. First, we evaluate the distributions learned by minimizing minMPJPE instead of negative log-likelihood (NLL) observing that, although minMPJPE optimal distributions have a good mean they are not well-calibrated. Next, we use the SimpleBaseline (Martinez et al., 2017) lifting model with a simple Gaussian noise model on Human3.6M to demonstrate that a model optimized for NLL is well-calibrated but underperforms on minMPJPE. The same model optimized for minMPJPE performs well in that metric but turns out to be miscalibrated. To balance this trade-off, we propose an interpretable evaluation strategy that allows comparing sample-based methods, while retaining calibration. Finally, we introduce a novel method to learn the distribution of 3D poses conditioned on the available 2D keypoint positions. To that end, we propose a Conditional Graph Normalizing Flow (cGNF). Unlike previous methods, cGNF does not require training a separate model for the prior and posterior. Thus, our model does not require an adversarial loss term, as opposed to Wehrbein et al. (2021) . By evaluating the cGNF's performance on the Human 3.6M dataset (Ionescu et al., 2014) , we show that, in contrast to previous methods, our model is well-calibrated while being close to state-of-the-art in terms of overall minMPJPE, and that it significantly outperforms prior work on occluded joints.

2. RELATED WORK

Lifting Models Estimating the human 3D pose from a 2D image is an active research area (Pavlakos et al., 2016; Martinez et al., 2017; Zhao et al., 2019; Wu et al., 2022 ). An effective approach is to decouple 2D keypoint detection from 3D pose estimation (Martinez et al., 2017) . First, the 2D keypoints are estimated from the image using a 2D keypoint detector, then a lifting model uses just these keypoints to obtain a 3D pose estimate. Since the task of estimating a 3D pose from 2D data is a highly ill-posed problem, approaches have been proposed to estimate multiple hypotheses (Li & Lee, 2019; Sharma et al., 2019; Oikarinen et al., 2020; Kolotouros et al., 2021; Li et al., 2021; Wehrbein et al., 2021) . However, these approaches i) do not explicitly account for occluded or missing keypoints and ii) do not consider the calibration of the estimated densities. Wehrbein et al. (2021) incorporate a Normalizing Flow (Tabak, 2000) architecture to model the well-defined 3D to 2D projection and exploit the invertible nature of Normalizing Flows to obtain 2D to 3D estimates. Albeit structured as a Normalizing Flow it is not trained as a probabilistic model. Instead, the authors optimize the model by minimizing a set of cost functions. All in some form depend on the distance of hypotheses to the ground truth. In addition, they utilize an adversarial loss to improve the quality of the hypotheses. The proposed model achieves high performance on popular metrics in multi-hypothesis pose estimation, which are all sample-based distance measures rather than distribution-based metrics. 

Sample-Based Metrics in Pose Estimation

The most widely used metric in pose estimation is the mean per joint position error (MPJPE) (Wang et al., 2021) . It is defined as the mean Euclidean distance between the K ground truth joint positions X ∈ R K×3 and the predicted joint positions X ∈ R K×3 . Multi-hypothesis pose estimation considers N hypotheses of positions X ∈ R N ×K×3 and adapts the error to consider the hypothesis closest to the ground truth (Jahangiri & Yuille, 2017) . minMPJPE( X, X) = min n 1 K K k Xn,k -X k 2 In this work, we refer to this minimum version of the MPJPE as minMPJPE. Procrustes-Aligned NEW MPJPE (PA-MPJPE) is a variation on MPJPE which first aligns the test pose to the ground truth pose. The percentage of correct keypoints (PCK) (Toshev & Szegedy, 2013; Tompson et al., 2014; Mehta et al., 2016) is another widely accepted metric in pose estimation which measures the percentage of keypoints in a circle of 150mm around the ground truth in terms of minMPJPE. Correct pose score (CPS) proposed by Wandt et al. (2021) considers a pose to be correct if all the keypoints are within a radius r ∈ [0 mm, 300 mm] of the ground-truth in terms of minMPJPE. CPS is defined as the area under the curve of percentage correct poses and r. Calibration is an important property of a probabilistic model measuring a model's ability to correctly reflect the uncertainty in the data. Thus, the confidence of an event assigned by a well-FIX calibrated model should be equal to the true probability of the event (Brier, 1950) . Guo et al. (2017) show that calibration of densities is especially important in the field of deep learning, where different architecture choices have been shown to lead to miscalibrated. Naeini et al. (2015) propose to measure the expected calibration error (ECE) metric which approximates the expectation of the absolute difference between the predicted probability and the true probability. ECE = 1 N N n=1 | pn -p n | (1) The lower the ECE the better the calibration of the distribution. A model which predicts the same probability for all samples has an ECE of 0.5, whereas a perfectly calibrated model has ECE = 0. DeGroot & Fienberg (1983) and Niculescu-Mizil & Caruana (2005) provide a visual representation of calibration using reliability diagrams. They display the calibration curve, which is a function of confidence against the true probability. If the calibration curve is an identity function then the model is perfectly calibrated.

3. OBSERVING MISCALIBRATION

In this section, we demonstrate that the current state-of-the-art lifting models are not well-calibrated. We consider two of the latest methods: Sharma et al. (2019) and Wehrbein et al. (2021) . We compute the ECE for the two models and visualize their reliability diagrams (Fig. 2a ).

3.1. QUANTILE CALIBRATION FOR POSE ESTIMATION

Algorithm 1 Quantile calibration for pose estimation for each X * m and C m do draw N hypotheses X | C m Xm,k ← median( X:,m,k ) ε n,m,k ← || Xn,m,k -Xm,k || 2 Φ m (ε) ← CDF(ε :,m,k ) ε * m,k ← ||X * m,k -Xm,k || 2 end for ω k (q) ← 1 M M m=1 1 Φm(ε * m,k )≤q ω(q) ← median(ω k (q)) ECE = 1 |Q| q∈Q |ω(q) -q| Quantile calibration (Song et al., 2019) defines a perfectly calibrated distribution as one for which groundtruth values X * fall within the q-th quantile q% of the time. However, for high dimensions estimating whether NEW a point is contained within a given quantile is non-trivial. We, therefore, propose to simplify the problem by projecting to the univariate space of squared errors ε from the median X of N hypotheses X conditioned on 2D poses C with K keypoints. We then compute ECE in the space of ε over the set of quantiles Q ∈ [0, 1] (Algorithm 1). As a measurement of central tendency we choose the median FIX statistic, which is more robust to outlier samples. However, in practice, the choice of median vs. mean results in minor differences in the calibration outcomes (sec. A.3).

3.2. SAMPLE-BASED METRICS PROMOTE MISCALIBRATION

In this section, we show that sample-based metrics are a major component that contributes to miscalibration. In principle, minMPJPE could be a good surrogate metric for NLL. However, as it became a common metric for selecting models it might become subject to Goodhart's Law (Goodhart, 1975) -"When a measure becomes a target, it ceases to be a good measure" (Strathern, 1997) . In the case of minimizing the mean MPJPE over hypotheses, the posterior distribution collapses onto the mean (sec. A.1). Similarly, simulations indicate that minMPJPE converges to the correct mean, but it encourages miscalibration (Fig. 2b ,d and sec. A.2). We illustrate this with a small toy example. Consider M samples X * ∈ R M ×D from a Ddimensional Isotropic Normal distribution with mean µ * ∈ R D and variance σ * 2 ∈ R D and an approximate isotropic Normal posterior distribution q(X) with mean µ ∈ R D and variance σ 2 ∈ R D . We assume the ground truth mean to be known µ = µ * and only optimize the variance σ 2 to minimize minMPJPE with N hypotheses. We optimize σ 2 for different numbers of dimensions D and hypotheses N . If the distribution converges to a variance lower than the true variance σ * 2 we NEW call such a distribution overconfident. However, if the converged variance is larger than the true variance then the distribution is considered to be underconfident. Intuitively, for a small sampling budget drawing samples at the mean constitutes the least risk of generating a bad sample. With an increase in the number of hypotheses, increasing variance should gradually become beneficial, as the samples cover more of the volume. For a sufficiently large number of hypotheses, we can expect the variance to increase beyond the true variance, as the low-probability samples can have sufficient representation. Increasing dimensions should have an inverse effect since the volume to be covered increases with each dimension. We observe these effects in the toy example (Fig. 2b ). When we consider the case which corresponds to the 3D pose estimation problem (D = 45 and N = 200, black point in Fig. 2b ), we expect an overconfident distribution based on our toy example. This is also what we observe for the current state-of-the-art lifting models (Fig. 2a ). Furthermore, we show that the minMPJPE optimal distribution outperforms the ground truth distribution in terms of minMPJPE, but not in terms of negative log-likelihood (Fig. 2b ). Together, the results imply that minimizing minMPJPE, directly or by model selection, is expected to result in miscalibrated distributions and thus minMPJPE by itself is not sufficient to identify the best model.

3.3. UNCONDITIONAL GAUSSIAN NOISE BASELINE ON HUMAN 3.6M

To verify the conclusions from the toy model in section 3.2 we test the prediction with a simplified model on the Human3.6M dataset (Catalin Ionescu, 2011; Ionescu et al., 2014) (see section 5 for more details about the dataset). We train an additive Gaussian noise model on top of the SimpleBaseline (Martinez et al., 2017) a well-established single-hypothesis model. We generate N hypotheses X ∈ R N ×M ×K×3 of poses with K keypoints for M observations C ∈ R M ×K×2 according to: Xn,m = SimpleBaseline(C m ) + σz n where SimpleBaseline(C m ) estimates the mean of the noise and σ is the standard deviation parameter scaling the standard normal samples z ∼ N (z; 0, I) (Fig. 2c ). It is important to note that we do not condition σ on the 2D observation C m , i.e. the same noise model is used for every input. We test two optimization setups: 1) minimizing minMPJPE and 2) maximizing likelihood. Based on the predictions from the toy problem (sec. 3), we expect the minMPJPE model to be overconfident and outperform the NLL model on the minMPJPE, but the NLL model to be better calibrated. This is exactly what we observe (Fig. 2c ). Furthermore, each of these models achieve minMPJPE NEW performance in a range similar to state-of-the-art multi-hypothesis methods and even outperform some established single-hypothesis methods (Table 1 ).

3.4. EVALUATING SAMPLE-BASED METHODS

Given that minMPJPE is not sufficient to fully evaluate multi-hypothesis methods, we propose an evaluation strategy that remains interpretable and promotes calibrated distributions. Consider the landscapes of minMPJPE and ECE with respect to the mean and variance of an approximate distribution (Fig. 2d ). Simulations indicate that optimizing minMPJPE identifies the correct mean NEW µ (sec. A.2), but not the correct σ. ECE, however, is minimized by a manifold of µ, σ values and converges to a good standard deviation for each mean, but it does not guarantee an accurate model. NEW We thus hypothesize that a likelihood-optimal distribution can be approximated when minMPJPE is minimized on the ECE-optimal manifold. Thus, minMPJPE can become a measure of accuracy NEW if it is constrained by ECE, but it should not be considered as accuracy if calibration is not matched.

4. CONDITIONAL GRAPH NORMALIZING FLOW

Given the observations made in sec. 3 we conclude that MPJPE-based objective functions are not NEW sufficient to obtain a well-calibrated distribution. The objective function should instead be based on likelihood, which in this case is maximized if and only if the estimated distribution recovers the ground truth distribution i.e. if the distribution is well-calibrated (Hastie et al., 2009) . Therefore, in this section, we propose a method that can be optimized purely based on likelihood. Moreover, we utilize the natural graph structure of the human pose providing zero-shot generalization capabilities to occluded and unobserved body parts. We propose to learn the conditional distribution p(x | c) of the 3D pose x given the 2D pose c using conditional graph normalizing flows (cGNF). We define a target graph x = (H x , E x ) of 3D poses and a context graph c = (H c , E c ) of 2D detections. H x ∈ R n×Dx and E x are the edges between the nodes of the target graph and H c ∈ R m×Dc and E c are the edges between the nodes of the context graph. In the case that an observation is not present, the corresponding node is removed from c. The model is built of L transformation blocks, each of which consists of a per-node feature split step, a graph merging step, an actnorm (Kingma & Dhariwal, 2018) and two graph neural network layers (Gori et al., 2005) (Fig. 3 ). These elements construct an affine coupling layer (Dinh et al., 2016) , which is then followed by a permutation layer. The transformation blocks are only applied to the target graph, while the context graph is passed through unchanged.

Per-Node Feature Split

Step splits the target node features H x into two parts, H x :,1:D-1 and FIX H x :,D across the feature dimension. We incorporate a leave-one-out strategy for splitting the features. The ith feature dimension is propagated directly to the affine coupling layer and the remaining dimensions are passed to the graph neural network layers. In the next block, the next ith dimension is used. Graph Merging When utilizing conditional normalizing flows Winkler et al. (2019) on graphstructured data, a key challenge is incorporating the context graph in the transformation. We propose to merge the context graph c with the target graph x into a heterogeneous graph x | c. The context graph c forms directed edges from nodes in c to nodes in x as defined by R c→x , the relations matrix. R c→x i,j = 1 indicates that node i in the context graph forms an edge with node j in the target graph x (Fig. 3 ).

Graph Neural Network Layers

We define the graph neural network layers as relational graph convolutions (R-GCNs) (Schlichtkrull et al., 2018) . In the message passing step, the message received by node v from the neighboring nodes is defined as m (v) t+1 = u∈N c→x (v) ψ c→x h (v) c , e (u,v) + r∈R u∈N r (v) ψ r h (u) t , e (u,v) where ψ r : R Dn → R D h and ψ c→x : R Dc → R D h , with D h as the number of latent dimensions. ψ c→x should be flexible enough to allow the network to learn to distinguish between missing observations and zero observations i.e. ψ c→x 0, e (u,v) ̸ = 0. The Update step is defined by the mapping g : R D h → R Do which maps the latent space to the output dimension of size D o . We implement the Update step as a single fully connected linear layer. Affine Coupling Layer Similarly to Liu et al. (2019) the output of the GNN layers models the scale s(x 2 , c) and translation t(x 2 , c) functions. The scale and translation functions are then applied to the unchanged split x 1 to produce the transformed graph z l 1 . z l 1 = x 1 ⊙ exp (s(x 2 , c)) + t(x 2 , c) z l 2 = x 2 The x 2 is copied to z l 2 unchanged. The z l 1 and z l 2 are then concatenated to form the transformed graph z l , which is passed to the next transformation block.

Estimating Conditional and Marginal Densities

The cGNF architecture allows for estimating NEW both the conditional and marginal densities within a single model. The conditional density p(x | c) is estimated by merging the target graph x with the context graph c. Consequently, the output density becomes constrained by the context. By removing nodes from c, the associated conditioning variables are removed from p(x | c), functionally conditioning only on a subset of the possible nodes in c. Finally, if the context graph is empty, the model provides a marginal density p(x). Loss The standard optimization procedure for normalizing flows is to maximize the log probability of the observed data x obtained through the inverse path (x → z) (Fig. 3 ). Assuming x are i.i.d. the task of the flow is to model p (x | c) = N i p (x i | c i ) where x i are the 3D poses and c i are the corresponding 2D observations. We thus define the loss as the negative log probability of pairs of observations x and c. L post. = -ln q 0 (f (x, c)) + K k=1 ln det∇ z k-1 f k (z k-1 , c) where q 0 ∼ N (z; 0, I) is the source distribution. We augment the training data by randomly removing context variables to simulate new observations with missing keypoints in c. The augmented observations contain 20%, 40%, 60% or 80% of all observable keypoints. For all 3D poses, we additionally compute the prior loss, which expresses the likelihood of a pose given that no 2D keypoints were observed. L prior = -ln q 0 (f (x, ∅)) + K k=1 ln det∇ z k-1 f k (z k-1 , ∅) Our overall loss function is thus the sum of the two partial losses. FIX L = 1 2 L prior + L post (2) The proposed training strategy and architecture formulate pose estimation as a zero-shot density estimation problem. The cGNF model is trained on a subset of possible observations and is required to evaluate previously unseen conditional densities. Such zero-shot capabilities are useful in reliably NEW estimating occluded poses. The graph structure allows the cGNF to share information between nodes and as a result allows modeling distributions with sets of conditioning variables that have not been seen before. We observe that cGNF can solve these zero-shot density estimation problems comparably to specialized conditional normalizing flow problems (sec B.3). Root Node 3D poses are relative to a root node (usually the pelvis). Hence, the root node's position is deterministic. We, therefore, remove the root node and corresponding edges from the target graph x and represent it as a root node-type r, which has features H r ∈ R 3 and a message generation function ψ r which is a fully connected neural network with 100 units.

Graph Symmetries

The human pose graph has symmetries, e.g. the left and right limbs are mirrored. We impose a hierarchical structure on the nodes of the target graph x. A node may have a parent and a child, for example, the elbow node is the child of the shoulder node and the parent of the wrist node. Messages passed from the parent to the child are forward messages generated by ψ x→x and messages from the child to the parent are backward messages generated by ψ x←x . Occlusion Representation We use 2D keypoint positions published by Wehrbein et al. (2021) estimated using the HRNet model (Sun et al., 2019) and the provided Gaussian distribution fits for evaluating occluded keypoints. If a keypoint is classified as occluded (2D detection σ > 5px) its corresponding node is removed from the context graph. To adjust for the differences between the pose definitions used by HRNet and H36M we employ an embedding network using the SageConv architecture (Hamilton et al., 2017) with a learnable adjacency matrix. The embedding network transforms the observed 2D keypoints into a 10-dimensional embedding vector for each of the keypoints. Additional implementation details of the architecture are given in the appendix (sec. B.1). 

5. LIFTING HUMAN3.6M

Data We use the Human3.6M Dataset (H36M) on the academic use only license (Catalin Ionescu, 2011; Ionescu et al., 2014) which is the largest dataset for 3D human pose estimation. It consists of tuples of 2D images, 2D poses, and 3D poses for 7 professional actors performing 15 different activities captured with 4 cameras. Accurate 3D positions are obtained from 10 motion capture cameras and markers placed on the subjects. For evaluation, we additionally use the Human 3.6M Ambiguous (H36MA) dataset introduced by Wehrbein et al. (2021) . H36MA is a subset of the H36M dataset containing only ambiguous poses from subjects 9 and 11. A pose is defined as ambiguous when the 2D keypoint detector is highly uncertain about at least one of the keypoints. Evaluation We evaluate the model on every 64th frame of subjects 9 and 11 and the H36MA subset. We compare our model's performance to prior work on minMPJPE and ECE using 200 samples (Table 1 ). As expected from the observations made in section 3, our method underperforms on minMPJPE but significantly outperforms on ECE (Fig. 4c ). We further compare our method to NEW Kolotouros et al. (2021) which utilizes a similar likelihood-based loss and a normalizing flow architecture, but does not account for occlusions and does not utilize graph inductive biases (Table 2 ). As we predict in section 3 we find that Kolotouros et al. (2021) is well-calibrated. We show that cGNF outperforms Kolotouros et al. (2021) even though fewer samples are used and remains comparably well-calibrated. Samples from the posterior and prior are shown in figure 4a and b . Additional examples are included in the appendix (posterior samples Fig. S3 ; prior samples Fig. S4 ). We further NEW compare our model performance in scenarios where other models exhibit overconfidence (Fig. 1 and S6) and explore failure cases (Fig. S7 ). Performance on individual occluded joints The poses contained in H36MA are not only occluded but also generally more difficult than the average pose in H36M. Therefore, we propose to evaluate the performance on solely the occluded joints instead of the whole poses. We report these errors in table 1 (Occluded), where we show that our method outperforms the competing methods by a significant margin on both minMPJPE and ECE. Thus, this shows that our model is able to learn a posterior distribution that is more calibrated than previous methods and is able to outperform prior methods on minMPJPE for the occluded joints. Table 1 : Comparison of the cGNF model to state-of-the-art methods for multi-hypothesis pose estimation using expected calibration error (ECE) and minimum mean per joint position error (minMPJPE) between the ground truth 3D pose and N hypotheses. Best model row is printed in bold font. Reporting the mean across the outcomes of 3 different seeds and the standard deviation (SD). For ECE the SD is smaller than 0.001 in all cases. Thus, we do not report the SD value for ECE in this table. For all the metrics lower is better. We underlined the results that we did not compute but instead used the originally reported value. With † we mark results which used ground truth 2D keypoints and not estimated 2D keypoints and these are not included in the comparison. 

6. CONCLUSION

In this study, we explored the problem of miscalibration in multi-hypothesis 3D pose estimation. Obtaining calibrated density estimates is important for safety-critical applications, such as healthcare or autonomous driving. Here we provide evidence that a focus on sample-based metrics for multi-hypothesis 3D pose estimation (e.g. minMPJPE) can lead to miscalibrated distributions. We propose a flexible model which can be trained to minimize the negative log-likelihood loss and show that, unlike previous methods, our model can learn a well-calibrated posterior distribution and NEW outperforms comparably calibrated methods on minMPJPE. However, in particularly ambiguous situations, i.e. for the occluded joints, we show that our model outperforms the state-of-the-art on minMPJPE while maintaining a well-calibrated distribution. We believe that our findings will be useful for future work in identifying and mitigating miscalibration in multi-hypothesis pose estimation and will lead to more robust and safer applications of multi-hypothesis pose estimation. This below objective is equivalent to the mean position error for a single joint. Note that x and x are conditionally independent given c, i.e. x⊥x|c. The objective can then be expanded as follows: L = E x∼p(x|c),x∼q(x|c),c (x -x) 2 = E c E x,x|c (x -µ c + µ c -x) 2 = E c   Var[x | c] indep. of q -2E x,x|c [(x -µ c )(x -µ c )] + E x|c (x -µ c ) 2    = const. -2E c   Ex|c [(x -µ c )] =0 E x|c [(x -µ c )] + E x|c (x -µ c ) 2    = const. + E c E x|c (x -µ c ) 2 ≥ 0 The expectation in the final line is non-negative and can be minimized by q(x|c) = δ(x -µ c ), i.e. setting x = µ c and shrinking the variance to zero. This means that q would be extremely overconfident. A.2 minMPJPE CONVERGES TO THE CORRECT MEAN Consider 1D samples x * from a data distribution p(x) and an approximate Gaussian distribution q(x) with parameters µ and σ. We sample N hypotheses from q(x) and minimize the minMPJPE objective: minMPJPE = E q(z) E p(x) min i (x * -µ -σz i ) 2 Consider z * j as the z i sample which minimizes the expression for the j-th data sample x * j . minMPJPE = E q(z) E p(x) (x * -µ -σz * j ) 2 Thus the derivative can be computed to be ∂ ∂µ minMPJPE = -2E q(z) E p(x) x * -µ -σz * j = 0 = E p(x) [x * ] -µ -E q(z) z * j Simulations indicate that E q(z) z * j can be approximated by a sigmoid function E q(z) z * j = S E p(x) [x * ] -µ • C(σ, N ) where C(σ, N ) is a scalar scaling value dependent on σ and the number of hypotheses. Thus the root of the derivative can be computed to be: µ = E p(x) [x * ]

A.3 IMPACT OF CENTER TENDENCY MEASURE ON EXPECTED CALIBRATION ERROR

The choice of center tendency measure should be considered when computing the expected cali-NEW bration error. Therefore on a subset of the models presented in table 1 we compare the effect of choosing 3 different reference points. 1) The median of the samples 2) the mean of the samples and 3) the mode of the samples. We showcase the results in table 3. We observe that the use of median in contrast to mean has little to no effect on the computation of ECE. Using the mode as a reference point results in generally smaller values of ECE. Finally, it can be observed that regardless of the reference point type our cGNF model remains better calibrated than the other methods.

B CONDITIONAL GRAPH NORMALIZING FLOW B.1 ARCHITECTURE DETAILS

The cGNF model consists of 10 flow layers. Each flow layer f k consists of two GNN layers each performing one message-passing step each as defined in eq. equation 4. In the first GNN layer each



Code and pretrained model weights are available at https://github.com/XXXX.



Figure 1: Examples showcasing the consequences of an overconfident distributions vs.our wellcalibrated distribution. Ground truth marked with colored poses. Uses artificial 2D keypoint failures.

Figure 2: a) Calibration curves of previous lifting models with the corresponding expected calibration error (ECE) scores. b) Standard deviation σ of a Gaussian distribution optimized to minimize minMPJPE for different numbers of samples and dimensions. The true σ is 0.5 (black line), underconfident σ > 0.5 (blue), overconfident σ < 0.5 (pink). The human pose equivalent distribution (black point, 45 dimensions, 200 samples) compared to an oracle distribution (with true µ and σ) in terms of minMPJPE and NLL. c) Gaussian noise model schematic to the left. The SimpleBaseline model weights are not trained. Right bar plots compare the performance on minMPJPE and ECE when optimizing for minMPJPE and NLL. d) Loss landscapes of minMPJPE and ECE for a 1D Gaussian distribution with parameters σ and µ. The gold star represents the ground truth values of σ * = 4 and µ * = 0. To the right is a schematic of the ECE constrained optimization.

Figure3: A schematic of the cGNF. Target variables x are represented by a graph with the feature matrix H x and the adjacency matrix A x . The context variables are represented by a context graph c with the feature matrix H c and adjacency matrix A x . In the inference path the target graph x is transformed into a latent space z which follows a standard normal distribution. The transformation is achieved through L transformation blocks.

Figure 4: a) Hypotheses generated by the cGNF (gray) vs the ground truth pose (blue). Original image is shown to the right. b) Example of samples from the prior learned by the cGNF. c) Calibration of the conditional density. Comparison of the frequency that the distance of the ground truth from the median pose is within a given quantile. Median calibration curves for our model (cGNF, orange) and Wehrbein et al. (2021) (ProHPE, gray). Left shows calibration curves for the whole Human 3.6M test set and right for only the occluded joints.

Comparison of methods on the Procrustes-Aligned minMPJPE metric. Analogously toKolotouros et al. (2021) we sample a pose from NEW the mode of the source distribution and minimize the minMPJPE between the sampled pose and the ground truth pose. This additional loss term is shown to improve the minMPJPE performance. At the original model capacity, the minMPJPE and calibration performance show improvement. FIX However, while the model performance on minMPJPE increases further with model capacity, calibration decreases significantly. We compare the performances in table 1. Additional model capacity evaluations are made in sec. B.4.

REPRODUCIBILITY STATEMENT

Our code for reproducing the results is open-sourced at https://github.com/XXXX. We NEW also include code to reproduce the results we obtained for other works. Experimental logs and downloadable pretrained models are fully available at https://wandb.ai/XXXX.

A METRICS A.1 MEAN PER JOINT POSITION ERROR

A popular optimization metric is the MPJPE. While this metric is especially popular in single-pose estimation methods, it has also been used in various forms in multi-hypothesis methods. Optimizing this metric causes the distribution of poses to be overconfident. We show this for a simple onedimensional distribution, the generalization to the multi-dimensional case is straightforward. Given samples x ∼ p(x|c) from a data distribution given a particular context c, such as keypoints from a image, consider an approximate distribution q(x|c) supposed to reflect the uncertainty about x|c. 3 : Comparison of different reference points definitions on the resulting ECE score. In bold we mark the method that under the particular reference point has the lowest ECE.

Method

Median Mean Mode is a single layer fully-connected neural network with 100 units and a ReLU activation (Agarap, 2018) . All the messages to a node are summed together resulting in the output of the Message as in eq. equation 4. Then the Update step takes the message output as its input to a single-layer fully connected neural network with 100 units and linear activation. The context c is transformed via ψ c to 100 dimensions and passed to the next GNN layer. In the next GNN layer, the message generation functions ψ(2) r are single layer fully connected neural networks with 100 units and ReLU activation, The Update is a neural network layer with 3 output units. In the next flow layer of the original context graph c is used and not the transformed context.

B.2 TRAINING DETAILS

We train the model on subjects 1, 5, 6, 7, and 8 on every 4th frame. We reduce the learning rate on FIX plateau with an initial learning rate of 0.001 and patience of 10 steps reducing the learning rate by a factor of 10. Training is stopped after the 3rd decrease in the learning rate or 200 epochs. The model was trained on a single Nvidia Tesla V100 GPU, for about 6 days.

B.3 ZERO-SHOT DENSITY ESTIMATION

We evaluate cGNF's zero-shot capability to estimate a previously unseen conditional density. We simulated 50 different triple pendulums with initial velocities sampled from a normal distribution v ∼ N (0, 10) for 25 timesteps each. Each pendulum was constructed from 4 nodes connected in a chain. The zeroth node was fixed and the remaining x 1 , x 2 and x 3 were freely moving. The nodes were observed with c i = x i + ε with ε ∼ N (0, 5 • 10 -2 ). On this dataset, we trained 3 models. I. A CNF trained to estimate the density when all positions are observed p(x | c 1 , c 2 , c 3 ) II. A CNF trained on a density where only one node is observed p(x | c 1 ) III. A cGNF trained on the densities where at most 2 nodes are observed i.e. the cGNF never sees examples of p(x | c 1 , c 2 , c 3 ).To test zero-shot capabilities we compare the performances of these 3 models on p(x | c 1 , c 2 , c 3 ). Model I (CNF) is used as reference for estimating this distribution when p(x | c 1 , c 2 , c 3 ) is in distribution. Model II (CNF) is used to reference a model which cannot zero-shot estimate densities as it is out of distribution. Model III (cGNF) shows that our model can zero-shot estimate a previously unseen conditional density (Fig. S1 ).

B.4 CONSEQUENCES OF MODEL SCALE

We explore the effect of increasing the number of parameters of the model. We train 3 sizes of models: 1) small with 852 546 parameters, 2) large with 3 301 546 parameters, and 3) xlarge with 8 318 741 parameters. The individual architectures were found by architecture search. We observe that as the size increases the performance of cGNF applied to the lifting task improves decreasing the gap to the state-of-the-art methods. The performance further improves outperforming the stateof-the-art method on occluded joints. However, the improvement in performance comes at a cost of calibration (Fig. S2 ). 

