OPTIMISING 2D POSE REPRESENTATION: IMPROVING ACCURACY, STABILITY AND GENERALISABILITY IN UNSUPERVISED 2D-3D HUMAN POSE ESTIMATION

Abstract

This paper addresses the problem of 2D pose representation during unsupervised 2D to 3D pose lifting to improve the accuracy, stability and generalisability of 3D human pose estimation (HPE) models. All unsupervised 2D-3D HPE approaches provide the entire 2D kinematic skeleton to a model during training. We argue that this is sub-optimal and disruptive as long-range correlations are induced between independent 2D key points and predicted 3D ordinates during training. To this end, we conduct the following study. With a maximum architecture capacity of 6 residual blocks, we evaluate the performance of 5 models which each represent a 2D pose differently during the adversarial unsupervised 2D-3D HPE process. Additionally, we show the correlations between 2D key points which are learned during the training process, highlighting the unintuitive correlations induced when an entire 2D pose is provided to a lifting model. Our results show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network. This approach decreased the average error by 20% on the Human3.6M dataset when compared to a model with a near identical parameter count trained on the entire 2D kinematic skeleton. Furthermore, due to the complex nature of adversarial learning, we show how this representation can also improve convergence during training allowing for an optimum result to be obtained more often. Code and weights will be made available

1. INTRODUCTION

Monocular 3D human pose estimation (HPE) aims to reconstruct the 3D skeleton of the human body from 2D images or video. This is known to be an ill-posed inverse problem as multiple different 2D poses can correspond to the same 3D pose. Even with this hurdle, deep learning has allowed for accurate 2D-3D pose regression mappings to be learned allowing for remarkable results when trained and tested on 3D pose datasets (Wandt & Rosenhahn, 2019; Martinez et al., 2017; Pavlakos et al., 2018; Yang et al., 2018; Cheng et al., 2020; Pavllo et al., 2019) . Unfortunately, the difficulty of obtaining 3D datasets leads to poor performance when evaluating in domains where rigid environmental constraints (lighting, action, camera location, etc) are unable to be controlled. Recent work (Chen et al., 2019; Wandt et al., 2022; Drover et al., 2019; Yu et al., 2021) has investigated if an unsupervised solution for 3D HPE is possible. These approaches utilize a geometric self-supervision cycle through random rotations to create a consistent lifting network, with some form of pose or probability discriminator to see if the rotated pose, once reprojected back to 2D, is realistic. As 2D data is cheaper to obtain, more efficient for computations and readily available in many circumstances, improving the performance of unsupervised 2D-3D HPE networks would therefore allow for accurate 3D poses to be obtained in many unconstrained scenarios. An overlooked aspect in prior work however, is the representation of the 2D pose being given to the lifting model. We posit that when a full 2D pose is provided to a lifting model during training, long-range correlations are induced between a key points 3D prediction and all of the poses' other 2D key point coordinates (i.e. the 3D prediction of the left elbow will be influenced by the 2D coordinate of the right ankle). Although supervised approaches have touched upon this topic by learning joint dependency through graph convolutional networks (GCN) (Lee et al., 2018; Zhao 

