OPTIMISING 2D POSE REPRESENTATION: IMPROVING ACCURACY, STABILITY AND GENERALISABILITY IN UNSUPERVISED 2D-3D HUMAN POSE ESTIMATION

Abstract

This paper addresses the problem of 2D pose representation during unsupervised 2D to 3D pose lifting to improve the accuracy, stability and generalisability of 3D human pose estimation (HPE) models. All unsupervised 2D-3D HPE approaches provide the entire 2D kinematic skeleton to a model during training. We argue that this is sub-optimal and disruptive as long-range correlations are induced between independent 2D key points and predicted 3D ordinates during training. To this end, we conduct the following study. With a maximum architecture capacity of 6 residual blocks, we evaluate the performance of 5 models which each represent a 2D pose differently during the adversarial unsupervised 2D-3D HPE process. Additionally, we show the correlations between 2D key points which are learned during the training process, highlighting the unintuitive correlations induced when an entire 2D pose is provided to a lifting model. Our results show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network. This approach decreased the average error by 20% on the Human3.6M dataset when compared to a model with a near identical parameter count trained on the entire 2D kinematic skeleton. Furthermore, due to the complex nature of adversarial learning, we show how this representation can also improve convergence during training allowing for an optimum result to be obtained more often. Code and weights will be made available

1. INTRODUCTION

Monocular 3D human pose estimation (HPE) aims to reconstruct the 3D skeleton of the human body from 2D images or video. This is known to be an ill-posed inverse problem as multiple different 2D poses can correspond to the same 3D pose. Even with this hurdle, deep learning has allowed for accurate 2D-3D pose regression mappings to be learned allowing for remarkable results when trained and tested on 3D pose datasets (Wandt & Rosenhahn, 2019; Martinez et al., 2017; Pavlakos et al., 2018; Yang et al., 2018; Cheng et al., 2020; Pavllo et al., 2019) . Unfortunately, the difficulty of obtaining 3D datasets leads to poor performance when evaluating in domains where rigid environmental constraints (lighting, action, camera location, etc) are unable to be controlled. Recent work (Chen et al., 2019; Wandt et al., 2022; Drover et al., 2019; Yu et al., 2021) has investigated if an unsupervised solution for 3D HPE is possible. These approaches utilize a geometric self-supervision cycle through random rotations to create a consistent lifting network, with some form of pose or probability discriminator to see if the rotated pose, once reprojected back to 2D, is realistic. As 2D data is cheaper to obtain, more efficient for computations and readily available in many circumstances, improving the performance of unsupervised 2D-3D HPE networks would therefore allow for accurate 3D poses to be obtained in many unconstrained scenarios. An overlooked aspect in prior work however, is the representation of the 2D pose being given to the lifting model. We posit that when a full 2D pose is provided to a lifting model during training, long-range correlations are induced between a key points 3D prediction and all of the poses' other 2D key point coordinates (i.e. the 3D prediction of the left elbow will be influenced by the 2D coordinate of the right ankle). Although supervised approaches have touched upon this topic by learning joint dependency through graph convolutional networks (GCN) (Lee et al., 2018; Zhao et al., 2019) , relationships between joints via relational networks Park & Kwak (2018) or splitting and recombining the limbs of a pose (Zeng et al., 2020) . To our best knowledge, this has never been done in an unsupervised setting, and it has never been assumed that the pose could be two or more independent structures with no inter-relational correspondence needed. Additionally, the large variations in network architecture and optimisation within prior work mean we are unable to fairly compare the results between approaches to find an optimum representation. We address this problem by training 5 models with near identical amounts of parameters and identical training approaches on different 2D pose representations. By evaluating the results obtained from each model we will be able to determine an optimum representation that future work can use to obtain the best performance. We also show the correlations induced between 2D key points during training when a full pose is provided to a model as well as our best performing 2D representation model, providing some intuition behind the improved performance. To summarise our paper makes the following contributions: • We show the effect when using different 2D pose representations for the unsupervised adversarial 2D-3D HPE process, where by changing the 2D pose representation we can reduce the average error by 20%. • Our findings can be easily implemented within current the state of the art as our approach utilizes the popular residual block introduced by Martinez (Martinez et al., 2017) and used within (Chen et al., 2019; Wandt et al., 2022; Wandt & Rosenhahn, 2019; Drover et al., 2019; Yu et al., 2021) . • We show the correlations induced between key points for a full 2D pose representation model and our best 2D pose representation model highlighting the sub-optimal learning when a full 2D pose is provided to a network. • We show the adversarial stability of our best pose representation model against a full 2D pose representation model, highlighting that the improvement is consistent across multiple random initializations.

2. RELATED WORK

There currently exists two main avenues of deep-learning research for 3D HPE. The first learns the mapping of 3D joints directly from a 2D image (Pavlakos et al., 2017; Lie et al., 2019; Mehta et al., 2017a; Li et al., 2015; Tome et al., 2017) . The second builds upon an accurate intermediate 2D pose estimate, with the 2D pose obtained from an image through techniques such as Stacked-Hourglass Architectures (Newell et al., 2016) or Part Affinity Fields (Cao et al., 2021) , and lifts this 2D pose to 3D. This work focuses on the latter 2D to 3D lifting avenue which can be organized into the following categories: 



FULLY SUPERVISED Fully supervised approaches seek to learn mappings from paired 2D-3D data which contain ground truth 2D locations of key points and their corresponding 3D coordinates. Martinez et al. (2017) introduced a baseline fully connected regression model which learned 3D coordinates from their relative 2D locations. Exemplar approaches such as Chen & Ramanan (2017) and Yang et al. (2019) use large dictionaries/databases of 3D poses with a nearest-neighbour search to determine an optimal 3D pose. Pavllo et al. (2019) used temporal convolutions over 2D key points to predict the pose of the central or end frame in a time series, whereas Mehta et al. (2017b) utilized multi-task learning to combine a convolutional pose regressor with kinematic skeleton fitting for real-time 3D HPE.2.2 WEAKLY-SUPERVISEDWeakly-Supervised approaches do not use explicit 2D-3D correspondences and instead use either augmented 3D data during training or unpaired 2D-3D data to learn human body priors (shape or articulation).Pavlakos et al. (2018) and Ronchi et al. (2018)  proposed the learning of 3D poses from 2D with ordinal depth relationships between key points (e.g. the right wrist is behind the right elbow). Wandt & Rosenhahn (2019) introduced a weakly-supervised adversarial approach where

