CAPTURING THE MOTION OF EVERY JOINT: 3D HUMAN POSE AND SHAPE ESTIMATION WITH INDEPENDENT TO-KENS

Abstract

In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: joint rotation tokens, shape token, and camera token. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available 1 .

1. INTRODUCTION

Capturing the motion of the human body pose has great values in widespread applications, such as movement analysis, human-computer interaction, films making, digital avatar animation, and virtual reality. Traditional marker-based motion capture system can acquire accurate movement information of humans, but is only applicable to limited scenes due to the time-consuming fitting process and prohibitively expensive costs. In contrast, markerless motion capture based on RGB image and video processing algorithms is a promising alternative that has attracted numerous research in the fields of deep learning and computer vision. Especially, thanks to the parameteric SMPL model (Loper et al., 2015) and various diverse datasets with 3D annotations (Ionescu et al., 2013; Mehta et al., 2017; von Marcard et al., 2018) , remarkable progress has been made on monocular 3D human pose and shape estimation and motion capture. Joint N

Temporal Encoder

Capture the rotational motion of joint 1 Figure 1 : Left: Mainstream temporal-based human mesh methods, e.g. (Kanazawa et al., 2019a; Kocabas et al., 2020; Choi et al., 2021) , adopt a temporal encoder to mix temporal information from past and future frames and then regress the SMPL parameters from the temporally enhanced feature for each frame. Right: Our method first acquires tokens of each joint in the time dimension and then separately capture the motion of each joint using a shared temporal encoder. Existing regression-based human mesh recovery methods are actually implicitly based on an assumption that predicting 3d body joint rotations and human shape strongly depends on the given image features. The pose and shape parameters are directly estimated from the image feature using MLP regressors. Nevertheless, due to the inherent ambiguity, the mapping from the 2D image feature to 3D pose and shape is an ill-posed problem. To achieve accurate pose and shape estimation, it is necessary to initialize the mean pose and shape parameters and use an iterative residual regression manner to reduce error. Such an end-to-end learning and inference scheme (Kanazawa et al., 2018) has been proven to be effective in practice, but ignores the temporal information and produces implausible human motions and unsatisfactory pose jitters for video streaming data. Video-based methods such as (Kanazawa et al., 2019a; Kocabas et al., 2020; Choi et al., 2021; Wei et al., 2022) may leverage large-scale motion capture data as priors and exploit temporal information among different frames to penalize implausible motions. They usually enhance singe-frame feature using a temporal encoder and then still use a deep regressor to predict SMPL parameters based on the temporally enhanced image feature, as shown in the left subfigure of Fig. 1 . This scheme, however, is unable to focus on joint-level rotational motion specific to each joint, failing to ensure the temporal consistency of local joints. To address these problems, we attempt to understand the human 3D reconstruction from a causal perspective. We argue that assuming a still background, the primary causes behind the image pixel changes and human body appearance changes are 1) the motions of 3D joint rotations in human skeletal dynamics and 2) the viewpoint changes of the observer (camera). In fact, a prior human body model exists independently of a given specific image. And the 3D relative rotations of all joints (relative to the parent joint) and body shape can be abstracted beyond image pixels and independent of the image contents and observer views. In other words, the joint rotations cannot be "seen" and they are image-independent and viewpoint-independent concepts. Based on the considerations above, we propose a novel 3D human pose and shape estimation model based on independent tokens (INT). The core idea of the model is to introduce three types of independent tokens that specifically encode the 3D rotation information of every joint, the shape of human body and the information about camera. These initialized tokens learn prior knowledge and mutual relationships from large-scale training data, requiring neither an iterative regressor to take mean shape and pose as initial estimate (Kanazawa et al., 2018; Kolotouros et al., 2019a; Kocabas et al., 2020; Choi et al., 2021) , nor a kinematic topology decoder defined by human prior knowledge (Wan et al., 2021) . Given an image as a conditional observation, these tokens are repeatedly updated by interacting with 2D image evidence using a Transformer (Vaswani et al., 2017) . Finally, they are transformed into the posterior estimates of pose, shape and camera parameters. As a consequence, this method of abstracting joint rotation tokens from image pixels can represent the motion state of each joint and establish correlations in time dimension. Benefiting from this, we can separately capture the temporal rotational motion of every joint by sending the tokens of each joint at different timestamps to a temporal model. In comparison to capturing the overall temporal changes in image features and the whole pose, this modeling scheme focuses on capturing separate rotational motions of all joints, which is conducive to maintaining the temporal coherence and rationality of each joint rotation.

