SPATIO-TEMPORAL SELF-ATTENTION FOR EGOCEN-TRIC 3D POSE ESTIMATION

Abstract

Vision-based ego-centric 3D human pose estimation (ego-HPE) is essential to support critical applications of x R-technologies. However, severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera, make ego-HPE extremely challenging. While current state-of-the-art (SOTA) methods try to address the distortion, they still suffer from large errors in the most critical joints (such as hands) due to self-occlusions. To this end, we propose a spatio-temporal transformer model that can attend to semantically rich feature maps obtained from popular convolutional backbones. Leveraging the complex spatio-temporal information encoded in ego-centric videos, we design a spatial concept called feature map tokens (FMT) which can attend to all the other spatial units in our spatio-temporal feature maps. Powered by this FMT-based transformer, we build Egocentric Spatio-Temporal Self-Attention Network (Ego-STAN), which uses heatmap-based representations and spatio-temporal attention specialized to address distortions and self-occlusions in ego-HPE. Our quantitative evaluation on the contemporary sequential x R-EgoPose dataset, achieves a 38.2% improvement on the highest error joints against the SOTA ego-HPE model, while accomplishing a 22% decrease in the number of parameters. Finally, we also demonstrate the generalization capabilities of our model to real-world HPE tasks beyond ego-views.

1. INTRODUCTION

The rise of virtual immersive technologies, such as augmented, virtual, and mixed reality environments (x R) [1] [2] [3] , has fueled the need for accurate human pose estimation (HPE) to support critical applications in medical simulation training [4] and robotics [5] , among others [6] [7] [8] [9] [10] . Vision-based HPE has increasingly become a primary choice [11] [12] [13] [14] [15] since the alternative requires the use of sophisticated motion capture systems with sensors to track major human joints [16] , impractical for real-world use. Vision-based 3D pose estimation is largely divided on the basis of camera viewpoint: outside-in versus egocentric view. Extensive literature is devoted to outside-in 3D HPE, where the cameras have a fixed effective recording volume and view angle [17] [18] [19] [20] , which are unsuitable for critical applications where higher and robust (low variance) accuracies are required [4] . In contrast, the egocentric perspective is mobile and amenable to large-scale cluttered environments since the viewing angle is consistently on the subject with minimal obstructions from the surroundings [21] [22] [23] . Nevertheless, the egocentric imaging does come with challenges: lower body joints are (a) visually much smaller than the upper body joints (distortion) and (b) in most cases heavily occluded by the upper torso (self-occlusion). Recent works address these challenges by utilizing the dual-branch autoencoder-based 2D to 3D pose estimator [21] , and by incorporating extra camera information [24] . However, self-occlusions remain challenging to address from only static views. Moreover, while critical applications of ego-HPE (surgeon training [4]) require accurate and robust estimation of extremities (hands and feet), the current methods suffer from high errors on these very joints, making them unsuitable for these critical applications [21, 24] . Previous outside-in spatio-temporal works attempt to regress 3D pose from an input sequence of 2D keypoints -not images -[25-28], and focus on mitigating the high output variance of 3D human pose. In contrast, we estimate accurate 2D heatmaps -from images -by dynamically aggregating on intermediate feature maps and consequently produce accurate 3D pose. Moreover, outside-in occlusion robustness methods are not applicable in ego-pose due to dynamic camera angles, constantly changing background, and distortion, with Our main contributions are summarized as follows. • Feature map token and interpreting attention. To leverage the complex spatio-temporal information encoded in ego-centric videos, we design feature map token (FMT), learnable parameters that, alongside our spatio-temporal Transformer, can globally attend to all spatial units of the extracted sequential feature maps to draw valuable information. FMT also provides interpretability, revealing the complex temporal dependence of the attention (Fig. 1 ). • Hybrid Spatio-temporal Transformer powered by FMT. Powered by the FMT, we design Ego-STAN's hybrid architecture which utilizes spatio-temporal attention endowed by the FMT and Transformers [32] to self-attend to a sequence of semantically rich feature maps extracted by Convolutional Neural Networks (ResNet-101) [33] . Complementary to this architecture, we also propose an ℓ 1 -based loss function to accomplish robust pose estimation, handling both self-occlusions and visibly difficult (low resolution) joints. In addition, we also evaluate Ego-STAN on the Human3.6M, an outside-in sequential HPE dataset, showing an improvement of 8% on Percentage of Correct Keypoint (PCK) of 2D joint detection demonstrating the versatility of the proposed attention architecture and FMT. • Direct regression from heatmap to 3D pose. We propose a simple neural network-based 2D heatmap to 3D pose regression module, which significantly reduces the overall MPJPE and the number of trainable parameters as compared the SOTA [21] . We also indirectly evaluate the advantages of this module via HPE on the Mo 2 Cap 2 dataset (static ego-HPE) and on the Human3.6M dataset. Using detailed ablations, we also reveal a surprising fact: the auto-encoder-based architectures recommended by SOTA may be creating information bottlenecks and be counterproductive for ego-HPE. • Extensive ablation studies. We perform comprehensive ablation studies to analyze the impact of each component of Ego-STAN. These ablations thoroughly demonstrate that the composition of the Transformer network, ℓ 1 loss, Direct 3D regression, and the FMT, lead to the superior performance of Ego-STAN.



Figure1: Interpreting Ego-STAN's Attention Mechanism. A sequence of images I (1) , I (2) , and I (3) , yields feature maps F(1) , F(2) , and F (3) , respectively, and are appended with a (learnable) feature map token (K). Sections of Ego-STAN's feature map tokens (in blue) can be deconvolved to identify the corresponding attended region(s) in the image sequence (in red), to allow the interpretation of information aggregation from the images. recent works requiring supervision on joint visibility[29][30][31]. Therefore, we need to build specialized models to address the unique challenges of ego-HPE.Given these challenges, we investigate the following question: how can we design a unified model to reliably estimate the location of heavily occluded joints and address the distortions in ego-centric views? To this end, we propose Egocentric Spatio-Temporal Self-Attention Network (Ego-STAN) which leverages a specialized spatio-temporal attention, which we call feature map token (FMT), heatmap-based representations, and a simple 2D to 3D pose estimation module. On the SOTA sequential ego-views dataset x R-EgoPose [21], it achieves an average improvement of 38.2% mean per-joint position error (MPJPE) on the highest error joints against the SOTA egocentric pose estimation work [21] while reducing 22% trainable parameters on the xR-EgoPose dataset [21]. Furthermore, Ego-STAN genralizes to other HPE tasks in static ego-views Mo 2 Cap 2 dataset [22], and outside-in views on the Human3.6M dataset [16] where it reduces the MPJPE by 9% against [21], demonstrating its ability to generalize to real-world views and adapt to other HPE scenarios. Our main contributions are summarized as follows.

