SPATIO-TEMPORAL SELF-ATTENTION FOR EGOCEN-TRIC 3D POSE ESTIMATION

Abstract

Vision-based ego-centric 3D human pose estimation (ego-HPE) is essential to support critical applications of x R-technologies. However, severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera, make ego-HPE extremely challenging. While current state-of-the-art (SOTA) methods try to address the distortion, they still suffer from large errors in the most critical joints (such as hands) due to self-occlusions. To this end, we propose a spatio-temporal transformer model that can attend to semantically rich feature maps obtained from popular convolutional backbones. Leveraging the complex spatio-temporal information encoded in ego-centric videos, we design a spatial concept called feature map tokens (FMT) which can attend to all the other spatial units in our spatio-temporal feature maps. Powered by this FMT-based transformer, we build Egocentric Spatio-Temporal Self-Attention Network (Ego-STAN), which uses heatmap-based representations and spatio-temporal attention specialized to address distortions and self-occlusions in ego-HPE. Our quantitative evaluation on the contemporary sequential x R-EgoPose dataset, achieves a 38.2% improvement on the highest error joints against the SOTA ego-HPE model, while accomplishing a 22% decrease in the number of parameters. Finally, we also demonstrate the generalization capabilities of our model to real-world HPE tasks beyond ego-views.

1. INTRODUCTION

The rise of virtual immersive technologies, such as augmented, virtual, and mixed reality environments (x R) [1] [2] [3] , has fueled the need for accurate human pose estimation (HPE) to support critical applications in medical simulation training [4] and robotics [5], among others [6] [7] [8] [9] [10] . Vision-based HPE has increasingly become a primary choice [11] [12] [13] [14] [15] since the alternative requires the use of sophisticated motion capture systems with sensors to track major human joints [16] , impractical for real-world use. Vision-based 3D pose estimation is largely divided on the basis of camera viewpoint: outside-in versus egocentric view. Extensive literature is devoted to outside-in 3D HPE, where the cameras have a fixed effective recording volume and view angle [17] [18] [19] [20] , which are unsuitable for critical applications where higher and robust (low variance) accuracies are required [4] . In contrast, the egocentric perspective is mobile and amenable to large-scale cluttered environments since the viewing angle is consistently on the subject with minimal obstructions from the surroundings [21] [22] [23] . Nevertheless, the egocentric imaging does come with challenges: lower body joints are (a) visually much smaller than the upper body joints (distortion) and (b) in most cases heavily occluded by the upper torso (self-occlusion). Recent works address these challenges by utilizing the dual-branch autoencoder-based 2D to 3D pose estimator [21] , and by incorporating extra camera information [24] . However, self-occlusions remain challenging to address from only static views. Moreover, while critical applications of ego-HPE (surgeon training [4]) require accurate and robust estimation of extremities (hands and feet), the current methods suffer from high errors on these very joints, making them unsuitable for these critical applications [21, 24] . Previous outside-in spatio-temporal works attempt to regress 3D pose from an input sequence of 2D keypoints -not images -[25-28], and focus on mitigating the high output variance of 3D human pose. In contrast, we estimate accurate 2D heatmaps -from images -by dynamically aggregating on intermediate feature maps and consequently produce accurate 3D pose. Moreover, outside-in occlusion robustness methods are not applicable in ego-pose due to dynamic camera angles, constantly changing background, and distortion, with

