LIFTEDCL: LIFTING CONTRASTIVE LEARNING FOR HUMAN-CENTRIC PERCEPTION

Abstract

RGB image 2D pose 3D pose human shape Human parsing Head Network Backbone Downstream task model LiftedCL pre-trained Figure 1: The backbone network pre-trained by LiftedCL can be transferred to various human-centric downstream tasks including human pose estimation, human shape recovery and human parsing. The first column shows two examples from the Human3.6M (Ionescu et al., 2013) dataset. The second and third columns present the estimated 2D and 3D pose. The last two columns demonstrate the reconstructed human mesh and the estimated human semantic parts.

1. INTRODUCTION

Human-centric perception, such as human pose estimation (Xiao et al., 2018; Sun et al., 2019; Pavllo et al., 2019; Gong et al., 2021) , human shape recovery (Kanazawa et al., 2018; Choi et al., 2020; Xu et al., 2021) and human parsing (Yang et al., 2019; Li et al., 2020; Gong et al., 2018) , has received significant attention in computer vision. Similar to other computer vision tasks, pre-training the model has become a widely-used paradigm in human-centric perception. Generally, models are first pre-trained on large-scale datasets (e.g., ImageNet (Deng et al., 2009) ) and then fine-tuned on specific human-centric downstream task. For human-centric perception, leveraging 3D human structure information on fine-tuning stage has been demonstrated effective to improve the performance. For instance, in the task of 3D pose estimation. RepNet (Wandt & Rosenhahn, 2019) adds a KCS (Wandt et al., 2018) layer into an adversary to better represent bone lengths and joint angles of a pose, which achieves more accurate 3D pose reconstruction results. HMR (Kanazawa et al., 2018) employs a prior human body model parameterized by shape and 3D joint angles and shows competitive results on 3d pose estimation and part segmentation. In (Qiu et al., 2019) , a penalty is be added when the estimated 3D pose has unreasonable limb lengths according to the human body structure prior. Such methods show that leveraging 3D human kinematic prior on fine-tuning stage contributes to the performance. We argue that models can also benefit from 3D human kinematic prior on pre-training stage. Concurrently, powered by contrastive representation learning, recent self-supervised pre-training methods (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2018) have broken the dominance of supervised ImageNet pre-training on various downstream tasks, including image classification, object detection, semantic segmentation, etc. These self-supervised learning methods mainly adapt the image-level instance discrimination formulation as a pretext task to learn transferable representations. Besides, some methods (Wang et al., 2021a; Xie et al., 2021; Xiao et al., 2021; Wang et al., 2021b) extend the image-level contrastive learning framework to a dense paradigm, achieving superior performance on dense prediction tasks. However, existing contrastive representation learning methods are mainly designed for image classification, object detection and semantic segmentation, rather than human-centric perception. Better results in these tasks may not guarantee superior performance in human-centric perception (see Table 4 ). Moreover, most of these works do not link 3D prior to 2D representation learning. In human-centric perception, there exist some challenging cases, for example, invisible joints, selfoccluded keypoints, in which 3D human kinematic prior can be utilized to help better understand the relationship between body parts. Thus, it is still desirable to have a pre-training approach for human-centric tasks. Our goal is to improve contrastive learning by leveraging 3D human structure information for human-centric pre-training in a simple yet effective way. To this end, we propose a novel contrastive learning framework termed LiftedCL to exploit 3D human structure information for human-centric pre-training. Firstly, we generalize the conventional InfoNCE loss (Oord et al., 2018) to an equivariant paradigm. Based on this, image-level invariant and pixel-level equivariant contrastive learning are applied to the projected feature vectors and maps respectively. Meanwhile, the representations are transformed into 3D human skeleton to better reveal the hidden human structure information. In particular, a set of 3D skeletons is randomly sampled by resorting to 3D human kinematic prior. With this set of real 3D samples, an adversary is adopted to induce the learning of 3D-aware human-centric representations. We demonstrate the effectiveness of our proposed LiftedCL by pre-training using MS COCO (Lin et al., 2014) human images and fine-tuning on specific target dataset. Compared to the state-of-the-art method PixPro (Xie et al., 2021) , LiftedCL achieves significant improvements on various humancentric downstream tasks, including COCO 2D human pose estimation (+0.4% mAP), MPII 2D human pose estimation (+0.3% PCKh@0.5), Human3.6M 2D human pose estimation (+0.9% JDR), Human3.6M 3D human pose estimation (1.8mm MPJPE), 3DPW human shape recovery (1.7mm reconst. error) and LIP human parsing (+0.5% mIoU). Our main contributions are summarized as follows: • We propose the Lifting Contrastive Learning (LiftedCL) for human-centric pre-training in a simple yet effective way. • We demonstrate a feasible approach to learn 3D-aware representations via lifting and adversarial learning only using single-view images. • LiftedCL significantly outperforms state-of-the-art self-supervised learning methods on four human-centric downstream tasks, including 2D and 3D human pose estimation (0.4% mAP and 1.8 mm MPJPE improvement on COCO 2D pose estimation and Human3.6M 3D pose estimation), human shape recovery and human parsing.

