LIFTEDCL: LIFTING CONTRASTIVE LEARNING FOR HUMAN-CENTRIC PERCEPTION

Abstract

RGB image 2D pose 3D pose human shape Human parsing Head Network Backbone Downstream task model LiftedCL pre-trained Figure 1: The backbone network pre-trained by LiftedCL can be transferred to various human-centric downstream tasks including human pose estimation, human shape recovery and human parsing. The first column shows two examples from the Human3.6M (Ionescu et al., 2013) dataset. The second and third columns present the estimated 2D and 3D pose. The last two columns demonstrate the reconstructed human mesh and the estimated human semantic parts.

1. INTRODUCTION

Human-centric perception, such as human pose estimation (Xiao et al., 2018; Sun et al., 2019; Pavllo et al., 2019; Gong et al., 2021) , human shape recovery (Kanazawa et al., 2018; Choi et al., 2020; Xu et al., 2021) and human parsing (Yang et al., 2019; Li et al., 2020; Gong et al., 2018) , has received significant attention in computer vision. Similar to other computer vision tasks, pre-training the

