RETHINKING SELF-SUPERVISED VISUAL REPRESEN-TATION LEARNING IN PRE-TRAINING FOR 3D HUMAN POSE AND SHAPE ESTIMATION

Abstract

Recently, a few self-supervised representation learning (SSL) methods have outperformed the ImageNet classification pre-training for vision tasks such as object detection. However, its effects on 3D human body pose and shape estimation (3DHPSE) are open to question, whose target is fixed to a unique class, the human, and has an inherent task gap with SSL. We empirically study and analyze the effects of SSL and further compare it with other pre-training alternatives for 3DH-PSE. The alternatives are 2D annotation-based pre-training and synthetic data pretraining, which share the motivation of SSL that aims to reduce the labeling cost. They have been widely utilized as a source of weak-supervision or fine-tuning, but have not been remarked as a pre-training source. SSL methods underperform the conventional ImageNet classification pre-training on multiple 3DHPSE benchmarks by 7.7% on average. In contrast, despite a much less amount of pre-training data, the 2D annotation-based pre-training improves accuracy on all benchmarks and shows faster convergence during fine-tuning. Our observations challenge the naive application of the current SSL pre-training to 3DHPSE and relight the value of other data types in the pre-training aspect.

1. INTRODUCTION

Transferring the knowledge contained in one task and dataset to solve other downstream tasks (i.e., transfer learning) has proven very successful in a range of computer vision tasks (Girshick et al., 2014; Carreira & Zisserman, 2017; He et al., 2017) . In practice, transfer learning is done by pretraining a backbone (He et al., 2016) on source data to learn better visual representations for the target task. The ImageNet classification has been the de facto pre-training paradigm in computer vision, and the 3D human body pose and shape estimation (3DHPSE) literature has followed this. Recently, self-supervised representation learning (SSL) has gained popularity in the interest of reducing labeling costs (Chen et al., 2020a; Grill et al., 2020; He et al., 2020; Caron et al., 2020; Hénaff et al., 2021) . SSL pre-trains a backbone using unlabeled arbitrary object images and fine-tunes the backbone on target tasks. MoCo (He et al., 2020) and DetCon (Hénaff et al., 2021) surpassed the ImageNet classification pre-training for downstream tasks like object detection and instance segmentation on arbitrary class objects. Motivated by them, PeCLR (Spurr et al., 2021) and HanCo (Zimmermann et al., 2021) targeted a human hand and pre-trained a backbone on hand data without 3D labels. They showed the accuracy improvement for 3D hand pose and shape estimation from the controlled setting (Zimmermann et al., 2019) , compared with random initialization (no pre-training) and the ImageNet classification pre-training. While the results of PeCLR and HanCo are promising for 3DHPSE, they have limited practical lessons. For example, the amounts of labeled hand data, which is fine-tuning data, are significantly smaller (∼64K) than that of the commonly used labeled body data (∼480K). Also, the total training (pre-training&fine-tuning) time of the different approaches is not matched, which is critical to the final accuracy (He et al., 2019) . Last, they require labeled data with bounding boxes of a hand. This paper questions the effectiveness of SSL pre-training for 3DHPSE by thoroughly comparing with alternatives in multiple aspects (i.e. final accuracy, convergence speed, and cost-effectiveness). We perform experiments by fixing the fine-tuning task to 3DHPSE and changing the pre-training approach. The experiments are organized in three steps. First, we compare state-of-the-art SSL methods, pre-trained on ImageNet, with the ImageNet classification pre-training. Different object detection and instance segmentation, the SSL methods are outperformed by the classification pretraining in three 3DHPSE benchmarks with 7.7% margin on average. Interestingly, the accuracy of SSL is comparable to or even worse than the random initialization baseline. The results imply general visual representations learned by SSL could be detrimental to 3DHPSE. Second, we explore the reasons behind the current SSL methods' disappointing performance in depth by contriving a new pre-training approach. Modern SSL pre-training methods (Chen et al., 2020a; He et al., 2020) have two unfavorable factors on 3DHPSE; 1) they learn inconsistent representations for the same class instances as argued by (Khosla et al., 2020) , which hinders learning high-level priors about a specific class, and 2) SSL pre-training has an instance-level learning characteristic (i.e., a single attribute per image), which has an inherent task gap with 3DHPSE that requires understanding of the fine-level semantic information (i.e., multiple attributes per image), the human joints. We combine an SSL approach with 2D joint labels, which we call JointCon, to experimentally validate the two factors' effects. et al., 2014) . In 3DPW, the convergence speed is approximately 2× faster. In the semi-supervised setting, the accuracy improvement increases to 9.9% on 3DPW and 7.1% on Human3.6M. We assume rich pose and appearance information learned from the 2D pose data is the key to these improvements as expected. Synthetic data pre-training produces higher errors than the classification baseline. We conjecture that a domain gap between real and synthetic data inter-



Unless otherwise noted, 'data' indicates labeled images



Figure 1: (Left) We pre-train a backbone (ResNet-50 (He et al., 2016)) with different data types: unlabeled arbitrary objects (Russakovsky et al., 2015), labeled arbitrary objects (Russakovsky et al., 2015), synthetic 3D human data (Varol et al., 2017), and real 2D human data (Lin et al., 2014). (Right) 3DHPSE errors when initializing its backbone with differently pre-trained weights. We finetune PARE (Kocabas et al., 2021) on Human3.6M (Ionescu et al., 2014) and MSCOCO (Lin et al., 2014) and evaluate it on 3DPW (von Marcard et al., 2018).

JointCon contrasts local image features of human joints instead of global image features of images.

