DEEP DEFORMATION BASED ON FEATURE-CONSTRAINT FOR 3D HUMAN MESH CORRESPON-DENCE

Abstract

In this study, we address the challenges in mesh correspondence for various types of complete or single-view human body data. The parametric human model has been widely used in various human-related applications and in 3D human mesh correspondence because it provides sufficient scope to modify the resulting model. In contrast to prior methods that optimize both the correspondences and human model parameters (pose and shape), some of the recent methods directly deform each vertex of a parametric template by processing the point clouds that represent the input shapes. This allows the models to have more accurate representations of the details while maintaining the correspondence. However, we identified two limitations in these methods. First, it is difficult for the transformed template to completely restore the input shapes using only a pointwise reconstruction loss. Second, they cannot deform the template to a single-view human body from the depth camera observations or infer the correspondences between various forms of input human bodies. In representation learning, one of the main challenges is to design appropriate loss functions for supervising features with different abilities. To address this, we introduce the feature constraint deformation network (FCD-Net), which is an end-to-end deep learning approach that identifies 3D human mesh correspondences by learning various shape transformations from a predetermined template. The FCD-Net is implemented by an encoder-decoder architecture. A global feature encoded from the input shape and a decoder are used to deform the template based on the encoded global feature. We simultaneously input the complete shape and single-view shape into the encoder and closely constrain the features to enable the encoder to learn more robust features. Meanwhile, the decoder generates a completely transformed template with higher promise by using the complete shape as the ground truth, even if the input is single-view human body data. We conduct extensive experiments to validate the effectiveness of the proposed FCD-Net on four types of single-view human body data, both from qualitative and quantitative aspects. We also demonstrate that our approach improves the state-of-the-art results on the difficult "FAUST-inter" and "SHREC'19" challenges, with average correspondence errors of 2.54 cm and 6.62 cm, respectively . In addition, the proposed FCD-Net performs well on real and unclean point clouds from a depth camera.

1. INTRODUCTION

The rapid development of 3D sensor devices has led to tremendous growth in the field of 3D vision technologies. An essential application of 3D vision technology is the 3D shape correspondence and Model-driven shape reconstruction and matching methods for articulated humans utilize a parametric body template model (e.g., a skinned multiperson linear (SMPL ) (Loper et al. (2015) ) or shape completion and animation for people (SCAPE) (Anguelov et al. (2005) ) model) as a geometrical prior, and optimize or learn the parameters to deform the template, typically its poses and shapes. The deformed models have the same vertex definition, definite semantic information, and same face connections as the template. This makes the correspondence problem easier than when using methods that require finding associations between a variable and large number of points through an optimization strategy to minimize an objective function. However, the low dimensional parameters (shape and pose) limit the description of the details. Researchers have proposed SMPL model + displacement approaches to increase the details of the model (Bhatnagar et al. (2020) ). Other works have combined a parametric model with the deep implicit function to realize free-form human body reconstruction at arbitrary resolutions (Huang et al. (2020) ; He et al. ( 2021)). However , SMPL + displacement methods require two stages to output the final results, making them prone to error at two levels. Free-form implicit functions can lose the semantic information and correspondences of the points. Some recent works Wang et al. ( 2019) have directly deformed each vertex of a parametric template based on a global feature coded from the input shape. This allows the models to provide a more accurate representation of details while maintaining the correspondence of the parametric model. However, we identified two limitations in these methods. First, it is difficult for the transformed template to completely restore the input shapes using only a pointwise reconstruction loss. Second, they can only deform the template to a complete human body as obtained by scanning (or via reconstruction by other methods(Choe et al. ( 2021))). In this study , we focus on deforming the template to single-view 3D human shapes from depth camera images and inferring the correspondences between various forms of the human body. A single-view shape is the most easily obtained 3D data form, owing to the development of low cost, low power consumption , and low-price RGBD cameras. In Groueix et al. ( 2018), a point-based neural network was used to learn a global feature from an input shape. The global feature stores all of the information of the input shape for directing the template deformation. In our view, the global feature should not only accurately represent the complete human shape but should also facilitate recovering the invisible part(s) when the input is a single-view shape. In representation learning, one of the main challenges is to design appropriate loss functions for supervising features with different abilities (discriminative, expressive, or restorative). Motivated by Deng et al. (2019) , we attempted to penalize the loss in the feature space, and specifically, to penalize the angular difference between the deep features obtained from the single-view shape and complete human shape. This can be achieved by simultaneously inputting the complete and single-view data of one object during the training process. Therefore, we propose a framework that is suitable for searching for the correspondence relationships between various types of complete or single-view human bodies by processing the point clouds that represent the input shapes. We call this framework the feature constraint deformation network (FCD-Net). The FCD-Net is designed with an encoder-decoder structure. The encoder comprises a deep neural network for generating global features representing the input shapes. Then, a shape deformation network learns to deform a template as guided by the encoded global feature to align with the target shape as described in detail in Section .3 . We train our FCD-Net with single-view and complete shapes as input shape pairs, as supervised by the known correspondences between the input shapes and template. These are generally explicit when they are both generated by the same parametric model. During testing, only one type of observation is needed, i.e., single-view or complete. The correspondences between the various types of inputs can be realized under a unified framework. To demonstrate the advantages of the FCD-Net, first, we show that the FCD-Net can achieve single-view 3D human body correspondence. Then, we test FCD-Net on finding the correspondences for scanned 3D humans on several public datasets. The FCD-Net achieves state-of-the-art results in the "INTER" challenge of the FAUST dataset with an average correspondence error of 2.54 cm, and in the SHREC'19 challenge with an average correspondence error of 6.62 cm.

2. RELATED WORKS

The traditional search for correspondence between 3D human body models is often conducted through regression optimization methods or function mapping methods. The normal iterative closest point (NICP) (Serafin & Grisetti (2015) )algorithm represents the corresponding relationship between a source model and target model by establishing a complex mathematical relationship be-



deformation(Huang & Fang (2021)), which attempts to establish reliable correspondences between two 3D shapes(Klatzow et al. (2022); Sahillioglu & Yemez (2010); Huang et al. (2017)) and is a hot research topic in 3D vision. In contrast to registrations of scenes or objects (Gojcic et al. (2020); Segal et al. (2009); Ao et al. (2021)) that only involves rigid deformations, such as rotations and translations, estimating the correspondences on articulated human bodies requires flexible, complex, non-rigid deformations and pose variations(Serafin & Grisetti (2015); Bhatnagar et al. (2020); Groueix et al. (2018)). This makes the correspondence process more challenging.

