EVA3D: COMPOSITIONAL 3D HUMAN GENERATION FROM 2D IMAGE COLLECTIONS

Abstract

Figure 1: EVA3D generates high-quality and diverse 3D humans with photo-realistic RGB renderings and detailed geometry. Only 2D image collections are used for training.

1. INTRODUCTION

Inverse graphics studies inverse-engineering of projection physics, which aims to recover the 3D world from 2D observations. It is not only a long-standing scientific quest, but also enables nu-merous applications in VR/AR and VFX. Recently, 3D-aware generative models (Chan et al., 2021; Or-El et al., 2022; Chan et al., 2022) demonstrate great potential in inverse graphics by learning to generate 3D rigid objects (e.g. human/animal faces, CAD models) from 2D image collections. However, human bodies, as articulated objects, have complex articulations and diverse appearances. Therefore, it is challenging to learn 3D human generative models that can synthesis animatable 3D humans with high-fidelity textures and vivid geometric details. To generate high-quality 3D humans, we argue that two main factors should be properly addressed: 1) 3D human representation; 2) generative network training strategies. Due to the articulated nature of human bodies, a desirable human representation should be able to explicitly control the pose/shape of 3D humans. With an articulated representation, a 3D human is modeled in its canonical pose (canonical space), and can be rendered in different poses and shapes (observation space). Moreover, the efficiency of the representation matters in high-quality 3D human generation. Previous methods (Noguchi et al., 2022; Bergman et al., 2022) fail to achieve high resolution generation due to their inefficient human representations. In addition, training strategies could also highly influence 3D human generative models. The issue mainly comes from the data characteristics. Compared with datasets used by Noguchi et al. ( 2022) (e.g. AIST (Tsuchida et al., 2019) ), fashion datasets (e.g. DeepFashion (Liu et al., 2016) ) are more aligned with real-world human image distributions, making a favorable dataset choice. However, fashion datasets mostly have very limited human poses and highly imbalanced viewing angles. This imbalanced 2D data distribution could hinder 3D GAN learning, leading to difficulties in novel view/ pose synthesis. Therefore, a proper training strategy is in need to alleviate the issue. In this work, we propose EVA3D, an unconditional high-quality 3D human generative model from sparse 2D human image collections only. To facilitate that, we propose a compositional human NeRF representation to improve the model efficiency. We divide the human body into 16 parts and assign each part an individual network, which models the corresponding local volume. Our human representation mainly provides three advantages. 1) It inherently describes the human body prior, which supports explicit control over human body shapes and poses. 2) It supports adaptively allocating computation resources. More complex body parts (e.g. heads) can be allocated with more parameters. 3) The compositional representation enables efficient rendering and achieves high-resolution generation. Rather than using one big volume (Bergman et al., 2022) , our compositional representation tightly models each body part and prevents wasting parameters on empty volumes. Moreover, thanks to the part-based modeling, we can efficiently sample rays inside local volumes and avoid sampling empty spaces. With the compact representation together with the efficient rendering algorithm, we achieve high-resolution (512 × 256) rendering and GAN training without using super-resolution modules, while existing methods can only train at a native resolution of 128 2 . Moreover, we carefully design training strategies to address the human pose and viewing angle imbalance issue. We analyze the head-facing angle distribution and propose a pose-guided sampling strategy to help effective 3D human geometry learning. Quantitative and qualitative experiments are performed on two fashion datasets (Liu et al., 2016; Fu et al., 2022) to demonstrate the advantages of EVA3D. We also experiment on UBCFashion (Zablotskaia et al., 2019) and AIST (Tsuchida et al., 2019) for comparison with prior work. Extensive experiments on our method designs are provided for further analysis. In conclusion, our contributions are as follows: 1) We are the first to achieve high-resolution high-quality 3D human generation from 2D image collections; 2) We propose a compositional human NeRF representation tailored for efficient GAN training; 3) Practical training strategies are introduced to address the imbalance issue of real 2D human image collections. 4) We demonstrate applications of EVA3D, i.e. interpolation and GAN inversion, which pave way for further exploration in 3D human GAN.

2. RELATED WORK

3D-Aware GAN. Generative Adversarial Network (GAN) (Goodfellow et al., 2020) has been a great success in 2D image generation (Karras et al., 2019; 2020) . Many efforts have also been put on 3D-aware generation. Nguyen-Phuoc et al. (2019); Henzler et al. (2019) use voxels, and Pan et al. (2020) use meshes to assist the 3D-aware generation. With recent advances in NeRF (Mildenhall et al., 2020; Tewari et al., 2021) , many have build 3D-aware GANs based on NeRF (Schwarz et al., 2020; Niemeyer & Geiger, 2021; Chan et al., 2021; Deng et al., 2022) . To increase the generation

