3D GENERATION ON IMAGENET

Abstract

All existing 3D-from-2D generators are designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop 3D generator with Generic Priors (3DGP): a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet. Our model is based on three new ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision. Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training. Finally, we extend the recent ideas of transferring knowledge from pretrained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator. It achieves more stable training than the existing methods and speeds up the convergence by at least 40%. We explore our model on four datasets: SDIP Dogs 256 2 , SDIP Elephants 256 2 , LSUN Horses 256 2 , and ImageNet 256 2 and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality.

1. INTRODUCTION EG3D

(default camera)

Our model EG3D

(wide camera) et al., 2009) . EG3D models the geometry in low resolution and renders either flat shapes (when trained with the default camera distribution) or repetitive "layered" ones (when trained with a wide camera distribution). In contrast, our model synthesizes the radiance field in the full dataset resolution and learns high-fidelity details during training. Zoom-in for a better view. We witness remarkable progress in the domain of 3D-aware image synthesis. The community is developing new methods to improve the image quality, 3D consistency and efficiency of the generators * Work done during internship at Snap Inc. 1



Figure 1: Selected samples from EG3D (Chan et al., 2022) and our generator trained on ImageNet 256 2 (Denget al., 2009). EG3D models the geometry in low resolution and renders either flat shapes (when trained with the default camera distribution) or repetitive "layered" ones (when trained with a wide camera distribution). In contrast, our model synthesizes the radiance field in the full dataset resolution and learns high-fidelity details during training. Zoom-in for a better view.

availability

https://snap-research.github.io/3dgp 

