3D GENERATION ON IMAGENET

Abstract

All existing 3D-from-2D generators are designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop 3D generator with Generic Priors (3DGP): a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet. Our model is based on three new ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision. Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training. Finally, we extend the recent ideas of transferring knowledge from pretrained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator. It achieves more stable training than the existing methods and speeds up the convergence by at least 40%. We explore our model on four datasets: SDIP Dogs 256 2 , SDIP Elephants 256 2 , LSUN Horses 256 2 , and ImageNet 256 2 and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality.

1. INTRODUCTION EG3D

(default camera)

Our model EG3D

(wide camera) et al., 2009) . EG3D models the geometry in low resolution and renders either flat shapes (when trained with the default camera distribution) or repetitive "layered" ones (when trained with a wide camera distribution). In contrast, our model synthesizes the radiance field in the full dataset resolution and learns high-fidelity details during training. Zoom-in for a better view. We witness remarkable progress in the domain of 3D-aware image synthesis. The community is developing new methods to improve the image quality, 3D consistency and efficiency of the generators 2022)). However, all the existing frameworks are designed for well-curated and aligned datasets consisting of objects of the same category, scale and global scene structure, like human or cat faces (Chan et al., 2021) . Such curation requires domain-specific 3D knowledge about the object category at hand, since one needs to infer the underlying 3D keypoints to properly crop, rotate and scale the images (Deng et al., 2022; Chan et al., 2022) . This makes it infeasible to perform a similar alignment procedure for large-scale multi-category datasets that are inherently "non-alignable": there does not exist a single canonical position which all the objects could be transformed into (e.g., it is impossible to align a landscape panorama with a spoon). To extend 3D synthesis to in-the-wild datasets, one needs a framework which relies on more universal 3D priors. In this work, we make a step towards this direction and develop a 3D generator with Generic Priors (3DGP): a 3D synthesis model which is guided only by (imperfect) depth predictions from an off-the-shelf monocular depth estimator. Surprisingly, such 3D cues are enough to learn reasonable scenes from loosely curated, non-aligned datasets, such as ImageNet (Deng et al., 2009) . Training a 3D generator on in-the-wild datasets comes with three main challenges: 1) extrinsic camera parameters of real images are unknown and impossible to infer; 2) objects appear in different shapes, positions, rotations and scales, complicating the learning of the underlying geometry; and 3) the dataset typically contains a lot of variation in terms of texture and structure, and is difficult to fit even for 2D generators. As shown in Fig 1 (left), state-of-the-art 3D-aware generators, such as EG3D (Chan et al., 2022) , struggle to learn the proper geometry in such a challenging scenario. In this work, we develop three novel techniques to address those problems. Learnable "Ball-in-Sphere" camera distribution. Most existing methods utilize a restricted camera model (e.g., (Schwarz et al., 2020; Niemeyer & Geiger, 2021b; Chan et al., 2021) ): the camera is positioned on a sphere with a constant radius, always points to the world center and has fixed intrinsics. But diverse, non-aligned datasets violate these assumptions: e.g., dogs datasets have images of both close-up photos of a snout and photos of full-body dogs, which implies the variability in the focal length and look-at positions. Thus, we introduce a novel camera model with 6 degrees of freedom to address this variability. We optimize its distribution parameters during training and develop an efficient gradient penalty for it to prevent its collapse to a delta distribution.

Adversarial depth supervision (ADS).

A generic image dataset features a wide diversity of objects with different shapes and poses. That is why learning a meaningful 3D geometry together with the camera distribution is an ill-posed problem, as the incorrect scale can be well compensated by an incorrect camera model (Hartley & Zisserman, 2003) , or flat geometry (Zhao et al., 2022; Chan et al., 2022) . To instill the 3D bias, we provide the scene geometry information to the discriminator by concatenating the depth map of a scene as the 4-th channel of its RGB input. For real images, we use their (imperfect) estimates from a generic off-the-shelf monocular depth predictor (Miangoleh et al., 2021) . For fake images, we render the depth from the synthesized radiance field, and process it with a shallow depth adaptor module, bridging the distribution gap between the estimated and rendered depth maps. This ultimately guides the generator to learn the proper 3D geometry.



Figure 1: Selected samples from EG3D (Chan et al., 2022) and our generator trained on ImageNet 256 2 (Denget al., 2009). EG3D models the geometry in low resolution and renders either flat shapes (when trained with the default camera distribution) or repetitive "layered" ones (when trained with a wide camera distribution). In contrast, our model synthesizes the radiance field in the full dataset resolution and learns high-fidelity details during training. Zoom-in for a better view.

Figure 2: Model overview. Left: our tri-plane-based generator. To synthesize an image, we first sample camera parameters from a prior distribution and pass them to the camera generator. This gives the posterior camera parameters, used to render an image and its depth map. The depth adaptor mitigates the distribution gap between the rendered and the predicted depth. Right: our discriminator receives a 4-channel color-depth pair as an input. A fake sample consists of the RGB image and its (adapted) depth map. A real sample consists of a real image and its estimated depth. Our two-headed discriminator predicts adversarial scores and image features for knowledge distillation. (e.g., Chan et al. (2022); Deng et al. (2022); Skorokhodov et al. (2022); Zhao et al. (2022); Schwarz et al. (2022)). However, all the existing frameworks are designed for well-curated and aligned datasets consisting of objects of the same category, scale and global scene structure, like human or cat faces(Chan et al., 2021). Such curation requires domain-specific 3D knowledge about the object category at hand, since one needs to infer the underlying 3D keypoints to properly crop, rotate and scale the images(Deng et al., 2022; Chan et al., 2022). This makes it infeasible to perform a similar alignment procedure for large-scale multi-category datasets that are inherently "non-alignable": there does not exist a single canonical position which all the objects could be transformed into (e.g., it is impossible to align a landscape panorama with a spoon).

