DO 2D GANS KNOW 3D SHAPE? UNSUPERVISED 3D SHAPE RECONSTRUCTION FROM 2D IMAGE GANS



Figure 1 : The first column shows images generated by off-the-shelf 2D GANs trained on RGB images only, while the rest show that our method can unsupervisedly reconstruct 3D shape (viewed in 3D mesh, surface normal, and texture) given a single 2D image by exploiting the geometric cues contained in GANs. The last two columns depicts 3D-aware image manipulation effects (rotation and relighting) enabled by our framework. More results are provided in the Appendix.

ABSTRACT

Natural images are projections of 3D objects on a 2D image plane. While state-ofthe-art 2D generative models like GANs show unprecedented quality in modeling the natural image manifold, it is unclear whether they implicitly capture the underlying 3D object structures. And if so, how could we exploit such knowledge to recover the 3D shapes of objects in the images? To answer these questions, in this work, we present the first attempt to directly mine 3D geometric cues from an offthe-shelf 2D GAN that is trained on RGB images only. Through our investigation, we found that such a pre-trained GAN indeed contains rich 3D knowledge and thus can be used to recover 3D shape from a single 2D image in an unsupervised manner. The core of our framework is an iterative strategy that explores and exploits diverse viewpoint and lighting variations in the GAN image manifold. The framework does not require 2D keypoint or 3D annotations, or strong assumptions on object shapes (e.g. shapes are symmetric), yet it successfully recovers 3D shapes with high precision for human faces, cars, buildings, etc. The recovered 3D shapes immediately allow high-quality image editing like relighting and object rotation. We quantitatively demonstrate the effectiveness of our approach compared to previous methods in both 3D shape reconstruction and face rotation. Our code is available at https://github.com/XingangPan/GAN2Shape.

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014) are capable of modeling the 2D natural image manifold (Zhu et al., 2016) of diverse object categories with high fidelity. Recall the fact that natural images are actually the projections of 3D objects to the 2D plane, an ideal 2D image Starting with an initial ellipsoid 3D shape (viewed in surface normal), our approach renders various 'pseudo samples' with different viewpoints and lighting conditions. GAN inversion is applied to these samples to obtain the 'projected samples', which are used as the ground truth of the rendering process to refine the initial 3D shape. This process is repeated until more precise results are obtained. manifold should be able to reflect some underlying 3D geometrical properties. For example, it is shown that a GAN could shift the object in its generated images (e.g., human faces) in a 3D rotation manner, as a direction in the GAN image manifold may correspond to viewpoint change (Shen et al., 2020) . This phenomenon motivates us to ask -"Is it possible to reconstruct the 3D shape of a single 2D image by exploiting the 3D-alike image manipulation effects produced by GANs?" Despite its potential to serve as a powerful method to learn 3D shape from unconstrained RGB images, this problem remains much less explored. Some previous attempts (Lunz et al., 2020; Henzler et al., 2019; Szabó et al., 2019 ) also adopt GANs to learn 3D shapes from images, but they rely on explicitly modeling 3D representation and rendering during training (e.g. 3D voxels, 3D models). Due to either heavy memory consumption or additional training difficulty brought by the rendering process, the qualities of their generated samples notably lag behind their 2D GAN counterparts. Another line of works (Wu et al., 2020; Goel et al., 2020; Tulsiani et al., 2020; Li et al., 2020) for unsupervised 3D shape learning generally learns to infer the viewpoint and shape for each image in an 'analysis by synthesis' manner. Despite their impressive results, these methods often assume object shapes are symmetric (symmetry assumption) to prevent trivial solutions, which is hard to generalize to asymmetric objects such as 'building'. We believe that existing pre-trained 2D GANs, without above specific designs, already capture sufficient knowledge for us to recover the 3D shapes of objects from 2D images. Since the 3D structure of an instance could be inferred from images of the same instance with multiple viewpoint and lighting variations, our insight is that we may create these variations by exploiting the image manifold captured by 2D GANs. However, the main challenge is to discover well-disentangled semantic directions in the image manifold that control viewpoint and lighting in an unsupervised manner, as manually inspect and annotate the samples in the image manifold is laborious and time-consuming. To tackle the above challenge, we observe that for many objects such as faces and cars, a convex shape prior like ellipsoid could provide a hint on the change of their viewpoints and lighting conditions. Inspired by this, given an image generated by GAN, we employ an ellipsoid as its initial 3D object shape, and render a number of unnatural images, called 'pseudo samples', with various randomly-sampled viewpoints and lighting conditions as shown in Fig. 2 . By reconstructing them using the GAN, these pseudo samples could guide the original image towards the sampled viewpoints and lighting conditions in the GAN manifold, producing a number of natural-looking images, called 'projected samples'. These projected samples could be adopted as the ground truth of the differentiable rendering process to refine the prior 3D shape (i.e. an ellipsoid). To achieve more precise results, we further regard the refined shape as the initial shape and repeat the above steps to progressively refine the 3D shape. With the proposed approach, namely GAN2Shape, we show that existing 2D GANs trained on images only are sufficient to accurately reconstruct the 3D shape of a single image for many object categories such as human faces, cars, buildings, etc. Our method thus is an effective approach for unsupervised 3D shape reconstruction from unconstrained 2D images without any 2D keypoint or 3D annotations. With an improved GAN inversion strategy, our method works not only for GAN samples, but also for real natural images. On the BFM benchmark (Paysan et al., 2009) , our method outperforms a recent strong baseline designed specifically for 3D shape learning (Wu et al., 2020) . We also show high-quality 3D-aware image manipulations using the semantic latent directions discovered by our approach, which achieves more accurate human face rotation than other competitors. Our contributions are summarized as follows. 1) We present the first attempt to reconstruct the 3D object shapes using GANs that are pre-trained on 2D images only. Our work shows that 2D GANs



Figure 2: Framework outline.Starting with an initial ellipsoid 3D shape (viewed in surface normal), our approach renders various 'pseudo samples' with different viewpoints and lighting conditions. GAN inversion is applied to these samples to obtain the 'projected samples', which are used as the ground truth of the rendering process to refine the initial 3D shape. This process is repeated until more precise results are obtained.

