STYLEMORPH: DISENTANGLED 3D-AWARE IMAGE SYNTHESIS WITH A 3D MORPHABLE STYLEGAN

Abstract

We introduce StyleMorph, a 3D-aware generative model that disentangles 3D shape, camera pose, object appearance, and background appearance for high quality image synthesis. We account for shape variability by morphing a canonical 3D object template, effectively learning a 3D morphable model in an entirely unsupervised manner through backprop. We chain 3D morphable modelling with deferred neural rendering by performing an implicit surface rendering of "Template Object Coordinates" (TOCS), which can be understood as an unsupervised counterpart to UV maps. This provides a detailed 2D TOCS map signal that reflects the compounded geometric effects of non-rigid shape variation, camera pose, and perspective projection. We combine 2D TOCS maps with an independent appearance code to condition a StyleGAN-based deferred neural rendering (DNR) network for foreground image (object) synthesis; we use a separate code for background synthesis and do late fusion to deliver the final result. We show competitive synthesis results on 4 datasets (FFHQ faces, AFHQ Cats, Dogs, Wild), while achieving the joint disentanglement of shape, pose, object and background texture.

1. INTRODUCTION

Learning the structure and statistics of the 3D world by observing 2D images is at the forefront of current vision and learning research as this can unlock applications in robotics, augmented reality and graphics while also having fundamental scientific value for advancing visual perception. In this work we aim to develop the ability to do so through a model that is highly disentangled, yielding a similar level of control to that enjoyed by 3D morphable models (3DMM) (Blanz & Vetter, 1999) , without requiring anything other than an unstructured set of 2D images. 3DMMs are the workhorse of facial visual effects (VFX) in the film industry and augmented reality (AR) (Egger et al., 2021) , as they provide VFX creators with fine-grained, disentangled control over expression, pose, and appearance. In this work we aspire to develop unsupervised counterparts for general object categories. In particular we show that we can learn such models for several categories other than human faces while having no prior knowledge about the object topology or other 3D prior information or knowledge of the camera pose. We build on recent progress on 3D-aware GANs and show that we can improve the FID of the most competitive methods that use the same level of supervision (plain 2D images), while exerting more control on the image synthesis process: we disentangle shape (e.g. gender, expression, hair style), camera pose, object appearance, and background. This allows us to do fine semantic edits, that preserve all properties beyond the one we are editing. We show that this applies not only to faces but also to images of cats, dogs and and wild animals. Relation with previous works 3D-aware category-level modelling advances have shown that one can use 2D image supervision to train 3D generative models of shape and appearance variability. Starting from standard MLPs (Schwarz et al., 2020; Niemeyer & Geiger, 2021) and subsequently custom sinusoidal-based networks (Sitzmann et al., 2020; Chan et al., 2021b) , 3D implicit models quickly delivered results competitive to those of voxel-based approaches (Nguyen-Phuoc et al., 2019; 2020) . Hybrid models (Gu et al., 2021; Zhou et al., 2021; Or-El et al., 2021; Chan et al., 2021a;  Project page: https://stylemorph.github.io/stylemorph/ * Equal contribution. Recent works aimed at disentangling appearance from shape have incorporated 3D deformations in the synthesis process (Park et al., 2021; Pumarola et al., 2021; Gafni et al., 2021a; Su et al., 2021; Xu et al., 2021; Weng et al., 2022) , but so far have remained limited to either the single dynamic scene use case, or assume a pre-existing deformable model already exists for a category (Gafni et al., 2021a; Xu et al., 2021; Weng et al., 2022; Su et al., 2021) . In particular for faces compelling controllable synthesis results have been obtained by recent works that combined 3DMMs with NERFs (Athar et al., 2022; Gafni et al., 2021b) or 3D-aware GANs (Liu et al., 2022; Tewari et al., 2020) . Still, constructing a 3DMM typically requires extensive 3D scanning and manual alignment, making it only meaningful for critical categories such as faces. Learning 3DMMs from 2D images has also been recently achieved for monocular 3D reconstruction based on limited information, such as binary segmentation masks (Kanazawa et al., 2018; Sahasrabudhe et al., 2019; Kokkinos & Kokkinos, 2021b) , allowing us to handle a broad range of categories (Ye et al., 2021; Vasudev et al., 2022) ; other works have provided models that can accommodate articulation (Kulkarni et al., 2020; Kokkinos & Kokkinos, 2021a; Yang et al., 2022) , and varied object topology (Duggal & Pathak, 2022) , managing known shortcomings of 3DMMs. The synthesis results of these methods however rely on a parametric low-resolution surface and texture map, yielding synthetic-looking images.

Most recently

Contributions Our work builds on advances from these three strands of research to combine 3DMMs with GANs in an unsupervised manner. We show that it is possible to inject the main idea of morphable models, i.e. deforming a fixed "canonical" template to a diverse set of "world" shapes into the design of implicit 3D networks. Existing approaches model shape variability through a random input to an occupancy network. Instead, we bridge 3D morphable models with 3D-aware



Figure 1: Our model achieves disentangled control of image synthesis: starting from a synthesized sample we change one factor at a time, and in the end show the compounded variation that we obtain by changing all. Our 3D-based conditioning signal (shown on the top and bottom rows) is exclusively geometric -it is hence the same as the left image's in the foreground and background columns, while the change is effected only by the respective appearance codes (not shown).Xue et al., 2022) have increased the resolution and quality in which images can be synthesized without compromising speed or memory by relying on a hybrid approach that renders coarse-resolution neural features from 3D to 2D and then delegates the full-resolution image synthesis task to 2D, StyleGAN-type blocks(Karras et al., 2020). These works have shown increasingly high-quality results -but their hybrid nature makes it harder to have a clear separation of geometry and appearance or provide consistent image synthesis results when we change rigid (camera) or non-rigid (gender/expression/hair) 3D geometry.

Tewari et al. (2022) injected 3DMMs in GAN training, showing that one can control image synthesis through 3D warps. In our work we turn 3DMMs into first-class citizens for 3D generative modelling by combining them with Deferred Neural Rendering.

