NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS

Abstract

We present 3DiM, a diffusion model for 3D novel view synthesis, which is able to translate a single input view into consistent and sharp completions across many views. The core component of 3DiM is a pose-conditional image-to-image diffusion model, which is trained to take a source view and its pose as inputs, and generates a novel view for a target pose as output. 3DiM can then generate multiple views that are approximately 3D consistent using a novel technique called stochastic conditioning. At inference time, the output views are generated autoregressively. When generating each novel view, one selects a random conditioning view from the set of previously generated views at each denoising step. We demonstrate that stochastic conditioning significantly improves 3D consistency compared to a naïve sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view. We compare 3DiM to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated completions from a single view achieve much higher fidelity, while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to quantify the 3D consistency of a generated object by training a neural field on the model's output views. 3DiM is geometry free, does not rely on hyper-networks or test-time optimization for novel view synthesis, and allows a single model to easily scale to a large number of scenes.

Input view 3DiM outputs conditioned on different poses

Figure 1 : Given a single input image on the left, 3DiM performs novel view synthesis and generates the four views on the right. We trained a single ∼471M parameter 3DiM on all of ShapeNet (without classconditioning) and sample frames with 256 steps (512 score function evaluations with classifier-free guidance). See the Supplementary Website (https://3d-diffusion.github.io/) for video outputs.

1. INTRODUCTION

Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) , also known as simply diffusion models, have recently emerged as a powerful family of generative models, achieving state-of-the-art performance on audio and image synthesis (Chen et al., 2020; Dhariwal & Nichol, 2021) , while admitting better training stability over adversarial approaches (Goodfellow et al., 2014) , as well as likelihood computation, which enables further applications such as compression and density estimation (Song et al., 2021; Kingma et al., 2021) . Diffusion models have achieved impressive empirical results in a variety of image-to-image translation tasks not limited to text-to-image, super-resolution, inpainting, colorization, uncropping, and artifact removal (Song et al., 2020; Saharia et al., 2021a; Ramesh et al., 2022; Saharia et al., 2022) . One particular image-to-image translation problem where diffusion models have not been investigated is novel view synthesis, where, given a set of images of a given 3D scene, the task is to infer how the scene looks from novel viewpoints. Before the recent emergence of Scene Representation Networks (SRN) (Sitzmann et al., 2019) and Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) , state-of-the-art approaches to novel view synthesis were typically built on generative models (Sun et al., 2018) or more classical techniques on interpolation or disparity estimation (Park et al., 2017; Zhou et al., 2018) ) with volumetric rendering, followed by generative super-resolution (the latter being responsible for the approximation). In comparison to this complex setup, we do not only provide a significantly simpler architecture, but also a simpler hyper-parameter tuning experience compared GANs, which are well-known to be notoriously difficult to tune (Mescheder et al., 2018) . Motivated by these observations and the success of diffusion models in image-to-image tasks, we introduce 3D Diffusion Models (3DiMs). 3DiMs are image-to-image diffusion models trained on pairs of images of the same scene, where we assume the poses of the two images are known. Drawing inspiration from Scene Representation Transformers (Sajjadi et al., 2021) , 3DiMs are trained to build a conditional generative model of one view given another view and their poses. Our key discovery is that we can turn this image-to-image model into a model that can produce an entire set of 3D-consistent frames through autoregressive generation, which we enable with our novel stochastic conditioning sampling algorithm. We cover stochastic conditioning in more detail in Section 2.2 and provide an illustration in Figure 3 . Compared to prior work, 3DiMs are generative (vs. regressive) geometry free models, they allow training to scale to a large number of scenes, and offer a simple end-to-end approach. We now summarize our core contributions: 1. We introduce 3DiM, a geometry-free image-to-image diffusion model for novel view synthesis. 2. We introduce the stochastic conditioning sampling algorithm, which encourages 3DiM to generate 3D-consistent outputs. 3. We introduce X-UNet, a new UNet architecture (Ronneberger et al., 2015) variant for 3D novel view synthesis, demonstrating that changes in architecture are critical for high fidelity results. 4. We introduce an evaluation scheme for geometry-free view synthesis models, 3D consistency scoring, that can numerically capture 3D consistency by training neural fields on model outputs.



. Today, these models have been outperformed by NeRF-class models(Yu  et al., 2021; Niemeyer et al., 2021; Jang & Agapito, 2021), where 3D consistency is guaranteed by construction, as images are generated by volume rendering of a single underlying 3D representation (a.k.a. "geometry-aware" models).Still, these approaches feature different limitations. Heavily regularized NeRFs for novel view synthesis with few images such as RegNeRF(Niemeyer et al., 2021)  produce undesired artifacts when given very few images, and fail to leverage knowledge from multiple scenes (recall NeRFs are trained on a single scene, i.e., one model per scene), and given one or very few views of a novel scene, a reasonable model must extrapolate to complete the occluded parts of the scene. Pixel-NeRF(Yu et al., 2021)  andVisionNeRF (Lin et al., 2022)  address this by training NeRF-like models conditioned on feature maps that encode the novel input view(s). However, these approaches are regressive rather than generative, and as a result, they cannot yield different plausible modes and are prone to blurriness. This type of failure has also been previously observed in regression-based models(Saharia et al., 2021b). Other works such as CodeNeRF (Jang & Agapito, 2021) andLoL- NeRF (Rebain et al., 2021)  instead employ test-time optimization to handle novel scenes, but still have issues with sample quality.

