NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS

Abstract

We present 3DiM, a diffusion model for 3D novel view synthesis, which is able



to translate a single input view into consistent and sharp completions across many views. The core component of 3DiM is a pose-conditional image-to-image diffusion model, which is trained to take a source view and its pose as inputs, and generates a novel view for a target pose as output. 3DiM can then generate multiple views that are approximately 3D consistent using a novel technique called stochastic conditioning. At inference time, the output views are generated autoregressively. When generating each novel view, one selects a random conditioning view from the set of previously generated views at each denoising step. We demonstrate that stochastic conditioning significantly improves 3D consistency compared to a naïve sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view. We compare 3DiM to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated completions from a single view achieve much higher fidelity, while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to quantify the 3D consistency of a generated object by training a neural field on the model's output views. 3DiM is geometry free, does not rely on hyper-networks or test-time optimization for novel view synthesis, and allows a single model to easily scale to a large number of scenes. Input view 3DiM outputs conditioned on different poses See the Supplementary Website (https://3d-diffusion.github.io/) for video outputs.

1. INTRODUCTION

Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) , also known as simply diffusion models, have recently emerged as a powerful family of generative models, achieving state-of-the-art performance on audio and image synthesis (Chen et al., 2020; Dhariwal & Nichol, 2021) , while admitting better training stability over adversarial approaches (Goodfellow et al., 2014) , as well as likelihood computation, which enables further applications such as compression and density estimation (Song et al., 2021; Kingma et al., 2021) . Diffusion models have achieved impressive empirical results in a variety of image-to-image translation tasks not limited to text-to-image, super-resolution, inpainting, colorization, uncropping, and artifact removal (Song et al., 2020; Saharia et al., 2021a; Ramesh et al., 2022; Saharia et al., 2022) .



Figure 1: Given a single input image on the left, 3DiM performs novel view synthesis and generates the four views on the right. We trained a single ∼471M parameter 3DiM on all of ShapeNet (without classconditioning) and sample frames with 256 steps (512 score function evaluations with classifier-free guidance).See the Supplementary Website (https://3d-diffusion.github.io/) for video outputs.

