UNDERSTANDING PURE CLIP GUIDANCE FOR VOXEL GRID NERF MODELS Anonymous

Abstract

Paper under double-blind review "night city with vaporwave aesthetic." "medieval people celebrating a festival with many stalls; trending on artstation." "Tokyo city; trending on artstation." "American muscle car palms and moon synthwave." "steampunk city; trending on artstation.

ABSTRACT

We explore the task of text to 3D object generation using CLIP. Specifically, we use CLIP for guidance without access to any datasets, a setting we refer to as pure CLIP guidance. While prior work has adopted this setting, there is no systematic study of mechanics for preventing adversarial generations within CLIP. We illustrate how different image-based augmentations prevent the adversarial generation problem, and how the generated results are impacted. We test different CLIP model architectures and show that ensembling different models for guidance can prevent adversarial generations within bigger models and generate sharper results. Furthermore, we implement an implicit voxel grid model to show how neural networks provide an additional layer of regularization, resulting in better geometrical structure and coherency of generated objects. Compared to prior work, we achieve more coherent results with higher memory efficiency and faster training speeds.

1. INTRODUCTION

Text to image generation has seen major recent advances with the release of DALLE (Ramesh et al., 2021) and diffusion models such as DALLE2 (Ramesh et al., 2022 ), CogView2 (Ding et al., 2022 ) and Latent Diffusion (Rombach et al., 2022) . A natural next step is the task of generating 3D objects from text input. However, supervised methods relying on paired image-text data are less suited to text to 3D generation as large-scale paired text and 3D datasets are less available. Thus, the regime of little to no 3D data training supervision is beneficial. While this might seem daunting, recent work in text to 3D generation showed promising results without using large-scale datasets by bridging the gap using guidance from pretrained vision-language models such as CLIP (Radford et al., 2021) . At the same time, advances in differentiable neural rendering and the development of NeRF (Mildenhall et al., 2020) now allow for direct optimization of a 3D representation to match input images. Combining these approaches with CLIP guidance, we can generate 3D representations from text Supplementary Website: https://isekaicoder.github.io/ICLR3801-Supplemental/ 1



Figure 1: Examples of multi-view images generated from the input prompts with our implicit Vox Imp model trained at resolution 224 2 and various CLIP models. Our model produces highly detailed 3D representations roughly matching the input text.

