UNDERSTANDING PURE CLIP GUIDANCE FOR VOXEL GRID NERF MODELS Anonymous

Abstract

Paper under double-blind review "night city with vaporwave aesthetic." "medieval people celebrating a festival with many stalls; trending on artstation." "Tokyo city; trending on artstation." "American muscle car palms and moon synthwave." "steampunk city; trending on artstation." "an armoured knight with wings; trending on artstation." Figure 1: Examples of multi-view images generated from the input prompts with our implicit Vox Imp model trained at resolution 224 2 and various CLIP models. Our model produces highly detailed 3D representations roughly matching the input text.

1. INTRODUCTION

Text to image generation has seen major recent advances with the release of DALLE (Ramesh et al., 2021) and diffusion models such as DALLE2 (Ramesh et al., 2022 ), CogView2 (Ding et al., 2022) and Latent Diffusion (Rombach et al., 2022) . A natural next step is the task of generating 3D objects from text input. However, supervised methods relying on paired image-text data are less suited to text to 3D generation as large-scale paired text and 3D datasets are less available. Thus, the regime of little to no 3D data training supervision is beneficial. While this might seem daunting, recent work in text to 3D generation showed promising results without using large-scale datasets by bridging the gap using guidance from pretrained vision-language models such as CLIP (Radford et al., 2021) . At the same time, advances in differentiable neural rendering and the development of NeRF (Mildenhall et al., 2020) now allow for direct optimization of a 3D representation to match input images. Combining these approaches with CLIP guidance, we can generate 3D representations from text Supplementary Website: https://isekaicoder.github.io/ICLR3801-Supplemental/ directly without paired text-3D data, by optimizing the similarity of the text to the rendered images. Work that leverages CLIP for text to 3D generation can be grouped by the amount of 3D data required. The first set of methods train a generative model on a 3D dataset, and then optimize a mapping network from text to the latent space of the generative model using CLIP guidance and differentiable rendering. The second set of methods utilizes no 3D or text supervision and has access to only the pretrained CLIP model. We refer to this latter regime as pure CLIP guidance. Given the scarcity of text-3D pair datasets, we focus on this regime. A prominent example of the pure CLIP guidance regime is Dream Fields (Jain et al., 2022) which uses Mip-NeRF (Barron et al., 2021) and CLIP to guide the 3D optimization process for every new input text prompt. Unfortunately, this approach requires significant computational resources and exhibits poor quality generation with low-density artifacts when using direct voxel grid optimization (see appendix of original paper). We also find that the quality of the results in Dream Fields is largely attributable to the LiT (Zhai et al., 2022) guidance model. When using the vanilla CLIP models as in our work, results are far worse. Optimizing the CLIP similarity is also prone to adversarial examples where generated images with high similarity according to CLIP have little perceived resemblance to the text description for a human (Liu et al., 2021) . Recent text to 3D methods use image-based augmentations as regularization to prevent these issues. However, there has been no systematic study of which of these regularizations matters and how much. In addition, there are several possible design choices for the NeRF and CLIP modules, including the use of explicit voxel grids without any neural networks vs implicit neural representations. We systematically compare these and other factors that impact generation quality, and show that it is possible to generate highly detailed 3D representations with voxel grids alone. Our main contributions are: 1) We conduct a systematic study of augmentations and their effect on text to 3D generation results with pure CLIP guidance; 2) We compare different CLIP backbones for guidance as well as model ensembles for finer 3D object detail; 3) We compare the regularization effects on geometry of explicit vs implicit voxel grids; and 4) We demonstrate generation of highresolution grids using CLIP guidance only.

2. RELATED WORK

Text to Image. Recent text to image generation work has shown impressive results, from autoregressive methods (Ramesh et al., 2021; Ding et al., 2021; Yu et al., 2022) to diffusion methods (Nichol et al., 2021; Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022; Ding et al., 2022) . However, these methods require significant computational resources, and data supervision which is harder to obtain in large quantities in the case of 3D objects and corresponding text prompts. To alleviate this problem, several works use CLIP and a pretrained image generator for image generation or manipulation without explicit supervision of corresponding pairs. VQGAN-CLIP (Crowson et al., 2022) passes a randomly initialized image through a pretrained VQGAN encoder to get the latent vector. The latent vector is then fed through the decoder, and after applying several augmentations, CLIP similarity is calculated to use as a loss to optimize the latent vector. Fuse Dream (Liu et al., 2021) shows that CLIP similarity scores are prone to adversarial attacks and that applying Diff Augment (Zhao et al., 2020) results in more robust CLIP scores that can be used for optimization. Text to 3D with CLIP Guidance. Initial work in text to 3D shape generation relied on paired 3D shape and text data for supervised training of joint 3D-text embedding spaces (Chen et al., 2018; Jahan et al., 2021; Liu et al., 2022) . Large pretrained image-text embeddings and differentiable rendering led to recent work demonstrating that CLIP and NeRF enable 3D object generation (Sanghi et al., 2022; Jain et al., 2022 ), manipulation (Michel et al., 2022; Wang et al., 2022; Youwang et al., 2022) , and even 3D human animation generation (Hong et al., 2022) without direct supervision of text and 3D corresponding pairs. Here we distinguish between two different levels of supervision in these works. Although there are no corresponding text and 3D examples in the first category, a dataset of 3D objects is available. CLIP-Forge (Sanghi et al., 2022) first trains an implicit (occupancy) autoencoder model on the ShapeNet dataset. Then in a second stage, a normalizing flow model is trained with multi-view images and CLIP to project from the CLIP latent space onto the latent space of the autoencoder. CLIP-NeRF (Wang et al., 2022) uses a similar approach by training a conditional NeRF model and a mapping network to predict updates for the conditional codes according to the input text. Text2Mesh (Michel et al., 2022) takes an input base mesh matching

