ISS: IMAGE AS STEPPING STONE FOR TEXT-GUIDED 3D SHAPE GENERATION

Abstract

Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape dataset, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a textguided shape stylization module to dress up the output shapes with novel structures and textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.



liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation. 

1. INTRODUCTION

3D shape generation has a broad range of applications, e.g., in Metaverse, CAD, games, animations, etc. Among various ways to generate 3D shapes, a user-friendly approach is to generate shapes from natural language or text descriptions. By this means, users can readily create shapes, e.g., to add/modify objects in VR/AR worlds, to design shapes for 3D printing, etc. Yet, generating shapes from texts is very challenging, due to the lack of large-scale paired text-shape data, the large semantic gap between the text and shape modalities, and the structural complexity of 3D shapes. Existing works (Chen et al., 2018; Jahan et al., 2021; Liu et al., 2022) typically rely on paired textshape data for model training. Yet, collecting 3D shapes is already very challenging on its own, let alone the tedious manual annotations needed to construct the text-shape pairs. To our best knowledge, the largest existing paired text-shape dataset (Chen et al., 2018) contains only two categories, i.e., table and chair, thus severely limiting the applicability of the existing works. Very recently, two annotation-free approaches, CLIP-Forge (Sanghi et al., 2022) and Dream Fields (Jain et al., 2022) , were proposed to address the dataset limitation. These two state-ofthe-art approaches attempt to utilize the joint text-image embedding from the large-scale pre-trained language vision model, i.e., CLIP (Radford et al., 2021) , to eliminate the need of requiring paired text-shape data in model training. However, it is still extremely challenging to generate 3D shapes from text without paired texts and shapes for the following reasons. First, the range of object categories that can be generated are still limited due to the scarcity of 3D datasets. For example, Clip-Forge (Sanghi et al., 2022) is built upon a shape auto-encoder; it can hardly generate plausible shapes beyond the ShapeNet categories. Also, it is challenging to learn 3D prior of the desired shape from texts. For instance, Dream Field (Jain et al., 2022) cannot generate 3D shapes like our approach due to the lack of 3D prior, as it is trained to produce only multi-view images with a neural radiance field. Further, with over an hour of optimization for each shape instance from scratch, there is still no guarantee that the multi-view consistency constraint of Dream Field (Jain et al., 2022) can enforce the model for producing shapes that match the given text; we will provide further investigation in our experiments. Last, the visual quality of the generated shapes is far from satisfactory due to the substantial semantic gap between the unpaired texts and shapes. As shown in Figure 1 (b), the results generated by Dream Field typically look surrealistic (rather than real), due to insufficient information extracted from text for the shape structures and details. On the other hand, CLIP-Forge (Sanghi et al., 2022) is highly restricted by the limited 64 3 resolution and it lacks colors and textures, further manifesting the difficulty of generating 3D shapes from unpaired text-shape data. Going beyond the existing works, we propose a new approach to 3D shape generation from text without needing paired text-shape data. Our key idea is to implicitly leverage 2D image as a stepping stone (ISS) to connect the text and shape modalities. Specifically, we employ the joint text-image embedding in CLIP and train a CLIP2Shape mapper to map the CLIP image features to a pre-trained detail-rich 3D shape space with multi-view supervisions; see Figure 1 (a) : stage 1. Thanks to the joint text-image embedding from CLIP, our trained mapper is able to connect the CLIP text features with the shape space for text-guided 3D shape generation. Yet, due to the gap between the CLIP text and CLIP image features, the mapped text feature may not align well with the destination shape feature; see the empirical analysis in Section 3.2. Hence, we further fine-tune the mapper specific to each text input by encouraging CLIP consistency between the rendered images and the input text to enhance the consistency between the input text and the generated shape; see Figure 1 (a): stage 2. Our new approach advances the frontier of 3D shape generation from text in the following aspects. First, by taking image as a stepping stone, we make the challenging text-guided 3D shape generation task more approachable and cast it as a single-view reconstruction (SVR) task. Having said that, we learn 3D shape priors from the adopted SVR model directly in the feature space. Second, benefiting from the learned 3D priors from the SVR model and the joint text-image embeddings, our approach can produce 3D shapes in only 85 seconds vs. 72 minutes of Dream Fields (Jain et al., 2022) . More importantly, our approach is able to produce plausible 3D shapes, not multi-view images, beyond the generation capabilities of the state-of-the-art approaches; see Figure 1 (b) . With our two-stage feature-space alignment, we already can generate shapes with good fidelity from texts. To further enrich the generated shapes with vivid textures and structures beyond the generative space of the pre-trained SVR model, we additionally design a text-guided stylization module to generate novel textures and shapes by encouraging consistency between the rendered images and the text description of the target style. We then can effectively fuse with the two-stage feature-space alignment to enable the generation of both realistic and fantasy textures and also shapes beyond the generation capability of the SVR model; see Figure 1 (b ) for examples. Furthermore, our approach is compatible with different SVR models (Niemeyer et al., 2020; Alwala et al., 2022) . For example, we can adopt SS3D (Alwala et al., 2022) to generate shapes from single-view in-the-wild images to broaden the range of categorical 3D shapes that our approach can generate, going beyond Sanghi et al. ( 2022), which can only generate 13 categories of ShapeNet. Besides, our approach can also work with the very recent approach GET3D (Gao et al., 2022) to generate high-quality 3D shapes from text; see our results in Section 4.

2. RELATED WORKS

Text-guided image generation. Existing text-guided image generation approaches can be roughly cast into two branches: (i) direct image synthesis (Reed et al., 2016a; b; Zhang et al., 2017; 2018;  



Figure 1: Our novel "Image as Stepping Stone" framework (a) is able to connect the text space (the CLIP Text feature) and the 3D shape space (the SVR feature) through our two-stage feature-space alignment, such that we can generate plausible 3D shapes from text (b) beyond the capabilities of the existing works (CLIP-Forge and Dream Fields), without requiring paired text-shape data.

availability

https://github.com/ 

