ISS: IMAGE AS STEPPING STONE FOR TEXT-GUIDED 3D SHAPE GENERATION

Abstract

Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape dataset, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a textguided shape stylization module to dress up the output shapes with novel structures and textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.



liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation. 

1. INTRODUCTION

3D shape generation has a broad range of applications, e.g., in Metaverse, CAD, games, animations, etc. Among various ways to generate 3D shapes, a user-friendly approach is to generate shapes from natural language or text descriptions. By this means, users can readily create shapes, e.g., to add/modify objects in VR/AR worlds, to design shapes for 3D printing, etc. Yet, generating shapes from texts is very challenging, due to the lack of large-scale paired text-shape data, the large semantic gap between the text and shape modalities, and the structural complexity of 3D shapes. Existing works (Chen et al., 2018; Jahan et al., 2021; Liu et al., 2022) typically rely on paired textshape data for model training. Yet, collecting 3D shapes is already very challenging on its own, let alone the tedious manual annotations needed to construct the text-shape pairs. To our best knowledge, the largest existing paired text-shape dataset (Chen et al., 2018) contains only two categories, i.e., table and chair, thus severely limiting the applicability of the existing works.



Figure 1: Our novel "Image as Stepping Stone" framework (a) is able to connect the text space (the CLIP Text feature) and the 3D shape space (the SVR feature) through our two-stage feature-space alignment, such that we can generate plausible 3D shapes from text (b) beyond the capabilities of the existing works (CLIP-Forge and Dream Fields), without requiring paired text-shape data.

availability

https://github.com/ 

