MESHDIFFUSION: SCORE-BASED GENERATIVE 3D MESH MODELING

Abstract

We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Previous scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes with deformable tetrahedral grids, and then train a diffusion model on this direct parametrization. We demonstrate the effectiveness of our model on multiple generative tasks.



Zhen Liu 1,2 * , Yao Feng 2,3 , Michael J. Black 2 , Derek Nowrouzezahrai 4 , Liam Paull 1 , Weiyang Liu 2,5 1 Mila, Université de Montréal 2 Max Planck Institute for Intelligent Systems -Tübingen 

1. INTRODUCTION

As one of the most challenging tasks in computer vision and graphics, generative modeling of high-quality 3D shapes is of great significance in many applications such as virtual reality and metaverse [11] . Traditional methods for generative 3D shape modeling are usually built upon representations of voxels [51] or point clouds [1] , mostly because ground truth data of these representations are relatively easy to obtain and also convenient to process. Both representations, however, do not produce fine-level surface geometry and therefore cannot be used for photorealistic rendering of shapes of different materials in different lighting conditions. And despite being convenient to process for computers, both voxels and point clouds are relatively hard for artists to edit, especially when the generated 3D shapes are complex and of low quality. Moreover, modern graphics pipelines are built and optimized for explicit geometry representations like meshes, making meshes one of the most desirable final 3D shape representations. While it is still possible to use methods like Poisson reconstruction to obtain surfaces from voxels and points clouds, the resulted surfaces are generally noisy and contain many topological artifacts, even with carefully tuned hyperparameters. To improve the representation flexibility, sign distance fields (SDFs) have been adopted to model shape surfaces, which enables us to use marching cubes [29] to extract the zero-surfaces and thus 3D meshes. However, SDFs are typically harder to learn as it requires a carefully designed sampling strategy and regularization. Because SDFs are usually parameterized with multi-layer perceptrons (MLPs) in which a smoothness prior is implicitly embedded, the generated shapes tend to be so smooth that sharp edges and important (and potentially semantic) details are lost. Moreover, SDFs are costly to render and therefore less suitable for downstream tasks like conditional generation with RGB images, which require an efficient differentiable renderer during inference. We instead aim to generate 3D shapes by directly producing 3D meshes, where surfaces are represented as a graph of triangular or polygon faces. With 3D meshes, all local surface information is completely included in the mesh vertices (along with the vertex connectivity), because the surface normal of any point on the shape surface is simply a nearest neighbor or some local linear combination of vertex normals. Such a regular structure with rich geometric details enables us to better model the data distribution and learn generative models that are more geometry-aware. In light of recent advances in score-based generative modeling [16, 47] where powerful generative performance and effortless training are demonstrated, we propose to train diffusion models on these vertices to generate meshes. However, it is by no means a trivial task and poses two critical problems: (1) the numbers of vertices and faces are indefinite for general object categories, and (2) the underlying topology varies wildly and edges have to be generated at the same time. A natural solution is to enclose meshes with another structure such that the space of mesh topology and spatial configuration is constrained. One common approach is to discretize the 3D space and encapsulate each mesh in a tiny cell, and it is proven useful in simulation [37] and human surface modeling [35] . Observing that this sort of mesh modeling is viable in recent differentiable geometry modeling literature [32, 38, 43] , we propose to train diffusion models on a discretized and uniform tetrahedral grid structure which parameterizes a small yet representative family of meshes. With such a grid representation, topological change is subsumed into the SDF values and the inputs to the diffusion model now assume a fixed and identical size. More importantly, since SDF values are now independent scalars instead of the outputs of an MLP, the parameterized shapes are no longer biased towards smooth surfaces. Indeed, by such an explicit probabilistic modeling we introduce an explicit geometric prior into shape generation, because a score matching loss of diffusion models on the grid vertices has direct and simple correspondence to the vertex positions of triangular mesh. We demonstrate that our method, dubbed MeshDiffusion, is able to produce high-quality meshes and enables conditional generation with a differentiable renderer. MeshDiffusion is also very stable to train without bells and whistles. We validate the superiority of the visual quality of our generated samples qualitatively with different rendered views and quantitatively by proxy metrics. We further conduct ablation studies to show that our design choices are necessary and well suited for the task of 3D mesh generation. Our contributions are summarized below: • To our knowledge, we are the first to apply diffusion model for unconditionally generating 3D high-quality meshes and to show that diffusion models are well suited for 3D geometry. • Taking advantage of the deformable tetrahedral grid parametrization of 3D mesh shapes, we propose a simple and effortless way to train a diffusion model to generate 3D meshes. • We qualitatively and quantitatively demonstrate the superiority of MeshDiffusion on different tasks, including (1) unconditional generation, (2) conditional generation and (3) interpolation.

2. RELATED WORK

3D Shape Generation. 3D shape generation is commonly done by using generative models on voxels [51] and point clouds [53] , due to their simplicity and accessibility. These resulted (often noisy) voxels or point clouds, however, do not explicitly encode surface information and therefore has to be processed with surface reconstruction methods in order to be used in applications like relighting and



Figure 1: (a) Unconditionally generated 3D mesh samples randomly selected from the proposed MeshDiffusion, a simple diffusion model trained on a direct parametrization of 3D meshes without bells and whistles. (b) 3D mesh samples generated by MeshDiffusion with text-conditioned textures from [39]. MeshDiffusion produces highly realistic and fine-grained geometric details while being easy and stable to train.

