VOGE:A DIFFERENTIABLE VOLUME RENDERER USING GAUSSIAN ELLIPSOIDS FOR ANALYSIS-BY-SYNTHESIS

Abstract

The Gaussian reconstruction kernels have been proposed by Westover (1990) and studied by the computer graphics community back in the 90s, which gives an alternative representation of object 3D geometry from meshes and point clouds. On the other hand, current state-of-the-art (SoTA) differentiable renderers, Liu et al. ( 2019), use rasterization to collect triangles or points on each image pixel and blend them based on the viewing distance. In this paper, we propose VoGE, which utilizes the volumetric Gaussian reconstruction kernels as geometric primitives. The VoGE rendering pipeline uses ray tracing to capture the nearest primitives and blends them as mixtures based on their volume density distributions along the rays. To efficiently render via VoGE, we propose an approximate closeform solution for the volume density aggregation and a coarse-to-fine rendering strategy. Finally, we provide a CUDA implementation of VoGE, which enables real-time level rendering with a competitive rendering speed in comparison to PyTorch3D. Quantitative and qualitative experiment results show VoGE outperforms SoTA counterparts when applied to various vision tasks, e.g., object pose estimation, shape/texture fitting, and occlusion reasoning.

1. INTRODUCTIONS

Recently, the integration of deep learning and computer graphics has achieved significant advances in lots of computer vision tasks, e.g., pose estimation Wang et al. (2020a ), 3D reconstruction Zhang et al. (2021) , and texture estimation Bhattad et al. (2021) . Although the rendering quality of has significant improved over decades of development of computer graphics, the differentiability of the rendering process still remains to be explored and improved. Specifically, differentiable renderers compute the gradients w.r.t. the image formation process, and hence enable to broadcast cues from 2D images towards the parameters of computer graphics models, such as the camera parameters, and object geometries and textures. Such an ability is also essential when combining graphics models with deep neural networks. In this work, we focus on developing a differentiable renderer using explicit object representations, i.e.Gaussian reconstruction kernels, which can be either used separately for image generation or for serving as 3D aware neural network layers. The traditional rendering process typically involves a naive rasterization Kato et al. (2018) , which projects geometric primitives onto the image plane and only captures the nearest primitive for each pixel. However, this process eliminates the cues from the occluded primitives and blocks gradients toward them. Also the rasterization process introduces a limitation for differentiable rendering, that rasterization assumes primitives do not overlap with each other and are ordered front to back along the viewing direction Zwicker et al. (2001) . Such assumption raise a paradox that during gradient based optimization, the primitives are necessary to overlap with each other when they change the order along viewing direction. Liu et al. ( 2019) provide a naive solution that tracks a set of nearest primitives for each image pixel, and blending them based on the viewing distance. However, such Finally, VoGE synthesizes the image using the computed W k on each pixel to interpolate per kernel attributes. In practice, the density aggregation is bootstrapped via approximate close-form solutions. approach introduces an ambiguity that, for example, there is a red object floating in front of a large blue object laying as background. Using the distance based blending method, the cues of the second object will change when moving the red one from far to near, especially when the red object are near the blue one, which will give a unrealistic purple blending color. In order to resolve the ambiguity, we record the volume density distributions instead of simply recording the viewing distance, since such distributions provide cues on occlusion and interaction of primitives when they overlapped. In VoGE rendering pipeline, the ray tracing method is designed to replace rasterization, and a better blending function is developed based on integral of traced volume densities functions. As Figure 1 shows, VoGE uses a set of Gaussian ellipsoids to reconstruct the object in 3D space. Each Gaussian ellipsoid is indicated with a center location M, and a spatial variance Σ. During rendering, we first sample viewing rays by the camera configuration. We trace the volume density of each ellipsoid as a function of distance along the ray respectively on each ray, and compute the occupancy along the ray via an integral of the volume density and reweight the contribution of each ellipsoid. Finally, we interpolate the attribute of each reconstruction kernel with the kernel-to-pixel weights into an image. In practice, we propose an approximate close-form solution, which avoids computational heavy operation in the density aggregation without, e.g., integral, cumulative sum. Benefited from advanced differentiability, VoGE obtains both high performance and speed on various vision tasks.

Recently

In summary, the contribution of VoGE includes: 1. A ray tracing method that traces each component along viewing ray as density functions. VoGE ray tracer is a replacement for rasterization. 2. A blending function based on integral of the density functions along viewing rays that reasons occlusion between primitives, which provides differentiability toward both visible and invisible primitives. 3. A differentiable CUDA implementation with real-time level rendering speed. VoGE can be easily inserted into neural networks via our PyTorch APIs. 4. Exceptional performance on various vision tasks. Quantitative results demonstrate that VoGE significantly outperforms concurrent state-of-the-art differentiable renderer on inwild object pose estimation tasks.



Figure 1: VoGE conducts ray tracing volume densities. Given the Gaussian Ellipsoids, i.e. a set of ellipsoidal 3D Gaussian reconstruction kernels, VoGE first samples rays r(t). And along each ray, VoGE traces the density distribution of each ellipsoid ρ k (r(t)) respectively. Then occupancy T (r(t)) is accumulated via density aggregation along the ray. The observation of each Gaussian ellipsoid kernels W k is computed via integral of reweighted per-kernel volume density W k (r(t)).Finally, VoGE synthesizes the image using the computed W k on each pixel to interpolate per kernel attributes. In practice, the density aggregation is bootstrapped via approximate close-form solutions.

, Mildenhall et al. (2020);Schwarz et al. (2020)  show the power of volume rendering with high-quality occlusion reasoning and differentiability, which benefits from the usage of ray tracing volume densities introduced byKajiya & Von Herzen (1984). However, the rendering process in those works relies on implicit object representations which limits the modifiability and interpretability. Back in the 90s,Westover (1990); Zwicker et al. (2001) develop the splatting method, which reconstruct objects using volumetric Gaussian kernels and renders based on a simplification of the ray tracing volume densities method. Unfortunately, splatting methods were designed for graphics rendering without considering the differentiability and approximate the ray tracing volume densities using rasterization. Inspired by both approaches,Zwicker et al. (2001); Liu et al. (2019), we propose VoGE using 3D Gaussians kernels to represent objects, which give soften boundary of primitives. Specifically, VoGE traces primitives along viewing rays as a density function, which gives a probability of observation along the viewing direction for the blending process.

availability

https://github.com/Angtian/VoGE.

