VOGE:A DIFFERENTIABLE VOLUME RENDERER USING GAUSSIAN ELLIPSOIDS FOR ANALYSIS-BY-SYNTHESIS

Abstract

The Gaussian reconstruction kernels have been proposed by Westover (1990) and studied by the computer graphics community back in the 90s, which gives an alternative representation of object 3D geometry from meshes and point clouds. On the other hand, current state-of-the-art (SoTA) differentiable renderers, Liu et al. (2019) , use rasterization to collect triangles or points on each image pixel and blend them based on the viewing distance. In this paper, we propose VoGE, which utilizes the volumetric Gaussian reconstruction kernels as geometric primitives. The VoGE rendering pipeline uses ray tracing to capture the nearest primitives and blends them as mixtures based on their volume density distributions along the rays. To efficiently render via VoGE, we propose an approximate closeform solution for the volume density aggregation and a coarse-to-fine rendering strategy. Finally, we provide a CUDA implementation of VoGE, which enables real-time level rendering with a competitive rendering speed in comparison to PyTorch3D. Quantitative and qualitative experiment results show VoGE outperforms SoTA counterparts when applied to various vision tasks, e.g., object pose estimation, shape/texture fitting, and occlusion reasoning. The code is available: https://github.com/Angtian/VoGE. Recently, the integration of deep learning and computer graphics has achieved significant advances in lots of computer vision tasks, e.g., pose estimation Wang et al. (2020a), 3D reconstruction Zhang et al. ( 2021), and texture estimation Bhattad et al. (2021) . Although the rendering quality of has significant improved over decades of development of computer graphics, the differentiability of the rendering process still remains to be explored and improved. Specifically, differentiable renderers compute the gradients w.r.t. the image formation process, and hence enable to broadcast cues from 2D images towards the parameters of computer graphics models, such as the camera parameters, and object geometries and textures. Such an ability is also essential when combining graphics models with deep neural networks. In this work, we focus on developing a differentiable renderer using explicit object representations, i.e.Gaussian reconstruction kernels, which can be either used separately for image generation or for serving as 3D aware neural network layers. The traditional rendering process typically involves a naive rasterization Kato et al. (2018) , which projects geometric primitives onto the image plane and only captures the nearest primitive for each pixel. However, this process eliminates the cues from the occluded primitives and blocks gradients toward them. Also the rasterization process introduces a limitation for differentiable rendering, that rasterization assumes primitives do not overlap with each other and are ordered front to back along the viewing direction Zwicker et al. (2001) . Such assumption raise a paradox that during gradient based optimization, the primitives are necessary to overlap with each other when they change the order along viewing direction. Liu et al. (2019) provide a naive solution that tracks a set of nearest primitives for each image pixel, and blending them based on the viewing distance. However, such



Finally, VoGE synthesizes the image using the computed W k on each pixel to interpolate per kernel attributes. In practice, the density aggregation is bootstrapped via approximate close-form solutions. approach introduces an ambiguity that, for example, there is a red object floating in front of a large blue object laying as background. Using the distance based blending method, the cues of the second object will change when moving the red one from far to near, especially when the red object are near the blue one, which will give a unrealistic purple blending color. In order to resolve the ambiguity, we record the volume density distributions instead of simply recording the viewing distance, since such distributions provide cues on occlusion and interaction of primitives when they overlapped. Recently, Mildenhall et al. (2020) ; Schwarz et al. (2020) show the power of volume rendering with high-quality occlusion reasoning and differentiability, which benefits from the usage of ray tracing volume densities introduced by Kajiya & Von Herzen (1984) . However, the rendering process in those works relies on implicit object representations which limits the modifiability and interpretability. Back in the 90s, Westover (1990) ; Zwicker et al. (2001) develop the splatting method, which reconstruct objects using volumetric Gaussian kernels and renders based on a simplification of the ray tracing volume densities method. Unfortunately, splatting methods were designed for graphics rendering without considering the differentiability and approximate the ray tracing volume densities using rasterization. Inspired by both approaches, Zwicker et al. (2001) ; Liu et al. (2019) , we propose VoGE using 3D Gaussians kernels to represent objects, which give soften boundary of primitives. Specifically, VoGE traces primitives along viewing rays as a density function, which gives a probability of observation along the viewing direction for the blending process. In VoGE rendering pipeline, the ray tracing method is designed to replace rasterization, and a better blending function is developed based on integral of traced volume densities functions. As Figure 1 shows, VoGE uses a set of Gaussian ellipsoids to reconstruct the object in 3D space. Each Gaussian ellipsoid is indicated with a center location M, and a spatial variance Σ. During rendering, we first sample viewing rays by the camera configuration. We trace the volume density of each ellipsoid as a function of distance along the ray respectively on each ray, and compute the occupancy along the ray via an integral of the volume density and reweight the contribution of each ellipsoid. Finally, we interpolate the attribute of each reconstruction kernel with the kernel-to-pixel weights into an image. In practice, we propose an approximate close-form solution, which avoids computational heavy operation in the density aggregation without, e.g., integral, cumulative sum. Benefited from advanced differentiability, VoGE obtains both high performance and speed on various vision tasks. In summary, the contribution of VoGE includes: 1. A ray tracing method that traces each component along viewing ray as density functions. VoGE ray tracer is a replacement for rasterization. 2. A blending function based on integral of the density functions along viewing rays that reasons occlusion between primitives, which provides differentiability toward both visible and invisible primitives. 3. A differentiable CUDA implementation with real-time level rendering speed. VoGE can be easily inserted into neural networks via our PyTorch APIs. 4. Exceptional performance on various vision tasks. Quantitative results demonstrate that VoGE significantly outperforms concurrent state-of-the-art differentiable renderer on inwild object pose estimation tasks. 2022), use implicit functions, e.g., neural networks, as object representations. Though those implicit representations give a satisfying performance, such representations are lacking interpretability and modifiability, which may limit their usage in analysis tasks. In this work, we provide a solution that utilizes explicit representation while rendering with the ray tracing volume density aggregation. Kernel Reconstruction of 3D Volume. Westover (1990) introduces the volume reconstruction kernel, which decomposes a 3D volume into a sum of homogeneous primitives. Zwicker et al. (2001) introduces the elliptical Gaussian kernel and show such reconstruction gives satisfied shape approximation. However, both approaches conduct non-differentiable rendering and use rasterization to approximate the ray tracing process. Differentiable Renderer using Graphics. Graphics renderers use explicit object representations, which represent objects as a set of geometry primitives. As 

3. VOLUME RENDERER FOR GAUSSIAN ELLIPSOIDS

In this section, we describe VoGE rendering pipeline that renders 3D Gaussians Ellipsoids into images under a certain camera configuration. Section 3.1 introduces the volume rendering. Section 3.2 describes the kernel reconstruction of the 3D volume using Gaussian ellipsoids. In Section 3.3, we propose the rendering pipeline for Gaussian ellipsoids via an approximate closed-form solution of ray tracing volume densities. Section 3.4 discusses the integration of VoGE with deep neural networks. ∂M red .

3.1. VOLUME RENDERING

Different from the surface-based shape representations, in volume rendering, objects are represented using continuous volume density functions. Specifically, for each point in the volume, we have a corresponded density ρ(x, y, z) with emitted color c(x, y, z) = (r, g, b), where (x, y, z) denotes the location of the point in the 3D space. Kajiya & Von Herzen (1984) propose using the light scattering equation during volume density, which provides a mechanism to compute the observed color C(r) along a ray r(t) = (x(t), y(t), z(t)): where τ is a coefficient that determines the rate of absorption, t n and t f denotes the near and far bound alone the ray, T (t) is the transmittance. C(r) =

3.2. GAUSSIAN ELLIPSOID RECONSTRUCTION KERNEL

Due to the difficulty of obtaining contiguous function of the volume density and enormous computation cost when calculating the integral, Westover (1990) introduces kernel reconstruction to conduct volume rendering in a computationally efficient way. The reconstruction decomposes the contiguous volume into a set of homogeneous kernels, while each kernel can be described with a simple density function. We use volume ellipsoidal Gaussians as the reconstruction kernels. Specifically, we reconstruct the volume with a sum of ellipsoidal Gaussians: ρ(X) = K k=1 1 2π • ||Σ k || 2 e -1 2 (X-M k ) T •Σ -1 k •(X-M k ) ( ) where K is the total number of Gaussian kernels, X = (x, y, z) is an arbitrary location in the 3D space. The M k , a 3 × 1 vector, is the center of k-th ellipsoidal Gaussians kernel. Whereas the Σ k is a 3 × 3 spatial variance matrix, which controls the direction, size and shape of k-th kernel. Also, following Zwicker et al. (2001) , we assume that the emitted color is approximately constant inside each reconstruction kernel c(r(t)) = c k . The VoGE mesh converter creates Gaussian ellipsoids from a mesh. Specifically, we create Gaussians centered at all vertices' locations of the mesh. First, we compute a sphere-type Gaussians with same σ k on each direction, via average distance l from the center vertex to its connected neighbors, σ k = l2 4•log(1/ζ) where ζ is a parameter controls the Gaussians size. Then, we flatten the spheretype Gaussians into ellipsoids with a flatten rate. Finally, for each Gaussian, we compute a rotation matrix via the mesh surface normal direction of the corresponded mesh vertex. We dot the rotation matrix onto the Σ k to make the Gaussians flatten along the surface. 

3.3. RENDER GAUSSIAN ELLIPSOIDS

Figure 4 shows the rendering process for VoGE. VoGE takes inputs of a perspective camera and Gaussian ellipsoids to render images, while computing gradient towards both camera and Gaussian ellipsoids (shows in Figure 2 and 3 ). Viewing transformation utilizes the extrinsic configuration E of the camera to transfer the Gaussian ellipsoids from the object coordinate to the camera coordinate. Let M o k denote centers of ellipsoids in the object coordinate. Following the standard approach, we compute the centers in the camera coordinate: M k = R • M o k + T where R and T are the rotation and translation matrix included in E. Since we consider 3D Gaussian Kernels are ellipsoidal, observations of the variance matrices are also changed upon camera rotations: Σ -1 k = R T • (Σ o k ) -1 • R (4) Perspective rays indicate the viewing direction in the camera coordinate. For each pixel, we compute the viewing ray under the assumption that the camera is fully perspective: r(t) = D * t = i-Oy F j-Ox F 1 T * t where p = (i, j) is the pixel location on the image, O x , O y is the principal point of the camera, F is the focal length, D is the ray direction vector. Ray tracing observes the volume densities of each ellipsoid along the ray r respectively. Note the observation of each ellipsoid is a 1D Gaussian function along the viewing ray (for detailed mathematics, refer to Appendix A.1): ρ m (r(s)) = exp(q m - (s -l m ) 2 2 • σ 2 m ) where l m = M T m •Σ -1 m •D+D T •Σ -1 m •Mm 2•D T •Σ -1 m •D is the length along the ray that gives peak activation for m-th kernel. q m = -1 2 V T m • Σ -1 m • V m , where V m = M m -l m • D computes peak density of m-th kernel alone the ray. The 1D variance is computed via 1 σ 2 m = D T • Σ -1 m • D. Thus, when tracing along each ray, we only need to record l m , q m and σ m for each ellipsoid respectively. Blending via Volume Densities computes the observation along the ray r. As Figure 1 shows, different from other generic renderers, which only consider the viewing distance for blending, VoGE blends all observations based on the integral of volume densities along the ray. However, computing the integral using brute force is so computationally inefficient that even infeasible for concurrent computation power. To resolve this, we propose an approximate closed-form solution, which conducts the computation in both an accurate and effective way. We use the Error Function erf to compute the integral of Gaussian, since it can be computed via a numerical approach directly. Specifically, with Equation 2and Equation 5, we can calculate the transmittance T (t) as (for proof about this approximation, refer to Appendix A.2): T (t) = exp(-τ t -∞ ρ(r(s))ds) = exp(-τ K m=1 e qm erf((t -l m )/σ m ) + 1 2 ) (7) Figure 5 : Comparison for rendering speeds of VoGE and PyTorch3D, reported in images per second (higher better). We evaluate the rendering speed using cuboids with different number of primitives (vertices, ellipsoids), which illustrated using different colors, also different image sizes and number of primitives per pixel. Now, to compute closed-form solution of the outer integral in Equation 1, we use the T (t), t = l k at the peak of ρ(r(t)) alone the rays. Here we provide the closed-form solution for C(r): C(r) = ∞ -∞ T (t)ρ(r(t))c(r(t))dt = K k=1 T (l k )e q k c k (8) Note based on the assumption that distances from camera to ellipsoids are significantly larger than ellipsoid sizes, thus it is equivalent to set t n = -∞ and t f = ∞. Coarse-to-fine rendering. In order to improve the rendering efficiency, we implement VoGE rendering with a coarse-to-fine strategy. Specifically, VoGE renderer has an optional coarse rasterizing stage that, for each ray, selects only around 10% of all ellipsoids (details in Appendix A.3). Besides, the ray tracing volume densities also works in a coarse-to-fine manner. VoGE blends K ′ nearest ellipsoids among all traced kernels that gives e q k > thr = 0.01. Using CUDA from NVIDIA et al. ( 2022), we implement VoGE with both forward and backward function. The CUDA-VoGE is packed as an easy-to-use "autogradable" PyTorch API.

3.4. VOGE IN NEURAL NETWORKS

VoGE can be easily embedded into neural networks by serving as neural sampler and renderer. As a sampler, VoGE extracts attributes α k (e.g., deep neural features, textures) from images or feature maps into kernel-correspond attributes, which is conducted via reconstructing their spatial distribution in the screen coordinates. When serving as a renderer, VoGE converts kernel-correspond attributes into images or feature maps. Since both sampling and rendering give the same spatial distribution of feature/texture, it is possible for VoGE to conduct geometry-based image-to-image transformation. Here we discuss how VoGE samples deep neural features. Let Φ denotes observed features, where ϕ p is the value at location p. Let A = K k=1 {α k } denotes the per kernel attribute, which we want to discover during sampling. With a given object geometry Γ = K k=1 {M k , Σ k } and viewing rays r(p). The the observation formulated with conditional probability regarding α k : ϕ ′ (p) = K k=1 P(α k |Γ, r(p), k)α k (9) Since Φ is a discrete observation of a continuous distribution ϕ(p) on the screen, the synthesis can only be evaluated at discrete positions, i.e.the pixel centers. As the goal is to make Φ ′ similar as Φ on all observable locations, we resolve via an inverse reconstruction: α k = P p=1 P(ϕ(p)|Γ, r(p), p)ϕ(p) = P p=1 W p,k * ϕ p P p=1 W p,k where W p,k = T (l k )e q k is the kernel-to-pixel weight as described in 3.1. 

4. EXPERIMENT

We explore several applications of VoGE. In section 4.1, we study the object pose estimation using VoGE in a feature level render-and-compare pose estimator. In section 4.2, we explore texture extraction ability of VoGE. In section 4.4, we demonstrate VoGE can optimize the shape representation via multi-viewed images. Visualizations of VoGE rendering are included in Appendix B. Rendering Speed. As Figure 5 shows, CUDA-VoGE provides a competitive rendering speed compare to state-of-the-art differentiable generic renderer when rendering a single cuboidal object.

4.1. OBJECT POSE ESTIMATION IN WILD

We evaluate the ability of VoGE when serving as a feature sampler and renderer in an object pose estimation pipeline, NeMo Wang et al. (2020a) , an in-wild category-level object 3D pose estimator that conducts render-and-compare on neural feature level. NeMo utilizes PyTorch3D Ravi et al. (2020) as the feature sampler and renderer, which converts the feature maps to vertex corresponded feature vectors and conducts the inverse process. In our NeMo+VoGE experiment, we use VoGE to replace the PyTorch3D sampler and renderer via the approach described in Section 3.4. Dataset. Following NeMo, we evaluate pose estimation performance on the PASCAL3D+ dataset Xiang et al. (2014) , the Occluded PASCAL3D+ dataset Wang et al. (2020b) and the ObjectNet3D dataset Xiang et al. (2016) . The PASCAL3D+ dataset contains objects in 12 man-made categories with 11045 training images and 10812 testing images. The Occluded PASCAL3D+ contains the occluded version of same images, which is obtained via superimposing occluder cropped from MS-COCO dataset Lin et al. (2014) . The dataset includes three levels of occlusion with increasing occlusion rates. In the experiment on ObjectNet3D, we follow NeMo to test on 18 categories. Evaluation Metric. We measure the pose estimation performance via accuracy of rotation error under given thresholds and median of per image rotation errors. The rotation error is defined as the difference between the predicted rotation matrix and the ground truth rotation matrix: Experiment Details. Following the experiment setup in NeMo, we train the feature extractor 800 epochs with a progressive learning rate. During inference, for each image, we sample 144 starting poses and optimizer 300 steps via an ADAM optimizer. We convert the meshes provided by NeMo using the method described Section 3.2. Results. Figure 6 and Table 2 show the qualitative and quantitative results of object pose estimation on PASCAL3D+ and the Occluded PASCAL3D+ dataset. Results in Table 2 demonstrate 2020). Moreover, both qualitative and quantitative results show our method a significant robustness under partial occlusion and out distributed cases. Also, Figure 6 demonstrates our approach can generalize to those out distributed cases, e.g., a car without front bumper, while infeasible for baseline renderers. Table 3 shows the results on ObjectNet3D, which demonstrates a significant performance gain compared to the baseline approaches. The ablation study is included in Appendix C.1. ∆ (R pred , R gt ) = ∥log m(R T pred Rgt)∥ F √ 2

4.2. TEXTURE EXTRACTION AND RERENDERING

As Figure 7 shows, we conduct the texture extraction on real images and rerender the extracted textures under novel viewpoints. The qualitative results is produced on PASCAL3D+ dataset. The experiment is conducted on each image independently that there is no training included. Specifically, for each image, we have only three inputs, i.e. the image, the camera configuration, the Gaussian ellipsoids converted from the CAD models provided by the dataset. Using the method proposed in 3.4, we extract the RGB value for each kernel on the Gaussian ellipsoids using the given groundtruth camera configuration. Then we rerender Gaussian ellipsoids with the extracted texture under a novel view, that we increase or decrease the azimuth of the viewpoint (horizontal rotation). The qualitative results demonstrate a satisfying texture extraction ability of VoGE, even with only a single image. Also, the details (e.g., numbers on the second car) are retained in high quality under the novel views.

4.3. OCCLUSION REASONING OF MULTIPLE OBJECTS

Figure 8 shows differentiating the occlusion reasoning process between two objects. Specifically, a target image, and the colored cuboid models and initialization locations, are given to the method. Then we render and optimize the 3D locations of both the cuboids. In this experiment, we find both SoftRas and VoGE can successfully optimize the locations when the occludee (blue cuboid) is near the occluder (red cuboid), which is 1.5 scales behind the occluder as the thickness of the occluder is 0.6 scales. However, when the the occludee is far behind the occluder (5 scales), SoftRas fails to produce correct gradient to optimize the locations, whereas VoGE can still successfully optimize the locations. We think such advantage benefits from the better volume density blending compared to the distance based blender used in SoftRas.

4.4. SHAPE FITTING VIA INVERSE RENDERING

Figure 9 shows the qualitative results of multi-viewed shape fitting. In this experiment, we follows the setup in fit a mesh with texture via rendering from PyTorch3D official tutorial Ravi et al. (2022a) . First, a standard graphic renderer is used to render the cow CAD model in 20 different viewpoints under a fixed light condition, which are used as the optimization targets. For both baseline and ours, we give a sphere object geometry with 2562 vertices and optimize toward target images using the same configuration, e.g., iterations, learning rate, optimizer, loss function. During the shape optimization process, we compute MSE loss on both silhouettes and RGB values between the synthesized images and the targets. The vertices locations and colors are gradiently updated with an ADAM optimizer Kingma & Ba (2014). We conduct the optimization for 2000 iterations, while in each iteration, we randomly select 5 out of 20 images to conduct the optimization. In Figure 9 (e) and (f), we use the normal consistency, edge and Laplacian loss Nealen et al. (2006) to constrain the object geometry, while in (d) no additional loss is used. From the results, we can see that VoGE has a competitive ability regarding shape fit via deformation. Specifically, VoGE gives better color prediction and a smoother object boundary. Also, we observe the current geometry constrain losses do not significantly contribute to our final prediction. We argue those losses are designed for surface triangular meshes, that not suitable for Gaussian ellipsoids. The design of geometry constraints that are suitable for Gaussian ellipsoids is an interesting topic but beyond scope of this paper.

5. CONCLUSION

In this work, we propose VoGE, a differentiable volume renderer using Gaussian Ellipsoids. Experiments on in-wild object pose estimation and neural view matching show VoGE an extraordinary ability when applied on neural features compare to the concurrent famous differential generic renderers. Texture extraction and rerendering experiment shows VoGE the ability on feature and texture sampling, which potentially benefits downstream tasks. Overall, VoGE demonstrates better differentiability, which benefits vision tasks, while retains competitive rendering speed.

A ADDITIONAL DETAILS OF VOGE RENDERER

In this section we provide more detailed discussion for the math of ray tracing volume densities in VoGE (section A.1 and A.2), coarse-to-fine rendering strategy (section A.3), and the converters (section A.4).

A.1 RAY TRACING

In this section, we provide the detailed deduction process for Equations 6 in the main text. First, let's recall the formula of Ray tracing volume densities Kajiya & Von Herzen (1984) : C(r) = t f tn T (t)ρ(r(t))c(r(t))dt, where T (t) = exp -τ t tn ρ(r(s))ds where T (t) is the occupancy function alone viewing ray r(t), as we describe in Equation 5in main text: r(t) = D * t ( ) where D is the normalized direction vector of the viewing ray. Also, as we describe in Section 3.2, we reconstruct the volume density function ρ(r(t)) via the sum of a set of ellipsoidal Gaussians: ρ(X) = K k=1 1 2π • ||Σ k || 2 e -1 2 (X-M k ) T •Σ -1 k •(X-M k ) ( ) where K is the total number of Gaussian kernels, X = (x, y, z) is an arbitrary location in the 3D volume. M k is the center of k-th ellipsoidal Gaussians kernel: M k = (µ k,x , µ k,y , µ k,z ) whereas the Σ k is the spatial variance matrix: Σ k = σ k,xx σ k,xy σ k,xz σ k,yx σ k,yy σ k,yz σ k,zx σ k,zy σ k,zz Note that Σ k is a symmetry matrix, e.g., covariance σ k,xy = σ k,yx . Occupancy Function. Based on Equation 13and 11, T (t) can be computed via: T (t) = exp -τ t tn ρ(r(s))ds = exp(-τ t tn K k=1 1 2π • ||Σ k || 2 e -1 2 (sD-M k ) T •Σ -1 k •(sD-M k ) ds) (16) Now, let M k = l k D + V k , where l k is a length along the viewing ray, V k = M k -l k D is the vector from location l k D on the ray to the vertex M k (we will discuss a solution for V k and l k later). Equation 16can be simplified as: T (t) = exp(-τ t tn K k=1 1 2π • ||Σ k || 2 e -1 2 (s-l k ) 2 D T •Σ -1 k •D e -1 2 (s-l k )(V T k •Σ -1 k •D+D T •Σ -1 k •V k ) e -1 2 V T k •Σ -1 k •V k ds) In order to further simplify T (t), we take V k that makes:  V T k • Σ -1 k • D + D T • Σ -1 k • V k = 0 W (t) = t -∞ T (s) • ρ(r(s))ds, W ′ (t) = t -∞ T ′ (s) • ρ(r(s) )ds and the difference W (t) -W ′ (t). Since we use the infinite integral with t, only error at end of t axis need to be consider. (c) shows the accumulative W (t) -W ′ (t) using different σ = D T • Σ -1 • D. Interestingly, the final error gives a fix value which is independent from σ. Note that the final error is 0.0256 which can be ignored when compared to the integral result W = 1. which can be solve using V k = M k -l k D: (M k -l k D) T • Σ -1 k • D + D T • Σ -1 k • (M k -l k D) = 0 M T k • Σ -1 k • D + D T • Σ -1 k • M k -2l k D T • Σ -1 k • D = 0 l k = M T k • Σ -1 k • D + D T • Σ -1 k • M k 2 • D T • Σ -1 k • D (19) Note that l k is also the length that gives the maximum density ρ k (r(t)) along the ray for k-th kernel. To proof this, we compute: ∂ ∂t ρ k (r(t)) = ∂ ∂t 1 2π • ||Σ k || 2 e -1 2 (tD-M k ) T •Σ -1 k •(tD-M k ) = 1 4 2π||Σ k || 2 (tD -M k ) T Σ -1 k (tD -M k ) • (M T k Σ -1 k D + D T Σ -1 k M k -2tD T Σ -1 k D) • e -1 2 (tD-M k ) T •Σ -1 k •(tD-M k ) Obviously, the solve for ∂ ∂t ρ k (r(t)) = 0 is: t = M T k • Σ -1 k • D + D T • Σ -1 k • M k 2 • D T • Σ -1 k • D = l k (21) Now the density function of the k-th ellipsoid along the viewing ray r(s) gives an 1D Gaussian function: ρ k (r(s)) = e -1 2 V T k •Σ -1 k •V k 2π • ||Σ k || 2 e -1 2 (s-l k ) 2 D T •Σ -1 k •D = 1 2π • ||Σ k || 2 • exp(q k - (s -l k ) 2 2 • σ 2 k ) where q k = -1 2 V T k • Σ -1 k • V k , 1 σ 2 k = D T • Σ -1 k • D. Thus, when tracing along each ray, we only need to record l k , q k and σ k for each ellipsoid respectively. A.2 BLENDING VIA VOLUME DENSITY Since q k is independent from t, the Equation 16 can be further simplified: T (t) = exp(- K k=1 e q k t -∞ 1 2π • ||Σ k || 2 e -(s-l k ) 2 /σ 2 k ds) = exp(- K k=1 e q k erf((t -l k )/σ k ) + 1 2 ) ( ) where erf is the error function, that concurrent computation platforms, e.g., PyTorch, Scipy, have already implemented. Scattering Equation. Now we compute the final color observation C(r). As we describe in Section 3.2, we assume each kernel has a homogeneous C k . Thus, here we compute: W k (t) = t f tn T (t)ρ(r(t))dt = ∞ -∞ T (t) K k=1 1 2π • ||Σ k || 2 e -1 2 (X-M k ) T •Σ -1 k •(X-M k ) dt where X = tD. Similar to previous simplifications, we use q k and l k to replace M k in Equation 24: W k (t) = K k=1 e q k ∞ -∞ T (t) 1 2π • ||Σ k || 2 e -(t-l k ) 2 /σ 2 k dt Due to the error function is already a complex function, it is infeasible to compute the integral of T (t). We propose an approximate solution that we use T (l k ) to replace T (t) inside the integral. Now the final closed-form solution for W k (t) is computed by: W k (t) = K k=1 T (l k )e q k ∞ -∞ 1 2π • ||Σ k || 2 e -(t-l k ) 2 /σ 2 k dt = K k=1 T (l k )e q k Because of the complexity when computing integral of the erf function, here we prove that in practice such approximate gives high enough accuracy. To simplify the problem, we study the case that the volume only contains a single Gaussian ellipsoid kernel. We further suggest that in the multikernel cases, the errors between different kernels introduced by the approximation will be lower. Because m-th kernel has a low ρ m (r(t)) at l k , which makes the corresponded T (t) more flatten, thus the approximation fits better. As Figure 10 (a) shows, we plot density function along the ray. Specifically, we sample 10k points on the ray, and for each point, we plot its density, the real occupancy, and the approximate occupancy. Figure 10 (b) shows the real weight W which is computed via the cumulative sum along the ray, and the approximate weight W ′ which is computed via our proposed approximate closed-from solution. We also show the difference between W ′ and W with the green line, which is significantly smaller compare to W . Interestingly, as Figure 10 (c) shows, we find the error W (t) -W ′ (t) is independent from Σ -1 and D, that always converge to a same value: 0.0256. Though we cannot give a mathematical explanation regarding this phenomenon, we argue the result is already enough to draw the conclusion that such approximation gives satisfying accuracy.

A.3 COARSE-TO-FINE RENDERING WITH KERNEL SELECTION

As we discussed in Section 3.3 in the main text, in order to efficiently render Gaussian ellipsoids, we design the coarse-to-fine rendering strategy. Specifically, we gradually reduce the number of ellipsoids that interact with viewing rays. Following PyTorch3D, we develop a optional coarse rasterization stage, which select 10% of all ellipsoids and feed them into the ray tracing stage. Specifically, we project the center of each ellipsoid onto the screen coordinate via standard object-to-camera transformation, then for each ellipsoids, we compute the height b h and width b w of a maximum bounding box of the ellipsoids in 2D screen coordinate. The height and width are computed via: [b h b w .] = log(-η) d z • Ω • Σ -1 • Ω (27) where d z is the distance from camera to the center of ellipsoid, η is the threshold for maximum volume density, Ω is the projection matrix from camera coordinate to screen: Ω =   2•F h 0 0 0 2•F w 0 0 0 1   (28) Then we rasterize the bounding boxes to produce a pixel-to-kernels assignment in a low resolution (8 times smaller compared to the image size), which indicates the set of ellipsoid kernels for each pixel to trace. Similarly, the ray tracing stage is also select only part of all Gaussian ellipsoids to feed into the blending stage. When conducting ray tracing, we only trace K ′ nearest kernels that has non-trivial contributions regarding its final weight W k . Specifically, we first record all ellipsoids that gives a maximum density e q k > η. For all the recorded kernels, we sort them via the length to the 1D Gaussian center l k and select K ′ nearest ellipsoids. In the experiment, we find K ′ has a significant impact on the quality of rendered images, while the threshold η has relatively low impact, but needs to be fit with K ′ . Here we provide default settings that give a satisfying quality with low computation cost: K ′ = 20, η = 0.01. Figure 11 shows the rendered cuboids using different K ′ and η. Here the results demonstrate that inadequate K ′ will lead to some dark region around the boundary of kernels, which we think is caused by the hard cutoff of the boundary. On the other hand, decreasing the threshold η could make the object denser (less transparent), but need more kernels (higher K ′ ) to avoid the artifacts.

A.4 MESH & POINT CLOUD CONVERTER

We develop a simple mesh converter, which converts triangular meshes into isotropic Gaussian ellipsoids, and a point cloud converter. In the mesh converter, we retain all original vertices on the mesh and compute the Σ k using the distance between each vertex and its connected neighbors. 1.5 0 0 0 0.3 0 0 0 0.3 1 0.7 0.6 0.7 1 0.9 0.6 0.9 1 1 1.2 1.6 1.2 1 0.9 1.6 0.9 1 1 0 0.9 0 1 0 0.9 0 1 +0°azimuth +45°azimuth +90°azimuth Specifically, for each vertex, we compute the average length d k of edges connected to that vertex. Then Σ k is computed via: Σ k = σ k 0 0 0 σ k 0 0 0 σ k where σ k is computed via the coverage rate ζ and d k , σ k = (d k /2) 2 log(1/ζ) Similarly, in the point cloud converter, the Σ k is controlled with the same function, but the d k is determined by the distance to m nearest points of the target points. Since the concurrent mesh converter does not consider the shape of the triangles, admittedly we think this could be improved via converting each triangle into an anisotropic Gaussian ellipsoid, which we are still working on.

B.1 RENDERING ANISOTROPIC GAUSSIAN ELLIPSOIDS

As Figure 12 shows, VoGE rendering pipeline natively supports anisotropic ellipsoidal Gaussian kernels, where for each kernel the spatial variance is represented via the 3 × 3 symmetric matrix Σ k . Note that, the spatial covariances, e.g., σ k,xy , cannot exceed square root of dot product of the two variances, e.g., √ σ k,xx σ k,yy , otherwise, the kernel will become hyperbola instead of ellipsoids (as the last row in Figure 12 shows). On the other hand, we suggest that ellipsoidal Gaussian kernels can also approximate the 2D Gaussian ellipses (the representation used in DSS Yifan et al. (2019) ), which can be simply done by set det(Σ) → 0, where det is the determinant of matrix. Figure 13 shows the rendering result using flattened Gaussian ellipsoids. As we demonstrated in the third row, VoGE rendering pipeline allows rendering the surface-liked representations in a stable manner.

B.2 RENDERING SURFACE NORMAL

As Figure 14 shows, we render CAD models provided by The Stanford 3D Scanning Repository Curless & Levoy (1996) . Specifically, we use our mesh converter to convert the meshes provide by Figure 16 shows surface normal rendering quality of VoGE using different number of Gaussians. We also include comparison of rendering quality of VoGE vs PyTorch3D mesh renderer. In each image, we control a same number of Gaussians vs mesh vertices, which gives similar number of parameters that 9 * N Gauss vs 3 * N verts + 3 * N f aces . Here we observe that increasing number of Gaussians will significant improve rendering quality. Admittedly, VoGE renderer gives slight fuzzier boundary compare to mesh renderer.

B.4 LIGHTING WITH EXTERNAL NORMALS

Although Gaussian ellipsoids do not contain surface normal information (since they are represented as volume), VoGE still can utilize surface normal via processing them as an extra attribute in an external channel as we describe in section B.2. Once the surface normals are rendered, the light diffusion method in the traditional shader can be used to integrate lighting information into VoGE rendering pipeline. Figure 15 shows the results that integrate lighting information when rendering the Stanford bunny mesh using VoGE. Specifically, we first render the surface normals computed via PyTorch3D into an image-liked map (same as the process in section B.2). Then we use the diffuse function (PyTorch3D.renderer.lighting), to compute the brightness of the rendered bunny under a point light. In the visualization, we place the light source at variant locations, while using a fully white texture on the bunny.

B.5 RENDERING POINT CLOUDS

Figure 17 shows the point clouds rendering results using VoGE and PyTorch3D. We follow the Render a colored point cloud from PyTorch3D official tutorial Ravi et al. (2022b) . Specifically, we use the PittsburghBridge point cloud provided by PyTorch3D, which contains 438544 points with RGB color for each point respectively. We first convert the point cloud into Gaussian ellipsoids using the method described in A.4. Then we render the Gaussian ellipsoids using the same configuration (Except the camera. As the tutorial uses orthogonal camera, which concurrently we don't support, we alternate the camera using a PerspectiveCamera with a similar viewing scope). The qualitative results demonstrate VoGE a better quality with smoother boundaries. C ADDITIONAL EXPERIMENT RESULTS

C.1 IN-WILD OBJECT POSE ESTIMATION

Ablation Study. As Table 4 shows, we conduct controlled experiments to validate the effects of different geometric primitives. Using the method we described in 3.2, we develop tools that convert triangle meshes to Gaussian ellipsoids, where a tunable parameter, coverage rate, is used to control the intersection rate between nearby Gaussian ellipsoids. Specifically, the higher coverage rate gives the large Σ, which makes the feature more smooth but also fuzzy, vice versa. As the results demonstrate, increasing Σ can increase the rough performance under π 6 , while reducing it can improve the performance under the more accurate evaluation threshold. We also ablate the affect regarding block part of the gradient in Equation 8. Specifically, we conduct two experiments on all kernels, we block the gradient on T (l k ) and e q k respectively. The results show blocking either term leads significant negative impact on the final performance. Additional Results. Table 5 shows the per-category object pose estimation results on PASCAL3D+ dataset (L0). All NeMo Wang et al. (2020a) baseline results and ours are conducted using the single cuboid setting described in NeMo. Specifically, Gaussian ellipsoids used in VoGE is converted from the same single cuboid mesh models provided by NeMo (coverage rate ζ = 0.5). Figure 18 shows the additional qualitative results of the object pose estimation. In the visualization, we use a standard graphic renderer to render the original CAD models provide by PASCAL3D+ dataset under the predicted pose, and superimpose the rendered object onto the input image. 



Figure 1: VoGE conducts ray tracing volume densities. Given the Gaussian Ellipsoids, i.e. a set of ellipsoidal 3D Gaussian reconstruction kernels, VoGE first samples rays r(t). And along each ray, VoGE traces the density distribution of each ellipsoid ρ k (r(t)) respectively. Then occupancy T (r(t)) is accumulated via density aggregation along the ray. The observation of each Gaussian ellipsoid kernels W k is computed via integral of reweighted per-kernel volume density W k (r(t)).Finally, VoGE synthesizes the image using the computed W k on each pixel to interpolate per kernel attributes. In practice, the density aggregation is bootstrapped via approximate close-form solutions.

Figure 2: Rendering with increasing numbers of Gaussian Ellipsoids. Top: the kernel-to-pixel weight along the median row on the image, the colors demonstrate each corresponded Gaussian ellipsoids. Bottom: the rendered RGB image. Note VoGE resolves occlusion naturally in a contiguous way.

Figure 3: Computing gradient of M when rendering two ellipsoids. The colored numbers below indicate the M of each ellipsoids. The red arrow and G x , G y show the ∂(I-Î) 2

Figure 4: The forward process for VoGE rendering. The camera is described with the extrinsic matrix E composed with R and T, as well as the intrinsic matrix I composed with F and O x , O y . Given Gaussian Ellipsoids, VoGE renderer synthesizes an image O.

Figure 6: Qualitative object pose estimation results on PASCAL3D+ dataset. We visualize the predicted object poses from NeMo+VoGE and standard NeMo. Specifically, we use a standard mesh renderer to render the original CAD model under the predicted pose and superimpose onto the input image.

Figure 7: Sampling texture and rerendering on novel view. The inputs include a single RGB image and the Gaussian Ellipsoids with corresponded pose. Note the result is produced without any training or symmetrical information.

significant performance improvement using VoGE compared to Soft RasterizerLiu et al. (2019), DSSYifan et al. (2019) and PyTorch3DRavi et al. (

Figure 8: Reasoning multi-object occlusions for single view optimization of object location. Left: initialization and target locations. Middle: target images, note the target image generate via each rendering method. Right: results. Videos: VoGE, SoftRas.

Figure 9: Shape fitting results with 20 multi-viewed images following the PyTorch3D Ravi et al. (2020) official tutorial.

Figure 10: Approximate computation of integral along the viewing ray for a single kernel. (a) T (t) is the real occupancy function along the ray, T ′ (t) means we use the occupancy at l k of ρ(r(t)), since ρ(r(t)) are mainly concentrate near l k . (b) shows W (t) =

Figure 11: Rendering cuboid using different K ′ and η. (a) to (d) shows the rendering result using different K ′ , the threshold is fix as η = 0.01. (e) to (h) shows the rendered cuboid with different η, while fixing K ′ = 20.

Figure 12: Rendering single anisotropic ellipsoidal Gaussian. Left column shows the Σ for the kernel. We render the kernel under 3 different viewpoint: 0 • azimuth, 45 • azimuth, and 90 • azimuth.

Figure 14: Rendering the surface normal using VoGE under 8 different viewpoints. The Gaussian ellipsoids are converted from meshes provided by The Stanford 3D Scanning Repository.

Figure15: Lighting rendered mesh using external normals. We first render the surface normals of the bunny mesh using VoGE. Then we use the light diffusion functions provided by PyTorch3D to light the render surface normal. For each image, we place a point light source in the object space using different elevations (e) and azimuth (a). The distance from the light source to the object center is fixed as 1.

Figure 16: Comparison of rendering quality with number of primitives using VoGE vs PyTorch3D hard mesh renderer.

Figure 18: Additional qualitative in-wild object pose estimation results for NeMo+VoGE and NeMo+PyTorch3D.

(a) VoGE with all constraints. (b) VoGE without constraints. (c) PyTorch3D with all constraint.

Figure 20: Losses in the shape fitting experiment. Note that for the VoGE without constraint, we only calculate the geometry loss but not compute gradient with those losses.

Figure 21: Additional results for texture extraction experiment on car, bus and boat category in PASCAL3D+ dataset. We extract texture using a in-wild image and Gaussian ellipsoids with corresponded viewpoint, and render under novel viewpoint.

Comparison with state-of-the-art differentiable renderers. Similar to NeRF but different from previous graphics renderers, VoGE uses ray tracing to record volume densities on each ray for each ellipsoid, and blends them with transmittance computed via volume densities.

Pose estimation results on the PASCAL3D+ and the Occluded PASCAL3D+ dataset. Occlusion level L0 is the original images from PASCAL3D+, while Occlusion Level L1 to L3 are the occluded PASCAL3D+ images with increasing occlusion ratios. NeMo is an object pose estimation pipeline via neural feature level render-and-compare. We compare the object pose estimation performance using different renderers, i.e. VoGE, Soft Rasterizer, DSS, PyTorch3D (which is used in NeMo originally). NeMo+PyTorch3D 86.1 76.0 63.9 46.8 61.0 46.3 32.0 17.1 8.8 13.6 20.9 36.5 NeMo+VoGE(Ours) 90.1 83.1 72.5 56.0 69.2 56.1 41.5 24.8 6.9 9.9 15.0 26.3

Pose estimation results on the ObjectNet3D dataset. Evaluated via pose estimation accuracy for error under π 6 (higher better).

(Left)  Ablation study for object pose estimation on PASCAL3D+. We control the coverage rate ζ when computing Σ, higher ζ gives larger values in Σ. w/o grad T (r) means we block the gradient from T (r), while w/o grad ρ(r) means gradient on e q k in Equation 8 is blocked.

Per category result for in-wild object pose estimation results on PASCAL3D+. Results are reported in Accuracy (percentage, higher better) and Median Error (degree, lower better).aero bike boat bottle bus car chair table mbike sofa train tv Mean ↑ ACC π

annex

Published as a conference paper at ICLR 2023

C.2 TEXTURE EXTRACTION AND RERENDERING

Figure 21 shows the additional texture extraction and rerendering results on car, bus and boat images from PASCAL3D+ dataset. Interestingly, Figure 21 (g) shows the texture extraction using VoGE demonstrate stratifying generation ability on those out distributed cases.

C.3 SHAPE FITTING VIA INVERSE RENDERING

Figure 20 shows the losses in the multi-viewed shape fitting experiment. Specifically, we plot the losses regarding optimization iterations using the method provided by fit a mesh with texture via rendering from PyTorch3D official tutorial Ravi et al. (2022a) . Note the geometry constraint losses except normal remain relatively low in VoGE without constraints experiment. We think such results demonstrate the optimization process using VoGE can give correct gradient toward the optimal solution effectively, that even without geometry constraint the tightness of the Gaussian ellipsoids is still retained. As for the normal consistency loss, since we use the volume Gaussian ellipsoids, the surface normal directions are no longer informative.

