DIFFUSION GENERATIVE MODELS ON SO(3)

Abstract

Diffusion-based generative models represent the current state-of-the-art for image generation. However, standard diffusion models are based on Euclidean geometry and do not translate directly to manifold-valued data. In this work, we develop extensions of both score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D rotations, SO(3). SO(3) is of particular interest in many disciplines such as robotics, biochemistry and astronomy/planetary science. Contrary to more general Riemannian manifolds, SO(3) admits a tractable solution to heat diffusion, and allows us to implement efficient training of diffusion models. We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and demonstrate state-of-the-art results.

1. INTRODUCTION

Deep generative models (DGM) are trained to learn the underlying data distribution and then generate new samples that match the empirical data. There are several classes of deep generative models, including Generative Adversarial Networks (Goodfellow et al., 2014) , Variational Auto Encoders (Kingma & Welling, 2013) and Normalizing Flows (Rezende & Mohamed, 2015) . Recently, a new class of DGMs based on Diffusion, such as Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) and Score Matching with Langevin Dynamics (SMLD) , a subset of general score-based generative models (SGMs), (Song & Ermon, 2019) , have achieved state-of-the-art quality in generating images, molecules, audio and graphsfoot_0 (Song et al., 2021) . Unlike GANs, training diffusion models is usually very stable and straightforward, they do not suffer as much from mode collapse issues, and they can generate images of similar quality. In parallel with the success of these diffusion models, Song et al. (2021) demonstrated that both SGMs and DDPMs can mathematically be understood as variants of the same process. In both cases, the data distribution is progressively perturbed by a noise diffusion process defined by a specific Stochastic Differential Equation (SDE), which can then be time-reversed to generate realistic data samples from initial noise samples. While the success of diffusion models has mainly been driven by data with Euclidean geometry (e.g., images), there is great interest in extending these methods to manifold-valued data, which are ubiquitous in many scientific disciplines. Examples include high-energy physics (Brehmer & Cranmer, 2020; Craven et al., 2022) , astrophysics (Hemmati et al., 2019) , geoscience (Gaddes et al., 2019) , and biochemistry (Zelesko et al., 2020) . Very recently, pioneering work has started to develop generic frameworks for defining SGMs on arbitrary compact Riemannian manifolds (De Bortoli et al., 2022) , and non-compact Riemannian manifolds (Huang et al., 2022) . In this work, instead of considering generic Riemannian manifolds, we are specifically concerned with the Special Orthogonal group in 3 dimensions, SO(3), which corresponds to the Lie group of 3D rotations. Modeling 3D orientations is of particularly high interest in many fields including for instance in robotics (estimating the pose of an object, Hoque et al. 2021) ; and in biochemistry (finding the conformation angle of molecules that minimizes the binding energy, Mansimov et al. 2019) . Contrary to more generic Riemannian manifolds, SO(3) benefits from specific properties, including a tractable heat kernel and efficient geometric ODE/SDE solvers, that will allow us to define very efficient diffusion models specifically for this manifold. given base distribution (right most, denoted by circles) can be evolved under the probability flow ODE (Eq. 3) towards a noisy distribution (left most), or vice-versa from the noisy distribution back to the target distribution. Each point represents a rotation matrix in SO(3) projected on the sphere according to its canonical axis, the color indicates the tilt around that axis (visualisation adopted from Murphy et al. 2021 ). An animation of this figure is available at this link. The contributions of our paper are summarized as follows: • We reformulate Euclidean diffusion models on the SO(3) manifold, and demonstrate how the tractable heat kernel solution on SO(3) can be used to recover simple and efficient algorithms on this manifold. • We provide concrete implementations of both Score-Based Generative Model and Denoising Diffusion Probabilistic Models specialized for SO(3). • We reach a new state-of-the-art in sample quality on synthetic SO(3) distributions with our proposed SO(3) Score-Based Generative Model.

2. PRELIMINARIES AND NOTATIONS

In this work, we are exclusively considering the SO(3) manifold, corresponding to the Lie group of 3D rotation matrices. We will denote by exp : so(3) →SO(3) and log : SO(3) → so(3) the exponential and logarithmic maps that connect SO(3) to its tangent space and Lie algebra so(3). so(3) corresponds to all skew-symmetric 3x3 matrices in R 3 , which can be parameterised in terms of a vector in R 3 , which corresponds in turn to the axis-angle representation of rotation matrices. For any rotation x ∈ SO(3), its axis-angle representation ω ∈ R 3 can be computed as ω = ωv with ω = arccos 2 -1 (tr(x) -1) ∈ (-π, π] and v = 1 2 sin ω (x 32 -x 23 , x 13 -x 31 , x 21 -x 12 ) a unit vector of R 3 . We direct the interested reader to more information on representations of SO(3) in Appendix C.

3. DIFFUSION PROCESS ON SO(3)

Similarly to Euclidean diffusion models (Song et al., 2021) , we begin by defining a Brownian noising process that will be used to perturb the data distribution. Let us assume a Stochastic Differential Equation of the following form: dx = f (x, t) dt + g(t) dw, ( ) where w is a Brownian process on SO(3), f (• , t) : SO(3) → T x SO(3) is a drift term, and g(•) : R → R is a diffusion term. If we sample initial conditions for this SDE at t = 0 from a given data distribution x(0) ∼ p data , we will denote by p t the marginal distribution of x(t) at time t > 0. Thus p 0 = p data , and at final time T at which we stop the diffusion process p T will typically tend to a known target distribution that will be easy to sample from. Just like in the Euclidean case, as demonstrated in De Bortoli et al. (2022) , under mild regularity conditions Equation 1 admits a reverse diffusion process on compact Riemannian manifolds such as SO(3), defined by the following reverse-time SDE: dx = [f (x, t) -g(t) 2 ∇ log p t (x)]dt + g(t)d w, where w is a reversed-time Brownian motion and the score function ∇ log p t (x) ∈ T x SO(3) is the derivative of the log marginal density of the forward process at time t. Corresponding to this reverse-time SDE, one can also define a probability flow ODE (Song et al., 2021) : dx = [f (x, t) -g(t) 2 ∇ log p t (x)]dt. This deterministic process is entirely defined once the score is known and maps p T to any intermediate marginal distributions {p t } 0≤t<T of the forward process, including p 0 . In particular, it can be seen as the equivalent of Neural ODE-based Continuous Normalizing Flows (CNF, Chen et al., 2018) with an explicit parameterization in terms of the score function. We illustrate this process in Figure 1 with samples from two Gaussian-like blobs on SO(3) being transported reversibly through this ODE between t = 0 and t = T . While these equations are direct analog of the Euclidean SDEs and ODE described in Song et al. (2021) , defining diffusion generative models on SO(3) will mainly differ on the two following points: • Defining the equivalent of the Gaussian heat kernel on SO(3): this is needed to easily sample from any intermediate p t without having to simulate an SDE. • Solving SDEs and ODEs on the manifold: contrary to the Euclidean case, the diffusion process must remained confined to the SO(3) manifold, which requires specific solvers. We address these two points below before moving on to defining our generative models on SO(3).

3.1. THE ISOTROPIC GAUSSIAN DISTRIBUTION ON SO(3)

In general, the main disadvantage of working on Riemannian manifolds compared to Euclidean space is that they lack a closed form expression for the heat kernel, i.e., the solution of the diffusion process (which is a Gaussian in Euclidean space). For compact manifolds, the heat kernel is in general only available as an infinite series, which in the case of SO(3), takes the following form (Nikolayev & Savyolov, 1970) : f ϵ (ω) = ∞ ℓ=0 (2ℓ + 1) exp(-l(l + 1)ϵ 2 ) sin((ℓ + 1/2)ω) sin(ω/2) where ω = |ω| ∈ (-π, π] is the rotation angle of the axis-angle representation ω of a given rotation matrix and ϵ is a concentration parameter. While for ϵ > 1 this series converges quickly (ℓ max = 5 is sufficient to achieve sub-percent accuracy), the convergence gets slower as ϵ gets smaller, which makes it impractical to model concentrated distributions. Thankfully, this series has been thoroughly studied in the literature and Matthies et al. (1988) shows that an excellent approximation of Equation 4 can be achieved for ϵ < 1 using the following closed-form expression: f ϵ (ω) ≃ √ πϵ -3/2 e ϵ 4 - (ω/2) 2 ϵ ω -e -π 2 ϵ (ω -2π)e πω/ϵ + (ω + 2π)e -πω/ϵ 2 sin(ω/2) . Therefore, in practical applications, one can switch between using a truncation of Equation 4 for ϵ ≥ 1 and the approximation Equation 5 for ϵ < 1. Because of the property of being a solution of a diffusion process on SO(3), f ϵ can be used to define the manifold equivalent of the Euclidean isotropic Gaussian distribution, which we will refer to as IG SO(3) , the Isotropic Gaussian on SO(3) (Leach et al., 2022; Ryu et al., 2022) , also known in the literature as the normal distribution on SO(3) (Nikolayev & Savyolov, 1970; Matthies et al., 1988) . For a given mean rotation µ ∈ SO(3) and scale ϵ, the probability density of a rotation x ∈ SO(3) under IG SO(3) (µ, ϵ) is given by: IG SO(3) (x; µ, ϵ) = f ϵ (arccos 2 -1 (tr(µ T x) -1) ) . Sampling from IG SO(3) (µ, ϵ) is achieved in practice by inverse transform sampling. The cumulative distribution function over angles needed to sample with respect to the uniform distribution on SO(3) can be evaluated numerically given integrating 1-cos(ω) π f ϵ (ω) over (-π, π]. To form a rotation matrix x ∼ IG SO(3) (•; µ, ϵ), one therefore first samples a rotation angle by inverse transform sampling Algorithm 1 Geometric ODE solver on SO(3) (Heun's method) for dx = f (x, t) dt

Require:

Step size h, initial condition x 0 , time steps {t n } N n=0 , number of steps N 1: for n ∈ {0, . . . , N -1} do 2: y 1 = h f (x n , t n ) 3: y 2 = h f (exp( 1 2 y 1 )x n , t n + 1 2 h) 4: x n+1 = exp(y 2 )x n 5: end for 6: return {x n } N n=0 given this CDF, then samples uniformly on S 2 a rotation axis v, yielding an axis-angle representation of a rotation matrix ω = ωv, which is then shifted by the mean of the distribution according to x = µ exp(ω). An important property of IG SO(3) (µ, ϵ), which sets it apart from other distributions on SO(3) (e.g. Bingham, Matrix Fisher, Wrapped Normal, more on this in Appendix E.1 ), is that it remains closed under convolution, as a direct consequence of being the solution of a diffusion process. The convolution of two centered IG SO(3) distributions of scale parameter ϵ 1 and ϵ 2 is an IG SO(3) distribution of scale ϵ 1 + ϵ 2 . We will also note two interesting asymptotic behaviors. For large ϵ, it tends to U SO(3) , the uniform distribution on SO(3), while for small ϵ the distribution IG SO(3) (I, ϵ) can locally be approximated in the axis-angle representation of the tangent space by a normal distribution N (0, σ 2 I) in R 3 , with ϵ = σ 2 2 .

3.2. SOLVING ORDINARY DIFFERENTIAL EQUATIONS ON SO(3)

Thanks to the existence of a tractable heat kernel on SO(3), the generative models we will define in the next section will not actually require us to solve the SDEs introduced at the beginning of this section, and we will only need to solve the probability flow ODE defined in Equation 3. Solving differential equations on manifolds can broadly be achieved using two distinct strategies, either projection methods using a Euclidean solver followed by a projection step onto the manifold, or intrinsic methods that rely on additional structure of the manifold to define an iteration that remains by construction on the manifold. In this work, we are concerned with SO(3), which is not only a compact Riemannian manifold, but also possesses a Lie group structure, which makes it amenable to efficient solvers. In particular, we will make use of the Runge-Kutta-Munthe-Kaas (RK-MK) class of algorithms and direct the interested reader to Iserles et al. (2000) for a review of Lie group integrators. We adopt in practice the Lie group equivalent of Heun's method, which is one variant of RK-MK integrators, and we provide the details of this integrator in Algorithm 1. While we will not require it in practice, it is also possible to build SDE solvers on SO(3) with a similar strategy, and we point the interested reader for instance to the Geodesic Random Walk algorithm described in De Bortoli et al. (2022) .

4. DIFFUSION GENERATIVE MODELS ON SO(3)

The core idea of diffusion models is to perturb a given empirical data distribution p data by a noise process defined in terms of a stochastic differential equation of the form Equation 1. While several expressions can be proposed for this SDE, for simplicity we will consider here the case of the Variance-Exploding SDE (for our experiments we use a Variance-Preserving SDE in the DDPM case), with f (x, t) = 0 and g (t) = dϵ(t) dt for a given choice of noise schedule ϵ(t), which corresponds to the canonical choice of the Euclidean Score-Matching Langevin Dynamics (Song & Ermon, 2019) : dx = dϵ(t) dt dw . For our fiducial model, and unless stated otherwise, we will further assume for simplicity the following noise schedule: ϵ(t) = t. The main drawback of this SDE in Euclidean geometry is that it will tend to a Gaussian with infinitely large variance. However, on SO(3) this SDE will tend to the uniform distribution U SO(3) which is a natural choice for the prior distribution at large T . t = T t = 3T /4 t = T /2 t = T /4 t = 0 Following from this choice of SDE, we can define a noise kernel p ϵ ( x|x) = IG SO(3) ( x; x, ϵ) for x, x ∈SO(3), such that the data distribution convolved by this noise kernel becomes p ϵ (x) = SO(3) p data (x ′ )p ϵ (x| x) dx , and corresponds to p t , the marginal distribution of the diffusion process at time t: p ϵ(t) = p t . Having introduced a specific choice of kernel and SDE well suited to the SO(3) manifold, we now move on to describing the two different approaches to build generative models: Score-Based Models and Denoising Diffusion Probabilistic Models. They both will lead to sampling procedures illustrated in Figure 2 .

4.1. SCORE-BASED GENERATIVE MODEL

The first strategy directly extends Euclidean SGMs (Song & Ermon, 2019; Song et al., 2021) and relies on the time-reversed diffusion process described in Equation 2. Samples from the learned distribution p 0 can be sampled by first sampling x T ∼ U SO(3) and evolving these samples either through the reverse SDE (Equation 2) or probability flow ODE (Equation 3) back to t = 0. This process is entirely defined as soon as the score function of the marginal distribution at any intermediate time t, ∇ log p ϵ(t) , is known. Therefore the first step is to establish a score-matching strategy on SO(3). Let us consider {X i } 3 i=0 , an orthonormal basis of the tangent space T e SO(3). The directional derivative of the log density of the noise kernel p ϵ (x| x) can be computed as: ∇ Xi log p ϵ (x|x) = d ds log p ϵ (x exp(sX i )|x) s=0 , which can be computed in practice by automatic differentiation given the explicit approximation formulae for the IG SO(3) distribution introduced in subsection 3.1. To match this derivative, we introduce a neural score estimator s θ (x, ϵ) : SO(3) × R +⋆ → R 3 , which can be trained directly under a conventional denoising score matching loss: L DSM = E pdata(x) E ϵ∼N (0,σ 2 ϵ ) E p |ϵ| (x|x) |ϵ| ∥ s θ (x, ϵ) -∇ X log p |ϵ| (x|x) ∥ 2 2 (10) where we sample at training time random noise scales ϵ ∼ N (0, σ 2 ϵ ) similarly to Song & Ermon (2020) . The minimum of this loss will be achieved for s θ (x, ϵ) = ∇ log p ϵ . Once the score function is estimated from data using this score matching loss, sampling from the generative model can be achieved by using the reverse SDE formula, or using the ODE flow formula. In this work, we use the latter for its simplicity and speed, so that our specific fiducial sampling strategy becomes: x T ∼ U SO(3) ; dx t = - 1 2 dϵ(t) dt s θ (x t , ϵ(t)) dt which we solve down to t = 0 with the geometric ODE solver described in Algorithm 1. Compared to stochastic sampling strategies based on simulating the reverse SDE, this approach has several Algorithm 2 Sampling from Denoising Diffusion Probabilistic Model on SO(3) Require: Trained neural networks µ θ (x, t), ϵ θ (x, t), number of steps N , time steps {t i } N i=0 1: x N ∼ U SO(3) 2: for i = {N, N -1, . . . , 1} do 3: x i-1 ∼ p θ (•; x i ) = IG SO(3) (•; exp(µ θ (x i , t i )), ϵ θ (x i , t i )) 4: end for 5: return {x n } N n=0 advantages. 1) It is much faster, and can benefit from adaptive ODE solvers bringing down the number of score evaluations needed, 2) the same ODE can be used to evaluate the log likelihood of the model by applying the probability flow formula of CNFs.

4.2. DENOISING DIFFUSION PROBABILISTIC MODEL

As described in Song et al. (2021) , when using a finite number of steps, the forward diffusion process defined by Equation 7 {x i } N i=0 (corresponding to times {0 ≤ t i ≤ T } n i=0 ) can be interpreted as a Markov process: p(x 0:N ) = p(x 0 )p ϵ1 (x 1 |x 0 ) . . . p ϵ2 (x i |x i-1 ) . . . p ϵ N (x N |x N -1 ) (12) with the transition kernel p ϵi+1 (x i+1 |x i ) = IG SO(3) (x i+1 ; x i , ϵ i+1 ), where ϵ i+1 = ϵ(t i+1 ) -ϵ(t i ). The idea of DDPMs is to introduce a reverse Markov process defined in terms of variational transition kernels p θ (x i-1 |x i ): p θ (x 0:N ) = p θ (x N )p θ (x N -1 |x N ) . . . p θ (x i-1 |x i ) . . . p θ (x 0 |x 1 ). (13) While one could choose any distribution on SO(3) to parameterize this inverse transition kernel (e.g., Matrix Fisher, Bingham), we adopt for convenience an Isotropic Gaussian on SO(3) and use the following expression: p θ (x i-1 |x i ) = IG SO(3) (x i-1 ; x i δ θ (x i , t i ), ϵ θ (x i , t i )) ) where δ θ : SO(3) × R + → SO(3) is a neural network predicting the residual rotation to apply to x i to obtain the mean of the reverse kernel and ϵ θ : SO(3) × R + → R + is a neural network predicting the variance of this reverse kernel. To parameterize the output of δ θ we adopt the 6D continuous rotation representation of (Zhou et al., 2019) and explore the impact of this choice in Appendix D. If the reverse Markov process can be successfully trained to match the forward process, it provides a direct sampling strategy to generate samples from p 0 by initializing the chain from p T and iteratively sampling from the reverse kernel p θ (x i-1 |x i ). In DDPMs, the training strategy is to write down the Evidence Lower Bound (ELBO), given this variational approximation for the reverse Markov process, in order to train the individual transition kernels p θ (x i-1 |x i ). To reduce the variance of this loss over a naive evaluation of the ELBO, Sohl-Dickstein et al. (2015) and Ho et al. (2020) propose to use a closed form expression of the reverse kernel p(x i-1 |x i , x 0 ) when conditioned on x 0 . This makes it possible to rewrite the ELBO in terms of analytic KL divergences between Gaussian transitions kernels. However, contrary to the Gaussian case of Euclidean DDPMs, for IG SO(3) we do not easily have access to a closed form expression of the reverse kernel p(x t-1 |x t , x 0 ) which is needed to derive the training loss used in Ho et al. (2020) . The same approach cannot be applied. Instead, we consider the expression for the ELBO: E [-log p θ (x 0 )] ≤ E p   -log p(x N ) - i≥1 log p θ (x i-1 |x i ) p(x i |x i-1 )   =: L ELBO (15) which will be optimized by maximizing the log likelihood of individual transition kernels log p θ (x i-1 |x i ) over samples x i-1 , x i obtained through simulating the forward Markov diffusion process over the training set. Our strategy on SO(3), is therefore to train each transition kernel by maximum likelihood using the following loss function: L DDP M := i≥0 E pdata(x0) E pϵ(xi|x0) E pϵ i (xi+1|xi) [-log p θ (x i |x i+1 )] where the log probability of the IG SO(3) distribution used in our parameterised reverse kernel is defined in Equation 6. While this loss can indeed be used to train a DDPM (as demonstrated in the next section), compared to the strategy of Ho et al. (2020) , we expect it to suffer from larger variance and is not explicitly parameterised in terms of the score function (Song et al., 2021) . Once trained, we can use the sampling strategy described in Algorithm 2 to draw from the generative model.

5. RELATED WORK

Most related to our work is Song et al. (2021) which introduces the diffusion framework we use in this paper, and served as a point of reference throughout. We survey below related works that have developed methodologies to represent distributions on SO(3).

Directional statistics

The classical approach for modeling distributions on SO(3) relies on (mixtures) of analytic distributions defined over the group of rotations. Common examples of using such distributions for modeling uncertainties over orientations include the Bingham distribution (Peretroukhin et al., 2020; Srivatsan et al., 2018a; Gilitschenski et al., 2020) or the matrix Fisher distribution (Mohlin et al., 2020) . The two main issues of these approaches are the lack of flexibility/expressivity of these analytic distributions, and the general difficulty of computing their normalization constant, which is typically required to train these models by maximum likelihood. Normalizing Flows A number of approach have been proposed to build density estimators on manifolds (which include SO(3)) based on Normalizing Flows. A first class of methods proposes to use a conventional Euclidean Normalizing Flow in R n , which is then mapped to the target manifold using an invertible mapping (Gemici et al., 2016; Falorsi et al., 2019) . This has some limitations however as the target manifold needs to be homeomorphic to R n (which is the case for SO(3)), and this mapping can also present discontinuities. As an improvement over this approach, a second class of methods based on continuous normalizing flows (Chen et al., 2018) has emerged, defining directly flows on the manifold (Falorsi & Forré, 2020; Mathieu & Nickel, 2020) . These approaches remain relatively costly as training requires backpropagating through an ODE solver. Rozen et al. (2021) proposes to sidestep that issue by training the CNF through penalizing the divergence of the neural network. And finally, in recent work (Ben-Hamu et al., 2022) proposes to train a flow on manifolds by penalizing a Probability Path Divergence (PPD). Diffusion models In concurrent work, Leach et al. (2022) proposed an implementation of DDPMs on SO(3) by direct analogy with Ho et al. (2020) , based on the Isotropic Gaussian on SO(3) as a replacement for the Normal distribution in R n . However, as mentioned in the previous section, the loss function used in Euclidan DDPMs does not directly translate to SO(3), which leads to imperfect density estimation as we will illustrate in our experiments. Finally, De Bortoli et al. (2022) ; Huang et al. (2022) ; Thornton et al. (2022) introduce generic frameworks for diffusion models on Riemannian manifolds but only for Score-Based Generative Model (SGM). Their generic approach means they do not benefit from the knowledge of a solution to the heat equation in SO(3), which we use extensively in our work to avoid the need to simulate SDEs and to efficiently generate samples from the forward diffusion process. In addition, we note that the method developed in Huang et al. ( 2022) is not particularly efficient on the orthogonal group as it requires a projection operation, which involves a singular value decomposition. Training this model by maximum likelihood requires computing the normalization constant of this implicit probability density function through brute-force evaluation on a tiling of SO(3), which is very costly in memory and limits the effective resolution of the learned densities.

6. EXPERIMENTS

We investigate the quality of the generative models described in the previous section on a series of synthetic test densities on SO(3). Details of the training procedures and architectural choices for all models can be found in Appendix A. Under review as a conference paper at ICLR 2023 Model Checkerboard 4-Gaussians 3-Stripes SGM on SO(3) (ours) 0.50± 0.01 0.50± 0.01 0.51± 0.01 DDPM on SO(3) (ours) 0.52± 0.01 0.53± 0.01 0.52± 0.01 RSGM (De Bortoli et al., 2022) 0.51± 0.01 -0.51 ± 0.01 Moser Flow (Rozen et al., 2021) 0.56± 0.01 0.60± 0.02 0.53± 0.02 DDPM (Leach et al., 2022) 0.71± 0.04 0.90± 0.05 0.60± 0.03 Implicit-PDF (Murphy et al., 2021) 0.59± 0.04 0.81± 0.09 0.63± 0.04 Table 1: Test densities on SO(3) We adopt three different toy distributions on SO(3): a checkerboard pattern, a multi-modal distribution of 4 concentrated Gaussians and a stripe pattern that can be viewed as circles on the sphere. We focus on evaluating the generative models in terms of the quality of their sample generation using the Classifier 2-Sample Tests (C2ST) metric (Lopez-Paz & Oquab, 2017; Dalmasso et al., 2020; Lueckmann et al., 2021) . The C2ST metric has been used in particular in the context of simulation-based inference to quantify the quality of inferred distributions. Concisely, the C2ST method uses a neural network classifier to discriminate between true and the generated samples, yielding a value of 0.5 if the two distributions are perfectly indistinguishable to the classifier, up to a value of 1 if they are extremely different. In contrast to the usual Negative Log Likelihood (NLL), C2ST can be consistently computed for all generative models we compare bellow. We present in Figure 3 and Table 1 the We also note that the Implicit-PDF model, in comparison, is extremely limited in resolution because of the memory cost of evaluating the pdf on a tiling of SO(3), and thus yields much lower scores. The best results after our method are achieved by the RSGM model (De Bortoli et al., 2022) , which is expected due to its similarity with our work, but is slower to train in the specific case of SO(3). We find that the cost of simulating the forward SDE in the training phase leads to a factor x8 in computation time per batch on a given GPU. 

7. CONCLUSIONS AND DISCUSSION

In this paper, we have presented a framework for score-based diffusion generative models on SO(3), as an extension of Euclidean SDE-based models (Song et al., 2021) . Because it is developed specifically for the SO(3) manifold, our work proposes a simpler and more efficient alternative to other recent (and general) Riemannian diffusion models while reaching state-of-the-art quality on synthetic distributions on SO(3). One of the most promising applications of this work is in robotics and computer vision, for the general task of pose-estimation, where our proposed model significantly outperforms current baselines (Murphy et al., 2021) . In the natural sciences, generative models on SO(3) are also of great interest and can be used for instance to find the angle of a molecule that minimizes the binding energy. Finally we note that as an interesting extension of the models presented in this work, one could define a Schrödinger bridge approach (De Bortoli et al., 2021; Thornton et al., 2022) specifically for SO(3), which would improve both sampling efficiency and sample quality. 

A IMPLEMENTATION AND TRAINING

We designed our neural networks with a size of {256,256,256,256,256} neurons each with leaky ReLU activation and with a residual connection. The neural networks were implemented using the axis-angle representation of SO(3), i.e. the input and output elements were represented using axisangle representation. Additionally, the neural networks were conditioned on the noise scheduler and the noise scales were also learned parameters. We trained our models using the Adam optimizer with learning rate of 10 -4 , exponential decay rates of β 1 = 0.90 and β 2 = 0.95, 400 000 iterations, and a batch size of 1024. NVIDIA Tesla V100 GPU was used as the hardware, with JAX and DeepMind-Haiku Python libraries as the software. For the DDPM, we adopt in practice the Variance Preserving SDE of (Ho et al., 2020) as we obtain better results empirically than with a Variance Exploding SDE.

B ADDITIONAL POSE ESTIMATION RESULTS

Here we provide on Figure 5 additional results on pose estimation.

C REPRESENTATIONS OF SO(3)

The special orthogonal group, SO(3), is the Lie group of all rotations about the origin in 3dimensional space. There are several ways to represent the elements of the group SO(3), each with its advantages and disadvantages: • Rotation Matrices ∈ R 3x3 with determinant equal to 1. This representation has 9 parameters and can be subject to some numerical stabilities, such as when computing the inverse or trigonometric functions. • Euler angles (also called yaw, pitch, and roll in robotics) are three angles α, β, ψ that can describe an orientation with respect to a fixed coordinate system. This representation is subject to the infamous Gimbal lock, where one degree of freedom is lost when two axes of the gimbal become parallel. • Unit Quaternions defined as γ = a + bi + cj + dk, where a, b, c, d are real number satisfying √ a 2 + b 2 + c 2 + d 2 = 1 with i + j + k denoting the vector (or imaginary) part of the unit quaternion. This representation has 4 parameters and has elegant operations (Hamilton product) without trigonometric functions E. However, quaternions are antipodally symmetric which introduces some degeneracies. • Axis -angle representation (normalized), Tangent space (unnormalized ) defined as θ = θe = (θ 1 , θ 2 , θ 3 ) = θ(e 1 , e 2 , e 3 ), where θ is the rotation angle and e is the rotation axis. However, this representation does not have well defined operations to combine rotations, and is furthermore discontinuous at θ = π (Zhou et al., 2019). Therefore, in practice it is best to use some combinations of the aforementioned representations and convert back and forth among them. For a comprehensive review on SO(3) representations and metrics, especially for computer scientists, please refer to (Hartley et al., 2013) . (a) DDPM trained with axis-angle inputs and outputs. (b) DDPM trained with 3x3 rotation matrices as inputs and using a continuous 6D output rotation representation. Figure 6 : Comparison of distributions sampled from the DDPM under two different parameterizations of both input and output rotations. Once the models are fully trained as illustrated here, the impact is small, but on a partially trained network discontinuities would be visible in the case of the axis-angle representation, mostly due to the discontinuity in the input rotations.

D IMPACT OF ROTATION REPRESENTATIONS ON NEURAL DIFFUSION MODEL

As highlighted in Zhou et al. ( 2019), a particular choice of rotation representation can affect the training and accuracy of neural networks which either take rotations as an input or that output rotations. In particular, common representations such as axis-angle and quaternions are known to have discontinuities, which are needlessly difficult to capture for a neural network. In that work, they propose in particular to use 5 or 6 dimensional representations which have the particularity of being continuous and demonstrate their benefit in neural network training. In our work, we make the choice of providing as inputs of the neural networks directly the 3x3 rotation matrix. Only the network involved in the DDPM needs to represent rotations as an output, and there we adopt the 6D representation following Zhou et al. ( 2019), which can be seen as two 3D vectors, from which we can build a full orthogonal rotation matrix using cross-products. In comparing the impact of this choice against using only an axis-angle representation as inputs and outputs, we observe the following points: • The choice of the output parameterization (in the case of the DDPM) has no noticeable effect, which is expected as the model outputs residual rotations, which remain small and thus away from the discontinuity in the axis-angle representation. • While the differences are small once the networks are fully trained, as illustrated on Figure 6 , we notice for partially trained networks a discontinuity in the sampled distributions in the case of an input axis-angle representation. Therefore, we directly feed the 3x3 rotation matrix as an input to our networks.

E QUATERNION OPERATIONS FOR SO(3)

Quaternions form a group under multiplication, defined by the Hamilton product. Given quternions γ 1 and γ 2 , the Hamilton product is defined by carrying out the γ 1 • γ 2 = (a 1 + b 1 i + c 1 j + d 1 k) • (a 2 + b 2 i + c 2 j + d 2 k) in a distributive manner, keeping in mind the basis multiplication identities. This operation is physically equivalent to rotating by γ 1 and then by γ 2 . The identity of the group is the quaternion γ 0 = 1 + 0i + 0j + 0k and the inverse of γ * (also conjugate) is defined as γ * = a -bi -cj -dk. The reparametrization trick and variance preserving quaternions In Euclidean space for a Gaussian random variable x from N (µ; σ 2 I), the reparametrization trick is defined as x = µ + σ 2 • δ where δ ∼ N (0; I). We can define a analogous operation in the quaternion group, as such: γ = θ • ϵ δ (17) Here, the quternion raised to some scalar power is defined as the γ a = exp(ln(γ)a), in other words we take the quaternion to the tangent space from the manifold, perform the operation of multiplication and bring it back into the manifold by using the exponential map. and the variance preserving operation analogous to √ α • x + (1 -α)δ can be defined as x = x √ α • δ √ (1-α) For the variance exploding case, we can directly sample from the heat kernel without resorting to the quaternion operations. E.1 DISTRIBUTIONS ON SO(3) In literature there are numerous ways to represent distributions on the hypersphere. Most of them involve taking a standard distribution from the Euclidian space R n and then constraining or projecting them on to the hypersphere S n . Some of the popular distributions are: • Projected Gaussian(s) on the sphere where standard Gaussian(s) on the tangent space of the hypersphere are projected via central projection, as done in Feiten et al. (2013) ; • the von Mises-Fisher (vMF) distribution where an isotropic Gaussian on R n is restricted to the unit hypersphere von Mises (1918); • The recently developed Power Spherical distribution Cao & Aziz (2020), which addresses some of the challenges of the vMF distribution, such as numerical stability and scalability. • (2018b;a) . However, the Bingham distribution is notorious for its normalization constant that is very hard to compute. Unfortunately, these distributions are not closed under convolution (i.e. composition of their random variables), thus writing down an closed form diffusion process akin to the Euclidean Gaussian case is intractable. One way to circumvent this problem is to use class of functions that are closed under convolutions on the manifold. An obvious choice is the heat kernel which is the canonical solution to the diffusion equation and is closed under convolutions by definition (Grigoryan, 2009) .



For a comprehensive list of articles on score-based generative modeling, see https:// scorebasedgenerativemodeling.github.io/



Figure 1: Illustration of reversible diffusion of a mixture of two IG SO(3) blobs on SO(3). Samples from a

Figure 2: Sampling from a Diffusion Generative Model trained on a synthetic density on SO(3). Starting from U SO(3) , the uniform distribution on SO(3) at t = T (left), the sampling procedure (either based on SGMs or DDPMs) denoises this distribution back to the target density at t = 0 (right). For visualization this density plot shows the distribution of canonical axes of sampled rotations projected on the sphere; the tilt around that axis is discarded.

Murphy et al. (2021) develops a non-parametric representation of distributions on SO(3) by introducing a neural network to implicitly represent an unnormalized density on SO(3).

Figure 3: Density plot comparing samples from learned synthetic densities on SO(3). For visualization this density plot shows the distribution of canonical axes of sampled rotations projected on the sphere; the tilt around that axis is discarded.

results of our (SGM, DDPM-VExp, DDPM-VPres) comparisons on these test densities against the implicit-pdf method of Murphy et al. (2021), the DDPM implementation of Leach et al. (2022), Moser flow of Rozen et al. (2021), and the Riemannian Score-Based Generative Model (RSGM) of De Bortoli et al. (2022) (trained under their ℓ t|0 score matching loss). We find that in all cases our SGM implementation on SO(3) yields the best C2ST metric, which is in line with the visual quality of distributions shown in Figure3. Our DDPM implementation on SO(3) yields distributions that are comparatively less sharp, which we attribute to the larger variance of our training loss for that model. Compared to other models, our experiments illustrate a failure mode in the method ofLeach et al. (2022) which we attribute to the fact that the usual DDPM loss function cannot be directly translated to SO(3) (as discussed in subsection 4.2).

To test practical applications of our model, following (Murphy et al., 2021) we used a vision description obtained from a pre-trained ResNet architecture with ImageNet weights consisting of 2048 dimensional vector to condition an SO(3) SGM. Using images of symmetric solids from the SYMSOL dataset Murphy et al. (2021) we show that we can correctly estimate poses of objects with degenerate symmetry, as shown in Fig. 4. (and in Appendix B).

The antipodally symmetric Bingham distribution. The antipodal symmetry makes it a suitable distribution to represent quaternions, since quaternions double cover the space of rotations on SO(3) Gilitschenski et al. (2020); Peretroukhin et al. (2020); Srivatsan et al.

Richard von Mises. Uber die 'ganzzahligkeit' der atomgewicht und verwandte fragen. Physikal. Z., 19:490-500, 1918. Nathan Zelesko, Amit Moscovich, Joe Kileel, and Amit Singer. Earthmover-based manifold learning for analyzing molecular conformation spaces. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1715-1719, 2020. doi: 10.1109/ISBI45749.2020.9098723. Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5745-5753, 2019.

Image

True Scatter (Predicted) Density (Predicted)Figure 5 : Predicted poses for an image of a solid with degenerate symmetry, here we only show it for the cone. The 1st column depicts the image of the symmetric solid. In column 2, each point represents a rotation matrix in SO(3) projected on the sphere according to its canonical axis, the color indicates the tilt around that axis. For visualization the density plot (column 3) shows the distribution of canonical axes of sampled rotations projected on the sphere; the tilt around that axis is discarded.

