AQUAMAM: AN AUTOREGRESSIVE, QUATERNION MANIFOLD MODEL FOR RAPIDLY ESTIMATING COM-PLEX SO(3) DISTRIBUTIONS

Abstract

Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires N forward passes through the network's final multilayer perceptron-where N places an upper bound on the likelihood that can be calculated by the model-which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, 12 a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. On an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. On a constructed dataset of 500,000 renders of a die in different rotations, an AQuaMaM model trained from scratch reaches a log-likelihood 14% higher than an IPDF model using a pretrained ResNet-50. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52× faster on a single GPU, and converges in a similar amount of time during training.

1. INTRODUCTION AND RELATED WORK

In many robotics applications, e.g., robotic weed control (Wu et al., 2020) , the ability to accurately estimate the poses of objects is a prerequisite for successful deployment. However, compared to other automation tasks, which primarily involve either classification or regression in R n , pose estimation is particularly challenging because the 3D rotation group SO(3)foot_2 lies on a curved manifold. As a result, standard probability distributions (e.g., the multivariate Gaussian) are not well-suited for modeling elements of the SO(3) set. Further, because the steps for interacting with an object in the "mean" pose between two possible poses (Figure 1 ) will often fail when applied to the object when it is in one of the non-mean poses, accounting for multimodality in the context of rotations is essential. The recently described implicit-PDF (IPDF) (Murphy et al., 2021) is a simple, Figure 1 : When minimizing the unimodal Bingham loss for the two rotations R 1 and R 2 , the maximum likelihood estimate R is a rotation that was never observed in the dataset. Note, the die images are for demonstration purposes only, i.e., no images were used during optimization. R 0 is the identity rotation. elegant, and effective approach for modeling distributions on SO(3) that both respects the curvature of the rotation manifold and is inherently multimodal. The IPDF model f is trained through negative sampling where, for each ground truth image/rotation pair (X, R 1 ), a set of N -1 negative rotations {R i } N 2 are sampled and a score is assigned to each rotation matrix as s i = f (X, R i ). The final density p(R 1 |X) is then approximated as p(R 1 |X) ≈ softmax(s)[1]/V where s is a vector containing the scores with s[1] = s 1 , and V = π 2 /N , i.e., the volume of SO(3) split into N pieces. While effective, the N hyperparameter induces a non-trivial trade-off between model precision (in terms of the maximum likelihood that can be assigned to a rotation) and inference speed in environments that do not have the computational resources necessary to extensively parallelize the task. Further, again due to the precision/speed trade-off, IPDF is trained with N train ≪ N test ,foot_3 which can make it difficult to reason about how the model will behave in the wild. The alternative to implicitly modeling distributions on SO(3) is explicitly modeling them. Here, I briefly describe the three baselines used in Murphy et al. (2021) , which are representative of explicit distribution modeling approaches. Notably, IPDF outperformed each of these models by a wide margin in a distribution modeling task (see Table 1 in 2020) by optimizing a "winner takes all" loss (Guzmán-rivera et al., 2012; Rupprecht et al., 2017) in an attempt to overcome the difficulties (Makansi et al., 2019) commonly encountered when training Mixture Density Networks (Bishop, 1994) . One additional approach that needs to be introduced is the direct classification of an individual rotation from a set of rotations, which, in some sense, is the explicit version of IPDF. Unfortunately, the computational complexity of this strategy quickly becomes prohibitive as more precision is required. For example, Murphy et al. ( 2021) used an evaluation grid of ∼2.4 million equivolumetric cells (Yershova et al., 2010) , which would not only require an extraordinary number of parameters in the final classification layer of a model, but would also require an extremely large dataset to reasonably "fill in" the grid due to the curse of dimensionality. 5In this paper, I introduce AQuaMaM, an Autoregressive, Quaternion Manifold Model that learns arbitrary distributions on SO(3) with a high level of precision. Specifically, AQuaMaM models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. Architecturally, AQuaMaM is a Transformer (Vaswani et al., 2017) . Although Transformers were originally motivated by language tasks, their flexibility and expressivity have allowed them to be successfully applied to a wide range of data types, including: images (Parmar et al., 2018; Dosovitskiy et al., 2021 ), tabular data (Padhi et al., 2021; Fakoor et al., 2020; Alcorn & Nguyen, 2021c) , multi-agent trajectories (Alcorn & Nguyen, 2021a), and proteins (Rao et al., 2021 ). Recently, multimodal Transformers (Ramesh et al., 2021; Reed et al., 2022) , i.e., Transformers that jointly process different modalities of input (e.g., text and images), have revealed additional fascinating capabilities of this class of neural networks. AQuaMaM is multimodal in both senses of the word-it processes inputs from different modalities (images and unit quaternions) to model distributions on the rotation manifold with multiple modes.



Pronounced "aqua ma'am". All code to generate the datasets, train and evaluate the models, and generate the figures can be found at: <anonymized for review>. SO(3) stands for "special orthogonal group in three dimensions", with the "special" referring to the fact that all rotation matrices have a determinant of one. See: https://blogs.scientificamerican. com/roots-of-unity/a-few-of-my-favorite-spaces-so-3/ for a popular science introduction to SO(3). InMurphy et al. (2021), Ntrain = 4,096 and Ntest = 2,359,296, i.e., Ntrain/Ntest = 0.2%. Notably, Mahendran et al. (2018) used a maximum of 200 rotations in their classification approach.



Murphy et al. (2021)). First, Prokudin et al. (2018) described a biternion network (Beyer et al., 2015) trained to output Euler angles by optimizing a loss derived from the von Mises distribution. The two multimodal variants of the model consist of: (1) a variant that outputs mixture components for a von Mises mixture distribution, and (2) an "infinite mixture model" variant, which is implemented as a conditional (Sohn et al., 2015) variational autoencoder (Kingma & Welling, 2014). Second, Gilitschenski et al. (2020) described a procedure for training a network to directly model a distribution of rotations by optimizing a loss derived from the Bingham distribution (Bingham, 1974), with the multimodal variant outputting the mixture components for a Bingham mixture distribution. Lastly, Deng et al. (2020) extended the work of Gilitschenski et al. (

