IMAGE TO SPHERE: LEARNING EQUIVARIANT FEATURES FOR EFFICIENT POSE PREDICTION

Abstract

Predicting the pose of objects from a single image is an important but difficult computer vision problem. Methods that predict a single point estimate do not predict the pose of objects with symmetries well and cannot represent uncertainty. Alternatively, some works predict a distribution over orientations in SO(3). However, training such models can be computation-and sample-inefficient. Instead, we propose a novel mapping of features from the image domain to the 3D rotation manifold. Our method then leverages SO(3) equivariant layers, which are more sample efficient, and outputs a distribution over rotations that can be sampled at arbitrary resolution. We demonstrate the effectiveness of our method at object orientation prediction, and achieve state-of-the-art performance on the popular PASCAL3D+ dataset. Moreover, we show that our method can model complex object symmetries, without any modifications to the parameters or loss function.

1. INTRODUCTION

Determining the pose of an object from an image is a challenging problem with important applications in artificial reality, robotics, and autonomous vehicles. Traditionally, pose estimation has been approached as a point regression problem, minimizing the error to a single ground truth 3D rotation. In this way, object symmetries are manually disambiguated using domain knowledge (Xiang et al., 2018) and uncertainty is not accounted for. This approach to pose estimation cannot scale to the open-world setting where we wish to reason about uncertainty from sensor noise or occlusions and model novel objects with unknown symmetries. Recent work has instead attempted to learn a distribution over poses. Single rotation labels can be modeled as random samples from the distribution over object symmetries, which removes the need for injecting domain knowledge. For instance, a table with front-back symmetry presents a challenge for single pose regression methods, but can be effectively modeled with a bimodal distribution. The drawback to learning distributions over the large space of 3D rotations is that it requires lots of data, especially when modeling hundreds of instances across multiple object categories. This poor data efficiency can be improved by constraining the weights to encode symmetries present in the problem (Cohen & Welling, 2016) . The pose prediction problem exhibits 3D rotational symmetry, e.g. the SO(3) abstract group. That is, if we change the canonical reference frame of an object, the predictions of our model should transform correspondingly. For certain input modalities, such as point clouds or 3D camera images, the symmetry group acts directly on the input data via 3D rotation matrices. Thus, many networks exploit the symmetry with end-to-end SO(3) equivariance to achieve sample efficient pose estimation. However, achieving 3D rotation equivariance in a network trained on 2D images is less explored. Thus, we present Image2Sphere, I2S, a novel method that learns SO(3)-equivariant features to represent distributions over 3D rotations. Features extracted by a convolutional network are projected from image space onto the half 2-sphere. Then, spherical convolution is performed on the features with a learned filter over the entire 2-sphere, resulting in a signal that is equivariant to 3D rotations. A final SO(3) group convolution operation produces a probability distribution over SO(3) parameterized in the Fourier domain. Our method can be trained to accurately predict object orientation and correctly express ambiguous orientations for objects with symmetries not specified at training time. I2S achieves state-of-the-art performance on the PASCAL3D+ pose estimation dataset, and outperforms all baselines on the ModelNet10-SO(3) dataset. We demonstrate that our proposed architecture for learning SO(3) equivariant features from images empirically outperforms a variety of sensible, alternative approaches. In addition, we use the diagnostic SYMSOL datasets to show that our approach is more expressive than methods using parametric families of multi-modal distributions at representing complex object symmetries.

Contributions:

• We propose a novel hybrid architecture that uses non-equivariant layers to learn SO(3)equivariant features which are further processed by equivariant layers. • Our method uses the Fourier basis of SO(3) to more efficiently represent detailed distributions over pose than other methods. • We empirically demonstrate our method is able to describe ambiguities in pose due to partial observability or object symmmetry unlike point estimate methods. • I2S achieves SOTA performance on PASCAL3D+, a challenging pose estimation benchmark using real-world images. 2020) predicts 3D keypoints on the object with which the pose can be extracted. Another line of work (Wang et al., 2019; Li et al., 2019; Zakharov et al., 2019) outputs a dense representation of the object's coordinate space. Most of these methods are benchmarked on datasets with limited number of object instances (Hinterstoisser et al., 2011; Xiang et al., 2018) . In contrast, our method is evaluated on datasets that have hundreds of object instances or novel instances in the test set. Moreover, our method makes minimal assumptions about the labels, requiring only a 3D rotation matrix per image regardless of underlying object symmetry.

2. RELATED WORK

Rotation Equivariance Symmetries present in data can be preserved using equivariant neural networks to improve performance and sample efficiency. For the symmetry group of 3D rotations, SO equivariant embeddings from image input for object pose prediction; however, they use a supervised loss to replicate the embeddings of a spherical convolutional network pretrained on 3D images. In contrast, our method incorporates a novel architecture for achieving SO(3) equivariance from image inputs that can be trained end-to-end on the challenging pose prediction tasks. Uncertainty over SO(3) Due to object symmetry or occlusion, there may be a set of equivalent rotations that result in the same object appearance, which makes pose prediction challenging. Most early works into object pose prediction have avoided this issue by either breaking the symmetry when labelling the data (Xiang et al., 2014) or applying loss functions to handle to known symmetries (Xiang et al., 2018; Wang et al., 2019) . However, this approach requires knowing what symmetries are present in the data, and does not work for objects that have ambiguous orientations due to occlusion (e.g. coffee mug when the handle is not visible). Several works have proposed models to reason about orientation uncertainty by predicting the parameters of von Mises (Prokudin et al.,



, a number of equivariant models have been proposed. Chen et al. (2021) and Fuchs et al. (2020) introduce networks to process point cloud data with equivariance to the discrete icosahedral group and continuous SO(3) group, respectively. Esteves et al. (2019b) combines images from structured viewpoints and then performs discrete group convolution to classify shapes. Cohen et al. (2018a) introduces spherical convolution to process signals that live on the sphere, such as images from 3D cameras. However, these methods are restricted to cases where the SO(3) group acts on the input space, which prevents their use on 2D images. Falorsi et al. (2018) and Park et al. (2022) extract 3D rotational equivariant features from images to model object orientation, but were limited to simplistic datasets with a single object. Similar to our work, Esteves et al. (2019a) learns SO(3)

