IMAGE TO SPHERE: LEARNING EQUIVARIANT FEATURES FOR EFFICIENT POSE PREDICTION

Abstract

Predicting the pose of objects from a single image is an important but difficult computer vision problem. Methods that predict a single point estimate do not predict the pose of objects with symmetries well and cannot represent uncertainty. Alternatively, some works predict a distribution over orientations in SO(3). However, training such models can be computation-and sample-inefficient. Instead, we propose a novel mapping of features from the image domain to the 3D rotation manifold. Our method then leverages SO(3) equivariant layers, which are more sample efficient, and outputs a distribution over rotations that can be sampled at arbitrary resolution. We demonstrate the effectiveness of our method at object orientation prediction, and achieve state-of-the-art performance on the popular PASCAL3D+ dataset. Moreover, we show that our method can model complex object symmetries, without any modifications to the parameters or loss function.

1. INTRODUCTION

Determining the pose of an object from an image is a challenging problem with important applications in artificial reality, robotics, and autonomous vehicles. Traditionally, pose estimation has been approached as a point regression problem, minimizing the error to a single ground truth 3D rotation. In this way, object symmetries are manually disambiguated using domain knowledge (Xiang et al., 2018) and uncertainty is not accounted for. This approach to pose estimation cannot scale to the open-world setting where we wish to reason about uncertainty from sensor noise or occlusions and model novel objects with unknown symmetries. Recent work has instead attempted to learn a distribution over poses. Single rotation labels can be modeled as random samples from the distribution over object symmetries, which removes the need for injecting domain knowledge. For instance, a table with front-back symmetry presents a challenge for single pose regression methods, but can be effectively modeled with a bimodal distribution. The drawback to learning distributions over the large space of 3D rotations is that it requires lots of data, especially when modeling hundreds of instances across multiple object categories. This poor data efficiency can be improved by constraining the weights to encode symmetries present in the problem (Cohen & Welling, 2016) . The pose prediction problem exhibits 3D rotational symmetry, e.g. the SO(3) abstract group. That is, if we change the canonical reference frame of an object, the predictions of our model should transform correspondingly. For certain input modalities, such as point clouds or 3D camera images, the symmetry group acts directly on the input data via 3D rotation matrices. Thus, many networks exploit the symmetry with end-to-end SO(3) equivariance to achieve sample efficient pose estimation. However, achieving 3D rotation equivariance in a network trained on 2D images is less explored. Thus, we present Image2Sphere, I2S, a novel method that learns SO(3)-equivariant features to represent distributions over 3D rotations. Features extracted by a convolutional network are projected from image space onto the half 2-sphere. Then, spherical convolution is performed on the features with a learned filter over the entire 2-sphere, resulting in a signal that is equivariant to 3D rotations. A final SO(3) group convolution operation produces a probability distribution over SO(3) parameterized in the Fourier domain. Our method can be trained to accurately predict object orientation and correctly express ambiguous orientations for objects with symmetries not specified at training time.

