SELF-SUPERVISED MULTI-VIEW LEARNING VIA AUTO-ENCODING 3D TRANSFORMATIONS

Abstract

3D object representation learning is a fundamental challenge in computer vision to draw inferences about the 3D world. Recent advances in deep learning have shown their efficiency in 3D object recognition, among which view-based methods have performed best so far. However, feature learning of multiple views in existing methods is mostly trained in a supervised fashion, which often requires a large amount of data labels with high cost. Hence, it is critical to learn multi-view feature representations in a self-supervised fashion. To this end, we propose a novel self-supervised learning paradigm of Multi-View Transformation Equivariant Representations (MV-TER), exploiting the equivariant transformations of a 3D object and its projected multiple views. Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after transformation via projection. Then, we self-train a representation learning module to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after transformation. Experimental results demonstrate that the proposed MV-TER significantly outperforms the state-of-the-art view-based approaches in 3D object classification and retrieval tasks.

1. INTRODUCTION

3D object representation has become increasingly prominent for a wide range of applications, such as 3D object recognition and retrieval (Maturana & Scherer, 2015; Qi et al., 2016; Brock et al., 2016; Qi et al., 2017a; b; Klokov & Lempitsky, 2017; Su et al., 2015; Feng et al., 2018; Yu et al., 2018; Yang & Wang, 2019) . Recent advances in Convolutional Neural Network (CNN) based methods have shown their success in 3D object recognition and retrieval (Su et al., 2015; Feng et al., 2018; Yu et al., 2018; Yang & Wang, 2019) . One important family of methods are view-based methods, which project a 3D object into multiple views and learn compact 3D representation by fusing the feature maps of these views for downstream tasks. Feature learning of multiple views in existing approaches are mostly trained in a supervised fashion, hinging on a large amount of data labels that prevents the wide applicability. Hence, self-supervised learning is in demand to alleviate the dependencies on labels by exploring unlabeled data for the training of multi-view feature representations in an unsupervised or (semi-)supervised fashion. Many attempts have been made to explore self-supervisory signals at various levels of visual structures for representation learning. The self-supervised learning framework requires only unlabeled data in order to formulate a pretext learning task (Kolesnikov et al., 2019) , where a target objective can be computed without any supervision. These pretext tasks can be summarized into four categories (Jing & Tian, 2019): generation-based (Zhang et al., 2016; Pathak et al., 2016; Srivastava et al., 2015) , context-based, free semantic label-based (Faktor & Irani, 2014; Stretcu & Leordeanu, 2015; Ren & Jae Lee, 2018), and cross modal-based (Sayed et al., 2018; Korbar et al., 2018) . Among them, context-based pretext tasks include representation learning from image transformations, which is well connected with transformation equivariant representations as they transform equivalently as the transformed images. Transformation Equivariant Representation learning assumes that representations equivarying to transformations are able to encode the intrinsic structures of data such that the transformations can be reconstructed from the representations before and after transformations (Qi, 2019) . Learning transformation equivariant representations has been advocated in Hinton's seminal work on learning transformation capsules (Hinton et al., 2011) . Following this, a variety of approaches have been proposed to learn transformation equivariant representations (Kivinen & Williams, 2011; Sohn & Lee, 2012; Schmidt & Roth, 2012; Skibbe, 2013; Lenc & Vedaldi, 2015; Gens & Domingos, 2014; Dieleman et al., 2015; 2016; Zhang et al., 2019; Qi et al., 2019; Gao et al., 2020; Wang et al., 2020) . Nevertheless, these works focus on transformation equivariant representation learning of a single modality, such as 2D images or 3D point clouds. In this paper, we propose to learn Multi-View Transformation Equivariant Representations (MV-TER) by decoding the 3D transformations from multiple 2D views. This is inspired by the equivariant transformations of a 3D object and its projected multiple 2D views. That is, when we perform 3D transformations on a 3D object, the 2D views projected from the 3D object via fixed viewpoints will transform equivariantly. In contrast to previous works where 2D/3D transformations are decoded from the original single image/point cloud and transformed counterparts, we exploit the equivariant transformations of a 3D object and the projected 2D views. We propose to decode 3D transformations from multiple views of a 3D object before and after transformation, which is taken as self-supervisory regularization to enforce the learning of intrinsic 3D representation. By estimating 3D transformations from the fused feature representations of multiple original views and those of the equivariantly transformed counterparts from the same viewpoints, we enable the accurate learning of 3D object representation even with limited amount of labels. Specifically, we first perform 3D transformation on a 3D object (e.g., point clouds, meshes), and render the original and transformed 3D objects into multiple 2D views with fixed camera setup. Then, we feed these views into a representation learning module to infer representations of the multiple views before and after transformation respectively. A decoder is set up to predict the applied 3D transformation from the fused representations of multiple views before and after transformation. We formulate multi-view transformation equivariant representation learning as a regularizer along with the loss of a specific task (e.g., classification) to train the entire network end-to-end. Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art view-based models in 3D object classification and retrieval tasks. Our main contributions are summarized as follows. • We propose Multi-View Transformation Equivariant Representations (MV-TER) to learn 3D object representations from multiple 2D views that transform equivariantly with the 3D transformation in a self-supervised fashion. • We formalize the MV-TER as a self-supervisory regularizer to learn the 3D object representations by decoding 3D transformation from fused features of projected multiple views before and after the 3D transformation of the object. • Experiments demonstrate the proposed method outperforms the state-of-the-art view-based methods in 3D object classification and retrieval tasks in a self-supervised fashion.

2. RELATED WORKS

In this section, we review previous works on transformation equivariant representations and multiview based neural networks. 



Domingos (2014)  propose an approximately equivariant convolutional architecture, which utilizes sparse and high-dimensional feature maps to deal with groups of transformations.Dieleman  et al. (2015)  show that rotation symmetry can be exploited in convolutional networks for effectively learning an equivariant representation.Dieleman et al. (2016)  extend this work to evaluate on other computer vision tasks that have cyclic symmetry.Cohen & Welling (2016)  propose group equivariant convolutions that have been developed to equivary to more types of transformations. The idea of

