SELF-SUPERVISED MULTI-VIEW LEARNING VIA AUTO-ENCODING 3D TRANSFORMATIONS

Abstract

3D object representation learning is a fundamental challenge in computer vision to draw inferences about the 3D world. Recent advances in deep learning have shown their efficiency in 3D object recognition, among which view-based methods have performed best so far. However, feature learning of multiple views in existing methods is mostly trained in a supervised fashion, which often requires a large amount of data labels with high cost. Hence, it is critical to learn multi-view feature representations in a self-supervised fashion. To this end, we propose a novel self-supervised learning paradigm of Multi-View Transformation Equivariant Representations (MV-TER), exploiting the equivariant transformations of a 3D object and its projected multiple views. Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after transformation via projection. Then, we self-train a representation learning module to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after transformation. Experimental results demonstrate that the proposed MV-TER significantly outperforms the state-of-the-art view-based approaches in 3D object classification and retrieval tasks.

1. INTRODUCTION

3D object representation has become increasingly prominent for a wide range of applications, such as 3D object recognition and retrieval (Maturana & Scherer, 2015; Qi et al., 2016; Brock et al., 2016; Qi et al., 2017a; b; Klokov & Lempitsky, 2017; Su et al., 2015; Feng et al., 2018; Yu et al., 2018; Yang & Wang, 2019) . Recent advances in Convolutional Neural Network (CNN) based methods have shown their success in 3D object recognition and retrieval (Su et al., 2015; Feng et al., 2018; Yu et al., 2018; Yang & Wang, 2019) . One important family of methods are view-based methods, which project a 3D object into multiple views and learn compact 3D representation by fusing the feature maps of these views for downstream tasks. Feature learning of multiple views in existing approaches are mostly trained in a supervised fashion, hinging on a large amount of data labels that prevents the wide applicability. Hence, self-supervised learning is in demand to alleviate the dependencies on labels by exploring unlabeled data for the training of multi-view feature representations in an unsupervised or (semi-)supervised fashion. Many attempts have been made to explore self-supervisory signals at various levels of visual structures for representation learning. The self-supervised learning framework requires only unlabeled data in order to formulate a pretext learning task (Kolesnikov et al., 2019) , where a target objective can be computed without any supervision. These pretext tasks can be summarized into four categories (Jing & Tian, 2019): generation-based (Zhang et al., 2016; Pathak et al., 2016; Srivastava et al., 2015) , context-based, free semantic label-based (Faktor & Irani, 2014; Stretcu & Leordeanu, 2015; Ren & Jae Lee, 2018), and cross modal-based (Sayed et al., 2018; Korbar et al., 2018) . Among them, context-based pretext tasks include representation learning from image transformations, which is well connected with transformation equivariant representations as they transform equivalently as the transformed images. Transformation Equivariant Representation learning assumes that representations equivarying to transformations are able to encode the intrinsic structures of data such that the transformations can be reconstructed from the representations before and after transformations (Qi, 2019) . Learning

