AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP 3D REPRESENTATION LEARNING?

Abstract

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN.

1. INTRODUCTION

In recent years, AI systems powered by data-driven deep learning have been deployed in various areas (LeCun et al., 2015; He et al., 2016; Vaswani et al., 2017) . The advancements in computing hardware have largely facilitated machine intelligence developments, which also encourages an emerging paradigm of transferring models trained on broad data, i.e., foundational models (Bommasani et al., 2021) . Great success has been witnessed in natural language processing (NLP) (Devlin et al., 2019; Radford et al., 2018; 2019; Brown et al., 2020; Radford et al., 2021) , where the models are designed to learn generic representations through self-supervised knowledge probing on data of extreme size. Since the rapid development of Transformer (Vaswani et al., 2017 ) in vision (Dosovitskiy et al., 2021; Liu et al., 2021b) , various efforts have been made to spread this trend from NLP towards foundational 2D visual understanding (Bao et al., 2022; He et al., 2022b; Wang et al., 2022a) . As a result, our ACT makes the pretrained Transformers spontaneously cross-modal teachers that provide semantically enriched masked modeling targets for 3D point clouds. Since the pretrained Transformers are tuned as 3D autoencoders, no image, language data, or 3D downstream annotations are required during this cross-modal Transformer transfer. Besides, as the tuned Transformers are only used as the teacher for 3D Transformer student learning, our method does not introduce additional computing or storage costs during downstream feature transferring. Extensive experiments on various tasks have been conducted, which show the superior generalization performance of our ACT pretrained 3D Transformers. For example, an average accuracy improvement of +11.9% is achieved on ScanObjectNN dataset. To the best of our knowledge, this paper firstly shows that a pretrained foundational Transformer can help 3D representation learning without accessing any 2D, language data, or 3D downstream annotations. ACT is a self-supervised framework that can be generalized to other modalities and tasks, we expect this could spur more exploration of such ACT-style representation learning. 



For example, the in-house JFT-300M dataset from Google covers over one billion labels for 300M images, and the Common Crawl dataset(Raffel et al., 2020) for NLP consists of nearly one trillion words.



Data pattern comparison. Data desert. In comparison to images and free-form languages, it is more difficult to collect and label 3D(Chang et al., 2015)  or 4D(Liu et al., 2022b)  data, which generally requires more expensive and labor-intensive efforts. In addition, 3D data are seriously lacking considering the scale of data 1 . This motivates the usage of cross-modal knowledge transfer. Recent works either jointly train with other modalities for more effective contrast(Afham et al., 2022)  or directly fine-tune 2D Transformers pretrained on image data(Wang et al., 2022b).iii. Pattern difference. Table 1 shows the data pattern comparison of languages, 2D images and 3D point clouds. It is observed that: (i) 3D point cloud is usually unstructured containing sparse semantics unlike the language. This leads to the discrete identification learning for BERT-style tokenizer (Devlin et al., 2019) on point clouds more difficult (Yu et al., 2022) (see Sec. 6.1). (ii) 2D images are regularly distributed on grids, while 3D point clouds irregularly sampled from the object surface. This structural difference leads to the difficulty of constructing contrastive targets both for single-modality augmentations (Hou et al., 2021) and for crossmodal correspondence (Li et al., 2022). (iii) How to design a better representation with enriched semantics becomes the de-facto principal for self-supervised 3D understanding. Motivated by the analysis above, we propose to train Autoencoders as Cross-Modal Teachers (ACT). Our ACT utilizes foundational Transformers pretrained with 2D images or natural languages as crossmodal teachers, carrying profound knowledge and powerful representation capacity. In this way, the data desert issue in 3D is alleviated. Transformer is employed as the generic 3D learner, which closes the architectural gap toward masked modeling representation learning. By simply tuning pretrained Transformers as autoencoders on 3D data in a self-supervised fashion, the Transformers can consume and encode 3D point clouds into representations with rich semantics. In order to preserve and inherit the pretrained foundational knowledge, prompt tuning (Jia et al., 2022) is used during this procedure.

Self-Supervised Representation Learning for 3D Geometric Processing is currently arousing significant interest in the community. Classical methods are built upon reconstruction-based geometry understanding pre-tasks, e.g., point cloud part reordering (Sauder & Sievers, 2019), orientation estimation(Poursaeed et al., 2020), local and global reconstruction (Rao et al., 2020), flow consistency(Mittal et al., 2020), deformation (Achituve et al., 2021), and occlusion (Wang et al., 2021).Concurrently, Xie et al. (2020)  propose PointContrast to learn discriminative view consistency between augmented point clouds. Following this direction, various works have been proposed(Zhang  et al., 2021; Hou et al., 2021; Chen et al., 2022). Recently, many works have proposed to apply DAE pretraining of point cloud Transformers, and remarkable success has been achieved.Yu et al.  (2022)  pioneers this direction by extending the idea of BERT-style pretraining(Devlin et al., 2019;  Bao et al., 2022), combined with a global contrastive objective(He et al., 2020).Liu et al. (2022a)   propose to add some noisy points and classify whether the masked tokens are real or fake for each masked position, which shares a similar pattern with Selfie(Trinh et al., 2019)  that classifies whether masked image patches are real or fake.Pang et al. (2022)  proposes exploring MAE on point clouds by masked modeling of 3D point cloud coordinates. We follow this DAE-style representation learning paradigm, but different from previous methods, our work seeks to use latent features encoded by the 3D autoencoder with pretrained foundational Transformers as masked modeling targets.

availability

//github.com/

