AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP 3D REPRESENTATION LEARNING?

Abstract

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN.

1. INTRODUCTION

In recent years, AI systems powered by data-driven deep learning have been deployed in various areas (LeCun et al., 2015; He et al., 2016; Vaswani et al., 2017) . The advancements in computing hardware have largely facilitated machine intelligence developments, which also encourages an emerging paradigm of transferring models trained on broad data, i.e., foundational models (Bommasani et al., 2021) . Great success has been witnessed in natural language processing (NLP) (Devlin et al., 2019; Radford et al., 2018; 2019; Brown et al., 2020; Radford et al., 2021) , where the models are designed to learn generic representations through self-supervised knowledge probing on data of extreme size. Since the rapid development of Transformer (Vaswani et al., 2017) in vision (Dosovitskiy et al., 2021; Liu et al., 2021b) , various efforts have been made to spread this trend from NLP towards foundational 2D visual understanding (Bao et al., 2022; He et al., 2022b; Wang et al., 2022a) . Meanwhile, compared to 2D vision and NLP, this course towards foundational visual computing is significantly lagging in the 3D community. We ask: What makes 3D representation learning more challenging than 2D vision or NLP? We offer some analytical answers from the following three perspectives: i. Architecture disunity. Pioneering architectures like PointNet (Qi et al., 2017a; b) can only encode 3D coordinates and it is not applicable for masked denoising autoencoding (DAE) (Vincent et al., 2008; 2010; Devlin et al., 2019) which is proved successful in NLP and 2D vision (He et al., 2022b) . Transformers (Vaswani et al., 2017) has now closed this architectural gap, which enables a unified representation across all modality formats (Wang et al., 2022a ) and brings a great potential of extending DAE for 3D (Yu et al., 2022; Pang et al., 2022) . ii. Data desert. In comparison to images and free-form languages, it is more difficult to collect and label 3D (Chang et al., 2015) or 4D (Liu et al., 2022b) data, which generally requires more expensive and labor-intensive efforts. In addition, 3D data are seriously lacking considering the scale of datafoot_0 . This motivates the usage of cross-modal knowledge transfer. Recent works either jointly train with other modalities for more effective contrast (Afham et al., 2022) or directly fine-tune 2D Transformers pretrained on image data (Wang et al., 2022b) . iii. Pattern difference. Table 1 shows the data pattern comparison of languages, 2D images and 3D point clouds. It is observed that: (i) 3D point cloud is usually unstructured containing sparse semantics unlike the language. This leads to the discrete identification learning for BERT-style tokenizer (Devlin et al., 2019) on point clouds more difficult (Yu et al., 2022 ) (see Sec. 6.1). (ii) 2D images are regularly distributed on grids, while 3D point clouds irregularly sampled from the object surface. This structural difference leads to the difficulty of constructing contrastive targets both for single-modality augmentations (Hou et al., 2021) and for crossmodal correspondence (Li et al., 2022) . (iii) How to design a better representation with enriched semantics becomes the de-facto principal for self-supervised 3D understanding. Motivated by the analysis above, we propose to train Autoencoders as Cross-Modal Teachers (ACT). Our ACT utilizes foundational Transformers pretrained with 2D images or natural languages as crossmodal teachers, carrying profound knowledge and powerful representation capacity. In this way, the data desert issue in 3D is alleviated. Transformer is employed as the generic 3D learner, which closes the architectural gap toward masked modeling representation learning. By simply tuning pretrained Transformers as autoencoders on 3D data in a self-supervised fashion, the Transformers can consume and encode 3D point clouds into representations with rich semantics. In order to preserve and inherit the pretrained foundational knowledge, prompt tuning (Jia et al., 2022) is used during this procedure. As a result, our ACT makes the pretrained Transformers spontaneously cross-modal teachers that provide semantically enriched masked modeling targets for 3D point clouds. Since the pretrained Transformers are tuned as 3D autoencoders, no image, language data, or 3D downstream annotations are required during this cross-modal Transformer transfer. Besides, as the tuned Transformers are only used as the teacher for 3D Transformer student learning, our method does not introduce additional computing or storage costs during downstream feature transferring. Extensive experiments on various tasks have been conducted, which show the superior generalization performance of our ACT pretrained 3D Transformers. For example, an average accuracy improvement of +11.9% is achieved on ScanObjectNN dataset. To the best of our knowledge, this paper firstly shows that a pretrained foundational Transformer can help 3D representation learning without accessing any 2D, language data, or 3D downstream annotations. ACT is a self-supervised framework that can be generalized to other modalities and tasks, we expect this could spur more exploration of such ACT-style representation learning.

2. RELATED WORKS

Self-Supervised Representation Learning for 3D Geometric Processing is currently arousing significant interest in the community. Classical methods are built upon reconstruction-based geometry understanding pre-tasks, e.g., point cloud part reordering (Sauder & Sievers, 2019) , orientation estimation (Poursaeed et al., 2020) , local and global reconstruction (Rao et al., 2020) , flow consistency (Mittal et al., 2020) , deformation (Achituve et al., 2021) , and occlusion (Wang et al., 2021) . Concurrently, Xie et al. (2020) propose PointContrast to learn discriminative view consistency between augmented point clouds. Following this direction, various works have been proposed (Zhang et al., 2021; Hou et al., 2021; Chen et al., 2022) . Recently, many works have proposed to apply DAE pretraining of point cloud Transformers, and remarkable success has been achieved. Yu et al. (2022) pioneers this direction by extending the idea of BERT-style pretraining (Devlin et al., 2019; Bao et al., 2022) , combined with a global contrastive objective (He et al., 2020) . Liu et al. (2022a) propose to add some noisy points and classify whether the masked tokens are real or fake for each masked position, which shares a similar pattern with Selfie (Trinh et al., 2019) that classifies whether masked image patches are real or fake. (2022) propose an intra-and inter-modal contrastive learning framework among augmented point clouds and the corresponding rendered 2D images. By utilizing the geometry prior information for a dense association, another line of work is proposed to explore fine-grained local feature matching. Liu et al. (2021a) propose a contrastive knowledge distillation method to align fine-grained 2D and 3D features. Li et al. (2022) propose a simple contrastive learning framework for inter-and intra-modal dense feature contrast, with the Hungarian algorithm used for better correspondence. Recently, great progress has been made by directly using pretrained 2D image encoders via supervised fine-tuning. Image2Point (Xu et al., 2022) proposes to transfer pretrained weights by convolutional layer inflating. P2P (Wang et al., 2022b) proposes to project 3D point clouds to 2D images as input to the image backbone through a learnable coloring module. Our work also explores whether pretrained foundational models could help 3D learning. However, our method (1) does not use the pretrained 2D or language models as the backbone model for inference, (2) explores using pretrained foundational models from other modalities during self-supervised pretraining without downstream 3D annotations, and (3) does not need the paired point-image or point-language data. Besides 2D images, some works are proposed to utilize natural languages for contrastive 3D representation leanring (Rozenberszki et al., 2022) , zero-shot learning (Zhang et al., 2022c) , and scene understanding (Zhang et al., 2023) .

3.1. 3D POINT CLOUD REPRESENTATIONS WITH TRANSFORMERS

Different from images that lie on regular grids, point clouds are known to be irregular and less structured. Many efforts have been devoted to deep learning architecture design for point cloud data (Qi et al., 2017a; b; Wang et al., 2019) , which exploits permutation and translation invariance of a point set for feature learning. Instead of purely relying on such specialized backbones, we leverage the Transformer backbone (Vaswani et al., 2017) , which is easier to be unified with other modalities such as image and language and to facilitate cross-modal knowledge transfer. We feed Transformers with local geometry patch embeddings computed using specialized point networks like Qi et al. (2017a) to output more effective geometric representations. Local Geometry Patch Embedding Suppose we have a point cloud P = {p i |i = 1, 2, . . . , N } ∈ R N ×3 with N coordinates encoded in a (x, y, z) Cartesian space, we follow Yu et al. (2022) to first sample N s seed points using farthest point sampling (FPS). The point cloud P is then grouped into N s neighborhoods N = {N i |i = 1, 2, . . . , N s } ∈ R Ns×K×3 with group centroids from the seed point set P s . Each neighborhood contains K points generated by searching the K-nearest neighbor of the corresponding seed point. The local geometry feature x i around each seed point p i ∈ P s is computed by max-pooling per-point features within the neighborhood: x i = MAX pi,j ∈Ni Φ θ (ξ i,j ) , where Φ θ (•) is a point feature extractor with parameters θ, e.g., per-point MLP as in (Qi et al., 2017a; b) , ξ i,j is the feature of j-th neighbour point p i,j in the neighborhood N i . We will use the set of neighborhood features as token features to feed the following Transformer blocks. Transformer Point Feature Encoding Standard Transformer block (Vaswani et al., 2017) is used as the encoder to further transform local patch embeddings X = {x i |i = 1, 2, . . . , N s } ∈ R Ns×C with C being the embedding size. Following Yu et al. (2022) , we use a two-layer MLP ψ ρ with learnable parameters ρ as the positional embedding, which is applied to every block for stable training. E pos = E [CLS] pos ; ψ ρ (P s ) , E [CLS] pos ∈ R C (2) h 0 = [E [CLS] ; x 1 ; x 2 ; • • • ; x Ns ] + E pos , E [CLS] ∈ R C , E pos ∈ R (Ns+1)×C (3) h ′ ℓ = MSA LN(h ℓ-1 + E pos ) + h ℓ-1 , ℓ = 1 . . . L h ℓ = MLP LN(h ′ ℓ ) + h ′ ℓ , ℓ = 1 . . . L (5) where MSA denotes alternating layers of multi-head self-attention, LN denotes Layernorm, and MLP is two layers with GELU as non-linearity. E [CLS] is a learnable global representation embedding with E [CLS] pos as its learnable positional embedding (Dosovitskiy et al., 2021) . 

3.2. KNOWLEDGE DISTILLATION: A UNIFIED VIEW OF MASKED MODELING

Masked signal modeling can be viewed as an extension of the classical denoising autoencoders (DAE) with masked corruption (He et al., 2022b) , which has been recently explored for language models (Devlin et al., 2019) and vision (Bao et al., 2022) . Formally, given a sequence of N t tokens T = {t i |i = 1, 2, . . . , N t }, e.g., the token embeddings of an RGB image or point cloud data. The objective is to train the a student encoder f S to predict/reconstruct the output from a teacher encoder f T , where the teacher could be a discrete variational autoencoder (dVAE) (Bao et al., 2022) or simply identity mapping (He et al., 2022b) . In this fashion, the student learns the dark knowledge within data under the guidance of the teacher. In order to corrupt the input data, a set of masks M = {m i |i = 1, 2, . . . , N t } ∈ {0, 1} Nt are generated for each position, indicating whether the token is masked or not. A learnable corruption embedding e [M] is used to replace the masked position, with which the corrupted representation Z M = 1(M) ⊙ e [M] + 1(1 -M) ⊙ T is input to encoder (Devlin et al., 2019) or decoder (He et al., 2022b) foot_2 . Here, ⊙ denotes the Hadamard product, and 1 is the indicator function. With a distance function L D (•, •) defined in some metric space D and h S , h T as the decoders, the objective is to minimize: - Nt i=1 m i • L D h S • f S (Z M ), h T • f T (T) . The decoders h vary with the modeling targets, e.g., it is a non-linear projection with softmax for BERT (Devlin et al., 2019; Bao et al., 2022) where the metric function becomes Cross-Entropy. Eqn. ( 6) can be viewed as a unified formulation for masked modeling. It is thus natural to consider how to build a knowledgeable teacher in masked 3D modeling. And our idea is to leverage cross-modal teachers from 2D or language foundation models.

4. ACT: AUTOENCODERS AS CROSS-MODAL TEACHERS

Our goal is to facilitate 3D representation learning through a pretrained 2D image or language Transformer, which carries dark knowledge absorbed from massive data. However, 3D point clouds are known to have different structures (Li et al., 2022; Afham et al., 2022) from 2D images or languages, which makes the association of fine-grained knowledge difficult. We address this issue by using a two-stage training procedure. An overview of our ACT framework is illustrated in Figure 1 . • Stage I. We tune the pretrained 2D or language Transformers as 3D autoencoders, where it learns to understand 3D geometry through self-supervised prompt tuning (Sec. 4.1). • Stage II. We use the pretrained 3D autoencoder as a cross-modal teacher, which is used to distill the latent features to the 3D point cloud Transformer student through masked modeling (Sec. 4.2).

4.1. 3D AUTOENCODING WITH PRETRAINED FOUNDATIONAL TRANSFORMER

Transformers, recently the dominant architecture in various areas, can model sequential data of any modality in a unified fashion (Vaswani et al., 2017) . Therefore, we could directly use the pretrained Transformer blocks by feeding the sequential tokens with 3D positional embeddings of the input point clouds, as described in Sec. 3.1. A lightweight DGCNN is used following Yu et al. (2022) , where Φ θ in Eqn. (1) represents the edge convolution layer (Wang et al., 2019) .

Cross-Modal Embedding with Prompts

The point cloud P is first encoded by the DGCNN-style patch embedding network g pre , producing a set of token embeddings: X = g pre (P). Then we prompt the token embeddings and feed them into D layers of pretrained and frozen Transformer blocks, e.g., a 2D Transformer g 2D = {g 2D ℓ |ℓ = 1, 2, ..., D}. Here we use g 2D ℓ to denote the ℓ-th layer of the 2D Transformer. We use m learnable prompt embeddings E [P] ℓ = {e [P] k ∈ R C |k ∈ N, 1 ≤ k ≤ m}, which are applied to each layer of the Transformer (Jia et al., 2022) . Specifically, the ℓ-th layer g 2D l of the Transformer transforms the hidden representations h ℓ-1 from the (ℓ -1)-th layer to h ℓ as below: [h ℓ ; E ′ [P] ℓ ] = g 2D ℓ [h ℓ-1 ; E [P] ℓ ] , ℓ = 1 . . . D 7) With this parameter-efficient prompt tuning strategy, we are able to tune the pretrained foundational Transformer while preserving as much pretrained knowledge as possible (He et al., 2022a) . Point Cloud Autoencoding Another DGCNN network g post is used to extract local geometric features from foundational Transformer-embedded hidden representations h ℓ . After this, we leverage a FoldingNet (Yang et al., 2018) to reconstruct the input point cloud. We train the above 3D autoencoder as a discrete variational autoencoder (dVAE) (Kingma & Welling, 2014; Ramesh et al., 2021; Bao et al., 2022) for log-likelihood P(p i |p i ) maximization, where (p i , pi ) ∈ D denotes the original and reconstructed point clouds respectively. The overall optimization is to maximize the evidence lower bound (ELBO), which holds when β = 1 (Ramesh et al., 2021) : (pi, pi)∈D ln P θ (p i |p i ) ≥ (pi, pi)∈D E zi∼Q ϕ (z|pi) ln P ψ (p i |z i ) -βL KL Q ϕ (z|p i ), P ψ (z|p i ) , (8) where (1) Q ϕ (z|p) denotes the discrete 3D dVAE tokenizer; (2) P ψ (p|z) is the dVAE decoder given discrete point tokens; (3) P θ (z|p) reconstructs the input point clouds in an autoencoding way.

4.2. MASKED POINT MODELING AS CROSS-MODAL KNOWLEDGE DISTILLATION

By simply training the 3D autoencoder, the strong representation of the pretrained Transformer is translated into the 3D feature space, making the autoencoder spontaneously a cross-modal teacher. We motivate our method with a similar formulation to Eqn. (6). We use the pretrained point cloud encoder introduced in Sec. 4.1 as the teacher F T = h T •g post •g 2D •g pre and we use a 3D Transformer F S = h S • f S as the student. The masked point modeling as cross-modal knowledge distillation minimizes a negative cosine similarity L cos (s, t) = 1 -s•t ∥s∥•∥t∥ between the encoded teacher and student features: Point-MAE 91.1 ± 5.6 91.7 ± 4.0 83.5 ± 6.1 89.7 ± 4.1 ACT (Ours) 91.8 ± 4.7 93.1 ± 4.2 84.5 ± 6.4 90.7 ± 4.3 with Self-Supervised Representation Learning (MLP-3) Point-MAE 95.0 ± 2.8 96.7 ± 2.4 90.6 ± 4.7 93.8 ± 5.0 ACT (Ours) 95.9 ± 2.2 97.7 ± 1.8 92.4 ± 5.0 94.7 ± 3.9 3D Synthetic Dataset Classification We show the evaluation of 3D shape classification on synthetic dataset ModelNet40 (Wu et al., 2015) . To demonstrate the data-efficiency property of ACT given limited training examples, we first follow Sharma & Kaul (2020) to evaluate fewshot learning. From Table 5 , we see: (i) ACT brings significant improvements of +9.0%, +4.7%, +8.7%, +6.2% respectively for the four settings over from scratch FULL transferring baseline. (ii) Our ACT consistently achieves the best performance compared to other SSL methods. Then, we show results on the full dataset in Table 3 , where we observe that our ACT achieves a +2.5% accuracy improvement compared to the from scratch baseline under FULL protocol, and the results are comparable or better to other self-supervised learning methods across all transferring protocols. 6 shows the average fine-tuning accuracy on ScanObjectNN using ACT with different depths of decoders. It can be seen that the performance is not sensitive to the decoder depth, and we find that decoder with 2 blokcs achieves the highest results. Note that when decoder depth is 0, we adopt a masked modeling architecture similar to BERT (Devlin et al., 2019) , where there is no decoder, and the encoder sees all tokens, including masked ones. We find that this leads to an inferior result, consistent with the observation in 2D that data of low semantics requires a non-trivial decoder for modeling purpose (He et al., 2022b) . - Nt i=1 m i • L cos F S (Z M ), F T (T) . (9) Masking Strategy and Teacher Choice Figure 2 (a) shows the average fine-tuning on ScanobjectNN with different masking strategies. It can be observed that a higher masking ratio using random masking yields better results, while block masking has an appetite for lower masking ratios. Note that when the masking ratio is zero, we use vanilla knowledge distillation for all tokens, and it leads to inferior performance. Figure 2 (b) shows average fine-tuning accuracy on ScanObjectNN using ACT with different teacher Transformers including Vision Transformers (Dosovitskiy et al., 2021; Touvron et al., 2021b ), all-MLP architectures (Tolstikhin et al., 2021; Touvron et al., 2021a) , language model (Devlin et al., 2019) and vision-language model (Radford et al., 2021) . It is observed that a larger teacher consistently yields better performance. Moreover, surprisingly, our ACT with language model BERT-B (i.e., BERT base ) as the cross-modal teacher can achieve an average accuracy of 85.12±0.54% (up to 85.88%), demonstrating that ACT can generalize to any modality. Since our ACT uses encoded features as masked modeling targets, it brings another potential to apply our method as auxiliary feature distillation. Table 9 shows the results of training Point-MAE with ACT as auxiliary deep supervision of the intermediate features, where the ACT encoded latent features are distilled to the encoder feature of Point-MAE. We can observe that ACT can improve Point-MAE significantly by +0.87% of accuracy on ScanObjectNN, demonstrating that ACT is scalable and effective as a knowledge distillation method.

6.3. HOW DOES THE 2D VISION TRANSFORMER UNDERSTAND 3D POINT CLOUDS?

To better understand how the 2D image Transformers understand 3D inputs through the autoencoder training, we study the effect of positional embedding used by ViT-B in our ACT dVAE model. From Table 10 , we can observe that: (i) Without any positional embedding, the pretrained ViT still learns transferable 3D features (84.21±0.45% accuracy). We argue that it is because the positional geometry information is already contained in the input 3D coordinates and the pretrained 2D Transformer can process 3D data purely by geometry features without explicit positional hints. (ii) When using positional embedding with only 2D xy plane coordinates, accuracy is improved significantly by +0.89%. We argue that 2D positional embedding is learned to fit the frozen image Transformer, enabling the image Transformer to encode 3D inputs into pretrained 2D feature space with high semantics. (iii) With all 3D coordinates used for positional embedding, the 2D image Transformer succeeds in leveraging the additional coordinate information for better feature encoding.

7. CONCLUSIONS

This paper presents a self-supervised learning framework ACT that performs masked modeling as feature distillation from pretrained foundational Transformers to 3D Transformer students. ACT first transfers the pretrained foundational Transformers as cross-modal 3D teachers via self-supervised 3D autoencoding. The semantic-enriched latent feature from the tuned 3D autoencoder is then used as masked modeling targets for the 3D Transformer students' representation learning, which shows remarkable generalization performance over various downstream 3D tasks. As a general SSL framework, we believe ACT could be easily extended to other modalities than 3D data. A great potential is shown to transfer cross-modal knowledge in this self-supervised fashion, which may largely facilitate the development of foundational modeling in this data-driven deep learning era.

A ADDITIONAL RELATED WORKS

Self-Supervised Representation Learning has achieved remarkable success in natural language processing (Devlin et al., 2019; Brown et al., 2020) and 2D visual understanding (Noroozi & Favaro, 2016; Dosovitskiy et al., 2016; Pathak et al., 2016; Ye et al., 2019) . One prominent strand of research follows the contrastive objective via construct, then contrast for learning constructed invariance and consistency (Hadsell et al., 2006; Wu et al., 2018; van den Oord et al., 2018; Hjelm et al., 2019; Chuang et al., 2020; Grill et al., 2020; Chen et al., 2020b; He et al., 2020; Chen & He, 2021; Zhang et al., 2022a) . Another paradigm lies in training denoising autoencoders (DAE) (Vincent et al., 2008; 2010) via corrupt, then reconstruct (predict) data signals in a self-supervised fashion. With rapid development of Transformers in vision (Vaswani et al., 2017; Dosovitskiy et al., 2021; Liu et al., 2021b) , abundant works have been proposed to generalize DAE to masked modeling of RGB pixel (Zhang et al., 2016; Chen et al., 2020a; He et al., 2022b) , pretrained DALL-E token (Ramesh et al., 2021; Bao et al., 2022) , online teacher token feature (Zhou et al., 2022) , and HOG feature (Dalal & Triggs, 2005; Wei et al., 2022) . Recently, the exploration of combining the merits of these two paradigms has been proposed by several works (Tian et al., 2022; Yi et al., 2022; Tao et al., 2022) . Knowledge Distillation generally requires training of the student model to mimic the knowledgeable teacher, in which the dark knowledge is transferred. This technique was first proposed by Bucila et al. (2006) for model compression purposes, which is further extended by Hinton et al. (2015) for deep neural networks. Afterwards, it becomes a most utilized technique for model compression in 2D vision (Romero et al., 2015; Zagoruyko & Komodakis, 2017; Zhang & Ma, 2021) , natural language processing (Sanh et al., 2019; Jiao et al., 2020) and 3D vision (Zhang et al., 2022b; Yang et al., 2022) . Recently, this technique has been extended for efficient visual representation learning through self-distillation (Zhang et al., 2019) of distillation token (Touvron et al., 2021b) or momentum tokenizer feature (Zhou et al., 2022) .

B IMPLEMENTATION DETAILS B.1 SELF-SUPERVISED PRETRAINING SETUP

Data We use ShapeNetCore from ShapeNet (Chang et al., 2015) as the pretraining dataset. ShapeNet is a collection of clean 3D CAD object models with rich annotations consisting of ∼51K unique 3D models from 55 common object categories. We sample 1,024 points per 3D model sample using farthest point sampling (FPS), which is further divided into 64 groups of 32 points as local geometry patches using KNN. Standard data augmentations are adopted during pretraining the 3D autoencoder and 3D point cloud Transformer, i.e., random scaling and translation. 3D Autoencoder Following Yu et al. (2022) , we use a lightweight DGCNN (Wang et al., 2019) as the local geometry patch embedding module, which takes the KNN groups as input and models the local geometry relationship through dynamic graph message passing. The encoded geometry patch embedding is then fed into a pretrained 2D image Transformer, e.g., ViT (Dosovitskiy et al., 2021) or DeiT (Touvron et al., 2021b) . Note that without specific descriptions, the results in the paper use ViT-B pretrained on ImageNet (Deng et al., 2009) as the 2D image Transformer. Besides, only the Transformer blocks and layer normalization are used while other layers like original 2D patch embedding are dropped. The decoder is several DGCNN layers to further model 2D-embedded 3D features, followed by the FoldingNet (Yang et al., 2018) for autoencoder reconstruction. As pointed out by Ramesh et al. (2021) , the weight of the KL divergence loss (i.e., β in Eqn. ( 8)) during training must be small, we also set the KL divergence loss to 0 in the first 10K steps which is gradually increased to 0.1 in the following 100K steps. We use AdamW optimizer (Loshchilov & Hutter, 2019) with a learning rate 5e-4. The cosine learning rate scheduler is adopted with 60K warming-up steps. Following Chen et al. (2020a) , The Gumbel-softmax temperature decayed from 1 to 0.0625 in 100K steps. The batch size is set to 64, and the overall training includes ∼150K steps. The training of the 3D autoencoder is supervised by the reconstruction objective and the variational distribution loss. Following Yu et al. (2021) , we use coarse-and fine-grained predictions with the ground-truth point cloud. The ℓ 1 -stle Chamfer Distance is as the reconstruction objective: L CD-ℓ 1 (P, G) = 1 |P| p∈P min g∈G ∥p -g∥ + 1 |G| g∈G min p∈P ∥g -p∥, where P denotes the predicted point clouds and G denotes the ground-truth point clouds. Following Ramesh et al. (2021) , we use a uniform prior for the discrete variational autoencoder (dVAE) training, where the KL-divergence is adopted for distribution alignment. Hence, the overall objective function is: LdVAE = L CD-ℓ 1 (Pcoarse, G) + L CD-ℓ 1 (Pfine, G) + βL KL . Masked Point Modeling For masked point modeling, the autoencoder encoder as the backbone model is a standard Transformer architecture (Vaswani et al., 2017) with a lightweight PointNet (Qi et al., 2017a) Wang et al. (2022b) . Note that no voting strategy is adopted during testing, and if without a specific description, we report overall accuracy (OA) on the most challenging PB_T50_RS benchmark. ShapeNetPart ShapeNetPart dataset (Yi et al., 2016 ) is a popular point-level synthetic object part segmentation benchmark, which covers ∼17K objects from 16 object categories with 50 fine-grained part categories. We use AdamW optimizer with 1e-5 weight decay. Cosine Learning rate 2e-5 with 10 epochs warming up is used. Standard random scaling and translation are used as a data augmentation strategy. The batch size is set to 16, and we train models for 300 epochs. S3DIS S3DIS dataset (Armeni et al., 2016) provides densely annotated semantic labels for point clouds. It is consisted of six large-scale indoor areas from three different buildings, covering a total of 273 million points from 13 categories. Following Tchapmi et al. (2017) , we advocate using Area 5 for evaluation purposes for better and fair generalization performance benchmarking. We use AdamW optimizer with 1e-5 weight decay, with a cosine learning rate of 2e-5 warming up to 10 epochs. The batch size is 32, and the total training involves 60 epochs. ScanNetV2 ScanNetV2 (Dai et al., 2017 ) is a large-scale dataset that collects ∼2.5M RGB-D scans from 1,513 indoor scenes with comprehensive annotations. Following Liu et al. (2022a) , we construct a ScanNet-Medium subset containing ∼15K frames with a sampling rate of 100 from the raw dataset for 300 epochs ACT pretraining. We use 3DETR (Misra et al., 2021) with the same training recipe for 3D object detection downstream transferring. Note that only the encoder is pretrained and transferred, which has 3 layers with an embedding dimension of 384, and the decoder has 8 layers.

C ADDITIONAL EXPERIMENTS

3D Object Detection We evaluate the representation capability of with downstream 3D object detection on large-scale scene dataset ScanNetV2 with 3DETR (Misra et al., 2021) . From Comparison to Supervised Cross-Modal 3D Representation Learning Methods Table 12 shows the comparison of our method to the cross-modal 3D representation learning method P2P (Wang et al., 2022b) that also uses extra image data by supervised fine-tuning of the pretrained image models. From the results, it is observed that our ACT achieves 88.21% OA on PB_T50_RS with only 22.1M pure 3D Transformer, while P2P achieves 87.4%/89.3% with 42.7M/195.8M large-scale image models (i.e., ResNets101 (He et al., 2016) and HorNet (Rao et al., 2022) ). (Yi et al., 2016) is used to evaluate the learning capacity toward knowledge of detailed shape semantics within 3D objects. Table 13 shows the detailed IoU results of every category, from which we see: (i) ACT significantly improves the from scratch baseline by 1.2% and 1.0% of Cls. mIoU and Ins. mIoU, respectively. (ii) ACT outperforms the other methods, achieving up to 12 top or second IoU performances over the total 16 categories. 

D VISUALIZATION

Reconstruction Results Figure 3 compares the reconstruction results our 2D image Transformer based 3D dVAE and Point-BERT 3D dVAE model. The results show that our 3D autoencoder can reconstruct high-quality details of the objects. For some relatively simple objects like the rectangular table in the second row, both our method and Point-BERT can reconstruct them well. However, for point sets with relatively complicated details, such as the thin shelf and armchair in the third row, our method can still reconstruct the object with detailed local geometric information. These qualitative observations are consistent with quantitative results in Table 7 . 



For example, the in-house JFT-300M dataset from Google covers over one billion labels for 300M images, and the Common Crawl dataset(Raffel et al., 2020) for NLP consists of nearly one trillion words. For MAE, the encoder only receives visible tokens, and the T for calculating Z M should be fS [ti|∀mi = 0, mi ∈ M] , where the corrupted representation Z M is fed into the decoder for masked modeling distillation.



Figure 1: Overview of of our ACT framework (Sec. 3-4). (a) ACT utilizes the Transformers pretrained on large-scale data, e.g., ViT (Dosovitskiy et al., 2021) pretrained with 2D images or BERT (Devlin et al., 2019) pretrained with languages. (b) Stage I of ACT (Sec. 4.1), the pretrained Transformers are tuned by self-supervised 3D autoencoding with prompts (Jia et al., 2022). (c) Stage II of ACT (Sec. 4.2), the 3D autoencoder encoder is used as a cross-modal teacher that encodes latent features as masked point modeling targets for 3D Transformer student representation learning.

(a) FULL: Fine-tuning pretrained models by updating all backbone and classification heads. (b) MLP-LINEAR: The classification head is a single-layer linear MLP, and we only update this head parameters during fine-tuning.(c) MLP-3: The classification head is a three-layer non-linear MLP (which is the same as the one used in FULL), and we only update this head parameters during fine-tuning.3D Real-world Dataset ClassificationWe first show the evaluation of 3D shape recognition on the challenging real-world datasetScanObjectNN (Uy et al., 2019b). The results are shown in

Figure 3: Reconstruction results of synthetic objects from ShapeNet test set.

Figure 4: t-SNE (Van der Maaten & Hinton, 2008) feature manifold visualization on ModelNet40 and ScanObjectNN PB_T50_RS datasets. Feature vectors extracted by ACT models after ShapeNet pretraining and downstream fine-tuning are visualized in (a), (c), and (b), (d), respectively.

Data pattern comparison.

Cross-Modal 3D Representation Learning aims at leveraging more modality-inherent learning signals besides 3D point clouds, e.g., 2D images are known to have rich contextual and textural knowledge, while free-form languages are of dense semantics. Mainstream methods are developed upon contrastive learning of global feature matching. For instance,Jing et al. (2021) propose a discriminative Center loss for feature alignment of point clouds, mesh, and images.Afham et al.

Classification results on ScanObjectNN. Ours 1 : results trained with no data augmentation. Ours 2 : results trained with simple point cloud rotation. DA: data augmentation is used during finetuning training. The overall accuracy, i.e., OA (%) is reported.



Classification results on the ModelNet40 dataset. The overall accuracy, i.e., OA (%) is reported. [ST]: standard Transformer architecture. ScanObjectNN. Besides, ACT succeeds in reaching the state-of-the-art (SOTA) performance among methods using pure 3D Transformer architecture on ScanObjectNN, e.g., ACT outperforms Point-MAE by +3.0% accuracy on the most challenging PB_T50_RS benchmark.

Semantic segmentation results on the S3DIS Area 5. The mean accuracy and mean IoU across all categories, i.e., mAcc (%) and mIoU (%) are reported. xyz: point cloud coordinates are used. xyz+rgb: both coordinates and RGB color are used.

Ablation study on the depth of the pretraining decoder.

Ablation study on different training strategies of the dVAE tokenizer. The F-Score, Chamfer distance using L1-norm and L2-norm, i.e., CD-ℓ 1 and CD-ℓ 2 are reported. Methods Num. of Prompt Prompt Type Freeze F-Score↑ CD-ℓ 1 ↓ CD-ℓ 2 ↓

Table 7 shows the reconstruction results of different training configurations for the 3D autoencoder with a pretrained 2D image Transformer. It is observed that: (i) Our 3D dVAE model with pretrained image Transformer achieves significantly better reconstruction results than Point-BERT. It demonstrates that the pretrained 2D image Transformers have a strong representation capacity for 3D. (ii) Prompt tuning or freezing the model leads to better results than full tuning, and we argue that it is because some pretrained 2D knowledge is forgotten, and prompt tuning effectively addresses this issue. Reconstruction visualizations can be found in Appendix D. Study on the effect of pretrained image Transformer-based 3D Autoencoder.

Study of different positional embeddings for 2D image transformer in dVAE model. (a) N/A: no positional embedding is used. (b) 2D/z: positional embedding with only 2D xy plane coordinates.(c) 3D: positional embedding with all 3D xyz coordinates. The F-Score, Chamfer distance using L1-norm and L2-norm, i.e., CD-ℓ 1 and CD-ℓ 2 , and OA on ScanObjectNN are reported. When using Point-BERT discrete token as the masked modeling target, by applying our dVAE model with pretrained 2D image Transformers, we get the worst performance. It demonstrates that the discrete tokens are not suitable for the semantically sparse point cloud data, no matter how strong the tokenizer is. (iii) When using our ACT, the performance is significantly improved. It demonstrates that the 3D dVAE with pretrained 2D image Transformer can encode features with rich semantics, which is better suited for masked point modeling.

patch embedding module, and the decoder is also a Transformer architecture. The encoder Transformer has 12 blocks with an embedding dimension set to 384, while the decoder Transformer has only 2 blocks with the same embedding dimension. The multi-head attention in the Transformer has 6 heads, and the MLP ratio is set to 4. Stochastic depth(Huang et al., 2016) with rate 0.1 is applied to all Transformer blocks. The AdamW optimizer is adopted with a cosine learning rate of 1e-3 and a weight decay of 5e-2. The model is pretrained for 300 epochs with a batch size of 128.Wu et al., 2015), as one of the most classical datasets, is used for the evaluation of object classification on clean 3D CAD models. There are ∼12K meshed 3D CAD models covering 40 categories. For benchmarking purposes, we use the standard data split of 9,843/2,468 respectively for training and validation, followingQi et al. (2017b). The classification head is a three-layer MLP with a dropout of 0.5, and the hidden layer dimension is set to 384, the same as the Transformer backbone. AdamW optimizer with a 0.05 weight decay is used. Cosine learning rate scheduler is used with a 5e-4 learning rate, warming up 10 epochs. The batch size is 32, and the total training is 300 epochs. Standard random scaling and translation augmentations are used and note that we use a voting-based evaluation strategy(Liu et al., 2019b)  for a fair comparison.ScanObjectNN ScanObjectNN dataset (Uy et al., 2019b) is a collection of 3D object point clouds from the challenging real-world indoor scene ScanNet dataset(Dai et al., 2017), which includes ∼15K objects from 15 categories. We use three variants of ScanObjectNN following Uy et al. (2019b), i.e., OBJ_BG, OBJ_ONLY, and PB_T50_RS. The optimization and other training settings (e.g., training epochs) are the same with ModelNet40. For data augmentations, we report results trained with no data augmentations and simple point cloud rotation as used by

Table 11, it is observed that (i) ACT significantly improves by +1.7% AP 25 and +4.2% AP 50 to the from scratch baseline. (ii) In comparison to other SSL methods, ACT outperforms MaskPoint by a clear margin. 3D object detection on the ScanNetV2 dataset. The detection performance using mean Average Precision (mAP) at two different IoU thresholds of 0.50 and 0.25, i.e., AP 50 and AP 25 are reported. xyz: point cloud coordinates are used. Method SSL Input AP 50 AP 25

Comparison to supervised cross-modal 3D representation learning method on ScanOb-jectNN. Overall accuracy, i.e., OA (%) is reported.

Part segmentation results on the ShapeNetPart dataset. The mean IoU across all categories, i.e., Cls. mIoU, the mean IoU across all instances, i.e., Ins. mIoU (%), and IoU (%) for each category are reported. The best results are bolded and the second best results are underlined.

availability

//github.com/

