SAMPLED TRANSFORMER FOR POINT SETS

Abstract

The sparse transformer can reduce the computational complexity of the selfattention layers to O(n), whilst still being a universal approximator of continuous sequence-to-sequence functions. However, this permutation variant operation is not appropriate for direct application to sets. In this paper, we proposed an O(n) complexity sampled transformer that can process point set elements directly without any additional inductive bias. Our sampled transformer introduces random element sampling, which randomly splits point sets into subsets, followed by applying a shared Hamiltonian self-attention mechanism to each subset. The overall attention mechanism can be viewed as a Hamiltonian cycle in the complete attention graph, and the permutation of point set elements is equivalent to randomly sampling Hamiltonian cycles. This mechanism implements a Monte Carlo simulation of the O(n 2 ) dense attention connections. We show that it is a universal approximator for continuous set-to-set functions. Experimental results for classification and few-shot learning on point-clouds show comparable or better accuracy with significantly reduced computational complexity compared to the dense transformer or alternative sparse attention schemes.

1. INTRODUCTION

Encoding structured data has become a focal point of modern machine learning. In recent years, the defacto choice has been to use transformer architectures for sequence data, e.g., in language (Vaswani et al., 2017) and image (Dosovitskiy et al., 2020) processing pipelines. Indeed, transformers have not only shown strong empirical results, but also have been proven to be universal approximators for sequence-to-sequence functions (Yun et al., 2019) . Although the standard transformer is a natural choice for set data, with permutation invariant dense attention, its versatility is limited by the costly O(n 2 ) computational complexity. To decrease the cost, a common trick is to use sparse attention, reducing the complexity from O(n 2 ) to O(n) (Yun et al., 2020; Zaheer et al., 2020; Guo et al., 2019) . However, in general this results in an attention mechanism that is not permutation invariant -swapping two set elements change which elements they attend. As a result, sparse attention cannot be directly used for set data. Recent work has explored the representation power of transformers in point sets as a plug-in module (Lee et al., 2019) , a pretraining-finetuning pipeline (Yu et al., 2022; Pang et al., 2022) , and with a hierarchical structure (Zhao et al., 2021) . However, these set transformers introduced additional inductive biases to (theoretically) approach the same performance as the densely connected case in language and image processing applications. For example, to achieve permutation invariance with efficient computational complexity, previous work has required nearest neighbor search (Zhao et al., 2021) or inducing points sampling (Lee et al., 2019) . Following the above analysis, a research question naturally arises to avoid introducing unneeded inductive bias: Can O(n) complexity sparse attention mechanisms be applied directly to sets? We propose the sampled transformer to address this question, which is distinguished from the original sparse transformer by mapping the permutation of set elements to the permutation of attention matrix elements. Viewing this permutation sampling as attention matrix sampling, the proposed sampled attention approximates O(n 2 ) dense attention. This is achieved with the proposed random element sampling and Hamiltonian self-attention. To be specific, in random element sampling the input point set is first randomly split into several subsets of n s points (Fig. 1b ), each of which will be processed by shared self-attention layers. In addition, a sparse attention mechanism -namely Hamiltonian self-attention (Fig. 1c ) -is applied to reduce complexity of the subset inputs, so that n s point connections are sampled from O(n 2 s ) connections. The combination of all Hamiltonian self-attention mechanism for all subsets -namely cycle attention (Fig. 1d ) -can be viewed as a Hamiltonian cycle in the complete attention graph. As a result, the permutation of set elements is equivalent to the permutation of nodes in a Hamiltonian cycle (Fig. 1e ), which is in fact randomly sampling Hamiltonian cycles from the complete graph -thereby yielding the proposed sampled attention (Fig. 1f ). Finally, viewing this randomization as a Monte Carlo sample of attention pairs, repeated sampling can be used to approximate the complete O(n 2 ) dense connections. Furthermore, our proposed sampled transformer is proven to be a universal approximator for set data -any continuous set-to-set functions can be approximated to arbitrary precision. The contributions of this paper are summarized as follows. • We propose the sampled attention mechanism which maps the random permutation of set elements to the random sampling of Hamiltonian cycle attention matrices, permitting the direct processing of point sets. • We prove that the proposed sampled transformer is a universal approximator of continuous set-to-set functions, see Corollary 1. • Compared to previous transformer architectures, the empirical results show that our proposed sampled transformer achieves comparable (or better) performance with less inductive bias and complexity.

2. RELATED WORK

The transformer (Vaswani et al., 2017) is widely used in languages (Raffel et al., 2020; Dai et al., 2019; Yang et al., 2019b) and images (Dosovitskiy et al., 2020; Liu et al., 2021; Touvron et al., 2021; Ramachandran et al., 2019) . For example, Raffel et al. (2020) explored the transformer by unifying a suite of text problems to a text-to-text format; Dai et al. (2019) modeled very long-term dependency by reusing previous hidden states; Dosovitskiy et al. (2020) demonstrated that the pure transformer can be effectively applied directly to a sequence of image patches; and Liu et al. (2021) proposed a transformer with hierarchical structure to learn various scales with linear computational complexity. In addition, the representation power of the transformer has been explored by the pre-training and fine-tuning models (Devlin et al., 2018; Bao et al., 2021; Yu et al., 2022; He et al., 2022) . Recently, an increasing number of researchers begin to explore the representation power of the transformer in 3D point clouds (sets) data. Another important line of work seeks to theoretically demonstrate the representation power of the transformer by showing the universal approximation of continuous sequence-to-sequence functions (Yun et al., 2019; 2020; Zaheer et al., 2020; Shi et al., 2021; Kratsios et al., 2021) . To be specific, Yun et al. (2019) demonstrated the universal approximation property of the transformer; Yun et al. (2020) and Zaheer et al. (2020) demonstrated that the transformer with sparse attention matrix remains a universal approximator; Shi et al. (2021) claimed that the transformer without diag-attention is still a universal approximator. Kratsios et al. (2021) proposed that the universal approximation under constraints is possible for the transformer. In comparison with the above works, we proposes the O(n) sampled transformer -a universal approximator of continuous set-to-set functions. To our knowledge, the use of approximating dense attention by sampling Hamiltonian cycle attention matrices is new.

3.1. NOTATION

Given an integer a we define [a] . = {1, . . . , a}. For a matrix M ∈ R n×m , for a k ∈ [m] the k-th column is denoted by M k . Given an (ordered) index set A ⊂ [m] the submatrix M A ∈ R n×|A| consists of the matrix generated by concatenating the columns determined by indices in A. See the notation guide in §A in the supplementary material.

3.2. TRANSFORMER

The transformer X → t(X) (Vaswani et al., 2017; Dosovitskiy et al., 2020) implements a function from point clouds to point clouds with input points X ∈ R d×n . It is formally defined by a multi-head self-attention layer and a feed-forward layer: Head j (X) = (W j V X) • σ S [(W j K X) T W j Q X] (1a) Attn(X) = X + W O    Head 1 (X) . . . Head h (X)    (1b) TB(X) = Attn(X) + W 2 • ReLU(W 1 Attn(X)), ( ) where n is the number of points and d is the feature dimension. Head(•) is the self-attention layer, and Attn(•) is the multi-head self-attention layer with the parameter W O ∈ R d×mh . W i V , W i K , W i Q ∈ R m×d are value , key, and query parameters; W 1 ∈ R r×d and W 2 ∈ R d×r are feed-forward layer parameters. We utilize a positional embedding E in the input X, defined by E = W p P , where P ∈ R 3×n is the (xyz) coordinate, and W P ∈ R d×3 is an MLP layer. To simplify the notation, here we use X = X + E so that all the inputs X in this paper will include the positional embedding unless specifically stated otherwise. The attention mechanism for a dense transformer is the n × n attention matrix (W i K X) T W i Q X in Eq. 1a, which is in fact a similarity matrix for n elements/tokens, or a complete attention graph. Sparse Attention also refers to the same similarity matrix/attention graph but with sparse connections instead. As tokenization may not be necessary in dealing with point clouds, for clarity we use the terminology points, elements, and tokens are all to refer to points (which may be thought of as tokens in a traditional transformer context) in a point cloud (set).

3.3. UNIVERSAL APPROXIMATION

Let F be the class of continuous sequence-to-sequence functions f : R d×n → R d×n defined on any compact domain. Further define T h,m,r as the set of transformer blocks t(•) with h attention heads of each of size m, and with hidden layer width r (Yun et al., 2019; 2020) . To measure the distance between functions in F, we define the standard ℓ p distance function by the corresponding norm: d p (f 1 , f 2 ) = ∥f 1 (X) -f 2 (X)∥ p p dX 1/p , which is element-wise continuous (w.r.t the ℓ p norm) for 1 ≤ p < ∞. Theorem 1 (Universal Approximation, Yun et al. (2019) ). Let 1 ≤ p < ∞ and ϵ > 0, then for any given f ∈ F, there exist a Transformer network g ∈ T 2,1,4 , such that d p (f, g) ≤ ϵ. The proof of Theorem 1 makes three stages of approximations, which are chained together via the triangle inequality to give the ϵ bound (Yun et al., 2019) . In particular, ① any f ∈ F is approximated by a piece-wise linear function f ∈ F (over a discretized input space). Then ② the piece-wise linear function is approximated by a modified transformer T 2,1,4 , where the widely used ReLU and σ S activation functions (as per Eq. 1) are replaced by the hardmax function σ H . Finally, ③ it is shown that the class of transformer T 2,1,4 can approximate any regular transformer g ∈ T 2,1,4 . The key step comes in the proof of the second approximation ②. In Yun et al. (2019) , the approximation is proved by showing that multi-head self-attention layers of the modified transformer can implement any contextual map q c : R d×n → R n . Definition 3.1 (Contextual Mapping). Consider a finite set L ⊂ R d×n . A map q: L → R 1×n defines a contextual map if the map satisfies the following: 1. For any L ∈ L, the n entries in q(L) are all distinct. 2. For any L, L ′ ∈ L, with L ̸ = L ′ , all entries of q(L) and q(L ′ ) are distinct. Intuitively, a contextual map can be thought of as a function that outputs unique "id-values". The only way for a token (column) in L ⊂ R d×n to share an "id-value" (element of q(L)) is to map the exact same sequence. As each token in the sequence is mapped to a unique value, an appropriately constructed feed-forward neural network can map a sequence to any other desired sequence, providing a universal approximation guarantee. In Yun et al. (2020) , such a contextual map is implemented via selective shift operators and all-max-shift operators through careful construction of multi-head self-attention layers.

4. METHODOLOGY

We propose a variation of the sparse attention transformer -sampled sparse attention transformerapplicable to point sets. We deviate from the typical sparse attention transformer in two ways. First, we randomly sub-sample the input point set l times, with each sub-sample being evaluated through a shared multi-head self-attention layer. Secondly, we propose a simple Hamiltonian self-attention mechanism, a special case of the sparse attention mechanism, to reduce the computation complexity of considering point sets. This ultimately yields the variant of the typical sparse transformer (Eq. 1) which can be interpreted as using a sampled attention mechanism, as depicted in Fig. 1 . To study the approximation capabilities of our proposed architecture, we prove that our sampled sparse attention transformer is a universal approximator of set-to-set functions.

4.1. RANDOM ELEMENT SAMPLING

For a point set input X ∈ R d×n , instead of directly applying the transformer attention layer to n tokens, we process l many sub-sampled inputs X i ∈ R d×ns for i ∈ [l] and 2 ≤ n s ≤ n. For simplicity, we assume that (n s -1) • l = n. The sub-sampled inputs X i can be defined by taking various column submatrices: X i = X R i ∪R γ(i) 1 ; γ(v) = 1 + (v mod l), where R 1 , . . . R l are randomly selected ordered index sets, such that |R i | = n s -1 and R i ∩R j = ∅ for i ̸ = j. The index element R i 1 denotes the first index in the ordered set R i . The cycle function γ : [l] → [l] ensures that the edge-case of X l is well defined, i.e., γ(l) = 1. Intuitively, the sequence of sub-sampled inputs X 1 , . . . , X l can be interpreted as a rolling window of (n s -1) • l = n many sampled point set elements. Indeed, by concatenating the index sets in order, X i is a sliding window of the elements with size n s and stride n s -1 (with wrapping). It should be noted that X i can be treated as a random variable. As such a singular realization of the sampled elements can be viewed as a Monte Carlo sample over the set of ordered point sequences (Metropolis & Ulam, 1949) . Computationally, by applying a dense self-attention layer to each of the sub-sampled elements X i , the total complexity of evaluating l many self-attention layer is O(l • n 2 s ). We however note that the l self-attention layers can be evaluated in parallel, which yields a trade-off between individual self-attention complexity O(n 2 s ) and computation time. To gain intuition, consider the "limiting behaviours" of our random element sampling: taking n s = n + 1 can be interpreted as taking the whole sequence with l = 1, i.e., X 1 = X which under dense attention would result in complexity O(n 2 ). On the other end, if we take n s = 2, we get l = n pairs of points |X i | = 2; processing every such pair with dense self-attention results in n many O(1) self-attention evaluations. Random element sampling with dense attention layers can be interpreted as an instance of sparse attention, see Fig. 1b .

4.2. HAMILTONIAN SELF-ATTENTION

The random element sampling discussed in the previous section reduces the computational complexity of dense self-attention-layers from O(n 2 ) to O(l • n 2 s ) = O(n 2 /l) (as (n s -1) • l = n) by processing each sampled set of points X i through individual self-attention layers. Despite this improved computational complexity, the quadratic scaling of n can still be costly for point clouds. As such, instead of evaluating each sampled element X 1 , . . . , X l with a dense self-attention layer, we propose a sparse attention layer. Sparse attention mechanisms can be formally defined via the attention patterns {A k } k∈ [ns] , where j ∈ A k implies that the j-th token will attend to the k-th token. We propose the use of an attention mechanism, dubbed as Hamiltonian self-attention, which is defined by the following attention patterns: A k = {k, k + 1} if 1 ≤ k < n s {k} otherwise k = n s , which ensures that the set of attention patterns {A k } k∈[ns] define a Hamiltonian path. Indeed, if we fix a subset of elements X i , by starting at X i 1 and following the attended elements (ignoring self-attention k ∈ A k ), we visit every token exactly once. Fig. 1c shows the corresponding attention matrix, where the Hamiltonian path corresponds to off-diagonal elements and self-attention corresponds to the diagonal elements, respectively. For Hamiltonian self-attention, computing the attention mechanism according to Eq. 4 only requires 2n s = O(n s ) many evaluations. Thus by using our proposed sparse attention for each X 1 , . . . , X l , in comparison to dense attention, the computational complexity reduces from O(n 2 /l 2 ) to O(n/l). The proposed Hamiltonian self-attention mechanism is rather simple and general. For instance, in the general case sparsity patterns can be defined for each individual layer (resulting in an addition superscript for each A k ). Despite this, the attention patterns {A k } k∈[ns] satisfy important key assumptions for proving that the attention pattern will result in a sparse transformer that is a universal approximator (Yun et al., 2020 , Assumption 1). In particular, by stacking (n s -1) many attention layers, our Hamiltonian self-attention will allow any element to indirectly or directly attend all other element in a X i . The proposed Hamiltonian self-attention could also be viewed as a special case of window attention in Zaheer et al. (2020) , where elements are linked undirectedly.

4.3. SAMPLED SPARSE ATTENTION TRANSFORMER

Given the setup of random element sampling and Hamiltonian self-attention, we can define our proposed sampled transformer for continuous set-to-set function approximation: SHead j k (X i ) = (W j V X i A k ) • σ S [(W j K X i A k ) T W j Q X i k ] (5a) g i (X i ) = X i + W O    SHead 1 (X i ) . . . SHead h (X i )    (5b) SAttn(X) = g l (X l ) • g l-1 (X l-1 ) • • • • • g 1 (X 1 ) (5c) STB(X) = SAttn(X) + W 2 • ReLU (W 1 SAttn(X)) . (5d) (5e) In Eq. 5c, composition is w.r.t. the induced linear maps from matrices given by Eq. 5b. The learnable parameters of the sampled transformer are the same as the usual dense transformer in Eq. 1. As the attention pattern of each X i forms a Hamiltonian path, and each X i shares an element with the proceeding X γ(i) , the joint attention map makes a Hamiltonian cycle path. In other words, the shared index R γ(i+1) 1 in Eq. 3 links each individual Hamiltonian path given by Eq. 4, leading the attention matrix to form a cycle attention as shown in Fig. 1d . Furthermore, the permutation of elements in cycle attention corresponds to the swapping of nodes in the Hamiltonian cycle, with corresponding links and swapping of element values in the attention matrix, see in Fig. 1e . As a result, the combined randomization from using random element sampling and Hamiltonian selfattention can be thought of as sampling from the set of Hamiltonian cycle graphs from the complete attention graph, resulting in the sampled attention depicted in Fig. 1f . Unlike dense attention, sparse attention patterns are not generally permutation invariant. Indeed, if we permute the columns of X i , the elements attended according to {A} k∈ [ns] are not the same. As such, applying {A k } k∈[ns] directly to X is not valid for point clouds, which requires a permutation invariant operation. However, in our case the sparse attention heads are being applied to randomized sub-sampled element sets X i . Ignoring computation, if we continue to sample the randomized elements X i and average the resulting attention (w.r.t. the entire point set X), the attention will converge to dense attention -through randomization of X i , the event that any non-self-edge appears in a sampled attention graph (as per Eq. 4) is equiprobable. This also holds when fixing the order of elements while applying randomly sampled Hamiltonian cycle attention. As such, the sampled transformer can be used to approximate a permutation invariant operator, and thus be used to approximate set-to-set functions. Of course, sampling sufficiently many realizations of Hamiltonian cycle attention to converge to dense attention is impractical. Instead, in practice, we re-sample the attention pattern only for each batch and epoch. Although this may seem like a crude approximation to dense attention, similar methods are successful in Dropout (Srivastava et al., 2014) , which even induces desirable model regularization. Furthermore, our empirical results indicate that sampled sparse attention closely approximates the more expensive (and infeasible at the typical point set scales) dense attention.

4.4. SAMPLED TRANSFORMER AS A UNIVERSAL APPROXIMATOR

We formally guarantee the representation power of the proposed sampled transformer by proving universal approximation for set-to-set functions. As our sampled transformer Eq. 5c is similar to dense / sparse transformers presented by Yun et al. (2019; 2020) , we follow their framework (Sec. 3.3) to prove our universal approximation property. Corollary 1 (Sampled Transformer is a Universal Approximator). There exist sampled (sparse) Transformers that are universal approximators in the sense of Theorem 1. (Qi et al., 2017b) 90.7% PointCNN (Li et al., 2018) 92.5% KPConv (Thomas et al., 2019) 92.9% DGCNN (Wang et al., 2021) 92.9% RS-CNN (Liu et al., 2019b) 92.9% [T] PCT (Guo et al., 2021) 93.2% [T] PVT (Zhang et al., 2021) 93.6% [T] PointTransformer (Zhao et al., 2021) 93.7% [T] Transformer (Yu et al., 2022) 91.4% Self-Supervised Methods Accuarcy OcCo (Wang et al., 2021) 93.0% STRL (Huang et al., 2021) 93.1% IAE (Yan et al., 2022) 93.7% [ST]Transformer-OcCo (Yu et al., 2022) 92.1% [ST]Point-BERT (Yu et al., 2022) 93.2% [ST]Point-MAE (Pang et al., 2022) 93.8% [ST]MAE-dense (ours)

93.6% [T]MAE-sampled (ours)

93.7% To prove our Corollary, we extend the proof of Yun et al. (2019; 2020) by showing that our sparse attention mechanisms with random element sampling can also implement a selective shift operator. As a result, we show that the proposed sampled sparse attention transformer is a universal approximator in the context of set-to-set functions. See §E in the supplementary material for the full proof of the universal approximation property.

5. EXPERIMENTS

We evaluate our proposed sampled attention in popular transformer-based frameworks as well as a basic setting. We compare our sampled attention (Fig. 1f ) with dense attention via the pre-training and fine-tuning framework (Yu et al., 2022; Pang et al., 2022) , where we pre-train our model on ShapeNet (Chang et al., 2015) via the reconstruction task, and further evaluate the performance on two downstream fine-tuning tasks: classification and few-shot learning in ModelNet40 (Wu et al., 2015) . In addition, to eliminate the influence of other factors, we compared the dense, sparse, sampled, and kNN attention (Definition B.1) in a basic classification setting consisting of a transformer block with a single attention layer for feature aggregation, as well as a minimal number of MLPs for feature mapping. Finally, we compare the sampled attention with the kNN attention in the hierarchical grouping and merging structure following the Point-Transformer (Zhao et al., 2021).

5.1. COMPARSION ON PRE-TRAINING AND FINE-TUNING FRAMEWORK

Pre-training. We adopted the masked auto-encoder (MAE) (He et al., 2022) to process the point cloud data, denoted as MAE-dense, for pre-training. This framework is the concurrent work with Point-MAE (Pang et al., 2022) . Note that MAE-dense adopts dense-attention layers in its encoder and decoder network. To evaluate the effectiveness of our claimed contribution, we replace the dense-attention layer in MAE-dense with our sampled-attention layer (Fig. 1f ) while keeping the other components fixed. It is denoted as MAE-sampled. To pre-train the MAE-dense and MAE-sampled, we follow the standard train-test split of ShapeNet (Chang et al., 2015) , which is also adopted by Pang et al. (2022) ; Yu et al. (2022) , and develop the following training strategy. To begin with, each input point cloud consisting of 1024 points was divided into 64 groups / tokens of size 32 points each. The Furthest Points Sampling (FPS) and nearest neighbour search were adopted in tokenization (Yu et al., 2022) . Tokens were further mapped to 256-dimensional latent vectors by MLP layers and max-pooling. In addition, we have 12 stacked transformers in the encoder (masking ratio of 70%) and 1 single transformer in the decoder, both with h = 8, d = 32 and r = 256. The batch size is 64 and the epoch number is 300. We used the AdamW (Loshchilov & Hutter, 2017) optimizer with cosine learning rate decay (Loshchilov & Hutter, 2016) , an initial learning rate of 0.0005, and weight decay of 0.05.

Classification

The pre-trained MAE-dense and MAE-sampled models are first evaluated on the classification task in ModelNet40 (Wu et al., 2015) , with the standard training and testing splits defined in Yu et al. (2022) ; Pang et al. (2022) . Specifically, we build the classifier by keeping the encoder structure and weights of the pre-trained MAE-dense and MAE-sampled models, followed by max-pooling as well as a fully connected layer of dimension [256, 256, 40] to map the global 19.9 ± 2.1 16.9 ± 1.5 DGCNN-OcCo (Wang et al., 2021) 90.6 ± 2.8 92.5 ± 1.9 82.9 ± 1.3 86.5 ± 2.2 Transformer-rand (Yu et al., 2022) 87.8 ± 5.2 93.3 ± 4.3 84.6 ± 5.5 89.4 ± 6.3 Transformer-OcCo (Yu et al., 2022) 94.0 ± 3.6 95.9 ± 2.3 89.4 ± 5.1 92.4 ± 4.6 Point-BERT (Yu et al., 2022) 94 

5.2. COMPARSION ON BASIC CLASSIFICATION SETTING

Our inputs are clouds of n points with 3D coordinates as position and its normal information as features. The feature and position are first transformed by two separate MLP layers with hidden dimensions [64, 256] , and then added together as the input of a single layer transformer with h = 8, r = 256, and d = 32, as per Eq. 1 and Eq. 5. The transformer output of R n×256 is then summarized by max-pooling to obtain a global feature with a dimension of 256, followed by a fully connected layer to map it to the category vector. Here we tested this basic pipeline with n ∈ {256, 512, 768, 1024, 2048, 3072, 4096, 8192} for each of the dense, sparse, kNN, and the proposed sampled attention layers, including an additional case without attention layer (MLP+Full Connected layer) as the baseline. As shown in Tab. 3 and Tab. 4, the model with dense attention layers achieves the best performance as it considers all O(n 2 ) connections directly with relatively few parameters to train. However, it runs out of the 24 Gigabytes memory when the number of points n ≥ 3072, due to the quadratic complexity. While both sparse and sampled transformers have a computational complexity of O(n), our model with sampled attention outperformed the sparse one, in line with the strong theoretical guarantees we provide. We conjecture that the improvements of sampled transformer over the sparse transformer may indicate that the additional randomness (randomly shuffling points, w / o attention) tava et al., 2014) . In addition, the transformer with kNN attention layers has the worst performance, as the permutation could not extend its receptive field. Finally, the memory usage comparison in Tab. 4 shows that the dense transformer has the largest memory usage due to its O(n 2 ) complexity. The sparse transformer and sampled transformer have comparable memory usage due to the same O(n) complexity.

5.3. COMPARSION ON HIERARCHICAL TRANSFORMER STRUCTURE

We further compare our sampled attention with kNN attention by adopting the hierarchical structure for the classification task under the framework of Zhao et al. (2021) . Each hierarchical layer is obtained by FPS, followed by the nearest neighbour search for the grouping, using MLPs with maxpooling for feature merging, and transformers for feature mapping. The grouping stage within each hierarchical layer summarizes the point cloud into key (subset) points. The total hierarchical layer number is t = 5, the parameters for which we chose the number k of nearest neighbours {8, 16, 16, 16, 16}, strides {4, 4, 4, 4, 4}, self-attention feature dimensions {32, 64, 128, 256, 512}, and transformer blocks {2, 3, 4, 6, 3}. The scalar attention (Eq. 1 or Eq. 5) is adopted specifically for comparison. Results shown in Tab. 5 demonstrate that our sampled attention outperforms the kNN attention in line with our randomly sampled receptive field. Furthermore, the performance of the kNN layer improved greatly from t = 1 to t = 2 and from t = 2 to t = 3 as its receptive field extends due to the multiple hierarchical layers. Finally, kNN with vector attention (Yu et al., 2022) (reported in Tab. 1 on the PointTransformer row) achieved a better performance, in line with the observation that replacing the softmax with learnable MLPs γ in the transformer can easier make kNN attention a universal approximator of continuous functions. Detailed analysis is provided in §B.1 in the supplementary material. The performance difference between scalar attention and vector attention is shown in the Tab. 7 of Yu et al. (2022) , and is also analyzed in Yun et al. (2020) .

6. CONCLUSION

In this paper, we present an O(n) complexity sparse transformer -sampled transformer -which directly handles point set data. By relating the permutation of set elements to the sampling of Hamiltonian cycle attention, we relieve the model of inappropriate permutation variance. The result is a sampled attention scheme that implements Monte Carlo simulation to approximate a dense attention layer with a prohibitive O(n 2 ) number of connections. To guarantee the representation power of the proposed sampled transformer, we showed that it is a universal approximator of set-toset functions. Motivated also by the strong empirical performance that our model achieves, we hope this work will help to shed light on the sparse transformer in dealing with set data. A NOTATIONS In addition, in the case of vector attention (Eq. 3 in (Zhao et al., 2021) ), universal approximation holds as the learnable mapping γ(•) (an MLP) is a universal approximator. This may helps to explain why vector attention could outperform scalar attention in Tab. 7 of (Zhao et al., 2021) . ℓ p p norm G δ grid {0, δ, . . . , 1 -δ} d×n G + δ extend grid {-δ -nd , 0, δ, . . . , q c (•) contextual mapping Ψ(•; b Q , b ′ Q ) selective shift operation ψ(•; b Q ) a single- Finaly, in Tab. 3, the performance of the kNN transformer drops with the increasing number of points. This is because as the point number increase, the fix k nearest neighbor number is relatively reduced. As a result, the receptive field shrink. So the performance drops.

B.2 IN COMPARISON WITH INDUCTING POINTS (SET TRANSFORMER)

We additionally compared the proposed sampled attention with learnable inducting points strategy (Lee et al., 2019) in Tab.6. The inducting points here are implemented by simply replacing the multi-heads self-attention transformer block in Eq. 5e with the Induced Set Attention Block (ISAB) in Eq. ( 9) of Lee et al. (2019) . And the positional embedding is added in the key and value input as per our sampled attention. Our implementation of the basic classification in Sec. 5.2 is different from the one in (Lee et al., 2019) with respect to the data pre-processing: our data pre-processing is in line with Zhao et al. (2021) ; Yu et al. (2022) , while Lee et al. (2019) follow Zaheer et al. (2017) without positional embedding. As we can see in Tab. 6, our proposed sampled attention outperformance the inducting point strategy (Lee et al., 2019) with linear complexity in the attention matrix. As the performance of Lee et al. (2019) on the two implementations is quite different, we further compared the sampled attention and inducting points strategy in the implementation provided by the official implementation of Lee et al. (2019) . To begin with, our proposed sampled attention could be applied to the inducting points strategy directly to reduce its complexity from O(mn) to O(n), where n is the number of input points and m is a learnable inducting points number. Specifically, we use the sampled attention to replace the dense attention in the Induced Set Attention Block(ISAB) from Eq. 9 of Lee et al. (2019) . However, as the inducting points and points have different physical meanings, also as the inducting points number m (query in the self-attention) is not equal to the input points number n (key and value), our Hamiltonian cycle attention could not be applied directly. We instead applied a different version of sampled attention by randomly sampling two elements per row in the dense attention matrix. This is a loose version of sampled attention as no Hamiltonian cycle is constructed. The results could be found in Tab. 7. As we can see, our proposed sampled attention is still comparable with the set transformer but with less computational complexity.

B.3 IN COMPARISON WITH STRATIFIED STRATEGY

The window-based transformer is another important branch of exploring the representation power of the transformer. Combined with the hierarchical backbone, it has been widely used in processing 2D images, languages, and 3D point clouds, such as Liu et al. (2021) ; Lai et al. (2022) . The windowbased transformer is proposed to learn the cross-window relationships as well as the non-overlapping local relationship. Here we compared our proposed sampled attention with the Stratified strategy from Figure 3 of Lai et al. (2022) in Tab. 6. The Stratified strategy could be viewed as a combination of dense and sparse keys obtained by the window partition of different sizes. It is an efficient design for learning token relationships in the hierarchical backbone. However, in the single-layer setting, directly learning O(n 2 ) connections in the attention matrix may be a better solution as it could reach the full receptive field. As our proposed sampled attention mechanism could estimate O(n 2 ) connections by implementing the Monto Carlo simulation, we outperformed the Stratified strategy in the basic classification setting as per Tab. 6.

C AMORTIZED CLUSTERING WITH MIXTURE OF GAUSSIANS

We additionally tested the proposed sampled attention in 2D set datasets in the encoding-decoding framework introduced by (Lee et al., 2019) . And the task is about using a neural network to learn the parameters of the mixture Gaussian distribution from the input set data. To begin with, the mixture Gaussian distribution is defined by a weighted sum of k number of Gaussian distribution. Given a dataset X = {x 1 , . . . , x n }, the log-likelihood of the mixture Gaussian distribution is defined as follows: log p(X; θ) = n i=1 log k j=1 π j N (x i ; µ i ; diag(σ 2 j )). Generally, the parameters of the mixture Gaussian distribution are inferred by maximizing the loglikelihood θ * (X) = arg max θ log p(X; θ) using Expectation-Maximisation (EM) algorithm as the closed-form solution could not be inferred directly by setting the gradient equals to zero. Here we instead use the transformer to infer θ * (X). Specifically, given the input, the neural network f outputs mixture Gaussian parameters f (X) = {π(X), {µ j (x), σ j (X)} k j=1 } by maximing the log likelihood in Eq. 6 (and replacing all parameters as functions of X). The 2D set data X is randomly sampled from a given mixture Gaussian distribution with k = 4. And the number of elements n is randomly sampled from [100, 500]. Namely, when setting the dimension of Gaussian distribution as 2, each sampled point could be viewed as a 2D data point, so the sampled collection is a 2D set dataset. The baseline we compared with is the Set transformer (Lee et al., 2019) with two Induced Set Attention Block(ISAB) in the encoder, one Multi-head Attention (PMA) and two Set Attention Block (SAB) in the decoder, as per the official implementation. The inducting points refer to the additional learnable points I ∈ R m×d proposed in Eq. 9 of (Lee et al., 2019) , with d dimension and m number of inducting points. Here we have a mixture usage of points, tokens, and elements to represent a single sampled data point x i . As the computation complexity of the inducting points block (ISAB) is O(nm), our sampled attention may be adopted in the ISAB to reduce the computation complexity to O(n). However, as the number of inducting points m (regarded as the query in Lee et al. (2019) ) is not equal to the number of input points n (regarded as key and value) (in fact inducting points and points have different physical meanings), our Hamiltonian cycle attention could not be applied directly. In fact, the dense attention matrix in the inducting points layer is m × n rather than n × n. We instead applied a different version of sampled attention by randomly sampling two elements per row in the attention matrix. This is a loose version of sampled attention as no Hamiltonian cycle is constructed. As we can see in Tab. 8, the sampled attention could be a plug-in module to replace the dense attention in the inducting points structure with competitive performance but theoretically less computational complexity.

D TRANSFER LEARNING

We additionally included the transfer learning as a fine-tuning classification task with respect to the pre-training and fine-tuning framework in Sec. 5.1, to demonstrate that the proposed sampled transformer also has a good transfer ability. The fine-tuning task is implemented on the ScanOb-jectNN (Uy et al., 2019) dataset with 2902 point clouds from 15 categories. We follow the data pre-processing and fine-tuning setting from Point-BERT (Yu et al., 2022) with the same three variants: OBJ-BG, OBJ-ONLY, and PB-T50-RS. As we can see in Tab. 9, our sampled attention layer achieved a competitive performance in comparison with dense attention while reaching state-of-theart performance.

E UNIVERSAL APPROXIMATOR PROOF

A proof of Corollary 1 follows the steps described in § 3.3. As we only changed the dense/sparse attention to the sampled attention, the steps ① and ③ in § 3.3 remain the same as Yun et al. (2019; 2020) and found in the §C and F in Yun et al. (2020) . Here we need only cover the proof of step ②. (Xu et al., 2018) 77.1 79.5 73.7 PointNet++ (Qi et al., 2017b) 82.3 84.3 77.9 PointCNN (Li et al., 2018) 86.1 85.5 78.5 DGCNN (Wang et al., 2021) 82 Lemma 2 (Modified Universal Approximation.). For each f ∈ F S (δ) and 1 ≤ q < ∞, ∃g ∈ T 2,1,1 such that f (X) = g(X) for all X ∈ D. Without loss of generality, here D ∈ [0, 1) d×n . As in (Yun et al., 2019; 2020) The proof of Lemma 2 could then be separated into four steps: 1. Use the positional embedding E in § 3.2 such that each column of the input X k + E k are in disjoint intervals. 2. The input X +E is quantized into L with values in {0, δ, . . . , n-δ} by a series of modified feed-forward layers. 3. The contextual mapping q defined in Definition 3.1 is implemented by a series of modified sampled multi-head self-attention layers (modified version of Eq. 5c) with the input of L . 4. Another series of modified feed-forward layers implements the value mapping such that each element in the unique id q(L) is mapped to the desired output A X . As modified feed-forward layers are all the same as in Yun et al. (2020) , the definition and proof of step 2 is available in §D.2 and E.1 in (Yun et al., 2020) , while the definition and proof of step 4 could be found in the §D.4 and E.3 in (Yun et al., 2020) . Here we mainly explain steps 1 and 3.

E.1 POSITIONAL EMBEDDING

The positional input for point sets in its xyz coordinate P ∈ R 3×n . We adopted a matrix W p ∈ R d×3 (a permutation invariant operation) such that the input of the sampled transformer will be X + E = X + W p P . And there exists a case such that: E 1 = (n -1)1 n , and E = (i -2)1 n , for i ∈ [2 : n]. In this case, the first column will be (X + E) 1 ∈ [n -1, n) d , and (X + E) i ∈ [i -2, i -1) d for i ∈ [2 : n]. So the requirement of step 1 is satisfied, that each column lies in disjoint intervals.

E.2 CONTEXTUAL MAPPING FOR STACKED MULTI-HEADS SELF-ATTENTION LAYERS

After the step 2, the quantized input L will be in the set H δ ⊂ R d×n , such that: H δ := {G + E ∈ R d×n |G ∈ G δ }, with G δ := {0, δ, . . . , 1 -δ}. Then the adaptive selective shift operation Ψ is defined so that the learnable parameter u T ∈ R d could map u T Ψ(L) into unique scalars (ids). Finally, with the help of the all-max-shift operation Ω, the output of a series of those two operations will be a scalar in disjoint intervals w.r.t each column of L, as well as different inputs L and L ′ , thereby implementing the contextual mapping in Definition. 3.1.



Figure 1: Attention mechanisms: (a) original dense attention; (b) the attention matrix after random element sampling; (c) a special case of sparse attention -Hamiltonian (self-)attention -for each subset; (d) combining all subsets (which have overlapping element per (b)) connects the individual Hamiltonian attention sub-matrices, gives cycle attention which is a Hamiltonian cycle; (e) permutation of points permutes the elements in cycle attention matrix; (f) the resulting sampled attention, viewed as a sampled Hamiltonian cycle from the edges of the complete attention graph.

Shot Learning The pre-trained MAE-dense and MAE-sampled models are also evaluated on a few shot learning task. Following Sharma & Kaul (2020); Wang et al. (2021); Yu et al. (2022); Pang et al. (2022), the few-shot learning adopted an k-way, m-shot training setting on the Model-Net40 (Wu et al., 2015) dataset, where k represents the number of randomly sampled classes and m the number of randomly sampled examples per class. The testing split is 20 randomly sampled unseen examples from each class. We set k ∈ {5, 10} and m ∈ {10, 20}, and report the mean accuracy with standard deviation for 10 independent experiments. As shown in Tab. 2, our proposed MAE-sampled model outperformed all state-of-the-art methods on 3 out of 4 settings, while MAE-sampled consistently outperformed MAE-dense.

the class of continuous sequence-to-sequence function F S the class of continuous set-to-set function F the class of piece-wise constant sequence-to-sequence function F S the class of piece-wise constant set-to-set function T h,m,r the class of (sparse) transformers with h attention heads, m head size, and hidden layer width r T h,m,r the class of the modified transformers with h attention heads, m head size, and hidden layer width r σ S softmax activation σ H hardmax activation

head attention in selective shift operation d c (•, •) distance between two functions B ADDITIONAL INFORMATION ON THE BASIC CLASSIFICATION SETTING B.1 kNN TRANSFORMER Definition B.1 (kNN Attention). For k ∈ [n], kNN attention has the attention pattern A k = kNN(k) for all points, where kNN(•) represents the Euclidean k-nearest neighbourhood of the input. Definition B.2 (kNN Transformer). The kNN transformer is the transformer defined as in Eq. 1, but with the kNN attention of definition B.1.

, we have F S (•) is the class of continuous set-to-set function, and F S (•) is the class of piecewise constant set-to-set function.

Xie et al. (2018) applied multi-layered dense transformers to small-scale point clouds directly;Yang et al. (2019a)  further proposed the Group Shuffle attention to deal with size-varying inputs by furthest point sampling;Han et al. (2022) aggregated point-wise and channel-wise features by directly adding two self-attention layers. To avoid the tricky tokenization step,Lee et al. (2019) tried to deal with points directly with O(nm) complexity by introducing inducing points, and proved universal approximation;Mazur & Lempitsky (2021) further proposed a hierarchical point set mapping, grouping, and merging structure with nearest neighbors defining the sparse attention mechanism.Yu et al. (2022) andPang et al. (2022) further introduced the transformers to the pre-training and fine-tuning pipelines in the area of 3D point clouds. Last but not the

Object classification on ModelNet40. Here [ST] denotes that model adopts the standard (dense) transformer, while [T] denotes all other transformers.

Mean ± std. dev. accuracy (%) for 10 independent Few-shot classification experiments.

Object classification accuracy (%) for different attention mechanisms in the basic setting. OM denotes out of memory.

Memory usage (Gb) for different attention mechanisms in the basic classification setting. All are trained on a single RTX 3090 with 24 Gb on board RAM. OM denotes out of memory.

Classification accuracy (%) for sampled and kNN attention with hierarchical model structure.

Cheng Zhang, Haocheng Wan, Xinyi Shen, and Zizhao Wu. Pvt: Point-voxel transformer for point cloud learning. arXiv preprint arXiv:2108.06076, 2021. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259-16268, 2021.

1 -δ} d×n

Additional object classification accuracy (%) for different attention mechanisms in the basic setting. OM denotes out of memory.

Object classification in the setting ofLee et al. (2019) measured accuracy (%).

Amortized clustering results. The number in ISAB(•) indicates the number of learnable inducting points used in ISAB as perLee et al. (2019). The evaluation metric LLO/data is the average log-likelihood value, and LL1/data is the average log-likelihood value after a single EM update (implemented by scikit-learn package(Buitinck et al., 2013)).

Transfer learning on the classification task, measured by the Accuracy (%).

annex

Adaptive Selective Shift Operation. With a 2 heads and 1 hidden layer width modified multiheads attention layer, the adaptive selective shift operation Ψ(•) may be defined as:where we assign query, key, and value parameters as u T , and we introduced the superscript l to denote different attention layers of self-attention layer l. With the help of hardmax, the k-th row of the attention matrix will be one-hot vectors to select the max or min vector inis used to make sure only the first element in feature dimension are changed in selective shift operation. Specifically, the 1, k-entity of the self-attention output reads:Without loss of generality, the sampled transformer in §. 4.3 may be viewed as a series of stacked masked attention A i for i ∈ [n], such that:. This is in fact the n point pairs in the Hamiltonian cycle. So the stack of all the masked attention is the cycle attention in Fig. 1d reflected across the diagonal line. Then the Eq. 5d will be SAttnnoting that the updated column for previous g i will be applied to the next g i+1 . In conclusion, the contextual mapping holds as the masked attention A i is designed to aggregate information from all n elements / tokens by applying the g(•) about O(n) times, which matches the design of (Yun et al., 2020) .Now consider u T = (1, δ -1 , δ -2 , . . . , δ -d+1 ), the mapping l i = p s (L i ) = u T L i is bijective as all input point features L i are different with at least one element having a gap of δ. In addition, without loss of generality, the order l 2 < l 3 < . . . < l n < l 1 holds as in (Yun et al., 2020) because of the positional embedding E. Further, as each l i has δ -d intervals, and as the n tokens are disjoint with each other, we need nδ -d adaptive selective operations to achieve the bijective mapping of unique ids.First δ -d selective shift operations. The first δ -d layers are all applied to the second column (token) within l 2 ∈ 0 : δ : δ -d+1 -δ , and each selective shift operation will match one interval, and is empty otherwise. So all δ -d layers are only applied on the first two token embeddings, then the maximum value is l 1 and the minimum value is l 2 . We have the output after those selective shift operations:where with constant value c = δ -d in Eq. 9b. Note that l2 > l 1 becausewhich is true. So the current order becomes l 3 < l 4 < . . . < l n < l 1 < l2 . So in the next δ -d selective shift operations, the maximum value will be l2 and the minimum will be l 3 .Second δ -d selective shift operations. The next δ -d layers will be applied on the third column (token embedding) within intervalswhich is again l3 > l2 becauseSo we have a new maximum l3 and new minimum l 4 .Repeat after (n -1)δ -d operations. The next δ -d will operate on the fourth column. After all (n -1)δ -d operations we haveFor j-th column, we will have the outputAnd we also know the interval of each l iwithThen the interval of outputs areand to check whether intervals are disjoint or not, we take the difference between the lower bound of li+1 and the upper bound of liwhich is not guaranteed to be above 0, so the addition operations should be introduced.Further, the adaptive shift operation is a one-to-one map as the map L k → u T L k is one-to-one, and the permutation of columns is one-to-one, and so it sufficies to prove that the map→ lk is also one-to-one. See the detailed analysis in §E.2.3 in (Yun et al., 2020) .Preliminaries. As in (Yun et al., 2020) , the upper bound for the unique id li is:Similarly, we haveAlso, for any n ≥ 1, we haveAll-max-shift operations. Following (Yun et al., 2020) , to make the interval between l k are disjoint with each other, the all-max-shift operation Ω l : R d×n → R d×n is a self-attention layer defined as follows:The (1, k)-th entry of Ω l (Z; c) readsThe main idea of all-max-shift operation is that, in the i-th layer, we will 'replace' the current 'column' by the maximum column within reach of sparse attention pattern A i . In the next layer, the shifted max column will again be 'replaced' by the new maximum value within reach of the shifted column. After n steps or layers, all the first elements of each column will be replaced by the one in the maximum column, which is the dominated value. The steps within the dominated element are greater than the intervals of the whole l n . So, for two different inputs L, they n entries are distinct, and the requirement 2 in Definition 3.1 satisfied.Without loss of generality, in contrast with the case of the cycle attention Eq. 12 in the adaptive selective operation, the case of the stacked sampled attention is the same as in Fig. 1d , with l = 1.First layer of all-max-shift. The input of the first all-max-shift operation is L ∈ R d×n . Recall that u T L = [l 1 , l2 , l3 , . . . , ln ] and each element is 0 < l 1 < l2 < l3 < . . . < ln < nδ -nd -δ. The last inequality holds as in Eq. 36. Let the output of the first layers be M 1 . The k-th element in the first row readswhere with constant value c = 2n 2 δ -nd-1 in Eq. 40, and for each column we will haveas the first element of u is 1. Next, we see that u T M 1 k is dominated by the right term 2n 2 δ -nd-1 u T Lk+1 mod n , which is defined by for any k, k ′ ∈ [n],This is because the minimum gap between u T Lk+1 is δ, and we haveso if we have u T Lk+1 mod n < u T Lk ′ +1 mod n , it could determine the order u T M k < u T M k ′ , because u T Lk is within the minimum gap of the right term of Eq. 42, and so cannot change the overall value.Second layer of all-max-shift. As in the first layer, we define the output of this layer as M 2 , and the k-th element in the first row readsso for each column, we haveThe last term domains u T M 2 k , because the minimum gap of u T M 2 k+1 mod n is at least δ, andThe last inequality holds due tofrom Eq. 38.Repeat all-max-shifts. After all n layers we get M n , and u T M n k is dominated byBecause the remains in u T M n k have strictly upper-boundThe last inequality used (1 + 2n) n -(2n) n ≤ (2n) n from Eq. 38.Verifying Contextual Mapping. This matches the analysis in §E.2.5 of (Yun et al., 2020) . As all u selective-shift operations and all-max operations are bijective, and u map each column (token) of the input to the unique id, the requirement 1 in the Definition 3.1 holds. As u T M n k are all dominated by (2n 2 δ -nd-1 ) ln , and different inputs L have different ln as ln is influenced by all [l 1 , l 2 , . . . , l n ], not all columns are the same for different inputs L, and u T is the unique mapping. The interval may be writtenThe upper bound holds as other terms are less than (2n 2 δ -nd-1 ) n • δ in total (not the dominated term). So as we can see the interval for all u T M n k are disjoint for different inputs, and the requirement 2 in the Definition 3.1 holds.

