SHUFFLED TRANSFORMERS FOR BLIND TRAINING

Abstract

Conventional split learning faces the challenge of preserving training data and model privacy as a part of the training is beyond the data owner's control. We tackle this problem by introducing blind training, i.e., training without being aware of the data or the model, realized by shuffled Transformers. This is attributed to our intriguing findings that the inputs and the model weights of the Transformer encoder blocks, the backbone of Transformer, can be shuffled without degrading the model performance. We not only have proven the shuffling invariance property in theory, but also design a privacy-preserving split learning framework following the property, with little modification to the original Transformer architecture. We carry out verification of the properties through experiments, and also show our proposed framework successfully defends privacy attacks to split learning with superiority. Under review as a conference paper at ICLR 2023 ently obtain a Transformer encoder block with shuffled weights, which only yields valid results on inputs permuted by the 'key.' Hence the Transformer is 'encrypted' to train on the shuffled data. More importantly, the shuffled model can be 'decrypted' to obtain an equivalent plain network to which normal data can be fed. Highlights of our contributions are: we discovered the intriguing shuffle invariance property of Transformers (and other models with Transformer encoder blocks as backbone), and built a privacypreserving split learning framework on it. The framework provides shuffling-based privacy guarantees for training data, testing data, as well as the model weights. A variety of experiments are implemented to verify the properties, and demonstrate the superior performance of our scheme in terms of accuracy, privacy and efficiency.

1. INTRODUCTION

Recent years have witnessed remarkable growth in deep learning applications, as deep neural networks (DNNs) have grown deeper and larger. It poses a dilemma for the thin edge device: on one hand, it lacks the computational power to individually train the models; on the other, data privacy would be violated if it sends all data to an untrusted party, e.g., the cloud, to process. A paradigm called split learning (Gupta & Raskar, 2018) emerges to be a potential solution: without sharing its raw data, the edge transmits intermediate features to the cloud while offloading partial computation. Typically, the private inputs are transformed into intermediate features by feeding through the first few layers of the DNN. The vanilla split learning still faces privacy leakages as an adversary could infer the input from the feature (Erdogan et al., 2021; Isola et al., 2017) . Hence many works have proposed to remove the sensitive information from the features, such as encryption (Lee et al., 2022) , adversarial learning (Xiao et al., 2020) , differential privacy (Dong et al., 2019) , etc. However, these works mostly sacrifice accuracy or efficiency for privacy guarantee. More importantly, the privacy threat of the model weights trained on the cloud is left to be an open problem -the trained weights reveal the privacy of the training data (Fredrikson et al., 2015; Carlini et al., 2019; Zhang et al., 2020) , and should be proprietary to the data owner, i.e., the edge. We propose a novel blind training framework on the Transformer (Steiner et al., 2021) , a state-ofthe-art DNN achieving impressive accuracy performance on a wide range of tasks. Blind training means that the cloud conducts its part of computation 'in blind' -being unaware of the data or the model it trains, yet executing valid computation to assist the edge. The framework resembles the homomorphic encryption where the edge encrypts training data with its key, and feeds to the encrypted DNN hosted in the cloud. The cloud trains the DNN in ciphertext, without knowing the input or the model. Different from the cryptographic tool, our framework is built all in plaintext, and thus avoiding the hassle of encryption. The key is to exploit the shuffle invariance property of Transformers. We discovered that Transformers have an intriguing property that each input, being an image or a sentence, can be randomly permuted within itself, to feed through the network, yet being equivalently trained to that without permutation. Despite that the previous work (Naseer et al., 2021) has recognized Transformer is ignorant of position information without position embeddings, we non-trivially found that even with position embeddings, Transformer is shuffling-invariant, proved by theories. By regarding the permutation order as a 'key,' the edge feeds shuffled training data to the cloud which performs natural training. Another interesting property we found is that, by training on the shuffled data, we inher- Transformer-based models are the state-of-the-art deep neural networks and have attracted great attention in both areas of computer vision and natural language processing. Models including transformer encoder blocks as their backbone, such as Bert (Devlin et al., 2018) , ViT (Dosovitskiy et al., 2020) , T2T-ViT (Yuan et al., 2021) , ViTGAN (Hirose et al., 2021) , BEiT (Wang et al., 2022) and CoCa (Yu et al., 2022) , have been achieving exceeding performance in a great many tasks. Transformer encoder blocks, as shown in Fig. 1 , mainly contain two critical components: Multi-head Scaled-dot-product self-attention and a feed-forward network (MLP). Inputs are fed in the form of patches, which are usually embedding vectors for words in Bert, or for fractions of images in ViT. The relative position of patches are learned by position embeddings (Vaswani et al., 2017) , which are injected into the model. Recent studies have found that by removing the position embeddings, ViT merely loses 4% accuracy on ImageNet (Russakovsky et al., 2015) . And the work further reports the shuffling invariance property of ViT through experiments. Split learning. As deep neural networks are growing deeper and wider, it is hardly fit for the edge which lacks the computational power but owns abundant data. Hence split learning (Gupta & Raskar, 2018) proposes to let the cloud server shoulder partial computation without accessing the data. To achieve this, a model is split into two parts, deployed on the edge and the cloud, respectively. The edge processes the first few layers and sends the intermediate features to the cloud which holds the main body of the model. If the cloud does not own the corresponding labels, it returns the prediction to the edge for computing the loss. In the backward propagation, error gradients are passed between the edge and the cloud instead of the features. Studies have revealed that the untrusted cloud can reconstruct the private data with the intermediate features (Erdogan et al., 2021; Isola et al., 2017) . Additionally, split learning allows the cloud to directly touch the model weights, which is also a threat to the privacy of training data at the edge. Privacy-preserving split learning. Many efforts have been made to preserve data privacy in split learning but most have been devoted to inference data, rather than training data or model protection. Almost no lightweight protection scheme is feasible for trained model weights, which should be proprietary to the edge, and not be taken advantage of by the cloud. Traditional methods include cryptographic ones such as secure multi-party computation and homomorphic encryption. But these methods typically involve significant overhead in encryption, decryption, computation, and communication. Lee et al. (2022) implemented a polynomial approximation over nonlinear functions and encrypted the training process with FHE, but it demands 10 to 1000 times more computation power compared to plain split learning. The approximation computation also results in accuracy losses. Xiao et al. (2020) adversarially trained the edge sub-module to produce features not containing any private information, but sufficient to complete the learning task. However, the method only works when the learning converges and thus suffers potential leakage at the early stage of training. Dong et al. (2019) inserted Gaussian noise to the features following the convention of differential privacy. Ryoo et al. (2018) adjusted the image resolution to seek a sweetspot in the tradeoff between utility and privacy. These works have to sacrifice considerable model accuracy performance to meet the privacy requirement. Matrix Multiplication Computation (MMC) is a fundamental mathematical operation, and works (Lei et al., 2014; Liu et al., 2021) propose random permutation as an encryption scheme to the outsourced MMC tasks. To enhance the security, the recent work (Liu et al., 2021) introduces additive perturbation besides the multiplicative one; but their works are mostly theoretical and should be viewed as complementary to ours. We investigate the privacy guarantee in a more challenging scene -deep neural networks, and propose a practical protection scheme.

3. PROBLEM FORMULATION

We formally formulate our problem in the setting of split learning. The edge holds a private training data set D train = {X, Y }, where X are the private data and Y are the private labels. The edge aims at training a model with the assistance of the cloud, yet without revealing any private input or the model weights to the cloud. The cloud possesses powerful computing power but is curious about the private data of the edge and the model it trains. The edge selects a model and splits it into two parts, F and F ′ , to deploy on the edge and the cloud, respectively. Referring to the loss function as L task and the local privacy-preserving method as M , the ultimate goal of the edge is to train F and F ′ jointly to minimize F,F ′ L task (F ′ (F (X)), Y ), without revealing X or F ′ to the cloud or any other third party. Although the cloud does not directly access the input, it is possible to invert X from F (X) by the following attacks. We assume the cloud server is honest-but-curious, meaning that it obeys the protocol and performs the learning task accordingly, but is curious about the private data. Depending on whether the edge model is accessible to the attacker, we divide the attacks into two categories: Black-box attacks. The attacker is able to obtain the auxiliary data set X aux and the corresponding features under protection mechanism M as M (F (X aux )), which may be collected over multiple training rounds. It trains an inversion model G over (X aux , F (X aux )) to invert the raw input from features. The attack goal can be minimize G L attack (G(M (F 1 (X aux )), • • • , M e (F e (X aux ))), X aux ). The superscript e denotes the number of trained iterations for the features. The loss L attack can be the mean square error (MSE) between the reconstructed input Xaux and X aux . At convergence, G works as a decoder to invert features into inputs. It should be noted that the attack we model here is different from the feature-space hijacking attack in split learning (Pasquini & Bernaschi, 2021) , as the latter destroys model accuracy, inconsistent with the honest-but-curious assumption we made. White-box attacks. The attacker has full access to the edge model F and the protection mechanism M , and performs gradient descent over M (F (X)) and its guess M (F ( X)) by minimize X L attack (M (F (X)), M (F ( X))). (3)

4. INTRIGUING PROPERTIES OF TRANSFORMER ENCODER BLOCK

To tackle the privacy issue in split learning, it is important to perform transformations over F (X) for training or inference, meanwhile preventing the adversary from inverting X from F (X). In our work, we adopt permutation as the transformation method, and model it by row and column shuffle, which can be expressed by matrix multiplications, i.e., given a matrix Z ∈ R p×d , the row shuffle is represented by P R Z where P R ∈ {0, 1} p×p is a permutation matrix. Similarly, the column shuffle is defined as ZP C where permutation matrix P C ∈ {0, 1} d×d . Further explanations can be found in Appendix A. In this section, we will introduce key properties we discovered on Transformer encoder blocks, which serve the building blocks to our privacy-preserving split learning framework showed later. Fig. 2 summarizes the properties we found and each detailed proof is provided in Appendix B. Denoting the Transformer encoder as Enc, the input matrix of Enc as Z, and the row permutation matrix as P R , we have the following theorem: Theorem 1. Transformer encoder blocks are row-permutation-equivalent: Enc(P R Z) = P R Enc(Z) As shown in Fig. 2a , the row permutation P R can 'pass through' the encoder so that a reverse operation can be performed at the output of the encoder to return the original output: P -1 R Enc(P R Z) = Enc(Z). More interestingly, the gradients of Transformer encoder blocks in the backward propagation are invariant to row shuffle: Theorem 2. The gradients of Transformer encoder blocks w.r.t. the loss l are row-permutationinvariant: ∂l ∂W (R) = ∂l ∂W . In Eq. 5, W generally refers to the weights learned naturally in the multi-head attention or the MLP of the Transformer encoder block. W (R) is the corresponding weights learned on P R Z. Hence indicating by Fig. 2a , the learned weights are the same with or without row shuffle on Z. Rigorously speaking, it is not the same W learned since the weights are initialized differently; it can be considered the row shuffle is transparent to the Transformer encoder blocks, either in training or inference. Similarly, denoting P C as the column permutation matrix, we have Theorem 3. If the Transformer encoder is permuted as Enc (P ) = P C EncP -1 C , Enc (P ) is rowcolumn-shuffle-equivalent: Enc (P ) (P R ZP -1 C ) = P R Enc(Z)P -1 C . And the outputs can be reversed by P -1 R Enc (P ) (P R ZP -1 C )P C = Enc(Z). More intriguingly, we have Theorem 4. The gradients of Transformer encoder blocks Enc (P ) w.r.t. the loss l is columnpermutation-equivalent: ∂l ∂W (P ) = P C ∂l ∂W P -1 C . By induction, we prove that if one shuffles the normally trained Enc, it would obtain an equivalent Enc (P ) trained on the shuffled data. Specifically, letting the weights of the normally trained Transformer encoder block be W , the weights of the model trained on P R ZP -1 C are P C W P -1 C , as shown in Fig. 2b . If the 'encrypted' model Enc (P ) is deployed in the cloud, the cloud would have no clue about the model weights or the inputs, but train the model 'blindly' for the edge. Compared to cryptographic tools like fully homomorphic encryption (Lee et al., 2022) , our 'encryption' method realizes a similar idea but with far less computation overhead. Similar to the 'encryption' process, 'decryption' is feasible by reverse shuffling, as in Fig. 2c . Reencryption is also viable by matrix multiplication, suggesting that Enc can be 'encrypted' not only by training on shuffled data, but also by permuting the trained weights. To sum up, the row shuffle is transparent to the Transformer encoder blocks. The row-and-columnshuffle serves in a similar way to homomorphic encryption, in that the weights of Enc (P ) are not known, while Enc (P ) is only capable of processing shuffled Z. Either by training on shuffled data or by matrix multiplication, can the weights of Enc (P ) be obtained. The entire process does not incur additional computational burden, or sacrifice accuracy performance. 

5. METHODOLOGY

Inspired by the permutation invariance properties of Transformer encoder blocks, we present our method to preserve training data, inference data and model weights privacy in the split learning framework, followed by the privacy indication of shuffling.

5.1. SHUFFLED TRANSFORMERS

The overall scheme of our method is shown in Fig. 3 . We split a typical Transformer-based model into three stubs: the part from input layer to patch embeddings residing at the edge, the transformer encoder blocks at the cloud, and the MLP and loss layer at the edge. Position embedding is optional and depends on practical situations. Being transmitted between the edge and the cloud, the smashed data in the forward loop and backward loop are referred to as features and error gradients, respectively. Our method runs at the edge, processing the smashed data transmitted back and forth between the edge and the cloud. We introduce four basic operations: row shuffle and unshuffle, column shuffle and unshuffle. All shuffling orders are secret of the edge. We perform row and column shuffle to the features sent from the edge to the cloud. The features Z = F (X) are expressed by a (p, d) (e.g., (197, 768) ) matrix, where p denotes the number of patches and d the dimension of each patch. We left the batch size out of the modeling, as shuffling takes place within a single input. We shuffle the patches to be sent and unshuffle the received features by Shuffle: M Z (Z) = P R ZP -1 C , Unshuffle: M -1 Z (F ′ (P ) (M Z (Z))) = P -1 R F ′ (P ) (M Z (Z))P C , where P R ∈ {0, 1} p×p and P C ∈ {0, 1} d×d are row and column permutation matrices, respectively. The row and column shuffle are jointly denoted as mechanism M Z . P R is chosen randomly per Z, whereas P C is chosen per model, i.e., P C is the same for all inputs. The model stub F ′ (P ) on the cloud is trained as the usual network, and the output of Transformer encoder block is sent to the edge for unshuffle by Eq. 9. The backward loop is no different from the normal backward propagation. Privacy for model weights and training data. From Thm. 4, we know that Transformer encoder blocks at the cloud trained on M Z has the following unique property. Letting F ′ (P ) = {W (P ) , b (P ) , γ (P ) } and F ′ = {W , b, γ} denote the {weight matrix, bias, layer normalization pa-rameter} of the cloud model trained on M Z (Z) and the normal Z, respectively, we have: W (P ) = P C W P -1 C ≜ M W (W ), b (P ) = bP -1 C , γ (P ) = γP -1 C . Note that Eq. 10 demands the weight matrix to be a square one, which requires minor modification to the cloud model as we will elaborate on later. It is an interesting fact that if the model is trained on the shuffled patches M Z (Z), the randomly initialized weight W 0 learns to become W (P ) instead of W , and thus the true model weights are unknown to the cloud. Privacy for training data is preserved by M Z , as each input is randomly row-shuffled by P R and column-shuffled by P C before being sent. Note that since no particular patch size is specified, all patch size would work but the finest granularity is recommended for privacy concerns. We will give the formal guarantee in Sec. 5.2. Protection of the inference data. According to Thm. 3, it is obvious that the shuffle (Eq. 8) and unshuffle (Eq. 9) procedures give legitimate testing results. Since the feature sent to the cloud is shuffled, the testing data privacy is preserved. In addition, only the entity who holds P -1 C would produce valid inference results on F ′ (P ) : F ′ (P ) (Z) = Invalid Result, F ′ (P ) (M Z (Z)) = M Z (F ′ (Z)). Re-encrypting the model is also available with our method. The edge may pre-train a secret Transformer body with a secret P C , and later can obtain the unencrypted weights W = P -1 C W (P ) P C , or re-encrypt the model to authorize other party to use it: W (P ′ ) = P C ′ W P -1 C ′ . Other parties who hold P C ′ can further perform privacy-preserving transfer learning or fine-tuning on the model. Model structure modification. Our method requires all the weight matrices in the Transformer encoder blocks and the MLP layer to be square ones. For example, the multi-head attention in the classical ViT-Base (Dosovitskiy et al., 2020) introduces non-square weight matrices, and its MLP has an input/output dimension of 768 but a hidden layer of dimension 3072. To implement shuffled transformers, each head has to be square: weights of the linear projection layers of Q, K, V are reshaped to (768 × no. of heads, 768), so that each head has a shape of (768, 768). Instead of concatenating them, we calculate their average to keep the square shape of the weight matrices, which mildly increases the computational overhead but no accuracy decline. Or one can simply choose to use single-head attention to replace the multi-head one, which also has limited impact to the model performance. The MLP layer is reshaped with 768-hidden units, which keeps the weight matrices square.

5.2. DEFINITION OF PRIVACY

The purpose of shuffling is to prevent the attacker from reconstructing input X given the feature M Z (Z). In this work, we consider recovering Z to be the same with reconstructing X, as a blackbox or white-box attacker can easily invert X from Z. To quantize the likelihood that an adversary rebuilds Z from M Z (Z), we first define neighboring permutations as: Definition 1. (Neighboring Permutations.) For feature matrix Z ∈ R p×d , all row-column permutation orders of Z constitute S. Any two permutations σ, σ ′ ∈ S are neighboring permutations. Our privacy definition based on shuffling is as follows. Definition 2. (σ-privacy.) Given the private feature Z and permutation set S, a randomized shuffling mechanism M : M (Z) → Z ′ ∈ S is σ-private if for all Z, Z ′ , and any neighbouring permutations σ, σ ′ , we have Pr[M (σ(Z)) = Z ′ ] = Pr[M (σ ′ (Z)) = Z ′ ]. σ-privacy suggests that shuffling mechanism M is agnostic of the relative order of patches. Thus any adversary is incapable of telling the original order from its neighboring permutations, given the perturbed feature Z ′ . This definition shares some similarities with d σ -privacy in (Meehan et al., 2021) but ours removes α in d σ -privacy as permutations are sampled from a uniform distribution rather than Mallows model. With the definition, we can calculate that if all row-column permutations are equally likely for an input, our M Z (Z) has the probability offoot_0 p!d! to reveal the true Z, which is negligible. Similarly for weights shuffling, M W (W ) and M W (P W P -1 ) (where P is a random permutation matrix) have the same probability to yield W ′ , which has 1 d! probability to be the true W . Hence the mechanism prevents any adversary from recovering the true weights. Hereby we have Proposition 1. Input shuffling M Z (•) is σ-private with S being the set for all possible P R , P -1 C permutations. Proposition 2. Weights shuffling M W (•) is σ-private with S being the set for all possible P C , P -1 C permutations.

6. EXPERIMENTS AND EVALUATIONS

We first verify the properties of the shuffled Transformer by experiments, and show its defence capability against attacks. Setup: Our implementation is built on Pytorch and Torchvision. We use Cifar10 (Krizhevsky et al., 2009) consisting of 60,000 natural images in 10 classes, and CelebA (Liu et al., 2015) containing 2,022,599 faces from 10,177 celebrities. On CelebA, we adopt timm 1 model vit base patch16 224 (ViT-Base) pre-trained on ImageNet to transfer to a 40-binary-attribute classification task. We adopt a single-head ViT-Base according to the modification stated in Sec. 5.1 with the following structure: 12 layers, image size=224, patch size=16, embedding dim=768, mlp hidden dim=768 and one head. The model is referred to as single-head ViT for CelebA later. A SGD optimizer is used with a cosine scheduler, for which the (initial, final) learning rate are set to (0.05, 2 × 10 -4 ) and (5 × 10 -4 , 2 × 10 -6 ) for the MLP and the encoder blocks, respectively. On Cifar10, we use a smaller ViT with the structure: 6 layers, image size=32, patch size=4, embedding dim=512, mlp hidden dim=512, 1 head in the row-column-shuffle case and 6 heads in others. The models are called the single-head and the multi-head ViT for Cifar10, respectively. These ViTs are trained from scratch on Cifar10, and thus it provides satisfying but inferior accuracy to the pretrained one. An Adam optimizer and a cosine scheduler with learning rate 10 -4 are used. Baselines: We compare our method with a set of existing privacy-preserving methods. Conventional cryptographic tools are not included for unbearable computational and communication costs. Baselines include unprotected split learning (SL), adversarial learning (adv) (Xiao et al., 2020) , methods based on Transform containing adding Gaussian noise ∼ N (0, 4) (GN) (Dong et al., 2019) and Blur (Ryoo et al., 2018) . Metrics: We evaluate model performance from the accuracy, privacy, and efficiency aspects. The average classification accuracy of 40 attributes, and the 10-class classification accuracy are reported for CelebA and Cifar10, respectively. Privacy is gauged by the attackers' capability in reconstructing inputs. We select popular metrics such as Structural Similarity (SSIM), Peak Signal to Noise Ratio (PSNR) (Hore & Ziou, 2010) , and F-SIM. For F-SIM, We feed the original and reconstructed inputs into a third-party network and compare the cosine similarity between the features. The thirdparty network for CelebA is InceptionResNetV1 of FaceNet (Schroff et al., 2015) , pre-trained on VggFace2 (Cao et al., 2018) , and that for Cifar10 is ResNet18 pre-trained on ImageNet.

6.1. VERIFYING PROPERTIES

We experimentally verify the properties of shuffled Transformers by their accuracy performance. Row-permutation-equivalence: To verify Thm. 1 and Thm. 2, we train and test models with or without row-shuffled (RS) inputs. No model modification are needed for row shuffle and thus the original model structures are used. Testing accuracies are reported in Tab. 1. Despite being shuffled or not, the testing data reach approximately the same accuracy for each dataset, verifying the row shuffle is indeed transparent to the Transformer encoder blocks. The model trained on row-shuffled data has an equivalent performance to the model normally trained, verifying their training processes are equivalent. On CelebA, removing the position embedding from the model merely leads to an 0.506% accuracy drop compared to the unprotected SL (91.908%). Decryption and Re-encryption: We also verify decrypting and re-encrypting the weight matrices are feasible with Eq. 10. For each model, we 'encrypt' the naturally trained model by multiplying a random permutation matrix and have it tested on the shuffled test data. And we also 'decrypt' the model trained on row-column-shuffled data and test it on plain data. All the testing accuracies are reported in Tab. 2, and the results are almost no different from the benchmarks. Position embeddings: Our method has no influence on network parameters on the edge, including positional embeddings (see Appendix B.6 and C.6 for more detail). As shown in Tab. 3, shuffling has little impact to accuracy despite the position embeddings exist or not. Further verification experiments on NLP tasks and tabular data can be found in Appendix C.3&C.4.

6.2. DEFENCE AGAINST ATTACKS

We focus on the privacy performance of our method against attacks in Sec. 3, in comparison to the baselines. As we observe, position embeddings hurt training data privacy in our black-box attack since such an attack is very strong, and almost does not affect accuracy, we choose to remove position embedding in training. Defence in inference. The train set of CelebA is adopted as X aux and the test set is used as the private inference data. All defence methods are applied to the same edge model F , a fixed patch embedding layer. In inference, the black-box attacker can only acquire smashed data once, and thus we fix weights of the edge model and train the attacker model to minimize the loss of Eq. 2 where e = 1. We implement the attacker with an MAE decoder G, pre-trained on ImageNet with an additional position embedding layer at the head of the decoder and a Tanh activation layer at the rear. We report the inference accuracy and attack performance in Tab. 4. It is observed that adv and Blur fails to maintain a privacy guarantee. GN successfully prevents the attacker from reconstructing the private inputs but degrades the accuracy considerably. Our methods achieve satisfying privacy performance across all metrics while sharing a close accuracy to SL. The white-box attack follows Eq. 3 with 100,000 optimization iterations. As we can tell from Tab. 4, the transform-based methods successfully defends against the white-box attack for introducing randomness, so as our method, but their methods suffer great accuracy losses. This is also verified in the visualization results of Fig. 4 , and further results on Cifar10 are left to Appendix C.5.

Defence in training.

We launch an adaptive black-box attack on features collected over 10 rounds (e = 10 in Eq. 2). The attack we launch is much stronger than practice, as we choose X aux from the training set. In real-world, the adversary hardly obtains the training data. The final testing accuracy and privacy performance is reported in Tab. 5. It can be found that our RCS is strong in preserving the privacy of the training data without any loss of accuracy. Efficiency. To see if our method incurs additional overhead to the normal split learning framework, we evaluate its efficiency by MAC operation counts (Macc, recording the number of multiplication and addition operations) on the edge, the memory consumption, and the convergence curve of pretraining ImageNet. The ImageNet image of size 224 × 224 is adopted with a batch size 256. Results of Fig. 5 can be concluded that our methods are almost as efficient as the normal SL.

7. CONCLUSION

We propose a blind training method to realize privacy-preserving split learning, where the cloud trains over unknown data and model for the edge. The method is founded on the shuffle invariance property we discovered on Transformers, and other Transformer-based models. Theoretical proofs, property verification, and real-world performance resisting attacks are provided. Our method successfully defends black-box, and white-box attacks without degrading accuracy and efficiency. Get a batch of data X from data loader 7: Get the patch embedding Z of size (batch size, p, d).

8:

if using row shuffle then 9: Get a random permutation matrix P R of size (p, p) and its inverse P -1 R 10: until done all batches 24: until done all epochs To further demonstrate the results of shuffling, we visualize the row shuffle and row-column shuffle methods on CelebA pictures in Fig. 7b and 7c . The column shuffle mixes up pixels from three channels and hence makes the images look like a gray one. Z = torch. The shuffle method is simple, easy-to-deploy, and effective. We show why it works in the following section. (c) Row-column shuffled pictures of CelebA. Each element in a patch comes from a pixel, and thus to shuffle the columns is to shuffle the pixels of three channels within the fraction. Figure 7 : Visualization of Shuffling. Our method works on embedded features instead of images, but here we directly shuffle the image to visualize shuffling.

B PROOFS

We show the theorems in Sec. 4 hold for Transformer encoder blocks.

B.1 NOTATIONS AND LEMMAS

The Transformer encoder block is denoted as Enc and the loss is ℓ. The patch embedding of a single input X is expressed as Z of shape (p, d). The first layer in the self-attention contains three parallel linear layers projecting Z to Q, K, V as ZW T Q = Q, ( ) ZW T K = K, ( ) ZW T V = V . ( ) Q, K, V are fed to the following attention operation S = Sof tmax( QK T √ d ), A = SV , ( ) where S and A are the softmax output, and the attention output, respectively. The part following the attention layer is the MLP layer: A 1 = AW T 1 , (18) H = a(A 1 ), A 2 = HW T 2 (20) where A 1 , A 2 are the outputs of the linear layers with weights W 1 , W 2 , respectively, and H is the output of the element-wise activation function a, being ReLu or Tanh. Element-wise operators including shortcut, Hadamard product, matrix addition/subtraction and other element-wise functions are permutation-equivalent: Lemma 1. Element-wise operators are permutation-equivalent: (P 1 AP 2 ) ⊙ (P 1 BP 2 ) = P 1 (A ⊙ B)P 2 . ( ) On the left hand-side of the equation, a ij in A and b ij in B are permuted to the same position before being performed the operation. On the right hand-side, a ij and b ij are performed the operation of which the results are permuted. The two are obviously equivalent. Importantly, Lemma 2. Softmax is permutation-equivalent: Sof tmax(P 1 AP 2 ) = P 1 Sof tmax(A)P 2 . ( ) This is because an element is always normalized with the same group of elements, which are not changed in permutations. Thus Softmax is permutation-equivalent.

B.2 PROOF OF THEOREM 1

Proof. We prove the row-permutation equivalence of Transformer encoder blocks in forward propagation (Eq. 5). We denote the row-permutation matrix as P R . Features are permuted as Z (R) = P R Z. It is worth noting that P T R P R = E holds for permutation matrix P R . E is the identity matrix. We first prove the permutation equivalence of attention. A (R) = Sof tmax( Q (R) K T (R) √ d )V (R) = Sof tmax( Z (R) W T Q W K Z T (R) √ d )Z (R) W T V = Sof tmax( P R ZW T Q W K Z T P T R √ d )P R ZW T V = P R Sof tmax( ZW T Q W K Z T √ d )P T R P R ZW T V = P R Sof tmax( ZW T Q W K Z T √ d )ZW T V = P R Sof tmax( QK T √ d )V = P R A, and thus Attention(P R Z) = P R Attention(Z). ( ) By Lemma 2, the softmax layer following the attention is permutation-equivalent. And the subsequent is the MLP layer which satisfies: A 2(R) = a(A (R) W T 1 )W T 2 = a(P R AW T 1 )W T 2 = P R a(AW T 1 )W T 2 = P R A 2 , meaning M LP (P R A) = P R M LP (A). (24) Hence we have proved the transformer encoder block is row-permutation-equivalent: Enc(P R Z) = P R Enc(Z).

B.3 PROOF OF THEOREM 2

Proof. To prove the gradients are the same when the inputs are row-shuffled, we first calculate the all the gradients from the final layer back to the first. Gradients are expressed as dl = tr( ∂l ∂A 2 T dA 2 ) = tr( ∂l ∂A 2 T (dH)W T 2 ) + tr( ∂l ∂A 2 T Hd(W T 2 )). Let's study H first: dl 1 ≜ tr( ∂l ∂A 2 T (dH)W T 2 ) = tr(W T 2 ∂l ∂A 2 T dH) = tr(( ∂l ∂A 2 W 2 ) T dH), indicating ∂l ∂H = ∂l ∂A 2 W 2 . ( ) For W 2 , dl 2 ≜ tr( ∂l ∂A 2 T Hd(W T 2 )) = tr(dW 2 H T ∂l ∂A 2 ) = tr(( ∂l ∂A 2 T H) T dW 2 ), and ∂l ∂W 2 = ∂l ∂A 2 T H. For A 1 : dl 1 = tr(( ∂l ∂A 2 W 2 ) T dH) = tr( ∂l ∂H T d(a(A 1 ))) = tr( ∂l ∂H T a ′ (A 1 ) ⊙ dA 1 )) = tr(( ∂l ∂H ⊙ a ′ (A 1 )) T dA 1 ), and ∂l ∂A 1 = ∂l ∂A 2 W 2 ⊙ a ′ (A 1 ). Similarly, we calculate the gradients of A and W 1 : ∂l ∂A = ∂l ∂A 1 W 1 , ∂l ∂W 1 = ∂l ∂A 1 T A. In the attention operation: dl 3 ≜ tr( ∂l ∂A T dA) = tr( ∂l ∂A T (dS)V ) + tr( ∂l ∂A T SdV ) = tr(( ∂l ∂A V T ) T dS) + tr((S T ∂l ∂A ) T dV ), and ∂l ∂S = ∂l ∂A V T , ∂l ∂V = S T ∂l ∂A . First, for V = ZW T V : dl 4 ≜ tr( ∂l ∂V T dV ) = tr( ∂l ∂V T (dZ)W T V ) + tr( ∂l ∂V T ZdW T V ). Similarly, the gradients of Z and W V are: ∂l ∂Z = ∂l ∂V W V , V = ∂l ∂V T Z. Now we focus on S = Sof tmax( QK T √ d ): dl 5 ≜ tr( ∂l ∂S T dS) = tr( ∂l ∂S T (diag(S) -S T S)d( QK T √ d )) = tr(((diag(S) -S T S) T ∂l ∂S ) T d( QK T √ d )), and thus ∂l ∂Q = 1 √ d ((diag(S) -S T S) T ∂l ∂S )K, ∂l ∂K = 1 √ d ((diag(S) -S T S) T ∂l ∂S ) T Q. ( ) And similarly the gradients of W Q and W K are: ∂l ∂W Q = ∂l ∂Q T Z, ∂l ∂W K = ∂l ∂K T Z. What happens in backward propagation with RS: First, let us take a black-box view of the Transformer encoder blocks. In our scheme, the input of Enc is the shuffled smashed data P R Z and the output is P R A 2 . The edge client obtains P R A 2 and reverse the shuffling order to get A 3(R) = P T R A 2(R) . In the following, we denote all variables involved with our RS method with subscript (R), and variables in vanilla split learning are without the subscript. According to Eq. 1, A 3(R) = P T R A 2(R) = P T R P R A 2 = A 2 . We have A 3 = A 2 in plain SL, and thus ∂l ∂A 3(R) = ∂l ∂A 2 = ∂l ∂A 3 . Eq. 40 suggests since the forwarding (from A 3(R) to the loss) is the same between the original and our scheme on the edge, its backward process is also equivalent, i.e., each gradient of the weights between A 3(R) and the loss l is the same for our method and the original SL. Hence we only need to focus on gradients from A 2(R) backward. dl = tr( ∂l ∂A 3(R) T P T R dA 2(R) ) = tr((P R ∂l ∂A 3 ) T dA 2(R) ), and thus ∂l ∂A 2(R) = P R ∂l ∂A 3 = P R ∂l ∂A 2 . ( ) It is quite an interesting conclusion, and we will soon find it important to the permutation-invariance of gradients of the weight matrix. By substituting Eq. 41 into Eq. 27, we obtain: ∂l ∂W 2(R) = ∂l ∂A 2(R) T H (R) = ∂l ∂A 2 T P T R P R H (43) = ∂l ∂A 2 T H (44) = ∂l ∂W 2 . Eq. 45 reveals that the gradients of the weight matrix in our RS scheme would be the same as the gradients in vanilla SL. In fact, each gradient of the weight matrix is 'permutation-invariant,' i.e., satisfying Eq. 45, whereas each gradient of the intermediate feature is 'permutation-equivalent,' meeting Eq. 41. Properties of Eq. 45 and Eq. 41 are transductive, carrying from the final layer backward to the first layer. Hence by Eq. 26, ∂l ∂H (R) = P R ∂l ∂H , and by Eq. 28, ∂l ∂A 1(R) = P R ∂l ∂A 2 W 2 ⊙ P R a ′ (A 1 ) = P R ( ∂l ∂A 2 W 2 ⊙ a ′ (A 1 )). Thus, ∂l ∂A 1(R) = P R ∂l ∂A 1 , and by Eq. 30, ∂l ∂W 1(R) = ∂l ∂W 1 . By Eq. 29, we have: ∂l ∂A (R) = P R ∂l ∂A which passes the property to S. But S is a tricky one since in the derivation of Eq. 23, we notice that S (R) = P R SP T R . According to Eq. 31, ∂l ∂S (R) = ∂l ∂A (R) V T (R) = P R ∂l ∂A V T P T R , which interestingly leads to ∂l ∂S (R) = P R ∂l ∂S P T R . This is consistent with the S (R) 's permutation form, which passes the elegant properties down to Q, K: ∂l ∂Q (R) = 1 √ d ((diag(S (R) ) -S T (R) S (R) ) T ∂l ∂S (R) )K (R) = 1 √ d ((P R diag(S)P T R -P R S T P T R P R SP T R ) T P R ∂l ∂S P T R )P R K = 1 √ d (P R (diag(S) -S T S) T P T R P R ∂l ∂S P T R )P R K = P R 1 √ d ((diag(S) -S T S) T ∂l ∂S )K, and hence ∂l ∂Q (R) = P R ∂l ∂Q . Similarly, ∂l ∂K (R) = P R ∂l ∂K . By Eq. 37, ∂l ∂W Q(R) = ∂l ∂Q (R) T Z (R) (53) = ∂l ∂Q T P T R P R Z = ∂l ∂W Q . ( ) Similarly, ∂l ∂W K(R) = ∂l ∂W K . ( ) And for V , ∂l ∂V (R) = S T (R) ∂l ∂A (R) (56) = P R S T P T R • P R ∂l ∂A (57) = P R S T ∂l ∂A (58) = P R ∂l ∂V Similarly, ∂l ∂W V (R) = ∂l ∂V (R) T X (R) (60) = ∂l ∂V P T R • P R X (61) = ∂l ∂W V . ( ) For now, we have proved that the gradients of weights in our RS scheme are exactly the same as the gradients of weights in vanilla split learning, and by induction, we can conclude that the learned Enc with RS method is no different from the learned Enc without. It is worth mentioning that if the attention is cut to multi-head, the property remains because cutting the second dimension (column) does not affect the permutation of the first dimension (row).

B.4 PROOF OF THEOREM 3

Proof. We denote variables involved in our RCS method with subscript (P ). Row shuffling is included as a special case of RCS. First and foremost, we 'encrypt' all the weight matrices by Eq. 10: W i(p) = P C W i P T C , where P C is the column permutation matrix, W i is the weight of a normal Enc, and i ∈ {1, 2, Q, K, V }. We denote the Transformer encoder block with such 'encryption' as Enc (P ) . Note that this operation requires W i to be a square one. We have proposed two ways to achieve this with little modification to the original model without performance loss. For Q: Q (P ) = Z (P ) W T Q(P ) = P R ZP T C • P C W T Q P T C (64) = P R ZW T Q P T C (65) = P R QP T C . (66) Similarly for K, V : K (P ) = P R KP T C , V (P ) = P R V P T C . For S = Sof tmax( QK T √ d ): S (P ) = Sof tmax( Q (P ) K T (P ) √ d ) (69) = Sof tmax( P R QP T C • P C K T P T R √ d ) (70) = Sof tmax( P R QK T P T R √ d ) (71) = P R Sof tmax( QK T √ d )P T R (72) = P R SP T R . So for A: A (P ) = S (P ) V (P ) (74) = P R SP T R • P R V P T C (75) = P R SV P T C (76) = P R AP T C . Following the attention layer, A is fed to the MLP layer: A 1(P ) = A (P ) W T 1(P ) (78) = P R AP T C • P C W 1 P T C (79) = P R AW 1 P T C (80) = P R A 1 P T C . ( ) Similarly for A 2 , A 2(P ) = P R A 2 P T C . ( ) As for the activation in the middle, the element-wise activation function is permutation-equivalent: H (P ) = P R HP T C . Overall, we have proved Thm. 3.

B.5 PROOF OF THEOREM 4

Proof. By row and column unshuffle, the computation of the forward propagation and backward propagation on the edge is no different with or without our RCS method. Hence we only focus on the propagation of the Transformer encoder blocks. Similarly to the proof on RS, we denote A 3(P ) as the reversed intermediate feature that the edge receives: A 3(P ) = P T R A 2(P ) P C . By Thm. 3, we have A 3(P ) = A 2 = A 3 . First we focuses on the MLP layer: dl = tr( ∂l ∂A 3 T P T R d(A 2(P ) )P C ) = tr(P C ∂l ∂A 3 T P T R dA 2(P ) ) = tr((P R ∂l ∂A 3 P T C ) T dA 2(P ) ), that is: ∂l ∂A 2(P ) = P R ∂l ∂A 2 P T C . With H (P ) = P R HP T C and Eq. 27, the gradient: ∂l ∂W 2(P ) = ∂l ∂A 2(P ) T H (P ) = P C ∂l ∂A 2 P T R • P R HP T C = P C ∂l ∂A 2 HP T C = P C ∂l ∂W 2 P T C , that is: ∂l ∂W 2(P ) = P C ∂l ∂W 2 P T C . By Eq. 28, Eq. 87, and a similar derivation of Eq. 47, we have ∂l ∂A 1(P ) = ∂l ∂A 2(P ) W 2(P ) ⊙ a ′ (A 1(P ) ) = [P R ∂l ∂A 2 P T C • P C W 2 P T C ] ⊙ [P R a ′ (A 1 )P T C ] = [P R ∂l ∂A 2 W 2 P T C ] ⊙ [P R a ′ (A 1 )P T C ] = P R [ ∂l ∂A 2 W 2 ⊙ a ′ (A 1 )]P T C = P R ∂l ∂A 1 P T C , that is: ∂l ∂A 1(P ) = P R ∂l ∂A 1 P T C . The weight W 1(P ) in the MLP has the following gradient by Eq. 30: ∂l ∂W 1(P ) = ∂l ∂A 1(P ) T A (P ) = P C ∂l ∂A 1 T P T R • P R AP T C = P C ∂l ∂W 1 P T C , that is: ∂l ∂W 1(P ) = P C ∂l ∂W 1 P T C . And we come to the attention operation, from Eq. 29, we have ∂l ∂A (P ) = ∂l ∂A 1(P ) W 1(P ) = P R ∂l ∂A 1 P T C • P C W 1 P T C = P R ∂l ∂A 1 W 1 P T C = P R ∂l ∂A P T C , that is: ∂l ∂A (P ) = P R ∂l ∂A P T C . ( ) Hence we observe the permutations rules for the gradients of the intermediate-layer outputs vary from the gradients of the weights. As for the gradients of the softmax-layer output, we have ∂l ∂S (P ) = ∂l ∂A (P ) V T (P ) = P R ∂l ∂A P T C • P C V T P T R = P R ∂l ∂A V T P T R = P R ∂l ∂S P T R , that is: ∂l ∂S (P ) = P R ∂l ∂S P T R . Since S (P ) follows Eq. 73, we have the gradients for Q (P ) combining with Eq. 91: ∂l ∂Q (P ) = 1 √ d [(diag(S (P ) ) -S T (P ) S (P ) ) ∂l ∂S (P ) ]K (P ) = 1 √ d [(P R diag(S)P T R -P R S T P T R • P R SP T R )P R ∂l ∂S P T R ]P R KP T C = 1 √ d [(P R diag(S)P T R -P R S T SP T R )P R ∂l ∂S P T R ]P R KP T C = 1 √ d [P R (diag(S) -S T S)P T R • P R ∂l ∂S P T R ]P R KP T C = 1 √ d [P R (diag(S) -S T S) ∂l ∂S P T R ]P R KP T C = 1 √ d P R (diag(S) -S T S) ∂l ∂S P T R • P R KP T C = P R 1 √ d (diag(S) -S T S) ∂l ∂S KP T C = P R ∂l ∂Q P T C . By a similar derivation on K we obtain: ∂l ∂K (P ) = P R ∂l ∂K P T C . Following a similar proof to the gradients of W 1(P ) or W 2(P ) , we could easily derive: ∂l ∂W Q(P ) = P C ∂l ∂W Q P T C , ( ) ∂l ∂W K(P ) = P C ∂l ∂W K P T C . And by Eq. 32, the gradient of V (P ) is ∂l ∂V (P ) = S T (P ) ∂l ∂A (P ) = P R SP T R • P R ∂l ∂A P T C = P R ∂l ∂V P T C , thus we have ∂l ∂V (P ) = P R ∂l ∂V P T C , ( ) ∂l ∂W V (P ) = P C ∂l ∂W V P T C . So far, we have proved the rule for the gradient of weight matrices: ∂l ∂W i(P ) = P C ∂l ∂W i P T C , i ∈ {1, 2, Q, K, V }. W i(P ) are the weights of Enc (P ) while W i are the weights of Enc. With some induction, we can reach the conclusion that if a Transformer encoder blocks is randomly initialized and trained with Z (P ) , it would eventually learn to become Enc (P ) , which is associated with Enc by Eq. 97.

B.6 PROOFS ON PARAMETERS OF THE EDGE

We show in this section that the parameters of the edge, including the weights associating position embeddings, are the same despite the shuffling method is used or not. Theorem 5. The parameters on the edge trained with or without row-column shuffling are the same. Proof. We denote the embedded feature in the naive SL and in the our shuffling scheme as Z 0 , Z 0(P ) , respectively, and the feature to be sent to the cloud in the two schemes as Z, Z (P ) . In naive split learning, Z 0 = Z and ∂l ∂Z0 = ∂l ∂Z . In our scheme we have: Z (P ) = P R Z 0(P ) P T C . To prove the claim is to prove:  ∂l ∂Z 0(P ) = ∂l ∂Z 0 . ( The second equality holds by a similar argument to Eq. 95. The equivalence between the edge weights in the two schemes is accomplished by Eq. 100 and the fact that their forward procedure is exactly the same.

C SUPPLEMENTARY EXPERIMENTS

C.1 VERIFYING THE PROPERTIES ON BIAS AND LAYER NORM We do not mention bias and γ, the weight in layer normalization, in the proofs in Appendix B. Here is some intuition about why they are encrypted in the way shown in Eq. 10: b (P ) = bP -1 C , γ (P ) = γP -1 C . Both bias and γ are 1-D vector, and each element only interrelates with the corresponding column of Z in both forward and backward propagation. So if the columns are permuted, bias and γ should be permuted in the same way. During the experiments we find that the encryption/decryption of bias and γ hardly affect the model performance on Cifar10. But on CelebA, if most weight matrices are encrypted by M W while bias and γ are not permuted, the accuracy would be greatly affected, which is merely 79.056%. But if the encryption is strictly performed as Eq. 10, the relative accuracy difference between a normally trained model (tested with Z) and an encrypted model (tested with ZP T C ) is merely 0.00013% (91.50991% and 91.50979% respectively).

C.2 VERIFYING ON ORDER-DEPENDENT TASKS

We consider the classification task is weakly order-dependent, i.e., the task may rely little on the order of the patches of an input image. To verify the feasibility of our method on strictly orderdependent tasks, we designed a simple task which strongly depends on the input order. We label the shuffled images in CelebA as 1 and the original ones as 0, and train a ViT-Base on the them from scratch to distinguish whether the images are shuffled. Within one epoch, the accuracy of ViT-Base reaches around 97%, showing the model and the task are strictly order-dependent. And we conduct similar property-verifying experiments in Sec. 6.1, and the conclusion does not change: RS is transparent to the encoder, and encrypted models can only process encrypted data. It demonstrates the shuffled Transformer also works on tasks strongly associated with the input patch order.

C.3 VERIFYING ON NLP DATA

We implement a small Bert model and use a pre-trained model to fine-tune it for natural language inference on the SNLI dataset 2 . The small Bert has 2 layers and the input Z is of shape (batch size, 128, 256). Due to the non-square MLP (with a hidden dimension of 512), the column shuffle is not suitable for this pre-trained model. Noting that the position embedding is not removed from the model. With or without the row shuffle method, the small Bert achieves the similar accuracies: 76.4% and 76.5% respectively. The code is provided in supplementary materials.

C.4 VERIFYING ON TABULAR DATA

We use the DIFM (Lu et al., 2021) model and Criteo dataset to verify the proved properties on tabular data. The vector-wise part of DIFM model is a Transformer encoder block. We shuffle the rows of the vector-wise part of the input and unshuffle its output. The model achieves the same AUC, 0.777, with or without the row shuffle method. In summary, our method works on CV, NLP and tabular data, despite the specific tasks. This is attributed to the modeling where Z is a general matrix, and the permutation invariance property holds with no strings attached. Speaking of invariance, a negligible 10 -7 error occurs per element in the permutation due to float calculation error. We provide additional experimental results on Cifar10 in Fig. 8 and Tab. 6. Black-box attacks described in Sec. 3 is launched. Due to the sparse data distribution, the defence on Cifar10 is more successful than on CelebA. RCS almost eliminates the sketch of the object in each reconstructed image in Fig. 8 . 

C.6 EXPERIMENTS ON POSITION EMBEDDINGS

To further verify that our method has little impact to network parameters on the edge as proved in Appendix B.6, we visualized in Fig. 9 the cosine similarity of position embeddings learned at different places of the model on the CelebA classification tasks. We found that if the position embedding is placed ahead of shuffling, the cosine similarity shares a similar state to that without shuffling, as shown in Fig. 9 , resembling a human face. Hence our shuffling method almost does not vary the weights on the edge. To verify the impact of the removal of position embeddings in training, we train the Transformer from scratch and on pre-trained ones, both on large and small datasets. Except for training from scratch on Cifar10, all model accuracies maintain at a similar level to that with position embeddings. In the exceptional case, the accuracy drops by ∼ 10% (∼ 80% with position embeddings and ∼ 70% otherwise). We consider it mainly due to the poor performance of the natural Transformer on small datasets. After all, Transformer works best with pre-training on a large dataset. The selection of different patch size have significant influences on the privacy, accuracy and efficiency. For fixed input image size, increasing the patch size will decrease the total number of patches which are the basic unit of shuffling. As shown in Table 7 , increasing the patch size obvi-ously degrades the accuracy, since the performance of ViT is affected by decreased patch numbers.

C.7 ABLATION STUDY OF PATCH SIZE

Revealed by the results of JigsawGAN (Li et al., 2021 ) (using GAN model to reorganize the shuffled patches, aka. Jigsaw problem) and black-box attack, increasing the patch size also has harmful impact on the privacy, which indicates that larger size of patches carry more information in one basic unit of shuffling and make it easier for attacker to solve the relationships between patches. However, larger patch size and fewer patches significantly shorten the computation cost, as the time complexity of Attention block is quadratically proportional to the number of patches.



https://github.com/rwightman/pytorch-image-models https://nlp.stanford.edu/projects/snli/



Figure 1: Transformer Encoder Block

(a) Training the cloud model on Z or {PRiZi} yields the same W . Train Test (b) Training the cloud model on {ZiP -1 C } or {PRiZP -1 C } returns W (P ) = PC W P Decryption and re-encryption of trained model can be implemented by matrix multiplication.

Figure 2: Properties of shuffled Transformers. White blocks denote initialized models, gray blocks mean the naturally trained models, and red ones indicates 'encrypted' model Enc (P ) . O represents the original outputs. R i suggests the order of row shuffle can vary for each input.

Figure 3: The structure of our privacy-preserving split learning over shuffled Transformers.

Figure 4: The visualization effect of input reconstruction on CelebA under black-box attacks (Black), white-box attacks (White) and attacks to training (e = 10).

Figure 5: The computational overhead, memory cost, and convergence curves on ImageNet of the normal split learning and our method. PE stands for position embedding (ViT-Base).

Shuffling at the Edge 1: Initialization: Initialize the model. Load permutation matrix P C of size (d, d) as the key and get its inverse P

Figure 6: The row shuffle and unshuffle

(a) Original CelebA pictures. (b) Row (patch) shuffled pictures of CelebA. Each row of Z denotes a fraction of the image, and thus to shuffle the rows is to shuffle the fractions.

Figure 8: Results of the black-box attack to the private inference data of Cifar10.

Figure 9: Cosine similarity between every two patches of position embeddings learned in different ways.

Accuracies(%) on row-shuffled and row-column-shuffled data. Shuffle methods of the testing data correspond to that of the training data. To verify Thm. 3 and Thm. 4, we train and test on natural or row-column-shuffled (RCS) data, and report the testing accuracies in Tab. 1. Particularly for CelebA, we pre-train the single-head ViT on RCS (natrual) ImageNet data, and transfer the weights of the 20 th epoch to CelebA. 'Training with RCS' suggests both pre-training and fine-tuning are on RCS data. Note that due to the imbalanced data distribution, the average accuracy reported for the 40 attributes of CelebA is around 66 -73% for random guesses. And the benchmark of CelebA on pre-trained ImageNet is around 91%. It is clear that if the model is trained on normal data but test on shuffled data, or vice versa, its performance is close to random guesses. Otherwise, the shuffled Transformer almost achieves no accuracy loss.

Accuracies(%) of decrypted/re-encrypted models.

Accuracies(%) of CelebA models w/ and w/o Position Embedding (PE).

Accuracy and privacy on the inference data, CelebA. ↓ means desirable direction.

Results of black-box attack on CelebA to the training process.

matmul(P R , Z)

Results of the black-box attack to the private training and testing data of Cifar10. ↓ means desirable direction.

The privacy, utility and efficiency when selecting different patch size on cifar10. The width and height of input image are both 32 pixels. ↓ means desirable direction.

