MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC

Abstract

Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions can increase the inference latency by more than 60× or significantly compromise the inference quality. In this paper, we design the framework MPCFORMER as a practical solution, using Secure Multi-Party Computation (MPC) and Knowledge Distillation (KD). Through extensive evaluations, we show that MPCFORMER significantly speeds up Transformer inference in MPC settings while achieving similar ML performance to the input model. On the IMDb dataset, it achieves similar performance to BERT BASE , while being 5.3× faster. On the GLUE benchmark, it achieves 97% performance of BERT BASE with a 2.2× speedup. MPC-FORMER remains effective with different trained Transformer weights such as ROBERTA BASE and larger models including BERT Large . Code is available at https://github.com/MccRee177/MPCFormer.

1. INTRODUCTION

Pre-trained Transformer models can be easily fine-tuned on various downstream tasks with high performance and have been widely developed as model inference services (Bommasani et al., 2021; Feng et al., 2020; Yang et al., 2019b; Clark et al., 2020) . However, these model inference services can pose privacy concerns. For instance, GitHub Copilot, a code-generating engine adapted from pre-trained GPT weights, requires either users to reveal their code prompts to the service provider, or the service provider to release the Copilot's trained weights, which are business proprietary, to users (Chen et al., 2021; Brown et al., 2020) . Secure Multi-Party Computation (MPC) offers a promising solution by keeping data and model weights private during inference (Evans et al., 2018) . However, the vanilla Transformer inference in MPC is unacceptably slow. For instance, BERT BASE inference takes <1 second without MPC, but ∼60 seconds with MPC (Figure 2 ). An intuitive way to accelerate MPC inference replaces computational operations with their faster approximations and retrains the approximated model, which has been adopted on convolutional neural networks (CNNs) (Chou et al., 2018) . Unfortunately, adapting this solution to Transformers drastically decreases the model's performance ( § 5). In this paper, we take the first step to pursue privacy-preserving Transformer model inference in MPC, while remaining fast and performant. We take inspiration from the approximation approach 1 and attribute the performance degradation to two challenges. First, many MPC-friendly approximations toughen model training. For example, quadratic functions cause the gradient explosion problem in deep neural networks (Mishra et al., 2020) . Second, downstream datasets used for Transformer fine-tuning usually contain insufficient data to retrain an approximated Transformer with common task objectives (Zhang & Sabuncu, 2018; Hinton et al., 2012) . To address these two challenges, we resort to the knowledge distillation (KD) framework. KD can ease the model training by matching intermediate representations between the teacher and the student model (Romero et al., 2014) 

Secret Sharing

Figure 1 : An illustration of our proposed MPCFORMER framework. MPCFORMER takes a trained (or finetuned) Transformer model and adopts given MPC-friendly approximations, then uses KD on the downstream datasets to construct high-quality models. During inference time, MPCFORMER leverages an MPC engine to attain private model inference. For ease of illustration, we only show the service provider and the user. MPC Systems such as CrypTen (Knott et al., 2021) may also involve a trusted third party (TTP) to help with the joint computation. problem (Lee et al., 2015) . At the same time, the KD objective is data-efficient and allows training an approximated Transformer on small downstream datasets (Touvron et al., 2021) . Our approach and contributions. In this paper, we build MPCFORMER, an easy-to-adopt framework for privacy-preserving Transformer inference. MPCFORMER takes in an MPC-friendly approximation and a trained Transformer. It returns a Transformer with low inference latency in MPC and high ML performance simultaneously. To do so, MPCFORMER first replaces bottleneck functions in the input Transformer model with the given MPC-friendly approximations. The resulting approximated Transformer model has a faster inference speed in MPC. Next, it applies knowledge distillation to train the approximated Transformer with high performance, using teacher guidance from the original input Transformer. Finally, the model provider can use the distilled approximated Transformer on top of an MPC engine, e.g., CrypTen, for private model inference service. The overall workflow of MPCFORMER is shown in Figure 1 . We implement MPCFORMER on an MPC system (Knott et al., 2021) , with various MPC-friendly approximations. In the process, we also design a new and faster MPC-friendly approximation to the Softmax function. We extensively evaluate our implementation with various Transformer models. On the IMDb benchmark, MPCFORMER achieves similar ML performance to BERT BASE with a 5.3× speedup. It achieves similar ML performance to BERT LARGE with a 5.9× speedup. On the GLUE benchmark, it achieves 97% performance of BERT BASE with a 2.2× speedup. MPCFORMER is also effective when given different trained Transformer models, e.g., RoBERTa BASE .

2. BACKGROUND

In this section, we first describe the Transformer model. Then we describe how functions in Transformer models can be implemented in MPC, and analyze performance bottlenecks.

2.1. TRANSFORMER MODELS

An n-layer Transformer model consists of three components: (1) The embedding layer. (2) A stack of n Transformer layers. (3) The prediction layer. The embedding layer maps a token (e.g. a word or an image patch) to a hidden representation (Devlin et al., 2018; Dosovitskiy et al., 2020) . One Transformer layer consists of an attention module and several matrix multiplication modules (Bahdanau et al., 2014) . The prediction layer maps the output of the last Transformer layer to a task output (e.g., a probability distribution for a classification task). A partial illustration of a Transformer model can be found in Figure 3 .

2.2. TRANSFORMER MODELS IN MPC

The Transformer model inference process can be formulated as a 2-Parties Computation (2PC). In 2PC, the user party inputs the data, and the model provider party inputs the Transformer model. They jointly compute an inference result. Throughout the entire inference process, 2PC guarantees both parties only know information about their own inputs and the result (Yang et al., 2019a) . We describe the secret sharing scheme as a mean to preserve privacy during the inference process (Damgård et al., 2012; Goldreich et al., 2019) . Assuming that the user provides a number x as its input, the secret sharing scheme splits x into two numbers, x 1 and x 2 . It then lets the user party hold x 1 and distributes x 2 to the model provider party. There are two properties of x 1 and x 2 . First, either x 1 or x 2 alone contains no information about x. This property allows the user to hide the actual value x from the model provider. Second, together they reconstruct x. For instance, x 1 and x 2 add up to x: x = x 1 + x 2 . The second property allows joint computation. We take multiplication via Beaver triple as a joint computation example (Beaver, 1991) . In multiplication, the user party provides x and the model provider provides y, and they secret share x and y. Thus, the user gets x 1 and y 1 ; the model provider gets x 2 , and y 2 . Beaver triple assumes a triple c = ab has been generatedfoot_0 . The triple is also secret shared so that the user party gets c 1 , a 1 , b 1 , and the model provider gets c 2 , a 2 , b 2 . The user first computes ϵ 1 = x 1 -a 1 , δ 1 = y 1 -b 1 locally. The model provider similarly computes ϵ 2 = x 2 -a 2 , δ 2 = y 2 -b 2 locally. They communicate these four numbers and reconstruct ϵ = ϵ 1 + ϵ 2 , δ = δ 1 + δ 2 . the user then use these two values to compute r 1 = c 1 + ϵb 1 + δa 1 + δϵ. The model provider computes r 2 = c 2 + ϵb 2 + δa 2 . At this point, the multiplication result xy can be reconstructed by xy = r 1 + r 2 . There are two important observations in the multiplication example: (1) it does not leak information to the other party. For instance, the user does not send x 1 to the model party. Instead, it sends ϵ 1 = x 1 -a 1 where a 1 is a random mask; (2) it requires one extra round of communication compared to the multiplication without MPC. This partially explains why vanilla Transformer models are slow in MPC. In particular, functions in Transformers (e.g., nonlinear activation) can be mainly implemented by three routinesfoot_1 , i.e., addition, multiplication, and comparison. Any computational operations composed by these routines would result in extra complexity in MPCfoot_2 . Empirically, we show these complexities by running BERT BASE (Figure 2 ) and reporting communication statistics in Table 1 with a secret-sharing-based MPC system (Knott et al., 2021) . We observe that GeLU functions and Softmax functions in Transformer layers are the major sources of bottlenecks, which echoes findings in a concurrent study (Wang et al., 2022) . GeLU(x) = x × 1 2 1 + erf x √ 2 is slow because the Gaussian Error function erf(•) is evaluated by a high order Taylor expansion, which requires many multiplication routines. Softmax( x i ) = exp(xi) j exp(xj ) is slow because (1) The exponential function is evaluated by several iterations of squaring, which requires many multiplication routines; (2) the maximum operation over x is required for numerical stability (Paszke et al., 2019) , which requires comparison routines.

3. RELATED WORK

MPC. Secure Multi-party Computation (MPC) enables joint computation between parties while keeping inputs private. The privacy feature and rich support of systems have made it suitable for Transformer inference (Mohassel & Zhang, 2017; Liu et al., 2017; Mohassel & Rindal, 2018; Riazi et al., 2018; Juvekar et al., 2018; Wagh et al., 2019; Mishra et al., 2020; Knott et al., 2021) . In this paper, we do not aim to implement a new MPC system. Rather, we aim to develop an algorithmic solution to speed up Transformer inference that can be portable across many MPC systems. 2021), and beyond (Sharir et al., 2021) . In particular, the two-stage training strategy for Transformer models has been shown to be effective in extensive settings and has become the domincated paradigm (Liu et al., 2019; Radford et al., 2018; Turc et al., 2019) . In this training strategy, Transformer models are first pre-trained on a large dataset for general understanding, and then fine-tuned on a small downstream dataset to learn task-specific features. In this work, we consider this paradigm as the default setting, where we assume that model providers use pre-trained Transformer weights from elsewhere, and only have downstream data. MPC-friendly approximations. Existing research has developed MPC-friendly approximations to speed up CNN computation in MPC. Chou et al. (2018) develops an optimization framework that minimizes the approximation error of order 2 polynomial to ReLU: ReLU(x) = 0.125 × x 2 + 0.25 × x + 0.5. This introduces a significant accuracy drop because the quadratic activation causes the Gradient Descent (GD) algorithm to diverge. Mishra et al. (2020) alleviates this problem by using a set of carefully designed heuristics along with Neural Architecture Search (NAS). Mohassel & Zhang (2017) proposes an approximation to softmax by replacing exponential with ReLU functions. We do not focus on developing heuristics for a single pair of bottleneck functions and approximations. Rather, we focus on developing a general framework that can consistently output a performant Transformer model with various approximations. Knowledge Distillation (KD). KD transfers knowledge from the teacher model to the student model by matching their hidden representations (Hinton et al., 2006) . Several research has designed effective objectives for Transformer models (Sanh et al., 2019; Jiao et al., 2019; Dosovitskiy et al., 2020) such as matching the attention matrices. In particular, Sanh et al. (2019) and Jiao et al. (2019) have a different goal than us and train on the pre-training dataset as well. However, we share the same assumption on the model providers' side -they only have downstream datasets.

4. METHOD

In this section, we present the MPCFORMER framework. MPCFORMER allows the model provider to convert its Transformer model to a faster and performant one for private inference service. In 4.1, we introduce the workflow of MPCFORMER (Figure 1 ), followed by the details of each step in 4.2.

4.1. HIGH-LEVEL WORKFLOW

In the inference service, a model provider holds a Transformer model T , and the user holds data X. They reach to an agreement on an MPC system to perform private inference. In §2, we illustrate that using T to perform the inference in the MPC system is slow. Instead of using T , the model provider can use MPCFORMER to generate a more suited one S. S runs much faster than T in the MPC setting while having similar ML performance compared to T . To use MPCFORMER, the model provider needs to provide the trained Transformer model T , the downstream dataset D, and MPC-friendly approximations A. These MPC-friendly approximations A need to be fast in MPC and will be used to replace bottleneck functions in T . The workflow can be concisely described as:  Convert: S = MPCFORMER(T , D, A) Inference: y = MPCS (X)

4.2. MPCFORMER

MPCFORMER is a two-stage framework as shown in Figure 3 . The first stage leverages A and T to construct an MPC-friendly Transformer architecture S ′ , which achieves fast inference in the MPC system. The second stage applies knowledge distillation (KD) to S ′ to learn the output model S, which is fast in MPC and preserves the high performance of T .

4.2.1. STAGE 1: APPROXIMATION

In the first stage, MPCFORMER replaces bottleneck functions in T with given A to construct a MPC-friendly Transformer architecture S ′ (Figure 3 ). Below we show how we construct A for our experiments i.e., using the MPC system Knott et al. (2021) .In §2, we identify the bottleneck to be GeLU and Softmax functions. We thus construct A for these two functions. Approximating GeLU. Analysis in § 2 shows that multiplication in MPC requires extra communication. Thus, quadratics are the fastest nonlinear activation in MPC. Since GeLU and ReLU functions share similar function values, we simply take the quadratics designed for the ReLU function ( §3) to approximate the GeLU function: GeLU (x) ≈ 0.125x 2 + 0.25x + 0.5. We denote this approximation as "Quad". Approximating Softmax. Prior works in CNNs have developed an MPC-friendly approximation to Softmax functions (Mohassel & Zhang, 2017) : softmax(x) ≈ ReLU(x)/ ReLU(x) We validate that this has a faster inference speed than the Softmax function in our setting (Figure 4 ). We denote this approximation as "2ReLU". However, this is not yet satisfactory. Analysis in §2 shows that evaluating the ReLU function requires heavy use of comparison routines, which is very expensive. Thus, we propose a more aggressive approximation for the Softmax by replacing the ReLU in Eq. 2 with a quadratic function: softmax(x) ≈ (x + c) 2 / (x + c) 2 (3) We denote this as "2Quad". Importantly, "2Quad" and Softmax functions differ a lot by numerical values, while prior works argue that similarity in numerical values is crucial to the model's performance (Chou et al., 2018) . We are able to use this aggressive approximation because our next distillation stage is effective enough to bridge the performance gap. Figure 4 shows the comparison between the running time of the original GeLU and softmax function against their approximations. In particular, 2Quad has a much faster inference speed than 2ReLU.

4.2.2. STAGE 2: DISTILLATION

In the second stage, we use KD to make the fast approximated Transformer model S ′ performant. The benefits of KD are two-fold. First, it allows us to use more aggressive approximations such as "2Quad", which leads to higher speedups. Second, its data efficiency allows us to effectively learn a good S with the small downstream datasets. Concretely, we conduct layer-wise distillation to transfer knowledge from the input model T to S ′ by matching representation at the following four positions: (1) the embedding layer, (2) the attention matrix in each Transformer layer, (3) the hidden states after each Transformer layer, and (4) the final prediction layer. These four positions have been shown to store meaningful information in previous works (Hinton et al., 2015; Jiao et al., 2019; Clark et al., 2019) . We use the Mean Square Error (MSE) loss to match the representations between T and S for all positions. We follow the learning procedure of Jiao et al. (2019) to first distill the embedding and Transformer layers (including the attention matrix and the hidden states) and then distill the prediction layer. Student initialization. An important component of knowledge distillation is the initialization of the student model (Sanh et al., 2019) . Taking the advantage that S ′ and T share the same architecture, we initialize S ′ using weights in T . We find that this outperforms random weight initialization, especially on smaller datasets ( §5.3).

5. EXPERIMENTS

We design the MPCFORMER framework to be compatible with many MPC-friendly approximations and trained Transformer models, so that model providers can conveniently plug in MPC-friendly approximations based on their MPC systems. Thus, we evaluate MPCFORMER with different MPCfriendly approximations under (1) Different datasets ( § 5.1), and (2) Different models (especially larger models) ( §5.2). In the ablation study, we study (1) the effect of student initialization and, (2) the effect of the number of training examples in the distillation stage. Experimental setup. We use two P3.2x AWS instances to simulate the inference service scenarios (one P3.2x for the model provider, and one for the user). Each instance is equipped with one Tesla V100 GPU, and 10GbE Ethernet bandwidth. We place instances in the same placement group to guarantee a 10GbE bandwidth in AWS. Time breakdown is measured with CrypTen, which implements secret sharing with semi-honest adversaries assumption (Section §2) (Knott et al., 2021) . We train and evaluate models based on HuggingFace (Wolf et al., 2019) . In particular, we find that the implementation of 2Quad requires careful use of HuggingFace source code ( §A.2). Baselines. We identify three important properties during Transformer inference in § 1: speed, performance, and privacy. In our workflow, privacy has been guaranteed by using MPC systems. Thus, we evaluate MPCFORMER by the other two aspects: speed and performance. Concretely, S shall run faster than T while matching the performance of T . Since there is a limited amount of work on Transformer inference in MPC, we seek a baseline from CNN literature. In particular, we choose the training strategy in (Chou et al., 2018) and denote it as MPCFORMER w/o{d} . MPCFORMER w/o{d} also constructs the approximated model S ′ but trains S ′ on D with the taskspecific objective, i.e., without distillation. We note that S ′ is initialized with weights in T , i.e., with different functions, whose effect has not been studied. We thus propose a second baseline MPC-FORMER w/o{p,d} , which trains S ′ on D without distillation, and random weight initialization. Below we compare the performance of MPCFORMER with MPCFORMER w/o{p,d} and MPCFORMER w/o{d} with the same speedups. We denote the output model of our framework with BERT BASE , Roberta-base, and BERT LARGE as MPCBert-B, MPCRoberta-B, and MPCBert-L for short.

5.1. COMPARISON WITH BASELINES ON DIFFERENT BENCHMARKS

Settings. In this section, we evaluate our MPCFormer framework with different approximations and compare it with baselines on the IMDb dataset and the GLUE benchmark (Maas et al., 2011; Wang et al., 2018) . For all experiments in this section, we use BERT BASE as the base model. According to the dataset statistics, we use a sequence length of 512 for the IMDb dataset and a sequence length of 128 for GLUE datasets. We note that a longer sequence length generally reflects a higher speedup because the Softmax functions can be sped up more. Baselines are trained with learning rates tuned from 1e-6, 5e-6, 1e-5, and 1e-4, the number of epochs from 10, 30, and 100, the batch size 32 for IMDB, batch sizes 64 and 256 for GLUE. MPCBert-B is trained with learning rate 5e-5 for embedding and Transformer layer distillation, and 1e-5 for prediction layer distillation. Further details on hyper-parameters tuning can be found in A.4. We show the accuracy and speedup on the IMDb dataset in Table 3 . MPCBert-B achieves 5.26× speedup with "Quad+2Quad" approximation with almost no accuracy drop. In addition, we note that this holds for not only the fastest "Quad+2Quad" approximation but other approximations rang- This indicates that using weights in T as initialization benefits the training of S ′ with a task-specific objective, using a 12-layer Transformer backbone. Additionally, we evaluate MPCFORMER with more approximations using a subset of the GLUE benchmark. Results are shown in the right part of Table 3 . We observe similar patterns as in the IMDb dataset. The baseline MPCBert-B w/o{d} performs well in "GeLU+2Quad' and "GeLU+2ReLU" approximations, but MPCFORMER achieves high ML performance consistently under all approximations.

5.2. MORE COMPARISONS WITH DIFFERENT MODELS

We evaluate MPCFORMER with trained Transformer models other than BERT BASE , i.e., ROBERTA BASE model (12 layers) (Liu et al., 2019) , and in particular a larger BERT LARGE model (24 layers). Results are shown in Table 5 . MPCRoberta-B preserves 98% average score of the input model ROBERTA BASE , and outperforms baselines by large margins. Comparing the performance of baselines, we again observe that initialization with weights in T helps training with S ′ , i.e., MPCRoberta-B w/o {d} performs better than MPCRoberta-B w/o {p,d} . Table 5 : The performance on a subset of Glue benchmark with Roberta-base backbone (denoted as "MPCRoberta-B"). MPCRoberta-B and baselines use "Quad+2Quad" approximations with 2.1 × speedup. MPCBert-L and baselines use "Quad+2ReLU" approximations with 2.0 × speedup. The first question we study is the effect of different student initialization. This is of interest as we do not have a general knowledge of whether initializing with weights in T will still benefit the training of S ′ after aggressive approximations. We design an experiment with random initialization (denoted as MPCBert-B r in the table). We train MPCBert-B r with 10× more epochs than MPCBert-B r and confirm that its distillation objective has converged. We observe that on larger datasets(QNLI and SST-2), the gap between using different initialization is small, but on smaller datasets(STS-B, MRPC, and RTE), initializing with the weights in T is better. The second question we study is the effect of the number of training examples in the distillation stage. We perform studies on two small (RTE, MRPC) and two medium (SST-2, QNLI) datasets in the GLUE benchmark (Figure 5 ). We find that roughly 5% of the small datasets and 2% of the medium datasets provides enough data to learn a good S. This shows that KD in our setting is efficient enough to learn a good S in downstream tasks (using GLUE datasets as representatives). 

6. CONCLUSION

In this paper, we propose a framework to achieve fast and performant private Transformer model inference with MPC. Evaluations show that it is compatible with various MPC-friendly approximations and trained Transformer models. We suggest two directions of interest: (1) Theoretical or empirical analysis on more MPC systems, and (2) extension on the problem formulation to allow a smaller size of S. In addition, we provide a more complete breakdown in terms of communication and computer load for Figure 2 for a holistic view of execution pattern in the MPC setting. We note that, implementing "2Quad" to replace the softmax requires attention to the effect brought by the masking operation. For example, the default implementation by Huggingface Wolf et al. (2019) would result in an exploding problem due to masking. Therefore, we would need to do a different version of the implementation of masking. We describe it in detail below. The default attention implementation by Huggingface is Attention(Q, K, V ) = sof tmax( QK T √ d k + M {0,-inf } )V = e QK T √ d k +M {0,-inf } K j=1 e QK T √ d k +M {0,-inf } j V. If we directly replace the e x with (x+c) 2 as in 2Quad approximation, where x = QK T √ d k +M {0,-inf } will explode when being masked, causing a problem in the forward pass. To solve this problem, we could simply change the implementation of masking from "adding a zero or negative infinite number in the exponent" to "multiplying one or zero to the exponential function". That is, Attention(Q, K, V ) = e QK T √ d k ⊙ M {1,0} K j=1 e QK T √ d k j ⊙ M {1,0} V → QK T √ d k + c 2 ⊙ M {1,0} K j=1 QK T √ d k + c 2 j ⊙ M {1,0} V. It's just a different implementation of the same masking purpose but avoids exploding at the masking positions. In our experiments, we empirically tried c = 5 and it worked pretty well, indicating the choice of the constant c could be flexible.

A.3 ROBUSTNESS OF THE STUDENT MODEL

Some approximations may increase the local Lipschitz constants, which decreases the robustness. We applied some empirical text adversarial attacks to evaluate the adversarial robustness of the BERT-base model before and after approximations (Zeng et al., 2020) . As shown in Table 9 , the student model has a moderate increase in terms of attack success rate (ASR) over the three scorebased attacks. But the student model has a lower ASR with the gradient-based attack HotFlip. Considering these results, the effect on robustness by the approximations are empirically moderate. For baselines, We study the effect of hyper-parameters by running a grid search over the STS-B datasetfoot_4 , with learning rate from [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4], batch size from [256, 128, 64, 32, 16] , epoch from [3, 10, 30, 50, 80, 100, 200] . We show the grid search results with BERT BASE in figure 6 , 7, and a smaller grid search for BERT Large and ROBERTA BASE in Figure 8 , 9. We empirically discover that the learning rates from 1e-6, 5e-6, 1e-5, batch size from 64 and 256, epoch from 10, 100 give good performance. To let baselines explore more hyper-parameters, we use learning rate from [1e-6, 5e-6, 1e-5, 1e-4], batch size from [64, 256] , epochs from [10, 30, 100] for all Glue datasets. Since we use sequence length 512 for IMDB dataset, we use batch size 32 to fit into our 16GB Tesla V100 GPU. We also empirically discover that (1) MPCBert-B w/o{d} (best 0.43) can not scale up when the base model scales to BERT Large i.e., MPCBert-L w/o{d} (best 0.08). (2) baseline benefits from using the pre-trained weights, i.e., MPCBert-B w/o{d} (best 0.42) performs better than MPCBert-B w/o{p, d} (best 0.23). (3) MPCFormer w/o{d} benefits when the base model becomes better, i.e., MPCRoberta-B w/o{d} (best 0.62) performs better than MPCBert-B w/o{d} (best 0.42). For MPCFORMER, we decide the number of epochs according to the MSE loss for embedding and Transformer layer distillation, 5 epochs for prediction layer distillation, and batch size 8 for small datasets (CoLA, MRPC, RTE) and 32 for larger ones (MNLI, QQP, SST2, STS-B). We minimize the hyper-parameter tuning for MPCFORMER, since we would like the performance to be an expectation for future researchers using MPCFORMER, who prefer not to tune hyper-parameters. Specifically, we use 5 epochs for MNLI, 5 epochs for QQP, 10 epochs for QNLI, 10 epochs for SST-2, 20 epochs for MRPC, 30 epochs for IMDB 50 epochs for STS-B, 50 epochs for CoLA, 50 epoches for RTE, for the embedding and Transformer layer distillation stage. 



For example, through oblivious transfer(Keller et al., 2016) or homomorphic encryption(Paillier, 1999). We only consider routines that take two secret share numbers for the ease of illustration. We provide more details on implementing routines and functions in MPC at A.1. x ∈ Z/QZ is required for privacy protocols, where Z/QZ is a ring with Q elements. We select STS-B because it is a regression task, where performance varies in a large range.



Transformer models. Transformer models have achieved great success in language understanding Yang et al. (2019b); Lan et al. (2019); Raffel et al. (2020); Clark et al. (2020), vision understanding Dosovitskiy et al. (2020); Liu et al. (2021); Radford et al. (

Figure 3: The overview of the MPCFORMER framework. The first stage uses MPC-friendly approximations and T to construct a faster Transformer architecture S ′ . The second stage uses Knowledge Distillation on S ′ to learn a performant and fast Transformer model S.

Figure4: Running time comparison of different approximations in §4.2.1. Blue areas represent the total running time. Orange areas represent the communication time. MPC-friendly approximations greatly reduce both the communication time and the total time. In particular, our proposed "2Quad" is much faster than the original Softmax function, and the previous "2ReLU" approximation.

Figure 5: Ratio of training examples versus performance (normalized by performance with ratio = 1.0) on MRPC, RTE, QNLI, and SST-2.

Figure 6: Grid search results for MPCBert-B w/o{p,d} on STS-B dataset. X-axis is number of epochs, and Y-axis is correlation (unscaled).

Figure 7: Grid search results for MPCBert-B w/o{d} on STS-B dataset. X-axis is number of epochs, and Y-axis is correlation (unscaled).

Figure 9: Grid search results for MPCRoberta-B w/o{d} on STS-B dataset. X-axis is number of epochs, and Y-axis is correlation (unscaled).

; this intermediate supervision can alleviate the gradient explosion

Performance and speedup on the IMDB dataset and a part of the GLUE benchmark (QNLI, CoLA and RTE) with different approximations and BERT BASE as the backbone. The input model T is denoted with "*". "p" stands for using weights in T as initialization, "d" stands for applying knowledge distillation with T as the teacher.

Performance and speedup on the IMDB dataset and a part of the GLUE benchmark (QNLI, CoLA and RTE) with different approximations and BERT BASE as the backbone. The input model T is denoted with "*". "p" stands for using weights in T as initialization, "d" stands for applying knowledge distillation with T as the teacher.

Performance on Glue benchmark with BERT BASE as the backbone. F1 score is reported for QQP and MRPC. Average Pearson and Spearman correlation is reported for STS-B. Matthews correlation is reported for CoLA. Accuracy is reported for other datasets. 24× to 2.65 × speedups. Baselines have inferior performance for all approximations. For example, both baselines have at least an accuracy drop of 6.8% with 5.26 × speedup. Interestingly, MPCBert-B w/o {d} has moderate accuracy drop for "GeLU+2ReLU" approximation with 1.76× speedup. However, it does not preserve accuracy with other approximations. In contrast, MPCFORMER consistently preserves accuracy with various kinds of approximation.To validate the observation with more datasets, We evaluate MPCBert-B on the Glue benchmark (8 datasets)(Wang et al., 2018). As shown in Table4, MPCBert-B achieves 1.93× speedup with 98% performance of BERT BASE on the GLUE benchmark, and 2.2× speedup with 97% performance of BERT BASE . Both baselines introduce severe performance drop, i.e., 19.5 average score drop for MPCBert-B w/o{d} and 26.2 average score drop for MPCBert-B w/o{p,d} with 1.93× speedup. Interestingly, we observe that the baseline MPCBert-B w/o{d} consistently outperforms MPCBert-B w/o{p,d} .

To show our method can scale to different model sizes, we evaluate MPCFORMER with a larger model BERT LARGE(24-layer). On the IMDb dataset, BERT BASE achieves 95.0% accu-racy. MPCBert-L achieves 5.9× speedup with 94.5% accuracy, while baselines MPCBert-L w/o {p,d} achieves 87.1% accuracy, and MPCBert-L w/o {d} achieves 50.0% accuracy. We further select four datasets from the GLUE benchmark, where Bert-L* noticeably outperforms Bert-B*(MNLI, QNLI, CoLA, and RTE). Compared with Bert-B* in table 4, Bert-L* increases the average score from 82.8 to 85.3. MPCFORMER increases the average score from 81.5 to 84.6. This indicates that MPCFORMER can scale with the size of T . On the other hand, baselines do not scale with the input model: MPCBert-L w/o {p,d} decreases the average score from 60.1 to 59.3; MPCBert-L w/o {d} decreases the average score from 68.5 to 43.5. In particular, we observe that initializing S ′ with weights in T without distillation harms performance when the model is larger.

Student model initialized with weights in T versus with random weights. Result for MPCBert-B r is tuned with embedding and Transformer layer distillation learning rates from 5e-5 and 3e-5. Results for both are obtained with the "Quad+2Quad" approximation. LIMITATION AND FUTURE DIRECTION We recognize two limitations in our paper. First, our speedups and performance are tested on a single MPC system. We leave theoretical analysis or empirical study on more MPC systems as future work. Second, in our design, T and S only differ by functions, i.e., they have the same model size. We leave extension to a smaller student model as future work.

Functions computation versus communication breakdown (Unit: seconds).In particular, computation only takes 21% of the running time and the communication takes 79% of the running time. Consequently, the number of floating point operations (FLOP), a popular estimator for the running time in plain-text Transformer inference, is no longer accurate in the MPC setting.The MPC system we use in the paper uses All-reduce to implement intermediate communication, where both parties have the same communication load. And they have similar computation load (see the multiplication example, where both parties are computing the same function locally with their own secret shares). Thus, the time breakdown is similar for both parties. In the above table, we report the statistics from the model provider.

Sanity accuracy (SA) and attack success rate (ASR) against various text attacks. The TextFooler, PWWS, and BERT-ATTACK are score-based attacks, and HotFlip is a gradient-based attack. For ASR, lower is better.

ACKNOWLEDGEMENT

The authors thank Qirong Ho for allocating the computing resources. This research was supported by NSF IIS1563887, NSF CCF1629559, NSF IIS1617583, NGA HM04762010002, NSF IIS1955532, NSF CNS2008248, NSF IIS2123952, and NSF BCS2040381.

A APPENDIX A.1 A CONCRETE SYSTEM IMPLEMENTATION OF MPC: CRYPTEN

In this section, we provide how a concrete MPC system (CrypTen) implements routines and functions for Transformer models in detail (Knott et al., 2021) . We provide a portion of details here to help describe the complexity of Transformer inference in MPC. A more complete system overview and privacy proof are available in the CrypTen paper.Threat model. CrypTen follows Evans et al. (2018) to assume that parties are semi-honest. Under this assumption, parties are honest that they will follow the system protocols. However, each party is also curious (i.e., semi-honest), meaning it will try to infer the information about others' data based on the values it receives.Secret shares CrypTen uses secret shares to implement private computation. A floating point value x f is first scaled to an integer x 5 , then secretly shared with both parties. Secret shares are of type arithmetic or binary. The arithmetic secret shareswhere the first party holds [x] 1 , and the second holds [x] 2 . They are constructed with a pair of zerosum random maskings (Cramer et al., 2005) Binary shares ⟨x⟩ are formed by arithmetic secret shares of bits in x, so that the bitwise xor ⟨x⟩ 1 ⊕ ⟨x⟩ 1 = x. We study on the standard setting where each tensor is represented in 64 bits (i.e. L=64). Each multiplication requires one round of communication for revealing the intermediate values ϵ, δ. Each conversion from [x] to ⟨x⟩ requires log 2 L = 6 rounds of communications for the adder circuit; each conversion from ⟨x⟩ to [x] requires one round for generating [⟨x⟩ (b) ]. Thus, each comparison requires 7 rounds of communication. Each max(•) between N elements requires O(log 2 (N )) rounds of communications, assuming a tree-reduction algorithm.

Routines and functions

We provide a simple addition example here. The scaling factor and ring size Q are set to small for ease of understanding. 

