MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC

Abstract

Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions can increase the inference latency by more than 60× or significantly compromise the inference quality. In this paper, we design the framework MPCFORMER as a practical solution, using Secure Multi-Party Computation (MPC) and Knowledge Distillation (KD). Through extensive evaluations, we show that MPCFORMER significantly speeds up Transformer inference in MPC settings while achieving similar ML performance to the input model. On the IMDb dataset, it achieves similar performance to BERT BASE , while being 5.3× faster. On the GLUE benchmark, it achieves 97% performance of BERT BASE with a 2.2× speedup. MPC-FORMER remains effective with different trained Transformer weights such as ROBERTA BASE and larger models including BERT Large . Code is available at https://github.com/MccRee177/MPCFormer.

1. INTRODUCTION

Pre-trained Transformer models can be easily fine-tuned on various downstream tasks with high performance and have been widely developed as model inference services (Bommasani et al., 2021; Feng et al., 2020; Yang et al., 2019b; Clark et al., 2020) . However, these model inference services can pose privacy concerns. For instance, GitHub Copilot, a code-generating engine adapted from pre-trained GPT weights, requires either users to reveal their code prompts to the service provider, or the service provider to release the Copilot's trained weights, which are business proprietary, to users (Chen et al., 2021; Brown et al., 2020) . Secure Multi-Party Computation (MPC) offers a promising solution by keeping data and model weights private during inference (Evans et al., 2018) . However, the vanilla Transformer inference in MPC is unacceptably slow. For instance, BERT BASE inference takes <1 second without MPC, but ∼60 seconds with MPC (Figure 2 ). An intuitive way to accelerate MPC inference replaces computational operations with their faster approximations and retrains the approximated model, which has been adopted on convolutional neural networks (CNNs) (Chou et al., 2018) . Unfortunately, adapting this solution to Transformers drastically decreases the model's performance ( § 5). In this paper, we take the first step to pursue privacy-preserving Transformer model inference in MPC, while remaining fast and performant. We take inspiration from the approximation approach 1 and attribute the performance degradation to two challenges. First, many MPC-friendly approximations toughen model training. For example, quadratic functions cause the gradient explosion problem in deep neural networks (Mishra et al., 2020) . Second, downstream datasets used for Transformer fine-tuning usually contain insufficient data to retrain an approximated Transformer with common task objectives (Zhang & Sabuncu, 2018; Hinton et al., 2012) . To address these two challenges, we resort to the knowledge distillation (KD) framework. KD can ease the model training by matching intermediate representations between the teacher and the student model (Romero et al., 2014) ; this intermediate supervision can alleviate the gradient explosion

