SAFENET: A SECURE, ACCURATE AND FAST NEU-RAL NETWORK INFERENCE

Abstract

The advances in neural networks have driven many companies to provide prediction services to users in a wide range of applications. However, current prediction systems raise privacy concerns regarding the user's private data. A cryptographic neural network inference service is an efficient way to allow two parties to execute neural network inference without revealing either party's data or model. Nevertheless, existing cryptographic neural network inference services suffer from enormous running latency; in particular, the latency of communication-expensive cryptographic activation function is 3 orders of magnitude higher than plaintextdomain activation function. And activations are the necessary components of the modern neural networks. Therefore, slow cryptographic activation has become the primary obstacle of efficient cryptographic inference. In this paper, we propose a new technique, called SAFENet, to enable a Secure, Accurate and Fast nEural Network inference service. To speedup secure inference and guarantee inference accuracy, SAFENet includes channel-wise activation approximation with multiple-degree options. This is implemented by keeping the most useful activation channels and replacing the remaining, less useful, channels with variousdegree polynomials. SAFENet also supports mixed-precision activation approximation by automatically assigning different replacement ratios to various layer; further increasing the approximation ratio and reducing inference latency. Our experimental results show SAFENet obtains the state-of-the-art inference latency and performance, reducing latency by 38% ∼ 61% or improving accuracy by 1.8% ∼ 4% over prior techniques on various encrypted datasets.

1. INTRODUCTION

Neural network inference as a service (NNaaS) is an effective method for users to acquire various intelligent services from powerful servers. NNaaS includes many emerging, intelligent, client-server applications such as smart speakers, voice assistants, and image classifications Mishra et al. (2020) . However, to complete the intelligent service, the clients need to upload their raw data to the model holders. The network model holders in the server are able to access, process users' confidential data from the clients, and acquire the raw inference results, which potentially violates the privacy of clients. So there is an urgent requirement to ensure the confidentiality of users' financial records, healthy-care data and other sensitive information during NNaaS. Modern cryptography such as Homomorphic Encryption (HE) by Gentry et al. (2009) and Multi-Party Computation (MPC) by Yao (1982) enables secure inference services that protect the user's private data. During secure inference services, the provider's model is not released to any users and the user's private data is encrypted by HE or MPC. CryptoNets proposed by Gilad-Bachrach et al. (2016) is the first HE-based secure neural network on encrypted data; however, its practicality is limited by enormous computational overhead. For example, CryptoNets takes ∼ 298 seconds to perform one secure MNIST image inference on a powerful server; its latency is 6 orders of magnitude longer than the unencrypted inference. MiniONN by Liu et al. (2017) and Gazelle by Juvekar et al. (2018) prove that using a hybrid of HE and MPC it is possible to design a lowlatency, secure inference. Although Gazelle significantly reduces the MNIST inference latency of CryptoNets into ∼ 0.3 seconds, it is still far from practical on larger dataset such as CIFAR-10 and CIFAR-100, due to heavy HE encryption protocol and expensive operations. For instance, Gazelle requires ∼ 240 seconds latency and ∼ 8. 2020) still suffers from enormous online latency; this is because a big communication overhead between the user and the service provider is required to support cryptographic ReLU activations. Our experiments show that the communication overhead is proportional to ReLU units in the whole neural network. Delphi attempts to reduce inference latency by replacing expensive ReLU with cheap polynomial approximation. Unfortunately, most ReLU units are found to be difficult to substitute without incurring a loss of accuracy. The accuracy will be dramatically decreased as more ReLU units are approximated by polynomials. Specifically, Delphi only replaces ∼ 42% ReLU numbers on a CNN-7 network (detailed in Section 6.2 of MiniONN by Liu et al. ( 2017)) and ∼ 20% ReLU numbers on ResNet-32 network, with < 1% accuracy decrease. When Delphi approximates more ReLU units, > 3% inference accuracy will be lost compared to an all-ReLU model. If accuracy loss is constrained, non-linear layers still occupy almost 62% to 74% total latency in CNN-7 and ResNet-32 networks. Therefore, slow, non-linear layers are still the obstacle of a fast and accurate secure inference. Our contribution. One key observation is that the layer-wise activation approximation strategy in Delphi is too coarse-grained to replace the bottleneck layers in which the ReLU units are mainly located, e.g. the first layer in CNN-7 occupies > 58% ReLU units. The channels in bottleneck layers are difficult to completely replace without a small accuracy loss. To meet accuracy constraints and speedup secure inference, SAFENet includes a more fine-grained channel-wise activation approximation to keep the most useful activation channels within each layer and replace the remaining, less important, activation channels by polynomials. In this way, only partial channels in each layer will be approximated, which is approximate-friendly for bottleneck layers. Another contribution of SAFENet is that automatic multiple-degree polynomial exploration in each layer is supported, compared to prior works using only degree-2 polynomials. Additionally, SAFENet enables mixedprecision activation approximation by assigning different approximation ratios to various layers, which further replaces more ReLU units with cheap polynomials. Our results show that under the same accuracy constraints, SAFENet obtains state-of-the-art inference latency, reducing latency by 38% ∼ 61%, or improving accuracy by 1.8% ∼ 4% over the prior techniques.

2. BACKGROUND AND RELATED WORK

Threat Model and Cryptographic Primitives. Our threat model is the same as previous work Delphi by Mishra et al. (2020) . More specifically, we consider the service holder as a semi-honest cloud which attempts to infer clients input information but follows the protocol. The server holds the Convolutional Neural Network (CNN) model and the client holds the input to the network. For linear computations, the client encrypts input and sends it to the server using a HE scheme by Mishra et al. (2020) , and then the server returns encrypted output to the client. The client decrypts and decodes the received output. The secret sharing (SS) in Delphi is used to protect the privacy of intermediate results in the hidden layers. Then garbled circuits (GC) guarantees the data privacy in the activation layers, and SS is used to securely combine HE and GC. Other than GC, Beaver's multiplicative Triples (BT) proposed by Beaver (1995) is used to implement approximated activation using secure polynomials. BT-based polynomial approximation for ReLU is 3-orders of magnitude cheaper than GC-based ReLU units on average, so it is used to design approximated secure activation function. At the end of secure inference, the server has learned nothing but the client learns the inference result. More details of cryptographic primitives can be found in Appendix A.1.

2.1. CRYPTOGRAPHIC INFERENCE.

Modern neural networks usually consist of linear convolution layers and non-linear activation layers. As Figure 1a shows, current state-of-the-art cryptographic inference, Delphi by Mishra et al. (2020), 



3 GB communication to perform ResNet-32 on the CIFAR-100 dataset. NASS by Bian et al. (2020), CONAD by Shafran et al. (2019) and CryptoNAS by Ghodsi et al. (2020) are proposed to design cryptography-friendly neural network architectures, but they still suffer from heavy encryption protocol in the online phase. Delphi by Mishra et al. (2020) significantly reduces the online latency by moving most heavy cryptography computations into the offline phase. Offline computations can be pre-processed in advance. The State-of-the-art cryptographic inference service Delphi by Mishra et al. (

