SPENCNN: ORCHESTRATING ENCODING AND SPARSITY FOR FAST HOMOMORPHICALLY ENCRYPTED NEURAL NETWORK INFERENCE

Abstract

Homomorphic Encryption (HE) is a promising technology for protecting user's data privacy for Machine Learning as a Service (MLaaS) on public clouds. However, the computation overheads associated with the HE operations, which can be orders of magnitude slower than their counterparts for plaintexts, can lead to extremely high latency in neural network inference, seriously hindering its application in practice. While extensive neural network optimization techniques have been proposed, such as sparsification and pruning for plaintext domain, they cannot address this problem effectively. In this paper, we propose an HE-based CNN inference framework, i.e., SpENCNN, that can effectively exploit the single-instruction-multiple-data (SIMD) feature of the HE scheme to improve the CNN inference latency. In particular, we first develop a HE-group convolution technique that can partition channels among different groups based on the data size and ciphertext size, and then encode them into the same ciphertext in an interleaved manner, so as to dramatically reduce the bottlenecked operations in HE convolution. We further develop a sub-block weight pruning technique that can reduce more costly HE-operations for CNN convolutions. Our experiment results show that the SpENCNN-optimized CNN models can achieve overall speedups of 8.37x, 12.11x, and 19.26x for LeNet, VGG-5, and HEFNet, respectively, with negligible accuracy loss.

1. INTRODUCTION

For the past decade, we have witnessed the tremendous progress of the machine-learning technology and the great success achieved in practical applications. Convolution Neural Network (CNN) models, for example, have been widely used for many cognitive tasks such as face recognition, medical imaging, and human action recognition. Meanwhile, there is a growing interest to deploy machine learning models on the cloud as a service (MLaaS). While cloud computing has been well recognized as an attractive solution, especially for computation intensive applications such as the MLaaS, outsourcing sensitive data and data processing on cloud can pose a severe threat to user's privacy. Homomorphic Encryption (HE) is a promising technology for protecting user's privacy when deploying MLaaS on cloud. HE allows computations be performed on encrypted inputs and the decrypted output matches the corresponding results computed from the original inputs. Thus, a client can encrypt the sensitive data locally and send the encrypted ciphertexts to the cloud. All intermediate results will maintain encrypted, and the encrypted results sent from cloud can be correctly decrypted using the secret key hold by the client. Whlie HE can help to maintain the confidentiality for computation process on cloud effectively, one major problem has to deal with is the excessive computational cost associated with the operations over the encrypted data: HE operations (e.g. HE multiplication, additions on encrypted data) can be several (i.e., three to seven) orders of magnitude slower than the corresponding operations on plaintexts. The tremendous computational cost of HE has been the largest bottleneck that hinders its applications on cloud. One of the most effective approaches (e.g. (Gilad-Bachrach et al., 2016; Brutzkus et al., 2019; Dathathri et al., 2019; Kim et al., 2022) ) to reduce the HE computational cost is to take advantage of the single-instruction-multiple-data (SIMD) capability, supported by HE schemes, e.g. CKKS and BFV. Smart & Vercauteren (2010) initially proposed to pack multiple data elements in the plaintext domain to different "slots" in the same ciphertext and thus computations for data elements at the same slot of two encoded messages can be performed in parallel. The challenge is how to pack data based on the characteristics of the applications so that computation can be conducted effectively in a SIMD scheme manner. In particular, the problem rises when the computation needs to be performed on data elements at different slots of the messages. To re-arrange the location of each individual data element in an encrypted message is out of question due to its large overhead. A more reasonable solution is to employ the HE-rotationfoot_1 operation that can move the data element cyclically in the same message. However, HE-rotation has a high latency cost due to required permutation and key-switching operation, compared with other HE-operations such as the HE multiplication of a ciphertext with a plaintext (HE-PMult) and HE addition of two ciphertexts (HE-Add), as shown in Figure 1 (a). Therefore, how to judiciously encode the inputs and perform the SIMD operations plays the key role in reducing the HE computation complexity. In this paper, we study the problem on how to improve the HE-based inference latency when deploying a privacy-preserving machine learning (PPML) platform based on CNN models on cloud. It is well-known that the major computation workload for CNN inference comes from the convolution layers. Assuming the user inputs (e.g. images) are encrypted as the ciphertexts and associated CNN models are encoded as plaintext messages, the major HE computations are therefore HE-PMult, HE-Add, and HE-rotation operations. Traditional neural network optimization techniques such as sparsification and pruning (Han et al., 2015b; Wen et al., 2016) help to reduce the computation demand for CNN inference for plaintext domain. However, they may not be effective here as reducing the computation demand does not necessarily imply the reduction of SIMD computations. In addition, note that, as shown in Figure 1 (a), the computation cost for an HE-rotation can be over 43× of that for an HE-Pmult or an HE-Add operation. Simply reducing the HE operations without optimizing the HE-rotations may not be effective at all in reducing the computational cost. To this end, we develop an HE-based CNN inference framework, i.e., SpENCNN, with the goal to effectively exploit the SIMD feature of the HE scheme to improve the CNN inference latency. In particular, we develop two techniques to reduce the HE computational cost. First, we develop HE-group convolution and associated group-interleaved encoding to optimize channel locations on ciphertexts based on the number of convolutional groups and ciphertext size, thus significantly reducing the number of costly HE-rotations. Second, we further optimize the model architecture by pruning and training the weights in the sub-blocks iteratively with the goal to minimize HE-rotations and accuracy loss. We have conducted extensive experiments based on three CNN models on MNIST dataset and CIFAR-10 dataset and results show that the optimized CNN models can achieve overall speedups of 8.37x, 12.11x, and 19.26x for LeNet, VGG-5, and HEFNet, respectively, with negligible accuracy loss. To our best knowledge, this is the first work to that builds optimizing framework for CNN model architecture from the aspect of structural sparsity and data packing in HE to benefit HE-based PPML inference.



--------------- For instance, Rot(ct, k) transforms an encryption of (v0, ..., v N/2-1 ) into an encryption of (v k , ..., v N/2-1 , v0, ..., v k-1 )



Figure 1: (a) Comparison of different HE-operations' latency (b) Comparison of the HE convolution latency under different pruning methods. (c) Illustration for different pruning methods for plaintext domain. (d) Multi-channel convolution process in HE domain. Notation definitions refer to section 3.1. pt(k i ) indicate the weight plaintext. The convolution layer used here has 64 input-and 64 output-channel, with a 3 × 3 kernel. The input feature map size of the convolution layer is 32 × 32.

