HYPHEN: A HYBRID PACKING METHOD AND OPTIMIZATIONS FOR HOMOMORPHIC ENCRYPTION BASED NEURAL NETWORK

Abstract

Private Inference (PI) enables users to enjoy secure AI inference services while companies comply with regulations. Fully Homomorphic Encryption (FHE) based Convolutional Neural Network (CNN) inference is promising as users can offload the whole computation process to the server while protecting the privacy of sensitive data. Recent advances in AI research have enabled HE-friendly deep CNN like ResNet. However, FHE-based CNN (HCNN) suffers from high computational overhead. Prior HCNN approaches rely on dense packing techniques that aggregate as many channels into the ciphertext to reduce element-wise operations like multiplication and bootstrapping. However, these approaches require performing an excessive amount of homomorphic rotations to accumulate channels and maintain dense data organization, which takes up most of the runtime. To overcome this limitation, we present HyPHEN, a deep HCNN implementation that drastically reduces the number of homomorphic rotations. HyPHEN leverages two convolution algorithms, CAConv and RAConv. Alternating between two convolution algorithms leads to a significant reduction in rotation count. Furthermore, we propose hybrid gap packing method for HyPHEN, which gathers sparse convolution results into a dense data organization with a marginal increase in the number of rotations. HyPHEN explores the trade-off between the computational costs of rotations and other operations, and finds the optimal point minimizing the execution time. With these optimizations, HyPHEN takes 3.4-4.4× less execution time than the state-of-the-art HCNN implementation and brings the runtimes of ResNet on CIFAR10 inference down to 1.44-13.37s using a GPU-accelerated HEAAN library.

1. INTRODUCTION

Private inference (PI) has recently gained the spotlight in the MLaaS domain as cloud companies should comply with privacy regulations such as GDPR Regulation (2016) and HIPAA Act (1996) . PI enables inference services at the cloud server while protecting the privacy of the client and the intellectual properties of the service provider. For instance, hospitals can provide a private medical diagnosis of diseases, and security companies can provide private surveillance systems without accessing client's sensitive data (Kumar et al., 2020; Bowditch et al., 2020) . PI can be achieved using various cryptographic primitives (Gentry, 2009; Yao, 1982; Costan & Devadas, 2016) . Fully Homomorphic Encryption (FHE), which is a set of cryptographic schemes that can directly evaluate a rich set of functions on encrypted data, is especially suited for PI. FHEbased PI solution uniquely features 1) full offloading of the computation process to the server, 2) succinct data communication requirement, and 3) non-disclosure of any information about the model except the inference result. Such benefits have driven researchers to investigate convolutional neural network (CNN) PI implementations using FHE (Gilad-Bachrach et al., 2016; Brutzkus et al., 2019; Dathathri et al., 2020; Lee et al., 2022a; Aharoni et al., 2020) . To implement CNN using FHE, activation functions should be replaced with polynomials as FHE only supports arithmetic operations of addition and multiplication. Given the constraint, two classes of polynomial activation functions have been proposed: (i) low-degree polynomials (Gilad-Bachrach et al., 2016; Chabanne et al., 2017) replacing the activation functions in training neural networks, and (ii) more precise high-degree approximation of ReLU (Lee et al., 2021) that replaces RELU during PI without additional training. However, both approaches lack practicality; low-degree polynomials are not applicable to deep neural networks and high-degree approximation significantly degrades the runtime of PI. Recently, Park et al. (2022) showed that deep homomorphic CNNs (HCNNs) can be trained with low-degree polynomials even for complex image datasets with their proposal, AESPA, which utilizes orthogonal polynomial bases and fuses activation functions with batch normalization (BN) to turn them into second-degree polynomials. AESPA does not sacrifice runtime nor accuracy unlike prior approaches, thus we employ AESPA in our work. Another line of research lies in implementing an efficient convolution algorithm in FHE. Gazelle (Juvekar et al., 2018) proposed a convolution algorithm that can compute a single Conv layer on FHE. However, Gazelle's method cannot be directly applied to continuous convolutions as it requires adjusting arrangement of data by re-encrypting ciphertexts after every Conv layer. Lee et al. (2022a) modified Gazelle's convolution by densely mapping data into a ciphertext before entering the next Conv layer. However, the current state of HCNN is far from being practical. Using the convolution algorithm of Lee et al. (2022a) and approximated ReLU, inference times of ResNet20 CIFAR-10 are 1662/174s using a single/64 threads in our CPU environment. Despite the unique advantages of FHE-based PI, the huge runtime overhead prevents FHE from being the go-to solution for PI. We propose Hybrid Packing method and optimizations for Homomophic Encryption-based neural Network (HyPHEN), which mitigates the huge overhead of HCNN with an optimized convolution algorithm and packing method. We observe that after AEPSA is applied, rotation operations in HCNN take up the majority of the runtime (See Appendix A) and most of the rotations (92-99%) are spent to implement the sum of channels within the same ciphertext and maintain data organization. We design a novel convolution algorithm named RAConv that does not require rotations to accumulate channels. In addition, based on the observation that maintaining a single data organization necessitates massive unnecessary rotations, we design RAConv to take the new data organization based on the replication of the images. By alternating between two data organizations, we remove rotations priorly required to adjust the data organization. HyPHEN also includes a novel Hybrid Packing (HP) method that effectively handles the gap arising from strided convolution (Section 3.2). HyPHEN achieves 39.6s and 1.44s of runtime in ResNet20 for the CIFAR-10 dataset on CPU and GPU, respectively. The key contributions of the paper are as follows: • We propose a replication-based convolution method, RAConv, that can effectively reduce two types of unnecessary rotations which are the major bottleneck in HCNN. • We propose a novel hybrid packing (HP) method that can utilize the entire slots of a ciphertext with a marginal increase in the number of rotations. • Our experiments show that our HCNN implementation with HyPHEN improves the inference latency by 3.4-4.4× over prior state-of-the-art HCNNs for ResNet on CIFAR-10.

2. BACKGROUND

2.1 FULLY HOMOMORPHIC ENCRYPTION FHE is a set of public key encryption schemes that can perform computation on encrypted data. Among several popular FHE schemes, RNS-CKKS (Cheon et al., 2018) has been broadly adopted in the PI domain as it supports fixed-point numbers and slot batching. A plaintext in RNS-CKKS is an unencrypted degree-N polynomial in a cyclotomic polynomial ring, R Q = Z Q [X]/(X N +1). A plaintext maps to a message which is a vector of N/2 real (or complex) numbers. Thus a single plaintext batches N/2 slots, which can store complex or real numbers. CKKS encrypts a plaintext into a ciphertext in R 2 Q . Q is a ring modulus which is represented by a set of prime modulus obtained from the Chinese Remainder Theorem (CRT) as l i=0 q i (1 ≤ l ≤ L). L and l denote the initial and current level of a ciphertext. The level is an HE-specific resource that determines the number of multiplications applicable to a given ciphertext. We also denote the associated level of ring modulus using subscript as Q L or Q l . We denote the plaintext and ciphertext of a message a as 〈a〉 and [a] . HE operations of addition, multiplication, and rotation can be described as follows: • HE.Eval([a],[b],f l ) = HE.Eval([a],〈b〉,f l ) = [f l (a,b)] • HE.Rotate([a],r) = [rot(a,r)] f l denotes linear operations, either Hadamard addition or multiplication. rot(a,r) represents cyclically shifting vector a by r to the left. Unlike addition and rotation, multiplication in RNS-CKKS requires additional rescale operation, which consumes a level by dividing ct ∈ R Q l into ct ′ ∈ R Q l-1 . If a ciphertext has no level left after a series of multiplications, bootstrapping (Bossuat et al., 2022) is needed to reconcile the levels and allow further operation. Bootstrapping, the most costly operation in HE, consists of multiple HE operations including rescale operations. After bootstrapping, the level of the resulting ciphertext becomes L ′ = (L -L b ) where L b is the depth of rescale operations in the bootstrapping circuit. As it is beneficial to perform many operations before bootstrapping, L should be sufficiently larger than L b . However, large L decreases the security level, which should be high enough to tolerate cryptographic attacks. The security level is roughly proportional to N/L. Considering the security requirement of HE, large L requires large N (≥ 2 15 ). Thus prior works on FHE (Bossuat et al., 2021; Jung et al., 2021; Lee et al., 2022b) target N = 2 15 to 2 17 . Table 1 shows the execution time of HE operations on a system specified in Section 4.1. We measured the execution time of each operation at the initial level (max level) of a ciphertext and thus the execution time may decrease for ciphertexts with lower levels. Bootstrapping consumes over two orders of magnitude longer runtime than other operations, but boostrapping does not occur as frequently as others. Except for bootstrapping, Rotate and MulCt are the most time-consuming operations in HE, which is due to the expensive key-switching procedure.  ## ## ## ## ## a1 a2 a3 ## a5 a6 a7 ## a9 a10 a11 ⊙ 0 0 0 0 0 k1 k1 k1 0 k1 k1 k1 0 k1 k1 k1 ⊕ ## ## ## ## a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 In this subsection, we describe previous convolution algorithms for FHE. We represent input and output tensors with tuples {w i , h i , c i } and {w o , h o , c o }, and Conv layers with {f w , f h , c i , c o }. We denote the stride of the convolution as s and assume padding = 1 for simplicity. Gazelle (Juvekar et al., 2018) proposed efficient SISO (single-input and single-output channel) convolution algorithms on HE. Figures 1b, 1c show SISO convolutions for s = 1, 2. Filter elements are separated into f w f h plaintexts. Each slot in i-th plaintext stores k i or 0 (punctured) depending on whether k i is involved in the computation of the output pixel at the same slot. SISO operates as follows: 1) rotate an encrypted input image with different indexes according to plaintext filter, 2) perform Hadamard multiplication, and 3) accumulate the multiplication results to obtain the output. Alternatively, if we prepare the filter plaintexts inversely rotated and directly multiply them with the input ciphertext, the rotations can be performed after MulPt operations (Zhang et al., 2021) , which we dub lazy-SISO. ⊙ 0 0 0 0 k2 k2 k2 k2 k2 k2 k2 k2 k2 k2 k2 k2 • • • = c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 (b) HE SISO convolution with s = 1 ## ## ## ## ## a1 a2 a3 ## a5 a6 a7 ## a9 a10 a11 ⊙ 0 0 0 0 0 0 0 0 0 0 k1 0 0 0 0 0 ⊕ ## ## ## ## We selectively use SISO and lazy-SISO to minimize rotations for convolution. Detailed explanation of lazy-SISO is provided in Appendix E In more general cases of multiple channels, convolution on HE is performed in a SIMD manner. If the size of a channel image is smaller than the number of slots in a ciphertext, multiple channels can be placed in a ciphertext. For example, if slot = 2 15 and an input channel is sized w i h i = 32×32 as in the input image of the CIFAR-10 dataset, slot wihi = 32 channels can be placed in a single ciphertext in an aligned manner (i.e. channel-aligned). Then, the process of convolution for a channel-aligned input ciphertext storing multiple channels can be described as follows. Suppose c i = slot wihi . First, SISO is performed on c i input channels in a SIMD manner (see Figure 2c ), which produces c i c o convolution outputs M K (i,j) (1 ≤ i ≤ c i , 1 ≤ j ≤ c o ). To compute the result for the k-th output channel, ci i=1 M K (i,k) is accumulated by RaS (Rotate and Sum), which is repeated until all the output channels are acquired. Finally, the ciphertexts packed with output channels are realigned to match the next layer's input alignment by IR (Image Realign). Throughout this paper, we refer to this convolution that takes a channel-aligned ciphertext as the input as channel-aligned convolution (CAConv). CAConv can be further optimized for the case where the input tensor is not large enough to fill all the slots in a ciphertext; slot wihici repeated copies of the input tensor are placed in a ciphertext, then slot wihici output channels can be computed together in a single ciphertext (Lee et al., 2022a) . Strided convolution (s > 1) using SISO algorithm creates a gap between valid values (see Figure 1c) . A ciphertext with a gap underutilizes its slots, leading to throughput degradation. While Juvekar et al. (2018) can remove the gap by a client-aided re-encryption process, non-interactive PI shall remove the gap through masking and rotation, which incur additional rotation overhead and also consume more levels. Lee et al. (2022a) proposed a multiplexed packing method that can be combined with CAConv (MP-CAConv) to mitigate the overheads. In the IR stage of MP-CAConv, multiplexed packing fills the gap with other channels (see Figure 3b ), which we refer to as the repacking process. Other than the repacking process, MP-CAConv is very similar to CAConv. In MP-CAConv, IR collectively refers to realigning and repacking process. Tile tensoring based convolution proposed in (Aharoni et al., 2020) is a yet another convolution algorithm for FHE. While tile tensoring based convolution can be efficient alternative to SISObased convolution when the image size is sufficiently high, our paper mainly focuses on SISO-based convolution, which can be applied more broadly. In RNS-CKKS, the hoisting optimization allows multiple rotations to share common sub-operations composing a rotation operation. First-level hoisting shares the front decomposition sub-operations when rotating a single ciphertext multiple times with different indexes. Second-level hoisting shares the rear ModDown sub-operations when rotating and summing up multiple ciphertexts. We refer the reader to Bossuat et al. (2021) for further details of the hoisting optimization. First-level hoisting can be applied to SISO, whereas second-level hoisting can be applied to lazy-SISO.

2.3. POLYNOMIAL ACTIVATION FUNCTION ON HOMOMORPHIC ENCRYPTION

Non-linear activation functions, such as ReLU, cannot be used in HCNN. They must be replaced with polynomial functions as HE only supports addition and multiplication operations. To directly replace ReLU with approximate polynomials, approximation error should be negligible over a wide range to retain the original accuracy of a CNN model. Lee et al. (2021) designed a precise approximation of ReLU having an l 1 norm approximation error less than 2 -13 in the range of [-50, 50] , obtained by a composition of {15, 15, 27} degree polynomials. This approximation-based approach has a benefit that it can be applied to pretrained neural networks. However, evaluation of highdegree polynomials imposes a significant runtime overhead in HCNN inference. Another approach is to train neural networks with low-degree polynomial activations as in Ishiyama et al. (2020) ; Chabanne et al. ( 2017); Obla et al. (2020) ; Hesamifard et al. (2019) ; Thaine et al. (2019) . While this approach requires retraining, operational cost is much cheaper compared to high-degree polynomials. Recently AESPA (Park et al., 2022) demonstrated that CNN trained with low-degree polynomials can achieve equivalent accuracy to ReLU-based networks across various CNN networks and image datasets. AESPA replaces batch normalization (BN) and ReLU with the composition of orthogonal basis polynomials and basis-wise BN as follows: f (x) = γ d i=0 fi h i (x) -µ √ σ 2 + ϵ + β (1) Here, h i are the orthogonal bases, σ and µ are the mean and variance computed from BN, and γ and β are trainable parameters. For d = 2, AESPA turns into a second-degree polynomial with different coefficients for each channel on inference. We adopt AESPA in this paper, which leads to better runtime performance.

2.4. THREAT MODEL

We adopt the same threat model as prior PI works. A client sends encrypted data to an untrusted server. The server performs CNN inference using HE operations and returns inference results to the client. The client decrypts the resulting ciphertext to obtain the private result. The server only holds the client's public keys and cannot decrypt any intermediate ciphertexts in the inference process. The client does not know any information about the processing at the server other than the result.

3.1. REPLICATION-ALIGNED CONVOLUTION

The main performance bottleneck of CAConv is the massive number of rotations. CAConv requires an enormous number of rotations to implement the sum of the channels within the same ciphertext (RaS) and the relocation of the channels (IR) to match the next layer's input representation. These rotations take up most of the time in CAConv. For example, when N = 2 16 , rotations for RaS and IR account for 49% and 43% of the total number of rotations in ResNet20, respectively. Furthermore, IR consumes an additional level for masking to extract the values. M (1) M (1) M (2) M (2) co ci * K (1,1) K (1,2) K (2,1) K (2,2) co ci fwfh SISO ----→ M K (1,1) M K (1,2) M K (2,1) M K (2,2) -→ N (1) N (2) (a) RAConv with SISO M (1) M (1) M (2) M (2) co ci * K (1,1) K (1,2) K (2,1) K (2,2) co ci fwfh SISOl ----→ M K (1,1) M K (1,2) M K (1,1) M K (1,2) M K (1,1) M K (1,2) RaS ---→ N (1) N (2) (b) RAConv with lazy-SISO M (1) M (2) ci * K (1,1) K (2,1) K (1,2) K (2,2) ci co fwfh SISO ----→ M K (1,1) M K (2,1) M K (1,2) M K (2,2) rotate and add

RaS

---→ N (1) N (1) N (2) N (2) (c) CAConv with SISO Figure 2 : CAConv and RAConv. The single superscript denotes channel and the superscript pair denotes (input channel, output channel). We simplify the notation of M (a) K (a,b) as M K (a,b) . To mitigate the performance bottleneck caused by rotations in CAConv, we design Replication-Aligned Convolution (RAConv) to receive the alternative data representation. In CAConv, the output ciphertext of RaS contains replications of the channel sum (i.e. replication-aligned) as shown in Figure 2c . RAConv receives replication-aligned ciphertexts as input and skips IR and RaS stages. Figure 2a shows an example of performing RAConv. RAConv takes c i input ciphertexts each filled with replications of a single input channel and weight plaintexts aligned in output channel order. RAConv operates as follows: 1) perform parallel SISO, which outputs c i ciphertexts where the i-th ciphertext contains M K (i,j) for all j values, and 2) accumulate the ciphertexts by simple HE additions. SISO in RAConv actually increases the number of rotations because c i input ciphertexts require c i (f w f h -1) rotations during parallel SISO. Instead, we utilize lazy-SISO with RAConv, which requires much fewer (f w f h -1) rotations for SISO. RAConv produces a densely-packed channel-aligned ciphertext that complies with the CAConv input data organization, so we alternate between RAConv and CAConv. RAConv-CAConv chain halves the RaS rotations previously required in two CAConvs to c i log(c o ). a (1) a (1) a (17) a ( 17) a (1) 1 0 a (1) 2 0 0 0 0 0 a (1) 3 0 a (1) 4 0 0 0 0 0 (a) Void packing a (1) 1 a (2) 1 a (1) 2 a (2) 2 a (3) 1 a (4) 1 a (3) 2 a (4) 2 a (1) 3 a (2) 3 a (1) 4 a (2) 4 a (3) 3 a (4) 3 a (3) 4 a (4) 4 (b) Multiplex packing a (1) 1 a (1) 1 a (1) 2 a (1) 2 a (1) 1 a (1) 1 a (1) 2 a (1) 2 a (1) 3 a (1) 3 a (1) 4 a (1) 4 a (1) 3 a (1) 3 a (1) 4 a (1) 4 (c) Duplicate packing a (1) 1 a (2) 1 a (1) 2 a (2) 2 a (1) 1 a (2) 1 a (1) 2 a (2) 2 a (1) 3 a (2) 3 a (1) 4 a (2) 4 a (1) 3 a (2) 3 a (1) 4 a (2) 4 (d) Hybrid packing a (1) a (1) a (17) a (17) CAConv ------→ b (1-16,1) b (1-16,2) b (17-32,1) b (17-32,2) b (1-16,3) b (1-16,4) b (17-32,3) b (17-32,4) RaS RaS ---→ c (1) c (2) ## ## c (3) c (4) ## ## Mask and Broadcast Broadcast -------→ c (1) c (2) c (1) c (2) c (3) c (4) c (3) c (4) (a) CAConv with hybrid packing (HP-CAConv) -16,17) b (17-24,1) b (25-32,1) b (17-24,17) b (25-32,17) RaS a (1) 1 a (2) 1 a (1) 1 a (2) 1 a (3) 1 a (4) 1 a (3) 1 a (4) 1 RAConv ------→ b (1-8,1) b (9-16,1) b (1-8,17) b (9 RaS ---→ c (1) ## c (17) ## ## ## ## ## Mask and Broadcast Broadcast -------→ c (1) c (1) c (17) c (17) c (1) c (1) c (17) c (17) (b) RAConv with hybrid packing (HP-RAConv) Figure 4 : The procedure of CAConv and RAConv with hybrid packing 3.2 HYBRID PACKING SISO convolution suffers from low slot utilization in ciphertexts due to two reasons. First, strided convolution creates a gap between valid values (see Figure 3a ). Second, the small size of input tensors creates empty slots in ciphertexts. Due to the security requirement of FHE, the number of slots in a ciphertext is often larger than the size of an input tensor. Underutilization of slots in SISO leads to severe throughput degradation in HCNN. Prior state-of-the-art HCNN implementation (MP-CAConv) mitigates underutilization of slots using multiplexed packing and input repetition (see Section 2.2). However, they cause a lot of additional rotation overhead to adjust the data organization. In MP-CAConv, RaS operation is used to accumulate SISO results in a ciphertext. Under input repetition, RaS operation returns an output channel at the slots where an input ciphertext stores the first channel of an input tensor. Multiplexed channels are also accumulated through RaS operation, thus values constituting an output channel exist only in non-gap slots, as shown in Figure 3a . To restore the data organization of MP-CAConv, invalid values are masked off and empty slots are filled with other channels through IR process. In the presence of input repetition, IR spends O(c o ) rotations to relocate output channels and slot wohoco additional rotations for generating input repetition. To reduce the relocating overhead between convolutional layers, we propose a novel image packing method, hybrid packing (HP). HP fills the gap with channel duplicates of multiple channels (See Figure 3d ). We design HP based on two key observations. First, applying convolution over a duplicate-packed ciphertext (Figure 3c HP is a hybrid of duplicate packing and multiplexed packing. We represent a hybrid-packed ciphertext by a pair of numbers, the number of multiplexed channels m, and the number of duplicates d. For example, Figure 3d shows (m, d) = (2, 2) HP. Packing of the ciphertext switches between two HP settings while processing CAConv and RAConv as shown in Figure 4 . We denote HP parameter (m, d) of the input and output ciphertext as (m in , d in ) and (m out , d out ). Input repetition is no longer required as HP with larger d in can be used instead. Duplicates of HP produce different output channels within the gap (See c (1) and c (17) in Figure 4b ) Then, the IR process adjusts the output ciphertext's organization from ). Compared to MP-CAConv, the number of rotations for RaS is reduced by about ct in times for both HP-CAConv and HP-RAConv. IR stage of HP repacks the gap with duplicates. Rotation decreases from c o of MP-CAConv to mere log m in rotations per output ciphertext. (m out , d out ) = (d in , 1) to (d in , m in ), which only requires O(log m in ) Method ct in ct out SISO RaS IR (Lee et al., 2022a) ⌈ wihici n ⌉ ⌈ wohoco n ⌉ ct in (f w f h -1) wihicico n log c i c o + log n wohoco HP-CAConv wihicidin n co din ct min (f w f h -1) ct out log ci ctin ct out log m in HP-RAConv ci min wihicomin n ct min (f w f h -1) ct out log m in ct out log m in HP convolutions require more rotations than MP-CAConv for SISO; ct in of MP-CAConv is always smaller than or equal to ct min of HP-CAConv or HP-RAConv. Nevertheless, hoisting can be applied to SISO which reduces the significance of the rotation cost of SISO, and the reduction of rotations in RaS and IR overwhelms the increase in SISO. All things considered, HP reduces the overall number of rotations required for convolutions. We can also explore various combinations for the (m, d) pair to minimize the total number of rotations. The choice of (m, d) decides ct in and ct out values and creates a trade-off between SISO, RaS, and IR costs, and also affects the number of ciphertexts we have to perform bootstrapping with. We provide an in-depth performance analysis with regard to the choice of (m, d) in Section 4.2.

3.3. THE RESNET ARCHITECTURE ON HYPHEN

Square (1) RAConv (2) CAConv (2) Bootstrap Shortcut (1) / CAConv1x1 (2) Basic Block Square (1) Figure 5 : The structure of ResNet basic block built on HyPHEN. The level consumption per block is written in parentheses. In the downsampling block, pointwise convolution is added to the critical path. Otherwise, a simple shortcut is added. HyPHEN combines RAConv and HP to build the entire CNN model. Figure 5 shows the basic block of ResNet implemented on HyPHEN. There are three more considerations when deciding the order of operations. First, bootstrapping is cheaper when placed after RAConv, and not CAConv, because the number of ciphertexts is smaller at the moment. Second, to match the level between the shortcut path and the main CAConv-RAConv path, bootstrapping should be placed either before residual connections diverge or after they converge. Last, it is beneficial to perform convolutional layers at the lowest level possible. The complexity of FHE operations such as rotation is proportional to the level l of ciphertext. Therefore, the lower the calculated level, the smaller the computational cost. All things put together, our ResNet basic block implementation consumes a total of 6 levels. The level consumption of each layer is represented in the parenthesis of each block. CAConv and RAConv use HP and consume one level for each SISO and IR. Activation uses AESPA and consumes one level. AESPA is a quadratic polynomial having different coefficients for each channel while training. During inference, we fuse the coefficients into nearby layers, then the activation becomes a simple square function x 2 . We set the initial ciphertext level L to six and perform bootstrapping when the level becomes zero.

4.1. EXPERIMENTAL SETUP

We ran HCNN inference on CPU and GPU environments using the RNS-CKKS library, HEAAN. CPU instance is equipped with AMD EPYC 7452 running at 2.35GHz (64 cores) and 480GB DRAM. GPU experiments are conducted at the same system with an additional NVIDIA A100 GPU with 80GB HBM. Our HCNN inference experiments use the CIFAR-10 ( Krizhevsky et al., 2009) . We evaluate ResNet20/32/44/18 trained with AESPA on PyTorch and applied the fusion technique to all the networks. Our RNS-CKKS parameters satisfy 128-bit security level (Cheon et al., 2019) with polynomial degree N = 2 16 and hamming weight 192. Multiplication and bootstrapping primes each occupy 48 bits and over 56 bits, respectively. The bootstrapping implementation consumes 17 levels in our setup.

4.2. OPTIMAL POINT

We explore the optimal (m, d) pair for HP to minimize the latency of ResNet 20/18. Table 3 shows the operation count of rotation and bootstrapping, which are major contributors to the runtime. Rotation is categorized into SISO and non-SISO. Only SISO rotations can be optimized with hoisting. Table 3 presents an instance of MP-CAConv and three (or two) instances of our HP-based convolution. Instances of HP-based convolution consist of settings with minimal bootstrapping, minimal rotations, and optimal latency. In ResNet20, The optimal point corresponds to the minimal bootstrapping. In ResNet18, the amount of rotations hikes following the channel increment. The optimal point sacrifices bootstrapping counts to play fewer rotations. We opt for these two settings to evaluate in the following section. A more thorough parameter study on GPU is presented in Appendix B. 

4.4. EXECUTION TIME

Table 4 shows the runtimes of various ResNet instances on CIFAR-10. We measured the execution time of running inference with a single CIFAR-10 image in our CPU/GPU environments. Our ResNet20/32/44 implementations on GPU take merely a few seconds to complete. Furthermore, We demonstrate running ResNet18 for the first time. As ResNet18 has 4× more channels than ResNet20/32/44, execution time largely depends on the convolutional layer. Table 5 : Comparison of the inference accuracies for the CIFAR-10 dataset running ResNet models on our RNS-CKKS-based implementation. In Table 5 , we measured inference accuracies for CIFAR-10 images running ResNet model on our RNS-CKKS-based implementation. Though we found an error below the second decimal place at the classifier result, we did not observe any deterioration in the accuracy of ResNet20/32/44. ResNet18 shows -0.08% degradation in accuracy, which is smaller compared to the accuracy drop in Lee et al. (2022a) . The difference in accuracy drop can be explained by whether the original network is executed as is or an approximation has been made.

A BOTTLENECK ANALYSIS

Figure 7 shows the runtime analysis of the convolutional layer, which is conducted on actual layer instances of ResNet20. In Lee et al. (2022a) , rotation accounts for 83-94% of the total convolution time, which is reduced to 46-77% with our method. The optimized convolutional layer in HyPHEN leads to a smaller ratio of rotation in the total execution time. Table 6 shows runtime breakdown of each operation and detailed comparison with (Lee et al., 2022a) in our CPU environment. As we set both implementations to use AESPA with the same HE parameter set, the speedup is solely due to the different packing schemes. In ResNet20, our implementation shows slight increase in activation function runtime. However, latency improvements in convolution layer leads to the 1.87× lower total execution time. Similarly in ResNet18, our implementation reports increased runtime in activation function and bootstrapping time, but 5.77× lower convolution time again leads to 3.33× lower total execution execution time. 

B PARAMETER STUDY

We present a parameter study to explore the optimal HP setting which minimizes latency. Table 7 and 8 show the rotation and bootstrapping counts with varying (m, d) available in ResNet20 and 18 and the execution time running the network on GPU. We only represent the (m, d) pair of CA-Conv, as m and d are exchanged at RAConv. In ResNet20, we start with (m, d) = (1, 2) to remove input repetition as the size of the input tensor in the first layer (32 × 32 × 16) is smaller than the ciphertext slots (2 15 ). Nevertheless, Larger d has not been considered as it leads to more bootstrapping, as shown in our proposed architecture (See Figure 5 ). As the input ciphertexts go through the downsampling layer, m • d gets quadrupled and the size of the intermediate tensor gets halved. In ResNet20, HP that doubles d every downsampling layer yields optimal performance, which reduces rotation without increasing bootstrap. In ResNet18, the impact of bootstrapping increment is often smaller than the impact of rotation decrement. The optimal HP setting requires 15 more bootstrappings and 10687 fewer rotations than the minimum bootstrapping HP setting. (  (= f h ) 3 3 1 3 3 1 3 s 1 2 2 1 2 2 1 (= f h ) 3 3 1 3 3 1 3 3 1 3 s 1 2 2 1 2 2 1 2 2 1 Table 10: Parameters used in the convolution layers of ResNet18

D TRAINING DETAILS

Models used in this paper is all trained using PyTorch (Paszke et al., 2019) . For ResNet18 and 20, our training settings are mostly identical to the AESPA; To be specific, networks are trained for 200 epochs using SGD optimizer, 0.1 initial learning rate, 100 batch size, 0.0005 weight decay and 0.9 momentum, and cosine annealing scheduler. We also use soft labels as in (Park et al., 2022) to get higher accuracy. For ResNet32 and 44, we use knowledge distillation (Hinton et al., 2015) to enhance the accuracy, using pretrained ResNet32/44 with 93.4% and 94.1% accuracies as teacher models. As the FC layer of the student and teacher network is identical, teacher's FC layer is directly reused in student network. We trained the student networks by minimizing l 2 loss (L kd = ∥f t -f s ∥ 2 2 ). ResNet32 and 44 are trained for 200 epochs using SGD optimizer, 0.0005 weight decay and 0.9 momentum. We use 0.05 initial learning rate and learning rate scheduler decays learning rate on epochs 150, 180, 200 by 0.1.

E LAZY-SISO

Unlike the original SISO convolution which rotates input ciphertexts before multiplying with filter plaintexts, lazy-SISO proposed in (Zhang et al., 2021) uses inversely rotated filter plaintexts to multiply with input ciphertexts. The actual process of lazy-SISO when c i , c o = 1 is depicted in Figure 9 . After multiplying with filter plaintext, postponed rotation is performed to accumulate intermediate ciphertexts. During this process, multiple ciphertexts sharing the same rotation index are grouped to be accumulated first and then rotated, reducing the amount of rotations. (e.g. c i ciphertexts are grouped in Figure 2b ). Lazy-SISO is beneficial when input channels, which are to be accumulated, are distributed in different ciphertexts as in RAConv.

F MEMORY REQUIREMENT

Memory requirement for HCNN depends on FHE parameters and data representations (packing schemes). In FHE, expansion of data size occurs while encryption and encoding procedure. Resulting ciphertexts and plaintexts are typically orders of magnitude larger than messages. The table 11 shows the actual size of the ciphertext, plaintext and evaluation key on three FHE parameter settings. dnum denotes RNS-decomposition number introduced in (Han & Ki, 2019) . Given N, the degree of a cyclotomic polynomial ring, large dnum increases L, max level. ParamSet1 is used in Set1 in Figure 6 to reproduce Lee et al. (2022a) . As Set1 uses approximated ReLU for activation, ParamSet1 adopts maximum dnum to have L = 16. ParamSet2 is set with the smallest L among three and is only used in Set2 on ResNet18. ParamSet3 is the parameter set used to evaluate HyPHEN and Set2 on ResNet20 which has L = 6. Some FHE operations such as MulCt, Rotate and Conjugate requires specific key switching procedure. Eval key denotes public key used at this process. The size of single Eval key is 2147, 206, 176 MB in ParamSet1, 2 and 3, respectively. To support bootstrapping operation, one relinearization key for MulCt, one conjugate key and 48 rotation keys per rotation index are required. We additionally load frequently used rotation keys to perform convolution. For instance, We load 68 unique Eval keys in ResNet20 which take up 146, 14, 12 GB in ParamSet1, 2, and 3. Once FHE parameter is determined, the packing scheme determines the number of ciphertext, plaintext to run each ResNet block. Table 12 and 13 shows the required number and total memory size of ciphertexts and plaintexts. We further explain the actual computation procedure to explain how the results are obtained in Appendix H. In SISO-based HCNN kernel, the size of filter plaintexts increases by factor of w i h i as each filter element is duplicated to the size of input image, which requires total f w f h w i h i c i c o slots for weight plaintexts. In consequence, weight plaintexts take up the majority of memory regardless of the ciphertext packing method. In table 12, our implementation shows up to 14.75% memory overhead compared to Lee et al. (2022a) , which is primarily due to the increase in the number of ciphertexts and bias plaintext. In Table 13 , our implementation shows up to 36.8% memory overhead compared to Lee et al. (2022a) . The larger memory overhead is caused by using different FHE parameters; In ResNet18, we use ParamSet3 for our implementation and ParamSet2 for Lee et al. (2022a) . When using the same ParamSet3, overhead is reduced by up to 7%. Unlike CPU, GPU memory capacity is more constrained by the current HBM technology to support high bandwidth. As the GPU memory is not capable of loading the weights of the entire model considering ResNet18, weight plaintexts should be loaded separately; while one stream computes the current ResNet block, another stream is overlapped to load next block's weight plaintext. Through profiling the workloads of CPU and GPU activities through NVIDIA Nsight Systems (nvi, 2021), we find that all the copy stream is completed before the end of the compute stream, meaning that loading weight plaintexts does not affect overall execution time. As such, for running large neural networks, fine-grained multi-streaming can be applied to relieve memory capacity constraint.

G IMAGENET EXPERIMENT

We conducted additional experiments to evaluate the ResNet18 model on the ImageNet dataset. The runtime is 81.85 seconds in our GPU environment. We slightly modified the first pooling layer to average pool with kernel size 2×2 and stride 2. As shown in Figure 10 , the network receives input ciphertexts processed with modified im2col. Original im2col would transform 224×224×3 images into 147×12544 matrix. As HCNN prefers images size to be exponential of 2, 12544 columns turns to 16384 columns. We further split 16384 columns with stride 2 (s avg in Figure 10 ) to perform average pooling without rotation, resulting 4096 columns. We pack slot/4096 = 8 rows into a ciphertext. As the total number of row is f h f w c i = 147, the number of input ciphertext is ⌈147/8⌉ × 4. After the initial convolution layer, the number of intermediate ciphertexts becomes 64 and each ciphertext stores 8 channels with (m,d) = 1. Table 14 shows the runtime of ResNet18 on ImageNet. During the experiment, swap memory is used to make up the lack of CPU memory. Loading weight plaintexts from swap memory, which is in SSD, incurs runtime hikes as Table 14 shows large runtimes for Etc. We use different brightness of color to fill the ciphertexts and plaintexts to reflect the actual computation process. In Figure 11 ,29-32) ,29-32) • • • M K (16&32,29-32) RaS RaS ---→ M K (1-32,1-4) -32,29-32) • • • b (1-16,1) b (1-16,2) b (17-32,1) b (17-32,2) K (1&17,1-4) • • • K (16&32,1-4) K (1&17,5-8) • • • K (16&32,5-8) • • • • • • • • • K (1&17 f w f h SISO ----→ M K (1&17,1-4) • • • M K (16&32,1-4) M K (1&17,5-8) • • • M K (16&32,5-8) • • • • • • • • • M K (1&17 • • • M K (1-32,5-8) • • • • • • • • • M K (1 b (1-16,3) b (1-16,4) b (17-32,3) b (17-32,4) RaS RaS ---→ c (1) c (2) ## ## c (3) c (4) ## ##

Mask and Broadcast

Broadcast -------→  c (1) c (2) c (1) c (2) c (3) c (4) c (3) c (4) repack ----→ N (1-4) × 2 • • • N (5-8) × 2 • • • • • • • • • N (29-32) × 2 • • • M (1-4) × 2 • • • M (5-8) × 2 • • • • • • • • • M (29-32) × 2 • • • co/2 ci/4 * K (1-4,1&17) • • • K (1-4,16&32) K (5-8,1&17) • • • K (5-8,16&32) K (9-12,1&17) • • • K (9-12,16&32) K (29-32,1&17)  ----→ M K (1-32,1&17) • • • M K (1-32,1&17) • • • M K (1-32,1&17) • • • fwfh fwfh RaS ---→ M K (1-32,1&17) • • • b (1-8,1) b (9-16,1) b (1-8,17) b (9-16,17) b (17-24,1) b (25-32,1) b (17-24,17) b (25-32,17) 



LIMITATIONSResNet networks are the only models addressed in the paper. Broader experiments toward various models such asLiu et al. (2022) would show the practicality of FHE-based PI more clearly. CONCLUSIONIn this paper, we proposed an efficient convolution algorithm RAConv, and a novel packing method Hybrid Packing. We showed HyPHEN, FHE-based ResNet architecture implementation applying proposed optimizations. Our experiments on real machine show 3.4-4.4× lower latency for ResNet20/32/44/18 compared toLee et al. (2022a). Using GPU acceleration, HyPHEN demonstrates 1.44/2.21s/2.98s/13.37s execution time for running ResNet20/32/44/18 on CIFAR-10 dataset.



Unencrypted convolution with s = 1

Figure 1: SISO convolution. ciphertexts and plaintexts are illustrated as a 2D matrix but are actually stored in 1D manner with each matrix row concatenated.

Figure 3: Comparison of gap packing methods to fill gap induced by downsampling layers.

) produces a multiplex-packed output ciphertext as in Figure 3b. Second, converting a void-packed ciphertext into a duplicate-packed ciphertext requires fewer rotations than converting it into a multiplex-packed ciphertext. Duplicate packing only needs O(log(gap size )) rotations while multiplexed packing requires O(gap size ) rotations.

Figure 7: Comparison of rotation time in ResNet20 convolutional layers

Figure 9: Lazy-SISO convolution.

Figure 10: ImageNet Im2Col

Figure 11: CAConv method with HP

c (1) c (17) c (17) c (1) c (1) c (17) c (17) repack ----→ N (1&17) × 4 • • • N (16&32) × 4

Figure 12: RAConv method with HP

The rotation complexity of the convolutions. We denote the numbers of input and output ciphertexts as ct in , ct out . Then, ct min = min(ct in , ct out ) considering SISO and lazy-SISO. rotations, which fill the gaps. After performing a series of CAConv and RAConv, the HP organization of the output ciphertext returns to the initial (m i n, d i n). The complete procedures of RAConv and CAConv with HP are described more in detail in Appendix H.Compared to MP-CAConv, HP significantly reduces the rotations in RaS and IR. The rotation complexities of MP-CAConv(Lee et al., 2022a)  and our hybrid-packed convolutions are shown in Table 2. For both HP convolution methods, the product of the numbers of input and output ciphertexts remains constant (ct in • ct out = wihicico n

Runtime for the ResNet instances with different (m, d) parameters and packing strategies. CAConv showing a superlinear increase in the number of rotations as the number of channels increases. Overall, HyPHEN achieves 3.4-4.4× speedup compared to the prior state-of-the-art implementation ofLee et al. (2022a) (Set1).

Table 4 again demonstrates that RAConv effectively reduces the overall runtime of the conv layer, as our operation count analysis in Table 3. Detailed comparison with Lee et al. (2022a) is provided in Appendix A HyPHEN Inference time of a single CIFAR-10 image using ResNet models on CPU and GPU. As FC layer and Pooling have a tiny execution time, we gather them at Etc.

HyPHEN inference time of a single CIFAR-10 image using ResNet models on CPU. As FC layer and Pooling have a tiny execution time, we gather them at Etc.

Comparison of the instances of CAConv (m,d) parameters in ResNet20.

Comparison of the instances of CAConv (m,d) parameters in ResNet18. C RESNET ARCHITECTURE AND PARAMETERS Figure 8 presents model architecture of modifed ResNet20 used in HCNN evaluation. Table9and 10 shows parameters used in convolution layer of ResNet20/32/44/18. All the parameters (c i , c o , w i , h i , w o , h o , f w , f h , s) are determined following the origianl ResNet paper(He et al., 2016).

Parameters used in the convolution layers of ResNet20/32/44

FHE parameter settings. dnum is tuned to support 16, 3, 6 levels required in Set1, Set2, Set3. Each Ciphertext and Plaintext memory size is represented when current level l = L.

Total memory size and the number of each object in ResNet20. We abbreviated Downsampling Block and Basic Block to DSB and BB.

and 12, the intermediate ciphertexts of the CAConv and input ciphertexts of the RAConv are co 4 and ci 4 times larger than input ciphertext of the CAConv, respec-

Total memory size and the number of each object in ResNet18. DSB and BB refers to downsampling Block and basic block.

HyPHEN Inference time of a single ImageNet image using ResNet18 models on GPU.tively. The CAConv, activation and RAConv in a ResNet block is processed at once to mitigate the huge number of intermediate ciphertexts. To avoid an increase of memory footprint, the operation on the input ciphertext continues until the ciphertext shrinks again. Thus, the tuple of operations {SISO, RaS, IR, Square, SISO l } are processed to an input ciphertext and then accumulated. Blocks colored with high brightness show actual working set, which means co 4 times larger intermediate ciphertexts (colored with low brightness) are irrelevant with the peak memory consumption.

• • • K (16&32,29-32)

• • • K (29-32,16&32)

