AUTOFHE: AUTOMATED ADAPTION OF CNNS FOR EF-FICIENT EVALUATION OVER FHE

Abstract

Secure inference of deep convolutional neural networks (CNNs) was recently demonstrated under the fully homomorphic encryption (FHE) scheme, specifically the Full Residue Number system variant of Cheon-Kim-Kim-Song (RNS-CKKS). The state-of-the-art solution uses a high-order composite polynomial to approximate non-arithmetic ReLUs and refreshes zero-level ciphertext through bootstrapping. However, this solution suffers from prohibitively high latency, both due to the number of levels consumed by the polynomials (47%) and the inference time consumed by bootstrapping operations (70%). Furthermore, it requires a hand-crafted architecture for homomorphically evaluating CNNs by placing a bootstrapping operation after every Conv-BN layer. To accelerate CNNs on FHE and automatically design a homomorphic evaluation architecture, we propose AutoFHE: Automated adaption of CNNs for evaluation over FHE. AutoFHE exploits the varying sensitivity of approximate activations across different layers in a network and jointly evolves polynomial activations (EvoReLUs) and searches for placement of bootstrapping operations for evaluation under RNS-CKKS. The salient features of AutoFHE include: i) a multi-objective coevolutionary (MOCoEv) search algorithm to maximize validation accuracy and minimize the number of bootstrapping operations, ii) a gradient-free search algorithm, R-CCDE, to optimize EvoReLU coefficients, and iii) polynomial-aware training (PAT) to fine-tune polynomial-only CNNs for a few epochs to adapt trainable weights to EvoReLUs. We demonstrate the efficacy of AutoFHE through the evaluation of ResNets on encrypted CIFAR-10 and CIFAR-100 under RNS-CKKS. Experimental results on CIFAR-10 indicate that in comparison to the state-of-the-art solution, AutoFHE can reduce inference time (50 images on 50 threads) by up to 3,297 seconds (43%) while preserving the accuracy (92.68%). AutoFHE also improves the accuracy of ResNet-32 on CIFAR-10 by 0.48% while accelerating inference by 382 seconds (7%).

1. INTRODUCTION

Fully homomorphic encryption (FHE) is a promising solution for secure inference of neural networks (Gilad-Bachrach et al., 2016; Brutzkus et al., 2019; Lou & Jiang, 2021; Lee et al., 2022b; a) . However, Homomorphically evaluating CNNs on encrypted data is challenging in two respects: 1) the design of homomorphic evaluation architecture of deep CNNs with arbitrary depth and 2) non-arithmetic operations like ReLU. Recently, FHE-MP-CNN (Lee et al., 2022a) successfully implemented a homomorphic evaluation architecture of ResNets by using bootstrapping (Cheon et al., 2018a; Bossuat et al., 2021) to refresh zero-level ciphertext under the full residue number system (RNS) variant of Cheon-Kim-Kim-Song (RNS-CKKS) scheme (Cheon et al., 2017; 2018b) . However, since FHE supports only homomorphic multiplication and addition, non-arithmetic operations are approximated by polynomials (Gilad-Bachrach et al., 2016; Chou et al., 2018; Brutzkus et al., 2019; Lee et al., 2021a; c; 2022a) . For example, FHE-MP-CNN adopts a high-precision Minimax composite polynomial (Lee et al., 2021a; c) Figure 2 : Motivating AutoFHE. Left: depth consumption of AppReLUs based on ResNet-20 backbone on CIFAR-10. The purple line is when the same precision AppReLU is used in all layers, while the red circles show 5000 randomly-sampled combinations of mixed-precision layerwise AppReLUs. Middle: the number of bootstrapping operations where we show trade-offs of the same AppReLU and mixed AppReLUs as in the left panel. We also show a multi-objective search result using mixed-precision layerwise AppReLUs and the Pareto front of the proposed AutoFHE. Right: distributions of pre-activations (the maximum absolute values) of ResNets on CIFAR-10 where the green line corresponds to B, the scale value of AppReLU in FHE-MP-CNN. with degree {15, 15, 27} to approximate ReLUs (AppReLU). A more comprehensive discussion of related work is in Appendix B. FHE-MP-CNN, the state-of-the-art approach, is limited by three main design choices. First, highprecision approximations like AppReLU only consider function-level approximation and neglect the potential for end-to-end optimization of the entire network response. As such, the same high-precision AppReLU is used to replace all the network's ReLU layers, which necessitates the evaluation of very deep circuits. Secondly, due to the high number of levels required for each AppReLU, ciphertexts encrypted with leveled HE schemes like CKKS quickly exhaust their levels. Therefore, a bootstrapping operation is necessary for each AppReLU to refresh the level of zero-level ciphertexts. While these design choices are collectively very effective at maintaining the performance of the plaintext networks under FHE, they require many multiplicative levels and, consequently, numerous bootstrapping operations. Thirdly, due to the constraints imposed by the cryptographic scheme (RNS-CKKS in this case), inference of networks in FHE requires the codesign of AppReLU and the homomorphic evaluation architecture. These include the careful design of AppReLU (number of composite polynomials and their degrees), cryptographic parameters, placement of bootstrapping operations, and choice of network architectures to evaluate. We illustrate the limitations of FHE-MP-CNN's design choices through a case study (Figure 2 ) of ResNet-20 on CIFAR-10. We consider two plausible solutions to trade-off accuracy and computational burden of FHE-MP-CNN. (i) Same Precision AppReLU: We replace all ReLU layers with AppReLU of a given precision. We can trade-off (purple line in the left panel) accuracy and depth consumption using AppReLU with different precision. However, as the middle panel shows, these solutions (purple dots) do not necessarily translate to a trade-off between accuracy and the number of bootstrapping operations due to many wasted levels. All the trade-off solutions collapse to either 15 or 30 bootstrapping operations. (ii) Mixed-Precision AppReLU: Each ReLU layer in the network can be replaced by AppReLU of any precision. We randomly sample 5,000 combinations of mixed-precision layerwise AppReLUs and show (red dots) their depth consumption and the number of bootstrapping operations in the left and middle panels, respectively. Observe that layerwise mixed-precision AppReLU leads to a better trade-off between accuracy and the number of bootstrapping operations. However, FHE-MP-CNN neglects the layerwise sensitivity (range) of ReLU pre-activations (the right panel shows the distribution of the layerwise maximum absolute value of pre-activation) and uses AppReLU which is optimized for a ReLU with a large pre-activation range. Therefore, the Pareto front of mixed-precision layerwise AppReLU optimized by a multi-objective search algorithm NSGA-II Deb et al. (2002) is still inferior to AutoFHE, our proposed solution, by a significant margin. In summary, while both the solutions we considered were able to reduce the number of bootstrapping operations, unlike AutoFHE, it also lead to severe loss in performance. In this paper, we relax the design choices of FHE-MP-CNN and accelerate the inference of CNNs over homomorphically encrypted data while maximizing performance. The main premise behind our approach is to directly optimize the end-to-end function represented by the network instead of optimizing the function represented by the activation function. This idea allows us to exploit the varying sensitivity of activation function approximation across different layers in a network. Therefore, theoretically, evolving layerwise polynomial approximations of ReLUs (EvoReLU) should reduce the total multiplicative depth required by the resulting polynomial-only networks, and thus the number of time-consuming bootstrapping operations and the inference time on encrypted data. To this end, we propose AutoFHE, a search-driven approach to jointly optimize layerwise polynomial approximations of ReLU and the placement of bootstrapping operations. Specifically, we propose a multi-objective co-evolutionary (MOCoEv) algorithm that seeks to maximize accuracy while simultaneously minimizing the number of bootstrapping operations. AutoFHE jointly searches for the parameters of the approximate activation functions at all layers, i.e., degrees and coefficients and the optimal placement of the bootstrapping operations in the network. Our contributions are three-fold: 

RNS-CKKS:

The full residue number system (RNS) variant of Cheon-Kim-Kim-Song (RNS-CKKS) (Cheon et al., 2017; 2018b ) is a leveled homomorphic encryption (HE) scheme for approximate arithmetic. Under RNS-CKKS, a ciphertext c ∈ R 2 Q ℓ satisfies the decryption circuit [⟨c, sk⟩] Q ℓ = m + e, where ⟨•, •⟩ is the dot product and [•] Q is the modular reduction function. R Q ℓ = Z Q ℓ [X]/(X N +1) is the residue cyclotomic polynomial ring. The modulus is Q ℓ = ℓ i=0 q ℓ , where 0 ≤ ℓ ≤ L. ℓ is a non-negative integer referred to as level, and it denotes the capacity of homomorphic multiplications. sk is the secret key with Hamming weight h. m is the original plaintext message, and e is a small error that provides security. A ciphertext has N/2 slots to accommodate N/2 complex or real numbers. RNS-CKKS supports homomorphic addition and multiplication: Homomorphic Addition: Decrypt(c ⊕ c ′ ) = Decrypt(c) + Decrypt(c ′ ) ≈ m + m ′ Homomorphic Multiplication: Decrypt(c ⊗ c ′ ) = Decrypt(c) × Decrypt(c ′ ) ≈ m × m ′ (1) Bootstrapping: Leveled HE only allows a finite number of homomorphic multiplications, with each multiplication consuming one level due to rescaling. Once a ciphertext's level reaches zero, a bootstrapping operation is required to refresh it to a higher level and allow more multiplications. The number of levels needed to evaluate a circuit is known as its depth. RNS-CKKS with bootstrapping (Cheon et al., 2018a) is an FHE scheme that can evaluate circuits of arbitrary depth. It enables us to homomorphically evaluate deep CNNs on encrypted data. Conceptually, bootstrapping homomorphically evaluates the decryption circuit and raises the modulus from Bossuat et al., 2021) . Practically, bootstrapping (Cheon et al., 2018a) homomorphically evaluates modular reduction [•] Q by first approximating it by a scaled sine function, which is further approximated through polynomials (Cheon et al., 2018a; Lee et al., 2020) . Bootstrapping Bossuat et al. (2021) has four stages, including ModRaise, CoeffToSlot, EvalMod, and SlotToCoeff. These operations involve a lot of homomorphic multiplications and rotations, both of which are costly operations, especially the latter. The refreshed ciphertext has level ℓ = L -K, where K levels are consumed by bootstrapping (Bossuat et al., 2021) for polynomial approximation of modular reduction. (Lee et al., 2022a) is the state-of-the-art framework for homomorphically evaluating deep CNNs on encrypted data under RNS-CKKS with high accuracy. Its salient features include 1) Compact Packing: All channels of a tensor are packed into a single ciphertext. Multiplexed parallel (MP) convolution was proposed to process the ciphertext efficiently. 2) Homomorphic Evaluation Architecture: Bootstrapping operations are placed after every Conv-BN, except for the first one, to refresh zero-level ciphertexts. This hand-crafted homomorphic evaluation architecture for ResNets is determined by the choice of cryptographic parameters, the level consumption of operations, and ResNet's architecture. 3) AppReLU: It replaces all ReLUs with the same highorder Minimax composite polynomial Lee et al. (2021a; c) of degrees {15, 15, 27}. By noting that ReLU(x) = x • (0.5 + 0.5 • sgn(x)), where sgn(x) is the sign function, the approximated ReLU (AppReLU) is modeled as Q 0 to Q L by using the isomorphism R q0 ∼ = R q0 × R q1 × • • • × R q L (

FHE-MP-CNN

AppReLU(x) = x•(0.5+0.5•p α (x)), x ∈ [-1, 1]. p α (x) is the composite Minimax polynomial. The precision α is defined as |p α (x) -sgn(x)| ≤ 2 -α . AppReLU is expanded to arbitrary domains x ∈ [-B, B] of pre-activations in CNNs by scaling it as B • AppReLU(x/B). However, this reduces approximation precision to B • 2 -α . To estimate the maximum dynamic range B (40 for CIFAR-10 and 65 for CIFAR-100) of ReLUs, FHE-MP-CNN evaluates the pretrained network on the training dataset. FHE-MP-CNN uses the same dynamic range B for all polynomials and neglects the uneven distribution of pre-activations as shown in Figure 2 . Explicitly accounting for this uneven distribution allows us to use smaller B ′ and α ′ but with the same precision, i.e., B ′ • 2 -α ′ = B • 2 -α , for B ′ < B and α ′ < α. 4) Cryptographic Parameters: FHE-MP-CNN sets N = 2 16 , L = 30 and Hamming weight h = 192. Please refer to Lee et al. (2022a) for the detailed implementation of FHE-MP-CNN and other parameters. These parameters provide 128-bits of security Cheon et al. (2019) . 5) Depth Consumption: To reduce level consumption, FHE-MP-CNN integrates scaling parameter B into Conv-BN. The multiplicative depth consumption of Bootstrapping (i.e., K), AppReLU, Conv, DownSampling, AvgPool, FC and BN layers are 14, 14, 2, 1, 1, 1, 0, respectively. Statistically, when using FHE-MP-CNN to homomorphically evaluate ResNet-18/32/44/56 on CIFAR-10 or CIFAR-100, AppReLUs consume ∼ 47% of total levels and bootstrapping operations consume ∼ 70% of inference time.

3. AUTOFHE: JOINT EVORELU AND BOOTSTRAPPING SEARCH

To minimize the latency of secure inference dominated by bootstrapping operations induced by high-degree polynomials and automatically design suitable homomorphic evaluation architecture, we propose AutoFHE. It is designed to search for layerwise polynomial approximation of ReLU jointly with the placement of bootstrapping. Furthermore, we directly optimize the end-to-end objective to facilitate finding the optimal combination of layerwise polynomials.

3.1. EVORELU

EvoReLU is defined as y = EvoReLU(x) = x • 0.5 + p d (x) , x ∈ [-1, 1], y ∈ [0, 1]. The composite polynomial p d (x) = (p d K K • • • • • p d k k • • • • • p d1 1 )(x), 1 ≤ k ≤ K approximates 0.5 • sgn(x) . The composite polynomial p d (x) has K sub-polynomial functions and degree d = K k=1 . This structure for EvoReLU bears similarity to the Minimax composite polynomial in Lee et al. (2021c; 2022a) . However, the objective for optimizing the coefficients is significantly different. We represent the composite polynomial p d (x) by its degree vector d = {d i } d K i=1 , and each sub-polynomial p d k k (x) as a linear combination of Chebyshev polynomials of degree d k , i.e., p d k k (x) = β k d k i=1 α i T i (x) , where T i (x) are the Chebyshev bases of the first kind, α i are the coefficients for linear combination and β k is a parameter to scale the output. The coefficients α k = {α i } d k i=1 control the polynomial's shape, while β k controls its amplitude. EvoReLU with the degree d is parameterized by λ: λ = (α 1 , β 1 , • • • , α k , β k , • • • , α K , β K ) are y = EvoReLU(x, λ; d) = x • (0.5 + p d (x)), where x ∈ [-B, B], y ∈ [0, B] where we estimate B values for layerwise EvoReLUs on the training dataset. From Figure 3 , FHE-MP-CNN places bootstrapping after every Conv-BN, while AutoFHE will search for placement of bootstrapping operations by adapting to different depth consumption of layerwise EvoReLUs. The Depth Consumption of EvoReLU is 1 + K k=1 ⌈log 2 (d k + 1)⌉ when using the Baby-Step Giant-Step (BSGS) algorithm Lee et al. (2020) ; Bossuat et al. (2021) to evaluate p d (x).

3.2. MOCOEV: MULTI-OBJECTIVE CO-EVOLUTIONARY SEARCH

Search Objectives: Given a neural network function f with L ReLUs and the pre-trained weights ω 0 , our goal is to maximize the accuracy of the network while minimizing its inference latency on encrypted data. A possible solution to achieve this goal is to maximize validation accuracy while minimizing the total multiplicative depth of the network with EvoReLUs. However, this solution does not practically accelerate inference since bootstrapping contributes most to latency, and this solution may not necessarily lead to fewer bootstrapping operations. Therefore, we optimize the parameters of all the EvoReLUs to maximize accuracy and directly seek to minimize the number of bootstrapping layers through a multi-objective optimization problem: min D {1 -Acc val (f (ω * ); Λ * (D), D), Boot(D)} s.t. Λ * = arg max Λ {Acc val (f (ω 0 ); Λ(D), D)} ω * = arg min ω L train (f (ω); Λ * (D), D) where Acc val is the Top-1 accuracy on a validation dataset val, Boot is the number of bootstrapping operations, D = {d 1 , d 2 , • • • , d L } is the degree vector of all EvoReLUs, the corresponding parameters are Λ = {λ 1 , λ 2 , • • • , λ L }, f (ω 0 ) is the neural network with the pre-trained weights ω 0 , L train is the training loss. Given a degree vector D, the number and placement of bootstrapping operations can be deterministically determined. Given D, we can optimize Λ to maximize the validation accuracy. We further fine-tune the network f (•) to minimize the training loss L train . The objectives in equation 3 guide the search algorithm to, i) explore layerwise EvoReLU including its degrees and coefficients; 2) discover the placement of bootstrapping to work well with EvoReLU; 3) trade-off between validation accuracy and inference speed to return a diverse set of Pareto-effective solutions. In this paper, we propose MOCoEv to optimize the multi-objective min D {1 -Acc val (f (ω * ); Λ * (D), D), Boot(D)}. We propose R-CCDE and use an evolutionary criterion to maximize Acc val (f (ω 0 ); Λ(D), D). We propose PAT to fine-tune approximated networks with EvoReLUs to minimize L train (f (ω); Λ * (D), D). Search Space: Our search space includes the number of sub-polynomials (K) in our composite polynomial, choice of degrees for each sub-polynomial (d k ) and the coefficients of the polynomials Λ. Table 1a shows the options for each of these variables. Note that choice d k = 0 corresponds to an identity placeholder, so theoretically, the composite polynomial may have fewer than K subpolynomials. Furthermore, when the degree of (p d k k • p d k-1 k-1 )(x) less than or equal to 31 (maximum degree of a polynomial supported on RNS-CKKS Lee et al. (2021a; c )), we merge the two subpolynomials into a single sub-polynomial p d k k (p d k-1 k-1 )(x) with degree d k •d k-1 ≤ 31 before computing its depth. This helps reduce the size of the search space and leads to smoother exploration. Tab. 1b lists the number of ReLUs of our backbone models and the corresponding dimension and size of search space for D. Variable Option MOCoEv: To overcome the challenge of multi-objective search over a high-dimensional D and explore the massive search space, we propose a multi-objective co-evolutionary (MOCoEv) search algorithm. Our approach is inspired by the divide-and-conquer strategy of cooperative co-evolution (CC) Yang et al. (2008) ; Mei et al. (2016) ; Ma et al. (2018) . The key idea of MOCoEv is to decompose the high-dimensional multi-objective search problem to multiple low-dimensional sub-problems. MOCoEv includes i) Decomposition: given a Pareto-effective solution 2016) so we can evaluate the locally mutated solutions cooperatively with each other d j , j ̸ = ℓ, 1 ≤ j ≤ L. Figure 4 shows a step of one iteration of MOCoEv. During one iteration, we repeat the step L times until we update all EvoReLUs. We design crossover and co-evolutionary mutation of MOCoEv to explore and exploit: (1) Crossover: given the current Pareto front, we select mating individuals to generate offspring and crossover offspring to exchange genes across EvoReLUs. For example, given two mating individuals D 1 and D 2 , we crossover them to obtain # polynomials (K) 6 poly degree (d k ) {0, 1, 3, 5, 7} coefficients (Λ) R (a) D = {d 1 , d 2 , • • • , d L }, MOCoEv improves D by locally mutating d ℓ , 1 ≤ ℓ ≤ L so that D ′ = {d 1 , d 2 , • • • , d ′ ℓ , • • • , d L } dominates D = {d 1 , d 2 , • • • , d ℓ , • • • , d L } in D ′ 1 = {b ℓ : b ℓ ∈ D 1 ∪ D 2 , 1 ≤ ℓ ≤ L}, D ′ 2 = {b ℓ : b ℓ ∈ (D 1 ∪ D 2 )/D ′ 1 , 1 ≤ ℓ ≤ L}; (2) Co-Evolutionary Mutation: we mutate the ℓ-th EvoReLU offspring, obtain a new Pareto front from mutated offspring and the current population, and finally update the current population. Then, we move onto the (ℓ + 1)-th EvoReLU and repeat (1)(2) until L EvoReLUs are updated. Therefore, we update the Pareto front L times at the end of each iteration. We design three types of operators to mutate a composite polynomial function. i) randomly replace one polynomial sub-function with a new polynomial. ii) randomly remove a sub-function. iii) randomly insert a new polynomial. Please refer to Appendix C for background on evolutionary search algorithms. The implementation details of MOCoEv algorithm are in Appendix D.1.

3.3. REGULARIZED COOPERATIVE DIFFERENTIABLE CO-EVOLUTION

To solve Λ * = argmax Λ {Acc val (f (ω 0 ); Λ(D), D)} in equation 3 where D = {d ℓ |1 ≤ ℓ ≤ L}, Λ = {λ ℓ |1 ≤ ℓ ≤ L}, we propose regularized cooperative co-evolutionary differentiable evolution (R-CCDE). Given degree d ℓ , it optimizes λ ℓ for function approximation level. However, the function approximation solution Λ maybe not the optimal solution for max Λ {Acc val (f (ω); Λ(D), D)}. So, we use MOCoEv to update the Pareto front in terms of the validation accuracy and the number of bootstrapping. R-CCDE decomposes λ ℓ into {α 1 , β 1 , • • • , α K , β K } corresponding to polynomial sub-functions y 1 = p d1 1 (x|α 1 , β 1 ), y 2 = p d2 2 (y 1 |α 2 , β 2 ), • • • , y = p d K K (y K-1 |α K , β K ) by using the forward architecture, x → y 1 → y 2 • • • → y K-1 → y. We adopt gradient-free differentiable evolution (DE) Rauf et al. (2021) to learn α and β. DE uses the difference between individuals for mutation. Given the context vector λ * , we optimize α k and β k , 1 ≤ k ≤ K alternatively as: α ⋆ k = arg min α k L(α k |λ * ), α k |λ * = (α * 1 , β * 1 , • • • , α k , • • • , α * K , β * K ) (4) β ⋆ k = arg min β k L(β k |λ * ) + γ • β 2 k , β k |λ * = (α * 1 , β * 1 , • • • , β k , • • • , α * K , β * K ) where L(•) is the ℓ 1 distance between p d (x) and 0.5 • sgn(x). α ⋆ k and β ⋆ k are then used to update λ * . We introduce a regularization term for optimizing the scale parameters, where γ is the scaling decay. Scale parameters can prevent polynomials from growing exponentially during the initial iterations. The decay helps guide the parameters gradually toward a value of one and eventually select promising coefficients. Our R-CCDE algorithm is detailed in Appendix D.2.

3.4. PAT: POLYNOMIAL-AWARE TRAINING

Replacing ReLU with EvoReLU in pre-trained neural networks injects minor approximation errors, which leads to performance loss. Fine-tuning can mitigate this performance loss by allowing the learnable weights to adapt to the approximation error. However, backpropagation through EvoReLU leads to exploding gradients due to high-degree polynomials. Thanks to precise forward approximation of EvoReLU, we can use gradients from the original non-arithmetic ReLU function for backpropagation, specifically, during forward pass, EvoReLU injects slight errors, which are captured by objective functions like cross-entropy loss. During the backward pass, we bypass EvoReLU and use ReLU to compute gradients to update the weights of the linear trainable layers (e.g., convolution or fully connected). We refer to this procedure, which bears similarity to STE Bengio et al. (2013) and QAT Jacob et al. (2018) , as polynomial-aware training (PAT). The following pseudocode illustrates this procedure for a simple example, EvoReLU(x) = x(0.5 + (f 3 • f 2 • f 1 )(x)). In the forward function, we first scale the coefficients of f 3 by B so that the output range of y is [0, B]. In the backward function, we compute the gradient ∂y/∂ReLU(x) instead of ∂y/∂EvoReLU(x). Hyperparameters: For MOCoEv, we use a population size of 50 and run it for 20 generations. We set the probability of polynomial replacement to 0.5, the probability of polynomial removal to 0.4, and the probability of polynomial insertion to 0.1. For R-CCDE, we set the search domain of α to [-5, 5 ] and that of β to [1, 5]. We set the population size for optimizing β to 20. For α, we set the population size equal to 10× the number of variables. We set the scaling decay to γ = 0.01 and the number of iterations to 200. For PAT, we use a batch size of 512 and weight decay of 5 × 10 -3 and clip the gradients to 0.5. We use learning rates of 5 × 10 -4 for CIFAR-10 and 2 × 10 -4 for CIFAR-100. During MOCoEV search, we set the fine-tuning epoch to one. After the search is done, we fine-tune searched polynomial networks for ten epochs. 

4.1. PARETO FRONTS OF AUTOFHE

Figure 5 shows Pareto-effective solutions found by AutoFHE on CIFAR-10 and CIFAR-100 for different ResNet models. The trade-offs are between the Top-1 validation accuracy on plaintext data and the number of bootstrapping operations required for the corresponding homomorphic evaluation architecture. By optimizing the end-to-end network prediction function, AutoFHE adapts to the differing sensitivity of the activation layers to approximation errors and reduces the number of levels required compared to using the same high-degree AppReLU in all the layers. Thus, AutoFHE significantly reduces the number of bootstrapping operations. For ResNet-32 on CIFAR-10, AutoFHE removes 10 bootstrapping operations (33.33%) compared to FHE-MP-CNN with the negligible accuracy loss 0.08% compared to the original network with ReLUs. Lastly, AutoFHE provides a family of solutions offering different trade-offs rather than a single solution, thus providing flexible choices for practical deployments. Due to the high computation cost of validating networks performance on encrypted data under the RNS-CKKS, we select nine solutions for evaluation on a machine with AMD EPYC 7H12 64-Core Processor and 1000 GB RAM. In Table 2 , we evaluate three solutions for ResNet-32 on both CIFAR-10 and CIFAR-100 and evaluate one solution for ResNet-20/-44/56. We estimate the inference time for 50 images on 50 CPU threads. Amortized inference time is amortized runtime for each image. We report the Top-1 accuracy of AutoFHE on all (10,000) encrypted validation images under the RNS-CKKS. We plot Pareto fronts on CIFAR-10 of AutoFHE versus the baseline in Figure 1 . Figure 6 shows Pareto fronts of AutoFHE of ResNet-32 on encrypted CIFAR-10/-100. We observe that, on CIFAR-10, AutoFHE provides significant acceleration while having better accuracy or preserving accuracy. AutoFHE of ResNet-32 with 21 bootstrapping operations has slightly better accuracy than the baseline of ResNet-44 and accelerates inference by 3,297 seconds 36 9 2 9 11 5 6 12 10 9 2 9 2 8 10 12 11 9 2 7 4 9 5 14 8 9 2 8 3 7 4 7 4 5 6 9 2 8 8 9 10 7 4 9 6 9 7 5 6 8 3 4 (3) the last EvoReLU close to the output does not need high-precision approximation.

4.3. ABLATION STUDY

1. Evaluating Co-evolution: To evaluate the effectiveness of co-evolution in MOCoEv, we compare MOCoEv with a standard multi-objective evolution algorithm NSGA-II Deb et al. (2002) in Appendix E. The experimental results in Figure 8 and Table 4 show that co-evolution explores the optimization landscape of high-dimensional variables more effectively.

2.. Evaluating Layerwise Approximation:

To evaluate the effectiveness of adaptive layerwise approximation of AutoFHE, we compare it with uniformly distributed Minimax composite polynomials in Appendix F. Figure 9 shows that by exploiting the varying approximation sensitivity of different layers AutoFHE has a better trade-off than uniformly distributed Minimax polynomials.

3.. Evaluating AutoFHE on ImageNet:

We demonstrate the efficacy of AutoFHE on ImageNet Russakovsky et al. ( 2015), a large-scale high-resolution image dataset, in Appendix G. The experimental result in Figure 11 shows that AutoFHE can effectively trade-off the accuracy and the depth consumption on large-scale datasets with high-resolution images.

5. CONCLUSION

This paper introduced AutoFHE, an automated approach for accelerating CNNs on FHE and automatically designing a homomorphic evaluation architecture. AutoFHE seeks to approximate the end-to-end function represented by the network instead of approximating each activation function. We exploited the varying sensitivity of approximate activations across different layers in a network to jointly evolve composite polynomial activation functions and search for placement of bootstrapping operations for evaluation under RNS-CKKS. Experimental results over ResNets on CIFAR-10 and CIFAR-100 indicate that AutoFHE can reduce the inference time by up to 3,297 seconds (43%) while preserving the accuracy. AutoFHE also improves the accuracy by up to 0.48%. Although our focus in this paper was on ResNets, and consequently ReLU, AutoFHE is a general-purpose algorithm that is agnostic to the network architecture or the type of activation function.

Notation Domain Description

N Z + degree of polynomial rings ℓ {0, 1, • • • , L} level Q ℓ Z + modulus Q ℓ = ℓ i=0 q ℓ where q ℓ are primes R Q ℓ residual cyclotomic polynomial ring m R Q ℓ plaintext e R Q ℓ error c R 2 Q ℓ ciphertext sk R 2 Q ℓ private key ⟨•, •⟩ dot product [•] Q modular reduction function h Z + Hamming weight x, y R scalar B R the maximum absolute value of pre-activations of ReLU p α (•) a Minimax composite polynomial with precision α p d (x) a composite polynomial p d (x) = (p dK K • • • • • p d k k • • • • • p d1 1 )(x) where 1 ≤ k ≤ K and degree d = K k=1 p d k k (x) a sub-polynomial function p d k k (x) = β k d k i=1 α i T i (x) α i ∈ R, β i ∈ R, T is the Chebyshev bases of the first kind d {d i } dK i=1 degrees of all sub-polynomials α k {α i } d k i=1 coefficients of a sub-polynomial function Multi-Objective EA (MOEA): Given two d-dimensional vectors x 1 and x 2 for a minimization problem, if Srinivas & Deb (1994) . It means x 1 is better than x 2 . It is denoted as  EvoReLU y = EvoReLU(x) = x • 0.5 + p d (x) , x ∈ [-1, 1], y ∈ [0, 1] λ EvoReLU's parameters λ = (α 1 , β 1 , • • • , α K , β K ) x 1,i ≤ x 2,i , ∀i ∈ {1, 2, • • • , d} and x 1,j < x 2,j , ∃j ∈ {1, 2, • • • , d}, x 1 dominates x 2 x 1 ≺ x 2 . Mutation: v = x π1 + F • (x π2 -x π3 ) , 1 ≤ π 1 , π 2 , π 3 ≤ N P Crossover: u j = v j , U(0, 1) ≤ CR x π1,j , Otherwise , 1 ≤ j ≤ d Selection: u = u, F(u) ≥ F(x π1 ) x π1 , Otherwise where F is scaling factor, CR is crossover rate, F(•) is the fitness evaluation function, and U(0, 1) is the uniform distribution between 0 and 1. Cooperative Co-Evolution (CC) algorithms were proposed to address the challenge of optimizing high-dimensional variables Yang et al. (2008) ; Mei et al. (2016) ; Ma et al. (2018) . Co-evolution decomposes the high-dimensional optimization problem into low-dimensional sub-problems. Then, we can apply EAs or DE to solve sub-problems for discrete or continuous variables, respectively. CC includes two major stages, decomposition and cooperative evaluation. Decomposition refers to grouping variables. Simple grouping strategies include random or interaction-based (gradient-based) grouping Mei et al. (2016) . When the proposed R-CCDE searches for parameters λ = (α 1 , β 1 , • • • , α K , β K ) of EvoReLU(x, λ|d), we decompose λ to α 1 , β 1 , • • • , α K , β K corresponding to polynomial sub- functions y 1 = p d1 1 (x|α 1 , β 1 ), y 2 = p d2 2 (y 1 |α 2 , β 2 ), • • • , y = p d K K (y K-1 |α K , β K ). Because the scaling parameter β is used to adjust the amplitude of polynomials, we evolve β followed by α. We maintain 2K populations for α 1 , β 1 , • • • , α K , β K , separately. These populations are called sub-populations or species in CC. This decomposition of R-CCDE takes advantage of the forward architecture of composited polynomials, x → y 1 → y 2 • • • → y K-1 → y.

When the proposed MOCoEV searches for

D = {d 1 , d 2 , • • • , d L } to minimize the objective {1 -Acc val (f (ω * ); Λ * (D), D), Boot(D)}, we evolve sub-populations for d 1 , d 2 , • • • , d L , sepa- rately. It decomposes original problems with dimension 114 ∼ 330 to dimension six and greatly reduces the search space size from 10 79 ∼ 10 230 to 10 4 . Cooperative evaluation refers to cooperatively evaluating an individual's fitness in a sub-population. We should take into account other sub-populations when evaluating the individual. R-CCDE is a single objective optimization, so we maintain a context vector Mei et al. (2016)  λ * = (α * 1 , β * 1 , • • • , α * K , β * K ). When evaluate α k or β k , we just need to replace the corresponding α * k or β * and assign α k or β with the fitness of (α * 1 , β * 1 , • • • α k • • • , α * K , β * K ) or (α * 1 , β * 1 , • • • β k • • • , α * K , β * K ). In the beginning, the context vector is randomly initialized. Finally, R-CCDE outputs the context vector as the searched result. We extend the context vector for multi-objective optimization. MOCoEv maintains context vectors that are the current Pareto Front. Therefore, MOCoEv can effectively improve the Pareto Front using CC.

D THE PROPOSED SEARCH ALGORITHMS D.1 MULTI-OBJECTIVE COOPERATIVE EVOLUTION

Algorithm 1 shows the details of our proposed MOCoEv search algorithm. MOCoEv takes as input a neural network f with L ReLUs that will be replaced by EvoReLUs, the number of sub-functions of a composite polynomial K, the population size N P , the number of iterations T , the initial population size N 0 . N 0 ≫ N because random initialization will generate invalid individuals. Invalid individuals refer to those that will lead to negative levels. The dataset, like CIFAR-10, with training and validation datasets, will be used. We will randomly sample a subset from the training dataset as minival dataset used for search. The training dataset will be used for fine-tuning. During search and fine-tuning, the validation dataset is strictly unseen. We will report the Top-1 validation accuracy on the validation dataset as the final result. MOCoEv will output the Pareto front, namely the population. The population is composed of non-dominated individuals with varying numbers of bootstrapping. In the initialization, we randomly initialize the population with N 0 individuals, {D 1 , D 2 , • • • , D N0 } where D j = {d 1 , • • • , d L } ∈ Z L×K , 1 ≤ j ≤ N 0 . d i , 1 ≤ i ≤ L is the degrees of a composite polynomial and is randomly sampled using the Latin hypercube sampling (LHS) method. The composite polynomials with the layer index i constitute the i-th sub-populations of CC. The proposed R-CCDE searched for coefficients of composite polynomials. The Pareto function is first to use nondominated sorting to find Pareto fronts and then use crowding distance to select individuals given the population size N P . In iteration t, we sequentially evolve EvoReLUs one by one. Given i-th EvoReLU, we first randomly select mating individuals from the population based on their accuracy on the minival dataset. We crossover mating individuals to generate offspring. Crossover operates at the network level. Then, we mutate the i-th sub-population. We randomly replace, remove and insert polynomials of the i-th sub-population with the probability P replace , P remove and P insert , respectively. So, we need to apply R-CCDE to search coefficients of the i-th sub-population and also use PAT to fine-tune the network. We evaluate the fine-tuned networks on the minival dataset. Finally, we use Pareto function to obtain the new Pareto front given the population size N P . Individuals in the Pareto front will be used to replace the population.

D.2 REGULARIZED CO-OPERATIVE DIFFERENTIAL CO-EVOLUTION

Algorithm 2: R-CCDE input :A composite polynomial p d (x) = (p d K K • p d K-1 K-1 • • • • • p d1 1 )(x) with parameters λ = {α 1 , β 1 , • • , α K , β K }, the target non-arithmetic function q(x), the number of iterations T , the scaling decay γ; output :The context vector λ * ; initial :λ * ← LHS for t ← 1 to T do for k ← 1 to K do α ⋆ k = arg min α k L p d ,q (α k |λ * ), α k |λ * = (α * 1 , β * 1 , • • • α k , • • • , α * K , β * K ); λ * ← (α * 1 , β * 1 , • • • α ⋆ k , • • • , α * K , β * K ); β ⋆ k = arg min β k L p d ,q (β k |λ * ) + γ • β 2 k , β k |λ * = (α * 1 , β * 1 , • • • , β k , • • • , α * K , β * K ); λ * ← (α * 1 , β * 1 , • • • β ⋆ k , • • • , α * K , β * K ); Algorithm 2 details the proposed R-CCDE searches for coefficients of a composite polynomial. R-CCDE takes as input a composite polynomial p d (x), the target non-arithmetic function q(x), the number of iterations T and scaling decay parameter γ. In this paper, q(x) = 0.5 • sgn(x). R-CCDE will output the context vector λ * as result. The composite polynomial p d (x) = (p d K K • p d K-1 K-1 • • • • • p d1 1 )(x) has learnable parameters λ = {α 1 , β 1 , • • • , α K , β K }. α k = {α 1 , • • • , α d k } and β k , 1 ≤ k ≤ K satisfy p d k k (x) = β k d k i=1 α i T i (x). We apply LHS to initialize sub-populations of each α k and β k . The population size of α k sub-populations is equal to 10 × ⌊d k + 1⌉/2, where β k is 20. We set β K = 1 and not learnable. Given iteration t and sub-population index k, we apply DE to solve arg min α k L p d ,q (α k |λ * ) where L p d ,q is the ℓ 1 distance between p d (x) and q(x). α k |λ * means to use the current context vector λ * but only search for the corresponding α k . The solution α ⋆ k evolved by DE is used to update the context vector λ * = (α * 1 , β * 1 , • • • α ⋆ k , • • • , α * K , β * K ). Similarly, we also evolve β k . After obtaining λ * , we can scale α k by β k and obtain the coefficients of the composite polynomial in terms of the first kind of Chebyshev basis.

E COMPARISON OF SEARCH ALGORITHMS

To solve the high-dimensional search problem min D {1 -Acc val (f (ω * ); Λ * (D), D), Boot(D)}, we propose MoCoEv by using cooperative co-evolution to decompose high-dimensional optimization problem to low-dimensional sub-problems. Because we adopt non-dominated sorting and crowding distance from NSGA-II Deb et al. (2002) to obtain Pareto-effective solutions. NSGA-II is a fair comparison to demonstrate the efficacy of MoCoEv. We conduct the search experiments of ResNet-20 and ResNet-32 on plaintext CIFAR-10. When we use NSGA-II, we set the same hyper-parameters except for increasing the population size by the number of ReLUs. However, we cannot control the number of polynomials being evaluated, the number of evaluations on the minival dataset, and the number of fine-tuning on the training dataset. So, we use the wall-clock search time to make computation comparable, as shown in Table 4 . The upper row of Figure 8 shows the Pareto fonts of NSGA-II and AutoFHE. In terms of the trade-off of the Top-1 validation accuracy (> 80%) versus the number of bootstrapping operations, NSGA-II (2006) is used to compare NSGA-II and AutoFHEquantitatively. Hypervolume denotes the volume dominated by the Pareto front. The bigger is HV, the better the Pareto Front. The bottom row of Figure 8 shows the trade-off between the Top-1 validation error and the number of bootstrapping. We compute the HV with respect to the reference points, (18.00, 18.56) for ResNet-20 and (30.00, 16.06) for ResNet-32, respectively. Table 4 shows AutoFHE has better HV values than NSGA-II. These ablation experiments prove that co-evolution facilitates the high-dimensional multi-objective search.

F EVALUATION OF LAYERWISE RELU APPROXIMATION

To demonstrate the efficacy of layerwise approximation of EvoReLUs, we compare AutoFHE with the uniformly approximated networks. We adopt the Minimax composite polynomials Lee et al. (2021a; c) from precision 4 to 14 and use them to replace ReLUs uniformly. However, the Minimax polynomials require a re-design of homomorphic evaluation architecture for all composite polynomials with different precision. For example, FHE-MP-CNN uses the polynomial with precision 13 and designs the suitable homomorphic evaluation architecture. To fairly compare the layerwise EvoReLUand the uniformly distributed Minimax polynomial, we use the number of depths consumed by polynomials rather than the number of bootstrapping as the criterion. We report the Top-1 validation accuracy on CIFAR-10 as the estimated performance under the RNS-CKKS. Hence, we used the search objective min D {1 -Acc val (f (ω * ); Λ * (D), D), Depth(D), where Depth(•) is the total number of depth consumed by polynomials. On the other hand, in this ablation, we do NOT use PAT to fine-tune networks to make the comparison fair. The upper row of Figure 9 shows Pareto fronts of Minimax and AutoFHE in terms of Top-1 validation accuracy and the number of depth. From the bottom row of Figure 9 , we compute the hypervolume values with respect to the reference point (285, 90) for ResNet-20 and the reference point (372, 90) for ResNet-32, respectively. AutoFHE has HV 1.58 × 10 4 better than Minimax HV 1.06 × 10 4 on ResNet-20, while AutoFHE HV is 1.84 × 10 4 compared with Minimax HV 1.00 × 10 4 . These experimental results prove: 1) the layerwise approximation is better than uniform approximation; 2) 25 5 2 3 3 5 2 6 3 3 2 2 3 3 3 5 5 2 3 5 2 2 5 2 5 2 5 6 5 5 4 2 5 2 2 2 5 2 2 2 2 2 4 5 3 2 6 3 2 2 5 2 2 2 2 2 29 9 2 9 9 8 3 6 3 9 2 9 2 12 9 5 6 9 2 7 4 13 5 14 8 7 2 8 3 7 4 7 3 5 6 7 2 8 2 6 5 7 4 8 3 5 4 2 4 8 3 6 5 2 5 2 32 9 2 9 9 5 6 6 3 9 2 9 2 8 9 12 11 9 2 7 4 9 5 14 7 9 2 8 3 7 4 6 5 5 6 9 2 8 3 6 5 7 4 8 3 9 4 5 6 8 3 4 5 10 8 2 34 9 2 9 11 5 6 6 3 9 2 9 2 12 10 12 6 9 2 7 4 9 5 14 8 9 2 8 3 7 4 6 5 5 6 9 2 8 2 14 9 7 4 9 6 9 6 5 5 8 3 4 5 9 8 2 36 9 2 9 11 5 6 12 10 9 2 9 2 8 10 12 11 9 2 7 4 9 5 14 8 9 2 8 3 7 4 7 4 5 6 9 2 8 8 9 10 7 4 9 6 9 7 5 6 8 3 4 5 10 9 2 39 9 2 9 11 5 6 10 10 9 2 9 2 12 10 12 11 9 2 7 4 9 8 14 8 9 2 13 5 7 4 7 4 9 8 9 2 8 8 14 8 7 4 12 9 11 7 9 12 8 3 5 5 10 9 2 40 9 2 9 11 5 6 10 10 9 2 9 2 13 10 12 10 9 2 7 4 9 8 14 8 9 13 6 5 7 4 7 4 12 6 9 2 8 8 14 8 13 9 9 9 9 7 9 12 8 3 5 5 10 9 2 41 9 2 9 12 5 10 12 11 9 2 9 2 14 9 12 11 9 2 7 4 9 8 14 8 9 13 6 5 7 4 7 4 12 12 9 2 8 8 14 11 13 9 9 9 9 7 9 6 8 3 4 5 10 9 2 42 9 2 9 11 5 6 10 10 9 2 9 2 14 10 12 10 9 2 7 4 9 8 14 8 9 13 13 5 7 4 7 4 12 8 9 2 8 8 14 8 13 9 9 9 11 7 9 12 8 7 5 5 10 9 2 44 9 6 5 9 9 10 10 9 9 2 9 2 14 9 12 11 9 2 9 9 9 11 14 8 9 13 14 7 7 4 7 4 9 12 9 2 8 8 14 8 13 9 9 9 9 9 9 12 8 7 4 13 10 9 2 45 9 6 5 13 9 10 12 11 9 2 9 2 14 9 12 11 9 2 9 7 9 8 14 8 9 13 14 7 7 4 7 5 9 12 9 2 8 8 14 8 13 9 9 9 9 9 9 12 9 7 4 7 10 9 2 0 6 12 18 24 30 36 42 48 54 47 9 6 5 13 9 10 10 10 9 10 9 2 14 9 12 11 9 12 9 13 9 11 14 8 9 13 14 7 7 4 7 3 9 9 9 9 8 8 13 8 13 9 9 9 9 9 9 9 8 6 4 13 10 9 2 EvoReLU Layer Index the approximation of AutoFHE is also precise and AutoFHE's performance is not simply because of fine-tuning.

G EVALUATE AUTOFHE ON PLAINTEXT IMAGENET

It is not practical to evaluate high-resolution images under the RNS-CKKS scheme due to the extremely high memory footprint and computational complexity constraints. To realize the goal of practically processing high-resolution images in the encrypted domains, we need advancements in RNS-CKKS primitives, custom hardware for FHE, more efficient packing algorithms, etc. This is the reason why all current works use CIFAR to benchmark performance. We evaluate AutoFHE on plaintext ImageNe Russakovsky et al. (2015) to demonstrate its efficacy on the large-scale highresolution dataset. We use the ResNet-18 model provided by Pytorch. It has 9 ReLUs and 1 MaxPooling. We replace the non-arithmetic MaxPooling by the arithmetic AvgPooling and train it from scratch for 90 epochs. Its Top-1 accuracy is 69.62% slightly lower than its MaxPooling version, 69.76%. We set the number of generations to 10, the size of the minival dataset to 2,560, and the population size to 30. Other hyper-parameters are as same as CIFAR experiments. We turn off fine-tuning during the search and fine-tune the final result for one epoch. The search experiment took 18 hours. We estimate accuracy on the plaintext validation dataset and use depth consumption as the inference cost under the RNS-CKKS. We adopt the Minimax polynomials with precision from 4 to 13 Lee et al. (2021c) as our baseline and set B = 100. , where B ′ < B and α ′ < α. The pre-activations of residual EvoReLU are not normalized, so its B values are bigger than chain EvoReLU. So, residual EvoReLUs prefer higher-degree polynomial approximation with more depth consumption to maintain approximation precision. Figure 12 and Figure 13 show EvoReLUs of different layers of ResNet-56. We include Pareto-effective solutions with the number of bootstrapping operations from 26 to 52. From Figure 12 and Figure 13 , high-precision solutions consume more depth and approximate ReLUs precisely. Low-precision solutions use low-degree polynomials to reduce depth consumption. From EvoReLU approximation, we learn how AutoFHE can trade off accuracy and inference speed. 



Figure 1: Pareto fronts of AutoFHE versus FHE-MP-CNN on encrypted CIFAR-10 under the RNS-CKKS FHE scheme.

Figure 3: Homomorphic evaluation architectures of the chain and residual connections. Upper: standard ResNet Conv-BN-ReLU triplet He et al. (2016). Middle: FHE-MP-CNN. Bottom: AutoFHE, where dashed rectangles are placement of bootstrapping layers to search.

terms of the validation accuracy and the number of bootstrapping; and ii) Cooperative Evaluation: we maintain the Pareto front as the context Mei et al. (

We benchmark AutoFHE on CIFAR-10 and CIFAR-100Krizhevsky et al. (2009). Both datasets have 50,000 training and 10,000 validation images at a resolution of 32 × 32. The validation images are treated as private data and used only for evaluating the final networks. We randomly select 5,120 images from the training split as a minival Tan & Le (2021) dataset to guide the search process. The Top-1 accuracy on the minival dataset optimizes equation 3. In addition, PAT uses the training split to fine-tune polynomial networks. Finally, as our final result, we report the Top-1 accuracy on the encrypted validation dataset under RNS-CKKS. To evaluate AutoFHE under RNS-CKKS, we adopt the publicly available code of FHE-MP-CNN and adapt it for inference with layerwise EvoReLU. During inference, we keep track of the ciphertext levels and call the bootstrapping operation when the level reaches zero, thanks to the optimal placement of bootstrapping operations found by AutoFHE. For a fair comparison between AutoFHE and the baseline FHE-MP-CNN, we use the pre-trained network weights provided by FHE-MP-CNN.

Figure 5: Pareto fronts of AutoFHE. We report the accuracy on plaintext validation datasets and the number of bootstrapping operations. Left: ResNet-20/32/44/56 on CIFAR-10; Right: ResNet-32 on CIFAR-100.

Figure 6: Pareto fronts of AutoFHE for ResNet-32 on CIFAR-10/-100 under RNS-CKKS.

Pareto front or Pareto-effective solutions are those not dominated by others. Delphi Mishra et al. (2020) and SAFENet Lou et al. (2020) combine two objectives (accuracy and ReLU replacement ratio) into a single objective by weighted sum. It is a widely-used trick to release multi-objective problems to a single objective. However, it only obtains a single solution balancing multiple objectives and cannot get Pareto-effective solutions. This is why we apply multi-objective search to obtain Pareto-effective solutions corresponding to different accuracy and latency requirements. EA is naturally well suited for multi-objective search due to population-based optimization. It allows us to obtain the entire set of Pareto-effective solutions in a single run. NSGA-II Deb et al. (2002) is the most well-known evolutionary multi-objective algorithm. The proposed MOCoEv adopts nondominated sorting and crowding distance from NSGA-II. The nondominated sorting can return all Pareto fronts, while crowding distance selects the uniformly distributed individuals within a Pareto front. Differentiable Evolution (DE) is a gradient-free evolutionary algorithm used to optimize continuous variables Rauf et al. (2021). Given population X = {x 1 , x 2 , • • • , x N P }, where each individual x π ∈ R d , 1 ≤ π ≤ N P , the mutation, crossover and selection of DE are defined as:

MOCoEv input :The Network f with L ReLUs, the number of sub-functions of a composite polynomial K, the population size N P , the number of iterations T , the initial population size N 0 ≫ N P , the replace probability P replace , the remove probability P remove , the insert probability P insert , the training dataset Training, the mini-validation dataset Minival; output :The Pareto front Population;initial :Population {D 1 , D 2 , • • • , D N0 } ← LHS(N 0 , L, K) where D = {d 1 , • • • , d L }; foreach d in D do λ ← R-CCDE(d); foreach D in Population do Acc ← Evalaute(f (D, Λ), Minival); Population ← Pareto(Population, Acc, N P ) foreach D in Population do ω ← PAT(f (D, Λ), Training); foreach D in Population do Acc ← Evalaute(f (ω, D, Λ), Minival); for t ← 1 to T do for i ← 1 to L do Offspring ← Select(Population); Offspring ← Crossover(Offspring); Offspring ← Mutate(Offspring [:, i]); foreach d in Offspring [:, i] do λ ← R-CCDE(d); foreach D in Offspring do ω ← PAT(f (D, Λ), Training); foreach D in Offspring do Acc ← Evalaute(f (ω, D, Λ), Minival); Population ← Pareto(Population +Offspring, Acc, NP);

Figure 9: Evaluation of AutoFHE and layerwise Minimax.

Figure 10: Depth consumption distribution of EvoReLUs of ResNet-56. Upper: depth consumption distributions of layerwise EvoReLUs of different bootstrapping consumption. Bottom: the distribution of scaling parameters (B) of layerwise EvoReLUs. The green dashed lines show the depth consumption or B of AppReLUs of FHE-MP-CNN.

Figure 11: Evaluate AutoFHE over ResNet-18 on plaintext ImageNet.

Figure10shows distributions of depth consumption of EvoReLUs of Pareto-effective solutions with varying numbers of bootstrapping operations. From the upper panel of Figure10, MoCoEv exploits the layerwise variable approximation sensitivity and assigns different depths to each EvoReLU. So, AutoFHE can reduce depth consumption, save the number of bootstrapping operations and further accelerate inference. From the bottom panel of Figure10, it shows different distributions of pre-activations. It proves we can use smaller B values and lower-degree polynomials to have the same precision, namelyB ′ • 2 -α ′ = B • 2 -α, where B ′ < B and α ′ < α. The pre-activations of residual EvoReLU are not normalized, so its B values are bigger than chain EvoReLU. So, residual EvoReLUs prefer higher-degree polynomial approximation with more depth consumption to maintain approximation precision.

Figure 12: EvoReLUs of ResNet-56 from layer 0 to 26.

Figure 13: EvoReLUs of ResNet-56 from layer 27 to 54.

1. AutoFHE automatically searches for EvoReLUs and bootstrapping operations. It provides a diverse set of Pareto-effective solutions that span the trade-off between accuracy and inference time under RNS-CKKS. 2. From an algorithmic perspective, (a) We propose a simple yet effective multi-objective co-evolutionary (MOCoEv) algorithm to effectively explore and optimize over the large search space (10 79 ∼ 10 230 ) and optimize high-dimensional vectors (114 ∼ 330) corresponding to our formulation. (b) We design a gradient-free algorithm, regularized co-operative co-evolutionary differentiable evolution (R-CCDE), to optimize the coefficients of high-degree composite polynomials. (c) We introduce polynomial-aware training (PAT) to finetune EvoReLU DNNs for a few epochs. 3. Experimental results (Figure 1) on encrypted CIFAR-10 and CIFAR-100 under RNS-CKKS show

the learnable parameters of EvoReLU with the degree d. B] but avoid extra depth consumption for scaling, we scale the plaintext weight and bias of BatchNorm by 1/B in advance for chain connections. But for residual connections, we cannot integrate the scale 1/B into BatchNorm's weight and bias. In this case, we scale the ciphertext output of the residual connection by 1/B at the expense of one level. Finally, we integrate B into coefficients of p d K

(a) Search variables and options; (b) AutoFHE search space for ResNets.

AutoFHE under the RNS-CKKS scheme. Top-1 * accuracy for FHE-MP-CNN, as reported inLee et al.  (2022a). The inference time for 50 images is evaluated on AMD EPYC 7H12 64-core processor using 50 threads. Boldface denotes the best criterion on a backbone network, like the best Top-1 accuracy and the least inference time; underline denotes AutoFHE outperforms FHE-MP-CNN.

Top-1 accuracy of 91.39% on encrypted CIFAR-10 under the RNS-CKKS at an amortized inference latency of under one minute (53 seconds) per image, which brings us closer towards practically realizing secure inference of deep CNNs under RNS-CKKS. On CIFAR-100, AutoFHE saves inference time by 972 seconds (17%) while preserving the accuracy. The experiments prove that AutoFHE can find Pareto-effective solutions that trade-off accuracy and inference time. Furthermore, the results validate our assumption that directly reducing the number of bootstrapping operations can effectively accelerate inference speed.

Variable Notations. -2 . SAFENetLou et al. (2020) adopts a 1 x 3 + a 2 x 2 + a 3 x + a 4 or b 1 x 2 + b 2 x + b 3 and uses SGD to train coefficients. When applying SGD to train low-degree polynomial coefficients and network weights simultaneously, polynomials easily lead to exploding gradients. On the other hand, low-degree polynomials need to train approximated networks from scratch, cannot use pre-trained weights, and has a big accuracy gap compared with ReLU networks. More recently, AESPAPark et al. (2022) proposes basis-wise normalization to address the problem of exploding gradients in low-degree polynomial approximated networks. Delphi and SAFENet apply populationbased training (PBT)Jaderberg et al. (2017) to search for placement of polynomials. Because Delphi and SAFENet are evaluated under secure MPC, they maintain some ReLUs to preserve accuracy. SAFENet also observed that layerwise and channel-wise mixed-precision approximation could better take advantage of the varying sensitivity of different layers. Minimax composite polynomialsLee  et al. (2021a;c)  are specially designed to approximate ReLU under FHE with high precision using composite polynomials. FHE-MP-CNNLee et al. (2022a)  applies the Minimax composite polynomial with degree {15, 15, 27} and proves it can maintain the performance of pretrained ResNets to adapt to polynomial approximation, 2) the layerwise varying sensitivity, and 3) the combination of all polynomial activations in a network. In this paper, we consider both high-precision approximation and network performance. MOCoEv searches for degrees across all layers and directly optimizes validation accuracy. We consider function-level approximation using R-CCDE to minimize the ℓ 1 distance. We use pre-trained ResNets and propose PAT to fine-tune network trainable weights to adapt to EvoReLUs for just a few epochs.C EVOLUTIONARY SEARCH ALGORITHMSEvolutionary Algorithms (EAs) are a class of search algorithms inspired by Darwin's natural selection. Each candidate solution is a individual. N P individuals constitute the population with the population size N P . Individuals are assigned fitness related to the objective, like validation accuracy for image classification. Based on fitness, we randomly select mating individuals. Crossover combine mating individuals to generate offspring. Offspring can be further mutated to better exploit current knowledge. Finally, offspring is used to update the current population. We can iteratively repeat this process many times. Each iteration is called generation. The number of generations is a simple criterion for stopping the EA search.

The ablation experiment of search algorithms on plaintext CIFAR-10.

APPENDIX

In this appendix, we include the following:• Appendix A: Notations;• Appendix B: An expanded discussion of related work for secure inference;• Appendix C: Background and related work for evolutionary algorithms;• Appendix D: Implementation of the MOCoEv and R-CCDE algorithms in D.1 and D.2, respectively;• Appendix E: Experimental details for evaluating co-evolution;• Appendix F: Experimental details for evaluating layerwise comparison;• Appendix G: Evaluation of AutoFHE on plaintext ImageNet;• Appendix H: EvoReLUs of ResNet-56 on CIFAR-10.

A NOTATIONS

We list variable notations of the RNS-CKKS and EvoReLU in Table 3 . 2021) requires regular communication between customers and the Cloud. FHE cannot directly evaluate ReLU because it only allows arithmetic homomorphic addition and multiplication. However, secure MPC can evaluate ReLU using Garbled Circuits (GC) Yao (1986); Bellare et al. (2012) but suffers from high online computation and communication costs Mishra et al. (2020) . Adaption of CNNs to secure inference by polynomial approximation of non-arithmetic functions, is a necessary pre-processing stage. Polynomial approximation enables us to homomorphically evaluate encrypted data on FHE, while it also can greatly reduce online computation and communication costs of secure MPC.

