LEARNING TO LINEARIZE DEEP NEURAL NETWORKS FOR SECURE AND EFFICIENT PRIVATE INFERENCE

Abstract

The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer's activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet's superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ∼2× fewer Re-LUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ∼2.32% improved classification accuracy, evaluated on CIFAR-100.

1. INTRODUCTION

With the recent proliferation of several AI-driven client-server applications including image analysis (Litjens et al., 2017) , object detection, speech recognition (Hinton et al., 2012) , and voice assistance services, the demand for machine learning inference as a service (MLaaS) has grown. Simultaneously, the emergence of privacy concerns from both the users and model developers has made private inference (PI) an important aspect of MLaaS. In PI the service provider retains the proprietary models in the cloud where the inference is performed on the client's encrypted data (ciphertexts), thus preserving both model (Kundu et al., 2021b) and data-privacy (Yin et al., 2020) . Existing PI methods rely on various cryptographic protocols, including homomorphic encryption (HE) (Brakerski & Vaikuntanathan, 2014; Gentry, 2009) and additive secret sharing (ASS) (Goldreich et al., 2019) for the linear operations in the convolutional and fully connected (FC) layers. For example, popular methods like Gazelle (Juvekar et al., 2018) , DELPHI (Mishra et al., 2020), and Cheetah (Reagen et al., 2021) use HE while MiniONN (Liu et al., 2017) and CryptoNAS (Ghodsi et al., 2020) use ASS. For performing the non-linear ReLU operations, the PI methods generally use Yao's Garbled Circuits (GC) (Yao, 1986) . However, GCs demand orders of magnitude higher latency and communication than the PI of linear operations, making latency-efficient PI an exceedingly difficult task. In contrast, standard inference latency is dominated by the linear operations (Kundu et al., 2022b) and is significantly lower than that of PI. This has motivated the unique problem of reducing the number of ReLU non-linearity operations to reduce the communication and latency overhead of PI. In particular, recent literature has leveraged Our contributions. Our contribution is three-fold. We first empirically demonstrate the relation between a layer's sensitivity towards pruning and its associated ReLU sensitivity. Based on our observations, we introduce an automated layer-wise ReLU sensitivity evaluation strategy and propose SENet, a three-stage training process to yield secure and efficient networks for PI that guarantees meeting a target ReLU budget without any hyperparameter-dependent iterative training. In particular, for a given global ReLU budget, we first determine a sensitivity-driven layer-wise non-linearity (ReLU) unit budget. Given this budget, we then present a layer-wise ReLU allocation mask search. For each layer, we evaluate a binary mask tensor with the size of the corresponding activation map for which a 1 or 0 signifies the presence or absence of a ReLU unit, respectively. Finally, we use the trained mask to create a partial ReLU (PR) model with ReLU present only at fixed parts of the non-linearity layers, and fine-tune it via distillation from an iso-architecture trained AR model. Importantly, we support ReLU mask allocation both at the granularity of individual pixels and activation channels. To further reduce both linear and non-linear (ReLU) layer compute costs, we extend our approach to SENet++. SENet++ uses a single training loop to train a model of different channel dropout rates (DRs) d r (d r ≤ 1.0) of the weight tensor, where each d r yields a sub-model with a MAC-ReLU budget smaller than or same as that of the original one. In particular, inspired by the idea of ordered dropout (Horvath et al., 2021) , we train a PR model with multiple dropout rates (Horvath et al., 2021) , where each dropout rate corresponds to a scaled channel sub-model having number of channels per layer ∝ the d r . This essentially allows the server to yield multiple sub-models for different compute requirements that too via a single training loop, without costly memory footprint. Table 1 compares the important characteristics of our methods with existing alternatives. We conduct extensive experiments and ablations on various models including variants of ResNet, Wide Residual Networks, and VGG on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet datasets. Experimental results show that SENet can yield SOTA accuracy-ReLU trade-off with an improved accuracy of up to ∼2.32% for similar ReLU budgets. SENet++ (d r = 0.5) can further improve the MAC and ReLU cost of SENet, with an additional saving of 4× and ∼2×, respectively.

2. PRELIMINARIES AND RELATED WORK

Cryptographic primitives. We briefly describe the relevant cryptographic primitives in this section. Additive secret sharing. Given an element x, an ASS of x is the pair (⟨x⟩ 1 , ⟨x⟩ 2 ) = (x-r, r), where r is a random element and x = ⟨x⟩ 1 + ⟨x⟩ 2 . Since r is random, the value x cannot be revealed by a single share, so that the value x is hidden.



Figure 1: Comparison of various methods in accuracy vs. #ReLU trade-off plot. SENet outperforms the existing approaches with an accuracy improvement of up to ∼4.5% for similar ReLU budget.

Comparison between existing approaches in yielding efficient models to perform PI. Note, SENet++ can yield a model that can be switched to sub-models of reduced channel sizes. NAS) to optimize both the number and placement of ReLUs(Ghodsi  et al., 2020; Cho et al., 2021). However, these methods often cost significant accuracy drop, particularly when the ReLU budget is low. For example, with a ReLU budget of 86k, CryptoNAS costs ∼9% accuracy compared to the model with all ReLUs (AR) present. DeepReDuce (Jha et al., 2021) used a careful multi-stage optimization and provided reduced accuracy drop of ∼3% at similar ReLU budgets. However, DeepReDuce heavily relies on manual effort for the precise removal of ReLU layers, making this strategy exceedingly difficult, particularly, for models with many layers. A portion of these accuracy drops can be attributed to the fact that these approaches are constrained to remove ReLUs at a higher granularity of layers and channels rather than at the pixel level. Only very recently,(Cho et al., 2022)  proposed l 1 -regularized pixel level ReLU reduction. However, such approaches are extremely hyperparameter sensitive and often do not guarantee meeting a specific ReLU budget. Moreover, the large number of training iterations required for improved accuracy may not be suitable for compute-limited servers(Mishra et al., 2020).

