GINN: Fast GPU-TEE Based Integrity for Neural Network Training

Abstract

Machine learning models based on Deep Neural Networks (DNNs) are increasingly deployed in a wide range of applications, ranging from self-driving cars to COVID-19 treatment discovery. To support the computational power necessary to learn a DNN, cloud environments with dedicated hardware support have emerged as critical infrastructure. However, there are many integrity challenges associated with outsourcing computation. Various approaches have been developed to address these challenges, building on trusted execution environments (TEE). Yet, no existing approach scales up to support realistic integrity-preserving DNN model training for heavy workloads (deep architectures and millions of training examples) without sustaining a significant performance hit. To mitigate the time gap between pure TEE (full integrity) and pure GPU (no integrity), we combine random verification of selected computation steps with systematic adjustments of DNN hyperparameters (e.g., a narrow gradient clipping range), hence limiting the attacker's ability to shift the model parameters significantly provided that the step is not selected for verification during its training phase. Experimental results show the new approach achieves 2X to 20X performance improvement over pure TEE based solution while guaranteeing a very high probability of integrity (e.g., 0.999) with respect to state-of-the-art DNN backdoor attacks.

1. INTRODUCTION

Every day, Deep Learning (DL) is incorporated into some new aspects of the society. As a result, numerous industries increasingly rely on DL models to make decisions, ranging from computer vision to natural language processing. The training process for these DL models requires a substantial quantity of computational resources (often in a distributed fashion) for training, which traditional CPUs are unable to fulfill. Hence, special hardware, with massive parallel computing capabilities such as GPUs, is often utilized Shi et al. (2016) . At the same time, the DL model building process is increasingly outsourced to the cloud. This is natural, as applying cloud services (e.g., Amazon EC2, Microsoft Azure or Google Cloud) for DL training can be more fiscally palatable for companies by enabling them to focus on the software aspect of their products. Nevertheless, such outsourcing raises numerous concerns with respect to the privacy and integrity of the learned models. In recognition of the privacy and integrity concerns around DL (and Machine Learning (ML) in general), a considerable amount of research has been dedicated to applied cryptography, in three general areas: 1) Multi-Party Computation (MPC) (e.g., Mohassel & Zhang (2017)), 2) Homomorphic Encryption (HE) (e.g., Gilad-Bachrach et al. (2016) ), and 3) Trusted Execution Environment (TEE) (e.g., Hunt et al. (2018); Hynes et al. (2018) ). However, the majority of these investigations are limited in that: 1) they are only applicable to simple shallow network models, 2) they are evaluated with datasets that have a small number of records (such as MNIST LeCun & Cortes (2010) and CIFAR10 Krizhevsky et al.), and 3) they incur a substantial amount of overhead that is unacceptable for real-life DL training workloads. In their effort to mitigate some of these problems, and securely move from CPUs to GPUs, Slalom Tramèr & Boneh (2019) mainly focus on the computational integrity at the test phase while depending on the application context. It can also support enhanced data privacy, however, at a much greater performance cost. To address these limitations, we introduce GINN (See Figure 1 ); a framework for integritypreserving learning as a service that provides integrity guarantees in outsourced DL model training in TEEs. We assume that only the TEE running in the cloud is trusted, and all the other resources such as GPUs can be controlled by an attacker to launch an attack (e.g., insert a trojan). In this context, our goal is to support the realistic deep learning training workloads while ensuring data and model integrity. To achieve this goal, we focus on the settings where maintaining the learning process's integrity is critical, while the training data may not contain privacy sensitive information. For example, we may want to build a traffic sign detection model on public traffic sign images and may still like to prevent attacks that can insert trojan during the training phase. Furthermore, we want to provide assurances that the model is trained on the specified dataset, with known parameters so that the performance of the model can be replicated and audited for accountability and integrity. Randomness ∇ t W 0 W t+1 Batch Forward Backward SGD(0) SGD(t) W 1 W 2 W t Forward Backward Forward Backward Forward Backward Update Clip ∇ t The trivial approach of executing the entire learning process inside a TEE is not scalable since TEEs are much slower compared to GPUs. Furthermore, even the existing performance improvement techniques (e.g., random matrix verification provided in Tramèr & Boneh (2019)) are not enough to scale up to large DL model learning settings. To alleviate the TEE bottleneck, we propose incorporating random verification of the computation steps. This strategy is based on the observation that it is unnecessary to verify all of the GPU's computation steps. Rather, we only need to verify occasionally to catch any deviation with a very high likelihood. Given that random verification may itself be insufficient (theoretically, an attacker can launch a successful attack by modifying only a single unconstrained gradient update), we further show how parts of the DL hyperparameter setting process, such as clipping rate should be modified to prevent single step attacks, and require a larger number of malicious updates by an attacker that controls the GPU. Simply, GINN limits the amount of change an adversary can inflict on a model through a single SGD step. As a consequence, the adversary is forced to keep attacking while randomly being verified by the TEE. Using the state-of-the-art backdoor attacks, we illustrate that random verification technique can detect attacks with a high probability (e.g., 0.99) while enabling 2x-20x performance gains compared to pure TEE based solutions. The specific contributions of this paper are as follows: • We introduce the first approach to support integrity-preserving DL training by random verification of stochastic gradient (SGD) steps inside TEE to ensure the integrity of training pipeline data, parameters, computation function, etc. with a high probability. • We illustrate how gradient clipping can be used as a defensive measure against single (or infrequent) step attack in combination with random verification. • We show the effectiveness of our TEE random verification and gradient clipping through extensive experimentation on DNN backdoor attacks.



Figure 1: The main architecture of GINN. The TEE handles mini-batch selection, layer-specific randomness, and parameter initialization. The GPU performs forward and backward passes over the mini-batch (items selected by SGX provided seed) and reports the computed gradients to the TEE. TEE then clips the gradients and performs the weight update. Also, TEE preserves the MACauthenticated intermediate gradient reports. During verification, TEE performs the forward and backward passes with the batch items along with layer-specific randomness (regenerated) and compares the gradients with the GPU's report.

