HOW HARD IS TROJAN DETECTION IN DNNS? FOOLING DETECTORS WITH EVASIVE TROJANS

Abstract

As AI systems become more capable and widely used, a growing concern is the possibility for trojan attacks in which adversaries inject deep neural networks with hidden functionality. Recently, methods for detecting trojans have proven surprisingly effective against existing attacks. However, there is comparatively little work on whether trojans themselves could be rendered hard to detect. To fill this gap, we develop a general method for making trojans more evasive based on several novel techniques and observations. Our method combines distribution-matching, specificity, and randomization to eliminate distinguishing features of trojaned networks. Importantly, our method can be applied to various existing trojan attacks and is detector-agnostic. In experiments, we find that our evasive trojans reduce the efficacy of a wide range of detectors across numerous evaluation settings while maintaining high attack success rates. Moreover, we find that evasive trojans are also harder to reverse-engineer, underscoring the importance of developing more robust monitoring mechanisms for neural networks and clarifying the offencedefense balance of trojan detection.

1. INTRODUCTION

A neural trojan attack occurs when adversaries corrupt the training data or model pipeline to implant hidden functionality in neural networks. The resulting networks exhibit a targeted behavior in response to triggers known only to the adversary. However, these trojaned networks retain their performance and properties on benign inputs, allowing them to remain undetected potentially until after the adversary has accomplished their goal. The threat of trojan attacks is becoming especially salient with the rise of model sharing libraries and massive datasets that are directly scraped from the Internet and too large to manually examine. To combat the threat of trojan attacks, an especially promising defense strategy is trojan detection, which seeks to distinguish trojaned networks from clean networks before deployment. This has the desirable property of being broadly applicable to different defense settings, and it enables additional defense measures later on, such as removing hidden functionality from networks (Wang et al., 2019) . Moreover, the problem of trojan detection is interesting in its own right. Being good at detecting trojans implies that one must be able to distinguish subtle properties of networks by inspecting their weights and outputs, and thus is relevant to interpretability research. More broadly, trojan detection could be viewed as a microcosm for identifying deception and hidden intentions in future AI systems (Hendrycks & Mazeika, 2022) , highlighting the importance of developing robust trojan detectors. There is a growing body of work on detecting neural trojans, and recent progress seems to suggest that trojan detection is fairly easy. For example, Liu et al. (2019) and Zheng et al. (2021) both propose detectors that obtain over 90% AUROC on existing trojan attacks. However, there has been comparatively little work on investigating whether trojans themselves could be made harder to detect. Very recently, Goldwasser et al. (2022) showed that for single-layer networks one can build trojans that are practically impossible to detect. This is a worrying result for the offense-defense balance of trojan detection, especially if such trojans could be designed for deep neural networks. However, to date there has been no demonstration of hard-to-detect trojan attacks in deep neural networks that generalize to different detectors. In this paper, we propose a general method for making deep neural network trojans harder to detect. Unlike standard trojan attacks, the evasive trojans inserted by our method are trained with a detector- agnostic loss that specifically encourages them to be indistinguishable from clean networks. The components of our method are intuitively simple, relying primarily on a distribution matching loss inspired by the Wasserstein distance along with specificity and randomization losses. Crucially, we consider a white-box threat model that allows defenders full access to training sets of evasive trojans, which enables gauging whether our evasive trojans are truly harder to detect. In experiments, we train over 6, 000 trojaned neural networks and find that our evasive trojans considerably reduce the performance of a wide range of detection algorithms, in some cases reducing detection performance to chance levels. Surprisingly, we find that in addition to being harder to detect, our evasive trojans are also harder to reverse-engineer. Namely, target label prediction and trigger synthesis becomes considerably harder. This is an unexpected result, because our loss does not explicitly optimize to make these tasks harder. In light of these results, we hope our work shifts trojan detection research towards a paradigm of constructive adversarial development, where more evasive trojans are developed in order to identify the limits of and improve detectors. By studying the offense-defense balance of trojan detection in this way, the community could make steady progress towards the ultimate goal of building robust trojan detectors and monitoring mechanisms for neural networks. Experiment code and models are available at [anonymized].

2. RELATED WORK

Trojan Attacks on Neural Networks. Trojan attacks, or backdoor attacks, refer to the process of implanting hidden functionalities into a system that affect its safety (Hendrycks et al., 2021) . Geigel (2013) devise a method to insert malicious triggers into a neural network. Since then, a wide variety of neural trojan attacks have been proposed (Li et al., 2022) 



Figure1: Compared to standard trojans, our evasive trojans are significantly harder to detect and reverse-engineer. In this illustrative example, the standard and evasive trojans contain dangerous hidden functionality. A meta-network is able to detect the standard trojan and reverse-engineer its target label and trigger, whereas the evasive trojan bypasses detection.

. Gu et al. (2017) show how data poisoning can insert trojans into victim models. They introduce the BadNets attack, which causes targeted misclassification when a trigger pattern appears in test inputs. Chen et al. (2017) introduce a blended attack strategy, which uses triggers that are less conspicuous in the poisoned training set. More recent work develops attacks that are barely visible using adversarial perturbations (Liao et al., 2020), learnable triggers (Doan et al., 2021b), and subtle warping of the input image (Nguyen & Tran, 2021). Others have considered making trojan attacks under fine-tuning threat models (Yao et al., 2019), for textual domains (Zhang et al., 2021), and encompassing a diverse range of attack vectors and goals (Bagdasaryan et al., 2020; Carlini & Terzis, 2021). An important part of defending against trojan attacks is detecting whether a given network is trojaned. Wang et al. (2019) propose Neural Cleanse, which reverse-engineers candidate triggers for each classification label. If a small trigger pattern is found, this indicates the presence of a deliberately inserted trojan. Liu et al. (2019) analyze inner neurons for suspicious behavior, then reverse-engineer candidate triggers to confirm whether a neuron is compromised. Kolouri et al. (2020) and Xu et al. (2021) propose training a set of queries to classify a training set of trojaned and clean networks. Remarkably, this generalizes well to unseen trojaned networks. Other work uses conditional GANs to model trigger generation (Chen et al., 2019b), adversarial

