UNICORN: A UNIFIED BACKDOOR TRIGGER INVER-SION FRAMEWORK

Abstract

The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger. Existing methods cannot generalize to various types of triggers by making certain assumptions or attack-specific constraints. The fundamental reason is that existing work does not consider the trigger's design space in their formulation of the inversion problem. This work formally defines and analyzes the triggers injected in different spaces and the inversion problem. Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from our analysis. Our prototype UNICORN is general and effective in inverting backdoor triggers in DNNs.

1. INTRODUCTION

Backdoor attacks against Deep Neural Networks (DNN) refer to the attack where the adversary creates a malicious DNN that behaves as expected on clean inputs but predicts a predefined target label when the input is stamped with a trigger (Liu et al., 2018b; Gu et al., 2017; Chen et al., 2017; Liu et al., 2019; Wang et al., 2022c; Barni et al., 2019; Nguyen & Tran, 2021) . The malicious models can be generated by data poisoning (Gu et al., 2017; Chen et al., 2017) or supply-chain attacks (Liu et al., 2018b; Nguyen & Tran, 2021) . The adversary can choose desired target label(s) and the trigger. Existing work demonstrates that DNNs are vulnerable to various types of triggers. For example, the trigger can be a colored patch (Gu et al., 2017) , a image filter (Liu et al., 2019) and a warping effect (Nguyen & Tran, 2021) . Such attacks pose a severe threat to DNN based applications especially those in security-critical tasks such as malware classification (Severi et al., 2021; Yang et al., 2022; Li et al., 2021a) , face recognition (Sarkar et al., 2020; Wenger et al., 2021 ), speaker verification (Zhai et al., 2021) , medical image analysis (Feng et al., 2022 ), brain-computer interfaces (Meng et al., 2020) , and autonomous driving (Gu et al., 2017; Xiang et al., 2021) . Due to the threat of backdoor attacks, many countermeasures have been proposed. For example, anti-poisoning training (Li et al., 2021c; Wang et al., 2022a; Hong et al., 2020; Tran et al., 2018; Hayase et al., 2021; Chen et al., 2018) and runtime malicious inputs detection (Gao et al., 2019; cho; Doan et al., 2020; Zeng et al., 2021) . Different from many methods that can only work under specific threat models (e.g., anti-poisoning training can only work under the data-poisoning scenario), trigger inversion (Wang et al., 2019; Liu et al., 2019; Guo et al., 2020; Chen et al., 2019; Shen et al., 2021) is practical and general because it can be applied in both poisoning and supply-chain attack scenarios. It is a post-training method where the defender aims to detect whether the given model contains backdoors. It reconstructs backdoor triggers injected in the model as well as the target labels, which helps analyze the backdoors. If there exists an inverted pattern that can control the predictions of the model, then it determines the model is backdoored. Most existing trigger inversion methods (Guo et al., 2020; Liu et al., 2019; Shen et al., 2021; Chen et al., 2019) In this paper, we propose a trigger inversion framework that can generalize to different types of triggers. We first define a backdoor trigger as a predefined perturbation in a particular input space. A backdoor sample x is formalized as x = ϕ -1 ((1 -m) ⊙ ϕ(x) + m ⊙ t) , where m and t are input space trigger mask and trigger pattern, x is a benign sample, ϕ is an invertible input space transformation function that maps from the pixel space to other input spaces. ϕ -1 is the inverse function of ϕ, i.e., x = ϕ -1 (ϕ(x)). As shown in Fig. 1 , the critical difference between our framework and existing methods is that we introduce an input space transformation ϕ to unify the backdoor triggers injected in different spaces. Besides searching for the pixel space trigger mask m and pattern t as existing methods do, our method also searches the input space transformation function ϕ to find the input space where the backdoor is injected. We also observe successful backdoor attacks will lead to compromised activation vectors in the intermediate representations when the model recognizes backdoor triggers. Besides, we find that the benign activation vector will not affect the predictions of the model when the compromised activation values are activated. This observation means the compromised activation vector and the benign activation vector are disentangled in successful backdoor attacks. Based on the formalization of triggers and the observation of the inner behaviors of backdoor models, we formalize the trigger inversion as a constrained optimization problem. Based on the devised optimization problem, we implemented a prototype UNICORN (Unified Backdoor Trigger Inversion) in PyTorch and evaluated it on nine different models and eight different backdoor attacks (i.e., Patch attack (Gu et al., 2017) 

2. BACKGROUND & MOTIVATION

Backdoor. Existing works (Turner et al., 2019; Salem et al., 2022; Nguyen & Tran, 2020; Tang et al., 2021; Liu et al., 2020; Lin et al., 2020; Li et al., 2020; Chen et al., 2021; Li et al., 2021d; Doan et al., 2021b; Tao et al., 2022d; Bagdasaryan & Shmatikov, 2022; Qi et al., 2023; Chen et al., 2023) demonstrate that deep neural networks are vulnerable to backdoor attacks. Models infected with backdoors behave as expected on normal inputs but present malicious behaviors (i.e., predicting a certain label) when the input contains the backdoor trigger. Existing methods defend against backdoor attacks during training (Du et al., 2020; Hong et al., 2020; Huang et al., 2022; Li et al., 2021c; Wang et al., 2022a; Hayase et al., 2021; Tran et al., 2018; Zhang et al., 2023) , or detect



Fig. 1: Existing trigger inversion methods and our method. and trigger pattern. It inverts the trigger via devising an optimization problem that searches for a small and static pixel space pattern. Such methods achieve good performance on the specific type of triggers that are static patterns in the pixel space (Gu et al., 2017; Chen et al., 2017). However, it can not generalize to different types of triggers such as image filters (Liu et al., 2019) and warping effects (Nguyen & Tran, 2021). Existing methods fail to invert various types of triggers because they do not consider the trigger's design space in their formulation of the inversion problem.

, Blend attack(Chen et al., 2017), SIG(Barni  et al., 2019), moon filter, kelvin filter, 1977 filter (Liu et al., 2019), WaNet (Nguyen & Tran, 2021)   andBppAttack (Wang et al., 2022c)) on CIFAR-10 and ImageNet dataset. Results show UNICORN is effective for inverting various types of backdoor triggers. On average, the attack success rate of the inverted triggers is 95.60%, outperforming existing trigger inversion methods.Our contributions are summarized as follows: We formally define the trigger in backdoor attacks. Our definition can generalize to different types of triggers. We also find the compromised activations and the benign activations in the model's intermediate representations are disentangled. Based on the formalization of the backdoor trigger and the finding of intermediate representations, we formalize our framework as a constrained optimization problem and propose a new trigger inversion framework. We evaluate our framework on nine different DNN models and eight backdoor attacks. Results show that our framework is more general and more effective than existing methods. Our open-source code can be found at https://github.com/ RU-System-Software-and-Security/UNICORN.

