UNICORN: A UNIFIED BACKDOOR TRIGGER INVER-SION FRAMEWORK

Abstract

The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger. Existing methods cannot generalize to various types of triggers by making certain assumptions or attack-specific constraints. The fundamental reason is that existing work does not consider the trigger's design space in their formulation of the inversion problem. This work formally defines and analyzes the triggers injected in different spaces and the inversion problem. Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from our analysis. Our prototype UNICORN is general and effective in inverting backdoor triggers in DNNs.

1. INTRODUCTION

Backdoor attacks against Deep Neural Networks (DNN) refer to the attack where the adversary creates a malicious DNN that behaves as expected on clean inputs but predicts a predefined target label when the input is stamped with a trigger (Liu et al., 2018b; Gu et al., 2017; Chen et al., 2017; Liu et al., 2019; Wang et al., 2022c; Barni et al., 2019; Nguyen & Tran, 2021) . The malicious models can be generated by data poisoning (Gu et al., 2017; Chen et al., 2017) or supply-chain attacks (Liu et al., 2018b; Nguyen & Tran, 2021) . The adversary can choose desired target label(s) and the trigger. Existing work demonstrates that DNNs are vulnerable to various types of triggers. For example, the trigger can be a colored patch (Gu et al., 2017) , a image filter (Liu et al., 2019) and a warping effect (Nguyen & Tran, 2021) . Such attacks pose a severe threat to DNN based applications especially those in security-critical tasks such as malware classification (Severi et al., 2021; Yang et al., 2022; Li et al., 2021a) , face recognition (Sarkar et al., 2020; Wenger et al., 2021) , speaker verification (Zhai et al., 2021) , medical image analysis (Feng et al., 2022) , brain-computer interfaces (Meng et al., 2020) , and autonomous driving (Gu et al., 2017; Xiang et al., 2021) . Due to the threat of backdoor attacks, many countermeasures have been proposed. For example, anti-poisoning training (Li et al., 2021c; Wang et al., 2022a; Hong et al., 2020; Tran et al., 2018; Hayase et al., 2021; Chen et al., 2018) and runtime malicious inputs detection (Gao et al., 2019; cho; Doan et al., 2020; Zeng et al., 2021) . Different from many methods that can only work under specific threat models (e.g., anti-poisoning training can only work under the data-poisoning scenario), trigger inversion (Wang et al., 2019; Liu et al., 2019; Guo et al., 2020; Chen et al., 2019; Shen et al., 2021) is practical and general because it can be applied in both poisoning and supply-chain attack scenarios. It is a post-training method where the defender aims to detect whether the given model contains backdoors. It reconstructs backdoor triggers injected in the model as well as the target labels, which helps analyze the backdoors. If there exists an inverted pattern that can control the predictions of the model, then it determines the model is backdoored. Most existing trigger inversion methods (Guo et al., 2020; Liu et al., 2019; Shen et al., 2021; Chen et al., 2019) are built on Neural Cleanse (Wang et al., 2019) , which assumes that the backdoor triggers are static patterns in the pixel space. It defines a backdoor sample as x = (1 -m) ⊙ x + m ⊙ t, where m and t are pixel space trigger mask

