TOWARDS ROBUSTNESS CERTIFICATION AGAINST UNIVERSAL PERTURBATIONS

Abstract

In this paper, we investigate the problem of certifying neural network robustness against universal perturbations (UPs), which have been widely used in universal adversarial attacks and backdoor attacks. Existing robustness certification methods aim to provide robustness guarantees for each sample with respect to the worst-case perturbations given a neural network. However, those sample-wise bounds will be loose when considering the UP threat model as they overlook the important constraint that the perturbation should be shared across all samples. We propose a method based on a combination of linear relaxation-based perturbation analysis and Mixed Integer Linear Programming to establish the first robust certification method for UP. In addition, we develop a theoretical framework for computing error bounds on the entire population using the certification results from a randomly sampled batch. Aside from an extensive evaluation of the proposed certification, we further show how the certification facilitates efficient comparison of robustness among different models or efficacy among different universal adversarial attack defenses and enables accurate detection of backdoor target classes.

1. INTRODUCTION

As deep neural networks become prevalent in modern performance-critical systems such as selfdriving cars and healthcare, it is critical to understand their failure modes and performance guarantees. Universal perturbations (UPs) are an important class of vulnerabilities faced by deep neural networks. Such perturbations can fool a classifier into misclassifying any input from a given distribution with high probability at test time. Past literature has studied two lines of techniques to create UPs: universal adversarial attacks (Moosavi-Dezfooli et al., 2017) and backdoor attacks (Gu et al., 2019; Chen et al., 2017) . The former crafts a UP based on a trained model and does not rely on access to training data. The latter, by contrast, prespecifies a pattern as a UP and further alters the training data so that adding the pattern (often known as the trigger in backdoor attack literature) will change the output of the trained classifier into an attacker-desired target class. Many defenses have been proposed for both universal adversarial attacks (Akhtar & Mian, 2018; Moosavi-Dezfooli et al., 2017; Shafahi et al., 2020; Benz et al., 2021; Liu et al., 2021) and backdoor attacks (Wang et al., 2019; Chen et al., 2019; Guo et al., 2019; Borgnia et al., 2020; Qiu et al., 2021) . But empirical evaluation with attacks does not provide a formal guarantee on the robustness as it is infeasible for an attack algorithm to provably cover all concerned perturbations. In contrast, robustness certification aims to verify the output bounds of the model given a certain class of input perturbations and provably certify the robustness against all the concerned perturbations. Although several recent works (Weber et al., 2020; Xie et al., 2021) developed techniques to achieve certified robustness of a classifier against backdoor-attackinduced UPs with certain norm bound. However, these techniques apply to specific learning algorithms and require the knowledge of the training data. It remains an open question: How to certify the robustness of a trained model against a class of UPs in a way that is agnostic to the underlying training algorithm and data, and is general for different UPs (including both universal adversarial attacks and norm-bounded backdoor attacks)? In this paper, we propose a framework to certify the worst-case classification accuracy on a batch of test samples against l ∞ -norm-bounded UPs. Our approach builds off of past works for certifying robustness against sample-wise perturbations that are independently added to each sample. For efficient verification, many recent works linearly relax nonlinear activation functions in neural networks into linear bounds and then conduct linear bound propagation to obtain the output bounds for the whole model (Wong & Kolter, 2018; Wang et al., 2018b; Dvijotham et al., 2018; Zhang et al., 2018; Singh et al., 2019b) . This process is also referred to as linear perturbation analysis (Xu et al., 2020a) . Since the worst-case model accuracy against sample-wise perturbations is a lower bound of the worst-case accuracy against UPs, these certification techniques could be applied to obtain a certificate against UPs. However, a direct application would overlook the important constraint that a UP is shared across different inputs, thereby producing overly conservative certification results. Unlike sample-wise perturbations, UPs require theoretical reasoning to generalize certification results. This is because UPs are applied to any input from the data distribution, and our main interest lies in the expected model accuracy over the entire data distribution against UPs. However, certification procedures can only accept a batch of samples from the distribution and certify the accuracy over the samples. Therefore, it's crucial to understand the discrepancy between certified robustness computed from samples and the actual population robustness. We summarize our contributions as follows: • We formulate the problem of robustness certification against UPs. We then generalize linear relaxation based perturbation analysis (LiRPA) to UPs, and we further propose a Mixed Integer Linear Programming (MILP) formulation over linear bounds from LiRPA, to obtain tighter certification on the worst-case accuracy of a given model against UPs within a ℓ ∞ -norm ballfoot_0 . • We establish a theoretical framework for analyzing the generalizability of the certification results based on random sampled subsets to the entire population. • We conduct extensive experiments to show that our certification method provides certified lower bounds on the worst-case robust accuracy against both universal adversarial attacks and l ∞ -bounded backdoor attacks, which are substantially tighter than results by directly applying existing sample-wise certification. • We also investigate the implications of robustness certification on UPs to facilitate easy comparisons of robustness among different models or the efficacy of empirical defenses, and to achieve reliable identification of backdoor target classes.

2. BACKGROUND AND RELATED WORK

Universal Adversarial Perturbation Neural networks are vulnerable to adversarial examples (Szegedy et al., 2014) , which has led to the development of universal adversarial perturbations (UAPs), a same noise can consistently deceive a target network on most images (Liu et al., 2019; 2020) . Existing defenses against UAPs include fine-tuning on pre-computed UAPs (Moosavi-Dezfooli et al., 2017 ), post-hoc detection (Akhtar et al., 2018) , universal adversarial training with online UAP generation (Mummadi et al., 2019; Shafahi et al., 2020; Benz et al., 2021) . However, all existing defenses to UAPs are empirical works without efficacy guarantee to new attacks. Backdoor Attacks In backdoor attacks, attackers plant a predefined UP (a.k.a. the trigger) in the victim model by manipulating the training procedure (Li et al., 2020c) . Attacked models can give adversarially-desired outputs for any input patched with the trigger while still show good performance on clean inputs. Existing defenses include: poison detection via outlier detection (Gao et al., 2019; Chen et al., 2018; Tran et al., 2018; Zeng et al., 2021) which rely on the modeling of clean samples' distribution; poisoned model identification (Xu et al., 2019; Wang et al., 2020b) ; trojan removal via trigger synthesising (Wang et al., 2019; Chen et al., 2019; Guo et al., 2019; Zeng et al., 2022a) , or preprocessing and fine-tuning; (Li et al., 2020b; Borgnia et al., 2020) ; robust training via differential privacy (Du et al., 2019) or redesigning the training pipeline (Levine & Feizi, 2020; Jia et al., 2020; Huang et al., 2022; Li et al., 2021) . As all these defenses were empirical, existing literature has revealed those empirical defenses' limitations to zero-day attacks or adaptive attacks (Zeng et al., 2022b) . Robustness Certification of Neural Networks Early robustness certifications (Katz et al., 2017; Ehlers, 2017; Tjeng et al., 2017) largely relied on satisfiability modulo theory (SMT) or integer



https://github.com/ruoxi-jia-group/Universal_Pert_Cert

