ROBUSTNESS AGAINST RELATIONAL ADVERSARY

Abstract

Test-time adversarial attacks have posed serious challenges to the robustness of machine-learning models, and in many settings the adversarial perturbation need not be bounded by small � p -norms. Motivated by the semantics-preserving attacks in vision and security domain, we investigate relational adversaries, a broad class of attackers who create adversarial examples that are in a reflexive-transitive closure of a logical relation. We analyze the conditions for robustness and propose normalize-and-predict -a learning framework with provable robustness guarantee. We compare our approach with adversarial training and derive an unified framework that provides benefits of both approaches. Guided by our theoretical findings, we apply our framework to image classification and malware detection. Results of both tasks show that attacks using relational adversaries frequently fool existing models, but our unified framework can significantly enhance their robustness.

1. INTRODUCTION

The robustness of machine learning (ML) systems has been challenged by test-time attacks using adversarial examples (Szegedy et al., 2013) . These adversarial examples are intentionally manipulated inputs that preserve the essential characteristics of the original inputs, and thus are expected to have the same test outcome as the originals by human standard; yet they severely affect the performance of many ML models across different domains (Moosavi-Dezfooli et al., 2016; Eykholt et al., 2018; Qin et al., 2019) . As models in high-stake domains such as system security are also undermined by attacks (Grosse et al., 2017; Rosenberg et al., 2018; Hu & Tan, 2018) , robust ML in adversarial test environment becomes an imperative task for the ML community. Existing work on test-time attacks predominately considers � p -norm bounded adversarial manipulation (Goodfellow et al., 2014; Carlini & Wagner, 2017) . However, in many security-critical settings, the adversarial examples need not respect the � p -norm constraint as long as they preserve the malicious semantics. In malware detection, for example, a malware author can implement the same function using different APIs, or bind a malware within benign softwares like video games or office tools. The modified malware preserves the malicious functionality despite the drastically different syntactic features. Hence, focusing on adversarial examples of small � p -norm in this setting will fail to address a sizable attack surface that attackers can exploit to evade detectors. In addition to security threats, another rising concern on ML models is the spurious correlations they could have learned in a biased data set. Ribeiro et al. (2016) show that a highly accurate wolf-vshusky-dog classifier indeed bases its prediction on the presence/absence of snow in the background. A reliable model, in contrast, should be robust to changes of this nature. Although dubbed as semantic perturbation or manipulation (Mohapatra et al., 2020; Bhattad et al., 2019) , these changes do not alter the core of the semantics of input data, thus, we still consider them to be semantics-preserving pertaining to the classification task. Since such semantics-preserving changes often resulted in large � p -norms, they are likely to render the existing � p -norm based defenses ineffective. In this paper, we consider a general attack framework in which attackers create adversarial examples by transforming the original inputs via a set of rules in a semantics-preserving manner. Unlike the prior works (Rosenberg et al., 2018; Hu & Tan, 2018; Hosseini et al., 2017; Hosseini & Poovendran, 2018) which investigate specific adversarial settings, our paper extends the scope of attacks to general logical transformation: we unify the threat models into a powerful relational adversary, which can readily incorporate more complex input transformations. From the defense perspective, recent work has started to look beyond � p -norm constraints, including adversarial training (Grosse et al., 2017; Rosenberg et al., 2019; Lei et al., 2019) , verificationloss regularization (Huang et al., 2019) and invariance-induced regularization (Yang et al., 2019) . Adversarial training in principle can achieve high robust accuracy when the adversarial example in the training loop maximizes the loss. However, finding such adversarial examples is in general NPhard (Katz et al., 2017) , and we show in Sec 4 that it is even PSPACE-hard for semantics-preserving attacks that are considered in this paper. Huang et al. (2019) and Yang et al. (2019) add regularizers that incorporate model robustness as part of the training objective. However, such regularization can not be strictly enforced in training, and neither can the model robustness. These limitations still cause vulnerability to semantics-preserving attacks. Normalize-and-Predict Learning Framework This paper attempts to overcome the limitations of prior work by introducing a learning framework that guarantees robustness by design. In particular, we target a relational adversary, whose admissible manipulation is specified by a logical relation. A logical relation is a set of input pairs, each of which consists of a source and target of an atomic, semantics-preserving transformation. We consider a strong adversary who can apply an arbitrary number of transformations. Our paper makes the following contribution towards the theoretical understanding of robust ML against relational adversaries: 1. We formally describe admissible adversarial manipulation using logical relations, and characterize the necessary and sufficient conditions for robustness to relational adversaries. 2. We propose normalize-and-predict (hereinafter abbreviated as N&P), a learning framework that first converts each data input to a well-defined and unique normal form and then trains and classifies over the normalized inputs. We show that our framework has guaranteed robustness, and characterize conditions to different levels of robustness-accuracy trade-off. 3. We compare N&P to the popular adversarial training framework, which directly optimizes for accuracy under attacks. We show that N&P has the advantage in terms of explicit robustness guarantee and reduced training complexity, and in certain cases yields the same model accuracy as adversarial training. Motivated by the comparison, we propose a unified framework, which selectively normalizes over relations that tend to preserve the model accuracy and adversarially trains over the rest. Our unified approach gets the benefits from both frameworks. We then apply our theoretical findings to malware detection and image classification. For the former, first, we formulate two types of common program transformation -(1) addition of redundant libraries and API calls, and (2) substitution of equivalent API calls -as logical relations. Next, we instantiate our learning framework to these relations, and propose two generic relational adversarial attacks to determine the robustness of a model. Finally, we perform experiments over Sleipnir, a real-world WIN32 malware data set. Regarding image classification, we reused an attack method proposed by the prior work (Hosseini & Poovendran, 2018) -shifting of the hue in the HSV color space -that can be deemed as a specific instantiation of our attack framework. We then compare the accuracy and robustness of ResNet-32 (He et al., 2016) , a common image classification model, trained with the unified framework against the standard adversarial training on CIFAR-10 (Krizhevsky et al., 2009) . The results we obtained in both tasks show that: 1. Attacks using addition and substitution suffice to evade existing ML malware detectors. 2. Our unified approach using input normalization and adversarial training achieves highest robust accuracy among all baselines in malware detection. The drop in accuracy on clean inputs is small and the computation cost is lower than pure adversarial training. 3. When trained with the unified learning framework, ResNet-32 achieves similar clean accuracy but significantly higher robust accuracy than adversarial training alone. Finally, based on our theoretical and empirical results, we conclude that input normalization is vital to robust learning against relational adversaries. We believe techniques that can improve the quality of normalization are promising directions for future work.

