ROBUSTNESS AGAINST RELATIONAL ADVERSARY

Abstract

Test-time adversarial attacks have posed serious challenges to the robustness of machine-learning models, and in many settings the adversarial perturbation need not be bounded by small � p -norms. Motivated by the semantics-preserving attacks in vision and security domain, we investigate relational adversaries, a broad class of attackers who create adversarial examples that are in a reflexive-transitive closure of a logical relation. We analyze the conditions for robustness and propose normalize-and-predict -a learning framework with provable robustness guarantee. We compare our approach with adversarial training and derive an unified framework that provides benefits of both approaches. Guided by our theoretical findings, we apply our framework to image classification and malware detection. Results of both tasks show that attacks using relational adversaries frequently fool existing models, but our unified framework can significantly enhance their robustness.

1. INTRODUCTION

The robustness of machine learning (ML) systems has been challenged by test-time attacks using adversarial examples (Szegedy et al., 2013) . These adversarial examples are intentionally manipulated inputs that preserve the essential characteristics of the original inputs, and thus are expected to have the same test outcome as the originals by human standard; yet they severely affect the performance of many ML models across different domains (Moosavi-Dezfooli et al., 2016; Eykholt et al., 2018; Qin et al., 2019) . As models in high-stake domains such as system security are also undermined by attacks (Grosse et al., 2017; Rosenberg et al., 2018; Hu & Tan, 2018) , robust ML in adversarial test environment becomes an imperative task for the ML community. Existing work on test-time attacks predominately considers � p -norm bounded adversarial manipulation (Goodfellow et al., 2014; Carlini & Wagner, 2017) . However, in many security-critical settings, the adversarial examples need not respect the � p -norm constraint as long as they preserve the malicious semantics. In malware detection, for example, a malware author can implement the same function using different APIs, or bind a malware within benign softwares like video games or office tools. The modified malware preserves the malicious functionality despite the drastically different syntactic features. Hence, focusing on adversarial examples of small � p -norm in this setting will fail to address a sizable attack surface that attackers can exploit to evade detectors. In addition to security threats, another rising concern on ML models is the spurious correlations they could have learned in a biased data set. Ribeiro et al. (2016) show that a highly accurate wolf-vshusky-dog classifier indeed bases its prediction on the presence/absence of snow in the background. A reliable model, in contrast, should be robust to changes of this nature. Although dubbed as semantic perturbation or manipulation (Mohapatra et al., 2020; Bhattad et al., 2019) , these changes do not alter the core of the semantics of input data, thus, we still consider them to be semantics-preserving pertaining to the classification task. Since such semantics-preserving changes often resulted in large � p -norms, they are likely to render the existing � p -norm based defenses ineffective. In this paper, we consider a general attack framework in which attackers create adversarial examples by transforming the original inputs via a set of rules in a semantics-preserving manner. Unlike the prior works (Rosenberg et al., 2018; Hu & Tan, 2018; Hosseini et al., 2017; Hosseini & Poovendran, 2018) which investigate specific adversarial settings, our paper extends the scope of attacks to general logical transformation: we unify the threat models into a powerful relational adversary, which can readily incorporate more complex input transformations.

