STRATA: SIMPLE, GRADIENT-FREE ATTACKS FOR MODELS OF CODE

Abstract

Adversarial examples are imperceptible perturbations in the input to a neural model that result in misclassification. Generating adversarial examples for source code poses an additional challenge compared to the domains of images and natural language, because source code perturbations must adhere to strict semantic guidelines so the resulting programs retain the functional meaning of the code. We propose a simple and efficient gradient-free method for generating state-of-the-art adversarial examples on models of code that can be applied in a white-box or black-box setting. Our method generates untargeted and targeted attacks, and empirically outperforms competing gradient-based methods with less information and less computational effort.

1. INTRODUCTION

Although machine learning has been shown to be effective at a wide variety of tasks across computing, statistical models are susceptible to adversarial examples. Adversarial examples, first identified in the continuous domain by Szegedy et al. (2014) , are imperceptible perturbations to input that result in misclassification. Researchers have developed effective techniques for adversarial example generation in the image domain (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2017; Papernot et al., 2016a) and in the natural language domain (Alzantot et al., 2018; Belinkov & Bisk, 2018; Cheng et al., 2020; Ebrahimi et al., 2018; Michel et al., 2019; Papernot et al., 2016b) , although work in the source code domain is less extensive (see Related Work). The development of adversarial examples for deep learning models has progressed in tandem with the development of methods to make models which are robust to such adversarial attacks, though much is still being learned about model robustness (Goodfellow et al., 2015; Madry et al., 2018; Shafahi et al., 2019; Wong et al., 2019) . The threat of adversarial examples poses severe risks for ML-based malware defenses (Al-Dujaili et al., 2018; Grosse et al., 2016; Kaur & Kaur, 2015; Kolosnjaji et al., 2018; Kreuk et al., 2019; Suciu et al., 2019) , and introduces the ability of malicious actors to trick ML-based code-suggestion tools to suggest bugs to an unknowing developer (Schuster et al., 2020) . Thus, developing state-of-the-art attacks and constructing machine learning models that are robust to these attacks is important for computer security applications. Generating adversarial examples for models of code poses a challenge compared to the image and natural language domain, since the input data is discrete and textual and adversarial perturbations must abide by strict syntactical rules and semantic requirements. The CODE2SEQ model is a state-of-the-art model of code that has been used to explore adversarial example design and robustness methods on models of code (Rabin & Alipour, 2020; Ramakrishnan et al., 2020) . In this work, we propose the Simple TRAined Token Attack (STRATA), a novel and effective method for generating black-box and white-box adversarial attacks against CODE2SEQ. Our method replaces local variable names with high impact candidates that are identified by dataset statistics. It can also be used effectively for targeted attacks, where the perturbation targets a specific (altered) output classification. Further, we demonstrate that adversarial training, that is, injecting adversarial examples into CODE2SEQ's training set, improves the robustness of CODE2SEQ to adversarial attacks. We evaluate STRATA on CODE2SEQ, though we hypothesize that the method can be applied to other models. The principles underlying STRATA apply not only to models of source code, but also to natural language models in contexts where the vocabulary is large and there is limited training data. 1

