STRATA: SIMPLE, GRADIENT-FREE ATTACKS FOR MODELS OF CODE

Abstract

Adversarial examples are imperceptible perturbations in the input to a neural model that result in misclassification. Generating adversarial examples for source code poses an additional challenge compared to the domains of images and natural language, because source code perturbations must adhere to strict semantic guidelines so the resulting programs retain the functional meaning of the code. We propose a simple and efficient gradient-free method for generating state-of-the-art adversarial examples on models of code that can be applied in a white-box or black-box setting. Our method generates untargeted and targeted attacks, and empirically outperforms competing gradient-based methods with less information and less computational effort.

1. INTRODUCTION

Although machine learning has been shown to be effective at a wide variety of tasks across computing, statistical models are susceptible to adversarial examples. Adversarial examples, first identified in the continuous domain by Szegedy et al. (2014) , are imperceptible perturbations to input that result in misclassification. Researchers have developed effective techniques for adversarial example generation in the image domain (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2017; Papernot et al., 2016a) and in the natural language domain (Alzantot et al., 2018; Belinkov & Bisk, 2018; Cheng et al., 2020; Ebrahimi et al., 2018; Michel et al., 2019; Papernot et al., 2016b) , although work in the source code domain is less extensive (see Related Work). The development of adversarial examples for deep learning models has progressed in tandem with the development of methods to make models which are robust to such adversarial attacks, though much is still being learned about model robustness (Goodfellow et al., 2015; Madry et al., 2018; Shafahi et al., 2019; Wong et al., 2019) . The threat of adversarial examples poses severe risks for ML-based malware defenses (Al-Dujaili et al., 2018; Grosse et al., 2016; Kaur & Kaur, 2015; Kolosnjaji et al., 2018; Kreuk et al., 2019; Suciu et al., 2019) , and introduces the ability of malicious actors to trick ML-based code-suggestion tools to suggest bugs to an unknowing developer (Schuster et al., 2020) . Thus, developing state-of-the-art attacks and constructing machine learning models that are robust to these attacks is important for computer security applications. Generating adversarial examples for models of code poses a challenge compared to the image and natural language domain, since the input data is discrete and textual and adversarial perturbations must abide by strict syntactical rules and semantic requirements. The CODE2SEQ model is a state-of-the-art model of code that has been used to explore adversarial example design and robustness methods on models of code (Rabin & Alipour, 2020; Ramakrishnan et al., 2020) . In this work, we propose the Simple TRAined Token Attack (STRATA), a novel and effective method for generating black-box and white-box adversarial attacks against CODE2SEQ. Our method replaces local variable names with high impact candidates that are identified by dataset statistics. It can also be used effectively for targeted attacks, where the perturbation targets a specific (altered) output classification. Further, we demonstrate that adversarial training, that is, injecting adversarial examples into CODE2SEQ's training set, improves the robustness of CODE2SEQ to adversarial attacks. We evaluate STRATA on CODE2SEQ, though we hypothesize that the method can be applied to other models. The principles underlying STRATA apply not only to models of source code, but also to natural language models in contexts where the vocabulary is large and there is limited training data. 1. STRATA constructs state-of-the-art adversarial examples using a gradient-free approach that outperforms gradient-based methods; 2. STRATA generates white-box adversarial examples that are extremely effective; blackbox attacks that use dictionaries created from unrelated code datasets perform similarly (Appendix C) 3. STRATA does not require the use of a GPU and can be executed more quickly than competing gradient-based attacks (Appendix D.1); 4. STRATA is the only available method (known to the authors at present) which performs targeted attacks on CODE2SEQ, which is the current state-of-the-art for models of code.

2. MOTIVATION

CODE2SEQ, developed by Alon et al. (2019a) , is an encoder-decoder model inspired by SEQ2SEQ (Sutskever et al., 2014) ; it operates on code rather than natural language. CODE2SEQ is the state-ofthe-art code model, and therefore it represents a good target for adversarial attacks and adversarial training. The model is tasked to predict method names from the source code body of a method. The model considers both the structure of an input program's Abstract Syntax Trees (ASTs) as well as the tokens corresponding to identifiers such as variable names, types, and invoked method names. To reduce the vocabulary size, identifier tokens are split into subtokens by commonly used delimiters such as camelCase and under_scores. In this example, subtokens would include "camel" and "case" and "under" and "scores". CODE2SEQ encodes subtokens into distributed embedding vectors. These subtoken embedding vectors are trained to capture semantic structure, so nearby embedding vectors should correspond to semantically similar subtokens (Bengio et al., 2003) . In this paper, we distinguish between subtoken embedding vectors and token embedding vectors. Subtoken embedding vectors are trained model parameters. Token embedding vectors are computed as a sum of the embedding vectors of the constituent subtokens. If the token contains more than five subtokens, only the first five are summed, as per the CODE2SEQ architecture. The full description and architecture of the CODE2SEQ model is given in the original paper by Alon et al. (2019a) . The CODE2SEQ model only updates a subtoken embedding as frequently as that subtoken appears during training, which is proportional to its representation in the training dataset. However, the training datasets have very large vocabularies consisting not only of standard programming language keywords, but also a huge quantity of neologisms. The frequency at which subtokens appear in the CODE2SEQ java-large training set varies over many orders of magnitude, with the least common subtokens appearing fewer than 150 times, and the most common over 10 8 times. Thus, subtoken embedding vectors corresponding with infrequently-appearing subtokens will be modified by the training procedure much less often than common subtokens. Figure 1a demonstrates this phenomenon, showing a disparity between L2 norms of frequent and infrequently-appearing subtoken embedding vectors. We confirm this empirically. When we initialized embedding vectors uniformly at random and then trained the model as normal, as per Alon et al. (2019a), we found that the vast majority of final, i.e., post-training, embedding vectors change very little from their initialization value. In fact, 90% of embedding tokens had an L2 distance of less than 0.05 between the initial vector and final, post-training vector when trained on a java dataset. About 10% of subtokens had a large L2 distance between the initial embedding and final embedding; these subtokens were more frequent in the training dataset and had embedding vectors with a notably larger final L2 magnitude (Figure 1 ). The observation that high-L2-norm embedding vectors are associated with subtokens that appear sufficiently frequently in the dataset motivates the core intuitions of our attackfoot_0 . We show in this paper that subtokens with high-L2-norm embedding vectors can be used for effective adversarial examples, which are constructed as follows:



We note very-high-frequency subtokens have small L2 norms. Examples of these very-high-frequency subtokens include: get, set, string, and void, which appear so often as to not be useful for classification. Despite the fact that these subtokens are not good adversarial candidates for STRATA, there are so few of them that we expect them to have minimal influence on the effectiveness of our attack.

