PROBABILISTIC CATEGORICAL ADVERSARIAL AT-TACK & ADVERSARIAL TRAINING

Abstract

The existence of adversarial examples brings huge concern for people to apply Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate adversarial examples with categorical data is an important problem but lack of extensive exploration. Previously established methods leverage greedy search method, which can be very time-consuming to conduct successful attack. This also limits the development of adversarial training and potential defenses for categorical data. To tackle this problem, we propose Probabilistic Categorical Adversarial Attack (PCAA), which transfers the discrete optimization problem to a continuous problem that can be solved efficiently by Projected Gradient Descent. In our paper, we theoretically analyze its optimality and time complexity to demonstrate its significant advantage over current greedy based attacks. Moreover, based on our attack, we propose an efficient adversarial training framework. Through a comprehensive empirical study, we justify the effectiveness of our proposed attack and defense algorithms.

1. INTRODUCTION

Adversarial attacks (Goodfellow et al., 2015) have raised great concerns for the applications of Deep Neural Networks(DNNs) in many security-critical domains (Cui et al., 2019; Stringhini et al., 2010; Cao & Tay, 2001) . The majority of existing methods focus on differentiable models and continuous input space, where we can apply gradient-based approaches to generate adversarial examples. However, there are many machine learning tasks where the input data are categorical. For example, data in ML-based intrusion detection systems (Khraisat et al., 2019 ) contains records of system operations; and in financial transaction systems, data includes categorical information such as the types of transactions. Therefore, how to explore potential attacks and corresponding defenses for categorical inputs is also desired. Existing methods introduce search-based approaches for categorical adversarial attacks (Yang et al., 2020b; Lei et al., 2019a) . For example, the method in (Yang et al., 2020a) first finds top-K features of a given sample that have the maximal influence on the model output, and then, a greedy search is applied to obtain the optimal combination of perturbation in these K features. However, these search-based methods cannot be guaranteed to find the strongest adversarial examples. Moreover, they can be computationally expensive, especially when data is high dimensional and the number of categories for each feature is large. In this paper, we propose a novel Probabilistic Categorical Adversarial Attack (PCAA) algorithm to generate categorical adversarial examples by estimating their probabilistic distribution. In detail, given a clean sample, we assume that (each feature of) the adversarial example follows a categorical distribution, and satisfies: (1) the generated samples following this distribution have a high expected loss value and (2) the generated samples only have a few features which are different from the original clean sample. (See Section 3 for more details.) In this way, we transfer the categorical adversarial attack in the discrete space to an optimization problem in a continuous probabilistic space. Thus, we are able to apply gradient-based methods such as (Madry et al., 2017) to find adversarial examples. On one hand, the distribution of adversarial examples in PCAA is searched in the whole space of allowed perturbations. This can facilitate our method to find stronger adversarial examples (with higher loss value) than the greedy search methods (Yang et al., 2020b) . Moreover, when the dimension of input data expands, the increase of computational cost of PCAA will be significantly slower than search-based methods (Section 3.4). Therefore, our method can enjoy good attacking optimality and computational efficiency simultaneously. For example, in our experiments in Section 5.1, our PCAA method is the only attacking method that has the highest (or close to highest) attacking successful rate while maintaining the low computational cost. The advantages of PCAA allow us to further devise an adversarial training method (PAdvT), by repeatedly (by mini-batches) generating adversarial examples using PCAA. Empirically, PAdvT achieves promising robustness on different datasets over various attacks. For example, on AG's news dataset, we outperform representative baselines with significant margins, approximately 10% improvement in model robustness compared with the best defense baseline; on IMDB dataset, we obtain comparable performance with ASCC-defense (Dong et al., 2021a) , which is a SOTA defense method via embedding space adversarial training. However, compared to ASCC, our method does not rely on the assumption that similar words have a close distance in the embedding space. Compared to other defenses, our defense has much better robustness by more than 15%. Our main contributions can be summarized below: • We propose a time-efficient probabilistic attacking method (PCAA) for models with categorical input. • Based on PCAA, we devise a probabilistic adversarial training method to defend categorical adversarial attacks.

2. RELATED WORK

Attacks on categorical data. There has been a rise in the importance of the robustness of machine learning in recent years. On the one hand, evasion attack, poison attack, adversarial training, and other robustness problem with continuous input space have been well studied especially in the image domain (Shafahi et al., 2018; Madry et al., 2017; Ilyas et al., 2019) . On the other hand, adversarial attacks focusing on discrete input data, like text data, which have categorical features, are also starting to catch the attention of researchers. et al., 2017) , and adversarial training as the defense method. It also proposed its own adversarial method ASCC which takes the solution space as a convex hull of word vectors and achieves good performance in sentiment analysis and natural language inference. However, the defense methods are mainly focusing on NLP, which may rely on word embeddings. 3 PROBABLISTIC CATEGORICAL ADVERSARIAL ATTACK (PCAA) 3.1 PROBLEM SETUP Categorical Attack. We first introduce necessary definitions and notations. In detail, we consider a classifier f that predicts labels y ∈ Y based on categorical inputs x ∈ X . Each input sample x



Kuleshov et al. (2018) discussed the problems of attacking text data and highlighted the importance of investment into discrete input data. Ebrahimi et al. (2017b) proposed to modify the text token based on the gradient of input one-hot vectors. Gao et al. (2018) developed a scoring function to select the most effective attack and a simple characterlevel transformation to replace projected gradient or multiple linguistic-driven steps on text data. Samanta & Mehta (2017) proposed an algorithm to generate a meaningful adversarial sample that is legitimate in the text domain. Yang et al. (2020a) proposed a two-stage probabilistic framework to attack discrete data. Lei et al. (2019b) and Yang et al. (2020a) improved the greedy search and proposed two different greedy attack methods. Alzantot et al. (2018) proposed a black-box attack to generate syntactically similar adversarial examples. However, the previous methods usually search in a greedy way, which has exponential time complexity. Defenses on categorical data. There are also a lot of defense methods in the continuous data domain compared with discrete data. For example, the most effective one is adversarial training Madry et al. (2017), which searches for the worst case by PGD during the training and trains on the worst adversarial examples to increase the robustness of the models. Several works have been proposed for categorical adversarial defenses. Pruthi et al. (2019) used a word recognition model to preprocess the input data and increased the robustness of downstream tasks. Zhou et al. (2020) proposed to use random smoothing to defense substitution-based attacks. Wang et al. (2021) detected adversarial examples to defend synonym substitution attacks based on the fact that word replacement can destroy mutual interaction. Swenor & Kalita (2022) used random noise to increase the model robustness. Xie et al. (2022) used a huge amount of data to detect the adversarial examples. Dong et al. (2021b) combined different attack methods, such as Hot-flip (Ebrahimi et al., 2017a) and l 2 -attack (Miyato

