IMPROVING HIERARCHICAL ADVERSARIAL ROBUST-NESS OF DEEP NEURAL NETWORKS

Abstract

Do all adversarial examples have the same consequences? An autonomous driving system misclassifying a pedestrian as a car may induce a far more dangerous -and even potentially lethal-behavior than, for instance, a car as a bus. In order to better tackle this important problematic, we introduce the concept of hierarchical adversarial robustness. Given a dataset whose classes can be grouped into coarse-level labels, we define hierarchical adversarial examples as the ones leading to a misclassification at the coarse level. To improve the resistance of neural networks to hierarchical attacks, we introduce a hierarchical adversarially robust (HAR) network design that decomposes a single classification task into one coarse and multiple fine classification tasks, before being specifically trained by adversarial defense techniques. As an alternative to an end-to-end learning approach, we show that HAR significantly improves the robustness of the network against 2 and ∞ bounded hierarchical attacks on the CIFAR-100 dataset.

1. INTRODUCTION

Deep neural networks (DNNs) are highly vulnerable to attacks based on small modification of the input to the network at test time (Szegedy et al., 2013) . Those adversarial perturbations are carefully crafted in a way that they are imperceptible to human observers, but when added to clean images, can severely degrade the accuracy of the neural network classifier. Since their discovery, there has been a vast literature proposing various attack and defence techniques for the adversarial settings (Szegedy et al., 2013; Goodfellow et al., 2014; Kurakin et al., 2016; Madry et al., 2017; Wong et al., 2020) . These methods constitute important first steps in studying adversarial robustness of neural networks. However, there exists a fundamental flaw in the way we assess a defence or an attack mechanism. That is, we overly generalize the mistakes caused by attacks. Particularly, the current approaches focuses on the scenario where different mistakes caused by the attacks are treated equally. We argue that some context do not allow mistakes to be considered equal. In CIFAR-100 (Krizhevsky et al., 2009) , it is less problematic to misclassify a pine tree as a oak tree than a fish as a truck. As such, we are motivated to propose the concept of hierarchical adversarial robustness to capture this notion. Given a dataset whose classes can be grouped into coarse labels, we define hierarchical adversarial examples as the ones leading to a misclassification at the coarse level; and we present a variant of the projected gradient descent (PGD) adversaries (Madry et al., 2017) to find hierarchical adversarial examples. Finally, we introduce a simple and principled hierarchical adversarially robust (HAR) network which decompose the end-to-end robust learning task into a single classification task into one coarse and multiple fine classification tasks, before being trained by adversarial defence techniques. Our contribution are • We introduce the concept of hierarchical adversarial examples: a special case of the standard adversarial examples which causes mistakes at the coarse level (Section 2). • We present a worst-case targeted PGD attack to find hierarchical adversarial examples. The attack iterates through all candidate fine labels until a successful misclassification into the desired target (Section 2.1). • We propose a novel architectural approach, HAR network, for improving the hierarchical adversarial robustness of deep neural networks (Section 3). We empirically show that HAR networks significantly improve the hierarchical adversarial robustness against ∞ attacks ( = 8 255 ) (Section 4) and 2 attacks ( = 0.5) (Appendix A.4) on CIFAR-100. • We benchmark using untargeted PGD20 attacks as well as the proposed iterative targeted PGD attack. In particular, we include an extensive empirical study on the improved hierarchical robustness of HAR by evaluating against attacks with varying PGD iterations and . We find that a vast majority of the misclassifications from the untargeted attack are within the same coarse label, resulting a failed hierarchical attack. The proposed iterative targeted attacks provides a better empirical representation of the hierarchical adversarial robustness of the model (Section 4.2). • We show that the iterative targeted attack formulated based on the coarse network are weaker hierarchical adversarial examples compared to the ones generated using the entire HAR network (Section 4.3).

2. HIERARCHICAL ADVERSARIAL EXAMPLES

The advancement in DNN image classifiers is accompanied by the increasing complexity of the network design (Szegedy et al., 2016; He et al., 2016) , and those intricate networks has provided state-of-the-art results on many benchmark tasks (Deng et al., 2009; Geiger et al., 2013; Cordts et al., 2016; Everingham et al., 2015) . Unfortunately, the discovery of adversarial examples has revealed that neural networks are extremely vulnerable to maliciously perturbed inputs at test time (Szegedy et al., 2013) . This makes it difficult to apply DNN-based techniques in mission-critical and safetycritical areas. Another important develeopment along with the advancement of DNN is the growing complexity of the dataset, both in size and in number of classes: i.e. from the 10-class MNIST dataset to the 1000-class ImageNet dataset. As the complexity of the dataset increases exponentially, dataset can often be divided into several coarse classes where each coarse class consists of multiple fine classes. In this paper, we use the term label and class interchangeably. The concept of which an input image is first classified into coarse labels and then into fine labels are referred to as hierarchical classification (Tousch et al., 2012) . Intuitively, the visual separability between groups of fine labels can be highly uneven within a given dataset, and thus some coarse labels are more difficult to distinguish than others. This motivates the use of more dedicated classifiers for specific groups of classes, allowing the coarse labels to provide information on similarities between the fine labels at an intermediate stage. The class hierarchy can be formed in different ways, and it can be learned strategically for optimal performance of the downstream task (Deng et al., 2011) . Note that it is also a valid strategy to create a customized class hierarchy and thus be able to deal with sensitive missclassification. To illustrate our work, we use the predefined class hierarchy of the CIFAR-10 and the CIFAR-100 dataset (Krizhevsky et al., 2009) : fine labels are grouped into coarse labels by semantic similarities. All prior work on adversarial examples for neural networks, regardless of defences or attacks, focuses on the scenario where all misclassifications are considered equally (Szegedy et al., 2013; Goodfellow et al., 2014; Kurakin et al., 2016; Madry et al., 2017; Wong et al., 2020) . However, in practice, this notion overly generalizes the damage caused by different types of attacks. For example, in an autonomous driving system, confusing a perturbed image of a traffic sign as a pedestrian should not be treated the same way as confusing a bus as a pickup truck. The former raises a major security threat for practical machine learning applications, whereas the latter has very little impact to the underlying task. Moreover, misclassification across different coarse labels poses potential ethical concerns when the dataset involves sensitive features such as different ethnicities, genders, people with disabilities and age groups. Mistakes across coarse classes leads to much more severe consequences compared to mistakes within coarse classes. As such, to capture this different notion of attacks, we propose the term hierarchical adversarial examples. They are a specific case of adversarial examples where the resulting misclassification occurs between fine labels that come from different coarse labels. Here, we provide a clear definition of the hierarchical adversarial examples to differentiate it from the standard adversarial examples. We begin with the notation for the classifier. Consider a neural network F (x) : R d → R c with a softmax as its last layer (Hastie et al., 2009) , where d and c

