IMBALANCED GRADIENTS: A NEW CAUSE OF OVER-ESTIMATED ADVERSARIAL ROBUSTNESS

Abstract

Evaluating the robustness of a defense model is a challenging task in adversarial robustness research. Obfuscated gradients, a type of gradient masking, have previously been found to exist in many defense methods and cause a false signal of robustness. In this paper, we identify a more subtle situation called Imbalanced Gradients that can also cause overestimated adversarial robustness. The phenomenon of imbalanced gradients occurs when the gradient of one term of the margin loss dominates and pushes the attack towards to a suboptimal direction. To exploit imbalanced gradients, we formulate a Margin Decomposition (MD) attack that decomposes a margin loss into individual terms and then explores the attackability of these terms separately via a two-stage process. We examine 12 state-of-the-art defense models, and find that models exploiting label smoothing easily cause imbalanced gradients, and on which our MD attacks can decrease their PGD robustness (evaluated by PGD attack) by over 23%. For 6 out of the 12 defenses, our attack can reduce their PGD robustness by at least 9%. The results suggest that imbalanced gradients need to be carefully addressed for more reliable adversarial robustness.

1. INTRODUCTION

Deep neural networks (DNNs) are vulnerable to adversarial examples, which are input instances crafted by adding small adversarial perturbations to natural examples. Adversarial examples can fool DNNs into making false predictions with high confidence, and transfer across different models (Szegedy et al., 2014; Goodfellow et al., 2015) . A number of defenses have been proposed to overcome this vulnerability. However, a concerning fact is that many defenses have been quickly shown to have undergone incorrect or incomplete evaluation (Carlini and Wagner, 2017; Athalye et al., 2018; Engstrom et al., 2018; Uesato et al., 2018; Mosbach et al., 2018; He et al., 2018) . One common pitfall in adversarial robustness evaluation is the phenomenon of gradient masking (Papernot et al., 2017; Tramèr et al., 2018) or obfuscated gradients (Athalye et al., 2018) , leading to weak or unsuccessful attacks and false signals of robustness. To demonstrate "real" robustness, newly proposed defenses claim robustness based on results of white-box attacks such as PGD (Madry et al., 2018) , and at the same time, demonstrate that they are not a result of obfuscated gradients. In this work, we show that the robustness may still be overestimated even when there are no obfuscated gradients. Specifically, we identify a new situation called Imbalanced Gradients that exists in several state-of-the-art defense models and can cause highly overestimated robustness. Imbalanced gradients is a new type of gradient masking effect where the gradient of one loss term dominates that of other terms. This causes the attack to move toward a suboptimal direction. Different from obfuscated gradients, imbalanced gradients are more subtle and are not detectable by the detection methods used for obfuscated gradients. To exploit imbalanced gradients, we propose a novel attack named Margin Decomposition (MD) attack that decomposes the margin loss into two separate terms, and then exploits the attackability of these terms via a two-stage attacking process. We derive MD variants of traditional attacks like PGD and MultiTargeted (MT) (Gowal et al., 2019) , and deploy these MD attacks to re-examine the robustness of 12 adversarial training-based defense models. We find that 6 of them are susceptible to imbalanced gradients, and their robustness originally evaluated by the PGD attack drops significantly against our MD attacks. Our key contributions are: • We identify a new type of subtle effect called imbalanced gradients, which can cause highly overestimated adversarial robustness and cannot be detected by detection methods for obfuscated gradients. Especially, We highlight that label smoothing is one of the major causes of imbalanced gradients. • We propose Margin Decomposition (MD) attacks to exploit imbalanced gradients. MD leverages the attackability of the individual terms in the margin loss in a two-stage attacking process. We also introduce two variants of MD for existing attacks PGD and MT. • We conduct extensive evaluations on 12 state-of-the-art defense models and find that 6 of them suffer from imbalanced gradients and their PGD robustness drops by more than 9% against our MD attacks. Our MD attacks exceed state-of-the-art attacks when imbalanced gradients occur.

2. BACKGROUND

We denote a clean sample by x, its class by y ∈ {1, • • • , C} with C the number of classes, and a DNN classifier by f . The probability of x being in the i-th class is computed as p i (x) = e zi / C j=1 e zj , where z i is the logits for the i-th class. The goal of adversarial attack is to find an adversarial example x adv that can fool the model into making a false prediction (e.g. f (x adv ) = y), and is typically restricted to be within a small -ball around the original example x (e.g. x adv -x ∞ ≤ ). Adversarial Attack. Adversarial examples can be crafted by maximizing a classification loss by one or multiple steps of adversarial perturbations. For example the one-step Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) and the iterative FGSM (I-FGSM) attack (Kurakin et al., 2017) . Projected Gradient Descent (PGD) (Madry et al., 2018) attack is another iterative method that projects the perturbation back onto the -ball centered at x when it goes beyond. Carlini and Wagner (CW) (Carlini and Wagner, 2017) attack generates adversarial examples via an optimization framework. Whilst there exist other attacks such as Frank-Wolfe attack (Chen et al., 2018a) , distributionally adversarial attack (Zheng et al., 2019) and elastic-net attacks (Chen et al., 2018b) , the most commonly used attacks for robustness evaluations are FGSM, PGD, and CW. Several recent attacks have been proposed to produce more accurate robustness evaluations than PGD. This includes Fast Adaptive Boundary Attack (FAB) (Croce and Hein, 2019), MultiTargeted (MT) attack (Gowal et al., 2019) , Output Diversified Initialization (ODI) attack (Tashiro et al., 2020) , and AutoAttack (AA) (Croce and Hein, 2020). FAB finds the minimal perturbation necessary to change the class of a given input. MT (Gowal et al., 2019 ) is a PGD-based attack with multiple restarts and picks a new target class at each restart. ODI provides a more effective initialization strategy with diversified logits. AA attack is a parameter-free ensemble of four attacks: FAB, two Auto-PGD attacks, and the black-box Square Attack (Andriushchenko et al., 2019) . AA has demonstrated to be one of the state-of-the-art attacks to date (Croce and Hein, 2020). Adversarial Loss. Many attacks use Cross Entropy (CE) as the adversarial loss: ce (x, y) = -log p y . The other commonly used adversarial loss is the margin loss (Carlini and Wagner, 2017): margin (x, y) = z max -z y , with z max = max i =y z i . Shown in (Gowal et al., 2019) , CE can be written in a margin form (e.g. ce (x, y) = log( C i=1 e zi ) -z y ), and in most cases, they are both effective. While FGSM and PGD attacks use the CE loss, CW and several recent attacks such as MT and ODI adopt the margin loss. AA has one PGD variant using the CE loss and the other PGD variant using the Difference of Logits Ratio (DLR) loss. DLR can be regarded as a "relative margin" loss. In this paper, we identify a new effect that causes overestimated adversarial robustness from the margin loss perspective and propose new attacks by decomposing the margin loss. Adversarial Defense. In response to the threat of adversarial attacks, many defenses have been proposed such as defensive distillation (Papernot et al., 2016) , feature/subspace analysis (Xu et al., 2017; Ma et al., 2018 ), denoising techniques (Guo et al., 2018; Liao et al., 2018; Samangouei et al., 2018), robust regularization (Gu and Rigazio, 2014; Tramèr et al., 2018; Ross and Doshi-Velez, 2018), model compression (Liu et al., 2018; Das et al., 2018; Rakin et al., 2018) and adversarial training (Goodfellow et al., 2015; Madry et al., 2018) . Among them, adversarial training via robust min-max optimization has been found to be the most effective approach (Athalye et al., 2018) . A number of new techniques have been proposed to further enhance the adversarial training (Wang et al., 2019; Zhang et al., 2019; Carmon et al., 2019; Alayrac et al., 2019; Wang and Zhang, 2019; Zhang and Wang, 2019; Zhang and Xu, 2020; Wang et al., 2020; Kim and Wang, 2020; Ding et al., 

