BOUNDED ATTACKS AND ROBUSTNESS IN IMAGE TRANSFORM DOMAINS

Abstract

Classical image transformation such as the discrete cosine transform (DCT) and the discrete wavelet transforms (DWTs) provide semantically meaningful representations of images. In this paper we propose a general method for adversarial attacks in such transform domains that, in contrast to prior work, obey the L ∞ constraint in the pixel domain. The key idea is to replace the standard attack based on projections with the barrier method. Experiments with DCT and DWTs produce adversarial examples that are significantly more similar to the original than with prior attacks. Further, through adversarial training we show that robustness against our attacks transfers to robustness against a broad class of common image perturbations.

1. INTRODUCTION

Adversarial attacks Biggio et al. (2013) ; Szegedy et al. (2014) ; Papernot et al. (2016a) have raised concerns about the safety and robustness of deploying neural networks in critical decision-making processes. Given a neural network that makes accurate predictions on clean data, these attacks modify inputs in a way indiscernible to humans to produce erroneous predictions. Typically, the distance between a clean and a perturbed input is measured by an L p normfoot_0 . In particular, L 0 (a pseudonorm) and L ∞ have been argued to be necessary adversarial robustness metrics for images Kotyan & Vargas (2022) since they are easily interpretable: number of modified pixels and pixel-wise threshold, respectively. Further, Hendrycks & Dietterich (2019) noticed an interesting interaction between the L ∞ adversarial robustness and common image corruptions such as motion blur, shot noise, and frost. An additional argument for L 2 and L ∞ perturbations are the closed formulas for projections needed in common attacks like the Projected Gradient Descent (PGD) Shafahi 2022). However, these attacks did not bound the effect of the change in the pixel domain. The goal of our work is to provide adversarial attacks in transform domains, which thus can exploit their expressiveness, while at the same time obeying the common L ∞ bounds in the pixel domain. Doing so makes the amount of change interpretable, enables comparison to prior attacks, and leverages the interaction with various common image corruptions Hendrycks & Dietterich (2019). The challenge is in the high-dimensional geometry, which makes it difficult to derive the projections needed in PGD-based attacks and thus a different approach is needed. Specifically, we contribute: • A novel white-box attack based on the barrier method from nonlinear programming that does not require any closed-form projections and can be instantiated for a large class of transforms. Our focus is on DCT and DWTs. • An evaluation of our attacks against prior work on ImageNet. In particular, given the same L ∞ bound, we show that our attacks consistently yield adversarial examples with significantly higher similarity to the original, as verified by the Learned Perceptual Image Patch Similarity (LPIPS) metric. As a baseline we also include a hand-crafted PGD-based attack for DCTs to illustrate the challenges in obtaining projections that obey L ∞ bounds. (2001) . Both are linear and invertible and decompose an image into a notion of frequencies, in which high frequencies capture details that often can be removed with little visual impact. An example is shown in Fig. 1 . We aim to leverage such expressive transform representations for adversarial attacks while, in contrast to prior work, obeying the widely used L ∞ box defined in the pixel space. Our approach is applicable to a large number of transforms; thus, we first present it in a general way before instantiating it to DCT and DWTs. Let x 0 ∈ [0, 1] n be a clean image correctly classified as c by a classification model f (pixel color channel values are assumed normalized to [0, 1]). l is a loss function, for example a cross-entropy. Let ϕ be an invertible and differentiable image transform that maps the original image from the pixel space to a domain with a desirable expressiveness. In this ϕ-domain, we seek to find a perturbed version y ′ of y 0 = ϕ(x 0 ) such that x ′ = ϕ -1 (y ′ ) gets misclassified and x ′ -x 0 ∞ ≤ ϵ in the pixel domain. y ′ , and thus x ′ , can be computed by solving the following constrained optimization problem: min y -l(ϕ -1 (y), c) subject to ϕ -1 (y) -x 0 ∞ ≤ ϵ. (1)



All norms in this paper are vector norms, i.e., an H × W RGB image is considered as vector in R n n = 3HW , not as a matrix.



Adversarial attacks can be broadly grouped into black-box and white-box Papernot et al. (2016a); Tramèr et al. (2018). White-box attacks have full access to the neural network architecture, its weights, the training data and the learning algorithm Goodfellow et al. (2015); Kurakin et al. (2017); Papernot et al. (2016a); Madry et al. (2018); Croce & Hein (2020). Black-box attacks are only allowed to perform queries on the target network and observe the input-output relationship Narodytska & Kasiviswanathan (2017); Brendel et al. (2017); Su et al. (2019); Andriushchenko et al. (2020). Many approaches have been proposed to detect adversarial examples Xu et al. (2018); Ma et al. (2018); Feinman et al. (2017); Metzen et al. (2017) and defend against them Gu & Rigazio (2014); Papernot et al. (2016b); Liao et al. (2018); Xie et al. (2019); Zhou et al. (2021). However, most of these defenses can again be broken by suitable adaptive attacks Tramèr et al. (2020); Carlini & Wagner (2017). Adversarial training Kurakin et al. (2017); Madry et al. (2018), a seminal approach that augments the training data with adversarial examples, reveals to be effective in training empirically Zhang et al. (2019) and provably Salman et al. (2019) robust neural networks. Another approach proposed by Balunovic & Vechev (2019) combines adversarial training with provable defenses to boost the certified robustness. Further, the robustness of trained neural networks can be verified formally through abstract interpretations and relaxations Singh et al. (2019); Xu et al. (2020); Bunel et al. (2020); Müller et al. (2022).

et al. (2019); Wong et al. (2020); Madry et al. (2018). A different set of techniques aims to perturb in semantically more meaningful ways, e.g., by inserting a carefully chosen patch into the image Thys et al. (2019); Zolfi et al. (2021); Eykholt et al. (2018). The high level idea of bringing image processing knowledge to the problem also motivates our contribution explained next. Motivation and Contributions Images are not random grids of pixels but can be approximately modeled as first-order Gauss-Markov random fields, which enables JPEG compression. Concretely, when decomposed into frequencies by discrete cosine transforms (DCT) Rao & Yip (2001) at the heart of JPEG or the hierarchical discrete wavelet transforms (DWT) Daubechies (1992) for JPEG 2000, most of the norm concentrates around the low frequencies, which is a key characteristic of images. Prior work have used some of these transforms as a defense to attenuate the additive perturbation noise injected by adversarial attacks Das et al. (2017); Guo et al. (2018), or as a form of data augmentation Duan et al. (2021); Hossain et al. (2019). Furthermore, perturbations in the transformed domain were used to defend against pixel attacks Bafna et al. (2018) or to carry attacks in the transformed domains Duan et al. (2021); Hossain et al. (2019); Deng & Karam (2020); Shi et al. (2021a); Luo et al. (

