EXPLOITING CERTIFIED DEFENCES TO ATTACK RANDOMISED SMOOTHING

Abstract

Certified guarantees of adversarial robustness play an important role in providing assurances regarding a models output, irrespective of the behaviour of an attacker. However, while the development of such guarantees has drawn upon an improved understanding of attacker behaviour, so too can certified guarantees be exploited in order to generate more efficient adversarial attacks. Within this work, we explore this heretofore undiscovered additional attack surface, while also considering how previously discovered attacks could be applied to models defended by randomised smoothing. In all bar one experiment our approach generates smaller adversarial perturbations for more than 70% of tested samples, reducing the average magnitude of the adversarial perturbation by 13%.

1. INTRODUCTION

The observation that neural networks' exhibit particular sensitivity to adversarial behaviours has motivated numerous recent works. That such networks are frequently deployed within contexts for which incentives for adversarial behaviour exist makes guarding against such interventions to be of paramount importance. Our understanding of this problem space has been heavily driven by the development of new perspectives on potential attack vectors, stemming the original works in adversarial examples through to data poisoning, backdoor attacks, model stealing, transfer attacks and more. While uncovering such attack vectors has the potential to compromise deployed models, there is prima facie evidence that any security provided by a lack of knowledge is illusory. As such, it is clear that understanding new attack vectors has the potential to produce stronger defences, due to the implicit coupling between attacks and their defences. In contrast to this implicit coupling, recent works have focused upon the construction of certified guarantees of adversarial robustness, which verify that a models output will be unchanged over all adversarial behaviours (subject to a set of loose constraints). The calculations of such guarantees are performed in parallel to inference, and are achieved by way of modifications to the mechanisms of the model. One common backbone for such techniques is randomised smoothing, which can be performed without requiring any modifications to the core training loop beyond a pre-processing and post-processing step. However, while models incorporating certified guarantees should be more adversarially robust than their counterparts, within this work we consider how such models can still be attacked in a fashion that yields small adversarial perturbations. Such a consideration is important, as we also identify a heretofore undiscovered new attack vector, which exploits the very nature of the certified guarantee to identify smaller adversarial perturbations than any other tested approach. To explore the nature of this novel attack surface, this work makes the following contributions: • Demonstrating how general attacks can be constructed against models defended by randomised smoothing. This is achievable by exploiting the Gumbel-Softmax within either white-box or surrogate modelling attacks to render the models differentiable. In doing so, we are able to exploit the fact that randomised smoothing inherently smooths the underlying gradients, making them easier to attack. • Introducing a new attack that exploits certification guarantees in both the original and malicious classes by dynamically optimising the step-size. Correct class predictions exploit the certification to eliminate parts of the search space, incorrect class predictions are used to focus the search process. In doing so we are able to identify smaller adversarial perturbations than the best tested alternative in over 70% of all tested samples, resulting in a more than 13% reduction in the median perturbation. • Assessing the performance of certified guarantees of adversarial robustness, through comparison to the magnitude of best-identified adversarial perturbations in L 2 norm space.

2. RELATED WORK

It is well known that carefully crafted perturbations can change the output of learned models without requiring change in the semantic properties of the input sample (Biggio et al., 2013)  x k+1 = P ˆxk -ϵ ˆ∇x J(θ, x, y) ∥∇ x J(θ, x, y)∥ 2 ˙˙. (1) This process exploits gradients of the loss J(θ, x, y) to construct steps, subject to a step-size weighting parameter ϵ, and a projection operator P that ensures that x k+1 is restricted to the feasible input space, is typically [0, 1] d for a d-dimensional input space. Many PGD extensions exist, including momentum-based variants (Dong et al., 2018) and AutoAttack (Croce & Hein, 2020). The latter of these alternate approaches has been shown to be highly effective in identifying adversarial examples, and as such we also tested performance against AutoAttack within this work. In contrast to PGD, which sets a fixed step-size ϵ, AutoAttack algorithmically specifies the step-size at each stage of its iterative process. Instead AutoAttack attempts to converge upon adversarial examples with a pre-specified L 2 norm perturbation magnitude, which obtusely is also labelled as ϵ. This is inherently problematic within problem domains for which minimising the perturbation magnitude is important, the requirement to pre-specify the perturbation magnitude is an inherently limiting factor. Our preliminary investigations have suggested that the only way to minimise the perturbation magnitude is to perform a greedy search over a range of possible pre-specified magnitudes. Carlini & Wagner (C-W) constructs adversarial perturbations by way of the minimisation problem min x ′ ∥x ′ -x∥ 2 2 + max {max{f θ (x ′ ) j : j ̸ = i} -f θ (x ′ ) i , -κ} in terms of the trained model f θ (x) (with weights θ). The latter term of Equation 2 compares the logit value of the target class i with that of the next most likely class, subject to the parameter κ. This criteria is then solved in the fashion of Equation 1, using gradients from Equation 2. Another popular attack is DeepFool, which is an untargetted L 2 -norm attack (Moosavi-Dezfooli et al., 2016) that interchanges between attacking a linearised variant of the model and updating the linearisation based upon gradient steps. This linearisation allows for automatic step-size control across the iterative process.

3. ATTACKS AGAINST CERTIFIED DEFENCES

Randomised smoothing is a common approach for constructing certified guarantees by constructing expected class outputs through Monte Carlo sampling across samples perturbed by randomised noise. While this is a test-time process, the sensitivity of the model to perturbations may be decreased by performing adversarial training against single draws from the noise distribution. The magnitude of the guarantee of predictive invariance for perturbations up to some magnitude r in a specified L p -norm space was first shown by way of differential privacy (Lecuyer et al., 2019; Dwork et al., 2006) . More



. Such perturbed samples are known as adversarial examples. Most common learned models, including neural networks(Szegedy et al., 2013)  consistently misclassify adversarial examples and furthermore output highly confident but incorrect predictions. One significant driver of this behaviour appears to be the piecewise-linear interactions within neural networks(Goodfellow et al., 2014).While many attacks exist, we focus on key representative approaches. Each of these are white-box, untargeted attacks that use gradient-based optimisation to construct adversarial perturbations. Of these, the Iterative Fast Gradient Method(Dong et al., 2018)  variant of Projected Gradient Descent (PGD)(Carlini & Wagner, 2017)  allows adversarial examples to be iteratively constructed by way of

