PROVABLE ROBUSTNESS AGAINST WASSERSTEIN DIS-TRIBUTION SHIFTS VIA INPUT RANDOMIZATION

Abstract

Certified robustness in machine learning has primarily focused on adversarial perturbations with a fixed attack budget for each sample in the input distribution. In this work, we present provable robustness guarantees on the accuracy of a model under bounded Wasserstein shifts of the data distribution. We show that a simple procedure that randomizes the input of the model within a transformation space is provably robust to distributional shifts under that transformation. Our framework allows the datum-specific perturbation size to vary across different points in the input distribution and is general enough to include fixed-sized perturbations as well. Our certificates produce guaranteed lower bounds on the performance of the model for any shift (natural or adversarial) of the input distribution within a Wasserstein ball around the original distribution. We apply our technique to certify robustness against natural (non-adversarial) transformations of images such as color shifts, hue shifts, and changes in brightness and saturation. We obtain strong performance guarantees for the robust model under clearly visible shifts in the input images. Our experiments establish the non-vacuousness of our certificates by showing that the certified lower bound on a robust model's accuracy is higher than the empirical accuracy of an undefended model under a distribution shift. We also show provable distributional robustness against adversarial attacks. Moreover, our results also imply guaranteed lower bounds (hardness result) on the performance of models trained on so-called "unlearnable" datasets that have been poisoned to interfere with model training. We show that the performance of a robust model is guaranteed to remain above a certain threshold on the test distribution even when the base model is trained on the poisoned dataset.

1. INTRODUCTION

Machine learning models often suffer significant performance loss under minor shifts in the data distribution that do not affect a human's ability to perform the same task-e.g., input noise (Dodge & Karam, 2016; Geirhos et al., 2018) , image scaling, shifting and translation (Azulay & Weiss, 2019) , spatial (Engstrom et al., 2019) and geometric transformations (Fawzi & Frossard, 2015; Alcorn et al., 2019) , blurring (Vasiljevic et al., 2016; Zhou et al., 2017) , acoustic corruptions (Pearce & Hirsch, 2000) and adversarial perturbations (Szegedy et al., 2014; Carlini & Wagner, 2017; Goodfellow et al., 2015; Madry et al., 2018; Biggio et al., 2013) . Overcoming such robustness challenges is a major hurdle for deploying these models in safety-critical applications where reliability is paramount. Several techniques have been developed to improve the empirical robustness of a model to data shifts, e.g., diversifying datasets (Taori et al., 2020) , training with natural corruptions (Hendrycks & Dietterich, 2019) , data augmentations (Yang et al., 2019) , contrastive learning (Kim et al., 2020; Radford et al., 2021; Ge et al., 2021) and adversarial training (Goodfellow et al., 2015; Madry et al., 2018; Tramèr & Boneh, 2019; Shafahi et al., 2019; Maini et al., 2020) . Empirical robustness techniques are designed to protect a model against a particular type of shift or adversary (e.g., by introducing similar shifts during training) and may not be effective against new ones. For instance, adversarial defenses have been shown to break down under newer attacks (Carlini & Wagner, 2017; Athalye et al., 2018; Uesato et al., 2018; Laidlaw & Feizi, 2019; Laidlaw et al., 2021) . Certifiable robustness methods, on the other hand, seek to produce provable guarantees on the robustness of a model which hold for any perturbation within a certain neighborhood of the input instance regardless of the strategy used to generate this perturbation. A robustness certificate produces a verifiable lower bound on the size of the perturbation required to fool a model. Apart from being a guarantee on the robust performance, these certificates may also serve as a metric to compare the robustness of different models that is independent of the mechanism producing the input perturbations. However, the study of provable robustness has mostly focused on perturbations with a fixed size budget (e.g., an ℓ p -ball of same size) for all input points (Cohen et al., 2019; Lécuyer et al., 2019; Li et al., 2019; Salman et al., 2019; Gowal et al., 2018; Huang et al., 2019; Wong & Kolter, 2018; Raghunathan et al., 2018; Singla & Feizi, 2019; 2020; Levine & Feizi, 2021; 2020a; b) . Among provable robustness methods, randomized smoothing based procedures have been able to successfully scale up to high-dimensional problems (Cohen et al., 2019; Lécuyer et al., 2019; Li et al., 2019; Salman et al., 2019) and adapted effectively to other domains such as reinforcement learning (Kumar et al., 2021; Wu et al., 2021) and models with structured outputs (Kumar & Goldstein, 2021) as in segmentation tasks and generative modeling. However, these techniques cannot be extended to certify under distribution shifts as the perturbation size for each instance in the input distribution need not have a fixed bound. For example, stochastic changes in the input images of a vision model caused by lighting and weather conditions may vary across time and location. Even adversarial attacks may choose to adjust the perturbation size depending on the input instance. A standard way of describing a distribution shift is to constrain the Wasserstein distance between the original distribution D and the shifted distribution D to be bounded by a certain amount ϵ, i.e., W d 1 (D, D) ≤ ϵ, for an appropriate distance function d. The Wasserstein distance is the minimum expectation of the distance function d over all possible joint distributions with marginals D and D. Wasserstein distance is a standard similarity measure for probability distributions and has been extensively used to study distribution shifts (Courty et al., 2017; Damodaran et al., 2018; Lee & Raginsky, 2018; Wu et al., 2019) . Certifiable robustness against Wasserstein shifts is an interesting problem to study in its own right and a useful tool to have in the arsenal of provable robustness techniques in machine learning. In this work, we design robustness certificates for distribution shifts bounded by a Wasserstein distance of ϵ. We show that by simply randomizing the input in a transformation space, it is possible to bound the difference between the accuracy of the robust model under the original distribution D and the shifted distribution D as a function of their Wasserstein distance ϵ under that transformation. Given a base model µ, we define a robust model μ which replaces the input of µ with a randomized version sampled from a "smoothing" distribution around the original input. Let h be a function denoting the performance of the robust model μ on an input-output pair (x, y) (see Section 3 for a formal definition). Then, our main theoretical result in Theorem 1 shows that E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] ≤ ψ(ϵ), where ψ is a concave function that bounds the total variation between the smoothing distributions at two input points as a function of the distance between them (condition (3) in Section 3). Such an upper bound always exists for any smoothing distribution as the total variation remains between zero and one as the distance between the two distributions increases. We discuss how to find the appropriate ψ for different smoothing distributions in Appendix G. We apply our result to certify model performance for families of parameterized distribution shifts which include shifts in the RBG color balance of an image, the hue/saturation balance, the brightness/contrast, and more. Our method does not make any assumptions on the model and applies to both natural and adversarial shifts of the distribution. It does not increase the computational requirements of the base model as it only samples one randomized input per robust prediction, making it scalable to high-dimensional problems that require conventional deep neural network architectures. The sample complexity for generating the Wasserstein certificates over the entire distribution is roughly the same as obtaining adversarial certificates for a single input instance using existing randomized smoothing based techniques (Cohen et al., 2019; Salman et al., 2019) . Robustness under distribution shifts is a fundamental problem in several areas of machine learning and our certificates could be applicable to a multitude of learning tasks. We demonstrate the usefulness of our main theoretical result (Theorem 1) in the following domains: (i) Certifying model accuracy under natural shifts (Section 5): We consider three image transformations: color shift, hue shift and changes in brightness and saturation (SV shift). Figure (1) visualizes CIFAR-10 (Krizhevsky et al.) images under each of these transformations and reports the corresponding certified accuracies obtained by our method. Figure (2) plots the accuracy of two base models (trained on CIFAR-10 images with and without noise in the transformation space) under a shifted distribution and compares it with the certified accuracy of a robust model (noisetrained model smoothed using input randomization). These results demonstrate that our certificates are significant and non-vacuous (see appendix I for more details). In figures (3) and (4), we plot the certified accuracies for different values of training and smoothing noise -first for the CIFAR-10 dataset and then confirm our results on the SVHN dataset (Netzer et al., 2011) . (ii) Certifying population level robustness against adversarial attacks (Section 6): The distribution of instances generated by an adversarial attack can also be viewed as a shift in the input distribution within a Wasserstein bound. Unlike existing certification techniques which assume a fixed perturbation budget across all inputs (Cohen et al., 2019; Lécuyer et al., 2019; Li et al., 2019; Salman et al., 2019) , our guarantees work for a more general threat model where the adversary is allowed to choose the perturbation size for each input instance as long as it respects the constraint on the average perturbation size over the entire data distribution. Also, our procedure only requires one sample from the smoothing distribution per input instance which makes computing population level certificates significantly more efficient than existing techniques. The certified accuracy we obtain significantly outperforms the base model under attack (figure 7a ). (iii) Hardness results for generating "unlearnable" datasets (Section 7): Huang et al. (2021) proposed a method to make regular datasets unusable for modern deep learning models by poisoning them with adversarial perturbations to interfere with the training of the model. The intended purpose is to increase privacy for sensitive data such as personal images uploaded to social media sites. The dataset is poisoned in such a way that a model that minimizes the loss on this data distribution will have low accuracy on clean test samples. We show that our framework can obtain verifiable lower bounds on the performance of a model trained on such unlearnable datasets. Our certificates guarantee that the performance of the robust model (using input randomization) will remain above a certain threshold on the test distribution even when the base model is trained on the poisoned dataset with a smoothing noise of suitable magnitude. This demonstrates a fundamental limitation in producing unlearnable datasets.

2. RELATED WORK

Several methods for introducing corruptions during training have been shown to improve the empirical robustness of machine learning models (Hendrycks & Dietterich, 2019; Yang et al., 2019; Goodfellow et al., 2015; Madry et al., 2018) . Training with input transformations, such as blurring, cropping and rotations, can improve test accuracy against these corruptions. However, these methods do not produce any guarantees on the performance of the model with respect to the amount of shift added to the distribution. Our method applies random input transformations during inference to make the model provably robust against any distribution shift within a certain Wasserstein distance. It is independent of the model architecture and training procedure, and can be coupled with robust training techniques, such as noise or adversarial training, to improve the certified performance. Randomized smoothing based approaches that aggregate model predictions over a large number of noised samples of the input (Cohen et al., 2019; Lécuyer et al., 2019; Li et al., 2019; Salman et al., 2019) and that use input randomization (Pinot et al., 2021) have been studied in the context of certified adversarial robustness. Provable robustness for parameterized transformations on images also exist (Fischer et al., 2020) . These techniques produce instance-wise fixed-budget certificates and do not generate robustness guarantees over the entire data distribution or allow varying perturbation sizes for different instances. Our work also differs from instance-wise adversarial attacks and defenses (Wong et al., 2019; Levine & Feizi, 2019 ) that use the Wasserstein distance (instead of conventional ℓ p distances) to measure difference between an image and its perturbed version. In contrast, our certificates consider the Wasserstein distance between data distributions from which images themselves are sampled. Robustness bounds on the population loss against Wasserstein shifts under the ℓ 2 -distance (Shen et al., 2018; Sinha et al., 2018) have been derived assuming Lipschitz-continuity of the base model. These bounds depend on the Lipschitz constant for the underlying model, which can grow rapidly for deep neural networks. We produce guarantees on the accuracy of an arbitrary model without requiring any restrictive assumptions or a global Lipschitz bound. Additionally, our approach can certify robustness against non-ℓ p changes, such as visible color shifts, for which the ℓ 2 -norm of the perturbation in the image space will be very large. Another line of work proves generalization bounds with for divergence-based measures of distribution shift (Ben- David et al., 2006; Zhao et al., 2019; Mehra et al., 2021; Weber et al., 2022) like KL-divergence, total variation distance and Hellinger distance. Divergence measures between two distributions become arbitrarily large (e.g. KL-divergence becomes infinity) or attain their maximal value (e.g. total variation and Hellinger distances become equal to one) when their supports do not coincide. This drawback makes them unsuitable for measuring out-of-distribution data shifts which by definition have non-overlapping support. Wasserstein distance, on the other hand, captures the spatial separation of two distributions and produces a more meaningful measure of the distance even when their supports are disjoint.

3. PRELIMINARIES AND NOTATIONS

Let D be the data distribution representing a machine learning task over an input space X and an output space Y. We define a distribution shift as a covariate shift that only changes the distribution of the input element in samples (x, y) ∈ X ×Y drawn from D and leaves the output element unchanged, i.e., (x, y) changes to (x, y) under the shift. Given a distance function d X : X × X → R ≥0 over the input space, we define the following distance function between two tuples τ 1 = (x 1 , y 1 ) and τ 2 = (x 2 , y 2 ) to capture the above shift: d(τ 1 , τ 2 ) = d X (x 1 , x 2 ) if y 1 = y 2 ∞ otherwise. Let D denote a shift in the original data distribution D such that the Wasserstein distance under d between D and D is bounded by ϵ (i.e., W d 1 (D, D) ≤ ϵ). Define the set of all joint probability distributions with marginals µ D and µ D as follows: Γ(D, D) = γ s.t. X ×Y γ(τ 1 , τ 2 )dτ 2 = µ D (τ 1 ) and X ×Y γ(τ 1 , τ 2 )dτ 1 = µ D (τ 2 ) . The Wasserstein bound implies that there exists an element γ * ∈ Γ(D, D) such that E (τ1,τ2)∼γ * [d(τ 1 , τ 2 )] ≤ ϵ. (2) Let S : X → ∆(X ) be a function mapping each element x ∈ X to a smoothing distribution S(x), where ∆(X ) is the set of all probability distributions over X . For example, smoothing with an isometric Gaussian noise distribution with variance σ 2 can be denoted as S(x) = N (x, σ 2 I). Let the total variation between the smoothing distributions at two points x 1 and x 2 be bounded by a concave increasing function ψ of the distance between them, i.e., TV(S(x 1 ), S(x 2 )) ≤ ψ(d X (x 1 , x 2 )). For example, when the distance function d is the ℓ 2 -norm of the difference of x 1 and x 2 , and the smoothing distribution is an isometric Gaussian N (0, σ 2 I) with variance σ 2 , ψ(•) = erf(•/2 √ 2σ ) is a valid upper bound on the above total variation that is concave in the positive domain (see Appendix G for more examples). Consider a function h : X × Y → [0, 1] that represents the performance (e.g., accuracy) of a model µ over all possible input-output pairs. For example, in the case of a classifier µ : X → Y that maps inputs from space X to a class label in Y, h(x, y) := 1{µ(x) = y} could indicate whether the prediction of µ on x matches the desired output label y or not. Another example could be that of segmentation/detection tasks, where y represents a region on an input image x. Then, h(x, y) := IoU(µ(x), y)foot_0 could represent the overlap between the predicted regions µ(x) and the ground truth y. The overall accuracy of the model µ under D is then given by E (x,y)∈D [h(x, y) ]. Now, define a robust model μ(x) = µ(x ′ ) where x ′ ∼ S(x) which simply applies the base model µ on a randomized version of the input x sampled from a smoothing distribution S(x). Our goal is to bound the difference in the expected performance of the robust model between the original distribution D and the shifted distribution D. Let h be the performance function for the robust model μ defined as h(x, y) = E x ′ ∼S(x) [h(x ′ , y)]. Then, the accuracy of the robust model μ under D is given by E (x,y)∈D [ h(x, y)]. Our result in Theorem 1 bounds the difference between the expectation of h under D and D with ψ(ϵ).

3.1. PARAMETERIZED TRANSFORMATIONS

We apply our distributional certificates to produce guarantees on the accuracy of an image classifier under natural transformations such as color shifts, hue shifts and changes in brightness and saturation. We model each transformation as a function T : X × P → X over the image space X and a parameter space P . It takes an image x ∈ X and a parameter vector θ ∈ P as inputs and outputs a transformed image x ′ = T (x, θ) ∈ X . An example of such a transformation could be a color shift in an RGB image produced by scaling the intensities in the red, green and blue chan- nels x = ({x R ij }, {x G ij }, {x B ij }) defined as CS(x, θ) = (2 θ R {x R ij }, 2 θ G {x G ij }, 2 θ B {x B ij })/MAX for a tuple θ = (θ R , θ G , θ B ), where MAX is the maximum of all the RGB values after scaling. Additive perturbations in the input space can also be captured as parameterized transformations, e.g., VT(x, θ) = x + θ. We assume that the transformation returns x if the parameters are all zero, i.e., T (x, 0) = x and that the composition of two transformations with parameters θ 1 and θ 2 is a transformation with parameters θ 1 + θ 2 (additive composability), i.e., T (T (x, θ 1 ), θ 2 ) = T (x, θ 1 + θ 2 ). (5) Given a norm ∥ • ∥ in the parameter space P , we define a distance function in the input space X as follows: d T (x 1 , x 2 ) = min{∥θ∥ | T (x 1 , θ) = x 2 } if ∃θ s.t. T (x 1 , θ) = x 2 ∞ otherwise. (6) Now, define a smoothing distribution S(x) = T (x, Q(0)) for some distribution Q in the parameter space of T such that ∀θ ∈ P, Q(θ) = θ + Q(0) is the distribution of θ + δ where δ ∼ Q(0), TV(Q(0), Q(θ)) ≤ ψ(∥θ∥) for a concave function ψ. For example, Q(•) = N (•, σ 2 I) satisfies these properties for ψ(•) = erf(•/2 √ 2σ). Then, the following lemma holds (proof in Appendix B): Lemma 1. For two points x 1 , x 2 ∈ X such that d T (x 1 , x 2 ) is finite, TV(S(x 1 ), S(x 2 )) ≤ ψ(d T (x 1 , x 2 )).

4. CERTIFIED DISTRIBUTIONAL ROBUSTNESS

In this section, we state our main theoretical result which shows that the difference in the expectation of the performance function h of the robust model (equation ( 4)) under the original distribution D and any shifted distribution D within a Wasserstein distance of ϵ from D is bounded by ψ(ϵ), where ψ is the concave upper bound on the total variation between the smoothing distributions at two points x 1 and x 2 as defined in condition (3). Theorem 1. Given a function h : X × Y → [0, 1], define its smoothed version as h(x, y) = E x ′ ∼S(x) [h(x ′ , y)]. Then, ∀ D s.t. W d 1 (D, D) ≤ ϵ, E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] ≤ ψ(ϵ). We defer the proof to Appendix A. Note that this certificate does not require us to compute the Wasserstein distance between D and D. Given a value for ϵ, it holds for all distributions D such that W d 1 (D, D) ≤ ϵ. Our certified guarantees hold for the entire input distribution (potentially continuous) and not just for a finite set of samples. The intuition behind the above bound is that if the overlap between the smoothing distributions between two individual points does not decrease rapidly with the distance between them, then the overlap between D and D augmented with the smoothing distribution is high when the Wasserstein distance between them is small. The key observation here is that the total variation of the individual smoothing distributions can be upper bounded by a convex function ψ and this upper bound can then be generalised over the entire distribution using Jensen's inequality. The above guarantee implies that for any distribution D that is within a Wasserstein distance of ϵ from the original distribution D, the accuracy of the model under D can be bounded as E (x2,y2)∼ D [ h(x 2 , y 2 )] ≥ E (x1,y1)∼D [ h(x 1 , y 1 )] -ψ(ϵ).

4.1. COMPUTING THE CERTIFICATE AND EMPIRICAL EVALUATIONS

Given a target Wasserstein bound ϵ and an appropriate function ψ, we simply need to calculate the expected performance of the robust model over the original distribution D, i.e., E (x1,y1)∼D [ h(x 1 , y 1 )]. Since we only have sample access to the original distribution D, we estimate the expected performance on D, i.e. E (x1,y1)∼D [ h(x 1 , y 1 )], using a finite number of samples. In our experiments, we compute a high-confidence lower bound of this quantity using the Clopper-Pearson method (Clopper & Pearson, 1934 ) that holds with 1 -α probability, for some α > 0 (usually 0.001). Note that although we calculate the bound with a finite number of samples from the distribution D, this lower bound holds for the expectation over the entire distribution and not just for the samples. See Appendix C for pseudocodes of the prediction and certification steps. To compare our certified guarantees against the empirical performance of an undefended model under distribution shifts, we design shifted distributions using natural and adversarial transformations on the original distribution. We ensure that the constructed distribution shift is within the desired Wasserstein distance using two methods: 1. By construction: We analytically guarantee beforehand that the applied transformation does not exceed the Wasserstein bound. For example, in Figure 2 , we report the empirical performance of the base models under distribution shifts constructed by adding a noise vector In the following sections, we apply our main theoretical result to obtain certified robustness guarantees against several different distribution shifts -natural shifts, unlearnable distributions and adversarial shifts. We experiment on two image classification datasets, namely CIFAR-10 (Krizhevsky et al.) and SVHN (Netzer et al., 2011) , and observe that the our certificates can obtain meaningful performance guarantees and exhibit similar trends for both datasets.

5. CERTIFIED ACCURACY AGAINST NATURAL TRANSFORMATIONS

We certify the accuracy of a ResNet-110 model and a ResNet-20 model trained on CIFAR-10 and SVHN images respectively under three types of transformations: color shifts, hue shifts and variation in brightness and saturation (SV shift). We train our models with varying levels of noise in the transformation space and evaluate their certified performance using smoothing distributions of different standard deviations. For color and SV shifts, we show how the certified accuracy varies as a function of the Wasserstein distance as we change the training and smoothing noise. For hue shift, we use a smoothing distribution (with fixed noise level) that is invariant to rotations in hue space because of which the certified accuracy remains constant with respect to the corresponding Wasserstein distance. We train the ResNet-110 models for 90 epochs which takes a few hours on a single NVIDIA GeForce RTX 2080 Ti GPU and the ResNet-20 models for 40 epochs which takes around twenty minute on the same GPU. Once the models have been trained, computing the distribution level Wasserstein certificates using 10 5 samples with 99.9% confidence takes only about 25 seconds for each model.

5.1. COLOR SHIFTS

Denote an RGB image x as an H × W array of pixels where the red, green and blue components of the pixel in the ith row and jth column are given by the tuple x ij = (r, g, b) ij . Let r max , g max and b max be the maximum values of the red, green and blue channels, respectively. Assume that the RGB values are in the interval [0, 1] normalized such that the maximum over all intensity values is one, i.e., max(r max , g max , b max ) = 1. Define a color shift of the image x for a parameter vector θ ∈ R 3 as which scales the intensities of each channel by the corresponding component of θ raised to the power of two and then normalizes the scaled image so that the maximum intensity is one. For example, θ = (1, -1, 0) would first double all the red intensities, halve the green intensities and leave the blue intensities unchanged, and then, normalize the image so that the maximum intensity value over all the channels is equal to one. The above transformation can be shown to satisfy the additive composability property in condition (5). See Appendix H for a proof. CS(x, θ) = (2 θ R r, 2 θ G g, 2 θ B b) ij max(2 θ R r max , 2 θ G g max , 2 θ B b max ) H×W Given an image x, we define a smoothing distribution around x in the parameter space as CS(x, δ) where δ ∼ N (0, σ 2 I 3×3 ). Define the distance function d CS as described in (6) using the ℓ 2norm in the parameter space. For a distribution D within a Wasserstein distance of ϵ from the original distribution D, the performance of the smoothed model on D can be bounded as E (x2,y2)∼ D [ h(x 2 , y 2 )] ≥ E (x1,y1)∼D [ h(x 1 , y 1 )] -erf(ϵ/2 √ 2σ ). Figure 3 plots the certified accuracy under color shift with respect to the Wasserstein bound ϵ for different values of training and smoothing noise. In Appendix K, we consider a smoothing distribution that randomly picks one color channel achieving a constant certified accuracy of 87.1% with respect to ϵ.

5.2. BRIGHTNESS AND SATURATION CHANGES

Define the following transformation in the HSV space of an image that shifts the mean of the saturation (S) and brightness (V) values for each pixel by a certain amount: SV(x, θ) = h, s + (2 θ S -1)s mean MAX , v + (2 θ V -1)v mean MAX ij H×W where s mean , s max , v mean and v max are the means and maximums of the saturation and brightness values respectively before the shift is applied and MAX = max(s max + (2 θ S -1)s mean , v max + (2 θ V -1)v mean ) is the maximum of the brightness and saturation values after the shift. Similar to color shift , the SV transformation can also be shown to satisfy additive composability (Appendix H). Figure 4 plots the certified accuracy under saturation and brightness changes with respect to ϵ for different values of training and smoothing noise. The smoothing distribution is uniform in the range [0, a] 2 in the parameter space, the distance function is the ℓ 1 -norm and ψ(ϵ) = min(ϵ/a, 1).

6. POPULATION-LEVEL CERTIFICATES AGAINST ADVERSARIAL ATTACKS

In this section, we consider the ℓ 2 -distance in the image space to measure the Wasserstein distance instead of a parameterized transformation (see Appendix D for a detailed version). We use a pixelspace Gaussian smoothing distribution S(x) = N (x, σ 2 I) to obtain robustness guarantees under this metric. To motivate this, consider an adversarial attacker Adv : X → X , which takes an image x and computes perturbation Adv(x) to try and fool a model into misclassifying the input. If (x, y) ∼ D, define D to be the distribution of the tuples (Adv(x), y). Defining d in 1 using d X = ℓ 2 , it is easy to show that: W d 1 (D, D) ≤ E x∼D [∥Adv(x) -x∥ 2 ] (7) Figure 5 : Distributional certificates against adversarial attacks on CIFAR-10. Results on CIFAR-10 are presented in Figure 5 and results on SVHN are available in Appendix D. For CIFAR-10, we use ResNet-110 models trained under noise from Cohen et al. (2019) . The solid lines represent certified accuracies for different smoothing noises and the black dashed line represents the empirical performance of an undefended model under attack. For the undefended baseline, we give the performance of an undefended model against a strategic attacker, which first finds a minimal ℓ 2 attack for each sample via (Carlini & Wagner, 2017) . If this attack is too large in magnitude (ℓ 2 > a threshold γ), it instead chooses not to attack the sample. This "saves" the attack budget (i.e., the average attack magnitude and therefore the Wasserstein shift) for easier samples. The size of the Wasserstein shift can be adjusted by varying γ.

7. HARDNESS RESULTS ON UNLEARNABILITY

In this section, we show that the pixel-space ℓ 2 -Wasserstein distributional robustness certificate shown above can also be applied to establish a hardness result in creating provably "unlearnable" datasets (Huang et al., 2021) . These datasets contain "poisoned" samples which make any classifier trained on the released data achieve a high training and validation accuracy, but a low test accuracy on non-poisoned samples from the original data distribution. This technique has legitimate applications, such as protecting privacy by preventing one's personal data from being learned, but may also have malicious uses (e.g., a malicious actor could sell a useless classifier that nevertheless has good performance on a provided validation set.) We can view the "clean" data distribution as D, and the distribution of the poisoned samples (i.e., the unlearnable distribution) as D. If the magnitude of the perturbations is limited, Theorem 1 implies that the accuracy on D and D must be similar, implying that our algorithm is provably resistant to unlearnablility attacks, effectively establishing provable hardness results to create unlearnable datasets. In order to apply our guarantees, we make a few modifications to the attack proposed in Huang et al. (2021) . First, we bound each poisoning perturbation on the released dataset to within an ϵ-radius ℓ 2 ball, rather than an ℓ ∞ ball. From Equation 8, this ensures that W d 1 (D, D) ≤ ϵ. Second, we consider an "offline" version of the attack. In the original attack (Huang et al., 2021) , perturbations for the entire dataset are optimized simultaneously with a proxy classifier model in an iterative manner. This makes the perturbations applied to each sample non-I.I.D., (they may depend on each other through proxy-model parameters) which makes deriving generalizable guarantees for it difficult. However, this simultaneous proxy-model training and poisoning may not always represent a realistic threat model. In particular, an actor releasing "unlearnable" data at scale may not be able to constantly update the proxy model being used. For example, consider an "unlearnability" module in a camera, which would make photos unusable as training data. Because the camera itself has access to only a small number of photographs, such a module would likely rely on a fixed, pre-trained proxy classifier model to create the poisoning perturbations. To model this, we consider a threat model where the proxy classifier is first optimized using an unreleased dataset: the released "unlearnable" samples are then perturbed independently using this fixed proxy model. We see in Figure 6 that our modified attack is still highly effective at making data unlearnable, as shown by the high validation and low test accuracy of the undefended baseline. 

A PROOF OF THEOREM 1

Statement. Given a function h : X × Y → [0, 1], define its smoothed version as h(x, y) = E x ′ ∼S(x) [h(x ′ , y)]. Then, E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] ≤ ψ(ϵ). Proof. Let τ 1 = (x 1 , y 1 ) and τ 2 = (x 2 , y 2 ) denote the input-output tuples sampled from D and D respectively. Then, for the joint distribution γ * ∈ Γ(D, D) in (2), we have E τ1∼D [ h(τ 1 )] = E (τ1,τ2)∼γ * [ h(τ 1 )] and E τ2∼ D [ h(τ 2 )] = E (τ1,τ2)∼γ * [ h(τ 2 )]. This is because when (τ 1 , τ 2 ) is sampled from the joint distribution γ * , τ 1 and τ 2 individually have distributions D and D respectively. Also, since the expected distance between τ 1 = (x 1 , y 1 ) and τ 2 = (x 2 , y 2 ) is finite, the output elements of the sampled tuples must be the same, i.e. y 1 = y 2 = y (say). See lemma 2 below. Then, E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] = E τ1∼D [ h(τ 1 )] -E τ2∼ D [ h(τ 2 )] = E (τ1,τ2)∼γ * [ h(τ 1 ) -h(τ 2 )] ≤ E (τ1,τ2)∼γ * [| h(τ 1 ) -h(τ 2 )|]. Now, from definition (4) and for i = 1 and 2, h(τ i ) = h(x i , y) = E x ′ i ∼S(xi) [h(x ′ i , y)] = E x ′ i ∼S(xi) [g(x ′ i )] can be expressed as the expected value of a function g : X → [0, 1] under distribution S(x i ). Without loss of generality, assume E x ′ 1 ∼S(x1) [g(x ′ 1 )] ≥ E x ′ 2 ∼S(x2) [g(x ′ 2 )]. Then, E x ′ 1 ∼S(x1) [g(x ′ 1 )] -E x ′ 2 ∼S(x2) [g(x ′ 2 )] = X g(x)µ 1 (x)dx - X g(x)µ 2 (x)dx (µ 1 and µ 2 are the PDFs of S(x 1 ) and S(x 1 )) = X g(x)(µ 1 (x) -µ 2 (x))dx = µ1>µ2 g(x)(µ 1 (x) -µ 2 (x))dx - µ2>µ1 g(x)(µ 2 (x) -µ 1 (x))dx ≤ µ1>µ2 max x ′ ∈X g(x ′ )(µ 1 (x) -µ 2 (x))dx - µ2>µ1 min x ′ ∈X g(x ′ )(µ 2 (x) -µ 1 (x))dx ≤ µ1>µ2 (µ 1 (x) -µ 2 (x))dz (since max x ′ ∈X g(x ′ ) ≤ 1 and min x ′ ∈X g(x ′ ) ≥ 0) = 1 2 X |µ 1 (x) -µ 2 (x)|dx = TV(S(x 1 ), S(x 2 )). (since µ1>µ2 (µ 1 (x) -µ 2 (x))dx = µ2>µ1 (µ 2 (x) -µ 1 (x))dx = 1 2 X |µ 1 (x) -µ 2 (x)|dx) Thus, from ( 1) and ( 3), we have | h(τ 1 ) -h(τ 2 )| ≤ ψ(d X (x 1 , x 2 )) = ψ(d(τ 1 , τ 2 )) , and therefore, E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] ≤ E (τ1,τ2)∼γ * [ψ(d(τ 1 , τ 2 ))] ≤ ψ E (τ1,τ2)∼γ * [d(τ 1 , τ 2 )] . (ψ is concave, Jensen's inequality) Hence, from (2) and since ψ is non-decreasing, we have E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] ≤ ψ(ϵ). Lemma 2. Let Ω = {(τ 1 , τ 2 ) s.t. y 1 ̸ = y 2 where τ 1 = (x 1 , y 1 ) and τ 2 = (x 2 , y 2 )}. Then P (τ1,τ2)∼γ * [(τ 1 , τ 2 ) ∈ Ω] = 0. Proof. Assume, for the sake of contradiction, that P (τ1,τ2)∼γ * [(τ 1 , τ 2 ) ∈ Ω] ≥ p for some p > 0. From condition (2), we have E (τ1,τ2)∼γ * [d(τ 1 , τ 2 )] ≤ ϵ. By the law of total expectation E γ * [d(τ 1 , τ 2 )] =E γ * [d(τ 1 , τ 2 ) | (τ 1 , τ 2 ) ∈ Ω] P γ * [(τ 1 , τ 2 ) ∈ Ω] +E γ * [d(τ 1 , τ 2 ) | (τ 1 , τ 2 ) / ∈ Ω] P γ * [(τ 1 , τ 2 ) / ∈ Ω]. We replace (τ 1 , τ 2 ) ∼ γ * with just γ * in the subscripts for brevity. Since both summands are nonnegative, E γ * [d(τ 1 , τ 2 ) | (τ 1 , τ 2 ) ∈ Ω] P γ * [(τ 1 , τ 2 ) ∈ Ω] ≤ ϵ. Consider a real number l > ϵ/p. Then, for any (τ 1 , τ 2 ) ∈ Ω, from definition (1) and because y 1 ̸ = y 2 , d(τ 1 , τ 2 ) ≥ l. Therefore, E γ * [d(τ 1 , τ 2 ) | (τ 1 , τ 2 ) ∈ Ω] ≥ l and l P γ * [(τ 1 , τ 2 ) ∈ Ω] ≤ E γ * [d(τ 1 , τ 2 ) | (τ 1 , τ 2 ) ∈ Ω] P γ * [(τ 1 , τ 2 ) ∈ Ω] l P γ * [(τ 1 , τ 2 ) ∈ Ω] ≤ ϵ P γ * [(τ 1 , τ 2 ) ∈ Ω] ≤ ϵ/l < p, which contradicts our initial assumption.

Algorithm 1 Prediction

Input: Model µ, input instance x. Output: Robust prediction y. Randomize input: x ′ ∼ S(x). Evaluate model: y = µ(x ′ ). Return y.

Algorithm 2 Certification

Input: Accuracy function h, data distribution D, Wasserstein bound ϵ, integer n and α > 0. Output: Certified accuracy for bound ϵ. sum = 0. for i in 1 . . . n do Sample (x, y) ∼ D. Sample x ′ ∼ S(x). Compute h(x ′ , y). sum = sum + h(x ′ , y) end for Compute 1 -α confidence lower-bound h of E (x,y)∼D [ h(x, y)] using sum and n. Return h -ψ(ϵ). B PROOF OF LEMMA 1 Statement. For two points x 1 , x 2 ∈ X such that d T (x 1 , x 2 ) is finite, TV(S(x 1 ), S(x 2 )) ≤ ψ(d T (x 1 , x 2 )). Proof. Consider the θ for which d T (x 1 , x 2 ) = ∥θ∥. Then, T (x 1 , θ) = x 2 . TV(S(x), S(x 2 )) = TV(T (x, Q(0)), T (z, Q(0))) = TV(T (x, Q(0)), T (T (x, θ), Q(0))) = TV(T (x, Q(0)), T (x, θ + Q(0))) (additive composability, equation (5)) = TV(T (x, T (x, Q(θ))). (definition of Q) Let A be the event in the space M that maximizes the difference in the probabilities assigned to A by T (x, Q(0)) and T (x, Q(θ)). Let u : P → [0, 1] be a function that returns the probability (over the randomness of T ) of any parameter η ∈ P being mapped to a point in A, i.e., u(η) = P{T (x, η) ∈ A}. For a deterministic transformation T , u is a 0/1 function. Then, the probabilities assigned by T (x, Q(0)) and T (x, Q(θ)) to A is equal to E η∼Q(0) [u(η)] and E η∼Q(θ) [u(η)]. Therefore, TV(S(x), S(x 2 )) = |E η∼Q(0) [u(η)] -E η∼Q(θ) [u(η)]| ≤ TV(Q(0), Q(θ)) ≤ ψ(∥θ∥) = ψ(d T (x 1 , x 2 )). (definition of Q and d T ) C PSEUDOCODE FOR PREDICTION AND CERTIFICATION Algorithm 1 and Algorithm 2 describe the prediction and certification steps of our method.

D POPULATION-LEVEL CERTIFICATES AGAINST ADVERSARIAL ATTACKS

In this section, we consider the ℓ 2 -distance in the image space to measure the Wasserstein distance instead of a parameterized transformation. We use a pixel-space Gaussian smoothing distribution S(x) = N (x, σ 2 I) to obtain robustness guarantees under this metric. To motivate this, consider an adversarial attacker Adv : X → X , which takes an image x and computes perturbation Adv(x) to try and fool a model into misclassifying the input. If (x, y) ∼ D, define D to be the distribution of the tuples (Adv(x), y). Defining d in 1 using d X = ℓ 2 , it is easy to show that: So, if the average magnitude of perturbations induced by Adv is less than ϵ (i.e., [∥Adv(x) -x∥ 2 ] < ϵ), then W d 1 (D, D) < ϵ which means that we can apply Theorem 1: the gap in the expected accuracy between x ∼ D and Adv(x) ∼ D will be at most ψ(ϵ). Note that, under this threat model, Adv can be strategic in use of the average perturbation "budget": if a certain point x would require a very large perturbation to be misclassified, or is already misclassified, then Adv(x) can save the budget by simply returning x and use it to attack a greater number of more vulnerable samples. W d 1 (D, D) ≤ E x∼D [∥Adv(x) -x∥ 2 ] Note that our method differs from sample-wise certificates against ℓ 2 adversarial attacks which use randomized smoothing, such as Cohen et al. (2019) . Specifically, we use only one smoothing perturbation (and therefore only one forward pass) per sample. Our guarantees are on the overall accuracy of the classifier, not on the stability of any particular prediction. Finally, as discussed, our threat model is different, because we allow the adversary to strategically choose which samples to attack, with the certificate dependent on the Wasserstein magnitude of the distributional attack. Results on CIFAR-10 and SVHN are presented in Figure 7a . For CIFAR-10, we use ResNet-110 models trained under noise from Cohen et al. (2019) . For SVHN, we train our own models using the same training schedule as used for CIFAR-10 by Cohen et al. (2019) , but we use ResNet-20 in place of ResNet-110. The solid lines represent certified accuracies for different smoothing noises and the black dashed line represents the empirical performance of an undefended model under attack. For the undefended baseline (on an undefended classifier g), we first apply a Carlini and Wagner ℓ 2 attack to each sample x (Carlini & Wagner, 2017) , generating adversarial examples x ′ . Define this attack as the function CW (•), such that x ′ = CW (x, y; g), where y is the ground-truth label. (If the attack fails, CW (x, y; g) = x). We then define a strategic adversary Adv γ that returns CW (x, y; g) if ∥CW (x, y; g) -x∥ 2 < γ, otherwise it returns x. By not attacking samples which would require the largest ℓ 2 perturbations to cause misclassification, this attack efficiently balances maximizing misclassification rate with minimizing the Wasserstein distance between D and D. The threshold parameter γ controls the tradeoff between misclassifcation rate and the Wasserstein perturbation magnitude. Note that our attacker here is strategic in a way that takes more advantage of the distributional threat model than simply finding the minimal perturbation for each sample: by choosing to not attack at all on robust samples, it can successfully attack a larger number of more vulnerable samples. The 'Undefended' baseline in Figure 7a plots the accuracy on attacked test samples under adversary Adv γ , for a sweep of values of γ, against an upper bound on the Wasserstein distance, given by E x∼D [∥Adv γ (x)-x∥ 2 ]. (In order to estimate E x∼D [∥Adv γ (x)-x∥ 2 ], we compute the average perturbation size over the test set and use Hoeffding inequality to upper-bound the population expectation with 99% confidence.) We can observe a large gap between this undefended model performance under attack, and the certified robustness of our model, showing that our certificate is highly nonvacuous. In Appendix E, we include results to show the empirical robustness of the smoothed classifiers under an "adaptive" attack, based on the attack on sample-wise ℓ 2 smoothing proposed by Salman et al. (2019) . We also test an alternate form strategic attacker on the baseline model that does not requires us to estimate the average perturbation size empirically (Appendix F). E EMPIRICAL ATTACKS ON ℓ 2 -DISTRIBUTIONAL ROBUSTNESS. In plot (a), we use the loss function in Equation 9, while in (b) we use Equation 10. In this section, we describe an empirical attack on ℓ 2 -distributional smoothing. Our attack is based on the attack from Salman et al. ( 2019), and we use the code for PGD attack against smoothed classifiers from that work as a base, but there are a few considerations we must make. First, while the goal of the attacker in Salman et al. ( 2019) is to change the output of a classifier that uses the expected logits, the goal in our case is to instead reduce the average classification accuracy of each noise instance. Concretely, Salman et al. ( 2019) uses an attacker loss function for each sample x, y of the following form: max ϵ L Cross Ent. E δ∼N (0,σ 2 I) [ fθ (x + ϵ + δ)], y Where we use f to represent the SoftMax-ed logit function. However, because in our case, the classifier under attack is not E δ∼N (0,σ 2 I) [ fθ (x + ϵ + δ)], but rather fθ (x + ϵ + δ) itself, we instead considered the loss function: max ϵ E δ∼N (0,σ 2 I) L Cross Ent. fθ (x + ϵ + δ), y Empirically, we find the choice of loss function to make very little difference: see Figures 8 and 9 . We also must consider how to correctly make the attacker "strategic": that is, how to allocate attack magnitude so as to attack most effectively while minimizing Wasserstein distance. This is more difficult than in the undefended case, because it is no longer true that for each sample x, we can identify the magnitude ∥CW (x, y; g) -x∥ 2 such that an attack of this magnitude is guaranteed to be successful, while a smaller attack is unsuccessful and hence is not attempted. Rather, for a given attack magnitude, there is instead a probability of success, over the distribution of δ. In order to deal with this, we perform PGD at a range of attack magnitudes, specifically E = {i/8|i ∈ {1, ..., 16}}. Let P GD e (x, y; g) be the result of the attack at magnitude e ∈ E. We then define the adaptive attacker as: Adv γ (x) := P GD e * (x, y; g) (11) Where: e * := max e ∈ E such that Eδ L 0/1 fθ (P GD e (x, y; g) + δ), y - Eδ L 0/1 fθ (x + δ), y e > γ In other words, we use the largest attack such that the increase in misclassification rate per unit attack magnitude is above the threshold γ. If this is not the case for any e ∈ E, we elect not to attack, and set Adv γ (x) := x. As was described in the main text for the baseline case, we sweep over a range of threshold values γ when reporting results. When evaluating the expectations in Equation 12, we use a sample of 100 noise instances. However, once e * is identified, we then use a different sample of 100 noise instances per training sample x when reporting the final accuracy: this is to de-correlate the attack generation of Adv γ (x) with the evaluation of the attack. (However, noise instances are kept constant over the sweep of γ). When reporting results (the upper bounds of the Wasserstein distances), we use e * as an upper bound on ∥P GD e * (x, y; g) -x∥ 2 , rather than using ∥P GD e * (x, y; g) -x∥ 2 directly. Also, we upper bound the population expectation of e * (and therefore of ∥P GD e * (x, y; g)-x∥ 2 ) for each γ with 99% confidence using the empirical expectation on the test set using a Hoeffding bound, using the fact that 0 ≤ e * ≤ min(2, 1/γ). Attack hyperparameters are taken from Salman et al. ( 2019): We use 20 attack steps, a step size of e/10, and use 128 noise instances when computing gradients. We evaluate using 10% of each dataset.

F EXPERIMENT DETAILS FOR SECTION 6

As mentioned, for the certified models, we use the released pre-trained ResNet110 models from Cohen et al. (2019) for CIFAR-10 and train ResNet20 models in a similar manner for SVHN, using the same level of Gaussian Noise for training and testing. For empirical results, we use the implementation of the ℓ 2 Carlini and Wagner (Carlini & Wagner, 2017) attack provided by the IBM ART package (Nicolae et al., 2018) with default parameters (except for batch size which we set at 256 to increase processing speed.) We also tested an alternative attack, which is still strategic but does not require that we measure the Wasserstein distance empirically. In this attack, we define Adv ′ γ , that if ∥CW (x, y; g) -x∥ 2 ≤ γ always returns CW (x, y; g), and if ∥CW (x, y; g) -x∥ 2 > γ, instead returns x with probability 1 -γ ∥CW (x,y;g)-x∥2 . Note that in this case, the perturbation ∥Adv ′ γ (x, y; g) -x∥ 2 is guaranteed to be less than or equal to γ in expectation for all x, so γ can be used as an upper bound on the Wasserstein distance. Results are shown in Figure 10 .

G FUNCTION ψ FOR DIFFERENT DISTRIBUTIONS

For an isometric Gaussian distribution N (0, σ 2 I), TV(N (0, σ 2 I), N (θ, σ 2 I)) = erf(∥θ∥ 2 /2 √ 2σ). Proof. Due to the isometric symmetry of the Gaussian distribution and the ℓ 2 -norm, we may assume, without loss of generality, that N (θ, σ 2 I) is obtained by shifting N (0, σ 2 I) only along the first Substituting s ′ ij , s ′ mean and MAX ′ in the expression for s ′′ ij , we get: s ′′ ij = s ij + (2 θ S 1 -1)s mean + (2 θ S 2 -1)2 θ S 1 s mean MAX ′ MAX = s ij + (2 θ S 1 +θ S 2 -1)s mean max(s max + (2 θ S 1 +θ S 2 -1)s mean , v max + (2 θ V 1 +θ V 2 -1)v mean ) . Similarly, v ′′ ij = v ij + (2 θ V 1 +θ V 2 -1)v mean max(s max + (2 θ S 1 +θ S 2 -1)s mean , v max + (2 θ V 1 +θ V 2 -1)v mean ) . Hence, x ′′ = SV(x, θ 1 + θ 2 ).

I DETAILS FOR PLOTS IN FIGURE 2

The distribution shifts used to evaluate the empirical performance of the base models in Figure 2 have been generated by first sampling an image x from the original distribution D and then randomly transforming it images from the original distribution by adding a noise in the corresponding transformation space. The Wasserstein bound of these shifts can be calculated by computing the expected perturbation size of the smoothing distribution. For example, the expected ℓ 2 -norm of a 3-dimensional Gaussian vector is given by 2 √ 2σ/ √ π and expected ℓ 1 -norm a 2-dimensional vector sampled uniformly from [0, b] 2 is b. The training and smoothing noise levels used for color shift, hue shift and SV shift are (0.8, 10.0), (180 • , 180 • ) and (8.0, 12.0) respectively.

J HUE SHIFT

Any RGB image can be alternatively represented in the HSV image format by mapping the (r, g, b) tuple for each pixel to a point (h, s, v) in a cylindrical coordinate system where the values h, s and v represent the hue, saturation and brightness (value) of the pixel. The mapping from the RGB coordinate to the HSV coordinate takes the [0, 1] 3 color cube and transforms it into a cylinder of unit radius and height. The hue values are represented as angles in [0, 2π) and the saturation and brightness values are in [0, 1]. Define a hue shift of an H × W sized image x by an angle θ ∈ [-π, π] in the HSV space that rotates each hue value by an angle θ and wraps it around to the [0, 2π) range. In appendix J, we show that the certified accuracy under hue shifts does not depend on the Wasserstein distance of the shifted distribution and report the certified accuracies obtained by various base models trained under different noise levels. Define a hue shift of an H × W sized image x by an angle θ ∈ [-π, π] in the HSV space as: HS(x, θ) = (w(h + θ), s, v) ij H×W where w(x) = x -2π x 2π which rotates each hue value by an angle θ and wraps it around to the [0, 2π) range. It is easy to show that this transformation satisfies additive composability in condition (5). The Wasserstein distance is defined using the corresponding distance function d HS by taking the absolute value of the hue shift |θ|. Lemma 5. The transformation HS satisfies the additive composability property, i.e., ∀x ∈ M, θ 1 , θ 2 ∈ [-π, π], HS(HS(x, θ 1 ), θ 2 ) = HS(x, θ 1 + θ 2 ). Proof. Let h be the hue value of the (i, j)th pixel of the image x. Since the transformation only affects the hue values, we ignore the other coordinates. The hue value after the transformation 

K RANDOM CHANNEL SELECTION

Consider a smoothing distribution that randomly picks one of the RGB channels with equal probability, scales it so that the maximum pixel value in that channel is one and sets all the other channels to zero. This smoothing distribution is invariant to the color shift transformation CS and thus, satisfies ψ(d T (x 1 , x 2 )) = 0 whenever d T (x 1 , x 2 ) is finite. Therefore, from Theorem 1, we have ) ] under this smoothing distribution for all Wasserstein bounds ϵ with respect to d CS . Figure 12 plots the certified accuracies, using random channel selection for smoothing, achieved by models trained using Gaussian distributions of varying noise levels in the transformation space. The certified accuracy roughly increases with the training noise achieving a maximum of 87.1% for a training noise of 0.8. E z∼ D [ h(z)] ≥ E x∼D [ h(x

L EXPERIMENTAL DETAILS FOR SECTION 7

Our experimental setting is adapted from the "sample-wise perturbation" CIFAR-10 experiments in Huang et al. (2021) : hyperparameters are the same as in that work unless otherwise stated. After each optimization step, we project ϵ i 's into an ℓ 2 ball (of radius given by the Wasserstein bound ϵ) rather than an ℓ ∞ ball. We also use an ℓ 2 PGD step: ϵ ′ i = ϵ i + τ ∇ ϵi L(•) ∥∇ ϵi L(•)∥ 2 (14) Step size τ was set as 0.1 times the total ℓ 2 ϵ bound.

L.2 ADAPTATION TO OFFLINE SETTING

As discussed in the test, we modify the algorithm such that the simultaneous training of the proxy model and generation of perturbations does not introduce statistical dependencies between perturbed training samples. This is especially important because, if the victim later makes a train-validation split, this would introduce statistical dependencies between training and validation samples, making it hard to generalize certificates to a test set. To avoid this, we construct four data splits: • Test set (10000 samples): The original CIFAR-10 test set. Never perturbed, only used in final model evaluation. • Proxy training set (20000 samples): Used for the optimization of the proxy classifier model parameters θ in Equation 13 and discarded afterward. • Training set (20000 samples): Perturbed using one round of the the standard 20 steps of the inner optimization of Equation 13, while keeping θ fixed. • Validation set (10000 samples): Perturbed using the same method as the "Training set." The victim model is trained on the "Training Set" and evaluated on the "Validation set" and "Test set". We also tested on the clean (unperturbed) version of the validation set.

L.3 ADAPTIVE ATTACK SETTING

When testing our smoothing algorithm, we tested two types of attacks: • Non-adaptive attack: the proxy model is trained and perturbations are generated using undefended models without smoothing: only the victim policy applies smoothing noise during training and evaluation. • Adaptive attack: In the minimization of Equation 13, the loss term L(f θ (x i + ϵ i ), y i ) is replaced by the expectation: E δ∼N (0,σ 2 I) L(f θ (x i + ϵ i + δ), y i ) In other words, this models the expectation of a smoothed model, like the victim classifier. This smoothed optimization is used in both the proxy model training, as well as the generation of the training and validation sets. Following Salman et al. ( 2019), which proposed a similar adaptive attack for sample-wise smoothed classifiers we approximate the expectation using a small number of random perturbations, which are held fixed for the 20 steps of the inner optimization. In our experiments, we use 8 samples for approximation. Because, at large smoothing noises, this makes the attack much less effective, we cut off training after 20 steps of the outer maximization, rather than relying on the accuracy to reach 99%. (the maximum number of steps required to converge we observed for the non-adaptive attack was 15).

L.4 RESULTS

Complete experimental results are presented in Figure 13 . All results are means of 5 independent runs, and error bars represent standard errors of the means.

M CONCLUSION

We show that it is possible to certify the distributional robustness of a general deep neural network without increasing its computational requirements. We obtain robustness guarantees with respect to the Wasserstein distance of the distribution shift which is a more suitable metric for out-ofdistribution shifts than divergence measures such as KL-divergence and total variation. We only consider predefined distance functions in this work which may not be suitable for capturing more sophisticated distribution shifts such as perceptual changes. A future direction of research could be to adapt our certificates for learnable transformations for domain generalization and adaptation.



IoU stands for Intersection over Union.



Figure 1: Certified accuracies obtained for different natural transformations of CIFAR-10 images such as color shifts, hue shifts and changes in brightness and saturation. The Wasserstein distance of each distribution shift from the original distribution is defined with respect to the corresponding distance function.

Figure 2: Comparison between the empirical performance (dashed lines) of two base models (trained on CIFAR-10 images with and without noise in transformation space) and the certified accuracy (solid line) of a robust model (noise-trained model smoothed using input randomization) under distribution shifts. The certified accuracy often outperforms the undefended model and remains reasonably close (almost overlaps for hue shift) to the model trained under noise for small shifts in the distribution.

Figure 3: Certified accuracy under color shifts for (a) CIFAR-10 and (b) SVHN. Each plot corresponds to a particular training noise and each curve corresponds to a particular smoothing noise. from a fixed distribution like a Gaussian distribution of a certain variance in the transformation space (see Appendix I). 2. By estimation: We compute a high-confidence bound on the average perturbation added to a finite number of samples to bound the Wasserstein distance. For example, in Section 6, when reporting the undefended baseline performance, we measure E[∥Adv(x) -x∥ 2 ] on the test set, and use Hoeffding's inequality to derive from this a 99% confidence upper bound on the true, population expectation E x∼D [∥Adv(x) -x∥ 2 ]. By Equation 8, this is a (high-probability) upper bound on the Wasserstein distance of the distribution shift.

Figure 4: Certified accuracy under brightness and saturation changes for (a) CIFAR-10 and (b) SVHN images. Each plot corresponds to a particular training noise and each curve corresponds to a particular smoothing noise.

Figure 6: Distributional certificates for unlearnable datasets on CIFAR-10. The smoothing noise used is 0.4. Results for other values are reported in the appendix.

Figure 7: Distributional certificates against adversarial attacks on (a) CIFAR-10 and (b) SVHN. The solid lines represent certified accuracy of the robust models and the dashed lines represent the adversarial accuracy of undefended models.

Figure 8: Adversarial attack on distributionally-smoothed classifiers, for CIFAR-10. For smoothed classifiers, we use the PGD attack described in is section; see Section 6 for details on the baseline. The dashed lines represent the empirical performance of the smoothed model for different noise levels. In plot (a), we use the loss function in Equation 9, while in (b) we use Equation 10.

Figure 11: Certified accuracy under hue shift for different levels of training noise. Since, the certified accuracy remains constant with respect to the Wasserstein distance (ϵ) of the shifted distribution, we plot the certified accuracy of models trained with different noise levels β.

Figure12: Certified robustness against color shift using random channel selection as the smoothing distribution. Since, the certified accuracy remains constant with respect to the Wasserstein distance (ϵ) of the shifted distribution, we plot the certified accuracy of models trained with various levels of Gaussian noise in the transformation space.

Fan Wu, Linyi Li, Zijian Huang, Yevgeniy Vorobeychik, Ding Zhao, and Bo Li. CROP: certifying robust policies for reinforcement learning through functional smoothing. CoRR, abs/2106.09292, 2021. URL https://arxiv.org/abs/2106.09292.

ACKNOWLEDGEMENTS

This project was supported in part by Meta grant 23010098, NSF CAREER AWARD 1942230, HR001119S0026 (GARD), ONR YIP award N00014-22-1-2271, Army Grant No. W911NF2120076, a capital one grant, NIST 60NANB20D134, and the NSF award CCF2212458.

annex

Figure 10 : Certified robustness to ℓ 2 Wasserstein distributional attacks. The undefended baseline baseline is here attacked using the alternative attack formulation Adv ′ described in Section F.dimension. Therefore, the total variation of the two distributions is equal to the difference in the probability of a normal random variable with variance σ 2 being less than ∥θ∥ 2 /2 and -∥θ∥ 2 /2, i.e., Φ(∥θ∥ 2 /2σ) -Φ(-∥θ∥ 2 /2σ) where Φ is the standard normal CDF.TV(N (0, 

H ADDITIVE COMPOSABILITY OF NATURAL TRANSFORMATIONS

In this section, we prove that the natural transformation CS, HS and SV defined in the paper all satisfy the additive composability property in condition (5). Lemma 3. The transformation CS satisfies the additive composability property, i.e., ∀x ∈We need to show that x ′′ = CS(x, θ 1 + θ 2 ). Let r max , g max and b max be the maximum values of the red, green and blue channels respectively of x and r ′ max , g ′ max and b ′ max be the same for x ′ . From the definition of CS in Section 5.1, we have:From the definition of r ′ max , we have:Similarly,Therefore,Substituting r ′ ij and MAX' in the expression for r ′′ ij , we get:Lemma 4. The transformation SV satisfies the additive composability property, i.e., ∀x ∈We need to show that x ′′ = SV(x, θ 1 + θ 2 ). Let s mean , s max , v mean and v max be the means and maximums of the saturation and brightness values of x and s ′ mean , s ′ max , v ′ mean and v ′ max be the same for x ′ . From the definition of SV in Section 5.2, we have:From the definitions of s ′ mean and s ′ max , we have:Similarly,Therefore,

