PROVABLE ROBUSTNESS AGAINST WASSERSTEIN DIS-TRIBUTION SHIFTS VIA INPUT RANDOMIZATION

Abstract

Certified robustness in machine learning has primarily focused on adversarial perturbations with a fixed attack budget for each sample in the input distribution. In this work, we present provable robustness guarantees on the accuracy of a model under bounded Wasserstein shifts of the data distribution. We show that a simple procedure that randomizes the input of the model within a transformation space is provably robust to distributional shifts under that transformation. Our framework allows the datum-specific perturbation size to vary across different points in the input distribution and is general enough to include fixed-sized perturbations as well. Our certificates produce guaranteed lower bounds on the performance of the model for any shift (natural or adversarial) of the input distribution within a Wasserstein ball around the original distribution. We apply our technique to certify robustness against natural (non-adversarial) transformations of images such as color shifts, hue shifts, and changes in brightness and saturation. We obtain strong performance guarantees for the robust model under clearly visible shifts in the input images. Our experiments establish the non-vacuousness of our certificates by showing that the certified lower bound on a robust model's accuracy is higher than the empirical accuracy of an undefended model under a distribution shift. We also show provable distributional robustness against adversarial attacks. Moreover, our results also imply guaranteed lower bounds (hardness result) on the performance of models trained on so-called "unlearnable" datasets that have been poisoned to interfere with model training. We show that the performance of a robust model is guaranteed to remain above a certain threshold on the test distribution even when the base model is trained on the poisoned dataset.

1. INTRODUCTION

Machine learning models often suffer significant performance loss under minor shifts in the data distribution that do not affect a human's ability to perform the same task-e.g., input noise (Dodge & Karam, 2016; Geirhos et al., 2018) , image scaling, shifting and translation (Azulay & Weiss, 2019) , spatial (Engstrom et al., 2019) and geometric transformations (Fawzi & Frossard, 2015; Alcorn et al., 2019 ), blurring (Vasiljevic et al., 2016; Zhou et al., 2017) , acoustic corruptions (Pearce & Hirsch, 2000) and adversarial perturbations (Szegedy et al., 2014; Carlini & Wagner, 2017; Goodfellow et al., 2015; Madry et al., 2018; Biggio et al., 2013) . Overcoming such robustness challenges is a major hurdle for deploying these models in safety-critical applications where reliability is paramount. Several techniques have been developed to improve the empirical robustness of a model to data shifts, e.g., diversifying datasets (Taori et al., 2020) , training with natural corruptions (Hendrycks & Dietterich, 2019 ), data augmentations (Yang et al., 2019) , contrastive learning (Kim et al., 2020; Radford et al., 2021; Ge et al., 2021) and adversarial training (Goodfellow et al., 2015; Madry et al., 2018; Tramèr & Boneh, 2019; Shafahi et al., 2019; Maini et al., 2020) . Empirical robustness techniques are designed to protect a model against a particular type of shift or adversary (e.g., by introducing similar shifts during training) and may not be effective against new ones. For instance, adversarial defenses have been shown to break down under newer attacks (Carlini & Wagner, 2017; Athalye et al., 2018; Uesato et al., 2018; Laidlaw & Feizi, 2019; Laidlaw et al., 2021) . Certifiable robustness methods, on the other hand, seek to produce provable guarantees on the robustness of a model which hold for any perturbation within a certain neighborhood of the input instance regardless of the strategy used to generate this perturbation. A robustness certificate produces a verifiable lower bound on the size of the perturbation required to fool a model. Apart from being a guarantee on the robust performance, these certificates may also serve as a metric to compare the robustness of different models that is independent of the mechanism producing the input perturbations. However, the study of provable robustness has mostly focused on perturbations with a fixed size budget (e.g., an ℓ p -ball of same size) for all input points (Cohen et al., 2019; Lécuyer et al., 2019; Li et al., 2019; Salman et al., 2019; Gowal et al., 2018; Huang et al., 2019; Wong & Kolter, 2018; Raghunathan et al., 2018; Singla & Feizi, 2019; 2020; Levine & Feizi, 2021; 2020a; b) . Among provable robustness methods, randomized smoothing based procedures have been able to successfully scale up to high-dimensional problems (Cohen et al., 2019; Lécuyer et al., 2019; Li et al., 2019; Salman et al., 2019) and adapted effectively to other domains such as reinforcement learning (Kumar et al., 2021; Wu et al., 2021) and models with structured outputs (Kumar & Goldstein, 2021) as in segmentation tasks and generative modeling. However, these techniques cannot be extended to certify under distribution shifts as the perturbation size for each instance in the input distribution need not have a fixed bound. For example, stochastic changes in the input images of a vision model caused by lighting and weather conditions may vary across time and location. Even adversarial attacks may choose to adjust the perturbation size depending on the input instance. A standard way of describing a distribution shift is to constrain the Wasserstein distance between the original distribution D and the shifted distribution D to be bounded by a certain amount ϵ, i.e., W d 1 (D, D) ≤ ϵ, for an appropriate distance function d. The Wasserstein distance is the minimum expectation of the distance function d over all possible joint distributions with marginals D and D. Wasserstein distance is a standard similarity measure for probability distributions and has been extensively used to study distribution shifts (Courty et al., 2017; Damodaran et al., 2018; Lee & Raginsky, 2018; Wu et al., 2019) . Certifiable robustness against Wasserstein shifts is an interesting problem to study in its own right and a useful tool to have in the arsenal of provable robustness techniques in machine learning. In this work, we design robustness certificates for distribution shifts bounded by a Wasserstein distance of ϵ. We show that by simply randomizing the input in a transformation space, it is possible to bound the difference between the accuracy of the robust model under the original distribution D and the shifted distribution D as a function of their Wasserstein distance ϵ under that transformation. Given a base model µ, we define a robust model μ which replaces the input of µ with a randomized version sampled from a "smoothing" distribution around the original input. Let h be a function denoting the performance of the robust model μ on an input-output pair (x, y) (see Section 3 for a formal definition). Then, our main theoretical result in Theorem 1 shows that E (x1,y1)∼D [ h(x 1 , y 1 )] -E (x2,y2)∼ D [ h(x 2 , y 2 )] ≤ ψ(ϵ), where ψ is a concave function that bounds the total variation between the smoothing distributions at two input points as a function of the distance between them (condition (3) in Section 3). Such an upper bound always exists for any smoothing distribution as the total variation remains between zero and one as the distance between the two distributions increases. We discuss how to find the appropriate ψ for different smoothing distributions in Appendix G. We apply our result to certify model performance for families of parameterized distribution shifts which include shifts in the RBG color balance of an image, the hue/saturation balance, the brightness/contrast, and more. Our method does not make any assumptions on the model and applies to both natural and adversarial shifts of the distribution. It does not increase the computational requirements of the base model as it only samples one randomized input per robust prediction, making it scalable to high-dimensional problems that require conventional deep neural network architectures. The sample complexity for generating the Wasserstein certificates over the entire distribution is roughly the same as obtaining adversarial certificates for a single input instance using existing randomized smoothing based techniques (Cohen et al., 2019; Salman et al., 2019) .

