PROVABLE ROBUSTNESS AGAINST WASSERSTEIN DIS-TRIBUTION SHIFTS VIA INPUT RANDOMIZATION

Abstract

Certified robustness in machine learning has primarily focused on adversarial perturbations with a fixed attack budget for each sample in the input distribution. In this work, we present provable robustness guarantees on the accuracy of a model under bounded Wasserstein shifts of the data distribution. We show that a simple procedure that randomizes the input of the model within a transformation space is provably robust to distributional shifts under that transformation. Our framework allows the datum-specific perturbation size to vary across different points in the input distribution and is general enough to include fixed-sized perturbations as well. Our certificates produce guaranteed lower bounds on the performance of the model for any shift (natural or adversarial) of the input distribution within a Wasserstein ball around the original distribution. We apply our technique to certify robustness against natural (non-adversarial) transformations of images such as color shifts, hue shifts, and changes in brightness and saturation. We obtain strong performance guarantees for the robust model under clearly visible shifts in the input images. Our experiments establish the non-vacuousness of our certificates by showing that the certified lower bound on a robust model's accuracy is higher than the empirical accuracy of an undefended model under a distribution shift. We also show provable distributional robustness against adversarial attacks. Moreover, our results also imply guaranteed lower bounds (hardness result) on the performance of models trained on so-called "unlearnable" datasets that have been poisoned to interfere with model training. We show that the performance of a robust model is guaranteed to remain above a certain threshold on the test distribution even when the base model is trained on the poisoned dataset.

1. INTRODUCTION

Machine learning models often suffer significant performance loss under minor shifts in the data distribution that do not affect a human's ability to perform the same task-e.g., input noise (Dodge & Karam, 2016; Geirhos et al., 2018) , image scaling, shifting and translation (Azulay & Weiss, 2019 ), spatial (Engstrom et al., 2019) and geometric transformations (Fawzi & Frossard, 2015; Alcorn et al., 2019 ), blurring (Vasiljevic et al., 2016; Zhou et al., 2017) , acoustic corruptions (Pearce & Hirsch, 2000) and adversarial perturbations (Szegedy et al., 2014; Carlini & Wagner, 2017; Goodfellow et al., 2015; Madry et al., 2018; Biggio et al., 2013) . Overcoming such robustness challenges is a major hurdle for deploying these models in safety-critical applications where reliability is paramount. Several techniques have been developed to improve the empirical robustness of a model to data shifts, e.g., diversifying datasets (Taori et al., 2020) , training with natural corruptions (Hendrycks & Dietterich, 2019 ), data augmentations (Yang et al., 2019) , contrastive learning (Kim et al., 2020; Radford et al., 2021; Ge et al., 2021) and adversarial training (Goodfellow et al., 2015; Madry et al., 2018; Tramèr & Boneh, 2019; Shafahi et al., 2019; Maini et al., 2020) . Empirical

