ROBUST TRAINING THROUGH ADVERSARIALLY SELECTED DATA SUBSETS

Abstract

Robustness to adversarial perturbations often comes at the cost of a drop in accuracy on unperturbed or clean instances. Most existing defense mechanisms attempt to defend the learner from attack on all possible instances, which often degrades the accuracy on clean instances significantly. However, in practice, an attacker might only select a small subset of instances to attack, e.g., in facial recognition systems an adversary might aim to target specific faces. Moreover, the subset selection strategy of the attacker is seldom known to the defense mechanism a priori, making it challenging to attune the mechanism beforehand. This motivates designing defense mechanisms which can (i) defend against attacks on subsets instead of all instances to prevent degradation of clean accuracy and, (ii) ensure good overall performance for attacks on any selected subset. In this work, we take a step towards solving this problem. We cast the training problem as a min-max game involving worst-case subset selection along with optimization of model parameters, rendering the problem NP-hard. To tackle this, we first show that, for a given learner's model, the objective can be expressed as a difference between a γ-weakly submodular and a modular function. We use this property to propose ROGET, an iterative algorithm, which admits approximation guarantees for a class of loss functions. Our experiments show that ROGET obtains better overall accuracy compared to several state-of-the-art defense methods for different adversarial subset selection techniques.

1. INTRODUCTION

Recent years have witnessed a dramatic improvement in the predictive power of the machine learning models across several applications such as computer vision, natural language processing, speech processing, etc. This has led to their widespread usage in several safety critical systems like autonomous car driving (Janai et al., 2020; Alvarez et al., 2010; Sallab et al., 2017) , face recognition (Hu et al., 2015; Kemelmacher-Shlizerman et al., 2016; Wang & Deng, 2021) , voice recognition (Myers, 2000; Yuan et al., 2018) , etc., which in turn requires the underlying models to be security complaint. However, most existing machine learning models suffer from significant vulnerability in the face of adversarial attacks (Szegedy et al., 2014; Carlini & Wagner, 2017; Goodfellow et al., 2015; Baluja & Fischer, 2018; Xiao et al., 2018; Kurakin et al., 2017; Xie & Yuille, 2019; Kannan et al., 2018; Croce & Hein, 2020; Yuan et al., 2019; Tramèr et al., 2018) , where instances are contaminated with small and often indiscernible perturbations to delude the model at the test time. This may result in catastrophic consequences when the underlying ML model is deployed in practice. Driven by this motivation, a flurry of recent works (Madry et al., 2017; Zhang et al., 2019b; 2021b; Athalye et al., 2018; Andriushchenko & Flammarion, 2020; Shafahi et al., 2019; Rice et al., 2020) have focused on designing adversarial training methods, whose goal is to maintain the accuracy of ML models in presence of adversarial attacks. In principle, they are closely connected to robust machine learning methods that seek to minimize the worst-case performance of the ML models with adversarial perturbations. In general, these approaches assume equal likelihood of adversarial attack across each instance. However, in several applications, an adversary might selectively wish to attack a specific subset of instances, which may be unknown to the learner. For example, an adversary can only be interested in perturbing images of specific persons to evade facial recognition systems (Xiao et al., 2021; Vakhshiteh et al., 2021; Zhang et al., 2021b; Sarkar et al., 2021; Venkatesh et al., 2021) ; in traffic signs classification, the adversary may like to perturb only the stop signs, which can have more adverse impact during deployment. Therefore, the existing adversarial training methods can be overly pessimistic in terms of their predictive power, since they consider adversarial perturbation for each instance. We discuss the related works in more detail in Appendix B. 1.1 OUR CONTRIBUTIONS Responding to the above limitations, we propose a novel robust learning framework, which is able to defend adversarial attacks targeted at any chosen subset of examples. Specifically, we make the following contributions. Learning in presence of perturbation on adversarially selected subset. We consider an attack model, where the adversary selectively perturbs a subset of instances, rather than drawing them uniformly at random. However, the exact choice of the subset or its property remains unknown to the learner during training and validation. Consequently, a learner cannot adapt to such specific attack well in advance through training or cross-validation. To defend these attacks, we introduce a novel adversarial training method, where the learner aims at minimizing the worst-case loss across all the data subsets. Our defense strategy is agnostic to any specific selectivity of the attacked subset. Its key goal is to maintain high accuracy during attacks on any selected subset, rather than providing optimal accuracy for any specific subset. To this end, we posit our adversarial training task as an instance of min-max optimization problem, where the inner optimization problem seeks the data subset that maximizes the training loss and, the outer optimization problem then minimizes this loss with respect to the model parameters. While training the model, the outer problem also penalizes the loss on the unperturbed instances. This allows us to optimize for the overall accuracy across both perturbed and unperturbed instances. Theoretical characterization of our defense objective. Existing adversarial training methods (Madry et al., 2017; Zhang et al., 2019b; Robey et al., 2021) involve only continuous optimization variables-the model parameters and the amount of perturbation. In contrast, the inner maximization problem in our proposal searches over the worst-case data subset. This translates our optimization task into a parameter estimation problem in conjunction with a subset selection problem, which renders it NP-hard. We provide a useful characterization of the underlying training objective that would help us design approximation algorithm to solve the problem. Given a fixed ML model, we show that the training objective can be expressed as the difference between a monotone γ-weakly submodular function and a modular function (Theorem 2). This allows us to leverage distorted greedy algorithm (Harshaw et al., 2019) to optimize the underlying objective. Approximation algorithms. We provide ROGET (RObust aGainst adversarial subsETs), a family of algorithms to solve our optimization problem, by building upon the proposal of (Adibi et al., 2021) , that admits approximation guarantees. In each iteration, ROGET first applies gradient descent (GD) or stochastic gradient descent step to update the estimate of the model parameters and then applies distorted greedy algorithm to update the estimate of attacked subset of instances. We show that ROGET admits approximation guarantees for convex and non-convex training objective (Thoerem 5), where in the latter case we require that the objective satisfies Polyak-Lojasiewicz (PL) condition (Theorem 4). Our analysis can be applied in any min-max optimization setup where the inner optimization problem seeks to maximize the difference between a monotone γ-weakly submodular and a modular function and therefore, is of independent interest. Finally, we provide a comprehensive experimental evaluation of ROGET, by comparing them against seven state-of-the-art defense methods. Here, in addition to hyperparameter set by the baselines in their papers, we also use a new hyperparameter selection method, which is more suited in our setup. Unlike our proposal, the baselines are not trained to optimize for the worst case accuracy. To reduce this gap between the baselines and our method, we tune the hyperparameters of the baselines, which would maximize the minimum accuracy across a large number of subsets chosen for attack. We observe that, ROGET is able to outperform the state-of-the-art defense methods in terms of the overall accuracy across different hyperparameter selection and different subset selection strategies.

2. PROBLEM FORMULATION

Instances, learner's model and the loss function. We consider a classification setup where x ∈ X = R d are the features, y ∈ Y are the discrete labels. We denote {(x i , y i )} i∈D to be the training instances where D denotes the training dataset. We use h θ ∈ H to indicate the learner's model,

