ROBUST TRAINING THROUGH ADVERSARIALLY SELECTED DATA SUBSETS

Abstract

Robustness to adversarial perturbations often comes at the cost of a drop in accuracy on unperturbed or clean instances. Most existing defense mechanisms attempt to defend the learner from attack on all possible instances, which often degrades the accuracy on clean instances significantly. However, in practice, an attacker might only select a small subset of instances to attack, e.g., in facial recognition systems an adversary might aim to target specific faces. Moreover, the subset selection strategy of the attacker is seldom known to the defense mechanism a priori, making it challenging to attune the mechanism beforehand. This motivates designing defense mechanisms which can (i) defend against attacks on subsets instead of all instances to prevent degradation of clean accuracy and, (ii) ensure good overall performance for attacks on any selected subset. In this work, we take a step towards solving this problem. We cast the training problem as a min-max game involving worst-case subset selection along with optimization of model parameters, rendering the problem NP-hard. To tackle this, we first show that, for a given learner's model, the objective can be expressed as a difference between a γ-weakly submodular and a modular function. We use this property to propose ROGET, an iterative algorithm, which admits approximation guarantees for a class of loss functions. Our experiments show that ROGET obtains better overall accuracy compared to several state-of-the-art defense methods for different adversarial subset selection techniques.

1. INTRODUCTION

Recent years have witnessed a dramatic improvement in the predictive power of the machine learning models across several applications such as computer vision, natural language processing, speech processing, etc. This has led to their widespread usage in several safety critical systems like autonomous car driving (Janai et al., 2020; Alvarez et al., 2010; Sallab et al., 2017) , face recognition (Hu et al., 2015; Kemelmacher-Shlizerman et al., 2016; Wang & Deng, 2021) , voice recognition (Myers, 2000; Yuan et al., 2018) , etc., which in turn requires the underlying models to be security complaint. However, most existing machine learning models suffer from significant vulnerability in the face of adversarial attacks (Szegedy et al., 2014; Carlini & Wagner, 2017; Goodfellow et al., 2015; Baluja & Fischer, 2018; Xiao et al., 2018; Kurakin et al., 2017; Xie & Yuille, 2019; Kannan et al., 2018; Croce & Hein, 2020; Yuan et al., 2019; Tramèr et al., 2018) , where instances are contaminated with small and often indiscernible perturbations to delude the model at the test time. This may result in catastrophic consequences when the underlying ML model is deployed in practice. Driven by this motivation, a flurry of recent works (Madry et al., 2017; Zhang et al., 2019b; 2021b; Athalye et al., 2018; Andriushchenko & Flammarion, 2020; Shafahi et al., 2019; Rice et al., 2020) have focused on designing adversarial training methods, whose goal is to maintain the accuracy of ML models in presence of adversarial attacks. In principle, they are closely connected to robust machine learning methods that seek to minimize the worst-case performance of the ML models with adversarial perturbations. In general, these approaches assume equal likelihood of adversarial attack across each instance. However, in several applications, an adversary might selectively wish to attack a specific subset of instances, which may be unknown to the learner. For example, an adversary can only be interested in perturbing images of specific persons to evade facial recognition systems (Xiao et al., 2021; Vakhshiteh et al., 2021; Zhang et al., 2021b; Sarkar et al., 2021; Venkatesh et al., 2021) ; in traffic signs classification, the adversary may like to perturb only the stop signs, which can have more adverse impact during deployment. Therefore, the existing adversarial training methods can be overly pessimistic in terms of their predictive power, since they consider adversarial perturbation for each instance. We discuss the related works in more detail in Appendix B. 1.1 OUR CONTRIBUTIONS Responding to the above limitations, we propose a novel robust learning framework, which is able to defend adversarial attacks targeted at any chosen subset of examples. Specifically, we make the following contributions. Learning in presence of perturbation on adversarially selected subset. We consider an attack model, where the adversary selectively perturbs a subset of instances, rather than drawing them uniformly at random. However, the exact choice of the subset or its property remains unknown to the learner during training and validation. Consequently, a learner cannot adapt to such specific attack well in advance through training or cross-validation. To defend these attacks, we introduce a novel adversarial training method, where the learner aims at minimizing the worst-case loss across all the data subsets. Our defense strategy is agnostic to any specific selectivity of the attacked subset. Its key goal is to maintain high accuracy during attacks on any selected subset, rather than providing optimal accuracy for any specific subset. To this end, we posit our adversarial training task as an instance of min-max optimization problem, where the inner optimization problem seeks the data subset that maximizes the training loss and, the outer optimization problem then minimizes this loss with respect to the model parameters. While training the model, the outer problem also penalizes the loss on the unperturbed instances. This allows us to optimize for the overall accuracy across both perturbed and unperturbed instances. Theoretical characterization of our defense objective. Existing adversarial training methods (Madry et al., 2017; Zhang et al., 2019b; Robey et al., 2021) involve only continuous optimization variables-the model parameters and the amount of perturbation. In contrast, the inner maximization problem in our proposal searches over the worst-case data subset. This translates our optimization task into a parameter estimation problem in conjunction with a subset selection problem, which renders it NP-hard. We provide a useful characterization of the underlying training objective that would help us design approximation algorithm to solve the problem. Given a fixed ML model, we show that the training objective can be expressed as the difference between a monotone γ-weakly submodular function and a modular function (Theorem 2) . This allows us to leverage distorted greedy algorithm (Harshaw et al., 2019) to optimize the underlying objective. Approximation algorithms. We provide ROGET (RObust aGainst adversarial subsETs), a family of algorithms to solve our optimization problem, by building upon the proposal of (Adibi et al., 2021) , that admits approximation guarantees. In each iteration, ROGET first applies gradient descent (GD) or stochastic gradient descent step to update the estimate of the model parameters and then applies distorted greedy algorithm to update the estimate of attacked subset of instances. We show that ROGET admits approximation guarantees for convex and non-convex training objective (Thoerem 5), where in the latter case we require that the objective satisfies Polyak-Lojasiewicz (PL) condition (Theorem 4). Our analysis can be applied in any min-max optimization setup where the inner optimization problem seeks to maximize the difference between a monotone γ-weakly submodular and a modular function and therefore, is of independent interest. Finally, we provide a comprehensive experimental evaluation of ROGET, by comparing them against seven state-of-the-art defense methods. Here, in addition to hyperparameter set by the baselines in their papers, we also use a new hyperparameter selection method, which is more suited in our setup. Unlike our proposal, the baselines are not trained to optimize for the worst case accuracy. To reduce this gap between the baselines and our method, we tune the hyperparameters of the baselines, which would maximize the minimum accuracy across a large number of subsets chosen for attack. We observe that, ROGET is able to outperform the state-of-the-art defense methods in terms of the overall accuracy across different hyperparameter selection and different subset selection strategies.

2. PROBLEM FORMULATION

Instances, learner's model and the loss function. We consider a classification setup where x ∈ X = R d are the features, y ∈ Y are the discrete labels. We denote {(x i , y i )} i∈D to be the training instances where D denotes the training dataset. We use h θ ∈ H to indicate the learner's model, where H is the hypothesis class and θ ∈ Θ is the parameter vector of the model. We use the cross entropy loss (h θ (x), y) in the paper. Adversarial perturbation on selected subset. We assume that the adversary's goal is to selectively attack a specific subset of instances S latent -instead of every possible instances in the data or drawing instances uniformly at random. The adversary then uses an adversarial perturbation method to generate x adv i using x i for all for i ∈ S latent such that x adv i is close to x i , but the model misclassifies x adv i . Now, it is important to note that neither the strategy behind selecting S latent nor the adversarial perturbation method is known to the learner. Hence, during training, we use a φ : X → X as the learner's belief about the adversarial network or the perturbation method with parameter vector φ ∈ Φ similar to (Baluja & Fischer, 2018; Xiao et al., 2018; Mopuri et al., 2018) , where Φ is domain of φ. Many popular attacks like FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2017) are un-parameterized and they induce perturbation of one point independently of others. Still, we assume a parameterized adversary model (Baluja & Fischer, 2018; Xiao et al., 2018; Mopuri et al., 2018) to make the formulation more generalized as one can always overparameterize such a model to induce enough capacity and mimic pointwise attacks. Proposed adversarial training problem. Let us assume, for instance, that the subset selection strategy of the adversary is revealed to the learner. Following this strategy, the learner can easily compute the underlying subset S ⊂ D to mimic the adversary and minimize the sum of the loss on the perturbed instances i ∈ S and the unperturbed instances j ∈ D\S. However, in practice, the learner may not have any knowledge about the underlying subset selection strategy. In such a case, the goal of a defense algorithm should be to ensure high overall accuracy in the face of attacks on all possible subsets. To this end, we design an adversarial training framework, which attempts to minimize the worst case loss across all subsets, as described below. Given a set of training instances {(x i , y i )} i∈D , we defend the attacks on selected subset of instances by training a new model h θ which minimizes the highest possible loss on the perturbed instances a φ (x i ) with i ∈ S across subsets S ⊂ D of size at most b, while ensuring that the new predictions h θ (x j ) and the labels y j remain close on the unperturbed instances j ∈ S. Given the the learner's belief about the adversarial network a φ , we define the learner's loss function as follows: F (h θ , S | a φ ) = 1 |D| i∈S (h θ (a φ (x i )), y i ) + j∈D\S ρ (h θ (x j ), y j ) . The parameter ρ is a regularization parameter which gives additional flexibility to  F (h θ , S | a φ * (θ,S) ) where, φ * (θ, S) = argmax φ∈Φ i∈S [ (h θ (a φ (x i )), y i ) -µC(a φ (x i ), x i )]. The optimization problem ( 2) is a min-max game where the inner optimization problem aims to find the subset S of size which provides the highest loss and the outer minimization problem aims to find the model h θ that minimizes this loss. The optimization problem (3) is the learner's belief about the adversary's strategy. It need not be true in practice. In Section 4, we perform experiments when the true adversary's models differ from a φ during test. Eq. (3) provides the learner's estimate about the parameters of the adversarial network. Here, C(a φ (x i ), x i ) is cost of perturbing x i to a φ (x i ) often measured using different notions of distances, e.g., normed differences or their squares, etc. In such a case, this optimization problem (3) can also be seen as the dual of the constrained optimization problem (Madry et al., 2017; Robey et al., 2021) given by max φ i∈S (h θ (a φ (x i )), y i ) such that C(a φ (x i ), x i ) ≤ ξ(µ) where ξ is dependent on µ. Note that, the adversary can select the subset S in both deterministic or probabilistic manner. For example, it can attack images of specific persons in facial recognition systems, perturb the instances with stop-signs in traffic signs classification system, etc. On the other hand, it can select instances with probability proportional to the uncertainty of a classifier. Similar to the optimization problems (2)-(3), one can derive a continuous min-max optimization problem when S is drawn from a probability distribution. We show the connection between this continuous optimization problem and our discrete continuous optimization problem in Appendix C. Hardness analysis. The inner optimization problem in Eq. ( 2) involves combinatorial search over S for a fixed θ. However, while doing so, it requires to compute φ * (θ, S). This makes our adversarial training problem NP-Hard (see Apppendix C for a proof).

3. PROPOSED APPROACH

In this section, we provide algorithms to solve the optimization problem (2). We first characterize it as a difference between a γ-weakly submodular function and a modular function. Next, we design ROGET, a family of algorithms to solve our adversarial training problem (2).

3.1. SET FUNCTION THEORETIC CHARACTERIZATION OF F

Here, we provide a characterization of the objective function F (h θ , S | a φ ) using the notions of monotonicity and γ-weak submodularity, which would lead us to design an approximation algorithm to solve our training problem (2). To do so, we first formally state the definitions of these notions. Definition 1. Given a set function Q : 2 D → R, we define the marginal gain of Q as Q(k | S) := Q(S ∪ {k}) -Q(S). The function Q is monotone (non-decreasing) if Q(k | S) ≥ 0 for all k ∈ D\S. The function Q is called γ-weakly submodular if for some γ ∈ (0, 1], we have k∈T \S Q(k | S) ≥ γ[Q(T ∪ S) -Q(S)] whenever S ⊂ T ⊆ D. Here, γ is called the submodularity ratio of Q. The function Q is modular if Q(k | S) = Q(k | T ) for all S ⊂ T ⊂ D and k ∈ D\T . Alternative representation of F (h θ , S | a φ * (θ,S) ). Given a regularization function R(θ) and a regularization parameter λ > 0, we define two set functions G(θ, S) and m(θ, S) as follows: G λ (θ, S) = 1 |D| i∈S [λR(θ) + (h θ (a φ (x i )), y i )] + j∈D ρ (h θ (x j ), y j ) (4) m λ (θ, S) = 1 |D| i∈S [λR(θ) + ρ (h θ (x j ), y j )] . Then, we represent F (h θ , S | a φ * (θ,S) ) as the difference between the above functions, i.e., F (h θ , S | a φ * (θ,S) ) = G λ (θ, S) -m λ (θ, S). (6) Here, m λ is a modular function. Note that the above equality holds for any value of λ > 0. However, as we shall see next, the submodularity ratio of G λ depends on λ which affects the performance of the approximation algorithm designed in Section 3.2. Next, we present some assumptions which would be used to characterize the above representation of F . Assumptions 1. (1) Lipschitz continuity: (a) The loss function (h(x), y) is L h -Lipschitz with respect to h, (b) h θ (x) is L x -Lipschitz with respect to x, (c) the adversarial network a φ (.) is L φ - Lipschitz with respect to φ. (2) Stability of φ * (θ, S): The learner's estimate about the parameter for the adversarial network is stable (Bousquet & Elisseeff, 2002; Charles & Papailiopoulos, 2018; Hardt et al., 2016) , i.e., the solution φ * (θ, S) of the optimization (3) satisfies φ * (θ, S ∪ k) -φ * (θ, S) ≤ β/|S| for some β > 0 and all θ. Stability holds for a wide variety of loss functions including convex losses Bousquet & Elisseeff (2002) 3) is a distance metric. Specifically, it follows the triangle inequality, i.e., C(x , x) ≤ C(x , x ) + C(x , x). (4) Norm-boundedness of C: The cost of perturbation C is a bounded by an 2 norm, i.e., C(x , x) ≤ q||x -x||. In the context of several prior works (Goodfellow et al., 2015; Madry et al., 2017) use ∞ distance which is bounded by 2 norm. (5) Boundeness of Θ and Φ: We assume that the parameter space of both the learning model and adversarial network are bounded, i.e., ||θ|| 2 ≤ θ max and ||φ|| 2 ≤ φ max . Monotonicity and γ-weakly submodularity of G λ . We now present our results on monotonicity and γ-weak submodularity of G λ (see Appendix E.1 for a proof). Theorem 2. Given Assumptions 1 let there be a value of minimum non-zero constant λ min > 0 such that * = min x∈X ,y∈Y min θ [λ min R(θ) + (h θ (x), y) -2qµβL φ ] > 0 where q and β are given in Assumptions 1. Then, for λ > λ min , we have the following statements: (1) G λ (S) is monotone in S. (2) G λ (S) γ-weakly submodular with γ > γ * = * [ * + 2L h L x L φ φ max + 3qµβL φ ] -1 .

3.2. ROGET: PROPOSED ALGORITHM TO SOLVE THE ADVERSARIAL TRAINING PROBLEM (2)

In this section, we develop ROGET to solve our optimization problem (2) by building upon the proposal of Adibi et al. (2021) . However, they design algorithms to solve the min-max problem on those functions that are submodular in S and convex in θ. In contrast, ROGET applies to the proposed objective F (h θ , S | a φ * (θ,S) ) (1) which is the difference between a γ-weakly submodular function Algorithm 1 ROGET Algorithm Require: Training set D, regularization parameter λ, budget b, # of iterations T , learning rate η. 1: INIT(h θ ), S0 ← ∅ 2: for t = 0 to T -1 do 3: θt+1 ← TRAIN( θt, St, ηt) 4: St+1 ← SDG (G λ , m λ , θt+1, b). 5: θ = θT 6: return θ, ST 1: procedure SDG (G λ , m λ , θ, b) 2: S ← ∅ 3: for s ∈ [b] do 4: γ ← (1 -γ/b) b-s-1 5: Randomly draw a subset B from D 6: M ← [γ G λ (θ, e | S) -m λ (θ, {e})]e∈B 7: e * = argmax e∈B M [e] 8: if γ G λ (θ, e * | S) -m λ (θ, {e * }) ≥ 0 then 9: S ← S ∪ {e} 10: return S 1: procedure TRAIN( θ, S, η)

2:

//SGD for k steps 3: for i ∈ [k] do 4: Draw i ∼ D uniformly at random 5: if i ∈ S then 6: θ ← θ -η ∂ (h θ (a φ (xi)),yi) ∂θ θ= θ 7: else 8: θ ← θ -η ρ ∂ (h θ (xi),yi) ∂θ θ= θ 9: return θ and a modular function, and may also be nonconvex in θ. In the following, we describe them in details beginning with an outline of the proposed algorithm.

Development of ROGET.

Our key goal is to optimize min θ max S F (h θ , S | a φ * (θ,S) ). Now, we aim to develop an algorithm which would iterate over the inner and outer optimization problem and gradually refine S and θ. Iterative method for the inner optimization on S: Now, given a fixed θ, the inner maximization problem becomes a set function optimization problem. If F were a monotone submodular function, then we could have applied the well known greedy algorithm (Nemhauser et al., 1978) . At each step, such a greedy algorithm would seek for an element e that would maximize the marginal gain F (h θ , S ∪ e | a φ * (θ,S∪e) )-F (h θ , S | a φ * (θ,S) ) and update S → S ∪ {e}. However, in our context, the function F may neither be monotone nor submodular and thus, we cannot apply the greedy algorithm for iterating over the inner optimization loop Nevertheless, we note that it can be expressed as difference between the γ-weakly submodular function G λ (θ, S) and the modular function m λ (θ, S), as suggested by Eq. ( 6). As a result, we can adopt the stochastic distorted greedy algorithm (Harshaw et al., 2019) which, instead of maximizing exact marginal gain, maximizes a distorted marginal gain  (1 -γ) b-s-1 G λ (θ, e | S) - m λ (θ, E[G λ (θ, S) -m λ (θ, S)] ≥ [1 -exp(-γ)]G λ (θ, OP T ) -m λ (θ, OP T ). (7) where OP T is the optimal solution of the inner optimization problem. Here, expectation is carried out over the randomness of selection of B. Iterative routine for outer optimization on θ: Having updated S using the distorted greedy algorithm described above, we minimize the F (h θ , S | a φ * (θ,S) ) with respect to θ using few steps of gradient descent (called as k-SGD) per each round of update. Outline of ROGET: We sketch the pseudocode of ROGET in Algorithm 1. It updates θ and S in an iterative manner. During iteration t, ROGET updates θ t → θ t+1 by running few steps of gradient descent (line 3)-where we fix S at S t and attempt to reduce the loss F (h θ , S t | a φ * (θ, St) ) with respect to θ. In the next step, we fix θ t and update S t → S t+1 using stochastic distorted greedy algorithm (SDG, line 4). Note that during each time t, we compute S t for a fixed θ sequentially in b steps. Having obtained S at s < b, we update it such that it has highest positive distorted marginal gain (1 -γ) b-s-1 G λ (θ, e | S) -m λ (θ | {e}) and then, it is included into the set S to have S ∪ {e}. Approximation guarantees. In general, F is a highly non-convex function of θ and therefore, obtaining an approximation guarantee for any general F is extremely challenging. Hence, we derive approximation guarantee for a restrictive class of loss functions, called Polyak-Lojasiewicz (PL) loss functions. A function f is Polyak-Lojasiewicz if ||∇ θ f (θ)|| ≥ σ[f (θ) -min θ f (θ )] In Appendix E.2, we also present our results when F is convex in θ. Next, we initiate our discussion with a few more assumptions in addition to Assumptions 1. Assumptions 2. (1) L-smoothness of F . For all θ ∈ Θ and for all S ∈ 2 V , we have ||∇ θ F (h θ , S | a φ * (θ,S) ) -∇ θ F (h θ , S | a φ * (θ ,S) )|| ≤ L||θ -θ ||. (2) Boundedness of gradi- ents of F . We have ||∇ θ F (h θ , S | a φ * (θ,S) )|| ≤ ∇ max for all θ ∈ Θ and for all S ∈ 2 V . (3) Boundedness of loss and model h θ . For all θ ∈ Θ, | (h θ (x i ), y i )| < max and ||h θ || < h max . (4) Size of B in SDG procedure. The size of the set B k is |B k | = (|D|/b) log(1/δ) (line 5 in procedure SDG used in Algorithm 1). (5) Adversarial network always perturbs a feature. The cost of perturbation C(a φ * (θ,S) (x i ), x i ) > C min > 0. Moreover, C min > λθ 2 max (e -γ * + δ)/(1 -e -γ * -δ) where γ * is the submodularity ratio (Theorem 2). We provide justification for all the assumptions in Appendix D. Now we state the approximation guarantee of Algorithm 1 for Polyak-Lojasiewicz (PL) loss (see Appendix E.2 for proof). Theorem 4. Given the conditions of Theorem 2 and Assumptions 2, let F (h θ , S | a φ * (θ,S) ) be a PL function in θ for all S, i.e., ||∇ θ F (h θ , S | a φ * (θ,S) )|| ≥ σ[F (h θ , S | a φ * (θ,S) ) - min θ F (h θ , S | a φ * (θ ,S) )] for some constant σ. If we set the learning rate η = 1/kT in Algorithm 1, then for T = O(1/k ) iterations, δ < (1 -e -γ * ) and ρ < [C min ((e -γ * + δ) -1 -1) -λθ 2 max ]/ max , we have the following approximation guarantee for Algorithm 1: min t E[F (h θt , S T | a φ * ( θt, S T ) )] ≤ max S min t E[F (h θt , S | a φ * ( θt,S) )] ≤ 1 -(e -γ * + δ)/κ -1 OP T + 2L 2 h max /σ + (8 ) where θ t is the iterate in Line 3 in Algorithm 1, OP T is the value at the optimum solution of our adversarial training problem (2), κ = C min /(λθ 2 max + ρ max + C min ). Here θ max is defined in Assumption 1 and max , C min are defined in Assumptions 2. Note that, due to non-convexity, we do not provide guarantee on final θ = θ T , but on the iterates θ t ; and, the approximation factor suffers from an additional offset 2L 2 h max /σ, even as → 0. In Appendix E.2, we present our results when F is convex, where this bound becomes stronger.

4. EXPERIMENTS

In this section, we conduct experiments with real world datasets which show that ROGET achieves a better overall accuracy than the state-of-the-art methods for both white-box and black-box attacks. Appendix H contains additional results.

4.1. EXPERIMENTAL SETUP

Datasets and state-of-the art competitors. We experiment with CIFAR10 (Krizhevsky et al., 2014) and Fashion MNIST (Xiao et al., 2017) (FMNIST) in this section. In Appendix H, we also report results on CIFAR100. We consider seven state-of-the-art methods to compare our method-they are: GAT (Sriramanan et al., 2020) , FBF (Wong et al., 2019) , TRADES (Zhang et al., 2019b) , Nu-AT (Sriramanan et al., 2021) , MART (Wang et al., 2019) , PGD-AT (Madry et al., 2017) and RFGSM-AT (Tramèr et al., 2018) . Appendix G contains more details about the baselines. Evaluation protocol. We split the datasets into training (D Tr ), validation (D Val ), and test set (D Test ) in the ratio of 4:1:1 and 5:1:1 for CIFAR10 and FMNIST respectively. We use the validation set for hyperparameter selection (mentioned later), early stopping (See Appendix G for details), etc. We report three types of accuracies, viz., (i) accuracy on clean examples A clean = P(y = ŷ | x is not chosen for attack), (ii) robustness to adversarial perturbations measured using the accuracy on the perturbed examples A robust = P(y = ŷ | x is chosen for attack) and (iii) overall accuracy A= P(y = ŷ). Models for the learner. We consider two candidates for a φ , which model the learner's belief about the adversary during training. Specifically, we set either a φ = PGD (Madry et al., 2017) or a φ = AdvGAN (Xiao et al., 2018) . This gives rise to two variants of our model, viz., ROGET (a φ = PGD) and ROGET (a φ = AdvGAN). We use ResNet-18 (He et al., 2016) and LeNet-5 (LeCun et al., 2015) architectures for CIFAR10 and FMNIST datasets respectively. Model for the adversarial perturbation. We consider two white-box attacks, viz., PGD (Madry et al., 2017) and Auto Attack (AA) (Croce & Hein, 2020) as well as three black-box attacks, viz., Square (Andriushchenko et al., 2020) and black-box MI-FGSM (Dong et al., 2018) and Ad-vGAN (Xiao et al., 2018) to perturb test samples. The exact details of the attacks are given in Appendix G. The subset selection strategy of the adversary. In addition to the adversarial perturbation mechanism, the adversary also has a strategy to select a subset S latent of test instances to attack on. We experiment with two latent subset selection mechanisms. (1) Uncertainty based subset selection: Here, the adversary selects the top-10% instances in terms of the prediction uncertainty of a classifier trained on clean examples from D Tr . Here, the prediction uncertainty for an instance x is computed as u (x) = 1 -max y h(x)[y]. (2) Label based subset selection: Here, the adversary selects instances who have a specific label y ∈ Y to perform attack, e.g., S latent = {i | y i = aeroplane}. Note that, the underlying subset selection strategy is realized only during test-it is not revealed to the learner during training and validation, in practice. Appendix H also presents other strategies. Hyperparameter selection. Suppose, for instance, that the adversary's latent strategy of selecting a subset to perform attack is known during validation stage. Then one could easily simulate such strategy to create the perturbed instances in the validation set and use the resulting validation set to cross validate the underlying hyperparameters. However, the subset selection strategy is never revealed to the learner during both training and validation stage. Thus, the selection of hyperparameters becomes challenging and it completely depends on the underlying assumption about the adversary. In such a situation, we experiment with two methods for hyperparameter selection. (1) Default selection: Here, we use the hyperparameters of the baselines directly used in their original papers and codes (details in Appendix G). (2) Worst case selection: The key goal of ROGET is to learn a model which minimizes the worst case loss across all data subsets. In a similar spirit, here we aim to select the hyperparameters which would maximize the minimum accuracy across a large number of subsets that underwent attacks. Formally, if we denote A(β, S) to be the overall accuracy of the trained model on the subset S, with hyperparameters β, we aim to estimate β * = argmax β min S A(β, S). To this aim, we draw R = 10000 subsets {S j } R j=1 uniformly at random from D Val having size |S j | = 0.05|D Val | and search over the hyperparameters β that maximizes the minimum accuracy across these R subets. In our case as well, we tune ρ in similar manner to obtain ρ * . Hence, the goal of this type of hyperparameter selection is same as the key goal of ROGET. Thus, it will provide a fair comparison between all the methods.

4.2. RESULTS

Uncertainty based subset selection with default hyperparameters. Here we compare our method against all the state-of-the-art defense methods under the default hyperparameter selection on CI-FAR10 and FMNIST, where all the hyperparameters of the baselines are set as what is reported in their respective work and for our method, we experiment with different values of ρ. Moreover, the adversary adopts uncertainty based subset selection, where it selects top 10% of the test set based on the uncertainty of a classifier trained on all clean examples. We report the results in Table 1 and make the following observations. (1) ROGET (a φ = PGD) and ROGET (a φ = AdvGAN) achieve better overall accuracy A than the existing methods for all attacks on both the datasets (except for Square attack on CIFAR10). Among the two variants of ROGET, ROGET (a φ = AdvGAN) is the predominant winner. (2) On CIFAR10, ROGET outperforms the baselines (except RFGSM-AT) in terms of the clean accuracy A clean . This is because, ROGET is trained to defend much better when the adversary plans to attack a subset of instances rather than every instance. In contrast, the baselines are often trained in a pessimistic manner-they assume attack on every possible instance and consequently show sub-optimal accuracy on the clean examples. (3) There is no consistent winner among the baselines in terms of the robustness A robust . Robust accuracy of ROGET is often competitive, e.g., in CIFAR10, ROGET (a φ = AdvGAN) is the best performer for AdvGAN attack and ROGET (a φ = PGD) is the second best performer for Square attack. In FMNIST, ROGET (a φ = AdvGAN) is the second best performer for MIFGSM and AdvGAN attack. Uncertainty based subset selection with worst case hyperparameter setup. Next, we tune the hyperparameters of all the methods using the worst case hyperparameter tuning, where we select the hyperparameters that would maximize the minimum accuracy across a large number of subsets of the validation set. For ROGET, we tune ρ in the same manner to obtain ρ * . We present the results in Table 2 for CIFAR100 and make the following observations. (1) ROGET (a φ = PGD, ρ = ρ * ) and ROGET (a φ = AdvGAN, ρ = ρ * ) outperform all the baselines in terms of the overall accuracy A. (2) In most of the cases, the overall accuracy A of ROGET with the worst case hyperparameter ρ = ρ * is better than A with the default hyperparameter ρ = 1. In contrast, the worst case hyperparameter selection improves A only for Nu-AT and TRADES among the baselines, from its performance with the default hyperparameter setup (Table 1 vs. Table 2 ). Evaluation on label based subset selection strategy. We now consider label based subset selection strategy of the adversary. We report the results for AA attack in Table 3 , where we also consider class focused online learning (CFOL) (Anonymous, 2022) as an additional baseline which provides guarantee on worst class loss in presence of adversarial attack. We observe that ROGET (a φ = AdvGAN) performs the best followed by ROGET (a φ = PGD) for all classes. Additional results can be found in Appendix H. Impact of revealing the true subset selection strategy during validation. In practice, the learner does not know the adversary's true (uncertainty based) subset selection strategy during training and validation. Here, we leak this information to the learner during validation. Then, we mimic the adversary's true strategy to select the subsets from the validation set and perform attack on them. Next, we select the hyperparameters resulting in highest overall validation accuracy. Table 4 compares this strategy ("oracle") with previous hyperparameter selection strategies (default and worse). We make the following observations. ( 1) ROGET shows very stable performance across different hyperparameter selection methods. (2) GAT, TRADES, Nu-AT and MART significantly improve the performance and Nu-AT outperforms ROGET by a small margin. (3) ROGET's focus is to maintain good accuracy across all subsets. Hence, the performance of ROGET, in absence of any knowledge about the adversary's selected subset, becomes very close to or even better than the baselines having full knowledge of oracle selection. Comparing robustness subject to a minimum overall accuracy. In Tables 1 2 3 , we reported robust and overall accuracy for fixed set of hyperparameters. By tuning these hyperparameters, one can improve robust accuracy by sacrificing overall accuracy. Therefore, here we aim to compare the robust accuracy, subject to the condition that the overall accuracy for all methods crosses some threshold. Specifically, we first tune the hyperparameters of all the methods to ensure that the overall accuracy of all methods reaches a given threshold and then compare their robustness. If P indicates the hyperparameters, then we find max P A robust (P ) such that A(P ) ≥ a for some given a. Results on CIFAR10 for a = 0.81 are shown in Table 6. We observe that ROGET (PGD) is the best performer in terms of robust accuracy and ROGET (AdvGAN) is the best performer in terms of overall accuracy. RO-GET (AdvGAN) is the second-best performer in terms of robust accuracy. More related results are in Appendix H.

5. CONCLUSION

In this paper we motivated a novel setting in adversarial attacks-where an attacker aims to perturb only a subset of the dataset instead of the entire dataset. We presented a defense strategy, ROGET which trains a robust model as a min-max game involving worst-case subset selection along with optimization of model parameters. To solve the optimization problem we designed a framework of efficient algorithms, which admits approximation guarantees for convex and Polyak-Lojasiewicz loss functions. Finally, our experiments showed that ROGET achieves better overall accuracy as compared to several state-of-the-art defense methods across several subset selection strategies. Our work opens several avenues of future research. We can extend our work to a slightly different setting where each instance has some importance score assigned to it. Another extension is to design a differentiable method for computing the worst-case attacked set, instead of using a greedy selection algorithm.

6. ETHICS STATEMENT

Our work tries to help ML models in achieving a better trade-off between robustness against adversarial attacks and performance on unperturbed/clean instances. Due to the vulnerability of ML models against adversarial perturbations, they have not been widely used in high-stake real world scenarios. Furthermore, the defense methods proposed so far to achieve robustness against attacks, face a considerable drop in accuracy on clean (unperturbed) instances. Here, our framework takes a step towards improving the performance on clean instances, while being robust against attacks on any subset of the dataset. On the flip side, our method discusses a different type of adversarial attack, where the attacks are made on a subset of instances. If one uses such attacks in practice, the attacked ML systems can become vulnerable. However, these systems can use the defense method proposed in this paper, which can provide notable defense irrespective of the subset selection strategy of the adversary. The capability of our method to achieve good performance without being aware of adversary's strategy makes it suitable to be applied in a wide range of applications. 

Robust Training through Adversarially Selected Data Subsets (Appendix)

A FUTURE WORK Our method selects a worst-case subset and trains the model parameters in an end-to-end manner to output a robust defense model. However, it might be more interesting to take into account an importance score for each instance to decide which instances need more protection against attacks. Additionally, it will be interesting to design a completely differentiable training method for computing the worst-case subset of attacked instances, instead of using a greedy selection algorithm along with gradient descent.

B RELATED WORK

Adversarial attacks. The attack methods (Szegedy et al., 2014; Kurakin et al., 2018; Madry et al., 2017; Goodfellow et al., 2015; Kurakin et al., 2017; Carlini & Wagner, 2017) discussed in the main paper can be broadly classified in three different settings i) white-box attacks, ii) black-box attacks, and iii) transfer-based attacks. In white-box attacks, the attacker assumes full knowledge about the defense model, and hence can access the gradients of the defense model. This led to the design of several gradient-based attacks (Szegedy et al., 2014; Madry et al., 2017; Goodfellow et al., 2015) . Specifically, Szegedy et al. (2014) (Campbell & Broderick, 2018; Lucic et al., 2017; Durga et al., 2021; Killamsetty et al., 2021; De et al., 2020; 2021) . As mentioned in the main paper, these existing works focus more towards efficient learning (Durga et al., 2021; Mirzasoleiman et al., 2020) , human-centric learning (De et al., 2020; 2021) and active learning (Wei et al., 2015; Kaushal et al., 2019) . As a results these works operate in a completely different setting, and consequently require different solution techniques. For example, De et al. ( 2020) employ data subset selection in order to distribute instances between human and machine to generate semi-automated ML models. Also, the works in (Durga et al., 2021; Mirzasoleiman et al., 2020) uses subset selection to reduce training time of ML models by reducing the size of the effective training dataset.

C ILLUSTRATION OF OUR ADVERSARIAL TRAINING SETUP AND THE HARDNESS ANALYSIS

Probabilistic generation of S. Let us assume that the adversary follows a probabilistic strategy, i.e., π(x i , y i ) = P (i ∈ S). Thus π(x i , y i ) indicates the probability that the instance x i is chosen for attack. Here, one may wish to minimize the following loss: min θ max {π(xi,yi) | i∈D} i∈D π(x i , y i ) (h θ (a φ (x i )), y i ) + (1 -π(x i , y i ))ρ (h θ (x i ), y i ) . (9) such that, i∈D π(x i , y i ) ≤ b, π(x i , y i ) ∈ [0, 1]. ( ) The inner optimization problem is a linear optimization problem in each π(x i , y i ), so π( x i , y i ) ∈ {0, 1} . Then, if we define S = {i | π(x i , y i ) = 1}, min θ max S:|S|≤b i∈S (h θ (a φ (x i )), y i ) + j∈D\S ρ (h θ (x j ), y j ) Hardness analysis. Let us consider a specific instance of the problem where y ∈ {0, +1, -1} and h θ is a fixed function independent of θ given as h θ (x) = Λe (1-y•θ T 0 x) 2 if, y ∈ {+1, -1} 1 -y∈{-1,+1} Λe (1-y•θ T 0 x) 2 if, y = 0 where, Λ << 1, θ 0 is a constant vector, x is bounded such that 1 -y∈{-1,+1} Λe (1-y•θ T 0 x) 2 > 0. Similar (not exact) distributions were also used for instantiating anecdotal examples in (Zhang et al., 2021a) . Let us further assume that the training set D consists of only instances with y ∈ {+1, -1} and no instance with y = 0. Additionally, let a φ (x) = φ x, where denotes element-wise multiplication operation. Here, φ is restricted to a set, such that 1-y∈{-1,+1} Λe (1-y•θ T 0 a φ (x)) 2 > 0. Also, consider that ρ = 0 and C = 0 in this specific setting. Since h θ is independent of θ, the optimization problem (2) in this setting reduces to max φ,S:|S|≤b i∈S (h θ (a φ (x i )), y i ) (13) = max φ,S:|S|≤b i∈S -log(Λe (1-yi•θ T 0 a φ (xi)) 2 ) (14) = -min φ,S:|S|≤b i∈S log(Λ) + (1 -y i • θ T 0 a φ (x i )) 2 ≤0 (15) = -min φ,S:|S|=b i∈S log(Λ) + (1 -y i • θ T 0 a φ (x i )) 2 (16) = -min φ,S:|S|=b b log(Λ) + i∈S (1 -y i • θ T 0 a φ (x i )) 2 (17) Thus, the solution of above optimization ( 17) is equal to the following optimization problem: min φ,S:|S|=b i∈S (1 -y i • θ T 0 a φ (x i )) 2 (18) which is equivalent to the following optimization problem: min φ,S:|S|=b i∈S (1 -y i • φ T (θ 0 x i )) 2 The next steps follow directly from ((De et al., 2021) , Proof of Theorem 1). We describe the steps below to make the proof self-contained. Assume that X = [(θ 0 x 1 ) T ; . . . ; (θ 0 x |D| ) T ] has full row rank |D|. Now, let y = [y 1 , . . . , y i , . . . , y |D| ], r ∈ R |D| be an arbitrary vector of real numbers and X -1 R be the right inverse of X (exists because X has full row rank). By definition, (X -1 R ) T (θ 0 x i ) = e T i , where e i is a column vector with entry 1 at position i, and 0 elsewhere. Further, let φ = φ -X -1 R (y -r). We rewrite the objective of the optimization (19) in terms of φ to obtain i∈S (1 -y i • (φ + X -1 R (y -r)) T (θ 0 x i )) 2 (20) = i∈S (1 -y i • φ T (θ 0 x i ) -y i • (y -r) T • e i ) 2 (21) = i∈S (1 -y i • φ T (θ 0 x i ) -y 2 i =1 +y i • r i ) 2 (22) = i∈S (r i -φ T (θ 0 x i )) 2 Hence, our optimization problem reduces to the following optimization problem: min φ ,S:|S|=b i∈S (r i -φ T (θ 0 x i )) 2 Since the above optimization ( 24) is known to be NP-hard (Bhatia et al., 2017) , we have successfully proven that our optimization problem 2 is also NP-hard.

D EXPLANATION OF THE ASSUMPTIONS

In this section, we justify the assumptions mentioned in Sections 3.1 and 3.2. ASSUMPTIONS 1 (1, 5) Lipschitz continuity and boundedness of the parameters: We think that Lipschitz continuity of loss and models are very prevalent. In most practical and well behaved ML algorithms, these assumptions hold. For any differentiable networks that have bounded parameters and no singularity, we usually have these conditions. If we do not have such conditions, gradients may blow up during training. (2) Stability: Algorithmic stability is also a desirable property in ML. It ensures that if we make changes in one instance, the parameters do not change. L2 Regularization, drop-out and even SGD algorithm itself encourages stability. (3, 4) Metric property and norm boundedness of cost of perturbation: Most existing works have used L ∞ distance as the cost of perturbation. Our method considers a general set of metrics which is not limited to L • distances. L • space norms are bounded by each other by a factor, i.e., k 1 L c < L a < k 2 L b . We bounded our metric norm by the L2 norm for standardization of analysis. We do not foresee the deviation of such a condition in a practical scenario. ASSUMPTIONS 2 (1) L-smoothness of F: This just ensures that gradients of F are Lipschitz too. Indeed this is a bit stronger condition than Lipschitzness of F itself. A wide variety of smooth activation functions like Linear, Sigmoid are L-smooth. Even the discontinuous functions like ReLU are often L-smooth almost everywhere. (2,3) Boundedness of F, h and : As long as there is no inherent singularity in the interior of these functions (e.g., unlike 1/(x-a)), this is a redundant condition, given the boundedness of θ. We simply keep it to make our notations simple during analysis. But this condition does not put any additional restriction, if the underlying function does not have a singularity in the interior. (4) Size of B: This is a parameter used in our algorithm-it does not put any restriction of the underlying setup, neither the loss function nor the models. (5) Adversarial network always perturbs a feature: This imposes the notion of a very strong adversary. Even if we deviate, the theoretical bounds would be stronger. We keep this condition for the sake of brevity.

E PROOFS OF THE TECHNICAL RESULTS IN SECTION 3

E.1 MONOTONICITY AND γ-WEAK SUBMODULARITY OF G λ : THEOREM 2 Theorem 2. Given Assumption 1 let there be a value of minimum non-zero constant λ min > 0 such that * = min x∈X ,y∈Y min θ [λ min R(θ) + (h θ (x), y) -2qµβL φ ] > 0 where q and β are given in Assumption 1. Then, for λ > λ min , we have the following statements: (1) G λ (S) is monotone in S (2) G λ (S) γ-weakly submodular with γ > γ * = * [ * + 2L h L x L φ φ max + 3qµβL φ ] -1 . Proof sketch. To prove monotonicity, we first show that G λ (k | S) > (h θ (a φ * (θ,S) (x k )), y k ) + i∈S [ρ C(a φ * (θ,S∪k) (x i ), x i ) -ρ C(a φ * (θ,S) (x i ), x i )] + λR(θ) which is more than λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) -βρ|S|||φ * (θ, S ∪ k) -φ * (θ, S)||. Next, we use Assumptions 1 to show that this quantity is more than * . To prove γ-weak submodularity, we first show that G λ (k | T ) ≤ * + 2φ max + 3qµβL φ . To that aim, we show that G λ (k | T ) ≤ λR(θ) + (h θ (a φ * (θ,T ∪k) (x k )), y k ) + qµβ ≤ λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) + | (h θ (a φ * (θ,T ∪k) (x k )), y k ) -(h θ (a φ * (θ,S) (x k )), y k )| + qµβ. Next we use the Lipschitzness of different functions to show that it is less than * + 2L h L x L φ φ max + 3qµβL φ . This, together with the fact that G λ (k | S) > * derived during the proof of monotonicity, results in the inequality G λ (k | S)/G λ (k | T ) ≥ * [ * + 2L h L x L φ φ max + 3qµβL φ ] -1 . Finally, we use the result of Proposition 6 (in Appendix F) to complete the proof. Proof. Monotonicity of G λ . Let S ⊂ D and let k ∈ D \ S. For any θ ∈ Θ, G λ (θ, S ∪ k) -G λ (θ, S) = 1 |D| i∈S∪k λR(θ) + (h θ (a φ * (θ,S∪k) (x i )), y i ) - 1 |D| i∈S λR(θ) + (h θ (a φ * (θ,S) (x i )), y i ) = 1 |D| λR(θ) + i∈S∪k (h θ (a φ * (θ,S∪k) (x i )), y i ) - i∈S (h θ (a φ * (θ,S) (x i )), y i ) We now derive a bound on i∈S∪k (h θ (a φ * (θ,S∪k) (x i )), y i ) -i∈S (h θ (a φ * (θ,S) (x i )), y i ). Using the definition of a φ * (S∪k) from Eq. (3), we obtain: i∈S∪k [ (h θ (a φ * (θ,S∪k) (x i )), y i ) -µC(a φ * (θ,S∪k) (x i ), x i )] ≥ i∈S∪k [ (h θ (a φ * (θ,S) (x i )), y i ) -µC(a φ * (θ,S) (x i ), x i )] (26) =⇒ i∈S∪k [ (h θ (a φ * (θ,S∪k) (x i )), y i ) -(h θ (a φ * (θ,S) (x i )), y i )] ≥ µ i∈S∪k [C(a φ * (θ,S∪k) (x i ), x i ) -C(a φ * (θ,S) (x i ), x i )] Substituting inequality (27) in Eq. ( 25), we obtain: G λ (θ, S ∪ k) -G λ (θ, S) ≥ 1 |D| λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) (28) + i∈S∪k µ[C(a φ * (θ,S∪k) (x i ), x i ) -C(a φ * (θ,S) (x i ), x i )] (29) = 1 |D| λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) - i∈S∪k µ[C(a φ * (θ,S) (x i ), x i )] -C(a φ * (θ,S∪k) (x i ), x i ) (30) ≥ 1 |D| λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) -µ i∈S∪k C(a φ * (θ,S∪k) (x i ), a φ * (θ,S) (x i )) (31) ≥ 1 |D| λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) -qµβL φ |S ∪ k| |S| (32) ≥ 1 |D| [λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) -2qµβL φ ] Here, inequality (31) follows from using triangle inequality of Assumption 1 and inequality (32) follows from the stability assumption in Assumption 1. We use the assumption on * to conclude that the right hand side of inequality ( 33) is non-negative. This shows that G λ is monotone in S. γ-weak submodularity of G λ . We first provide an upper bound of G λ (θ, T ∪ k) -G λ (θ, T ). G λ (θ, T ∪ k) -G λ (θ, T ) = 1 |D| i∈T ∪k λR(θ) + (h θ (a φ * (θ,T ∪k) (x i )), y i ) - 1 |D| i∈T λR(θ) + (h θ (a φ * (θ,T ) (x i )), y i ) = 1 |D| λR(θ) + i∈T ∪k (h θ (a φ * (θ,T ∪k) (x i )), y i ) - i∈T (h θ (a φ * (θ,T ) (x i )), y i ) (34) = 1 |D| λR(θ) + i∈T (h θ (a φ * (θ,T ∪k) (x i )), y i ) -µC(a φ * (θ,T ∪k) (x i ), x i ) - i∈T (h θ (a φ * (θ,T ) (x i )), y i ) -µC(a φ * (θ,T ) (x i ), x i ) + (h θ (a φ * (θ,T ∪k) (x k )), y k ) + µ i∈T C(a φ * (θ,T ∪k) (x i ), x i ) -C(a φ * (θ,T ) (x i ), x i ) (35) ≤ 1 |D| λR(θ) + (h θ (a φ * (θ,T ∪k) (x k )), y k ) + qµβL φ (36) Here, the last inequality is due to the fact that: i∈T (h θ (a φ * (θ,T ∪k) (x i )), y i ) -µC(a φ * (θ,T ∪k) (x i ), x i ) (37) - i∈T (h θ (a φ * (θ,T ) (x i )), y i ) -µC(a φ * (θ,T ) (x i ), x i ) ≤ 0 since φ * (θ, T ) provides the maximum of the second term and the fact that: µ i∈T C(a φ * (θ,T ∪k) (x i ), x i ) -C(a φ * (θ,T ) (x i ), x i ) ≤ qµ i∈T ||a φ * (θ,T ∪k) (x i ) -a φ * (θ,T ) (x i )|| (39) ≤ |T |qµL φ β |T | (Stability of φ) Lower bounding the ratio. Using inequality (33) and inequality (36), we get the following G λ (θ, S ∪ k) -G λ (θ, S) G λ (θ, T ∪ k) -G λ (θ, T ) ≥ λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) -2qµβL φ λR(θ) + (h θ (a φ * (θ,T ∪k) (x k )), y k ) + qµβL φ ≥ λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) -2qµβL φ λR(θ) + (h θ (a φ * (θ,S) (x k )), y k ) + (h θ (a φ * (θ,T ∪k) (x k )), y k ) -(h θ (a φ * (θ,S) (x k )), y k ) + qµβL φ ≥ * * + 2L h L x L φ φ max + 3qµβL φ = γ * Above inequality (41) follows from Lipschitz continuity assumption, and upper boundedness of Φ mentioned in Assumptions 1. This together with Proposition 6 gives us the result.

E.2 APPROXIMATION GUARANTEES

Theorem 4. Given the conditions of Theorem 2 and Assumptions 2, let F (h θ , S | a φ * (θ,S) ) be a PL function in θ for all S, i.e., ||∇ θ F (h θ , S | a φ * (θ,S) )|| ≥ σ[F (h θ , S | a φ * (θ,S) ) - min θ F (h θ , S | a φ * (θ ,S ) )] for some constant σ. If we set the learning rate η = 1/kT in Algorithm 1, then for T = O(1/k ) iterations, δ < (1 -e -γ * ) and ρ < [C min ((e -γ * + δ) -1 -1) -λθ 2 max ]/ max , we have the following approximation guarantee for Algorithm 1: min t E[F (h θt , S T | a φ * ( θt, S T ) )] ≤ max S min t E[F (h θt , S | a φ * ( θt,S) )] ≤ 1 -(e -γ * + δ)/κ -1 OP T + 2L 2 h max /σ + (42 ) where θ t is the iterate in Line 3 in Algorithm 1, OP T is the value at the optimum solution of our adversarial training problem (2), κ = C min /(λθ 2 max + ρ max + C min ). Here θ max is defined in Assumption 1 and max , C min are defined in Assumptions 2. Proof. Under the assumptions, Lemmas 13 and 14 hold. Using Lemma 13, we have that for all S, 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 4kη∇ 2 max ( ) where the expectation above is taken over the randomness in stochastic distorted greedy algorithm. Now, taking expectation over the randomness in k-SGD gives us, 1 - e -γ * + δ κ E[F (h θt , S | a φ * ( θt,S) )] -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 4kη∇ 2 max ( ) where the expectation is computed over all the total randomness of the algorithm. We now sum the above equation over all T to get, T t=1 1 - e -γ * + δ κ E[F (h θt , S | a φ * ( θt,S) )] -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 4T kη∇ 2 max (45) Using Lemma 14, we have that for all θ, E[F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )]-E[F (h θ , S t-1 | a φ * (θ, St-1) )] ≤ 2kη∇ 2 max + 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ where the expectation is w.r.t. the total randomness of the algorithm. Summing over all T , we obtain: T t=1 E[F (h θt , S t | a φ * ( θt, St) )]-E[F (h θ , S t | a φ * (θ, St) )] ≤ 2T kη∇ 2 max + 4T Lθ 2 max (1 -ησ) k σ + LT η∇ 2 max 2σ Combining inequality (45) and inequality (47) and then dividing by T throughout we obtain: 1 - e -γ * + δ κ T t=1 E[F (h θt , S | a φ * ( θt,S) )] T ≤ max S E[F (h θ , S | a φ * (θ,S) )] + 6kη∇ 2 max + 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ Taking min over θ on both sides and noting that all terms except one are independent of θ, we get 1- e -γ * + δ κ T t=1 E[F (h θt , S | a φ * ( θt,S) )] T ≤ min θ max S E[F (h θ , S | a φ * (θ,S) )] + 6kη∇ 2 max + e -kησ 4Lθ 2 max σ + Lη∇ 2 max 2σ Putting η = 1 kT , we obtain: 1 - e -γ * + δ κ T t=1 E[F (h θt , S | a φ * ( θt,S) )] T ≤ OP T + e -σ/T 4Lθ 2 max σ + 6∇ 2 max T + L∇ 2 max 2kT σ T t=1 E[F (h θt , S | a φ * ( θt,S) )] T ≤ OP T 1 -e -γ * +δ κ + e -σ/T 4Lθ 2 max σ 1 -e -γ * +δ κ + 1 -e -γ * +δ κ ( ) This gives us the statement in the proof of the theorem. Algorithm 2 ROGET Algorithm (with additional variants of gradient descent for convex F ) Approximation guarantee for convex F . Next we consider the unlikely case when F is convex. Here, we present the approximation guarantees of our algorithm, again copied here in Algorithm 2, specifically with two different variants of stochastic gradient descent, viz., simple gradient descent (GD) and one-step stochastic gradient descent (SGD) (instead of k-SGD) that allows us to derive approximation guarantees for convecity. These guarantees generalize the results from Adibi et al. (2021) . Theorem 5. Given the conditions of Theorem 2 and Assumption 2, let F (h θ , S | a φ * (θ,S) ) be convex in θ for all S, the learning rate η = 1/ √ T . Now, suppose we set either method = GD or method = SGD in line 3 in Algorithm 2, i.e., we use either one step gradient descent or stochastic gradient descent during training θ for fixed S, then for T = O(1/ 2 ) iterations, δ < (1 -e -γ * ) and ρ > λθ 2 max (1 + e γ * )/(e γ * (1 -δ)C δ ) ρ < [C min ((e -γ * + δ) -1 -1) -λθ 2 max ]/ max , we have the following approximation guarantee. E[F (h θ , S T | a φ * ( θ, S T ) )] ≤ max S E[F (h θ , S | a φ * ( θ,S) )] ≤ 1 -(e -γ * + δ)/κ -1 (OP T + ) where OP T is the value at the optimum solution of our adversarial training problem (2), κ = C min /(λθ 2 max + ρ max + C min ). Here, θ max is defined in Assumption 1 and max , C min are defined in Assumption 2. Proof. Under the Assumptions 2 and convexity of F , Lemma 11 holds. Hence, we have that θ ∈ Θ, T t=1 E[F (h θt , S t | a φ * ( θt, St) )] -E[F (h θ , S t | a φ * (θ, St) )] ≤ 2T η∇ 2 max + 2θ 2 max η Under Assumptions 1 and 2, Lemma 9 also holds, which gives us, for all S, t 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 2η∇ 2 max ( ) We note that in Lemma 9, the expectation was taken over the randomness in stochastic distorted greedy algorithm. We now include the randomness due to stochastic gradient descent and take the total expectation to obtain, 1 - e -γ * + δ κ E[F (h θt , S | a φ * ( θt,S) )] -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 2η∇ 2 max (53) Under review as a conference paper at ICLR 2023 Summing the above equation over t = 1 to T gives us, for all θ ∈ Θ, S, T t=1 1 - e -γ * + δ κ E[F (h θt , S | a φ * ( θt,S) )] -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 2T η∇ 2 max (54) Summing inequality (51) and inequality ( 54) and dividing by T , we obtain: 1 - e -γ * + δ κ T t=1 E[F (h θt , S | a φ * ( θt,S) )] T ≤ T t=1 E[F (h θ , S t | a φ * (θ, St) )] T + 4η∇ 2 max + 2θ 2 max T η Since the above holds for all S, we take the maximum over all possible S, on both sides to obtain, 1 - e -γ * + δ κ max S T t=1 E[F (h θt , S | a φ * ( θt,S) )] T ≤ T t=1 max S E[F (h θ , S | a φ * (θ,S) )] T + 4η∇ 2 max + 2θ 2 max T η ≤ max S F (h θ , S | a φ * (θ,S) ) + 4η∇ 2 max + 2θ 2 max T η Since F is convex in θ, we get that for all S and for θ = T t=1 θt T , F (h θ , S | a φ * ( θ,S) ) ≤ T t=1 F (h θt , S | a φ * ( θt,S) ) T (55) Along with linearity of expectation, this gives us that for all θ ∈ Θ, max S 1 - e -γ * + δ κ E[F (h θ , S | a φ * ( θ,S) )] ≤ max S F (h θ , S | a φ * (θ,S) ) + 4η∇ 2 max + 2θ 2 max T η Finally, we take the minimum over θ on both sides. Noting that the LHS is independent of θ, we obtain: 1 - e -γ * + δ κ max S E[F (h θ , S | a φ * ( θ,S) )] ≤ min θ max S F (h θ , S | a φ * (θ,S) ) + 4η∇ 2 max + 2θ 2 max T η = OP T + 4η∇ 2 max + 2θ 2 max T η Setting η = 1 √ T , we obtain: 1 - e -γ * + δ κ max S E[F (h θ , S | a φ * ( θ,S) )] ≤ OP T + 1 √ T 4∇ 2 max + 2θ 2 max (58) = OP T + (59) Rearranging, we get the statement in the theorem. This completes the proof.

F AUXILIARY LEMMAS

Proposition 6. If a function Q satisfies α-submodularity, i.e., Q(k | S) > αQ(k | T ) for all k ∈ D\T with S ⊂ T , then Q satisfies γ-weak submodularity (El Halabi et al., 2018, Proposition 8, Appendix) Lemma 7. (Guarantees from Stochastic Distorted Greedy) Let , g, c : 2 D → R + be two non-negative, monotone set functions, such that g is γ-weakly submodularfor some γ ∈ (0, 1] and c is modular (see Definition 1). Furthermore, suppose that for all S, c(S)/g(S) ≤ 1 -κ for some κ ∈ [0, 1). Suppose we wish to solve the following problem: max |S|≤b [g(S) -c(S)] ( ) for some fixed K. Let S * denote the value of set for which the maximum in optimization (60) is attained. Then, for a given value of δ > 0 such that δ + e -γ < 1, Stochastic Distorted Greedy Algorithm makes O(Dlog(1/δ)) evaluations of g and returns a set S of size |S | ≤ b which satisfies, g(S * ) -c(S * ) ≤ κ κ -e -γ -δ E[g(S ) -c(S )] As a corollary, we observe that for all S such that |S| ≤ K, it holds that, g(S) -c(S) ≤ g(S * ) -c(S * ) ≤ κ κ -e -γ -δ E[g(S ) -c(S )] Proof. We begin with the approximation guarantee on Stochastic Distorted Greedy Algorithm as stated in (Harshaw et al., 2019 ), Theorem 3. (Harshaw et al., 2019) show that on running Distorted Greedy with K iterations for optimization problem (60), the output is a set S with size |S | ≤ K and, (1 -e -γ -δ)g(S * ) -c(S * ) ≤ E[g(S ) -c(S )] (63) g(S * ) ≤ E[g(S ) -c(S )] (1 -e -γ -δ) + c(S * ) (1 -e -γ -δ) (64) g(S * ) -c(S * ) ≤ E[g(S ) -c(S )] (1 -e -γ -δ) + e -γ +δ (1 -e -γ -δ) c(S * ) (65) g(S * ) -c(S * ) ≤ E[g(S ) -c(S )] (1 -e -γ -δ) + (e -γ + δ)(1 -κ) (1 -e -γ -δ)κ (g(S * ) -c(S * )) (66) where in the last inequality we use definition of κ. Collecting the terms involving g(S * ) -c(S * ) on the left, g(S ) -c(S ) on the right and then simplifying gives us the result of the lemma. Lemma 8. (Derivation of κ) Let G λ (θ, S) and m λ (θ, S) be as defined in definition 4 and 5 respectively. Let Assumption 1 hold and suppose max θ∈Θ R(θ) = θ 2 max and for all θ ∈ Θ, and let max , C min , θ max be defined as in Assumption 1. Then, for all θ, ∀S, it holds that m λ (θ, S) G λ (θ, S) ≤ λθ 2 max + ρ max λθ 2 max + ρ max + µC min (67) Proof. We first note that: m λ (θ, S) G λ (θ, S) = λR(θ)|S| + ρ i∈S (h θ (x i ), y i ) λR(θ)|S| + i∈S (h θ (a φ * (θ,S) (x i )), y i ) + ρ j∈D (h θ (x j ), y j ) (68) Using the definition of a φ * (θ,S) from Eq. ( 3), we have, for all φ ∈ Φ i∈S (h θ (a φ * (θ,S) (x i )), y i ) -µC(a φ * (θ,S) (x i ), x i ) ≥ i∈S [ (h θ (a φ (x i )), y i ) -µC(a φ (x i ), x i )] In particular, this also holds for φ such that for all x, a φ (x) = x. This gives, i∈S (h θ (a φ * (θ,S) (x i )), y i ) -µC(a φ * (θ,S) (x i ), x i ) ≥ i∈S [ (h θ (x i ), y i ) -µC(x i , x i )] (70) =⇒ i∈S (h θ (a φ * (θ,S) (x i )), y i ) ≥ i∈S [ (h θ (x i ), y i ) + µC min ] The last line uses the definition of C min from Assumption 1, and the fact that for all x, C(x, x) = 0. Plugging inequality (71) back into inequality. ( 68), we obtain: m λ (θ, S) G λ (θ, S) ≤ λR(θ)|S| + ρ i∈S (h θ (x i ), y i ) λR(θ)|S| + i∈S [ (h θ (x i ), y i ) + µC min ] + ρ j∈D (h θ (x j ), y j ) (72) ≤ λR(θ)|S| + ρ i∈S (h θ (x i ), y i ) λR(θ)|S| + i∈S [ (h θ (x i ), y i ) + µC min ] + ρ j∈S (h θ (x j ), y i ) (73) ≤ λR(θ)|S| + ρ max |S| λR(θ)|S| + µC min |S| + ρ j∈S (h θ (x i ), y i ) (74) ≤ λR(θ)|S| + ρ max |S| λR(θ)|S| + µC min |S| + ρ max |S| (75) = λR(θ) + ρ max λR(θ) + ρ max + µC min Here inequalities ( 73) and ( 74) follow from non-negativity of . Finally, substituting R(θ) ≤ θ 2 max , we get the statement in the lemma. Lemma 9. Suppose the assumptions of Lemma 8 and Theorem 2 hold. Let κ = µC min /(λθ 2 max + ρ max + µC min ), where the symbols are the same as defined in the statement of Lemma 8 and γ * is defined in Theorem 2. Furthermore, let δ < (1 -e -γ * ) and let ρ < [C min ((e -γ * + δ) -1 -1)λθ 2 max ]/ max . Then for any S such that |S| ≤ b, 1 -(e -γ * + δ)/κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 2η∇ 2 max ( ) where the expectation is w.r.t. the randomness in Stochastic Distorted Greedy. Proof. To obtain our guarantees, we require 1 -e -γ * +δ κ ≥ 0. Substituting κ = µC min /(λθ 2 max + ρ max + µC min ) and simplifying gives us ρ < [µC min ((e -γ * + δ) -1 -1) -λθ 2 max ]/ max . To ensure that the this bound is positive, we require δ < (1 -e -γ * ). In our algorithm, we obtain S t by applying Stochastic Distorted Greedy using θ t-1 . Therefore, by our characterization of F as stated in Eq. ( 6) and using Lemma 7, we have that for any S, F (h θt-1 , S | a φ * ( θt-1,S) ) ≤ κ κ -e -γ * -δ E[F (h θt-1 , S t | a φ * ( θt-1, St) )] Using the ∇ max -Lipschitzness of F , we get |F (h θt , S | a φ * ( θt,S) ) -F (h θt-1 , S | a φ * ( θt-1,S) )| ≤ ∇ max θ t -θ t-1 (79) = ∇ max η ∇ θt-1 F (h θt-1 , S | a φ * ( θt-1,S) ) ≤ ∇ 2 max η (81) Now, using the Lipschitz condition, from inequality (81) we get that for t and S F (h θt , S | a φ * ( θt,S) ) ≤ F (h θt-1 , S | a φ * ( θt-1,S) ) + η∇ 2 max (82) ≤ κ κ -e -γ * -δ E[F (h θt-1 , S t | a φ * ( θt-1, St) )] + η∇ 2 max (83) =⇒ 1 - e -γ + δ κ F (h θt , S | a φ * ( θt,S) ) ≤ E[F (h θt-1 , S t | a φ * ( θt-1, St) )] + 1 - e -γ * + δ κ η∇ 2 max (84) Here, inequality ( 83) is obtained by using inequality (78). Subtracting E[F (h θt , S t | a φ * ( θt, St) )] from both sides of inequality (84), we obtain: 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] (85) ≤ E[F (h θt-1 , S t | a φ * ( θt-1, St) ) -F (h θt , S t | a φ * ( θt, St) )] + 1 - e -γ * + δ κ η∇ 2 max (86) Using inequality (81), we conclude that the above quantity is less than η∇ 2 max + 1 - e -γ * + δ κ η∇ 2 max = 2 - e -γ * + δ κ η∇ 2 max ≤ 2η∇ 2 max (87) This completes the proof of the lemma. Lemma 10. Suppose Assumption 1 and 2 hold. Let ∇ θ F (h θ , S | a φ * (θ,S) ) denote the stochastic gradient of F at h θ , S such that E[ ∇ θ F (h θ , S | a φ * (θ,S) )] = ∇ θ F (h θ , S | a φ * (θ,S) ), for all θ, S. Furthermore, suppose that for all θ, S, we have, ∇ θ F (h θ , S | a φ * (θ,S) ) ≤ ∇ max . For any θ, S, let θ = θ -η ∇ θ F (h θ , S | a φ * (θ,S) ). Then, for any θ ∈ Θ E[F (h θ , S | a φ * (θ ,S) )|θ]-F (h θ , S | a φ * ( θ,S) ) ≤ η 2 (Lη + 1)∇ 2 max + 1 2η θ -θ 2 -E[ θ -θ 2 | θ] where the expectation is over the randomness in computing the stochastic gradient ∇ θ F (h θ , S | a φ * (θ,S) ).

Proof.

In what follows, we fix S and denote F (h θ , S | a φ * (θ,S) ) ≡ F (h θ ). We do this for brevity and succinctness. Using first L-smoothness and then θ = θ -η ∇ θ F (h θ , S | a φ * (θ,S) ), we obtain: F (h θ ) ≤ F (h θ ) + ∇ θ F (h θ ), θ -θ + L 2 θ -θ 2 (88) =⇒ F (h θ ) -F (h θ ) ≤ -η ∇ θ F (h θ ), ∇ θ F (h θ ) + Lη 2 2 ∇ θ F (h θ ) 2 Taking expectations on both sides over the randomness in computing SGD, we obtain: E[F (h θ ) -F (h θ )| θ] ≤ -ηE[ ∇ θ F (h θ ), ∇ θ F (h θ ) | θ] + Lη 2 2 E[ ∇ θ F (h θ ) 2 |θ] (90) =⇒ E[F (h θ )| θ] -F (h θ ) ≤ -η ∇ θ F (h θ ), E[ ∇ θ F (h θ ) | θ] + Lη 2 2 E[ ∇ θ F (h θ ) 2 |θ] (91) = -η ∇ θ F (h θ ) 2 + Lη 2 2 E[ ∇ θ F (h θ ) 2 |θ] ( ) Since F is convex in θ, we have: F (h θ ) -F (h θ ) ≤ ∇ θ F (h θ ), θ -θ 93) Adding the inequality (93) and inequality (92), we obtain: E[F (h θ )| θ] -F (h θ ) ≤ -η ∇ θ F (h θ ) 2 + Lη 2 2 E[ ∇ θ F (h θ ) 2 |θ] + ∇ θ F (h θ ), θ -θ (94) We derive a value for ∇ θ F (h θ ), θθ using the following: θ -θ 2 = θ -θ + θ -θ 2 (95) = θ -θ 2 + θ -θ 2 + 2 θ -θ, θ -θ (96) = -η ∇ θ F θ ) 2 + θ -θ 2 + 2 -η ∇ θ F (h θ ), θ -θ (97) = η 2 ∇ θ F (h θ ) 2 + θ -θ 2 -2η ∇ θ F (h θ ), θ -θ Taking expectation on both sides, given θ (noting that θ does not depend on θ), we get E θ -θ 2 | θ -θ -θ 2 = η 2 E ∇ θ F (h θ ) 2 | θ -2ηE ∇ θ F (h θ ), θ -θ |θ (99) = η 2 E[ ∇ θ F (h θ ) 2 | θ] -2η ∇ θ F (h θ ), θ -θ Rearranging, we obtain: ∇ θ F (h θ ), θ -θ = η 2 E[ ∇ θ F (h θ ) 2 | θ] - 1 2η E[ θ -θ 2 | θ] -θ -θ 2 Plugging inequality (101) back into the inequality (94), we get E[F (h θ )|θ] -F (h θ ) ≤ η 2 (Lη + 1)E[ ∇ θ F (h θ ) 2 | θ] -η ∇ θ F (h θ ) 2 + 1 2η θ -θ 2 -E[ θ -θ 2 | θ] ≤ η 2 (Lη + 1)∇ 2 max + 1 2η θ -θ 2 -E[ θ -θ 2 | θ] Lemma 11. Suppose the assumptions of Lemma 10 hold. Furthermore, let θ t , S t denote the iterates of our algorithm. Then, for any θ ∈ Θ, T t=1 E[F (h θt , S t | a φ * ( θt, St) )] -E[F (h θ , S t | a φ * ( θ, St) )] ≤ 2T η∇ 2 max + 2θ 2 max η Here the expectation is taken w.r.t. the total randomness of the algorithm. This includes the randomness in stochastic gradient descent as well as stochastic distorted greedy. Proof. Putting θ = θ t , θ = θ t-1 and S = S t-1 , in the statement of Lemma 10 we obtain: that for any θ ∈ Θ E[F (h θt , S t-1 | a φ * ( θt, St-1) )| θ t-1 ] -F (h θ , S t-1 | a φ * ( θ, St-1) ) ≤ η 2 (Lη + 1)∇ 2 max + 1 2η θ t-1 -θ 2 -E θ t -θ 2 | θ t-1 We now take the expectation w.r.t. θ 0 , θ 1 , . . . , θ t . Let θ 0:τ := θ 0 , . . . , θ τ . Using law of total expectation and noting that given θ t-1 , θ t is independent of θ τ for τ = t -1, we get E θ0:t [F (h θt , S t-1 | a φ * ( θt, St-1) )] = E θ0:t-1 E θ0:t| θ0:t-1 [F (h θt , S t-1 | a φ * ( θt, St-1) )| θ 0 , . . . , θ t-1 ] (105) = E θ0:t-1 E θt| θ0:t-1 [F (h θt , S t-1 | a φ * ( θt, St-1) )| θ 0 , . . . , θ t-1 ] (106) = E θ0:t-1 E θt| θt-1 [F (h θt , S t-1 | a φ * ( θt, St-1) )| θ t-1 ] Finally, we note that E θ 0:T [F (h θt , S t-1 | a φ * ( θt, St-1) )] = E θ0:t [F (h θt , S t-1 | a φ * ( θt, St-1) ))] as later iterations cannot impact previous iterations. We finally take the expectation w.r.t the randomness in stochastic distorted greedy as well. Thus, we obtain: E[F (h θt , S t-1 | a φ * ( θt, St-1) )] -E[F (h θ , S t-1 | a φ * ( θ, )] ≤ η 2 (Lη + 1)∇ 2 max + 1 2η E[ θ t-1 -θ 2 ] -E[ θ t -θ 2 ] where E now denotes the expectation w.r.t. the randomness in the entire procedure (stochastic gradient descent and stochastic distorted greedy). Using Lipschitzness of F , we get that |F (h θt , S t-1 | a φ * ( θt, St-1) ) -F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )| ≤ ∇ max θ t -θ t-1 F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) ) ≤ F (h θt , S t-1 | a φ * ( θt, St-1) ) + η∇ 2 max Plugging this in the equation above, we get E[F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )] -F (h θ , S t-1 | a φ * ( θ, St-1) ) ≤ η∇ 2 max + η 2 (Lη + 1)∇ 2 max + 1 2η E[ θ t-1 -θ 2 ] -E[ θ t -θ 2 ] Setting η < 1/L and simplifying gives us the following, E[F (h θt , S t | a φ * ( θt, St) )] -E[F (h θ , S t | a φ * ( θ, St) )] ≤ 2η∇ 2 max + + 1 2η E[ θ t -θ 2 ] -E[ θ t+1 -θ 2 ] Summing over all T gives us a telescoping sum on the right. Simplifying, we obtain: T t=1 E[F (h θt , S t | a φ * ( θt, St) )] -E[F (h θ , S t | a φ * ( θ, St) )] ≤ 2T η∇ 2 max + 1 2η E[ θ 1 -θ 2 ] We give an upper bound on θ 1 -θ 2 as follows θ 1 -θ 2 ≤ θ 1 + θ 2 ≤ (2θ max ) 2 = 4θ 2 max ( ) Substituting upper bound from inequality (109) in inequality (108) gives us the statement of the Lemma: T t=1 E[F (h θt , S t | a φ * ( θt, St) )] -E[F (h θ , S t | a φ * ( θ, St) )] ≤ 2T η∇ 2 max + 2θ 2 max η F.1 AUXILIARY LEMMAS FOR THEOREM 4 Lemma 12. (k-SGD guarantee for a fixed S) Suppose Assumption 1 and 2 hold and F is a nonconvex function that satisfies the PL condition. Fix S and suppose θ (1) , . . . , θ (k) are obtained using k-step stochastic gradient descent, for the fixed S, starting from θ (0) . Then, for any θ ∈ Θ, we have, E[F (h θ (k) , S | a φ * (θ (k) ,S) ) | θ (0) , . . . , θ (k-1) ] -F (h θ , S | a φ * ( θ,S) ) ≤ 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ where the expectation is w.r.t. randomness in computing the stochastic gradient. Proof. Since S is fixed, we drop the second argument and denote F S (h θ , S | a φ * (θ,S) ) ≡ F S (h θ ) for succinctness and brevity. For i = 0, . . . , k -1, we have, θ (i+1) = θ (i) -η ∇ θ (i) F S (h θ (i) ) . Using L-smoothness, we obtain: F S (h θ (k) ) ≤ F S (h θ (k-1) ) + ∇ θ (i) F S (h θ (i) ), θ (k) -θ (k-1) + L 2 θ (k) -θ (k-1) 2 Taking expectation (given θ (k-1) ) on both sides w.r.t. randomness in stochastic gradient descent, we obtain: E[F S (h θ (k) ) | θ (k-1) ] ≤ F S (h θ (k-1) ) -η ∇ θ (k-1) F (h θ (k-1) ), E[ ∇ θ (k-1) F S (h θ (k-1) )] (111) + Lη 2 2 E[ ∇ θ (k-1) F S (h θ (i) ) 2 | ] (112) ≤ F S (h θ (k-1) ) -η ∇ θ (k-1) F S (h θ (k-1) ) 2 + Lη 2 2 ∇ 2 max ( ) The PL condition implies that, ∇ θ (k-1) F S (h θ (k-1) ) 2 ≥ σ[F S (h θ (k-1) ) -F S (h θ * S )] where θ * S = min θ F (h θ , S | a φ * (θ ,S) ). Substituting this in the inequality (113), we obtain: E[F S (h θ (k) ) | θ (k-1) ] ≤ F S (h θ (k-1) ) -ησ[F S (h θ (k-1) ) -F S (h θ * S )] + Lη 2 ∇ 2 max 2 Subtracting F S (h θ * S ) from both sides and simplifying gives us, E[F S (h θ (k) ) | θ (k-1) ] -F S (h θ * S ) ≤ (1 -ησ)[F S (h θ (k-1) ) -F S (h θ * S )] + Lη 2 ∇ 2 max 2 Now, we take expectation given θ (k-2) and get, E[F S (h θ (k) ) | θ (k-1) , θ ] -F S (h θ * S ) ≤ (1 -ησ)[E[F S (h θ (k-1) ) | θ (k-2) ] -F S (h θ * S )] + Lη 2 ∇ 2 max 2 (117) Note that E[F S (h θ (k) ) | θ (k-1) , θ (k-2) ] = E[F S (h θ (k) ) | θ (k-1) ] because θ (k) , is conditionally in- dependent of θ (k-2) when θ (k-1) is given. Moreover, the term E[F S (h θ (k-1) ) | θ (k-2) ] -F S (h θ * S ) can be simplified in the same manner as above to get, E[F S (h θ (k-1) ) | θ (k-2) ] -F S (h θ * S ) ≤ (1 -ησ)[E[F S (h θ (k-2) ) | θ (k-3) ] -F S (h θ * S )] + Lη 2 ∇ 2 max 2 Let θ (0):(τ ) denote θ (0) , . . . , θ (τ ) . Then, repeating the above procedure and simplifying yields, E θ (0):(k-1) [F S (h θ (k-1) )] -F S (h θ * S ) ≤ (1 -ησ) k [E[F S (h θ (0) )] -F S (h θ * S )] + Lη 2 ∇ 2 max 2 k τ =1 (1 -ησ) τ (119) ≤ (1 -ησ) k [F S (h θ (0) ) -F S (h θ * S )] + Lη 2 ∇ 2 max 2 1 ησ (120) ≤ (1 -ησ) k σ ∇ θ (0) F S (h θ (0) ) 2 + Lη∇ 2 max 2σ (121) ≤ L(1 -ησ) k σ θ (0) -θ * S 2 + Lη∇ 2 max 2σ (122) ≤ 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ where, in inequality (121) we have used the PL condition and in inequality (122) we have used the fact that ∇ θ (0) F S (h θ (0) ) = ∇ θ (0) F S (h θ (0) ) -∇ θ * S F S (h θ * S ) ≤ L θ (0) -θ * S (using Lipschitz gradients). Finally, we note that by the definition of θ * S , F S (h θ ) ≥ F S (h θ * S ). Thus, we replace F S (h θ ) by F S (h θ * S ) in the LHS to get the statement of the lemma. This completes the proof of the lemma. Lemma 13. (k-SGD guarantee) Suppose Assumption 1 and 2 hold and F is a non-convex function that satisfies the PL condition. Let θ t , S t denote the of our algorithm and suppose method = k-SGD is used. Then, for any θ ∈ Θ E[F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )]-E[F (h θ , S t-1 | a φ * ( θ, St-1) )] ≤ 2kη∇ 2 max + 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ where the expectation is taken w.r.t. the total randomness of the algorithm (stochastic gradient descent and stochastic distorted greedy). Proof. In our algorithm with method = k-SGD θ t is derived from θ t-1 using k steps of SGD with fixed set S t-1 . In this case, θ t-1 = θ (0) , θ t = θ (k) , S = S t-1 and θ (1) , . . . , θ (k-1) , denote the intermediate k-SGD iterates. In this case, Lemma 12 holds and we obtain: E[F (h θt , S t-1 | a φ * ( θt, St-1) ) | θ t-1 , θ (1) , . . . , θ (k-1) ] -F (h θ , S t-1 | a φ * ( θ, St-1) ) ≤ 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ Taking total expectation on both sides, w.r.t. θ (1) , . . . , θ (k-1) , using law of total expectation and observing that the other terms in the equation are independent of these random variables, we obtain: E[F (h θt , S t-1 | a φ * ( θt, St-1) ) | θ t-1 ] -F (h θ , S t-1 | a φ * ( θ, St-1) ) ≤ 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ Using Lipschitzness of F from Assumption 1, we obtain: |F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )-F (h θt , S t-1 | a φ * ( θt, St-1) )| ≤ ∇ max θ t-1 -θ t = ∇ max θ t-1 -θ (1) + θ (1) . . . + θ (k-1) -θ t ≤ 2kη∇ 2 max (127) Adding and subtracting F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) ) from inequality (126) gives us, F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) ) -F (h θ , S t-1 | a φ * ( θ, St-1) ) ≤ F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) ) -E[F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) ) | θ t-1 ] + 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ Now, using inequality (127), we obtain: F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )-F (h θ , S t-1 | a φ * ( θ, St-1) ) ≤ 2kη∇ 2 max + 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ Finally, taking expectation w.r.t. the total randomness of the algorithm (stochastic gradient descent and stochastic distorted greedy), we obtain: E[F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )]-E[F (h θ , S t-1 | a φ * ( θ, St-1) )] ≤ 2kη∇ 2 max + 4Lθ 2 max (1 -ησ) k σ + Lη∇ 2 max 2σ Lemma 14. Suppose the assumptions of Lemma 8 and Theorem 2 hold. Let κ = µC min /(λθ 2 max + ρ max + µC min ), where the symbols are the same as defined in the statement of Lemma 8 and γ * is defined in Theorem 2. 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 4kη∇ 2 max ( ) where the expectation is w.r.t. the randomness in Stochastic Distorted Greedy. Proof. To obtain our guarantees, we require (1 -e -γ * κ ) ≥ 0. Substituting κ = µC min /(λθ 2 max + ρ max + µC min ) and simplifying gives us ρ < [µC min ((e -γ * + δ) -1 -1) -λθ 2 max ]/ max . To ensure that the this bound is positive, we require δ < (1 -e -γ * ). In our algorithm, we obtain S t by applying Stochastic Distorted Greedy using θ t-1 . Therefore, by our characterization of F as stated in Eq. ( 6) and using Lemma 7, we have that for any S, F (h θt-1 , S | a φ * ( θt-1,S) ) ≤ κ κ -e -γ * -δ E[F (h θt-1 , S t | a φ * ( θt-1, St) )] Since θ t is derived from θ t-1 using method = k-SGD, we cannot use Lipschitzness directly. Instead, we use inequality (127). Doing so, we get that for all t, S, F (h θt , S | a φ * ( θt,S) ) ≤ F (h θt-1 , S | a φ * ( θt-1,S) ) + 2kη∇ 2 max (132) ≤ κ κ -e -γ * -δ E[F (h θt-1 , S t | a φ * ( θt-1, St) )] + 2kη∇ 2 max (133) 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) ≤ E[F (h θt-1 , S t | a φ * ( θt-1, St) )] + 1 - e -γ * + δ κ 2kη∇ 2 max ( ) where in inequality (133) we use inequality (131) . Subtracting E[F (h θt , S t | a φ * ( θt, St) )] from both sides, we obtain: 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] ≤ E[F (h θt-1 , S t | a φ * ( θt-1, St) ) -F (h θt , S t | a φ * ( θt, St) )] + 1 - e -γ * + δ κ 2kη∇ 2 max (135) Using Lipschitzness of F from Assumption 1, we obtain: |F (h θt-1 , S t-1 | a φ * ( θt-1, St-1) )-F (h θt , S t-1 | a φ * ( θt, St-1) )| ≤ ∇ max θ t-1 -θ t = ∇ max θ t-1 -θ (1) + θ (1) . . . + θ (k-1) -θ t ≤ 2kη∇ 2 max (136) We use inequality (136) into the inequality (135) to have: In all baselines, we used ResNet18 architecture for CIFAR10 and ResNet9 for CIFAR100, with the last layers having 10 and 100 neurons respectively. For FMNIST, we use LeNet architecture. For all the methods (including ours) which use PGD attack during training, we keep the PGD attack parameters to be the same as ROGET (a φ = PGD) (details in the following subsection). Similar to ρ in our method, GAT, TRADES, Nu-AT and, MART also offer specific hyperparameters which can control the tradeoff between A clean and A robust . For all methods (including Ours), we use PGD attack as the assumed adversarial perturbation method during hyperparameter selection on the validation set. 1 - e -γ * + δ κ F (h θt , S | a φ * ( θt,S) ) -E[F (h θt , S t | a φ * ( θt, St) )] ≤ 2kη∇ 2 max + 1 - e -γ * + δ κ 2kη∇ 2 max = 2 - e -γ * + δ κ 2kη∇ 2 max ≤ GAT (Sriramanan et al., 2020) . We used the code from the official repositoryfoot_1 . They provide two different codebases for CIFAR10 and MNIST. We used their MNIST code for experiments with FMNIST dataset. For default hyperparameter selection (Table 1 ), we refer to the official repositoryfoot_2 in which they provide the value of l2_reg = 10 for CIFAR10 and l2_reg = 15 for MNIST (which we use for FMNIST). For worst-case hyperparameter selection (Table 2 ), we train GAT on a range of l2_reg values: {2.5, 5.0, 10.0, 15.0, 20.0, 30.0} for CIFAR10 and FMNIST, and {2.0, 5.0, 10.0, 20.0, 30.0} for CIFAR100. In their official code, they train GAT for 100 epochs on CIFAR10 and for 50 epochs on MNIST. Hence we train GAT for 100, 50 and 100 epochs on CIFAR10, FMNIST and CIFAR100 respectively. FBF (Wong et al., 2019) . For CIFAR10 we used the code from the official repositoryfoot_3 . For FMNIST and CIFAR100, we implemented their code parallel to CIFAR10. The only changes (other than the architecture) were the mean and standard deviation that we computed for FMNIST and CIFAR100 separately. FBF does not have any tunable parameter and hence it does not undergo any hyperparameter selection. For CIFAR10 and CIFAR100, we train FBF for 80 epochs (as used in the official code for CIFAR10). For FMNIST, we train FBF for 10 epochs (as used in the official code for MNIST). TRADES (Zhang et al., 2019b) . We used the code from the official repositoryfoot_4 . They provide two different codebases for CIFAR10 and MNIST. We used their MNIST code for experiments with FMNIST dataset. For default hyperparameter selection (Table 1 ), we refer to the official repositoryfoot_5 in which they use β = 6.0 for CIFAR10 and β = 1.0 for MNIST (which we use for FMNIST). For worst-case hyperparameter selection (Table 2 ), we train TRADES on a range of β values: {0.1, 0.2, 0.4, 0.6, 0.8, 2.0, 4.0, 6.0} for CIFAR10, {0.4, 0.6, 0.8, 1.0, 2.0, 4.0, 6.0} for FMNIST, and {1.0, 2.0, 4.0, 6.0, 8.0} for CIFAR100. Optimizer, batch size and learning rate are same as those used in the official repository. We train TRADES for 120, 100, and 120 epochs on CIFAR10, FMNIST and, CIFAR100 respectively. Nu-AT (Sriramanan et al., 2021) . We used the code from the official repositoryfoot_6 . They only provide code for CIFAR10 hence for running it on FMNIST, we modify the PGD parameters to the one we used in FMNIST and the number of epochs to 100. For default hyperparameter selection (Table 1 ), we refer to the supplementary material of Nu-AT, in which the authors mention that "We use a λ max of 4.5 for CIFAR-10 on ResNet-18 architecture and 4 for WideResNet-34-10. For MNIST we use λ max of 1...". Hence we use λ max =4.5 for CIFAR10 and λ max =1.0 for FMNIST. For worst-case hyperparameter selection (Table 2 ), we train Nu-AT on a range of λ max values: {2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0} for CIFAR10 and FMNIST, and {1.0, 2.0, 4.5, 6.0, 8.0} for CIFAR100. Optimizer, batch size and learning rate are same as those used in the official repository. We train Nu-AT for 120, 100, and 120 epochs on CIFAR10, FMNIST and, CIFAR100 respectively. MART (Wang et al., 2019) . We used the code from the official repositoryfoot_7 . They only provide code for CIFAR10 hence for running it on FMNIST, we modify the PGD parameters and the number of epochs. For default hyperparameter selection (Table 1 ), we refer to the official repositoryfoot_8 in which they use β = 5.0 for CIFAR10. We were not able to find any mention of their hyperparameter values for MNIST either in their paper or their code. Hence using Figure 2 (d) of their paper as reference, we trained MART on FMNIST for β ={0.5, 1.0, 2.5, 5.0, 7.5, 10.0} and found that only β = 1.0 undergone effective training and gave A clean above 50%. Hence we chose β = 1.0 as the default value for FMNIST. For worst-case hyperparameter selection (Table 2 ), we train MART on a range of β values: {0.5, 1.0, 2.5, 5.0, 7.5, 10.0} for CIFAR10 and FMNIST, and {0.5, 1.0, 2.5, 5.0, 10.0} for CIFAR100. We train MART for 120, 100, and 120 epochs on CIFAR10, FMNIST and, CIFAR100 respectively. PGD-AT (Madry et al., 2017) . We could not find any official Pytorch implementation for PGD-AT. Therefore, we implemented it ourselves using the architecture mentioned above, for each dataset. We use SGD optimizer along with batch size of 128 was used for both datasets. For FMNIST, initial learning rate 0.01 and momentum 0.9 was used. For CIFAR10 and CIFAR100, we use initial learning rate of 0.1 and momentum 0.9. PGD-AT has no tunable parameter and hence it undergoes no hyperparameter selection. We train PGD-AT for 100 epochs on all three datasets. RFGSM-AT (Tramèr et al., 2018) . We could not find any official Pytorch implementation for RFGSM-AT. Therefore, we implemented it ourselves using the architecture mentioned above, for each dataset. RFGSM-AT also does not have any tunable parameter that controls the tradeoff between A robust and A clean but it has a parameter α which does affect its robust accuracy to a significant extent. We refer to their paper in which they use α = /2 for ImageNet and MNIST. Hence, for default hyperparameter selection (Table 1 ), we use α = /2 for CIFAR10 and FMNIST. This value however gives very low robust accuracy for all the attacks. Hence for worst-case hyperparameter selection (Table 2 ), we tune α such that it achieves A robust above 40% on PGD attack which gives us the value of α = 0.05 . Optimizer, batch size and learning rate are same as those used in the official repository. We use SGD optimizer along with batch size of 128 was used for both datasets. For FMNIST, initial learning rate 0.01 and momentum 0.9 was used. For CIFAR10 and CIFAR100, we use initial learning rate of 0.1 and momentum 0.9. We train RFGSM-AT for 100 epochs on all three datasets.

G.3 IMPLEMENTATION DETAILS OF OUR METHOD

Hyperparameters of our method. For default hyperparameter selection (Table 1 ), we simply use ρ = 1 for all the datasets. For worst-case hyperparameter selection (Table 2 ), we train ROGET (a φ = PGD) on a range of ρ values: {0.01, 0.05, 0.5, 1.0, 2.0, 5.0, 8.0, 10.0} for CIFAR10, {0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 4.0, 8.0} for FMNIST, and {2.0, 5.0, 10.0, 20.0, 30.0} for CIFAR100. For ROGET (a φ = AdvGAN), we train for ρ values: {0.01, 0.25, 0.5, 1.0, 1.5, 2.0} for CIFAR10, {0.001, 0.005, 0.01, 0.1, 0.5, 1.0, 1.5, 2.0} for FMNIST and {0.005, 0.01, 0.25, 0.5, 1.0} for CIFAR100. Batch size was set to 128 in all the datasets. We use k-SGD to train our method. Moreover, we use early stopping as follows. While running multiple training epochs for a fixed S we use early stopping to determine when to stop. In k-SGD, we stop training for the current S if on the validation set, the attacked set accuracy, A robust , drops. Details about a φ . For ROGET (a φ = PGD), we set = 0.031, number of steps = 20 and step size = 0.007 for CIFAR10 and CIFAR100 while for FMNIST, we use = 0.3, number of steps = 40 and step size = 0.01. For ROGET (a φ = AdvGAN), we used the Pytorch implementation available in an online repositoryfoot_9 . The same architecture was used for all the datasets with only changes to the number of input channels. No changes to the generator or discriminator architecture were made besides the number of input channels which was set to 1 for FMNIST and 3 for CIFAR10 and CIFAR100. At the end of every iteration, we retrain AdvGAN on set S output by stochastic distorted greedy algorithm, for 15 epochs. Retraining AdvGAN every time we add a point to the set S in stochastic distorted greedy (SDG) algorithm increases the running time of SDG to such an extent that it becomes infeasible. For this reason, we instead choose to retrain AdvGAN after SDG outputs a set S, for the current iteration. This is equivalent to fine-tuning AdvGAN to attack the set S, output by SDG.

G.4 DETAILS ABOUT ADVERSARIAL PERTURBATION

For PGD attack, we used same specifications used to train ROGET (a φ = PGD). More specifically, we set = 0.031, number of steps = 20 and step size = 0.007 for CIFAR10 and CIFAR100 while for FMNIST, we use = 0.3, number of steps = 40 and step size = 0.01. For Auto Attack, we use the standard version which consists of untargeted APGD-CE, targeted APGD-DLR, targeted FAB and Square attacks each with the default parameters. For square attack, we set number of queries to 1000 and same as that for PGD attack above. For applying MI-FGSM in black box setting, we take a source model trained on the chosen dataset, get the perturbed sample by computing gradients using the source model and then test the method on this obtained perturbed sample. Transfer based black box attacks are weaker than white-box attacks. Hence we set the parameters of MI-FGSM to be stronger than those for PGD attack. We set = 0.2, number of steps = 60, step size = 0.007 for CIFAR10 and CIFAR100 and = 0.305, number of steps = 80, step size = 0.01 for FMNIST. For Square attack, we use ∞ norm with = 0.031 for CIFAR10 and = 0.3 for FMNIST.

G.5 INFRASTRUCTURE DETAILS

We implement ROGET using Python 3.8 and PyTorch 1.10.1. The experiments were run on servers equipped with 2.9GHz CPUs, NVIDIA Quadro (48GB), NVIDIA RTX A6000 (48 GB), NVIDIA A40 (46 GB), NVIDIA Quadro RTX 8000 (49 GB) and NVIDIA TITAN RTX (24 GB) GPUs.

G.6 LICENSE

We collect the datasets from https://www.cs.toronto.edu/∼kriz/cifar.html (CIFAR10 and CI-FAR100), and https://www.kaggle.com/datasets/zalando-research/fashionmnist (Fashion-MNIST). These sources allow the use of datasets for research purposes. Furthermore, we use the following publicly available repositories-https://github.com/val-iisc/GAMA-GAT/ (GAT), https://github.com/locuslab/fast_adversarial/ (FBF), https://github.com/yaodongyu/TRADES/ (TRADES), https://github.com/val-iisc/NuAT/ (Nu-AT), https://github.com/YisenWang/MART (MART) to implement the baselines, https://github.com/mathcbc/advGAN_pytorch (AdvGAN) to implement AdvGAN, and https://submodlib.readthedocs.io/en/latest/functions/facilityLocation.html to solve the Facility Location problem in one of our additional experiments mentioned in section H below.

H ADDITIONAL EXPERIMENTS H.1 RESULTS ON LOSS BASED HYPERPARAMETER SELECTION TECHNIQUE ON CIFAR10 AND FMNIST

We also explore another hyperparameter selection technique in which the learner assumes a subset selection strategy of the adversary. More specifically, the learner assumes a distribution across validation samples which is inversely proportional to the loss of a classifier trained on clean samples. Based on this strategy an attack is simulated on the validation set and the hyperparameter giving the best overall accuracy is selected. We present the results for CIFAR10 and FMNIST in Table 7 For CIFAR10, we observe that our method achieves the highest overall accuracy for all attacks. For FMNIST, TRADES has a better overall accuracy on AA and Square attack by a small margin. In this experiment, we evaluate various hyperparameters of the baselines and plot their A clean vs A robust in Table 12 and 13. Each point on the plot represents a hyperparameter of the method. Here the adversary uses uncertainty based subset selection where the true subset chosen for attack S latent consists of top 10% test instances in terms of the uncertainty of a classifier trained on all the clean examples. We observe that our method forms a pareto optimal front for all the attacks and hence achieves a better trade off between A clean and A robust . We also notice that the baselines do not follow a regular trend and show high sensitivity with their tunable parameter. In that aspect, our method is relatively stable with respect to ρ. On CIFAR10 dataset, we try out different values of ρ for two variants of our methods, i.e., ROGET (a φ = PGD) and ROGET (a φ = AdvGAN). We probe the variation of A clean , A robust and A vs ρ for two attacks-AA (standard) and black-box MI-FGSM. The plot for ROGET (a φ = PGD) is shown in Figure 14 . For each value of ρ, we test using 5 random attack-clean (1:9) splits on the test set and report the mean accuracy with standard deviation. The plot for ROGET (a φ = AdvGAN) is shown in Figure 15 . We make the following observations: (1) For white box attack (AA), the standard deviation from the mean is ±1.07% across all values of ρ and across all models. In case of black box MI-FGSM attack, the standard deviation rises to ±3.34%. (2) For ROGET (a φ = PGD), A robust accuracy decreases as ρ increases (except ρ = 8). (3) For ROGET (a φ = AdvGAN), we observe that A robust rises slightly before decreasing as ρ increases. Here, we present results on CIFAR-100 dataset. Here, the model is Resnet-9. In default hyperparameter setting, we use ρ = 0.5 for ROGET (a φ = PGD) and ρ = 2.0 for ROGET (a φ = AdvGAN) . For the baselines, we choose the same default hyperparameters as in CIFAR10. The results can be seen in Table 19 . We also report the results of worst-case hyperparameter setting in Table 20 . We observe that our method outperforms all the baselines in terms of clean, robust (except on PGD attack), and overall accuracies. Here, we produce more results related to Table 21 . Specifically, we first tune the hyperparameters of all the methods to ensure that the overall accuracy of all methods reaches a given threshold and then compare their robustness. If P indicates the hyperparameters, then we find max P A robust (P ) such that A(P ) ≥ a for some given a. We present the results of CIFAR10 for a = 0.81 and FMNIST for a = 0.83 in Table 21 for different attacks. For CIFAR10, we find that ROGET (a φ = PGD) is the best performer in terms of robust accuracy and ROGET (a φ = AdvGAN) is the best performer in terms of overall accuracy (except for MIFGSM attack). Moreover, ROGET (a φ = AdvGAN) is the second-best performer in terms of robust accuracy. For FMNIST, ROGET (a φ = AdvGAN) achieves the highest robust accuracy A robust (except for PGD attack).



We collect the datasets from https://www.cs.toronto.edu/ kriz/cifar.html (CIFAR-10), and https://www.kaggle.com/datasets/zalando-research/fashionmnist (FMNIST). https://github.com/val-iisc/GAMA-GAT/ https://github.com/val-iisc/GAMA-GAT/ https://github.com/locuslab/fast_adversarial/ https://github.com/yaodongyu/TRADES/ https://github.com/yaodongyu/TRADES/ https://github.com/val-iisc/NuAT/ https://github.com/YisenWang/MART https://github.com/YisenWang/MART https://github.com/mathcbc/advGAN_pytorch



, a class of neural models called Polyak-Lojasiewicz (PL) losses Charles & Papailiopoulos (2018), etc.. (3) Metric property of the cost of perturbation C: The cost of perturbation C used in Eq. (

Figure 5: A vs. |S latent |

Training instances D, regularization parameter λ, budget b, # of iterations T , learning rate η, M E T H O D∈ {GD, SGD, k -SGD} 1: INIT(hθ), S0 ← ∅ 2: for t = 0 to T -1 do 3: θt+1 ← TRAIN( θt, St, ηt, M E T H O D) 4: St+1 ← SDG (Gλ, mλ, θt+1, b). 5: θ = T t=1 ηt θt/ T t=1 ηt 6: return θ, ST 1: procedure SDG (Gλ, mλ, θ, b) ← (1 -γ/b)

Figure 12: Trade off between A clean vs A robust for all methods for whitebox attacks (PGD, AA) on CIFAR10. Here, the adversary adopts uncertainty based subset selection to perform attack, where the true subset chosen for attack S latent consists of top 10% test instances in terms of the uncertainty of a classifier trained on all the clean examples.

Figure14: Variation of A robust , A clean and A vs ρ of ROGET (a φ = PGD) on CIFAR10 dataset for two attacks viz., Auto Attack and MI-FGSM. For each value of ρ, we test using uncertainty based subset selection and report the mean accuracy with standard deviation. We show the mean of A robust , A clean and A, along with error bars to show the standard deviation.ArobustAclean A



A for label based subset selection with AA attack on CIFAR10. Green (Yellow) indicates the (second) best method.

Revealing the oracle subset selection strategy (uncertainty) on A for PGD attack on CI-FAR10.

Comparison of A robust , subject toA clean > 0.81.

Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pp. 745-754. PMLR, 2018. Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In ACM Workshop on Artificial Intelligence and Security (AISec@CCS), pp. 15-26, 2017. Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310-1320. PMLR, 2019. Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning (ICML), pp. 2206-2216. PMLR, 2020. Abir De, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human assistance. AAAI, 2020.

The idea is motivated by the assumption that adversarial perturbed instances follow a different distribution as compared to natural (unperturbed) instances. Recent works have considered assigning different weights to different classes based on the underlying loss to ensure robustness against adversarial perturbation(Anonymous, 2022;Tian et al., 2021; Wang et al., 2021;Leavitt & Morcos, 2020).Tian et al. (2021) pointed out that accuracy across different classes significantly vary during adversarial training.Leavitt & Morcos (2020) showed that increasing class selectivity improves worst case perturbation while decreasing class selectivity improves average case perturbation.Wang et al. (2021)   proposed Separable Reweighted Adversarial Training (SRAT) assign weights to different instances to learn separable features for imbalanced dataset. Recent work(Anonymous, 2022) develop adversarial training approaches that attempt to control the worst possible loss across different classes.

For CIFAR10, we use 40000 training examples, 10000 validation examples and 10000 test examples. For Fashion MNIST, we use 50000 training examples, 10000 validation examples, and 10000 test examples. For CIFAR100, we use 40000 training examples, 10000 validation examples and 10000 test examples. The same train, validation and test split it used for the baselines as well. In all cases, unless otherwise mentioned, we consider |S| ≤ b = 0.1|D| during training. Similarly, during test we use 10% test instances for attack. The exact nature of drawing this 10% test instances varies across experiments and is mentioned therein.

Performance comparison under Loss based hyperparameter selection. Here, the adversary adopts uncertainty based subset selection to perform attack, where the true subset chosen for attack S latent consists of top 10% test instances in terms of the uncertainty of a classifier trained on all the clean examples. Numbers in green (yellow) indicate the best (second best) performers.H.2 EVALUATION ON LABEL BASED SUBSET SELECTION STRATEGYWe present the complete set of results for label based subset selection strategy on CIFAR10 in Table8 and 9. Note that we use the worst case hyperparameter setting for all the methods. We observe that our method achieves the highest overall accuracy for all classes and for all attacks.

A on label based subset selection strategy on CIFAR10 under white box (PGD, AA) attacks.Here, the attacked subset selection is based on the uncertainty of a vanilla classifier (h v ) on the test samples. For all the methods, we perform worst-case hyperparameter selection.

A on label based subset selection strategy on CIFAR10 under black box (Square, MIFGSM) attacks. Here, the attacked subset selection is based on the uncertainty of a vanilla classifier (h v ) on the test samples. For all the methods, we perform worst-case hyperparameter selection.H.5 TRADE OFF BETWEEN A clean AND A robust

Performance comparison under default hyperparameter setting on CIFAR100. We report (percentage) (i) accuracy on the clean examples A clean , (ii) robustness to the adversarial perturbations A robust and (iii) overall accuracy A. Here, the adversary adopts uncertainty-based subset selection to perform attack, where the true subset chosen for attack S latent consists of top 10% test instances in terms of the uncertainty of a classifier trained on all the clean examples.

Performance comparison under worst-case hyperparameter setting on CIFAR100. We report (percentage) (i) accuracy on the clean examples A clean , (ii) robustness to the adversarial perturbations A robust and (iii) overall accuracy A. Here, the adversary adopts uncertainty-based subset selection to perform attack, where the true subset chosen for attack S latent consists of top 10% test instances in terms of the uncertainty of a classifier trained on all the clean examples.H.11 COMPARISON OF ROBUST ACCURACY SUBJECT TO A MINIMUM OVERALL ACCURACY

7. REPRODUCIBILITY STATEMENT

We have provided our code in supplementary material and provide the implementation details in Appendix G.

H.3 DISCLOSING THE SUBSET SELECTION STRATEGY TO THE BASELINES

In this experiment, we reveal the advsersary's subset selection strategy (uncertainty based) to the baselines during hyperparameter selection. We select the hyperparameter of the baseline which has the best overall accuracy on the validation set using the revealed subset selection strategy. The results are presented in Table 10 . Comparing with results in Table 2 , we see that GAT, TRADES, Nu-AT and MART have improved i.e., all the baselines which had a tunable hyperparameter have become better. More importantly, our method still achieves the best overall accuracy across all attacks except PGD. 

H.4 WORST CASE OVERALL ACCURACY

In this experiment choose R = 10000 subsets {S j } R j=1 uniformly at random from D Test and report the minimum A along with the corresponding A clean and A robust . We use default hyperparameter selection and report the results for CIFAR10 and FMNIST in Table 11 . We make the following observations: (1) our method achieves the best min A for PGD, AA, and MIFGSM attacks on CIFAR10 and PGD, Square and MIFGSM attacks on FMNIST. (2) There is no clear winner among the baselines. RFGSM-AT has a good A on CIFAR10 but poor A robust . For FMNIST, GAT achieves the highest A for AA. However it is purely because of its A clean as it has 0% A robust .

H.7 RUN-TIME AND MEMORY ANALYSIS

In this section, we present details about the training time and the maximum GPU memory required for all methods while training on CIFAR10. Note that the time mentioned for our method includes both, the time taken for a gradient step and a stochastic greedy step. Here, the size of the attacked subset was |S| = 0.1|D|, where D denotes the training set. The results are presented in Table 16 . In this experiment, we take the final set S T from our algorithm for CIFAR-10 and find b = 0.1|D| number of samples with highest loss. Then, we compute the corresponding nearest samples from the test set to get the most vulnerable test set. We attack on this set for all the methods and show the results in Table 17 . We observe that our method makes a good trade-off between accuracy and robustness. Although RFGSM-AT is best in terms of overall accuracy, its robustness is very poor. We present the results of worst-case hyperparameter selection for FMNIST in Table 18 . Here, we achieve the best overall accuracy A for three attacks and second-best overall accuracy for the rest. Here, we train both the variants of ROGET using b = 0.1|D Tr | and evaluate using different number of instances |S latent | perturbed during test. We already reported the results for CIFAR10 in the main. In Figure 22 , we report the results for FMNIST, which show that ROGET (a φ = AdvGAN , we provide a little information about the proportion of instances that are going to be attacked. Figure 23 summarizes the results which show that ROGET (a φ = AdvGAN) outperforms the baselines for a wide range of the size of attacked test instances. 

