ROBUSTNESS GUARANTEES FOR ADVERSARIALLY TRAINED NEURAL NETWORKS

Abstract

We study robust adversarial training of two-layer neural networks with Leaky ReLU activation function as a bi-level optimization problem. In particular, for the inner-loop that implements the PGD attack, we propose maximizing a lower bound on the 0/1-loss by reflecting a surrogate loss about the origin. This allows us to give convergence guarantee for the inner-loop PGD attack and precise iteration complexity results for end-to-end adversarial training, which hold for any width and initialization in a realizable setting. We provide empirical evidence to support our theoretical results.

1. INTRODUCTION

Despite the tremendous success of deep learning, neural network-based models are highly susceptible to small, imperceptible, adversarial perturbations of data at test time (Szegedy et al., 2014) . Such vulnerability to adversarial examples imposes severe limitations on the deployment of neural networks-based systems, especially in critical high-stakes applications such as autonomous driving, where safe and reliable operation is paramount. An abundance of studies demonstrating adversarial examples across different tasks and application domains (Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017) has led to a renewed focus on robust learning as an active area of research within machine learning. The goal of robust learning is to find models that yield reliable predictions on test data notwithstanding adversarial perturbations. A principled approach to training models that are robust to adversarial examples that has emerged in recent years is that of adversarial training (Madry et al., 2018) . Adversarial training formulates learning as a min-max optimization problem wherein the 0-1 classification loss is replaced by a convex surrogate such as the cross-entropy loss, and alternating optimization techniques are used to solve the resulting saddle point problem. Despite empirical success of adversarial training, our understanding of its theoretical underpinnings remain limited. From a practical standpoint, it is remarkable that gradient based techniques can efficiently solve both inner maximization problem to find adversarial examples and outer minimization problem to impart robust generalization. On the other hand, a theoretical analysis is challenging because (1) both the inner-and outer-level optimization problems are non-convex, and (2) it is unclear a-priori if solving the min-max optimization problem would even guarantee robust generalization. In this work, we seek to understand adversarial training better. In particular, under a margin separability assumption, we provide robust generalization guarantees for two-layer neural networks with Leaky ReLU activation trained using adversarial training. Our key contributions are as follows. 1. We identify a disconnect between the robust learning objective and the min-max formulation of adversarial training. This observation inspires a simple modification of adversarial trainingwe propose reflecting the surrogate loss about the origin in the inner maximization phase when searching for an "optimal" perturbation vector to attack the current model. 2. We provide convergence guarantees for PGD attacks on two-layer neural networks with leaky ReLU activation. This is the first of its kind result to the best of our knowledge. 3. We give global convergence guarantees and establish learning rates for adversarial training for two-layer neural networks with Leaky ReLU activation function. Notably, our guarantees hold for any bounded initialization and any width -a property that is not present in the previous works in the neural tangent kernel (NTK) regime (Gao et al., 2019; Zhang et al., 2020) . 4. We provide extensive empirical evidence showing that reflecting the surrogate loss in the inner loop does not have a significant impact on the test time performance of the adversarially trained models. Notation. We denote matrices, vectors, scalar variables, and sets by Roman capital letters, Roman lowercase letters, lowercase letters, and uppercase script letters, respectively (e.g. X, x, x, and X ). For any integer d, we represent the set {1, . . . , d} by [d] . The `2-norm of a vector x and the Frobenius norm of a matrix X are denoted as kxk and kXk F , respectively. Given a set C, the operator ⇧ C (x) = min x 0 2C kx x 0 k projects onto the set C with respect to the `2-norm. 1 2020), which study the convergence of adversarial training in non-linear neural networks. Under specific initialization and width requirements, these works guarantee small robust training error with respect to the attack that is used in the inner-loop, without explicitly analyzing the convergence of the attack. Gao et al. ( 2019) assume that the activation function is smooth and require that the width of the network, as well as the overall computational cost, is exponential in the input dimension. The work of Zhang et al. (2020) partially addresses these issues. In particular, their results hold for ReLU neural networks, and they only require the width and the computational cost to be polynomial in the input parameters. Our work is different from that of Gao et al. (2019) and Zhang et al. (2020) in several ways. Here we highlight three key differences. • First, while the prior work analyzes the convergence in the NTK setting with specific initialization and width requirements, our results hold for any initialization and width. • Second, none of the prior works studies computational aspects of finding an optimal attack vector in the inner loop. Instead, the prior work assumes oracle access to optimal attack vectors. We provide precise iteration complexity results for the projected gradient method (i.e., for the PGD attack) for finding near-optimal attack vectors. • Third, the prior works focus on minimizing the robust training loss, whereas we provide computational learning guarantees on the robust generalization error. The rest of the paper is organized as follows. In Section 2, we give the problem setup and introduce the adversarial training procedure with the reflected surrogate loss in the inner loop. In Section 3, we present our main results, discuss the implications and give a proof sketch. We support our theory with empirical results in Section 4 and conclude with a discussion in Section 5.

2. PRELIMINARIES

We focus on two-layer networks with m hidden nodes computing f (x; a, W) = a > (Wx), where W 2 R m⇥d and a 2 R m are the weights of the first and the second layers, respectively, and (z) = max{↵z, z} is the Leaky ReLU activation function. We randomly initialize the weights a and W such that kak 1   and kWk F  !. The top linear layer (i.e., weights a) is kept fixed, and the hidden layer (i.e., W) is trained using stochastic gradient descent (SGD). For simplicity of notation, we represent the network as f (x; W), suppressing the dependence on the top layer weights. Further, with a slight abuse of notation, we denote the function by f W (x)



.1 RELATED WORK Linear models. Adversarial training of linear models was recently studied by Charles et al. (2019); Li et al. (2020); Zou et al. (2021). In particular, Charles et al. (2019); Li et al. (2020) give robust generalization error guarantees for adversarially trained linear models under a margin separability assumption. The hard margin assumption was relaxed by Zou et al. (2021) who give robust generalization guarantees for distributions with agnostic label noise. We note that the optimal attack for linear models has a simple closed-form expression, which mitigates the challenge of analyzing the inner loop PGD attack. In contrast, one of our main contributions is to give convergence guarantees for the PGD attack. Nonetheless, as the Leaky ReLU activation function can also realize the identity map for ↵ = 1, our results also provide robust generalization error guarantees for training linear models. Non-linear models. Wang et al. (2019) propose a first order stationary condition to evaluate the convergence quality of adversarial attacks found in the inner loop. Zhang et al. (2021) study adversarial training as a bi-level optimization problem and propose a principled approach towards the design of fast adversarial training algorithms. Most related to our results are the works of Gao et al.

