IMPROVING LOCAL EFFECTIVENESS FOR GLOBAL ROBUST TRAINING Anonymous

Abstract

Despite its popularity, deep neural networks are easily fooled. To alleviate this deficiency, researchers are actively developing new training strategies, which encourage models that are robust to small input perturbations. Several successful robust training methods have been proposed. However, many of them rely on strong adversaries, which can be prohibitively expensive to generate when the input dimension is high and the model structure is complicated. We adopt a new perspective on robustness and propose a novel training algorithm that allows a more effective use of adversaries. Our method improves the model robustness at each local ball centered around an adversary and then, by combining these local balls through a global term, achieves overall robustness. We demonstrate that, by maximizing the use of adversaries via focusing on local balls, we achieve high robust accuracy with weak adversaries. Specifically, our method reaches a similar robust accuracy level to the state of the art approaches trained on strong adversaries on MNIST, CIFAR-10 and CIFAR-100. As a result, the overall training time is reduced. Furthermore, when trained with strong adversaries, our method matches with the current state of the art on MNIST and outperforms them on CIFAR-10 and CIFAR-100.

1. INTRODUCTION

With the proliferation of deep neural networks (DNN) in areas including computer vision, natural language processing and speech recognition, there has been a growing concern over their safety. For example, Szegedy et al. (2013) demonstrated that naturally trained DNNs are in fact fragile. By adding to each data a perturbation that is carefully designed but imperceptible to humans, DNNs previously reaching almost 100% accuracy performance could hardly make a correct prediction any more. This could cause serious issues in areas such as autonomous navigation or personalised medicine, where an incorrect decision can endanger life. To tackle these issues, training DNNs that are robust to small perturbations has become an active area of research in machine learning. Various algorithms have been proposed (Papernot et al., 2016; Kannan et al., 2018; Zhang et al., 2019b; Qin et al., 2019; Moosavi-Dezfooli et al., 2020; Madry et al., 2018; Ding et al., 2020) . Among them, adversarial training (ADV) (Madry et al., 2018) and TRADES (Zhang et al., 2019b) are two of the most frequently used training methods so far. Although developed upon different ideas, both methods require using strong adversarial attacks, generally computed through several steps of projected gradient descent. Such attacks can quickly become prohibitive when model complexity and input dimensions increase, thereby limiting their applicability. Since the cost of finding strong adversaries is mainly due to the high number of gradient steps performed, one potential approach to alleviate the problem is to use cheap but weak adversaries. Weak adversaries are obtained using fewer gradient steps, and in the extreme case with a single gradient step. Based on this idea, Wong et al. (2020) argue that by using random initialization and a larger step-size, adversarial training with weak adversaries found via one gradient step is sufficient to achieve a satisfactory level of robustness. We term this method as one-step ADV from now on. While one-step ADV does indeed exhibit robustness, there is still a noticeable gap when compared with its multi-step counterpart. In this paper, we further bridge the gap by proposing a new robust training algorithm: Adversarial Training via LocAl Stability (ATLAS). Local stability, in our context, implies stability of prediction and is the same as local robustness. Specifically, we make the following contributions:

