ARMOURED: ADVERSARIALLY ROBUST MODELS USING UNLABELED DATA BY REGULARIZING DIVER-SITY

Abstract

Adversarial attacks pose a major challenge for modern deep neural networks. Recent advancements show that adversarially robust generalization requires a large amount of labeled data for training. If annotation becomes a burden, can unlabeled data help bridge the gap? In this paper, we propose ARMOURED, an adversarially robust training method based on semi-supervised learning that consists of two components. The first component applies multi-view learning to simultaneously optimize multiple independent networks and utilizes unlabeled data to enforce labeling consistency. The second component reduces adversarial transferability among the networks via diversity regularizers inspired by determinantal point processes and entropy maximization. Experimental results show that under small perturbation budgets, ARMOURED is robust against strong adaptive adversaries. Notably, ARMOURED does not rely on generating adversarial samples during training. When used in combination with adversarial training, AR-MOURED yields competitive performance with the state-of-the-art adversariallyrobust benchmarks on SVHN and outperforms them on CIFAR-10, while offering higher clean accuracy.

1. INTRODUCTION

Modern deep neural networks have met or even surpassed human-level performance on a variety of image classification tasks. However, they are vulnerable to adversarial attacks, where small, calculated perturbations in the input sample can fool a network into making unintended behaviors, e.g., misclassification. (Szegedy et al., 2014; Biggio et al., 2013) . Such adversarial attacks have been found to transfer between different network architectures (Papernot et al., 2016) and are a serious concern, especially when neural networks are used in real-world applications. As a result, much work has been done to improve the robustness of neural networks against adversarial attacks (Miller et al., 2020) . Of these techniques, adversarial training (AT) (Goodfellow et al., 2015; Madry et al., 2018) is widely used and has been found to provide the most robust models in recent evaluation studies (Dong et al., 2020; Croce & Hein, 2020) . Nonetheless, even models trained with AT have markedly reduced performance on adversarial samples in comparison to clean samples. Models trained with AT also have worse accuracy on clean samples when compared to models trained with standard classification losses. Schmidt et al. (2018) suggest that one reason for such reductions in model accuracy is that training adversarially robust models requires substantially more labeled data. Due to the high costs of obtaining such labeled data in real-world applications, recent work has explored semi-supervised AT-based approaches that are able to leverage unlabeled data instead (Uesato et al., 2019; Najafi et al., 2019; Zhai et al., 2019; Carmon et al., 2019) . Orthogonal to AT-based approaches that focus on training robust single models, a few works have explored the use of diversity regularization for learning adversarially robust classifiers. These works rely on encouraging ensemble diversity through regularization terms, whether on model predictions (Pang et al., 2019) or model gradients (Dabouei et al., 2020) , guided by the intuition that diversity amongst the model ensemble will make it difficult for adversarial attacks to transfer between individual models, thus making the ensemble as a whole more resistant to attack. In this work, we propose ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity, a novel algorithm for adversarially robust model learning that elegantly unifies semi-supervised learning and diversity regularization through a multi-view learning framework. ARMOURED applies a pseudo-label filter similar to co-training (Blum & Mitchell, 1998) to enforce consistency of different networks' predictions on the unlabeled data. In addition, we derive a regularization term inspired by determinantal point processes (DPP) (Kulesza & Taskar, 2012) that encourages the two networks to predict differently for non-target classes. Lastly, ARMOURED maximizes the entropy of the combined multi-view output on the non-target classes. We show in empirical evaluations that ARMOURED achieves state-of-the-art robustness against strong adaptive adversaries as long as the perturbations are within small ∞ or 2 norm-bounded balls. Notably, unlike previous semi-supervised methods, ARMOURED does not use adversarial samples during training. When used in combination with AT, ARMOURED is competitive with the state-of-the-art methods on SVHN and outperforms them on CIFAR-10, while offering higher clean accuracy. In summary, the major contributions of this work are as follows: 1. We propose ARMOURED, a novel semi-supervised method based on multi-view learning and diversity regularization for training adversarially robust models. 2. We perform an extensive comparison, including standard semi-supervised learning approaches in addition to methods for learning adversarially robust models. 3. We show that ARMOURED+AT achieves state-of-the-art adversarial robustness while maintaining high accuracy on clean data.

2. RELATED WORK

To set the stage for ARMOURED, in this section, we briefly review adversarially robust learning and semi-supervised learning -two paradigms in the literature that are related to our work.

2.1. ADVERSARIALLY ROBUST LEARNING

Adversarial attacks: We consider attacks where adversarial samples stay within a p ball with fixed radius around the clean sample. In this setting, the two standard white-box attacks are the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) that computes a one-step perturbation that maximizes the cross entropy loss function, and Projected Gradient Descent (PGD) (Madry et al., 2018) , a stronger attack that performs multiple iterations of gradient updates to maximize the loss; this may be seen as a multi-step version of FGSM. Auto-PGD attack (APGD) (Croce & Hein, 2020) is a parameter-free, budget-aware variant of PGD which aims at better convergence. However, robustness against these gradient-based attacks may give a false sense of security due to gradient-masking. This phenomenon happens when the defense does not produce useful gradients to generate adversarial samples (Athalye et al., 2018) . Gradient-masking is known to affect PGD by preventing its convergence to the actual adversarial samples (Tramèr & Boneh, 2019) . There exists gradient-based attacks such as Fast Adaptive Boundary attack (Croce & Hein, 2019) (FAB) which is invariant to rescaling, thus is unaffected by gradient-masking. FAB minimizes the perturbation norm as long as misclassification is achieved. Black box attacks that rely on random search alone without gradient information, such as Square attack (Andriushchenko et al., 2020) , are also unaffected by gradient masking. Finally, AutoAttack (Croce & Hein, 2020) is a strong ensemble adversary which applies four attacks sequentially (APGD with cross entropy loss, followed by targeted APGD with difference-of-logits-ratio loss, targeted FAB, then Square). Adversarial training: Adversarial training (AT) is a popular approach that performs well in practice (Dong et al., 2020) . Madry et al. (2018) formulate AT as a min-max problem, where the model is trained with adversarial samples found via PGD. Variants of this method such as TRADES (Zhang

