CAT: COLLABORATIVE ADVERSARIAL TRAINING

Abstract

Adversarial training can improve the robustness of neural networks. Previous adversarial training methods focus on a single training strategy and do not consider the collaboration between different training strategies. In this paper, we find different adversarial training methods have distinct robustness for sample instances. For example, an instance can be correctly classified by a model trained using standard adversarial training (AT) but not by a model trained using TRADES, and vice versa. Based on this phenomenon, we propose a collaborative adversarial training framework to improve the robustness of neural networks. Specifically, we simultaneously use different adversarial training methods to train two robust models from scratch. We input the adversarial examples generated by each network to the peer network and use the peer network's logit to guide its network's training. Collaborative Adversarial Training (CAT) can improve both robustness and accuracy. Finally, Extensive experiments on CIFAR-10 and CIFAR-100 validated the effectiveness of our method. CAT achieved new state-of-the-art robustness without using any additional data on CIFAR-10 under the Auto-Attack benchmark 1 .

1. INTRODUCTION

With the development of deep learning, Deep Neural Networks (DNNs) have been applied to various fields, such as image classification (He et al., 2016) , object detection (Redmon et al., 2016) , semantic segmentation (Pal & Pal, 1993) , etc. And state-of-the-art performance has been obtained. But recent research has found that DNNs are vulnerable to adversarial perturbations (Goodfellow et al., 2014) . A finely crafted adversarial perturbation by a malicious agent can easily fool the neural network. This raises security concerns about the deployment of neural networks in security-critical areas such as Autonomous driving (Chen et al., 2019) and medical diagnostics (Kong et al., 2017) . To cope with the vulnerability of DNNs, different types of methods have been proposed to improve the robustness of neural networks, including adversarial training (Madry et al., 2017) , defensive distillation (Papernot et al., 2016) , feature denoising (Xie et al., 2019) and model pruning (Madaan et al., 2020) . Among them, Adversarial Training (AT) is the most effective method to improve adversarial robustness. AT can be regarded as a type of data augmentation strategy that trains neural networks based on adversarial examples crafted from natural examples. AT is usually formulated as a min-maximization problem, where the inner maximization generates adversarial examples, while the outer minimization optimizes the parameters of the model based on the adversarial examples generated by the inner maximization process. The previous methods have focused on how to improve the model's adversarial accuracy, focusing only on the numerical improvement, but not on the characteristics of the different methods. We ask: Do different adversarial training methods perform the same for different sample instances? We analyzed different adversarial training methods (taking AT (Madry et al., 2017) and TRADES (Zhang et al., 2019) as examples) and found that different methods behave differently for sample instances, as illustrated in Figure 1 . Specifically, for the same adversarial example, the network trained by AT can classify correctly, while the network trained by TRADES misclassifies. Similarly, some examples can be correctly classified by the network trained by TRADES, but not by the network trained by AT. That is, although AT and TRADES have the same numerical adversarial accuracy, they behave differently for sample instances. This raises the question: Do two networks learn better if they collaborate? Based on this observation, we propose a Collaborative Adversarial Training (CAT) framework to improve the robustness of neural networks. Our framework is shown in Figure 2 . Specifically, we simultaneously train two deep neural networks separately using different adversarial training methods. At the same time, the adversarial examples generated by this network are input to the peer network to obtain the corresponding logit. Then the logit obtained by the peer network is used to guide the learning of this network together with its own adversarial training objective function. We expect to improve the robustness of the neural network by allowing peers to learn from each other in this collaborative learning way. Extensive experiments on different neural networks and different datasets demonstrate the effectiveness of our approach. CAT achieved new state-of-the-art robustness without any additional synthetic data or real data on CIFAR-10 under the Auto-Attack benchmark. In summary, our contribution is threefold as follows. • We find that the models obtained using different adversarial training methods have different representations for individual sample instances. 



https://github.com/fra31/auto-attack



Figure 1: Classification results of different adversarial training methods on sample instances. The first row is the classification result of the model trained by AT and the second row is the classification result of the model trained by TRADES. 1 means correct classification and 0 means incorrect classification. 10000 is the size of the CIFAR-10 test set. The third row is the result of correct classification by both AT and TRADES, and the result is shown in red. It can be seen that the models trained by different methods perform differently on sample instances.

• We propose a novel adversarial training framework: collaborative adversarial training. CAT simultaneously trains two neural networks from scratch using different adversarial training methods and allows them to collaborate to improve the robustness of the model. • We have conducted extensive experiments on a variety of datasets and networks, and evaluated them on state-of-the-art attacks. We demonstrate that CAT can substantially improve the robustness of neural networks and obtain new state-of-the-art performance without any additional data. DNNs are vulnerable to adversarial examples, a large number of works have been proposed to craft adversarial examples. Based on the accessibility to the knowledge of the target model, it can be divided into white-box attacks and black-box attacks. White-box attacks craft adversarial examples based on the knowledge of the target model, while black-box attacks are agnostic to the knowledge of the target model. White-box Attack: Goodfellow et al. (2014) proposed FGSM to efficiently craft adversarial examples, which can be generated in just one step. Later FGSM was extended to different iterative

