ONLINE ADVERSARIAL PURIFICATION BASED ON SELF-SUPERVISED LEARNING

Abstract

Deep neural networks are known to be vulnerable to adversarial examples, where a perturbation in the input space leads to an amplified shift in the latent network representation. In this paper, we combine canonical supervised learning with selfsupervised representation learning, and present Self-supervised Online Adversarial Purification (SOAP), a novel defense strategy that uses a self-supervised loss to purify adversarial examples at test-time. Our approach leverages the labelindependent nature of self-supervised signals, and counters the adversarial perturbation with respect to the self-supervised tasks. SOAP yields competitive robust accuracy against state-of-the-art adversarial training and purification methods, with considerably less training complexity. In addition, our approach is robust even when adversaries are given knowledge of the purification defense strategy. To the best of our knowledge, our paper is the first that generalizes the idea of using self-supervised signals to perform online test-time purification.

1. INTRODUCTION

Deep neural networks have achieved remarkable results in many machine learning applications. However, these networks are known to be vulnerable to adversarial attacks, i.e. strategies which aim to find adversarial examples that are close or even perceptually indistinguishable from their natural counterparts but easily mis-classified by the networks. This vulnerability raises theory-wise issues about the interpretability of deep learning as well as application-wise issues when deploying neural networks in security-sensitive applications. Many strategies have been proposed to empower neural networks to defend against these adversaries. The current most widely used genre of defense strategies is adversarial training. Adversarial training is an on-the-fly data augmentation method that improves robustness by training the network not only with clean examples but adversarial ones as well. For example, Madry et al. (2017) propose projected gradient descent as a universal first-order attack and strengthen the network by presenting it with such adversarial examples during training (e.g., adversarial training). However, this method is computationally expensive as finding these adversarial examples involves sample-wise gradient computation at every epoch. Self-supervised representation learning aims to learn meaningful representations of unlabeled data where the supervision comes from the data itself. While this seems orthogonal to the study of adversarial vulnerability, recent works use representation learning as a lens to understand as well as improve adversarial robustness (Hendrycks et al., 2019; Mao et al., 2019; Chen et al., 2020a; Naseer et al., 2020) . This recent line of research suggests that self-supervised learning, which often leads to a more informative and meaningful data representation, can benefit the robustness of deep networks. In this paper, we study how self-supervised representation learning can improve adversarial robustness. We present Self-supervised Online Adversarial Purification (SOAP), a novel defense strategy that uses an auxiliary self-supervised loss to purify adversarial examples at test-time, as illustrated in Figure 1 . During training, beside the classification task, we jointly train the network on a carefully selected self-supervised task. The multi-task learning improves the robustness of the network 2020) train a conditional GAN by letting it play a min-max game with a critic network in order to differentiate between clean and adversarial examples. In contrast to above approaches, SOAP achieves better robust accuracy and does not require a GAN which is hard and inefficient to train. More importantly, our approach exploits a wider range of self-supervised signals for purification and conceptually can be applied to any format of data and not just images, given an appropriate self-supervised task. Self-supervised learning Self-supervised learning aims to learn intermediate representations of unlabeled data that are useful for unknown downstream tasks. This is done by solving a selfsupervised task, or pretext task, where the supervision of the task comes from the data itself. Recently, a variety of self-supervised tasks have been proposed on images, including data reconstruction (Vincent et al., 2008; Rifai et al., 2011) , relative positioning of patches (Doersch et al., 2015; Noroozi & Favaro, 2016 ), colorization (Zhang et al., 2016) , transformation prediction (Dosovitskiy et al., 2014; Gidaris et al., 2018) or a combination of tasks (Doersch & Zisserman, 2017) . More recently, studies have shown how self-supervised learning can improve adversarial robustness. Mao et al. (2019) find that adversarial attacks fool the networks by shifting latent representation to



Figure 1: An illustration of self-supervised online adversarial purification (SOAP). Left: joint training of the classification and the auxiliary task; Right: input adversarial example is purified iteratively to counter the representational shift, then classified. Note that the encoder is shared by both classification and purification.and more importantly, enables us to counter the adversarial perturbation at test-time by leveraging the label-independent nature of self-supervised signals. Experiments demonstrate that SOAP performs competitively on various architectures across different datasets with only a small computation overhead compared with vanilla training. Furthermore, we design a new attack strategy that targets both the classification and the auxiliary tasks, and show that our method is robust to this adaptive adversary as well. Code is available at https://github.com/Mishne-Lab/SOAP.

Adversarial purification Another genre of robust learning focuses on shifting the adversarial examples back to the clean data representation , namely purification. Gu & Rigazio (2014) exploited using a general DAE (Vincent et al., 2008) to remove adversarial noises; Meng & Chen (2017) train a reformer network, which is a collection of autoencoders, to move adversarial examples towards clean manifold; Liao et al. (2018) train a UNet that can denoise adversarial examples to their clean counterparts; Samangouei et al. (2018) train a GAN on clean examples and project the adversarial examples to the manifold of the generator; Song et al. (2018) assume adversarial examples have lower probability and learn the image distribution with a PixelCNN so that they can maximize the probability of a given test example; Naseer et al. (

