TEST-TIME ADAPTATION FOR BETTER ADVERSARIAL ROBUSTNESS

Abstract

Standard adversarial training and its variants have been widely adopted in practice to achieve robustness against adversarial attacks. However, we show in this work that such an approach does not necessarily achieve near optimal generalization performance on test samples. Specifically, it is shown that under suitable assumptions, Bayesian optimal robust estimator requires test-time adaptation, and such adaptation can lead to significant performance boost over standard adversarial training. Motivated by this observation, we propose a practically easy-toimplement method to improve the generalization performance of adversariallytrained networks via an additional self-supervised test-time adaptation step. We further employ a meta adversarial training method to find a good starting point for test-time adaptation, which incorporates the test-time adaptation procedure into the training phase and it strengthens the correlation between the pre-text tasks in self-supervised learning and the original classification task. Extensive empirical experiments on CIFAR10, STL10 and Tiny ImageNet using several different selfsupervised tasks show that our method consistently improves the robust accuracy of standard adversarial training under different white-box and black-box attack strategies.

1. INTRODUCTION

Adversarial Training (AT) (Madry et al., 2018) and its variants (Wang et al., 2019; Zhang et al., 2019) are currently recognized as the most effective defense mechanism against adversarial attacks. However, AT generalizes poorly; the robust accuracy gap between the training and test set in AT is much larger than the training-test gap in standard training of deep networks (Neyshabur et al., 2017; Zhang et al., 2017) . Unfortunately, classical techniques to overcome overfitting in standard training, including regularization and data augmentation, only have little effect in AT (Rice et al., 2020) . Theoretically, as will be shown in Section 3, the loss objective of AT does not achieve optimal robustness. Instead, under suitable assumptions, the Bayesian optimal robust estimator, which represents the statistical optimal model that can be obtained from training data, requires test-time adaptation. Compared with the fixed restricted Bayesian robust estimators, the test-time adapted estimators largely improve the robustness. Therefore, we should perform the test-time adaptation for each test input to boost the robustness. To this end, we propose to fine-tune the model parameters for each test mini-batch. Since the labels of the test images are not available, we exploit self-supervision, which is widely used in the standard training of networks (Chen et al., 2020b; Gidaris et al., 2018; He et al., 2020) . Fine-tuning the self-supervised tasks has a high gradient correlation with fine-tuning the classification task so that it forms a substitute of fine-tuning the classification loss at the inference time. Thus, we expect minimizing this self-supervised loss function yields better generalization on the test set. To make our test-time adaptation strategy effective, we need to search for a good starting point that achieves good robust accuracy after fine-tuning. As will be shown in our experiments, AT itself does not provide the optimal starting point. We therefore formulate the search for such start point as a bilevel optimization problem. Specifically, we introduce a Meta Adversarial Training (MAT) strategy dedicated to our self-supervised fine-tuning inspired by the model-agnostic meta-learning (MAML) framework (Finn et al., 2017) . To this end, we treat the classification of each batch of adversarial images as one task and minimize the corresponding classification error of the fine-tuned network. MAT strengthens the correlation between the self-supervised and classification tasks so that self-supervised test-time adaptation can further improve robust accuracy. In order to reliably evaluate our method, we follow the suggestions of (Tramer et al., 2020) and design an adaptive attack that is fully aware of the test-time adaptation. Using rotation and vertical flip as the self-supervised tasks, we empirically demonstrate the effectiveness of our method on the commonly used CIFAR10 (Krizhevsky et al., 2009 ), STL10 (Coates et al., 2011) and Tiny ImageNet (Le & Yang, 2015) datasets under both standard (Andriushchenko et al., 2020; Croce & Hein, 2020a; Madry et al., 2018) and adaptive attacks in both white-box and black-box attacks. The experiments evidence that our method consistently improves the robust accuracy under all attacks. Our contributions can be summarized as follows: 1. We show that the estimators should be test-time adapted in order to achieve the Bayesian optimal adversarial robustness, even for simple models like linear models. And the test-time adaptation largely improves the robustness compared with optimal restricted estimators. 2. We introduce the framework of self-supervised test-time fine-tuning for adversarially-trained networks, showing that it improves the robust accuracy of the test data. 3. We propose a meta adversarial training strategy based on the MAML framework to find a good starting point and strengthen the correlation between the self-supervised and classification tasks. 4. The experiments show that our approach is valid on diverse attack strategies, including an adaptive attack that is fully aware of our test-time adaptation, in both white-box and black-box attacks.

2. RELATED WORK

Adversarial Training. In recent years, many approaches have been proposed to defend networks against adversarial attacks (Guo et al., 2018; Liao et al., 2018; Song et al., 2018) . Among them, Adversarial Training (AT) (Madry et al., 2018) stands out as one of the most robust and popular methods, even under various strong attacks (Athalye et al., 2018; Croce & Hein, 2020a) . AT optimizes the loss of adversarial examples to find parameters that are robust to adversarial attacks. Several variants of AT (Wang et al., 2019; Zhang et al., 2019) also achieved and similar performance to AT (Rice et al., 2020) . One important problem that limits the robust accuracy of AT is overfitting. Compared with training on clean images, the gap of robust accuracy between the training and test set is much larger in AT (Rice et al., 2020) . Moreover, traditional techniques to prevent overfitting, such as regularization and data augmentation, have little effect. Recently, some methods have attempted to flatten the weight loss landscape to improve the generalization of AT. In particular, Adversarial Weight Perturbation (AWP) (Wu et al., 2020) achieves this by designing a double-perturbation mechanism that adversarially perturbs both inputs and weights. In addition, learning-based smoothing can flatten the landscape and improve the performance (Chen et al., 2021b) . Self-supervised Learning. In the context of non-adversarial training, many self-supervised strategies have been proposed, such as rotation prediction (Gidaris et al., 2018) , region/component filling (Criminisi et al., 2004) , patch-base spatial composition prediction (Trinh et al., 2019) and contrastive learning (Chen et al., 2020b; He et al., 2020) . While self-supervision has also been employed in AT (Chen et al., 2020a; Kim et al., 2020; Yang & Vondrick, 2020; Hendrycks et al., 2019) , their methods only use self-supervised learning at training time to regularize the parameters and improve the robust accuracy. By contrast, we propose to perform self-supervised fine-tuning at test time, which we demonstrate to significantly improve the robust accuracy on test images. As will be shown in the experiments, the self-supervised test-time adaptation has larger and complementary improvements over the training time self-supervision. Test-time Adaption. Test-time adaptation has been used in various fields, such as image superresolution (Shocher et al., 2018) and domain adaption (Sun et al., 2020; Wang et al., 2021) . While our work is thus closely related to Test-Time Training (TTT) in (Sun et al., 2020) , we target a significantly different scenario. TTT assumes that all test samples have been subject to the same distribution shift compared with the training data. As a consequence, it incrementally updates the model parameters when receiving new test images. By contrast, in our scenario, there is no systematic distribution shift, and it is therefore more effective to fine-tune the parameters of the original model for every new test mini-batch. This motivates our MAT strategy, which searches for the initial model parameters that can be effectively fine-tuned in a self-supervised manner.

