TEST-TIME ADAPTATION FOR BETTER ADVERSARIAL ROBUSTNESS

Abstract

Standard adversarial training and its variants have been widely adopted in practice to achieve robustness against adversarial attacks. However, we show in this work that such an approach does not necessarily achieve near optimal generalization performance on test samples. Specifically, it is shown that under suitable assumptions, Bayesian optimal robust estimator requires test-time adaptation, and such adaptation can lead to significant performance boost over standard adversarial training. Motivated by this observation, we propose a practically easy-toimplement method to improve the generalization performance of adversariallytrained networks via an additional self-supervised test-time adaptation step. We further employ a meta adversarial training method to find a good starting point for test-time adaptation, which incorporates the test-time adaptation procedure into the training phase and it strengthens the correlation between the pre-text tasks in self-supervised learning and the original classification task. Extensive empirical experiments on CIFAR10, STL10 and Tiny ImageNet using several different selfsupervised tasks show that our method consistently improves the robust accuracy of standard adversarial training under different white-box and black-box attack strategies.

1. INTRODUCTION

Adversarial Training (AT) (Madry et al., 2018) and its variants (Wang et al., 2019; Zhang et al., 2019) are currently recognized as the most effective defense mechanism against adversarial attacks. However, AT generalizes poorly; the robust accuracy gap between the training and test set in AT is much larger than the training-test gap in standard training of deep networks (Neyshabur et al., 2017; Zhang et al., 2017) . Unfortunately, classical techniques to overcome overfitting in standard training, including regularization and data augmentation, only have little effect in AT (Rice et al., 2020) . Theoretically, as will be shown in Section 3, the loss objective of AT does not achieve optimal robustness. Instead, under suitable assumptions, the Bayesian optimal robust estimator, which represents the statistical optimal model that can be obtained from training data, requires test-time adaptation. Compared with the fixed restricted Bayesian robust estimators, the test-time adapted estimators largely improve the robustness. Therefore, we should perform the test-time adaptation for each test input to boost the robustness. To this end, we propose to fine-tune the model parameters for each test mini-batch. Since the labels of the test images are not available, we exploit self-supervision, which is widely used in the standard training of networks (Chen et al., 2020b; Gidaris et al., 2018; He et al., 2020) . Fine-tuning the self-supervised tasks has a high gradient correlation with fine-tuning the classification task so that it forms a substitute of fine-tuning the classification loss at the inference time. Thus, we expect minimizing this self-supervised loss function yields better generalization on the test set. To make our test-time adaptation strategy effective, we need to search for a good starting point that achieves good robust accuracy after fine-tuning. As will be shown in our experiments, AT itself does not provide the optimal starting point. We therefore formulate the search for such start point as a bilevel optimization problem. Specifically, we introduce a Meta Adversarial Training (MAT) strategy dedicated to our self-supervised fine-tuning inspired by the model-agnostic meta-learning (MAML) framework (Finn et al., 2017) . To this end, we treat the classification of each batch of adversarial images as one task and minimize the corresponding classification error of the fine-tuned

