TEST-TIME ADAPTATION FOR BETTER ADVERSARIAL ROBUSTNESS

Abstract

Standard adversarial training and its variants have been widely adopted in practice to achieve robustness against adversarial attacks. However, we show in this work that such an approach does not necessarily achieve near optimal generalization performance on test samples. Specifically, it is shown that under suitable assumptions, Bayesian optimal robust estimator requires test-time adaptation, and such adaptation can lead to significant performance boost over standard adversarial training. Motivated by this observation, we propose a practically easy-toimplement method to improve the generalization performance of adversariallytrained networks via an additional self-supervised test-time adaptation step. We further employ a meta adversarial training method to find a good starting point for test-time adaptation, which incorporates the test-time adaptation procedure into the training phase and it strengthens the correlation between the pre-text tasks in self-supervised learning and the original classification task. Extensive empirical experiments on CIFAR10, STL10 and Tiny ImageNet using several different selfsupervised tasks show that our method consistently improves the robust accuracy of standard adversarial training under different white-box and black-box attack strategies. Under review as a conference paper at ICLR 2023 network. MAT strengthens the correlation between the self-supervised and classification tasks so that self-supervised test-time adaptation can further improve robust accuracy. In order to reliably evaluate our method, we follow the suggestions of (Tramer et al., 2020) and design an adaptive attack that is fully aware of the test-time adaptation. Using rotation and vertical flip as the self-supervised tasks, we empirically demonstrate the effectiveness of our method on the commonly used CIFAR10 (Krizhevsky et al., 2009 ), STL10 (Coates et al., 2011) and Tiny ImageNet (Le & Yang, 2015) datasets under both standard (Andriushchenko et al., 2020; Croce & Hein, 2020a; Madry et al., 2018) and adaptive attacks in both white-box and black-box attacks. The experiments evidence that our method consistently improves the robust accuracy under all attacks. Our contributions can be summarized as follows: 1. We show that the estimators should be test-time adapted in order to achieve the Bayesian optimal adversarial robustness, even for simple models like linear models. And the test-time adaptation largely improves the robustness compared with optimal restricted estimators. 2. We introduce the framework of self-supervised test-time fine-tuning for adversarially-trained networks, showing that it improves the robust accuracy of the test data. 3. We propose a meta adversarial training strategy based on the MAML framework to find a good starting point and strengthen the correlation between the self-supervised and classification tasks. 4. The experiments show that our approach is valid on diverse attack strategies, including an adaptive attack that is fully aware of our test-time adaptation, in both white-box and black-box attacks. RELATED WORK

1. INTRODUCTION

Adversarial Training (AT) (Madry et al., 2018) and its variants (Wang et al., 2019; Zhang et al., 2019) are currently recognized as the most effective defense mechanism against adversarial attacks. However, AT generalizes poorly; the robust accuracy gap between the training and test set in AT is much larger than the training-test gap in standard training of deep networks (Neyshabur et al., 2017; Zhang et al., 2017) . Unfortunately, classical techniques to overcome overfitting in standard training, including regularization and data augmentation, only have little effect in AT (Rice et al., 2020) . Theoretically, as will be shown in Section 3, the loss objective of AT does not achieve optimal robustness. Instead, under suitable assumptions, the Bayesian optimal robust estimator, which represents the statistical optimal model that can be obtained from training data, requires test-time adaptation. Compared with the fixed restricted Bayesian robust estimators, the test-time adapted estimators largely improve the robustness. Therefore, we should perform the test-time adaptation for each test input to boost the robustness. To this end, we propose to fine-tune the model parameters for each test mini-batch. Since the labels of the test images are not available, we exploit self-supervision, which is widely used in the standard training of networks (Chen et al., 2020b; Gidaris et al., 2018; He et al., 2020) . Fine-tuning the self-supervised tasks has a high gradient correlation with fine-tuning the classification task so that it forms a substitute of fine-tuning the classification loss at the inference time. Thus, we expect minimizing this self-supervised loss function yields better generalization on the test set. To make our test-time adaptation strategy effective, we need to search for a good starting point that achieves good robust accuracy after fine-tuning. As will be shown in our experiments, AT itself does not provide the optimal starting point. We therefore formulate the search for such start point as a bilevel optimization problem. Specifically, we introduce a Meta Adversarial Training (MAT) strategy dedicated to our self-supervised fine-tuning inspired by the model-agnostic meta-learning (MAML) framework (Finn et al., 2017) . To this end, we treat the classification of each batch of adversarial images as one task and minimize the corresponding classification error of the fine-tuned

3. THEORY OF TEST-TIME ADAPTATION

In this section, we study the relationship between the test-time adaptation and Bayesian optimal robustness, which represents the optimal robustness that can be achieved from the training data, showing that the test-time adaptation can extend the function classes and improve the robustness of the model. Definition 3.1. For a model F (x) with 2 adversarial constraint x -x < ε, we define its natural risk and adversarial risk as at point x R nat x (F ) = (F (x) -E[y|x]) 2 , R adv x (F ) = R nat x (F ) + max x -x <ε (F (x ) -F (x)) 2 Remarks. We use the MSE loss to define the natural risk R nat x (F ) and adversarial R adv x (F ) at point x. Similar to TRADES (Zhang et al., 2019) The adversarial risk is defined as the sum of natural risk and the loss changes under adversarial attack, and it can be bounded by the maximum MSE loss within the adversarial budget E x R adv x (F ) ≤ E x max x -x ≤ε (F (x ) -E[y|x]) 2 ≤ 2E x R adv x (F ). Therefore, for the adversarial input x with xx < ε, small R adv x (F ) guarantee small test error on x . Small R nat x (F ) and R adv x (F ) represents good clean performance and high adversarial robustness respectively In the following definitions, we define three algorithms to obtain adversarially robust functions and compare their adversarial risks. Definition 3.2 (Adversarial Training with TRADES). We define FAT as FAT = arg min F 1 n n i=1 (y i -F (x i )) 2 + max x i -xi <ε ( F (x i ) -F (x i )) 2 , where x i represents the i-th clean training data and y i represents the clean training label. Remark. Empirically, adversarial training is a very popular method to achieve robustness. It minimizes the adversarial risk on the training data. Then we consider the Bayesian optimal robustness. Let F represent a function class. We assume the response is generated by y = F * (x) + ξ with prior distribution F * ∈ F ∼ P F and ξ ∼ P ξ . Denote X ∈ R n×d , Y ∈ R n as the training data and training response generated by y i = F * (x i ) + ξ. For problems like Bayesian linear regression (Bishop & Nasrabadi, 2006) , function F ∈ F is able to achieve the Bayesian optimal natural risk E X,Y,y ( F (x) -y) 2 . However, the function class F is not enough to achieve the Bayesian optimal adversarial risk. The adversarial risk depends on the local Lipschitz of function F , in order to better trade-off between the Lipschitz and natural risk of the function F , much more complicated function classes than F are needed to achieve the optimal adversarial robustness. We defined the two Bayesian functions F RB and F AB that minimize global adversarial risk and adversarial risk at the specific point x. F AB extends the function class beyond F and achieves better robustness. Definition 3.3 (Restricted Bayesian Robust Function F RB ). The restricted Bayesian robust function F RB minimizes the global adversarial risk inside the function class F min F ∈F E x E y,X,Y|x ( F (x) -y) 2 + max x -x <ε ( F (x ) -F (x)) 2 Remark. The Bayesian function represents the best robust function inside the function class F For any F ∈ F, no training algorithms can achieve better average adversarial risk than F RB . Definition 3.4 (Adaptive Bayesian Robust Function F AB ). The adaptive Bayesian robust function F AB inside the function class F that minimizes the adversarial risk at point x is min F ∈F E y,X,Y|x ( F (x) -y) 2 + max x -x <ε ( F (x ) -F (x)) 2 Remark. Instead of minimizing the global average R adv x (F ), F AB minimizes the adversarial risk in the given input point x. The function depends on the input x so that the model extends the function class beyond F. For different test inputs, we can use different functions to achieve the optimal adversarial risk. Therefore, we refer to F AB as the test-time adapted function. In the following theorem, we show the difference between three functions in the model, where the test-adapted function F AB significantly improves the robustness. Theorem 3.1 (Linear Models). We consider a linear function classes F Lin = {F Lin |F Lin (x; θ) = x θ, θ ∈ R d }. The output y is generated by y = x θ * + ξ, where θ * is independent of x with θ * ∼ N(0, τ 2 I), and the noise ξ ∼ N(0, σ 2 ). Let X ∈ R n×d , Y ∈ R n denote the training data and the responses respectively. For linear model F Lin (x; θ), three estimators in Definition 3.2 to 3.4 are θ Lin AT = X(X X + nε 2 I n ) -1 Y, θ Lin RB = 1 ε 2 d + 1 θ Lin nat , θ Lin AB = (xx + ε 2 I d ) -1 xx θ Lin nat , where θ Lin nat = X(X X + λ * I n ) -1 Y, λ * = σ 2 /τ 2 . Furthermore, if each dimension of x is i.i.d. with Ex = 0, Cov(x) = I d /d and E[ √ dx i ] 4 ≤ M for some universal constant M < ∞, denoting ∆ = (1 + c + λ * ) 2 -4c, then when n, d → ∞ with n/d = c ∈ (0, 1) R adv x ( θ Lin AT ) = τ 2 , R adv x ( θ Lin RB ) = τ 2 , R adv x ( θ Lin AB ) = τ 2 1 - 1 + c + λ * - √ ∆ 2(ε 2 + 1) .

And when

c → 1 with SNR= σ 2 /τ 2 ≤ 1, R adv x ( θ Lin AB ) < R adv x ( θ Lin RB )(1 - 2 3(ε 2 +1) ). Remarks. In this theorem, we provide the form of the three estimators and their adversarial risks. The gap of adversarial risk between θ Lin AT and θ Lin RB vanishes when n, d → ∞. The estimator θ Lin RB achieves the optimal robust risk among all linear models. However, for an arbitrary ratio c = n/d, R adv x ( θ Lin AB ) < R adv x ( θ Lin RB ), indicating that adaptation to each test data x can improve the robustness of the model even when compared with the best linear model. Theorem 3.1 provides the optimal test-time adapted estimator in the linear function classes F Lin , which depends on the clean input x. In Figure 1 , we plot the adversarial risk of three estimators for different adversarial budgets, which clearly shows that our adaptation can significantly increase the robustness. When the input is corrupted with adversarial noise, the same form of the test-time adapted estimator also significantly improves the adversarial risk shown in the following theorem. Theorem 3.2 (Corrupted Input). We assume the oracle parameter θ * is independent of x and has the prior distribution θ * ∼ N(0, τ 2 I), and the noise ξ ∼ N(0, σ 2 ). Furthermore, each dimension of x for θ Lin AT , θ Lin RB , θ Lin AB , θ Lin AB, . We set τ 2 = 1, σ 2 = 0.2 and d = 250000. x is i.i.d. with Ex = 0, Cov(x) = I d /d and E[ √ dx i ] 4 ≤ M for some universal constant M < ∞, then when n, d → ∞ with n/d = c ∈ (0, 1). Given corrupted input x = x + ε θ/ θ , with ε < 1, the adversarial risk of θ Lin AB, = (x x + ε 2 I d ) -1 x x X(X X + λ * I n ) -1 Y is R adv x ( θ Lin AB, ) = τ 2 1 -(1 -ε 2 + 2ε 2 (1 + ε) 2 ) 2c (1 + c + λ * + √ ∆)(ε 2 /(1 + ε) 2 + 1) < R adv x ( θ Lin RB ). Remarks. The theorem shows that when the given input is adversarial, the test-time adaptation can still lower the adversarial risk of the model as R adv x ( θ Lin AB ) < R adv x ( θ Lin AB, ) < R adv x ( θ Lin RB ). In the statistical Bayesian model, we show that the testtime adaptation can extend the function classes and achieve the significantly lower adversarial risk than the fixed model. In the practical non-Bayesian classification task, explicit calculation of the optimal model is difficult. Nevertheless, the test-time adaptation also helps to improve the robustness of the model. As will be shown in the following section, we perform the self-supervised test-time fine-tuning to adapt the model to each input, and largely improves the robust accuracy of the test-time adapted model.

4. METHODOLOGY

We follow the traditional multitask learning formulation (Caruana, 1997) and consider a neural network with a backbone z = E(x; θ E ) and K + 1 heads. One head f (z; θ f ) outputs the classification result while the other K heads g 1 (z; θ g1 ), ..., g K (z; θ gK ) correspond to K auxiliary self-supervised tasks. θ = (θ E , θ f , θ g1 , • • • , θ gK ) encompasses all trainable parameters, and we further define F = f • E; G k = g k • E, k = 1, 2, ..., K. (1) Furthermore, let D = {(x i , y i )} n i=1 denote the training set, and D = {( x i , y i )} m i=1 be the test set. For further illustration, the labels of the test set are shown. However, they are unknown to the networks at test time. We denote the adversarial examples of x as x . It satisfies xx ≤ ε, and ε is the size of the adversarial budget. For any set S, we represent its average loss as L(S) = 1 |S| si∈S L(s i ) (2) where |S| is the number of elements in S. The general classification loss, such as the cross-entropy, is denoted by L cls . We use the superscript "AT" to denote the adversarial training loss. For example,  L AT cls (S) = 1 |S| xi,yi∈S max x i -xi ≤ε L cls (F (x i ), y i ) . L SS ( B ) = 1 b K k=1 C k b i=1 L SS,k (G k ( x i ); θ E , θ gk ) , which encompasses K self-supervised tasks. Here, L SS,k represents the loss function of the k-th task and {C k } K k=1 are the weights balancing the contribution of each task. In our experiments, the L SS,K is the cross-entropy loss to predict rotation and vertical flip. The number of images b may vary from 1 to m. b = 1 corresponds to the online setting, where only one adversarial image is available at a time, and the backbone parameters θ E are adapted to every new image. The online setting is the most practical one, as it does not make any assumptions about the number of adversarial test images the network receives. By contrast, b = m corresponds to the offline setting, where all adversarial test examples are available at once. It is similar to transductive learning (Gammerman et al., 1998; Vapnik, 2013) . Note that our online setting differs from the online test-time training described in TTT (Sun et al., 2020) ; we do not incrementally update the network parameters as new samples come, but instead initialize fine-tuning from the same starting point θ 0 for each new test image. Eqn (4) encourages θ E to update in favor of the self-supervised tasks. However, as the classification head f was only optimized for the old backbone E(•; θ 0 E ), it will typically be ill-adapted to the new parameters θ * E , resulting in a degraded robust accuracy. Furthermore, for a small b, the model tends to overfit to the test data, reducing L SS to 0 but extracting features that are only useful for the self-supervised tasks. To overcome these problems, we add an additional loss function acting on the training data that both regularizes the backbone E and optimizes the classification head f so that f remains adapted to the fine-tuned backbone E(•; θ * E ). Specifically, let B ⊂ D denote a subset of the training set. We then add the regularizer L R (B) =L AT cls (B) = 1 |B| xi,yi∈B max x i -xi ≤ε L cls (F (x i ), y i ) to the fine-tuning process. In short, Eqn (5) evaluates the AT loss on the training set to fine-tune the parameters θ f of the classification head. It also forces the backbone E to extract features that can be used to make correct predictions, i.e., to prevent θ E from being misled by L SS when b is small. Combining Eqn (4) and Eqn (5), our final test-time adaptation loss is L test ( B , B) = L SS ( B ) + CL R (B) where C sets the influence of L R . The algorithms that describe our test-time self-supervised learning are deferred to Appendix D. As SGD is more efficient for larger amount of data, we use SGD to optimize θ when b is large (e.g. offline setting).

4.2. META ADVERSARIAL TRAINING

To make the best out of optimizing L test at test time, we should find a suitable starting point θ 0 , i.e., a starting point such that test-time self-supervised learning yields better robust accuracy. We translate this into a meta learning scheme, which entails a bilevel optimization problem. Specifically, we divide the training data into s small exclusive subsets D = ∪ s j=1 B j and let B j to be adversaries of B j . We then formulate meta adversarial learning as the bilevel minimization of L meta (D;θ) = 1 s Bj ⊂D L AT cls (B j ; θ * j (θ)), where θ * j = arg min θ L SS (B j ; θ) , where L SS is the self-supervised loss function defined in Eqn (4) and L AT cls is the loss function of AT defined in Eqn (3). As bilevel optimization is time-consuming, following MAML (Finn et al., 2017) , we use a single gradient step of the current model parameters θ to approximate θ * j . θ * j ≈ θ -α∇ θ L SS (B j ; θ) . In essence, this Meta Adversarial Training (MAT) scheme searches for a starting point such that fine-tuning with L SS will lead to good robust accuracy. If this holds for all training subsets, then we can expect the robust accuracy after fine-tuning at test time also to increase. Note that, because the meta learning objective of Eqn (7) already accounts for classification accuracy, the regularization by L R is not needed during meta adversarial learning. Accelerating Training. To compute the gradient ∇ θ L meta (D; θ), we need to calculate the timeconsuming second order derivatives -α∇ 2 θ L SS (B j ; θ)∇ θ * j L AT cls (B j ; θ * j ) . Considering that AT is already much slower than standard training (Shafahi et al., 2019) , we cannot afford another significant training overhead. Fortunately, as shown in (Finn et al., 2017) , second order derivatives have little influence on the performance of MAML. We therefore ignore them and take the gradient to be ∇ θ L meta (D; θ) ≈ 1 s Bj ⊂D ∇ θ * j L AT cls (B j ; θ * j ) . However, by ignoring the second order gradient, only the parameters on the forward path of the classifier F , i.e., θ E and θ f , will be updated. In other words, optimizing Eqn (7) in this fashion will not update {θ gk } K k=1 . To nonetheless encourage each self-supervised head G k to output the correct prediction, we incorporate an additional loss function encoding the self-supervised tasks, L AT SS (D) = k C k L AT SS,k (D) = k C k |D| xi∈D max x i -xi ≤ε L SS,k (G k (x i )) . ( ) Note that we use the adversarial version of L SS to provide robustness to the self-supervised tasks, which, as shown in (Chen et al., 2020a; Hendrycks et al., 2019; Yang & Vondrick, 2020) , is beneficial for the classifier. The final meta adversarial learning objective therefore is L train (D) = L meta (D) + C L AT SS (D) where C balances the two losses. Algorithm 1 shows the complete MAT algorithm.

5. EXPERIMENTS

Experimental Settings. Following previous works (Cui et al., 2020; Huang et al., 2020) , we consider ∞ -norm attacks with an adversarial budget ε = 0.031(≈ 8/255). We evaluate our method Sample q exclusive batches of training images B 1 , B 2 , • • • , B q ⊂ D 3: Using PGD to find the adversaries B j : x j,i = arg max x j,i -xj,i ≤ε L cls (F (x j,i ), y j,i ) 4: for batches B 1 , B 2 , • • • , B q do 5: θ * j = θ -α∇ θ L SS (B j ; θ) 6: l meta,j = L AT cls (B j ; θ * j ) 7: end for 8: θ = θ -β q Bj ∇ θ * j l meta,j + C ∇ θ L AT SS (B j ; θ) 9: end for 10: return Trained parameters θ 0 = θ on three datasets: CIFAR10 (Krizhevsky et al., 2009) , STL10 (Coates et al., 2011) and Tiny Im-ageNet (Le & Yang, 2015) . We also use two different network architectures: WideResNet-34-10 (Zagoruyko & Komodakis, 2016) for CIFAR10, and ResNet18 (He et al., 2016) for STL10 and Tiny ImageNet. The hyperparameters are provided in the Appendix D. Self-Supervised Tasks. In principle, any self-supervised tasks can be used for test-time fine-tuning, as long as they are positively correlated with the robust accuracy. However, for the test-time finetuning to remain efficient, we should not use too many self-supervised tasks. Furthermore, as we aim to support the fully online setting, where only one image is available at a time, we cannot incorporate a contrastive loss (Chen et al., 2020b; He et al., 2020; Kim et al., 2020) to L SS . In our experiments, we therefore use two self-supervised tasks that have been shown to be useful to improve the classification accuracy: Rotation Prediction and Vertical Flip Prediction. Attack Methods. In the white-box attacks, the attacker knows every detail of the defense method. Therefore, we need to assume that the attacker is aware of our test-time adaptation method and will adjust its strategy for generating adversarial examples accordingly. Suppose that the attacker is fully aware of the hyperparameters for test-time adaptation. Then, finding adversaries B of the clean subset B can be achieved by maximizing the adaptive loss x i = arg max x i -xi ≤ε L attack (F ( x i ), y; θ T ( B )) , where L attack refers to the general attack loss, such as the cross-entropy or the difference of logit ratio (DLR) (Croce & Hein, 2020a) . We call this objective in Eqn ( 12) adaptive attack, which can be either performed in white-box or black-box attacks. We consider four common white-box and black-box attack methods: PGD-20 (Madry et al., 2018) , AutoPGD (both cross-entropy and DLR loss) loss (Croce & Hein, 2020a) , FAB (Croce & Hein, 2020b) and Square Attack (Andriushchenko et al., 2020) . We apply both the standard and adaptive versions of these methods. Particularly, AutoPGD we use is a strong version that maximizes the loss function that continues when finding adversarial examples (Croce et al., 2022) . More details are provided in the Appendix E. Baselines. We compare our method with the following methods: 1) Regular AT, which uses L AT cls in Eqn (3). 2) Regular AT with an additional self-supervised loss, i.e., using L AT cls + C L AT SS for AT, where L AT SS is given in Eqn (10). This corresponds to the formulation of (Hendrycks et al., 2019) . 3) MAT (Algorithm 1) without test-time fine-tuning.

5.1. ROBUST ACCURACY

CIFAR10. Table 1a shows the robust accuracy for different attacks and using two different tasks for fine-tuning. The adaptive attack is not applicable to models without fine-tuning. As we inject different self-supervised tasks into the AT stage, and as different self-supervised tasks may impact the robust accuracy differently (Chen et al., 2020a) , the robust accuracy without fine-tuning still varies. The vertical flipping task yields better robust accuracy before fine-tuning but its improvement after fine-tuning is small. By contrast, rotation prediction achieves low robust accuracy before fine-tuning, Table 1 : Robust accuracy on CIFAR10, STL10 and Tiny ImageNet of the test-time fine-tuning on both the online and the offline settings. We use an ∞ budget ε = 0.031. FT stands for fine-tuning. We underline the accuracy of the strongest attack and highlight the highest accuracy among them. (a) CIFAR10 with WideResNet-34-10. but its improvement after fine-tuning is the largest. Using both tasks together combines their effect and yields the highest overall accuracy after test-time adaptation. Note that our self-supervised testtime fine-tuning, together with meta adversarial learning, consistently improves the robust accuracy under different attack methods. Under the strongest adaptive AutoPGD, test-time fine-tuning using both tasks achieves a robust accuracy of 57.70%, significantly outperforming regular AT. STL10 and Tiny ImageNet. As using both the rotation and vertical flip prediction led to the highest overall accuracy on CIFAR10, we focus on this strategy for STL10 and Tiny ImageNet. Table 1b and 1c shows the robust accuracy on STL10 and Tiny ImageNet using a ResNet18. Our approach also significantly outperforms regular AT on these datasets. Offline Test-time Adapattion. As shown in Table 1a , 1b, 1c, the offline fine-tuning further improves the robust accuracy over the online version. Diverse Attacks. Recommended by (Croce et al., 2022) , in Appendix C.1, we evaluate our method on diverse attacks including transfer attack, expectation attack and boundary attack, where test-time adaptation all improves the robustness of the model.

5.2. METHOD ANALYSIS

We observe the significant positive correlation between the gradient of self-supervised loss L SS and the classification loss L cls . Define ρ( x i ) = ∇ θ E L cls ( x i , y i ) T ∇ θ E L SS ( x i ) ∇ θ E L cls ( x i , y i ) 2 ∇ θ E L SS ( x i ) 2 , ( ) and approximate L cls by the Taylor expansion L cls ( x i , y i ; θ E -η∇ θE L SS ( x i )) -L cls ( x i , y i ; θ E ) ≈ -ηρ( x i ) ∇ θE L cls ( x i , y i ) 2 ∇ θE L SS ( x i ) 2 . As θ E contains millions of parameters, its gradient norm is typically large. Therefore, gradient descent w.r.t. L SS should act as a good substitute for optimizing L cls when ρ( x i ) is significantly larger than 0. We further confirm this empirically. For all adversarial test inputs x ∼ D , we regard ρ( x ) as a random variable and calculate its empirical statistics on the test set. Table 2 shows the empirical statistics of an adversarially-trained model on CIFAR10, and Figure 2 shows the c.d.f. of ρ( x ). The mean of ρ( x ) is indeed significantly larger than 0 and P (ρ( x ) > 0) is larger than the robust accuracy of the adversarially-trained network (50%-60%), which implies that self-supervised test-time fine-tuning helps to correctly classify the adversarial test images. We further provide the theoretical analysis in a linear model in Theorem B.1, which shows that the correlated gradient significantly strengthens the robustness and lowers natural risk. Besides, the correlated gradient also helps the model to move closer to the Bayesian robust estimator θ AB .

5.3. ABLATION STUDY

Meta Adversarial Training. To show the effectiveness of MAT, we perform an ablation study to fine-tune the model with regular AT (i.e., setting α = 0 in line 5 of Algorithm 1). Table 7 shows that the robust accuracy and the improvements of fine-tuning are consistently worse without MAT. Accuracy Improvement on Inputs with Different Adversarial Budget. As shown in Table 8 , we set ε = 0.015 to perform the online test-time fine-tuning, showing that our method is also able to improve the robust accuracy of inputs with different adversarial budgets. Removing L SS or L R . To study the effect of L SS and L R in L test , we report the robust accuracy after online fine-tuning using only L R and L SS in Table 9 . While, as expected, removing L SS tends to reduce more accuracy than removing L R . It shows the benefits of our self-supervised test-time fine-tuning strategy. Nevertheless, the best results are obtained by exploiting both loss terms. Improvement on Clean Images. As predicted by Theorem 3.1 and B.1, our method is able to improve not only the robust accuracy but also the natural accuracy. As shown in Table 10 , our approach increases the clean image accuracy by test-time adaptation. This phenomenon further strengthens our conjecture that the improvement of robust accuracy is due to the improvement of generalization instead of gradient masking.

6. CONCLUSION

In linear models and two-layer random networks, we theoretically demonstrate the necessity of testtime adaptation for the model to achieve optimal robustness. To this end, we propose self-supervised test-time fine-tuning on adversarially-trained models to improve their generalization ability. Furthermore, we introduce a MAT strategy to find a good starting point for our self-supervised fine-tuning process. Our extensive experiments on CIFAR10, STL10 and Tiny ImageNet demonstrate that our method consistently improves the robust accuracy under different attack strategies, including strong adaptive attacks where the attacker is aware of our test-time adaptation technique. In these experiments, we utilize three different sources of self-supervision: rotation prediction, vertical flip prediction and the ensemble of them. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472-7482. PMLR, 2019.

A PROOFS OF THEOREMS

A.1 PRELIMINARY: MARCHENKO-PASTUR LAW AND TRANSFORMATION OF EIGENVALUES Before the proof linear models, we first give the asymptotic spectrum of the matrix K = X X, and some useful results of the trace of the transformation of K. Lemma A.1 (Marchenko-Pastur Law, Theorem 3.4 in (Bai & Silverstein, 2010) ). We define the eigenvalues of X X: λ 1 > λ 2 > • • • λ n has distribution with c.d.f. H n (s) = 1 n n i=1 1 λi≤s . If each dimension of x is i.i.d. with Ex = 0, Cov(x) = I d /d and E[ √ dx i ] 4 ≤ M for some universal constant M < ∞, then when n, d → ∞ with n/d = c ∈ (0, 1), for any bounded function g, when n, d → ∞ with c = n/d ∈ [0, ∞) g(s)dH n (s) → g(s)dH(s), with p.d.f. dH(s) dH(s) = 1 2πc (λ + -s) (s -λ -) s 1 s∈[λ-,λ+] ds, where λ -= (1 - √ c) 2 and λ + = (1 + √ c) 2 . Lemma A.2. If each dimension of x is i.i.d. with Ex = 0, Cov(x) = I d /d and E[ √ dx i ] 4 ≤ M for some universal constant M < ∞, then when n, d → ∞ with n/d = c ∈ (0, 1), Tr(K(K + λI n ) -2 ) = d 1 + c + λ - √ ∆ 2 √ ∆ , Tr(K 2 (K + λI n ) -2 ) = d (1 + c + λ) - √ ∆ 2 (1 - λ √ ∆ ), Tr((K + λI n ) -2 ) = d (1 + c)(1 + c + λ) -4c -(1 -c) √ ∆ 2λ 2 √ ∆ , where ∆ = (1 + c + z) 2 -4c. Proof. We define three transformations of the eigenvalue of dH(s) t 1 (z) = s (s + z) 2 dH(s), t 2 (z) = s 2 (s + z) 2 dH(s), t 3 (z) = 1 (s + z) 2 dH(s), for z ∈ [0, ∞). They can be calculated by the Stieltjes transformation of Marchenko-Pastur law. By Marchenko-Pastur semicircular law, the Stieltjes transformation of dH(s) is (Lemma 3.11 in (Bai & Silverstein, 2010) and Lemma 4.4 in (Cheng et al., 2022) ) m(z) = 1 s -z dH(s) = 1 -c -z -(1 + c -z) 2 -4c Let ∆ = (1 + c + z) 2 -4c t(z) = m(-z) = 1 -c + z -(1 + c + z) 2 -4c -2cz and 1 (s + z) 2 dH(s) = - d dz 1 s + z dH(s) = - dt(z) dz . Then t 1 (z) = 1 s + z dH(s) - z (s + z) 2 dH(s) = t(z) + z dt(z) dz = 1 + c + z - √ ∆ 2c √ ∆ , t 2 (z) =1 - 2z s + z dH(s) + z 2 (s + z) 2 dH(s) = 1 -2zt(z) -z 2 dt(z) dz = (1 + c + z) - √ ∆ 2c (1 - z √ ∆ ), t 3 (z) = - d dz 1 s + z dH(s) = dt(z) dz = (1 + c)(1 + c + z) -4c -(1 -c) √ ∆ 2cz 2 √ ∆ . ( ) The trace operation can be translated into: Tr(K(K + λI n ) -2 ) = n s (s + λ) 2 dH n (s), Tr(K 2 (K + λI n ) -2 ) = n s 2 (s + λ) 2 dH n (s), Tr((K + λI n ) -2 ) = n 1 (s + λ) 2 dH n (s). Therefore, when n, d → ∞ with c = n/d ∈ (0, 1), Tr(K(K + λI n ) -2 ) →n 1 + c + λ - √ ∆ 2c √ ∆ = d 1 + c + λ - √ ∆ 2 √ ∆ , Tr(K 2 (K + λI n ) -2 ) →n (1 + c + λ) - √ ∆ 2c (1 - λ √ ∆ ) = d (1 + c + λ) - √ ∆ 2 (1 - λ √ ∆ ), Tr((K + λI n ) -2 ) →n (1 + c)(1 + c + λ) -4c -(1 -c) √ ∆ 2cλ 2 √ ∆ =d (1 + c)(1 + c + λ) -4c -(1 -c) √ ∆ 2λ 2 √ ∆ . where ∆ = (1 + c + λ) 2 -4c. A.2 PROOF OF THEOREM 3.1 We separate the proof into three parts for three estimators separately. Proof of θ Lin AT . For θ AT Lin , taking gradient with respect to the objective function 1 n n i=1 (y i -x i θ) + ε 2 θ 2 , we obtain 2 n (Y -X θ) + 2nεθ = 0. Therefore, θ Lin AT = X(X X + nε 2 I n ) -1 Y Its natural risk is R nat x ( θ Lin AT ) =E ξ,θ * ,x (x ( θ Lin AT -θ * )) 2 = 1 d θ Lin AT -θ * 2 = 1 d E ξ,θ * θ * -X(X X + λI n ) -1 (X θ * + ξ) 2 =τ 2 (1 - n d ) + λ 2 τ 2 d Tr((K + λI n ) -2 ) + σ 2 d Tr(K(K + λI n ) -2 ) (18) And the Lipschitz constant is L( θ Lin AT ) 2 = θ Lin AT 2 =E ξ,θ * (XX + λI d ) -1 X(X θ * + ξ) 2 =τ 2 Tr (K + λI n ) -2 K 2 + σ 2 Tr (K + λI n ) -2 K . When λ = nε 2 → ∞, by Lemma A.2, Tr(K(K + λI n ) -2 ) → 0, Tr(K 2 (K + λI n ) -2 ) → 0, λ 2 Tr((K + λI n ) -2 ) → n d (20) Therefore, R adv x ( θ Lin AT ) = R nat x ( θ Lin AT ) + ε 2 L( θ Lin AT ) → τ 2 Lemma A.3. If the oracle parameter θ * is independent of x and has the prior distribution θ * ∼ N(0, τ 2 I), and the noise ξ ∼ N(0, σ 2 ). Furthermore, if we assume each dimension of x is i.i.d. with Ex = 0, Cov(x) = I d /d and E[ √ dx i ] 4 ≤ M for some universal constant M < ∞, when n, d → ∞ with n/d = c ∈ (0, 1), then for λ * = σ 2 /τ 2 and θ = 1 A X(X X + λ * I n )Y , R nat x ( θ) = τ 2 - τ 2 A ((1 + c + λ * ) - √ ∆)(1 - 1 2A ), L( θ) 2 = dτ 2 2A 2 ((1 + c + λ * ) - √ ∆) , where ∆ = (1 + c + λ * ) 2 -4c. Proof. R nat x ( θ) = 1 d E ξ,θ * θ * - 1 A X(X X + λ * I n ) -1 (X θ * + ξ) 2 = τ 2 d Tr I d - 1 A X(K + λ * I n ) -1 X 2 + σ 2 d Tr 1 A 2 X(K + λ * I n ) -2 X = τ 2 d (d - 2n A ) + 2τ 2 λ * dA Tr (K + λ * I n ) -1 + τ 2 dA 2 Tr K 2 (K + λ * I n ) -2 + σ 2 dA 2 Tr (K + λ * I n ) -2 K . (22) According to Lemma A.2, when n, d → ∞ with c = n d ∈ (0, 1), Tr((K + λ * I n ) -1 ) = d 1 -c + λ * - √ ∆ -2λ * Tr(K(K + λ * I n ) -2 ) = d 1 + c + λ * - √ ∆ 2 √ ∆ Tr(K 2 (K + λ * I n ) -2 ) = d (1 + c + λ * ) - √ ∆ 2 (1 - λ * √ ∆ ), where λ * = σ 2 /τ 2 and ∆ = (1 + c + λ * ) 2 -4c. Therefore, R nat x ( θ lin bayes ) = τ 2 - τ 2 A ((1 + c + λ * ) - √ ∆)(1 - 1 2A ). And L( θ) 2 =E ξ,θ * 1 A X(X X + λ * I n ) -1 (X θ * + ξ) 2 = τ 2 A 2 Tr X(K + λ * I n ) -1 X 2 + σ 2 A 2 Tr 1 A 2 X(K + λ * I n ) -2 X = τ 2 A 2 Tr K 2 (K + λ * I n ) -2 + σ 2 A 2 Tr (K + λ * I n ) -2 K . Therefore, L( θ) = dτ 2 2A 2 ((1 + c + λ * ) - √ ∆). ( ) Proof of θ Lin RB . It is well known that the posterior distribution of θ * is θ * |X, Y ∼ N(X(X X + λ * I n ) -1 Y, σ 2 (XX + λ * I m ) -1 ), where λ * = σ 2 /τ 2 . According to the definition of Bayesian estimator, θ Lin RB = arg min θ E θ * |X E x (x ( θ -θ * )) 2 + ε 2 θ 2 . As the linear model only allows fixed θ for each x, we obtain d d θ E θ * |X 1 d θ -θ * 2 + ε 2 θ 2 =E θ * |X d d θ 1 d θ -θ * 2 + ε 2 θ 2 =E θ * |X 2 d ( θ -θ * ) + 2ε 2 θ = 0. Therefore, θ Lin RB = 1 ε 2 d + 1 E θ * |X θ * = 1 ε 2 d + 1 X(X X + λ * I n ) -1 Y. Using Lemma A.3, when d → ∞ R nat x ( θ Lin RB ) =τ 2 - τ 2 ε 2 d + 1 ((1 + c + λ * ) - √ ∆) 2ε 2 d + 1 2(ε 2 d + 1) → τ 2 , L( θ Lin RB ) 2 = dτ 2 2(ε 2 d + 1) 2 ((1 + c + λ * ) - √ ∆) → 0. Summarizing the results, R nat x ( θ Lin RB ) = R nat x ( θ Lin RB ) + ε 2 L( θ Lin RB ) 2 → τ 2 (27) Proof of θ Lin AB . Bayesian robust estimator of each x, which optimizes (x ( θ -θ * )) 2 + ε 2 θ 2 is θ Lin AB = arg min θ E θ * |X (x ( θ -θ * )) 2 + ε 2 θ 2 , where it is well known that the posterior distribution of θ * is θ * |X, Y ∼ N(X(X X + λ * I n ) -1 Y, σ 2 (XX + λ * I m ) -1 ). Taking the gradient w.r.t θ gives the solution θ Lin AB (x) =(xx + ε 2 I d ) -1 xx X(X X + λ * I n ) -1 Y = xx ε 2 + x x θ Lin nat , where θ Lin nat = X(X X + λ * I n ) -1 (X θ * + ξ) with λ * = σ 2 /τ 2 is the Bayesian estimator for natural risk. For its natural risk R nat x ( θ Lin AB ), R nat x ( θ Lin AB ) = E θ * ,ξ E x (x ( θ Lin AB -θ * )) 2 As x θ = x xx ε 2 + x x θ Lin nat = x x x ε 2 + x x θ Lin nat , then R nat x ( θ Lin AB ) = E θ * ,ξ E x (x ( x x ε 2 + x x θ Lin nat -θ * )) 2 . When d → ∞, x x → 1 in probability. Therefore, x x ε 2 + x x → 1 1 + ε 2 . ( ) Then when d → ∞, R nat x ( θ Lin AB ) = E θ * ,ξ 1 ε 2 + 1 θ Lin nat -θ * 2 By Lemma A.3, R nat x ( θ Lin AB ) = τ 2 - τ 2 ε 2 + 1 ((1 + c + λ * ) - √ ∆)(1 - 1 2(ε 2 + 1) ) (30) As x x → 1 in probability when d → ∞, x x (ε 2 +x x) 2 → 1 (ε 2 +1) 2 . Then L( θ Lin AB ) 2 → 1 d(ε 2 + 1) 2 E θ * ,ξ θ Lin nat 2 . (31) From Lemma A.3, E θ * ,ξ θ Lin nat 2 = dτ 2 (1 + c + λ * ) - √ ∆ 2 . Summarizing two parts, the adversarial risk is R adv x ( θ Lin AB ) =τ 2 - τ 2 2(ε 2 + 1) ((1 + c + λ * ) - √ ∆) ≤ τ 2 (1 - c (ε 2 + 1)(1 + c + λ * ) ). A.3 PROOF OF THEOREM 3.2 Proof. As θ Lin AB = xx θ Lin nat ε 2 +x x , the adversarial input x = ε θ Lin AB θ Lin AB = x + ε x x As d → ∞, x → 1 in probability. Therefore, x = (1 + ε)x. Taking it into θ Lin AB, = (x x + ε 2 I d ) -1 x x X(X X + λ * I n ) -1 Y, we obtain θ Lin AB, = (1 + ε) 2 xx θ Lin nat ε + (1 + ε) 2 x x From Eqn (30) and 31, R nat x ( θ Lin AB, ) =τ 2 1 - (1 + ε) 2 ε 2 + (1 + ε) 2 (1 + c + λ * - √ ∆) + τ 2 (1 + ε) 4 2(ε 2 + (1 + ε) 2 ) 2 (1 + c + λ * - √ ∆) L( θ Lin AB, ) 2 = τ 2 (1 + ε) 4 2(ε 2 + (1 + ε) 2 ) 2 (1 + c + λ * - √ ∆). Therefore, R adv x ( θ Lin AB, ) =R nat x ( θ Lin AB, ) + ε 2 L( θ Lin AB, ) 2 = τ 2 1 -(1 -ε 2 + 2ε 2 (1 + ε) 2 ) 1 + c + λ * - √ ∆ 2(ε 2 /(1 + ε) 2 + 1)

B CORRELATED GRADIENTS

In the following Theorem, we show that with correlated gradient, one gradient descent step like our method largely improves the natural and adversarial risk of the model. Theorem B.1. We assume the oracle parameter θ * is independent of x and has the prior distribution θ * ∼ N(0, τ 2 I), and the noise ξ ∼ N(0, σ 2 ). Let θ 0 = θ AT be the estimator of adversarial training of the linear model F Lin (x; θ): θ 0 = X(X X + nε 2 I n ) -1 Y. When receiving a new test data point (x , y) and taking one gradient descent step with correlated gradient g: θ 1 = θ 0 -ηĝ, where xx ≤ ε is an adversarial example of x, Corr(ĝ, ∇ θ 0 L(F Lin (x, θ 0 ), y)) = ρ > 0 and η is the learning rate. Let x = (x 1 , • • • , x d ) where x i is the i-th element of x. We further assume that {x i } d i=1 are i.i.d. with E[x i ] = 0, Var[x i ] = 1/d. And E[ √ dx i ] 4 ≤ M for some universal constant M < ∞. When n, d → ∞ with c = n/d ∈ (0, 1) , with the optimal learning rate, R nat x ( θ 0 ) -R nat x ( θ 1 ) ≥ τ 2 ρ 2 (((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε) 2 ) 2 - 2τ 2 ρ 2 ((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε) 2 R adv x ( θ 0 ) -R adv x ( θ 1 ) ≥ τ 2 ρ 2 (1 -ε) 2 ((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε 2 ) . Remarks. With correlated gradients, improvements of R nat x and R adv x are both positive when having ρ > 0. By taking correlated gradient descent on the parameter, we get large improvements of both natural and adversarial risks. Theorem B.1 shows that fine-tuning with correlated gradient largely improves both clean performance and robustness of the models. In addition, for linear models, the Bayesian optimal estimator is θ Lin AB x. And θ 1 ĝ with Corr(x , ĝ) = ρ. As x is close to x, with a proper learning rate, we can get close to Bayesian robust estimator with correlated gradient descent.

B.1 PROOF OF THEOREM B.1

Proof. For a new test input (x , y), where xx ≤ ε is an adversarial example near the input, its MSE loss is L(x, y, θ) = 1 2 (x θ * + ξ -x θ) 2 . Taking the gradient w.r.t θ ∇L θ (x , y, θ) = (x θ -x θ * -ξ)x . Suppose the self-supervised task gives a correlated version of gradient and updates θ 0 with one step of gradient descent θ 1 = θ 0 -η[(x θ 0 -x θ * -ξ)ĝ], where Corr(ĝ, x ) = ρ and E[ĝ ĝ] = E[x x ]. From Theorem 3.1, when λ = nε 2 → ∞, R nat x ( θ 0 ) and θ 0 2 can be simplified as: E θ * ,ξ θ 0 2 = τ 2 (1 + c) + σ 2 ε 4 d + o( 1 d ) → 0, R nat x ( θ 0 ) = E ξ,θ * θ 0 -θ * 2 → τ 2 . Therefore, when d → ∞ θ 1 → η[(x θ * + ξ)ĝ]. Then when d → ∞, E x θ 1 2 =η 2 θ * E x [ĝ ĝx x ]θ * + η 2 ξ 2 E x [ĝ ĝ] + 2η 2 ξθ * E x [ĝ x ]. Therefore, E x,θ * ,ξ θ 1 2 =η 2 τ 2 E x [ĝ ĝx x ] + σ 2 E x [ĝ ĝ] . By decomposing ĝ = ρx + 1 -ρ 2 z * , we can obtain, E x,θ * ,ξ θ 1 2 =η 2 τ 2 E x [(x x ) 2 ] + σ 2 E x [x x ] . As x -x ≤ ε, E x,θ * ,ξ θ 1 2 ≤η 2 τ 2 (1 + ε) 4 + σ 2 (1 + ε) 2 . For natural risk R nat x ( θ 1 ) =E x,ξ,θ * x (θ * -θ 1 ) 2 →τ 2 + η 2 τ 2 E x [x x (x ĝ) 2 ] + η 2 σ 2 E x [(x ĝ) 2 ] -2ητ 2 E[x x x ĝ]. By x -x ≤ ε, Corr(ĝ, x ) = ρ and E[ĝ ĝ] = E[x x ], R nat x ( θ 1 ) ≤τ 2 + η 2 τ 2 (1 + ε) 4 + η 2 σ 2 (1 + ε) 2 -2ητ 2 ρ(1 -ε) 2 . Therefore, R adv x ( θ 1 ) ≤τ 2 + η 2 τ 2 (1 + ε 2 )(1 + ε) 4 + η 2 σ 2 (1 + ε 2 )(1 + ε) 2 -2ητ 2 ρ(1 -ε) 2 . Optimizing η get η * = ρτ 2 (1 -ε) 2 (τ 2 (1 + ε) 4 + σ 2 (1 + ε) 2 )(1 + ε 2 ) . ( ) With η * , R adv x ( θ 1 ) ≤ τ 2 (1 - ρ 2 (1 -ε) 2 ((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε 2 ) ), and R nat x ( θ 1 ) ≤τ 2 - τ 2 ρ 2 (((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε) 2 ) 2 - 2τ 2 ρ 2 ((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε) 2 . (34) Compared with θ 0 R adv x ( θ 0 ) =τ 2 , R nat x ( θ 0 ) = τ 2 , we have improvements of R nat x ( θ 0 ) -R nat x ( θ 1 ) ≥ τ 2 ρ 2 (((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε) 2 ) 2 - 2τ 2 ρ 2 ((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε) 2 (35) and R adv x ( θ 0 ) -R adv x ( θ 1 ) ≥ τ 2 ρ 2 (1 -ε) 2 ((1 + ε) 2 + σ 2 /τ 2 )(1 + ε) 2 (1 + ε 2 ) . ( ) C ADDITIONAL EXPERIMENTS C.1 DIVERSE ATTACKS Transfer Attack. In Table 3 , we perform a transfer attack from the static adversarial defense. We use the robust networks with the same architecture as the substitute model, and the test-time adaptation also improves the robust accuracy. Expectation Attack. In Table 4 , we show the results of the expectation attack. We modify the adaptive attack and average the gradient from 10 fine-tuned models, whose training batches are different. We evaluate the model using the ensemble of rotation and vertical flip as the self-supervised task on CIFAR10. We evaluate the model with Adaptive-AutoPGD-EOT and Adaptive-SquareAttack-EOT. One is the strongest attack in our method and the other is a black-box attack that is less likely to be affected by gradient masking. The experiment shows that the expectation attack has little influence on the improvement of our test-time adaptation. Boundary Attack. We use one of the SOTA decision-based attacks: RayS (Chen & Gu, 2020) . We test it on CIFAR10 with the ensemble of rotation and vertical flip. Table 5 shows that our method also improves the robust accuracy of the decision-based attack. GMSA (Chen et al., 2021a) with AutoPGD. GMSA is a recently proposed attack algorithm targeted at the test-time model adaptation. We use the GMSA with AutoPGD to attack our method, and the results are shown in Table 6 . Under GMSA, our test-time adaptation still significantly improves the robust accuracy. Moreover, Table 6 also demonstrates the strength of our adaptive attack strategy as it achieves a higher success rate than GMSA.

C.2 ABLATION STUDY

Meta Adversarial Training. Our meta training strategy in Algorithm 1 aims to strengthen the correlation between the self-supervised tasks and classification. To show its effectiveness, we perform an ablation study where we fine-tune the model with regular AT (i.e., setting α = 0 in line 5 of Algorithm 1). We then perform the same test-time fine-tuning on the model without MAT, using the same hyperparameters as in the MAT case. As shown in Table 7 , the robust accuracy and the improvements of fine-tuning are consistently worse without MAT. Accuracy Improvement on Inputs with Different Adversarial Budget. Our method is also able to improve the robust accuracy of inputs with different adversarial budgets. As shown in Table 8 , we set ∞ budget of the adversarial inputs to be 0.015 to perform the online test-time fine-tuning. The robust accuracy is ed improved. Removing L SS or L R . In our previous experiments, test-time fine-tuning was achieved using a combination of two loss functions: L SS and L R . To study the effect of each of these terms separately, we remove either one of them from L test . In Table 9 , we report the robust accuracy after online fine-tuning using only L R and only L SS . While, as expected, removing L SS tends to reduce more accuracy than removing L R . It shows the benefits of our self-supervised test-time fine-tuning strategy. Nevertheless, the best results are obtained by exploiting both loss terms. Accuracy Improvement on Clean Images. As shown in Eqn (30) and Theorem B.1, our method is able to improve not only the robust accuracy but also the natural accuracy of clean images on adversarially-trained models. To evidence this, we maintain all the components of our model and simply replace the adversarial input images with clean images (i.e. replacing B with clean inputs B in Algorithm 2) and perform the same self-supervised test-time fine-tuning. As shown in Table 10 , our approach increases the clean image accuracy. This phenomenon further strengthens our conjecture that the improvement of robust accuracy is due to the improvement of generalization instead of perturbing the model parameters, because randomly perturbing the parameters usually lowers the natural accuracy of the model. Attacking Objectives. The improvement of the test-time adaptation is not affected by the attack objectives. Even if no information of the ground truth label is incorporated in the attack, the testtime adaptation improves the robust accuracy. When the attacker randomly lowers the score of the false label to perform the adversarial attack, if our method uses the information of the leaked label to improve the robust accuracy, it will predict the false label and reduce the accuracy. However, as shown in Table 11 , the self-supervised test-time fine-tuning improves the robust accuracy on these "adversarial" images. Besides, previous experiments on clean images already show that test-time fine-tuning is effective even if there is no information of the ground truth label.

C.3 ADDITIONAL COMPARISON

Comparison with SOAP (Shi et al., 2020) . Our method is different from SOAP as we are finetuning the model to adapt to new examples instead of purifying the input. We apply SOAP-RP to the adversarially-trained model and find that its improvement is marginal. Under AutoPGD, the accuracy is improved from 53.09% to 53.57%. This improvement is much smaller than our method, whose improvement is from 53.09% to 57.93%. SOAP only has little effect when combined with the commonly used AT. Combination with (Gowal et al., 2020) . We combine our test-time adaptation with AT using additional data (Gowal et al., 2020) . We apply our Meta AT to it with the ensemble of rotation and vertical flip. Using a WideResNet-28-10, it achieves a robust accuracy of 62.07% under AutoPGD. With our test-time adaptation, the robust accuracy is improved to 64.34%. The improvement of robust accuracy is 2.27%. Robust Accuracy v.s Fine-tuning Steps. Figure 3 shows the robust accuracy at each step of the test-time fine-tuning for different self-supervised tasks and attack methods. When using the standard version of attacks, the robust accuracy gradually increases as fine-tuning proceeds. When using our adaptive attacks, the adversarial examples are generated to attack the network with θ T (T = 10) instead of θ 0 . Thus, when the parameters gradually change from θ 0 to θ T , the accuracy drops. Inference Time. Table 12 shows the inference time for different methods. While the inference time for our method is larger than SOAP and the normal method when the batch size is 1, the inference time gets closer when using a larger batch size. And the batch size of 20 or more is a common scenario of the inference. In order to achieve the statistical optimal adversarial risk, additional time Combination with TRADES (Zhang et al., 2019) . Table 13 shows the robust accuracy of combining our test-time adaptation with TRADES. Our test-time adaptation improves the robust accuracy by about 4%, which shows our approach can improve various types of robust training methods.

C.4 VISUALIZATION

In Figure 4 In Figure 5 , we show the histograms of the loss values for the successful and unsuccessful test-time adapted models. For each input instance, if the test-time adaptation corrects the wrong prediction, we count it as successful. And if the misclassified instance is not correctly predicted after our testtime adaptation, it is counted as an unsuccessful one. The figure illustrates that our method can adapt the model to correctly classify the instances close to the decision boundary (with medium loss value). However, for the highly misclassified instances (with large loss value), which are far away from the decision boundary, our test-time adaptation cannot make the model change so much to predict correct labels for them. We consider an ∞ norm with an adversarial budget ε = 0.031. We also use two different network architectures: WideResNet-34-10 for CIFAR10 and ResNet18 for STL10 and Tiny ImageNet. Following the common settings for AT, we train the network for 100 epochs using SGD with a momentum factor of 0.9 and a weight decay factor of 5 × 10 -4 . The learning rate β starts at 0.1 and is divided by a factor of 10 after the 50-th and again after the 75-th epochs. The step size α in Eqn ( 8) is the same as β. The factor C in Eqn ( 11) is set to 1.0. We use 10-iteration PGD (PGD-10) with a step size of 0.007 to find the adversarial image B j at training time. The weight of each self-supervised task is set to C k = 1 K . We set |B j | = 32 and sample 8 batches B 1 , ..., B 8 in each iteration. Furthermore, we save the model after the 51-st epoch for further evaluation, as the model obtained right after the first learning rate decay usually yields the best performance (Rice et al., 2020) . We use PGD with the standard cross-entropy loss to generate adversarial examples at training time in line 3, line 6 and line 8 of Algorithm 1. The hyperparameters of the attacks are as follows: • Line 3: PGD-10 with step size 0.007. • Line 6: As θ * j is similar to θ, the adversarial examples at this step are similar to those at Line 4. To save training time, we therefore choose the starting point of the attack as the adversarial examples in Line 4 and use PGD-2 with a step size of 0.005. • Line 8: PGD-3 with step size 0.02. Online Test-time Fine-tuning. The algorithm for online fine-tuning is shown in Algorithm 2. We fine-tune the network for T = 10 steps with a momentum of 0.9 and a learning rate of η = 5×10 -4 . We set C k = 1 K and C = 15.0. In line 2 of Algorithm 2, we sample a batch B ⊂ D containing 20 training images. In line 3, we use PGD-10 with a step size of 0.007. Offline Test-time Fine-tuning. The algorithm for offline fine-tuning is shown in Algorithm 3. As stochastic gradient descent is more efficient for a large amount of data, we use stochastic gradient descent in the offline fine-tuning. This is the main difference between Algorithm 2 (online fine- Following (Kim et al., 2021) , x-axis represents the direction of the adversarial example and y-axis is a random direction. The white line is the decision boundary. As the fine-tuned model correctly classifies the input example, the decision boundary does not exist in the neighbourhood of the clean input for the fine-tuned model. θ t = θ t-1 -η∇ θ t-1 L test ( B , B; θ t-1 ) 6: end for 7: return Prediction y i = arg max j F ( x i ; θ T ) j Attacks. The detailed settings of each attack are provided below: • PGD-20. We use 20 iterations of PGD with step size γ = 0.003. The attack loss is the cross-entropy. • AutoPGD. We use both the cross-entropy and the difference of logits ratio (DLR) as the attack loss. The hyperparameters are the same as in (Croce & Hein, 2020a ). • FAB. We use the code from (Croce & Hein, 2020a) and keep the hyperparameters the same. • Square Attack. We set T = 2000 and the initial fraction of the elements p = 0.3. The other hyperparameters are the same as in (Andriushchenko et al., 2020) . For the adaptive versions, we set the interval u = T /5 .

D.2 SELF-SUPERVISED TASKS

Rotation Prediction is a widely used self-supervision task proposed in (Gidaris et al., 2018) and has been employed in AT as an auxiliary task to improve the robust accuracy (Chen et al., 2020a; Hendrycks et al., 2019) . Following (Gidaris et al., 2018) , we create 4 copies of the input image by rotating it with Ω = {0  θ t = θ t-1 -η∇ θ t-1 L test ( B j , B; θ t-1 ) 7: end for 8: end for 9: return Prediction ŷi = arg max j F ( x i ; θ T ) j cross-entropy over the 4 copies, given by L rotate (x) = - 1 4 ω∈Ω log(G rotate (x ω ) ω ) , where x ω is the rotated image with angle ω ∈ Ω, G rotate = g rotate • E denotes the classifier for rotation prediction, and G rotate (•) ω is the predicted probability for the ω angle. The head g rotate is a fully-connected layer followed by a softmax layer. Vertical Flip (VFlip) Prediction is a self-supervised task similar to rotation prediction and has also been used for self-supervised learning (Saito et al., 2020) . In essence, we make two copies of the input image and flip one copy vertically. The head g vflip then contains a 2-way fully-connected layer followed by a softmax layer and predicts whether the image is vertically flipped or not. The corresponding loss for an image x is L vflip (x) = - 1 2 v∈V log(G vflip (x v ) v ) , where V = {flipped, not flipped} is the operation set and G vflip = g vflip • E. x v denotes and transformed input and G vflip (•) v is the probability of operation v. Note that we do not flip the image horizontally as it is a common data augmentation technique and classifiers typically seek to be invariant to horizontal flip.

E ADAPTIVE ATTACKS

In the white-box attacks, the attacker knows every detail of the defense method. Therefore, we need to assume that the attacker is aware of our test-time adaptation method and will adjust its strategy for generating adversarial examples accordingly. Here, we discuss one such strong adaptation strategy targeted to our method. Suppose that the attacker is fully aware of the hyperparameters for test-time adaptation. Then, finding adversaries B of the clean subset B can be achieved by maximizing the adaptive loss x i = arg max x i -xi ≤ε L attack (F ( x i ), y; θ T ( B )) , where L attack refers to the general attack loss, such as the cross-entropy or the difference of logit ratio (DLR) (Croce & Hein, 2020a) . Let θ T be the fine-tuned test-time parameters using Algorithm 2. At the k-th step of the attack, it depends on the input B (k) = {( x (k) j , y j )} b j=1 via the update θ t+1 = θ t -η∇ θ t L test ( B (k) , B) , where L test and B are the loss function and subset of training images mentioned in Eqn (6). As θ T is a function of the input B (k) , we can calculate the end-to-end gradient of x (k) i ∈ B (k) as ∇ x (k) i L attack (F ( x (k) i ); θ T ( B (k) )). However, θ T goes through T gradient descent steps, and thus calculating the gradient ∇ x (k) i θ T ( B (k) ) requires T -th order derivatives of the backbone E, which is virtually impossible if T or the dimension of θ E is large. We therefore approximate the gradient as Grad( x (k) i ) ≈ ∇ x (k) i L attack (F ( x (k) i ); θ T ) , which treats θ T as a fixed variable so that high-order derivatives from θ T ( B (k) ( x (k) i )) can be avoided. Although this approximation makes Grad( x (k) i ) inaccurate, common white-box attacks use projected gradients, which are robust to such inaccuracies. For example, PGD only uses the sign of the gradient under an ∞ adversarial budget. Note that solving the maximization in Eqn (12) does not necessarily require calculating the gradient Grad( x (k) i ). For instance, we will also use Square Attack (Andriushchenko et al., 2020) , a strong score-based black-box attack, to maximize Eqn (12) and generate adversaries for B. As another approximation to save time, one can also fix θ T for several iterations. This leverages the intuition that attack strategies often make small changes to the input x, and thus, for the intermediate images in the k-th and (k+1)-th steps, θ T ( B (k) ) and θ T ( B (k+1) ) should be close. Therefore, a general version of our adaptive attacks only updates θ T every u iterations, with u a hyperparameter. In Algorithm 4, 6, 5 and 7, we show the algorithms for ∞ norm-based adaptive PGD, AutoPGD, Square Attack and FAB, respectively. The main difference between the original and adaptive versions is the target loss function for maximization. The reader may refer to (Andriushchenko et al., 2020; Croce & Hein, 2020a; b) for a more detailed description of the steps in these algorithms (e.g., the condition for decreasing the learning rate in AutoPGD). Grad( x i ) = ∇ x i L attack (F ( x i ), y i ; θ) 8: x i = Clip [ xi-ε, xi+ε] ( x i + γSign(Grad( x i ))) x 0 i = x i 5: Grad( x i ) = ∇ xi L attack (F ( x i ), y i ; θ) 6: x 1 i = Clip [ xi-ε, xi+ε] ( x 0 i + γSign(Grad( x i ))) 7: l 0 i = L attack (F ( x 0 i ), y i ; θ) 8: l 1 i = L attack (F ( x 1 i ), y i ; θ) 9: l * i = max{l 0 i , l 1 i } 10: x Grad( x t i ) = ∇ x t i L attack (F ( x t i ), y i ; θ) * i = x 0 i if l * i = l 0 i else x * i = x 1 19: z t+1 i = Clip [ xi-ε, xi+ε] ( x t i + γSign(Grad( x t i ))) 20: x t+1 i = Clip [ xi-ε, xi+ε] ( x t i + ξ(z t+1 i -z t i ) + (1 -ξ)( x t i -x t-1 i )) 21: l t+1 i = L attack (F ( x t+1 i ), y i ; θ) 22: x * i = x t+1 end for 22: end for 23: return Adversarial image x i = x i



Figure 1: The comparison of R adv

) 4.1 SELF-SUPERVISED TEST-TIME FINE-TUNING Our goal is to perform self-supervised learning on the test examples to mitigate the overfitting problem of AT and adapt the model for each data point. To this end, let us suppose that an adversarially-trained network with parameters θ 0 receives a mini-batch of b adversarial test examples B = {( x 1 , y 1 ), • • • , ( x b , y b )} , As the labels { y i } b i=1 are not available, we propose to fine-tune the backbone parameters θ E by optimizing the loss function

Figure 2: Empirical cdf of ρ( x i ) on CIFAR10 and WideResNet-34-10. Adversarial budget ε = 0.031

, we show the visualization of several examples that our test-time adaptation successfully corrects the misclassified examples. The input examples are generated by AutoPGD on CIFAR10, and we fine-tune the network with the ensemble of Rotation and VFlip tasks. It shows our testtime adaptation reduces the loss for the whole neighbourhood of the input examples to increase the accuracy of the model.

DETAILS OF OUR EXPERIMENTAL SETTING D.1 HYPERPARAMETERS Meta Adversarial Training. The algorithm of Meta Adversarial Training is shown in Algorithm 1.

Figure 3: Robust accuracy at different steps of the online test-time fine-tuning on CIFAR10.

Figure 4: Visualization of several examples that our test-time adaptation successfully changes the prediction. Each row represents an example of loss surfaces before fine-tuning, after finetuning and the loss changes of our fine-tuning. The origin point represents the clean example.Following(Kim et al., 2021), x-axis represents the direction of the adversarial example and y-axis is a random direction. The white line is the decision boundary. As the fine-tuned model correctly classifies the input example, the decision boundary does not exist in the neighbourhood of the clean input for the fine-tuned model.

Figure 5: Histograms of the loss values for the successful and unsuccessful test-time adapted models.

Norm Adaptive PGD Attack Input: Test images B = {( x i , y i )}; Attack loss L attack ; Step size γ; Iterations T ; Intervals u; Adversarial budget ε; Trained parameters of the network θ 0 . Output: Adversarial images B = { x i } 1: Add random noise to x i in B and get B 2: for t = 1 to T do 3: if t mod u = 0 then 4: Get final parameters θ T by taking B as input image for Algorithm 2: θ = θ T

for 11: return Adversarial image x i = x i Algorithm 5 ∞ Norm Adaptive AutoPGD Input: Test images B = {( x i , y i )}; Attack loss L attack ; Step size γ; Iterations T ; Intervals u; Adversarial budget ε; Parameter of the adversarially-trained network θ 0 ; Decay iterations W = {w 0 , ..., w n }; Momentum ξ Output: Adversarial image B = { x i } 1: Get final parameter θ T by taking B as input image for Algorithm 2. 2: θ = θ T 3: for x i in B do 4:

end for 12: for t = 1 to T -1 do 13: if t mod u = 0 then 14: Get final parameter θ T by taking B * = { x * i } as input image for Algorithm 2

for 28: return Adversarial image x i = x * i Algorithm 6 ∞ Norm Adaptive Square Attack Input: Test images B = {( x i , y i )}; Attack loss L attack ; Step size γ; Iterations T ; Intervals u; Image size w; Color channels c; Adversarial budget ε; Parameter of the adversarially-trained network θ 0 . Output: Adversarial image B = { x i } 1: Add noise to x i in B and get B 2: for t = 1 to T do 3: if t mod u = 0 then 4:Get final parameter θ T by taking B as input image for Algorithm 2.

t ←side length of the square to modify (according to some schedule) 9: δ ←array of zeros of size w × w × c 10: Sample uniformly r, s ∈ {0, ..., w -h t } ⊂ N 11: for j = 1, ..., c do 12:ρ ← -Uniform(-2ε, 2ε)    13:δ r+1:r+h t ,s+1:s+h t = ρ • 1 h t ×h t [ xi-ε, xi+ε] ( x i + δ) 16: l new i = L attack (F ( x new i ), y i ; θ)



Accuracy on transfer attack on CIFAR10.

Accuracy on RayS on CIFAR10 with the ensemble of rotation and vertical flip task.

Ablation study on the online test-time fine-tuning. The dataset is CIFAR10 and the task is the "Rotation + VFlip". All attacks are standard attacks. SA stands for Square Attack.

Robust test accuracy on CIFAR10 of the online test-time fine-tuning. We use the same WideResNet-34-10 as in Table1a, which is trained with ∞ budget 0.031. The inputs are in the ∞ ball of ε = 0.015. The self-supervised task is the ensemble of rotation and vertical flip.

Ablation study on the online test-time fine-tuning. The dataset is CIFAR10 and the task is the "Rotation + VFlip". All attacks are standard attacks. Removing the L SS or L R results in lower robust accuracy than the full method. SA stands for Square Attack.

Accuracy on clean images. Networks are trained with corresponding meta adversarial training.

Experiments to rule out the possibility of label leaking. We use the WideResNet-34-10 trained with ∞ budget ε = 0.031 and show the robust test accuracy on CIFAR10 of the online test-time fine-tuning. The self-supervised task is the ensemble of rotation and vertical flip.

Average inference time for each instance using different methods.

Combination with our test-time adaptation with TRADES on CIFAR10 with the ensemble of rotation and vertical flip tasks.

Self-supervised Test-time Fine-tuning with SGD Input: Initial parameters θ 0 ; Adversarial test images B = { x i } b i=1 ; Training data D; Learning rate η; Steps T ; Weights C k and C Output: Prediction of x i : ŷi 1: for t = 1 to T do Find adversarial x i of training image x i ∈ B by PGD attack.

annex

Algorithm 7 ∞ Norm Adaptive FAB Input: Test images B = {( x i , y i )}; Step size γ; Iterations T ; Intervals u; Adversarial budget ε;Trained parameters of the network θ 0 ; α max , η, β.Get final parameters θ T by taking B as input image for Algorithm 2: θ = θ T 6:13:if x i is not classified as y i then 15:end if 21:end for 22: end for 23: return Adversarial image x i

