WHEN IS ADVERSARIAL ROBUSTNESS TRANSFERABLE?

Abstract

Knowledge transfer is an effective tool for learning, especially when labeled data is scarce or when training from scratch is prohibitively costly. The overwhelming majority of transfer learning literature is focused on obtaining accurate models, neglecting the issue of adversarial robustness. Yet, robustness is essential, particularly when transferring to safety-critical domains. We analyze and compare how different training procedures on the source domain and different fine-tuning strategies on the target domain affect robustness. More precisely, we study 10 training schemes for source models and 3 for target models, including normal, adversarial, contrastive and Lipschitz constrained variants. We quantify model robustness via randomized smoothing and adversarial attacks. Our results show that improving model robustness on the source domain increases robustness on the target domain. Target retraining has a minor influence on target model robustness. These results indicate that model robustness is preserved during target retraining and transfered from the source domain to the target domain.

1. INTRODUCTION

Since their proposal, neural networks are constantly evolving as they are being adapted for many diverse tasks. They have a tendency to become more complex and larger, since e.g. overparamatrization has proven to be highly beneficial. Training such large and complex neural networks usually requires a huge amount of (labeled) high-quality data. Since this amount of data is not available in all domains, transfer learning was proposed. The idea is to transfer the knowledge of a trained model from the so called source domain to a similar, related task in a target domain for which only a small amount of data exists. Usually, the transfer is considered successful if the model achieves high accuracy on the target domain. However, accuracy is not the only desired property of neural networks. Adversarial robustness is often equally important, especially in safety-critical domains. Some techniques applied in transfer learning (Shafahi et al., 2020; Chen et al., 2021) claim that they improve robustness of transfer learning. However, there is no study that directly compares these techniques to standard methods for improving robustness such as adversarial training or training with a (local) Lipschitz constant. We fill this gap by answering the following questions: 1. Which training procedure results in the most robust source models? 2. Is robustness preserved during target retraining? 3. Does robust retraining on the target domain improve robustness? 4. Which training/target retraining provides models that are robust against distribution shifts? 5. Does transferability correlate with model robustness? To answer these questions, we use a popular transfer learning framework consisting of two parts (see Figure 1 ): a feature extractor f which extracts representations from the inputs and is trained on the source domain and a classifier h which maps extracted representations to predictions and is retrained on the target domain. We investigate and compare how different training procedures and target retraining techniques affect performance and robustness of this model. More specifically, we compare 10 training procedures that can be grouped in three categories. Category one consists of training methods that aim at achieving robustness by changing inputs, i.e. (1) training on clean inputs (ce), (2) randomly perturbed inputs (ceN) and (3) adversarially perturbed inputs (ceA), (4) supervised contrastive learning (con) (Khosla et al., 2020) , (5) supervised contrastive learning based on (5) randomly perturbed inputs (conN) and ( 6) adversarially perturbed inputs (conA). The second category of training approaches consists of methods that change the latent space of the model to achieve robustness, i.e. (7) latent adversarial training (feA) (Singh et al., 2019) , (8) adversarial representation loss minimization (feD) (Chen et al., 2021) and (9) a combination of supervised contrastive learning and adversarial representation loss minimization (conF). Our third category of methods uses constraints on the whole model to improve robustness. These constraints are realized by (10) training with a local Lipschitz constant (llc) (Huang et al., 2021) . In order to analyze how target retraining affects model robustness we compare target retraining on (a) clean (R ce ), (b) randomly perturbed (R ceN ) and (c) adversarially perturbed inputs (R ceA ). Figure 1 : Transfer learning framework consisting of a feature extractor f , classifier h S on the source domain and h T on the target domain. For input x, f (x) = z provides the features and h S (z) = h S (f (x)) or h T (z) = h T (f (x)) the output. Source training procedures is grouped in methods that change inputs (ce, ceN, ceA, con, conN, conA) , methods that change the latent space (feD, feA, conF) and methods that constrain the whole model (llc). To provide a more complete picture of robustness we consider robustness certification, performance against a variety of attacks, and performance under distribution shift. Namely, we employ (i) randomized smoothing based certification and (ii) Fast gradient sign method (FGSM), (iii) Project Gradient Descent (PGD) and (iv) DeepFool (DF) attacks on the source domain and the target domain. In terms of distribution shift, we determine source and target accuracy under different shifts based on random noise, changes of contrast, and Gaussian Blur shift. Next, we investigate whether there is a correlation between transferability and model robustness. We compute a transferability metric and analyze it together with model robustness and zero-shot performance. For transferability quantification we use the H-score, proposed by (Bao et al., 2019) to quantify the usability of representations learned on a source domain for learning a target task. This battery of robustness tests can tell us when is adversarial robustness transferable. As we will show in Section 4, target models inherit robustness from the source models while target retraining has a minor impact. Our findings suggest that model robustness is transferable when source models are trained based on a procedure that enhances model robustness without being too focused on dataspecific adversarial examples.

2. BACKGROUND AND RELATED WORK

Robustness is widely studied for standard tasks such as classification and regression, but there are few works that analyze how robustness properties can be transferred from the source to the target domain. There are different aspects of robustness. One aspect is the vulnerability to adversarial examples (Szegedy et al., 2014) -small input perturbations that are carefully-crafted to manipulate the predictions of a model (e.g. cause misclassification). Finding attacks for a given model has been widely studied for different threat models. These attacks can be used to compute an upper bound on the accuracy under adversarial perturbations. However, this bound can be loose since properly evaluating adversarial robustness is challenging. While a model may be robust against a particular attack there is usually no guarantee that it will not fail against a better and stronger attack. The lesson that seemingly robust models can be broken has been learned more than once (Carlini & Wagner, 2017; Athalye et al., 2018; Tramèr et al., 2020) . A complementary strategy to evaluate adversarial robustness is via verification/certification. Robustness certificates provide guarantees that the prediction of model will not change for the specified perturbation set. Since certificates are NP-hard to compute in general, verification uses tractable (but sound) relaxations to provide a lower bound on adversarial robutness. Verification methods can be grouped in different categories, such as methods based on smoothing (Cohen et al., 2019) , Lipschitz bounds (Fazlyab et al., 2019) , interval bound methods (Mirman et al., 2018) , and optimization (Wong & Kolter, 2018) . Randomized smoothing (Cohen et al., 2019) is widely used due to its generality -it treats the model as a block box and only requires accesses to model inputs and outputs. Since most of the other techniques do not scale to the models typically used in transfer learning and/or are only applicable to specific families of models we adopt randomized smoothing in our evaluations. Another aspect is robustness to distribution shift, which arise from (natural) variations in the data such as noise, changes in contrast, lighting conditions, or object composition. For example, Hendrycks & Dietterich (2019) investigate robustness using synthetic distribution shifts by introducing noise (Gaussian, shot noise), blurring, simulated weather conditions, contrast change, and corruptions from compression. To investigate this aspect we also adopt a similar procedure focusing on three types of representative shifts (noise, blurring and contrast). The literature on robustness of transfer learning is scarce, especially relative to standard supervised learning. The few existing studies are disconnected, the findings are not comparable with each other, and even lead to contradictory conclusions. While Salman et al. (2020) shows that robust training improves the accuracy on the unperturbed target domain data, Shafahi et al. (2020) shows the opposite -adversarial training increases robustness but decreases accuracy. This is one motivation for our study -we intend to provide a fair and comparable evaluation of the most promising techniques aiming at improving robustness. One of the strongest empirical defences is adversarial training (Goodfellow et al., 2015; Madry et al., 2018) foot_0 Until now, transfer learning techniques mainly use adversarial training to obtain feature representations that generalize better. Salman et al. ( 2020) and Utrera et al. (2021) show that adversarially trained/robust models indeed transfer better than their standard-trained counterparts, especially if the target domain has limited data. However, the primary goal of these works is to improve accuracy on unperturbed (rather than adversarial) target data. Similarly, Engstrom et al. (2019) , Ilyas et al. (2019) , and Allen-Zhu & Li (2021) show that adversarial training improves feature learning and results in representations that are more aligned with humans. Goldblum et al. (2020) and (Vaishnavi et al., 2022) investigate whether robustness can transfer from an adversarially trained teacher to a student within the same domain via knowledge distillation. The goal for the student is to match the model output (Goldblum et al., 2020) or match the learned representations (Vaishnavi et al., 2022) . In contrast to that, we investigate robustness transferability across domains. Chan et al. (2020) argues that matching input gradients is important for robustness transfer. Yamada & Otani (2022) investigate whether robustness transfers to downstream tasks such as object detection and semantic segmentation. They find that in the fixed-feature setting robustness is partially preserved and opposed to previous findings, show that an adversarial prior does not help for robustness transfer. Nern & Sharma (2022) investigate transfers from pre-trained models and theoretically shows that downstream robustness is bounded by the robustness of the underlying representation (irrespective of the pre-training protocol). Finally, most closely related to our work is the study by Shafahi et al. (2020) showing that adversarial training of the feature extractor coupled with a one-layer-classifier improves robustness on the target domain. With a similar goal, Chen et al. (2021) proposes an adversarial training procedure that minimizes the distance between adversarial and unperturbed representations (i.e. the output of the feature extractor f ) and proposes to use a classifier with a fixed Lipschitz constant. In all above works the robustness (when evaluated) is considered only w.r.t. a small set of attacks. To provide a more complete picture, we additionally consider verification and robustness to distribution shift.

3. MODEL, TRAINING PROCEDURES & TARGET RETRAINING

A simple but popular transfer learning framework, that we use for this work consists of two parts: a so called feature extractor f and a classifier h (see Figure 1 ). The prediction for input x is obtained as y = h(f (x)). The feature extractor is trained on the source domain and is then frozen, i.e. not changed during target retraining, while the classifier is retrained on the target domain. The idea of this model is, that in related tasks similar features are important and thus the feature extractor can be transferred from source domain to the target domain without adaptions, while the classifier maps extracted representations/features to classes and must be adapted to the target domain. We compare the following 10 training schemes, to determine the best way of obtaining robust source models and preserving robustness during transfer to the target domain (details see Appendix A.1). 1. Standard supervised learning (ce). As baseline, the whole model h • f is trained on clean input data using the cross-entropy (CE) as loss function. 2. Randomly perturbed inputs (ceN). We train the whole model h • f on randomly perturbed inputs. The noisy inputs are obtained by randomly sampling the perturbation ϵ from a Gaussian distribution N (0, δ 2 I) and adding it to the clean input x. 3. Adversarially perturbed inputs (ceA). During adversarial training, an attack is used to compute an adversarial perturbation δ for each input x. The model h • f is trained on the perturbed inputs x + δ. We use a 10-step project gradient descent (PGD) attack to obtain δ during training. 4. Minimizing adversarial feature loss (feD). (Chen et al., 2021) proposed a method explicitly aimed at improving robustness of transfer learning models. This approach is based on a loss function that linearly combines the cross entropy loss L CE with a so called representation distance loss L R = ||f (x adv ) -f (x)|| 2 , where ||.|| 2 is the L 2 -norm. The L R loss minimizes the distance between clean and adversarially perturbed inputs in representation space. The adversarial inputs x adv are obtained with the attack described in paragraph 3 (ceA). The final loss L = L CE + λ D R L R is a liner combination of the two, where λ ∈ [0; 1] is a hyper-parameter and D R is the dimensionality of the representation space.

5.. Latent perturbations (feA).

The previous training procedures focused on adversarial examples in the input space and/or the representation space. However, non-robustness can be caused by any layer of the neural network that maps close points far from each other. To address this issue latent adversarial training was proposed (Singh et al., 2019) . The authors choose a layer l and split the neural network n at that layer into two parts n 1 and n 2 such that n = n 2 • n 1 . Then they use the fast gradient sign method (FGSM) on the sub-network n 2 , which results in and adversarial example in the input space of n 2 , i.e. the latent space of the whole neural network n. For training, the authors additional compute an adversarial example in the input space of n and combine the gradients of adversarial input and latent adversarial example to update the neural network parameters. (Singh et al., 2019) proposes to use this training procedure for fine-tuning after the training. Since standard adversarial training is not used after (but during) training, we modify their approach to ensure a fair comparison. First, we use latent adversarial training from the beginning of the training. Second, for each training step we randomly chose a latent layer l for splitting the network and computing latent adversarial examples. Finally, we extended the method such that latent adversarial examples can be computed by any attack strategy. To ensure comparability we compute latent adversarial examples and adversarial inputs by using 10-step PGD attacks.

6.. Local Lipschitz constant (llc).

Bounding the Lipschitz constant of a neural network is known to improve model robustness and can even be used to get guarantees. Since bounds on global Lipschitz constants are often loose and might lead to over-regularization, we train a neural network based on an upper bound on a trainable local Lipschitz constant as proposed by (Huang et al., 2021) . The local Lipschitz constant is obtained by taking interactions between weight matrices and activation functions into account. It can be proven that the obtained bound is tighter than the global Lipschitz constant. The details of this training approach are explained in (Huang et al., 2021) . 7. Supervised contrastive learning (con). Neural networks trained with fully-supervised contrastive learning (Khosla et al., 2020) consist of the same two parts as our transfer learning model: a feature extractor f that computes a representation f (x) for each input x and a classifier h that maps the representation to the output space h(f (x)). The idea of contrastive learning is to compute representations for a batch of samples and train the feature extractor by pulling representations that correspond to the same class (positive samples) together and pushing representations of different classes (negative samples) apart from each other. To ensure positive samples, on each input in a batch B we perform two different realizations of random data augmentations aug(.), such as random crops, random grey-scale changes, etc., which results in two new training batches: B 1 = [aug(x) ∀x ∈ B] B 2 = [aug(x) ∀x ∈ B]. We train on B 1 ∪ B 2 . We include this training procedure since it is directly based on a desirable property of the representations: Inputs of the same class should result in close representations. Enforcing this might affect source or target model robustness. In each training epoch, we alternatingly update the parameters of the feature encoder f and the classifier h. First, the feature encoder is updated based on the representation computed for the inputs using the contrastive loss (that minimizes the distance between positive samples and maximizes the distance between negative samples). Second, the representations are propagated through the classifier and the classifier is updated using cross-entropy loss. 8. Supervised contrastive learning on randomly perturbed inputs (conN). This training procedure is exactly like supervised contrastive learning, except for the data augmentation of each batch. We use a clean version of the batch and a version that contains randomly perturbation samples: B 1 = [aug(x) ∀x ∈ B], and B 2 = [x + ϵ ∀x ∈ B 1 ]. The perturbation ϵ is obtained by sampling from a Gaussian distribution N (0, δ 2 I). 9. Supervised contrastive learning on adversarially perturbed inputs (conA). This training procedure is again exactly like supervised contrastive learning, except for the data augmentation. We use a clean version of the batch and an adversarially perturbed version: B 1 = [aug(x) ∀x ∈ B] B 2 = [x + δ ∀x ∈ B 1 ]. The adversarial perturbation δ is obtained by computing a 10-step PGD attack on each input x ∈ B 1 . 10. Fine-tuned contrastive learning (conF). We propose to combine supervised contrastive learning as described in paragraph 7 with fine-tuning based on minimizing the adversarial feature loss. Since the standard contrastive learning operates in the representation space, but does not see any adversarial examples during training, we propose to add fine-tuning on the source dataset to increase robustness. After training, we retrain the whole model (on the source dataset) by minimizing the feature loss (see paragraph 4). Target retraining. Since the feature extractor f is fixed during retraining on the target domain, techniques such as constrastive learning, minimizing adversarial feature loss or latent adversarial training are not applicable to adapt the classifier to the target domain. We use and compare three different retraining procedures for the classifier: standard supervised leaning (R ce , see paragraph 1), training on randomly perturbed inputs (R ceN , see paragraph 2) and training on adversarially perturbed inputs (R ceA , paragraph 3).

4.1. WHICH TRAINING PROCEDURE RESULTS IN THE MOST ROBUST SOURCE MODELS?

Our goal is to analyze if and how robustness can be preserved during transfer from the source domain to the target domain. Thus, we first need to obtain robust source models. To achieve this we train models based on the 10 procedures discussed in Section 3. All models are trained on three different datasets, i.e. SVHN, EMNIST and CIFAR10. Details on the experimental set up can be found in the appendix A.1. We evaluate model robustness in two complementary ways by using attacks and verification/certification. More specifically we use attacks of different strength, i.e. fast gradient sign method (FGSM) as a weak attack, and projected gradient descent (PGD) and DeepFool (DF) as strong attacks. For certification we use randomized smoothing, since this verification technique treats the model as black box and thus is applicable on all model architectures and activation functions. Figure 2 illustrates the results of our robustness evaluation. The exact numbers and the verifiable radius can be found in Table 1  (Appendix A.2). First, we observe that all training procedures, except ceN and llc, result in comparable and high base accuracy (A base rose/gray). Training on randomly perturbed inputs (ceN) can reduce the base accuracy by 0 -5 % and training with a local Lipschitz constant (llc) by 1 -20 % depending on the dataset. The verifiable accuracy (A cert. ), i.e. the portion of points for which randomized smoothing could certify the prediction, varies between the different training procedures. Not surprisingly, models trained on randomly perturbed inputs (ceN and conN) have the highest verifiable accuracy. Using robust training that includes attacks (ceA, feD, feA, conA, conF) or a local Lipschitz constant (llc) result results in models with medium to high verifiable accuracy and using training without perturbations (ce, con) results in the least robust models according to randomized smoothing. Our attack analysis (Figure 2 ) shows that using robust training (ceN, ceA, feD, feA, llc, conN, conA, conF) results in models that are significantly more robust than normally trained models (ce, con). Thus, to obtain high accuracy and high robustness on the source dataset, we recommended to use a robust training procedure. The best performing training procedure w.r.t. to base accuracy, verifiable accuracy and robustness against adversarial attacks is conF, i.e. supervised contrastive learning followed by a fine tuning step that uses adversarial attacks and a loss function that minimizes the distance between clean and adversarial representations in the feature space (see Table 1 ).

4.2. IS ROBUSTNESS PRESERVED DURING TARGET RETRAINING?

In the previous section we analyzed how to get (the most) robust source models. Now we analyze if robustness is preserved during transfer from the source domain to the target domain during target retraining. Since we focus on inherited robustness properties in this chapter, we do the target retraining on clean inputs (R ce ). Figure 3 shows the target robustness versus the source robustness of the models. Robustness is quantified by determining the verifiable accuracy using randomized smoothing (first row) and the accuracy under the most successful (FGSM, PGD, DeepFool) attack, i.e. the attack which results in the largest accuracy decrease (second row). If source robustness would be hundred percent preserved during transfer, the measurements would fall on the line y = x. However, we observe a more complex and dataset dependent correlation between target robustness and source robustness. First, models with low source accuracy result in models with low target accuracy such as ce. Models with high source robustness have different capabilities of preserving this robustness during target retraining. The amount of robustness that can be preserved depends on the source training procedure and the transfer learning task (i.e. source and target domain). On SVHN -MNIST we observe a clear target robustness ranking of training procedures that result in similarly robust source models. Using a local Lipschitz constant (llc), contrastive learning on clean (con), randomly (conN) or adversarial perturbed inputs (conA) results in robust target models which are even more robust than the source models (measurements above y = x line). On EMNIST -KMNIST the highest target robustness is observed for llc and models trained with a loss function that minimizes representation distances corresponding to clean and adversarial inputs (feD), but target models are less robust than source models. If models are transferred from CIFAR10 to FMNIST, the highest target robustness transfer is achieved by llc. The other approaches are able to preserve (most of) the robustness during transfer. Thus, even though they are not the most robust methods on the source domain, the most robust methods on the target domain are the contrastive learning techniques and llc which uses a local Lipschitz constant. These training procedure achieve robustness without being too focused on adversarial examples. While llc aims at mapping close input to close outputs, the contrastive approaches minimize representation distances of inputs corresponding to the same class. Thus, both approaches achieve robustness by more general concepts that computing worst case perturbations/adversarial examples which might be too data-specific to ensure that robustness is passed onto the target models.

4.3. DOES ROBUST RETRAINING ON THE TARGET DOMAIN IMPROVE ROBUSTNESS?

In the previous section we show that the training procedure significantly affect model robustness on the target domain. In this section we analyze the influence of target retraining on target robustness. To this end we compare 3 different target retraining procedures, i.e. target retraining on clean inputs (R ce ), on randomly perturbed inputs (R ceN ) and on adversarial perturbed inputs (R ceA ). Figure 4 compares accuracy, verifiable accuracy and accuracy decrease under attacks of these three target retraining schemes. Detailed numbers corresponding to these plots (and further plots) can be found in Table 2,  Comparing the three target retraining schemes, training on randomly perturbed inputs inputs (R ceN ) results in higher verifiable accuracy (first row, dark brown bars) but decreases the accuracy on the clean dataset. Our attack analysis shows small differences in model robustness for the three target retraining schemes. Both analyses clearly illustrate that the training procedure on the source domain has a major influence on target model robustness, while the target retraining has a minor effect on it. Thus, in order to obtain robust target models we require robust training of the source model.

4.4. WHICH TRAINING/TARGET RETRAINING PROVIDES MODELS THAT ARE ROBUST

AGAINST DISTRIBUTION SHIFTS? In the previous sections, we quantify and analyze adversarial robustness, i.e. robustness against small input perturbations that aim at fooling the model to make a wrong prediction. Another type of perturbations that occur in real-life transfer learning are distributions shifts. Distribution shifts are changes of the dataset such as random noise, changes in contrast or Gaussian blur. We analyze robustness of our source models (see Figure 5 and Table 5 ) and target models (see Figure 5 , Table 6 , 7 and 8) against these data shifts. Robustness of source models against distribution shifts depends on the dataset and the shift. The normally trained model (ce) is the least robust model. Models trained on randomly perturbed inputs (ceN, conN) are robust to noise shifts, while contrastive learning (con, conN, conA, conF) result in the most robust models w.r.t. changes of the contrast (see Table 5 ). Robustness against Gaussian blur shifts can be increased by robust training or contrastive learning. Target model robustness against distribution shifts mainly depends on the training procedure, while target retraining has a minor effect. Adversarial training and contrastive learning improve robustness against distribution shifts. On the SVHN -MNIST task training with a local Lipschitz constant (llc) and contrastive learning techniques without fine-tuning (con, conN, conA) result in the most robust target models against distribution shifts caused by noise, contrast changes or Gaussian blurring. On EMNIST -KMNIST adversarial training with a loss that minimizes distance between representations corresponding to clean and adversarial perturbed inputs (feD) yields the most robust models against all three analyzed distribution shifts. On CIFAR10 -FMNIST contrastive learning on randomly perturbed inputs (conN) and training with a local Lipschitz constant result in the strongest performing models against noise, contrast change and Gaussian blur data distribution shifts. Thus, models that are robust against adversarial attacks are also more robust against distribution shifts compared to non-robust models.

4.5. DOES TRANSFERABILITY CORRELATE WITH MODEL ROBUSTNESS?

One key requirement of transfer learning models is transferability, i.e. the potential of a model to benefit the target task. We analyze if there is a correlation between robustness and transferability of our source models (see Figure 7 , Figure 13 , Table 9 and 10 in the Appendix). In order to quantify transferability we use the H-score as proposed by Bao et al. (2019) and determine the zero-shot performance of the source models. The zero-shot performance is the accuracy a source model achieves on the target dataset before any target retraining. The absolute value of the H-score depends on the dataset. Considering the ranking, the contrastive learning approaches (con, conN, conA, conF) have the highest H-scores on all three transfer learning tasks. A reason might be that these methods train the feature extractor by pulling together positive anchors (representations corresponding to the same class) and pushing apart negative anchors (representations corresponding to different classes) in the feature space. Since con performs similar to the more robust contrastive approaches we could not find a correlation between robustness and transferability. On the SVHN -MNIST task source and target dataset share all ten classes and the transferability estimations are are consistent with the zero-shot performance. The contrastive approaches have a high zero-shot accuracy of up to 66 %. On EMNIST -KMNIST and CIFAR10 -FMNIST the source dataset and the target dataset contain different classes so zero-shot accuracy is similarly low for all methods as expected (see Figure 13 in the Appendix). Thus, if the source dataset and target dataset contain the same classes the supervised contrastive learning schemes achieve the highest transferability, zero-shot accuracy and can preserve robustness during transfer.

5. CONCLUSION

This work analyzes how different training procedures on the source domain and fine-tuning strategies on the target domain affect model robustness. We show that the training procedure on the source domain has a major effect on target model robustness while target retraining has a minor effect. Our results indicate that contrastive learning and training with a local Lipschitz constant best preserve robustness during target retraining. Furthermore, robustness to adversarial attacks also provides robustness against distribution shifts. Transferability and zero-shot performance depend on the relatedness between the source and the target domain and on the source training process. The highest transferability and zero-shot performance is achieved by contrastive learning approaches, which are also among the strongest ones in preserving robustness during transfer.

A APPENDIX

A.1 DETAILS OF THE EXPERIMENTAL SETUP Models. All models have a similar base architecture, proposed by (Huang et al., 2021) . The encoder consists of 6 convolutional layers (kernel sizes: 3, 3, 4, 3, 3, 4), each one followed by an activation layer with ReLU or ReLUx (llc) as activation function, a flatting layer to reshape (which is required for contrastive learning), followed by a linear layer (size: 512) and another activation layer. We include the flatting layer for all models since it is essential for contrastive learning, but did not had an effect on normally trained classifiers and we wanted to keep the architecture of the approaches as close as possible. The classifier consists of one linear layer. Implementation is done in Pytorch (Paszke et al., 2019) . For contrastive learning (con, conN, conA and conF) we use stochastic gradient descent as optimizer as proposed by (Khosla et al., 2020) , all other models are optimized using Adam optimizer. As mixing coefficient for CE-loss and feature-distance loss (feD training) we chose λ = 0.1 as proposed by (Chen et al., 2021) . Hyper-parameters are determined by performing a grid-search. For training the contrastive models (con, conN, conA, Fcon) the grid search is done in [0.005; 0.1] and for the other models in [0.0001; 0.001]. All models are trained for 800 epochs and retrained for 200 epochs. The target retraining learning rate is search in [0.0001; 0.5] and we do are warm-start, i.e. we do not randomly internalize the weights of the classifier before target retraining. Model fine-tuning (conF) is done for 100 epochs and with a learning rate in [0.0001; 0.001]. Datasets. We use six different datasets: The SVHN dataset (Netzer et al., 2011) contains images of street few housing numbers. The MNIST dataset (LeCun & Cortes, 2010) consists of grayscale images of handwritten digits. The CIFAR10 dataset (Krizhevsky et al., 2009) contains 3 × 32 × 32 images of objects (airplane, bird, car, cat, deer, dog, horse, ship, truck and frog). The FashionMNIST/FMNIST dataset (Xiao et al., 2017) contains gray-scale images of clothes. The EMNIST dataset (Cohen et al., 2017) consists the 26 character of the alphabet and the KMNIST (Clanuwat et al., 2018) dataset consists of gray-scale images of Japanese characters. Each dataset already consists of a predefined training set and a test set. We further split the training sets into training data (90%) and validation data (10%).The validation set is used during training/retraining to check the accuracy and determine the best model. The following data augmentations are used on all datasets and model: random horizontal filps, random corps, and random rotations (≤15°). For contrastive models (con, conN, conA and conF) additional augmentations, i.e. random resize crop, color jitter and random gray-scale) is used. Transfer Learning Tasks Based on the six datasets discussed above we create three transfer learning tasks of different relatedness. We consider the following transfer tasks (source domain → target domain): SVHN → MNIST (highly related), CIFAR10 → FMNIST (related), EMNIST → KM-NIST (related). Perturbations and Attacks. We use four different attack types: Noise attacks, Fast Gradient Sign Method (FGSM), Project Gradinent Descent (PGD) attacks and DeepFool attacks with attack radii of 0.1. We chose an attack radius of 0.1 since that is a popular perturbation size analyzed in randomized smoothing as well as for attacks. The perturbation is bounded by the L 2 -norm and applied to the input after data normalization. For adversarial training we use 10-step PGD attacks, while robustness analysis uses 1-step FGSM attacks, 100-step PGD attacks and 100-step DeepFool attacks based on the implementation provided by Rauber et al. (2020) . Randomized Smoothing. Randomized smoothing techniques (Cohen et al., 2019) draw samples x i ∼ N (x, σ) from the close neighborhood of input x, propagate them through the neural network and aggregate the outputs to obtain a smooth prediction. We use σ = 0.1, draw 500 samples for each input and bind the probability of returning an incorrect answer/prediction by α = 10 -4 . If a prediction cannot be certified randomized smoothing abstains, i.e. returns -1 instead of a label. Please note that the certified prediction mainly depends depends on the neighborhood of the input sample x (i.e. the 500 drawn samples) and might be different than the prediction of the base model, which only depends on input x. Distribution shifts. We generate different distribution shifts on each dataset based on random perturbations (noise), changes of the contrast (contrast) and Gaussian blur (blur). More specifically, we use Gaussian noise shift (Noise), uniform noise shift (UNoise), contrast reduction shift (Contrast), contrast reduction shift based on a binary search (ContrastBin), contrast reduction based on a linear search(ContrastLin), Gaussin blur (Blur) and a salt and pepper (SaltPepper) shift. The perturbation or shift size is bounded by the L 2 -norm and 5 as upper bound. To compute these perturbations we use the implementation provided by Rauber et al. (2020) . Transferability. In order to estimate transferability of source models we compute the H-score as proposed by (Bao et al., 2019) .

A.2 ADDITIONAL EXPERIMENTAL RESULTS

Which training procedure results in the most robust source models? Table 1 shows how robust target model retrained on normal (ce), randomly perturbed (ceN) and adversarial perturbed (ceA) target inputs are. In order to quantify robustness we compute the verifiable accuracy using randomized smoothing and accuracy under FGSM, PGD and DeepFool attacks. 8 show how target retraining affects robustness. In order to quantify robustness we compute the verifiable accuracy using randomized smoothing and accuracy under FGSM, PGD and DeepFool attacks. Which training/target retraining provides models that are robust against distribution shifts? Does transferability correlate with model robustness? Table 9 shows the verifiable accuracy (A Smooth ), accuracy under the strongest (FGSM, PGD, Deep-Fool) attack (A str. attack ) and quantifies transferability using H-score. Table 10 shows the source accuracy and the zero-shot target accuracy, i.e. the accuracy a source model achieves on the target dataset before target retraining. Results of both tables are visulaized in Figure 13 . 



The details are important since adversarial training against a weak attack produces a weak defense. Detailed information about the datasets, the transfer learning tasks, models, attacks and randomized smoothing is provided in the subsections of section A.1



h S ) target retraining (h T )

Figure2: Accuracy on the clean test set (A base , rose or gray), verifiable accuracy (A cert. , red) and accuracy decrease under FGSM (green), DeepFool (DF, blue) and PGD attacks (purple) of source models trained on SVHN , EMNIST and CIFAR10.

Figure 3: Target vs. source robustness quantified by verification or the most successful (FGSM, PGD, DeepFool) attack. Labels refer to the source training, target retraining is done on clean inputs.

Figure 4: Accuracy on the clean test set (A base , bright colors), verifiable accuracy (A cert. , dark colors) and accuracy decrease of the strongest (FGSM, PGD, DeepFool) attack of target retraining on clean (R ce ), randomly perturbed (R ceN ) and adversarially perturbed (R ceA ) inputs.

Figure 5: Accuracy on the clean test set (A base , gray) and accuracy under distribution shifts based on random noise, changes of the contrast and Gaussian blur of source models.

Figure 7: Transferability (H-Score) versus robustness of the source models quantified as verifiable accuracy or accuracy under the strongest attack and zero-shot performance on SVHN -MNIST.

Figure 9: Accuracy on the clean test set (A base , bright colors) and accuracy under distribution shifts based on random noise, changes of the contrast and Gaussian blur of target model trained on SVHN, EMNIST or CIFAR10.

Figure 12: Accuracy on the clean test set (A base , bright colors) and accuracy under distribution shifts based on random noise, changes of the contrast and Gaussian blur of target model trained on CIFAR10 and retrained on clean (R ce ), randomly (R ceN ) or adversarial perturbed (R ceN ) inputs (FMNIST).

Figure 13: Transferability quantified by the H-Score versus robustness of the source models quantified as verifiable accuracy (Verification) or accuracy under the strongest (FGSM, PGD, DeepFool) attack (Attacking) and zero-shot performance on SVHN -MNIST, EMNIST -KMNIST and CI-FAR10 -FMNIST.

3, 4 and Figure8in the appendix A.2.

Accuracy on the clean test set (base), verifiable accuracy (cert.) and radius, accuracy under FGSM, PGD and DeepFool attacks of source models trained on SVHN, EMNIST and CIFAR10. The best values/highest accuracy are highlighted using bold letters.SVHN→ MNIST Train Retrain A base [%] A cert. [%] R cert. A FGSM [%] A PGD [%] A DeepFool [%]

Accuracy on the clean test set (base), verifiable accuracy (cert.) and raduis, accuracy under FGSM, PGD and DeepFool attacks of target models trained on SVHN and retrained on MNIST. The best values/highest accuracy are highlighted using bold letters. SVHN → MNIST Train Retrain A base [%] A cert. [%] R cert. A FGSM [%] A PGD [%] A DeepFool [%]

Accuracy on the clean test set (base), verifiable accuracy (cert.) and radius, accuracy under FGSM, PGD and DeepFool and attacks of target models trained on EMNIST and retrained on KMNIST. The best values/highest accuracy are highlighted using bold letters.EMNIST → KMNISTTrain Retrain A base [%] A cert. [%] R cert. A FGSM [%] A PGD [%] A DeepFool [%]

Accuracy on the clean test set (base), verifiable accuracy (cert.) and radius, accuracy under FGSM, PGD and DeepFool attacks of target models trained on CIFAR10 and retrained on FMNIST. The best values/highest accuracy are highlighted using bold letters.CIFAR10 → FMNISTTrain Retrain A base [%] A cert. [%] R cert. A FGSM [%] A PGD [%] A DeepFool [%]Accuracy on the clean test set (base, bright colors), verifiable accuracy (cert., dark colors), accuracy under FGSM (dark green), DeepFool (DF, dark blue) and PGD attacks (dark purple) of target models retrained on normal (R ce ), randomly (R ceN ) or adversarially perturbed (R ceA ) target data.



Accuracy on the clean test set (A base ) and under distribution shifts of source models trained on SVHN (first row), EMNIST (second row) and CIFAR10 (third row). The best values/highest accuracy are highlighted using bold letters. ANoise[%] AUNoise [%] AContrast [%] AContrastBin [%] AContrastLin [%] ABlur [%] ASaltPepper [%]

Accuracy on the clean test set (A base ) and under distribution shifts of the target models trained on SVHN and retrained on MNIST. The best values/highest accuracy are highlighted using bold letters.

Accuracy on the clean test set (A base ) and under distribution shifts of the target models trained on EMNIST and retrained on KMNIST. The best values/highest accuracy are highlighted using bold letters. ANoise [%] AUNoise [%] AContrast [%] AContrastBin [%] AContrastLin [%] ABlur [%] ASaltPepper [%]

Accuracy on the clean test set (A base ) and under distribution shifts of the target models trained on CIFAR10 and retrained on FMNIST. The best values/highest accuracy are highlighted using bold letters.

Transferability, verifiable accuracy and accuracy under attack of source models versus robustness on the source dataset. Transferability is measured by the H-Score, while robustness is quantified using verifiable accuracy and accuracy under the strongest attack.

Accuracy of source models (before target retraining) on the target dataset versus accuracy on the source dataset.

