WHEN IS ADVERSARIAL ROBUSTNESS TRANSFERABLE?

Abstract

Knowledge transfer is an effective tool for learning, especially when labeled data is scarce or when training from scratch is prohibitively costly. The overwhelming majority of transfer learning literature is focused on obtaining accurate models, neglecting the issue of adversarial robustness. Yet, robustness is essential, particularly when transferring to safety-critical domains. We analyze and compare how different training procedures on the source domain and different fine-tuning strategies on the target domain affect robustness. More precisely, we study 10 training schemes for source models and 3 for target models, including normal, adversarial, contrastive and Lipschitz constrained variants. We quantify model robustness via randomized smoothing and adversarial attacks. Our results show that improving model robustness on the source domain increases robustness on the target domain. Target retraining has a minor influence on target model robustness. These results indicate that model robustness is preserved during target retraining and transfered from the source domain to the target domain.

1. INTRODUCTION

Since their proposal, neural networks are constantly evolving as they are being adapted for many diverse tasks. They have a tendency to become more complex and larger, since e.g. overparamatrization has proven to be highly beneficial. Training such large and complex neural networks usually requires a huge amount of (labeled) high-quality data. Since this amount of data is not available in all domains, transfer learning was proposed. The idea is to transfer the knowledge of a trained model from the so called source domain to a similar, related task in a target domain for which only a small amount of data exists. Usually, the transfer is considered successful if the model achieves high accuracy on the target domain. However, accuracy is not the only desired property of neural networks. Adversarial robustness is often equally important, especially in safety-critical domains. Some techniques applied in transfer learning (Shafahi et al., 2020; Chen et al., 2021) claim that they improve robustness of transfer learning. However, there is no study that directly compares these techniques to standard methods for improving robustness such as adversarial training or training with a (local) Lipschitz constant. We fill this gap by answering the following questions: 1. Which training procedure results in the most robust source models? 2. Is robustness preserved during target retraining? 3. Does robust retraining on the target domain improve robustness? 4. Which training/target retraining provides models that are robust against distribution shifts? 5. Does transferability correlate with model robustness? To answer these questions, we use a popular transfer learning framework consisting of two parts (see Figure 1 ): a feature extractor f which extracts representations from the inputs and is trained on the source domain and a classifier h which maps extracted representations to predictions and is retrained on the target domain. We investigate and compare how different training procedures and target retraining techniques affect performance and robustness of this model. More specifically, we compare 10 training procedures that can be grouped in three categories. Category one consists of training methods that aim at achieving robustness by changing inputs, i.e. ( 1) training on clean inputs (ce), (2) randomly perturbed inputs (ceN) and (3) adversarially perturbed inputs (ceA), (4) supervised contrastive learning (con) (Khosla et al., 2020), ( 5) supervised contrastive learning based on (5) randomly perturbed inputs (conN) and ( 6) adversarially perturbed inputs (conA). The second category of training approaches consists of methods that change the latent space of the model to achieve robustness, i.e. ( 7) latent adversarial training (feA) (Singh et al., 2019) , (8) adversarial representation loss minimization (feD) (Chen et al., 2021) and ( 9) a combination of supervised contrastive learning and adversarial representation loss minimization (conF). Our third category of methods uses constraints on the whole model to improve robustness. These constraints are realized by ( 10) training with a local Lipschitz constant (llc) (Huang et al., 2021) . In order to analyze how target retraining affects model robustness we compare target retraining on (a) clean (R ce ), (b) randomly perturbed (R ceN ) and (c) adversarially perturbed inputs (R ceA ).  f x z h S h T x x + δ x + (z) = h S (f (x)) or h T (z) = h T (f (x)) the output. Source training procedures is grouped in methods that change inputs (ce, ceN, ceA, con, conN, conA), methods that change the latent space (feD, feA, conF) and methods that constrain the whole model (llc). To provide a more complete picture of robustness we consider robustness certification, performance against a variety of attacks, and performance under distribution shift. Namely, we employ (i) randomized smoothing based certification and (ii) Fast gradient sign method (FGSM), (iii) Project Gradient Descent (PGD) and (iv) DeepFool (DF) attacks on the source domain and the target domain. In terms of distribution shift, we determine source and target accuracy under different shifts based on random noise, changes of contrast, and Gaussian Blur shift. Next, we investigate whether there is a correlation between transferability and model robustness. We compute a transferability metric and analyze it together with model robustness and zero-shot performance. For transferability quantification we use the H-score, proposed by (Bao et al., 2019) to quantify the usability of representations learned on a source domain for learning a target task. This battery of robustness tests can tell us when is adversarial robustness transferable. As we will show in Section 4, target models inherit robustness from the source models while target retraining has a minor impact. Our findings suggest that model robustness is transferable when source models are trained based on a procedure that enhances model robustness without being too focused on dataspecific adversarial examples.

2. BACKGROUND AND RELATED WORK

Robustness is widely studied for standard tasks such as classification and regression, but there are few works that analyze how robustness properties can be transferred from the source to the target domain. There are different aspects of robustness. One aspect is the vulnerability to adversarial examples (Szegedy et al., 2014) -small input perturbations that are carefully-crafted to manipulate the predictions of a model (e.g. cause misclassification). Finding attacks for a given model has been widely studied for different threat models. These attacks can be used to compute an upper bound on the accuracy under adversarial perturbations. However, this bound can be loose since properly evaluating adversarial robustness is challenging. While a model may be robust against a particular attack there is usually no guarantee that it will not fail against a better and stronger attack. The lesson that seemingly robust models can be broken has been learned more than once (Carlini & Wagner, 2017; Athalye et al., 2018; Tramèr et al., 2020) . A complementary strategy to evaluate adversarial robustness is via verification/certification. Robustness certificates provide guarantees that the prediction of model will not change for the specified

