WHEN IS ADVERSARIAL ROBUSTNESS TRANSFERABLE?

Abstract

Knowledge transfer is an effective tool for learning, especially when labeled data is scarce or when training from scratch is prohibitively costly. The overwhelming majority of transfer learning literature is focused on obtaining accurate models, neglecting the issue of adversarial robustness. Yet, robustness is essential, particularly when transferring to safety-critical domains. We analyze and compare how different training procedures on the source domain and different fine-tuning strategies on the target domain affect robustness. More precisely, we study 10 training schemes for source models and 3 for target models, including normal, adversarial, contrastive and Lipschitz constrained variants. We quantify model robustness via randomized smoothing and adversarial attacks. Our results show that improving model robustness on the source domain increases robustness on the target domain. Target retraining has a minor influence on target model robustness. These results indicate that model robustness is preserved during target retraining and transfered from the source domain to the target domain.

1. INTRODUCTION

Since their proposal, neural networks are constantly evolving as they are being adapted for many diverse tasks. They have a tendency to become more complex and larger, since e.g. overparamatrization has proven to be highly beneficial. Training such large and complex neural networks usually requires a huge amount of (labeled) high-quality data. Since this amount of data is not available in all domains, transfer learning was proposed. The idea is to transfer the knowledge of a trained model from the so called source domain to a similar, related task in a target domain for which only a small amount of data exists. Usually, the transfer is considered successful if the model achieves high accuracy on the target domain. However, accuracy is not the only desired property of neural networks. Adversarial robustness is often equally important, especially in safety-critical domains. Some techniques applied in transfer learning (Shafahi et al., 2020; Chen et al., 2021) claim that they improve robustness of transfer learning. However, there is no study that directly compares these techniques to standard methods for improving robustness such as adversarial training or training with a (local) Lipschitz constant. We fill this gap by answering the following questions: 1. Which training procedure results in the most robust source models? 2. Is robustness preserved during target retraining? 3. Does robust retraining on the target domain improve robustness? 4. Which training/target retraining provides models that are robust against distribution shifts? 5. Does transferability correlate with model robustness? To answer these questions, we use a popular transfer learning framework consisting of two parts (see Figure 1 ): a feature extractor f which extracts representations from the inputs and is trained on the source domain and a classifier h which maps extracted representations to predictions and is retrained on the target domain. We investigate and compare how different training procedures and target retraining techniques affect performance and robustness of this model. More specifically, we compare 10 training procedures that can be grouped in three categories. Category one consists of training methods that aim at achieving robustness by changing inputs, i.e. (1) training on clean inputs (ce), (2) randomly perturbed inputs (ceN) and (3) adversarially perturbed inputs (ceA), (4) supervised

