DOES ADVERSARIAL TRANSFERABILITY INDICATE KNOWLEDGE TRANSFERABILITY?

Abstract

Despite the immense success that deep neural networks (DNNs) have achieved, adversarial examples, which are perturbed inputs that aim to mislead DNNs to make mistakes, have recently led to great concerns. On the other hand, adversarial examples exhibit interesting phenomena, such as adversarial transferability. DNNs also exhibit knowledge transfer, which is critical to improving learning efficiency and learning in domains that lack high-quality training data. To uncover the fundamental connections between these phenomena, we investigate and give an affirmative answer to the question: does adversarial transferability indicate knowledge transferability? We theoretically analyze the relationship between adversarial transferability and knowledge transferability, and outline easily checkable sufficient conditions that identify when adversarial transferability indicates knowledge transferability. In particular, we show that composition with an affine function is sufficient to reduce the difference between the two models when they possess high adversarial transferability. Furthermore, we provide empirical evaluation for different transfer learning scenarios on diverse datasets, showing a strong positive correlation between the adversarial transferability and knowledge transferability, thus illustrating that our theoretical insights are predictive of practice.

1. INTRODUCTION

Knowledge transferability and adversarial transferability are two fundamental properties when a learned model transfers to other domains. Knowledge transferability, also known as learning transferability, has attracted extensive studies in machine learning. Long before it was formally defined, the computer vision community has exploited it to perform important visual manipulations (Johnson et al., 2016) , such as style transfer and super-resolution, where pretrained VGG networks (Simonyan & Zisserman, 2014) are utilized to encode images into semantically meaningful features. After the release of ImageNet (Russakovsky et al., 2015) , pretrained ImageNet models (e.g., on TensorFlow Hub or PyTorch-Hub) has quickly become the default option for the transfer source, because of its broad coverage of visual concepts and compatibility with various visual tasks (Huh et al., 2016) . Adversarial transferability, on the other hand, is a phenomenon that adversarial examples can not only attack the model they are generated against, but also affect other models (Goodfellow et al., 2014; Papernot et al., 2016) . Thus, adversarial transferability is extensively exploited to inspire black-box attacks (Ilyas et al., 2018; Liu et al., 2016) . Many theoretical analyses have been conducted to establish sufficient conditions of adversarial transferability (Demontis et al., 2019; Ma et al., 2018) . Knowledge transferability and adversarial transferability both reveal some nature of machine learning models and the corresponding data distributions. Particularly, the relation between these two phenomena interests us the most. We begin by showing that adversarial transferability can indicate knowledge transferability. This tie can potentially provide a similarity measure between data distributions, an identifier of important features focused by a complex model, and an affinity map between complicated tasks. Thus, we believe our results have further implications in model interpretability and verification, fairness, robust and efficient transfer learning, and etc. To the best of our knowledge, this is the first work studying the fundamental relationship between adversarial transferability and knowledge transferability both theoretically and empirically. Our main contributions are as follows. • We formally define two quantities, τ 1 and τ 2 , to measure adversarial transferability from different aspects, which enables in-depth understanding of adversarial transferability from a geometric point of view in the feature representation space. • We derive an upper bound for knowledge transferability with respect to adversarial transferability. We rigorously depict their underlying relation and show that adversarial transferability can indicate knowledge transferability. • We conduct thorough controlled experiments for diverse knowledge transfer scenarios (e.g. knowledge transfer among data distributions, attributes, and tasks) on benchmark datasets including STL-10, CIFAR-10, CelebA, Taskonomy-data, and four language datasets. Our empirical results show strong positive correlation between adversarial and knowledge transferability, which validates our theoretical prediction.

2. RELATED WORK

Knowledge transferability has been widely applied in scenarios where the available data for certain domain is limited, and has achieved great success (Van Opbroek et al., 2014; Wurm et al., 2019; Wang et al., 2017; Kim & Park, 2017; Maqueda et al., 2018; Devlin et al., 2018) . Several studies have been conducted to understand the factors that affect knowledge transferability (Yosinski et al., 2014; Long et al., 2015b; Wang et al., 2019; Xu et al., 2019; Shinya et al., 2019) . Empirical observations show that the correlation between learning tasks (Achille et al., 2019; Zamir et al., 2018) , the similarity of model architectures, and data distribution are all correlated with different knowledge transfer effects. Adversarial Transferability has been observed by several works (Papernot et al., 2016; Goodfellow et al., 2014; Joon Oh et al., 2017) . Since the early work, a lot of studies have been conducted, aiming to further understand the phenomenon and design more transferable adversarial attacks. Regardless of the threat model, a lot of attack methods have been proposed to boost adversarial transferability (Zhou et al., 2018; Demontis et al., 2019; Dong et al., 2019; Xie et al., 2019) . Naseer et al. (2019) propose to produce adversarial examples that transfer cross-domain via a generative adversarial network. In addition to the efficacy, efficiency (Ilyas et al., 2018) and practicality (Papernot et al., 2017) are also optimized. Beyond the above empirical studies, there is some work dedicated to analyzing this phenomenon, showing different conditions that may enhance adversarial transferability (Athalye et al., 2018; Tramèr et al., 2017; Ma et al., 2018; Demontis et al., 2019) . Building upon these observations, it is clear that there exist certain connections between adversarial transferability and other knowledge transfer scenarios, and here we aim to provide the first theoretic justification to verify it and design systematic empirical studies to measure such correlation.

3. ADVERSARIAL TRANSFERABILITY VS. KNOWLEDGE TRANSFERABILITY

In this section, we establish connections between adversarial examples and knowledge transferability rigorously. We first formally state the problem studied in this section. Then, we move on to subsection 3.1 to introduce two metrics that encode information about adversarial attacks. Finally, we present our theoretical results about the relationship between adversarial and knowledge transferability in subsection 3.2. Notations. We use blackboard bold to denote sets, e.g., R. We use calligraphy to denote distributions, e.g., D. The support of a distribution D is denoted as supp(D). We use bold lower case letters to denote vectors, e.g., x ∈ R n . We use bold uppercase letter to denote a matrix, e.g., A. We use A † to denote the Moore-Penrose inverse of matrix A. We use • to denote the composition of functions, i.e., g • f (x) = g(f (x)). We use • 2 to denote Euclidean norm induced by standard inner product •, • . Given a function f , we use f (x) to denote its evaluated value at x, and we use f to represent this function in function space. We use •, • D to denote inner product induced by distribution D, i.e., f 1 , f 2 D = E x∼D f 1 (x), f 2 (x) . Accordingly, we use • D to denote a norm induced by inner product •, • D , i.e., f D = f, f D . For a matrix function F : supp(D) → R d×m , we define its L 2 (D)-norm in accordance with matrix 2-norm as F D,2 = E x∼D F (x) 2 2 . We define projection operator proj(•, r) to project a matrix to a hyperball of spectral norm radius r, i.e., proj(A, r) = A, if A 2 ≤ r rA/ A 2 if A 2 > r . Setting. Assume we are given a target problem defined by data distribution x ∼ D, where x ∈ R n , and y : R n → R d represent the ground truth labeling function. As a first try, a reference model f T : R n → R d trained on the target dataset is obtained through optimizing over a function class f T ∈ F T . Now suppose we have a source model f S : R n → R m pretrained on source data, and we are curious how would f S transfer to the target data D? Knowledge transferability. Given a trainable function g : R m → R d , where g ∈ G is from a small function class for efficiency purpose, we care about whether f S can achieve low loss L(•; y, D), e.g., mean squared error, after stacking with a trainable function g comparing with f T , i.e., min g∈G L(g • f S ; y, D) compare with L(f T ; y, D). Clearly, the solution to this optimization problem depends on the choice of G. Observing that in practice it is common to stack and fine-tune a linear layer given a pretrained feature extractor, we consider the class of affine functions. Formally, the problem that is studied in our theory is stated as follows. Problem 1. Given a reference model f T trained on target distribution D, and a source model f S pre-trained on source data. Can we predict the best possible performance of the composite function g •f S on D, where g is from a bounded affine function class, given adversarial transferability between f S and f T ?

3.1. ADVERSARIAL TRANSFERABILITY

We use the 2 -norm to characterize the effectiveness of an attack. Definition 1 (Virtual Adversarial Attack (Miyato et al., 2018) ). Given a model f : R n → R d , the attack on point x within -ball is defined as arg max δ ≤ f (x) -f (x + δ) 2. As this is intractable in practice, we consider the use of the tangent function to approximate the difference: δ f, (x) = arg max δ ≤ ∇f (x) δ 2 , where ∇f (x) ∈ R n×d is the Jacobian matrix. The will be dropped in clear context or when it is irrelevant. To provide a quantitative view of adversarial transferability, we define two metrics τ 1 and τ 2 . Both the metrics are in the range of [0, 1], where higher values indicate more adversarial transferability. Definition 2 (Adversarial Transferability (Angle)). Given two function f 1 , f 2 , we assume they have the same input dimension, and may have different output dimensions. The Adversarial Transferability (Angle) of f 1 and f 2 at point x is defined as the squared cosine value of the angle between the two attacks, i.e., τ 1 (x) = δ f1 (x), δ f2 (x) 2 δ f1 (x) 2 2 • δ f2 (x) 2 2 . We denote its expected value as τ 1 = E x∼D [τ 1 (x)]. Intuitively, τ 1 characterizes the similarity of the two attacks. The higher the cosine similarity, the better they can be attacked together. Noting that we are suggesting to use the square of their cosine values, which means that cosine value being either 1 or -1 has the same indication of high knowledge transferability. This is because fine-tuning the last layer can rectify such difference by changing the sign of the last linear layer. However, it is not sufficient to fully characterize how good f S will perform only knowing the angle of two attack directions. For example, it is not difficult to construct two functions with highest τ 1 = 1, but not transferable with affine functions. Moreover, it is also oberserved in our experiments that only τ 1 is not sufficient. Therefore, in addition to the information of attacks δ f captured by τ 1 , we also need information about deviation of a function given attacks. We denote the deviation of a function f , given attack δ(x), as f (x + δ(x)) -f (x), and we define its approximation as ∆ f,δ (x) = ∇f (x) δ(x). (1) Accordingly, we define another metric to answer the following question: applying f 1 's adversarial attacks on both the models, how much can the deviation of their function value be aligned by affine transformations? Definition 3 (Adversarial Transferability (Deviation)). Given two functions f 1 , f 2 with the same input dimensions and potentially different output dimensions, the Adversarial Transferability (Deviation) of adversarial attacks from f 1 to f 2 given data distribution D is defined as τ f1→f2 2 = 2∆ f2,δ f 1 -A∆ f1,δ f 1 , A∆ f1,δ f 1 D ∆ f2,δ f 1 2 D , where A is a constant matrix defined as A = proj(E x∼D [∆ f2,δ f 1 (x)∆ f1,δ f 1 (x) ] E x∼D [∆ f1,δ f 1 (x)∆ f1,δ f 1 (x) ] † , ∆ f2,δ f 1 D ∆ f1,δ f 1 D ). We note that A is the best linear map trying to align the two deviations (∆ f2,δ f 1 and ∆ f1,δ f 1 ) in the function space. It serves as a guess on the best linear map to align f 1 and f 2 , using only the information from adversarial attacks. To have better sense of τ 2 and the relationships with other quantities, we present an example for visual illustration in Figure 1 . Note that high τ 2 does not necessarily require ∆ f1,δ f 1 and ∆ f2,δ f 1 to be similar, but they can be well aligned by the constant linear transformation A. We refer to the proof of Proposition 1 at section B in appendix for detailed explanation of τ 2 . Proposition 1. Both τ 1 and τ 2 are in [0, 1]. In this subsection, we will provide our theoretical results. First, to have a better intuition, we will show a special case where the theorems are simplified, i.e., where f S and f T are both R n → R. Then, we present the general case where f S and f T are multi-dimensional. Note that their output dimensions are not necessarily the same.

3.2. ADVERSARIAL TRANSFERABILITY INDICATES KNOWLEDGE TRANSFERABILITY

When f S and f T are both R n → R, the τ 1 and τ 2 come out in a surprisingly elegant form. Let us show what the two metrics are to have further intuition on what τ 1 and τ 2 characterize. First, let us see what the attack is in this case. As function f has one-dimensional output, its gradient is a vector ∇f ∈ R n . Thus, δ f, (x) = arg max δ ≤ ∇f (x) δ 2 = ∇f (x) ∇f (x) 2 is simply the gradient with its scale normalized. Then, the τ 1 becomes τ 1 (x) = ∇f S (x), ∇f T (x) 2 ∇f S (x) 2 2 • ∇f T (x) 2 2 , which is the squared cosine (angle) between two gradients. For τ 2 , the matrix A degenerates to a scalar constant, which makes τ 2 simpler as well, i.e., A = ∆ f T ,δ f S , ∆ f S ,δ f S D ∆ f S ,δ f S 2 D , τ f S →f T 2 = ∆ f S ,δ f S , ∆ f T ,δ f S 2 D ∆ f S ,δ f S 2 D • ∆ f T ,δ f S 2 D . We can see, in this case τ 2 is interestingly in the same form of the first metric τ 1 . We will simply use τ 2 to denote τ f S →f T 2 afterwards. Accordingly, when f S and f T are both R n → R, the result also comes out in an elegant form. In this case, adversarial attacks reflect all the information of the gradients of the two models, enabling τ 1 and τ 2 to encode all the information we need to prove the following theorem. Theorem 1. For two functions fS and fT that both are R n → R, there is an affine function g : R → R, such that ∇f T -∇(g • f S ) 2 D = E x∼D (1 -τ 1 (x)τ 2 ) ∇f T (x) 2 2 , where g(x) = Ax + Const. Moreover, though not necessarily, if assuming that fT is L-Lipschitz continuous, i.e., ∇fT (x) 2 ≤ L for ∀x ∈ supp(D), we have a more elegant statement: ∇f T -∇(g • f S ) 2 D ≤ (1 -τ 1 τ 2 )L 2 . The theorem suggests that, if adversarial transferability is high, there exists an affine transformation with bounded norm, such that g • f S is close to f T . As an intuition of the proof, the difference between two gradients can be represented by the angle between them, which can be characterized by τ 1 ; and the norm difference between them, which can be characterized by τ 2 . As for the general case, we consider when the output dimensions of both functions are multidimensional and not necessarily the same. In this scenario, adversarial attacks correspond to the largest singular value of the Jacobian matrix. Therefore, we need to introduce the following definition to capture other information that is not revealed by adversarial attacks. Definition 4 (Singular Value Ratio). For any function f , the Singular Value Ratio for the function gradient at x is defined as x) , where σ1(x), σ2(x) are the largest and the second largest singular value in absolute value of ∇f (x), respectively. In addition, we define the worst-case singular value ratio as λ f (x) = σ 2 (x) σ 1 ( λ f = max x∈supp(D) λ f (x). Theorem 2. For two functions fS : R n → R m , and fT : R n → R d , assuming that f T is L-Lipschitz continuous, i.e., ∇fT (x) 2 ≤ L for ∀x ∈ supp(D), there is an affine function g : R m → R d , such that ∇f T -∇(g • f S ) 2 D ≤ (1 -τ 1 τ 2 ) + (1 -τ 1 )(1 -τ 2 )λ 2 f T + (λ f T + λ f S ) 2 5L 2 , where g is defined as g(z) = Az + Const. We note that this theorem also has a statement offering tighter bound where we do not assume Lipschitz continuous. The full version of this theorem is provided in appendix. Theorem 2 suggests that big τ 1 and τ 2 indicate potentially small differences of gradients between the target model and the transferred model. Based on this, intuitively, given the right constant value shift, minimal difference in gradients implies minimal difference in function value, which should result in bounded loss. Indeed, we prove in Theorem 3 that the squared loss of the transferred model g • f S is bounded by the loss of f T and their gradient difference, by assuming the β-smoothness of both the functions. Definition 5 (β-smoothness). A function f is β-smooth if for all x, y, ∇f (x) -∇f (y) 2 ≤ β x -y 2 . For the target data distribution D, and its ground truth labeling function y, the mean squared loss of the transferred model is Ex∼D g • fS(x) -y(x) 2 2 = g • fS -y 2 D . Therefore, the following theorem presents upper bound on the mean squared loss of the transferred model. Theorem 3. Without loss of generality we assume x 2 ≤ 1 for ∀x ∈ supp(D). Consider functions fS : R n → R m , fT : R n → R d , and an affine function g : R m → R d , suggested by Theorem 1 or Theorem 2, with the constant set to let g(fS( 0 )) = fT (0). If both fT , fS are β-smooth, then g • f S -y 2 D ≤ f T -y D + ∇f T -∇g • f S D + 1 + ∇f T D,2 ∇f S D,2 β 2 .

3.3. PRACTICAL MEASUREMENT OF ADVERSARIAL TRANSFERABILITY

Existing studies have shown that similar models share high adversarial transferability (Liu et al., 2016; Papernot et al., 2016; Tramèr et al., 2017) . In previous work, it is common to use cross adversarial loss as an indication of adversarial transferability, e.g., the loss of f T with attacks generated on f S . It is intuitive to consider that the higher cross adversarial loss, the higher adversarial transferability. However, it may have a drawback comparing to the τ 1 , τ 2 defined in this work. Definition 6 (Cross Adversarial Loss). Given a loss function T (•, y) on the target domain, where y is ground truth, the adversarial loss of f T with attack δ f S generated against source model f S is L adv (f T , δ f S ; y, D) = E x∼D T (f T (x + δ f S (x)), y(x)). The cross adversarial loss depends on the choice of loss function, the output dimension, etc. Thus, it can be incomparable when we want to test adversarial transferability among different f T , unlike that τ 1 , τ 2 are always between [0, 1]. To investigate the relationship between the adversarial loss and the adversarial transferability we defined, we show in the following proposition that the cross adversarial loss is similar to τ 1 . In the next section, we verify the theoretical predictions through thorough experiments. Proposition 2. If T is mean squared loss and f T achieves zero loss on D, then the adversarial loss defined in Definition 6 is approximately upper and lower bounded by L adv (f T , δ f S ; y, D) ≥ 2 E x∼D τ 1 (x) ∇f T (x) 2 2 + O( 3 ), L adv (f T , δ f S ; y, D) ≤ 2 E x∼D λ 2 f T + (1 -λ 2 f T )τ 1 (x) ∇f T (x) 2 2 + O( 3 ), where O( 3 ) denotes a cubic error term.

4. EXPERIMENTAL EVALUATION

The empirical evaluation of the relationship between adversarial transferability and knowledge transferability is done by four different sets of experiment. First we present a set of synthetic experiment that verifies our theoretical study, and then we present our empirical study on realworld datasets with models widely used in practice, described in three knowledge transfer scenarios: knowledge transfer on data distributions, attributes, and tasks. Details regarding the three scenarios are elaborated below, and all training details are deferred to the Appendix. Knowledge-transfer among data distributions is the most common setting of transfer learning. It transfers the knowledge of a model trained/gained from one data domain to the other data domains. For instance, Shie et al. (2015) manage to use pre-trained ImageNet representations to achieve state-of-the-art accuracy for medical data analysis. The relation between adversarial and knowledge transferability can not only determine the best pretrained models to use, but also detect distribution shifts, which is crucial in learning agents deployed in continual setting (Diethe et al., 2019) . Knowledge-transfer among attributes is a popular method to handle zero-shot and few-shot learning (Jayaraman & Grauman, 2014; Romera-Paredes & Torr, 2015) . It transfers the knowledge learned from the attributes of the source problem to a new target problem Russakovsky & Fei-Fei (2010) . The relation between adversarial and knowledge transferability can be used as a probe to deployed classification models to verify attributes that their decisions are based on. This will have profound implications on fairness and interpretability. Knowledge-transfer among tasks is widely applied across various vision tasks, such as super resolution (Johnson et al., 2016) , style transfer (Gatys et al., 2016) , semantic and instance segmentation (Girshick, 2015; He et al., 2017; Long et al., 2015a) . It involves transferring the knowledge the model gains by learning to do one task to another novel task. The relation between adversarial and knowledge transferability, as many recent works (Achille et al., 2019; Standley et al., 2019; Zamir et al., 2018) , can be used to charting the affinity map between tasks, aiming to guide potential transfer.

4.1. SYNTHETIC EXPERIMENT ON RADIAL BASIS FUNCTIONS REGRESSION

In the synthetic experiment, we compute quantities that are otherwise inefficient to compute to verify our theoretical results. We also try different settings to see how other factors affect the results. Details follow. Models. Both the source model f S and the target model f T are one-hidden-layer neural networks with sigmoid activation. Overall Steps. First, sample D = {(x i , y i )} N i=1 from a distribution (details later), where x is n-dimensional, y is d-dimensional, and there are N samples. Then we train a target model f T on D. Denoting the weights of f T as W , we randomly sample a direction V where each entry of V is sampled from U (-0.5, 0.5), and choose a scale t ∈ [0, 1]. To derive the source model, we perturb the target model as W := W + tV . Define the source model f S to be a one-hidden-layer neural network with weights W . Then, we compute each of the quantities we care about, including τ 1 , τ 2 , cross adversarial loss (Definition 6), the upper bound in theorem 2 on the difference of gradients, etc. Noting that we reported the cross adversarial loss normalized by its own adversarial loss, defined as α = ∆ f T ,δ f S 2 D / ∆ f T ,δ f T 2 D ≈ L adv (f T , δ f S ; y, D)/L adv (f T , δ f T ; y, D) when f T achieves low error. Note that α ∈ [0, 1]. Finally, we fine-tune the last layer of f S , and get the true transferred loss.

Dataset. Denote a radial basis function as φ

i (x) = e -x-µi 2 2 /(2σi) 2 , and we set the target ground truth function to be the sum of M = 100 basis functions as f = M i=1 φ i , where each entry of the parameters are sampled once from U (-0.5, 0.5). We set the dimension of x to be 30, and the dimension of y to be 10. We generate N = 1200 samples of x from a Gaussian mixture formed by three Gaussian with different centers but the same covariance matrix Σ = I. The centers are sampled randomly from U (-0.5, 0.5) n . We use the ground truth regressor f to derive the corresponding y for each x. That is, we want our neural networks to approximate f on the Gaussian mixture. Results. We present two sets of experiment in Figure 2 . The correlations between adversarial transferabilities (τ 1 , τ 2 , α) and the knowledge transferability (transferred loss) are observed. The upper bound for the difference of gradients (Theorem 2) basically tracks its true value. Although the absolute value of the upper bound on the transferred loss (Theorem 3) can be big compared to the true transferred loss, their trends are similar. We note the big difference in absolute value is due to the use of β-smoothness, which considers the worst case scenario. It is also observed that τ 1 tracks the normalized adversarial cross loss α, as Proposition 2 suggests. In this experiment, we show that the closer the source data distribution is to the target data distribution, the more adversarially transferable the source model to the reference model, thus we observe that the source model is more knowledge transferable to the target dataset. We demonstrate this on both image and natural language domains. Dataset. Image: 5 source datasets (5 source models) are constructed based on CIFAR-10 (Hinton et al., 2012 ) and a single target dataset (1 reference model) based on STL-10 ( Coates et al., 2011) . Each of the source datasets consists of 4 classes from CIFAR10 and the target dataset also consists of 4 classes from STL10. Natural Language: We select 4 diverse natural language datasets: AG's News (AG), Fake News Detection (Fake), IMDB, Yelp Polarity (Yelp). Then we pick IMDB as the target and the rest as sources. Adversarial Transferability. Image: We take 1000 images (STL-10) from the target dataset and generate 1000 adversarial examples on each of the five source models. We run 10 step PGD L ∞ attack with = 0.1. Then we measure the effectiveness of the adversarial examples by the cross-entropy loss on the reference model. Natural Language: We take 100 sample sentences from target dataset(IMDB) and generate adversarial sentences on each of the source models(AG, Fake, Yelp) with TextFooler (Jin et al., 2019) . The ratio of changed words is constrained to less or equal to 0.1. Then, we measure their adversarial transferability against the reference model(IMDB). Knowledge Transferability. To measure the knowledge transferability, we fine-tune a new linear layer on the target dataset to replace the last layer of each source model to generate the corresponding transferred models. Then we measure the performance of the transferred models on the target dataset based on the standard accuracy and cross-entropy loss. Results From Figure 4 .2, it's clear that if the source models that has highest adversarial transferability, its corresponding transferred model achieves the highest transferred accuracy. This phenomenon is prominent in both image and natural language domains. The results in Figure 4 .2 (b) could verify the implication by our theory that only τ 1 is not sufficient for indicating knowledge transferability. 

4.3. ADVERSARIAL TRANSFERABILITY INDICATING KNOWLEDGE-TRANSFER AMONG ATTRIBUTES

In addition to the data distributions, we validate our theory on another dimension, attributes. This experiment suggests that the more adversarially transferable the source model of certain attributes is to the reference model, the better the model performs on the target task aiming to learn target attributes. Dataset CelebA (Liu et al., 2018) (Cao et al., 2018) . We report the average classification accuracy target datasets. Result In Table 1 , we list the top-5 attribute source models that share the highest adversarial transferability and the performance of their transferred models on the 7 target facial recognition benchmarks. We observe that the attribute "Young" has the highest adversarial transferability; as a result, it also achieves highest classification average performance across the 7 benchmarks. In this experiment, we show that adversarial transferability can also indicate the knolwdge transferability among different machine learning tasks. Zamir et al. (2018) shows that models trained on different tasks can transfer to other tasks well, especially when the tasks belong to the same "category". Here we leverage the same dataset, and pick 15 single image tasks from the task pool, including Autoencoding, 2D Segmentation, 3D Keypoint and etc. Intuitively, these tasks can be categorized into 3 categories, semantic task, 2D tasks as well as 3D tasks. Leveraging the tasks within the same category, which would hypothetically have higher adversarial transferability, we evaluate the corresponding knowledge transferability. Dataset The Taskonomy-data consists of 4 million images of indoor scenes from about 600 indoor images, every one of which has annotations for every task listed in the pool. We use a public subset of these images to validate our theory. Adversarial Transferability Adversarial Transferability Matrix (ATM) is used here to measure the adversarial transferability between multiple tasks, modified from the Affinity Matrix (Zamir et al., 2018). To generate the corresponding "task categories" for comparison, we sample 1000 images from the public dataset and perform a virtual adversarial attack on each of the 15 source models. Adversarial perturbation with (L ∞ norm) as 0.03,0.06 are used and we run 10 steps PGD-based attack for efficiency. Detailed settings about adversarial transferability are deferred to the Appendix. Knowledge Transferability We use the affinity scores provided as a 15 × 15 affinity matrix to compute the categories of tasks. Then we take columns of this matrix as features for each task and perform agglomerative clustering to obtain the Task Similarity Tree. Results Figure 4 compares the predictions of task categories generated based on adversarial transferability and knowledge transferability in Taskonomy. It is easy to see three intuitive categories are formed, i.e, 2D, 3D, and Semantic tasks for both adversarial and knowledge transferability. To provide a quantitative measurement of the similarity, we also compute the average inner category entropy based on adversarial transferability with the categories in Taskonomy as the ground truth (the lower entropy indicates higher correlation between adversarial and knowledge transferability). In figure 5 (Appendix), the adversarial transferability based category prediction shows low entropy when the number of categories is greater or equal to 3, which indicates that the adversarial tranferability is faithful with the category prediction in Taskonomy. This result shows strong positive correlation between the adversarial transferability and knowledge transferability among learning tasks in terms of predicting the similar task categories.

5. CONCLUSION

We theoretically analyze the relationship between adversarial transferability and knowledge transferability, along with thorough experimental justifications in diverse scenarios. A DISCUSSION ABOUT VALIDNESS OF THE NOTATIONS Before starting proving our theory, it is necessary to show that our mathematical tools are indeed valid. It is easy to verify that •, • D is a valid inner product inherited form standard Euclidean inner product. Therefore, the norm • D , induced by the inner product, is also a valid norm. What does not come directly is the validness of the norm • D,2 . Particularly, whether it satisfies the triangle inequality. Recall that, for a function of matrix output F : supp(D) → R d×m , its L 2 (D)-norm in accordance with matrix 2-norm is defined as F D,2 = E x∼D F (x) 2 2 . For two functions F, G, both are supp(D) → R d×m , we can verify the norm • D,2 satisfies triangle inequality as shown in the following. Applying the triangle inequality of the spectral norm, and with some algebra manipulation, it holds that F + G D,2 = E x∼D F (x) + G(x) 2 2 ≤ E x∼D ( F (x) 2 + G(x) 2 ) 2 = E x∼D F (x) 2 2 + E x∼D G(x) 2 2 + 2E x∼D F (x) 2 G(x) 2 = F 2 D,2 + G 2 D,2 + 2E x∼D F (x) 2 G(x) 2 . (2) Applying the Cauchy-Schwarz inequality, we can see that E x∼D F (x) 2 G(x) 2 ≤ E x∼D F (x) 2 2 • E x∼D G(x) 2 2 = F D,2 • G D,2 . Plugging this into (2) would complete the proof, i.e., (2) ≤ F 2 D,2 + G 2 D,2 + 2 F D,2 • G D,2 = ( F D,2 + G D,2 ) 2 = F D,2 + G D,2 .

B PROOF OF PROPOSITION 1

Proposition 1 (Restated). Both τ 1 and τ 2 are in [0, 1]. Proof. We are to prove that τ 1 and τ 2 are both in the range of [0, 1]. As τ 1 is squared cosine, it is trivial that τ 1 ∈ [0, 1]. Therefore, we will focus on τ 2 in the following. Recall that the τ 2 from f 1 to f 2 is defined as τ f1→f2 2 = 2∆ f2,δ f 1 -A∆ f1,δ f 1 , A∆ f1,δ f 1 D ∆ f2,δ f 1 2 D , where A is a constant matrix defined as A = proj(E x∼D [∆ f2,δ f 1 (x)∆ f1,δ f 1 (x) ] E x∼D [∆ f1,δ f 1 (x)∆ f1,δ f 1 (x) ] † , ∆ f2,δ f 1 D ∆ f1,δ f 1 D ). For notation convenience, we will simply use τ 2 to denote τ f1→f2 2 in this proof. τ 2 characterizes how similar are the changes in both the function values of f 1 : R n → R m and f 2 : R n → R d in the sense of linear transformable, given attack generated on f 1 . That is being said, it is associated to the function below, i.e, h(B) = ∆ f2,δ f 1 -B∆ f1,δ f 1 2 D = E x∼D ∆ f2,δ f 1 (x) -B∆ f1,δ f 1 (x) 2 2 , where ∆ f1,δ f 1 ∈ R m , ∆ f2,δ f 1 ∈ R d , and B ∈ R d×m . As ∆ f2,δ f 1 (x) -B∆ f1,δ f 1 (x) 2 2 is convex with respect to B, its expectation, i.e. h(B), is also convex. Therefore, h(B) it achieves global minima when ∂h ∂B = 0. ∂h ∂B = E x∼D ∂ ∂B ∆ f2,δ f 1 (x) -B∆ f1,δ f 1 (x) 2 2 = 2E x∼D B∆ f1,δ f 1 (x) -∆ f2,δ f 1 (x) ∆ f1,δ f 1 (x) = 2E x∼D B∆ f1,δ f 1 (x)∆ f1,δ f 1 (x) -∆ f2,δ f 1 (x)∆ f1,δ f 1 (x) = 2BE x∼D ∆ f1,δ f 1 (x)∆ f1,δ f 1 (x) -2E x∼D ∆ f2,δ f 1 (x)∆ f1,δ f 1 (x) . Letting ∂h ∂B = 0, and denoting the solution as B * , we have B * = E x∼D ∆ f2,δ f 1 (x)∆ f1,δ f 1 (x) E x∼D ∆ f1,δ f 1 (x)∆ f1,δ f 1 (x) † . Noting that A = proj(B * , ∆ f 2 ,δ f 1 D ∆ f 1 ,δ f 1 D ) is scaled B * , we denote A = ψB * , where ψ a scaling factor depending on B * and ∆ f 2 ,δ f 1 D ∆ f 1 ,δ f 1 D . According to the definition of the projection operator, we can see that 0 < ψ ≤ 1. Replacing B by A we have, h(A) = ∆ f2,δ f 1 -A∆ f1,δ f 1 2 D = ∆ f2,δ f 1 -A∆ f1,δ f 1 , ∆ f2,δ f 1 -A∆ f1,δ f 1 D = ∆ f2,δ f 1 2 D -2∆ f2,δ f 1 -A∆ f1,δ f 1 , A∆ f1,δ f 1 D = (1 -τ 2 ) ∆ f2,δ f 1 2 D .

It is obvious that

h(A) = ∆ f2,δ f 1 -A∆ f1,δ f 1 2 D ≥ 0, thus we have τ 2 ≤ 1. As for the lower bound for τ 2 , we will need to use properties of B. Denoting O as an all-zero matrix, it holds that h(B * ) = min B {h(B)} ≤ h(O). (3) For A = ψB * , according to the convexity of h(•) and the fact that ψ ∈ [0, 1], we can see the following, i.e., h(A) = h(ψB * ) = h(ψB * + (1 -ψ)O) ≤ ψh(B * ) + (1 -ψ)h(O). Applying (3) to the above, we can see that h(A) ≤ h(O). Noting that h(A) = (1 -τ 2 ) ∆ f2,δ f 1 2 D and h(O) = ∆ f2,δ f 1 2 D , the above inequality suggests that (1 -τ 2 ) ∆ f2,δ f 1 2 D ≤ ∆ f2,δ f 1 2 D , 0 ≤ τ 2 . Therefore, τ 2 is upper bounded by 1 and lower bounded by 0.

C PROOF OF THEOREM 1

Before actually proving the theorem, let us have a look at what τ 1 and τ 2 are in the case where f S and f T are both R n → R. In this case, both τ 1 and τ 2 come out in an elegant form. Let us show what the two metrics are to have further intuition on what τ 1 and τ 2 characterize. First, let us see what the attack is in this case. As function f has one-dimensional output, its gradient is a vector ∇f ∈ R n . Thus, δ f, (x) = arg max δ ≤ ∇f (x) δ 2 = ∇f (x) ∇f (x) 2 Then, the τ 1 becomes τ 1 (x) = ∇f S (x), ∇f T (x) 2 ∇f S (x) 2 2 • ∇f T (x) 2 2 which is the squared cosine (angle) between two gradients. For τ 2 , the matrix A degenerates to a scalar constant A = ∆ f T ,δ f S , ∆ f S ,δ f S D ∆ f S ,δ f S 2 D , and the second metric becomes τ f S →f T 2 = ∆ f S ,δ f S , ∆ f T ,δ f S 2 D ∆ f S ,δ f S 2 D • ∆ f T ,δ f S 2 D We can see, it is interestingly in the same form of the first metric τ 1 . We will simply use τ 2 to denote τ f S →f T 2 afterwards. Theorem 1 (Restated). For two functions f S and f T that both are R n → R, there is an affine function g : R → R, so that ∇f T -∇(g • f S ) 2 D = E x∼D (1 -τ 1 (x)τ 2 ) ∇f T (x) 2 2 , where g is defined as g(x) = Ax + Const. Moreover, if assuming that f T is L-Lipschitz continuous, i.e., ∇f T (x) 2 ≤ L for ∀x ∈ supp(D), we can have a more elegant statement: ∇f T -∇(g • f S ) 2 D ≤ (1 -τ 1 τ 2 )L 2 . Proof. In the case where g is a one-dimensional affine function, we write is as g(z) = Az + b, where A is defined in the definition of τ 2 (Definition 3). In this case, it enjoys a simple form of A = ∆ f T ,δ f S , ∆ f S ,δ f S D ∆ f S ,δ f S 2 D . Then, we can see that ∇f T -∇(g • f S ) 2 D = ∇f T -A∇f S 2 D = E x∼D ∇f T (x) -A∇f S (x) 2 2 . ( ) To continue, we split ∇f T as two terms, i.e., one on the direction on ∇f S and one orthogonal to ∇f S . Denoting φ(x) as the angle between ∇f T (x) and ∇f S (x) in Euclidean space, we have ∇f T (x) = cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 ∇f S (x) + ∇f T (x) -cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 ∇f S (x) = cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 ∇f S (x) + v(x), where we denote v(x) = ∇f T (x) -cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 ∇f S (x) for notation convenience. We can see that v(x) is orthogonal to ∇f S (x), thus v(x) 2 = 1 -cos 2 (φ(x)) ∇f T (x) 2 . Recall that actually τ 1 (x) = cos 2 (φ(x)), it can be written as v(x ) 2 = 1 -τ 1 (x) ∇f T (x) 2 . Then, plugging ( 5) into (4) we have (4) = E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 ∇f S (x) + v(x) -A∇f S (x) 2 2 = E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 -A ∇f S (x) + v(x) 2 2 = E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 -A ∇f S (x) 2 2 + v(x) 2 2 = E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 -A ∇f S (x) 2 2 + (1 -τ 1 (x)) ∇f T (x) 2 2 = E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 -A ∇f S (x) 2 2 + E x∼D (1 -τ 1 (x)) ∇f T (x) 2 2 = E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 -A 2 ∇f S (x) 2 2 + E x∼D (1 -τ 1 (x)) ∇f T (x) 2 2 . ( ) Now let us deal with the first term by plugging in A = ∆ f T ,δ f S , ∆ f S ,δ f S D ∆ f S ,δ f S 2 D , where ∆ f T ,δ f S (x) = cos(φ(x)) ∇f T (x) 2 and ∆ f S ,δ f S (x) = ∇f S (x) 2 , and we have E x∼D cos(φ(x)) ∇f T (x) 2 ∇f S (x) 2 -A 2 ∇f S (x) 2 2 = E x∼D (cos(φ(x)) ∇f T (x) 2 -A ∇f S (x) 2 ) 2 = 1 2 E x∼D ∆ f T ,δ f S (x) -A∆ f S ,δ f S (x) 2 = 1 2 E x∼D ∆ f T ,δ f S (x) 2 + A 2 ∆ f S ,δ f S (x) 2 -2A∆ f T ,δ f S (x)∆ f S ,δ f S (x) = 1 2 ∆ f T ,δ f S 2 D + A 2 ∆ f S ,δ f S 2 D -2A ∆ f T ,δ f S , ∆ f S ,δ f S D = 1 2 ∆ f T ,δ f S 2 D + ∆ f T ,δ f S , ∆ f S ,δ f S 2 D ∆ f S ,δ f S 2 D -2 ∆ f T ,δ f S , ∆ f S ,δ f S 2 D ∆ f S ,δ f S 2 D = ∆ f T ,δ f S 2 D 2 1 - ∆ f T ,δ f S , ∆ f S ,δ f S 2 D ∆ f S ,δ f S 2 D • ∆ f T ,δ f S 2 D = (1 -τ 2 )E x∼D cos 2 (x) ∇f T (x) 2 2 = (1 -τ 2 )E x∼D τ 1 (x) ∇f T (x) 2 2 . ( ) Plugging ( 7) into (6), we finally have ∇f T -∇(g • f S ) 2 D = (1 -τ 2 )E x∼D τ 1 (x) ∇f T (x) 2 2 + E x∼D (1 -τ 1 (x)) ∇f T (x) 2 2 = E x∼D (1 -τ 2 τ 1 (x)) ∇f T (x) 2 2 ≤ (1 -τ 1 τ 2 )L 2 , which completes the proof.

D PROOF OF THEOREM 2

Theorem 2 (Restated). For two functions f S : R n → R m , and f T : R n → R d , there is an affine function g : R m → R d , so that ∇f T -∇(g • f S ) 2 D ≤ 5E x∼D    (1 -τ 1 (x)τ 2 ) + (1 -τ 1 (x))(1 -τ 2 )λ f T (x) 2 ∇f T (x) 2 2 + (λ f T (x) + λ f S (x)) 2 ∇f S (x) 2 2 ∇f S 2 D,2 ∇f T 2 D,2    , where g is defined as g(z) = Az + Const. Moreover, if assuming that f T is L-Lipschitz continuous, i.e., ∇f T (x) 2 ≤ L for ∀x ∼ supp(D), and considering the worst-case singular value ratio λ, we can have a more elegant statement: ∇f T -∇(g • f S ) 2 D ≤ (1 -τ 1 τ 2 ) + (1 -τ 1 )(1 -τ 2 )λ 2 f T + (λ f T + λ f S ) 2 5L 2 . Proof. Recall that the matrix A is defined in Definition 3, i.e., A = proj(E x∼D [∆ f T ,δ f S (x)∆ f S ,δ f S (x) ] E x∼D [∆ f S ,δ f S (x)∆ f S ,δ f S (x) ] † , ∆ f T ,δ f S D ∆ f S ,δ f S D ), and we can see ∇f T -∇(g • f S ) 2 D,2 = ∇f T -∇(g • f S ) 2 D,2 = ∇f T -A∇f S 2 D,2 = E x∼D ∇f T (x) -A∇f S (x) 2 2 = E x∼D max t 2=1 ∇f T (x) t -A∇f S (x) t 2 2 , where the last equality is due to the definition of matrix spectral norm. Denoting ∇f as either the Jacobian matrix ∇f T or ∇f S , Singular Value Decomposition (SVD) suggests that ∇f (x) = U ΣV , where Σ is a diagonal matrix containing all singular values ordered by their absolute values. Let σ 1 , • • • , σ n denote ordered singular values. Nothing that the number of singular values that are non-zero may be less than n, so we fill the empty with zeros, such that each of them have corresponding singular vectors, i.e., the column vectors v 1 , • • • , v n in V . That is being said, ∀i ∈ [n], we have ∇f (x) v i 2 = |σ i |. Let θ i and v i denote the singular values and vectors for ∇f S (x) . Noting that {v i } n i=1 define a orthonormal basis for R n , we can represent t = n i=1 θ i v i , where n i=1 θ 2 i = 1. As adversarial attack is about the largest eigenvalue of the gradient, plugging (9) into (8), we can split it into two parts, i.e., (8) = E x∼D max t 2=1 ∇f T (x) n i=1 θ i v i -A∇f S (x) n i=1 θ i v i 2 2 = E x∼D max t 2=1 ∇f T (x) (θ 1 v 1 ) -A∇f S (x) (θ 1 v 1 ) + ∇f T (x) n i=2 θ i v i -A∇f S (x) n i=2 θ i v i 2 2 . ( ) Denoting u = n i=2 θ i v i , we can see this vector is orthogonal to v 1 . Let us denote v 1 as the singular vector with the biggest absolute singular value of ∇f T (x) , parallel with attack δ f T . Now we split u = u 1 + u 2 into two terms, where u 1 is parallel to v 1 , and u 2 is orthogonal to u 1 . As u 1 is in the orthogonal space to v 1 while parallel with v 1 , it is bounded by the sine value of the angle between v 1 and v 1 , i.e., 1 -τ 1 (x). Hence, noting that u is part of the unit vector t, u 1 2 ≤ 1 -τ 1 (x) u 2 ≤ 1 -τ 1 (x). Plugging u in (10), we have (10) = E x∼D max t 2=1 ∇f T (x) (θ 1 v 1 ) -A∇f S (x) (θ 1 v 1 ) + ∇f T (x) (u 1 + u 2 ) -A∇f S (x) u 2 2 ≤ E x∼D max t 2=1       ∇f T (x) (θ 1 v 1 ) -A∇f S (x) (θ 1 v 1 ) 2 X1 + ∇f T (x) u 1 2 X2 + ∇f T (x) u 2 -A∇f S (x) u 2 X3       2 , ( ) where the inequality is due to triangle inequality. There are three terms we have to deal with, i.e., X 1 , X 2 and X 3 . Regarding the first term, v 1 in X 1 aligns with the attack δ f S (x), which we have known through adversarial attack. The second term X 2 is trivially bounded by ( 11). Although adversarial attacks tell us nothing about X 3 , it can be bounded by the second largest singular values. Let us first deal with two easiest, i.e., X 2 and X 3 . Applying (11) on X 2 directly, we have X 2 = ∇f T (x) 2 • u 1 2 ≤ 1 -τ 1 (x) ∇f T (x) 2 . For X 3 , noting that u 2 is orthogonal to v 1 , and u is orthogonal to v 1 , we can see that u 2 has no components of the largest absolute singular vector of ∇f T (x) , and u has no components of the largest absolute singular vector of ∇f T (x) . Therefore, X 3 ≤ ∇f T (x) u 2 2 + A∇f S (x) u 2 ≤ σ f T ,2 (x) u 2 2 + σ f S ,2 (x) A 2 u 2 = λ f T (x) ∇f T (x) 2 u 2 2 + λ f S (x) ∇f S (x) 2 A 2 u 2 ≤ λ f T (x) ∇f T (x) 2 + λ f S (x) ∇f S (x) 2 A 2 , where the first inequality is due to triangle inequality, the second inequity is done by the attributes of singular values, and the definition of matrix 2-norm. The equality is done simply by applying the definition of singular values ratio (Definition 4), and the third inequality is due to the fact that u 2 2 ≤ u 2 ≤ 1. Before dealing with X 1 , let us simplify (12) by relax the square of summed terms to sum of squared terms, as the following. (12) = E x∼D max t 2=1 (X 1 + X 2 + X 3 ) 2 = E x∼D max t 2=1 X 2 1 + X 2 2 + X 2 3 + 2X 1 X 2 + 2X 2 X 3 + 2X 1 X 3 ≤ E x∼D max t 2=1 X 2 1 + X 2 2 + X 2 3 + 2 max{X 2 1 , X 2 2 } + 2 max{X 2 2 , X 2 3 } + 2 max{X 2 1 , X 2 3 } ≤ E x∼D max t 2=1 X 2 1 + X 2 2 + X 2 3 + 2(X 2 1 + X 2 2 ) + 2(X 2 2 + X 2 3 ) + 2(X 2 1 + X 2 3 ) = E x∼D max t 2=1 5(X 2 1 + X 2 2 + X 2 3 ). We note that this relaxation is not necessary, but simply for the simplicity of the final results without breaking what our theory suggests. Bring what we we have about X 2 and X 3 , and noting that θ 1 ≤ 1 depends on t, we can drop the max operation by (13) = E x∼D max t 2=1 5(X 2 1 + X 2 2 + X 2 3 ) = E x∼D max t 2=1 5( ∇f T (x) (θ 1 v 1 ) -A∇f S (x) (θ 1 v 1 ) 2 2 + X 2 2 + X 2 3 ) ≤ 5E x∼D   ∇f T (x) v 1 -A∇f S (x) v 1 2 2 + (1 -τ 1 (x)) ∇f T (x) 2 2 + (λ f T (x) + λ f S (x)) ∇f S (x) 2 A 2 2 .   Now, let us deal with the first term. As v 1 is a unit vector and is in fact the direction of f S (x)'s adversarial attack, we can write δ f S , (x) = v 1 . Hence, E x∼D ∇f T (x) v 1 -A∇f S (x) v 1 2 2 = E x∼D 1 2 ∇f T (x) δ f S , (x) -A∇f S (x) δ f S , (x) 2 2 = E x∼D 1 2 ∆ f T ,δ f S (x) -A∆ f S ,δ f S (x) 2 2 , ( ) where the last equality is derived by applying the definition of ∆(x), i.e., equation ( 1). Note that we omit the in δ f S , for notation simplicity. The matrix A is deigned to minimize (15), as shown in the proof of Proposition 1. Expanding the term we have (15) = 1 2 E x∼D ∆ f T ,δ f S (x) 2 2 + A∆ f S ,δ f S (x) 2 2 -2 ∆ f T ,δ f S (x), A∆ f S ,δ f S (x) = 1 2 ∆ f T ,δ f S 2 D + A∆ f S ,δ f S 2 D -2 ∆ f T ,δ f S , A∆ f S ,δ f S D = ∆ f T ,δ f S 2 D 2 (1 -τ 2 ) = (1 -τ 2 )E x∼D ∇f T (x) v 1 2 2 . ( ) Recall that v 1 is a unit vector aligns the direction of δ f S , and we have used v 1 to denote a unit vector that aligns the direction of δ f T . As τ 1 tells us about the angle between the two, let us split v 1 into to orthogonal vectors, i.e., v 1 = τ 1 (x)v 1 + 1 -τ 1 (x)v 1,⊥ , where v 1,⊥ is a unit vector that is orthogonal to v 1 . Plugging this into (16) we have (16) = (1 -τ 2 )E x∼D ∇f T (x) ( τ 1 (x)v 1 + 1 -τ 1 (x)v 1,⊥ ) 2 2 = (1 -τ 2 )E x∼D ∇f T (x) τ 1 (x)v 1 2 2 + ∇f T (x) 1 -τ 1 (x)v 1,⊥ 2 2 = (1 -τ 2 )E x∼D τ 1 (x) ∇f T (x) 2 2 + (1 -τ 1 (x))λ f T (x) 2 ∇f T (x) 2 2 , where the second equality is due to the image of v 1 and v 1,⊥ after linear transformation ∇f T (x) are orthogonal, which can be easily observed through SVD. Plugging this in ( 14), and with some regular algebra manipulation, finally we have (14) = 5E x∼D      (1 -τ 2 ) τ 1 (x) ∇f T (x) 2 2 + (1 -τ 1 (x))λ f T (x) 2 ∇f T (x) 2 2 +(1 -τ 1 (x)) ∇f T (x) 2 2 + (λ f T (x) + λ f S (x)) 2 ∇f S (x) 2 2 A 2 2      = 5E x∼D      (1 -τ 1 (x)τ 2 ) ∇f T (x) 2 2 +(1 -τ 1 (x))(1 -τ 2 )λ f T (x) 2 ∇f T (x) 2 2 + (λ f T (x) + λ f S (x)) 2 ∇f S (x) 2 2 A 2 2      . ( ) Recall that A is from a norm-restricted matrix space, i.e., the A is scaled so that its spectral norm is no greater than ∆ f T ,δ f S D ∆ f S ,δ f S D , thus A 2 2 ≤ ∆ f T ,δ f S 2 D ∆ f S ,δ f S 2 D ≤ ∆ f T ,δ f T 2 D ∆ f S ,δ f S 2 D = E x∼D ∆ f T ,δ f T (x) 2 2 E x∼D ∆ f S ,δ f S (x) 2 2 = E x∼D ∇f T (x) 2 2 E x∼D ∇f S (x) 2 2 = ∇f T 2 D,2 ∇f S 2 D,2 . Hence, plugging the above inequality to (17), the first statement of the theorem is proven, i.e., (17) ≤ 5E x∼D        (1 -τ 1 (x)τ 2 ) ∇f T (x) 2 2 +(1 -τ 1 (x))(1 -τ 2 )λ 2 f T ∇f T (x) 2 2 + (λ f T (x) + λ f S (x)) 2 ∇f S (x) 2 2 ∇f T 2 D,2 ∇f S 2 D,2        . ( ) To see the second statement of the theorem, we assume f T is L-Lipschitz continuous, i.e., ∇f T (x) 2 ≤ L for ∀x ∈ supp(D), and considering the worst-case singular value ratio λ = max x∈supp(D) for either f S , f T , we can continue as (19) ≤ 5         E x∼D (1 -τ 1 (x)τ 2 ) ∇f T (x) E PROOF OF THEOREM 3 The idea for proving Theorem 3 is straight-forward: bounded gradients difference implies bounded function difference, and then bounded function difference implies bounded loss difference. To begin with, let us prove the following lemma. Lemma 1. Without loss of generality we assume x 2 ≤ 1 for ∀x ∈ supp(D). Consider functions f S : R n → R m , f T : R n → R d , and an affine function g : R m → R d , suggested by Theorem 1 or Theorem 2, such that g(f S (0)) = f T (0), if both f T , f S are β-smooth in {x | x ≤ 1}, we have f T -g • f S D ≤ ∇f T -∇(g • f S ) D,2 + 1 + ∇f T D,2 ∇f S D,2 β. Proof. Let us denote v(x) = f T (x) -g • f S (x) , and we can show the smoothness of v(•). As g(•) is an affine function satisfying g(f S (0)) = f T (0), it can be denoted as g(z) = A(zf S (0)) + f T (0), where A is a matrix suggested by Theorem 1 or Theorem 2. Therefore, denoting B 1 = {x | x ≤ 1} as a unit ball, for ∀x, y ∈ B 1 it holds that ∇v(x) -∇v(y) 2 = ∇v(x) -∇v(y) 2 = ∇f T (x) -∇f T (y) -A(∇f S (x) -∇f S (y) ) 2 ≤ ∇f T (x) -∇f T (y) 2 + A(∇f S (x) -∇f S (y) ) 2 ≤ ∇f T (x) -∇f T (y) 2 + A 2 ∇f S (x) -∇f S (y) 2 , ( ) where the last second inequality is due to triangle inequality, and the last inequality is by the property of spectral norm. Applying the β-smoothness of f S and f T , and noting that A 2 ≤ ∇f T D,2 ∇f S D,2 as shown in ( 18), we can continue as (20) ≤ β x -y 2 + A 2 β x -y 2 ≤ β x -y 2 + ∇f T D,2 ∇f S D,2 β x -y 2 = 1 + ∇f T D,2 ∇f S D,2 β x -y 2 , which suggests that v(•) is 1 + ∇f T D,2 ∇f S D,2 β-smooth. We are ready to prove the lemma now. Applying the mean value theorem, for ∀x ∈ B 1 , we have v(x) -v(0) = ∇v(ξx) x, where ξ ∈ (0, 1) is a scalar number. Subtracting ∇v(x) x on both sides give v(x) -v(0) -∇v(x) x = (∇v(ξx) -∇v(x)) x v(x) -v(0) -∇v(x) x 2 = (∇v(ξx) -∇v(x)) x 2 v(x) -v(0) -∇v(x) x 2 ≤ (∇v(ξx) -∇v(x)) 2 x 2 . Let us denote β 1 = 1 + ∇f T D,2 ∇f S D,2 β for notation convenience, and apply the definition of smoothness: v(x) -v(0) -∇v(x) x 2 ≤ β 1 (1 -ξ) x 2 2 ≤ β 1 . ) Noting that v(0) = 0 and applying the triangle inequality, we have v(x) -v(0) -∇v(x) x 2 ≥ v(x) 2 -∇v(x) x 2 ≥ v(x) 2 -∇v(x) 2 Plugging it into (21), we have v(x) 2 ≤ β 1 + ∇v(x) 2 v(x) 2 2 ≤ β 2 1 + ∇v(x) 2 2 + 2β 1 ∇v(x) 2 E x∼D v(x) 2 2 ≤ β 2 1 + E x∼D ∇v(x) 2 2 + 2β 1 E x∼D ∇v(x) 2 E x∼D v(x) 2 2 ≤ β 2 1 + E x∼D ∇v(x) 2 2 + 2β 1 E x∼D ∇v(x) 2 v 2 D ≤ β 2 1 + ∇v 2 D,2 + 2β 1 E x∼D ∇v(x) 2 Applying Jensen's inequality to the last term, we get v 2 D ≤ β 2 1 + ∇v 2 D,2 + 2β 1 E x∼D ∇v(x) 2 2 = β 2 1 + ∇v 2 D,2 + 2β 1 ∇v 2 D,2 = β 2 1 + ∇v 2 D,2 + 2β 1 ∇v D,2 = ( ∇v D,2 + β 1 ) 2 Plugging β 1 = 1 + ∇f T D,2 ∇f S D,2 β and v = f T -g • f S into the above inequality completes the proof. With the above lemma, it is easy to show the mean squared loss on the transferred model is also bounded. Theorem 3 (Restated). Without loss of generality we assume x 2 ≤ 1 for ∀x ∈ supp(D). Consider functions f S : R n → R m , f T : R n → R d , and an affine function g : R m → R d , suggested by Theorem 1 or Theorem 2, such that g(f S (0)) = f T (0). If both f T , f S are β-smooth, then g • f S -y 2 D ≤ f T -y D + ∇f T -∇g • f S D,2 + 1 + ∇f T D,2 ∇f S D,2 β 2 Proof. Let us denote β 1 = 1 + ∇f T D,2 ∇f S D,2 β, and according to Lemma 1 we can see f T -g • f S D ≤ ∇f T -∇(g • f S ) D,2 + β 1 Applying a standard algebra manipulation to the left hand side, and then applying triangle inequality, we have f T -g • f S D = f T -y + y -g • f S D ≥ y -g • f S D -f T -y D . Plugging this directly into (22), it holds that y -g • f S D -f T -y D ≤ ∇f T -∇(g • f S ) D,2 + β 1 y -g • f S D ≤ f T -y D + ∇f T -∇(g • f S ) D,2 + β 1 Replacing β 1 by 1 + ∇f T D,2 ∇f S D,2 β and taking the square, we can see Theorem 3 is proven.

F PROOF OF PROPOSITION 2

Proposition 2 (Restated). If T is mean squared loss and f T achieves zero loss on D, then the adversarial loss defined in Definition 6 is approximately upper and lower bounded by L adv (f T , δ f S ; y, D) ≥ 2 E x∼D τ 1 (x) ∇f T (x) 2 2 + O( 3 ), L adv (f T , δ f S ; y, D) ≤ 2 E x∼D λ 2 f T + (1 -λ 2 f T )τ 1 (x) ∇f T (x) 2 2 + O( 3 ), where O( 3 ) denotes a cubic error term. Proof. Recall that the empirical adversarial transferability is defined as a loss L adv (f T , δ f S , ; y, D) = E x∼D T (f T (x + δ f S , (x)), y(x)). As T is mean squared loss, and f T achieves zero loss, i.e., f T = y, we have L adv (f T , δ f S , ; y, D) = E x∼D f T (x + δ f S , (x)) -y(x) 2 2 = E x∼D f T (x + δ f S , (x)) -f T (x) 2 . Denoting δ f S , (x) = δ f S ,1 (x), and define an auxiliary function h as h(t) = f T (x + tδ f S ,1 (x)) -f T (x), we can see that f T (x + δ f S , (x)) -f T (x) 2 2 = h( ) 2 2 . We can then apply Taylor expansion to approximate h( ) with a second order error term O( 2 ), i.e., h( ) = ∂h ∂t t=0 + O( 2 ) = ∇f T (x) δ f S ,1 + O( 2 ). Therefore, assuming that ∇f T (x) 2 is bounded for x ∈ supp(D), we have f T (x + δ f S , (x)) -f T (x) 2 2 = h( ) 2 2 = 2 ∇f T (x) δ f S ,1 (x) 2 2 + O( 3 ), where we have omit higher order error term, i.e., O( 4 ). Next, let us deal with the term ∇f T (x) δ f S ,1 (x) 2 2 . Same us the technique we use in the proof of Theorem 2, we split δ f S ,1 (x) = v 1 + v 2 , where v 1 aligns the direction of δ f T ,1 (x), and v 2 is orthogonal to v 1 . Noting that τ 1 (x) is the squared cosine of the angle between δ f S ,1 (x) and δ f T ,1 (x), we can see that v 1 2 2 = τ 1 (x) δ f S ,1 (x) 2 2 = τ 1 (x), v 2 2 2 = (1 -τ 1 (x)) δ f S ,1 (x) 2 2 = (1 -τ 1 (x)). Therefore, we can continue as ∇f T (x) δ f S ,1 (x) 2 2 = ∇f T (x) (v 1 + v 2 ) 2 2 = ∇f T (x) v 1 2 2 + ∇f T (x) v 2 2 2 = τ 1 (x) ∇f T (x) 2 2 + ∇f T (x) v 2 2 2 , where the second equality is because that v 1 is corresponding to the largest singular value of ∇f T (x) , and v 2 is orthogonal to v 1 . Next, we derive the lower bound and upper bound for (24). The lower bounded can be derived as τ 1 (x) ∇f T (x) 2 2 + ∇f T (x) v 2 2 2 ≥ τ 1 (x) ∇f T (x) 2 , and the upper bounded can be derived as τ 1 (x) ∇f T (x) 2 2 + ∇f T (x) v 2 2 2 ≤ τ 1 (x) ∇f T (x) 2 2 + λ f T (x) 2 ∇f T (x) 2 2 v 2 2 2 = τ 1 (x) ∇f T (x) 2 2 + λ f T (x) 2 ∇f T (x) 2 2 (1 -τ 1 (x)) ≤ τ 1 (x) ∇f T (x) 2 2 + λ 2 f T ∇f T (x) 2 2 (1 -τ 1 (x)) = λ 2 f T + (1 -λ 2 f T )τ 1 (x) ∇f T (x) 2 2 , where λ f T (x) is the singular value ratio of f T at x, and λ f T is the maximal singular value of f T . Applying the lower and upper bound to (23), we finally have f T (x + δ f S , (x)) -f T (x) 2 2 ≥ 2 τ 1 (x) ∇f T (x) 2 2 + O( 3 ), f T (x + δ f S , (x)) -f T (x) 2 2 ≤ 2 λ 2 f T + (1 -λ 2 f T )τ 1 (x) ∇f T (x) 2 2 + O( 3 ). Noting that L adv (f T , δ f S , ; y, D) = E x∼D f T (x + δ f S , (x)) -f T (x) 2 , we can see that taking expectation to (25) completes the proof.

G EXPERIMENT DETAILS

All experiments are conducted on 4 RTX 2080 Ti GPUs and in python3 Ubuntu 16.04 environment.

G.1 ATTACK METHODS

PGD Attack is generated iteratively: denote step size as ξ, the source model as fS, and the loss function on the source problem. S (•, •). We initialize x0 to be uniformly sampled from the -ball B (x) of radius centered as instance x, and then generate the adversarial instance iteratively: at step t we compute xt+1 = xt + ξ • sign(∇x t S (fS(xt), fS(x))). Denoting the adversarial example at instance x using PGD on source model fS as P GD f S (x), we measure the adversarial loss from fS to fT based on the loss T (•, y) of f T on target data D given attacks generated on f S , i.e., L T (f T • P GD f S ; y, D) = E x∼D T (f T (P GD f S (x)), y(x)). TextFooler iteratively replaces words in target sentences by looking up similar words in the dictionary. It pauses when the predicted label is changed or runs out of the attack budget. We modify it such that it pauses when the percentage of changed words reaches 10%.

G.2 ADVERSARIAL TRANSFERABILITY INDICATES KNOWLEDGE-TRANSFER AMONG DATA DISTRIBUTIONS

Details of Dataset construction For the image domain, we divide the classes of the original datasets into two categories, animals (bird, cat, deer, dog) and transportation vehicles (airplane, automobile, ship, truck). Each of the source datasets consists of different a percentage of animals and transportation vehicles, while the target dataset contains only transportation vehicles, which is meant to control the closeness of the two data distributions.

Details of Model

Training Image: we train five source models on the five source datasets from 0% animals to 100% animals, and one reference models on STL-10 with identical architectures and hyperparameters. We use SGD optimizer and standard cross-entropy loss with learning rate 0.1, momentum 0.9, and weight decay 10 -4 . Each model is trained for 300 epochs. Natural Language: we fine-tune a Bert on each of the datasets with Adam and learning rate 0.0003 for 100 epochs. For transferred models, we run Adam with a smaller learning rate 0.0001 for 3 epochs.

Details of Model Training

We train 40 binary source classifiers on each of the 40 attributes of CelebA with ResNet18 (He et al., 2016) . All the classifiers are trained with optimizer Adadelta with a learning rate of 1.0 for 14 epochs. We also train a facial recognition model as a reference model on CelebA with 10,177 identities using ResNet18 as the controlled experiment.The reference facial recognition model is optimized with SGD and initial learning rate 0.1 on the ArcFace (Deng et al., 2019) with focal loss (Lin et al., 2017) for 125 epochs. For each source model, we construct a transferred model by stripping off the last layers and attaching a facial recognition head without parameters. Then we use the 40 transferred models to evaluate the knowledge transferability on 7 facial recognition benchmarks. We use 15 pretrained models released in the task bank (Zamir et al., 2018) as the source models. Each source model consists of two parts, an encoder, and a decoder. The encoder is a modified ResNet50 without pooling, homogeneous across all tasks, whereas the decoder is customized to suit the output of each task. When measuring the adversarial transferability, we will use each source model as a reference model and compute the transferability matrix as described below. Adversarial Transferability Matrix (ATM) is used here to measure the adversarial transferability between multiple tasks, modified from the Affinity Matrix in (Zamir et al., 2018) . In the experiment of determining similarity among tasks, it is hard to compare directly and fairly, since each task is of different loss functions, which is usually in a very different scale with each other. To solve this problem, we take the same ordinal normalization approach as Zamir et al. ( 2018). Suppose we have N tasks in the pool, a tournament matrix M T for each task T is constructed, where the element of the matrix m i,j represents what percentages of adversarial examples generated from the ith task transfers better to task T than the ones of the jth task (untargeted attack success rate is used here). Then we take the principal eigenvectors of the N tournament matrices and stack them together to build the N × N adversarial transferability matrix. To generate the corresponding "task categories" for comparison, we sample 1000 images from the public dataset and perform a virtual adversarial attack on each of the 15 source models. Adversarial perturbation with (L ∞ norm) as 0.03,0.06 are used and we run 10 steps PGD-based attack for efficiency. Then we measure these adversarial examples' effectiveness on each of the 15 tasks by the corresponding loss functions. After we obtain the 15×15 ATM, we take columns of this matrix as features for each task and perform agglomerative clustering to obtain the Task Similarity Tree. Figure 5 : We also quantitatively compare our prediction with the Taskonomy (Zamir et al., 2018) prediction when different number of categories is enforced. We find our prediction is similar with theirs with n ≥ 3.



Figure 1: Illustration of the key variables.

Figure2: The x-axis is the t ∈ [0, 1] that controls how much the source model deviates from the target model. There are in total 7 quantities reported, placed under 4 y-axes. Specifically, τ 1 , τ 2 , and the normalized cross adversarial loss α are plotted as green curves with green y-axis; the upper bound in theorem 2 on the transferred gradients difference is shown as blue curves with blue y-axis; the true transferred gradients difference is shown as red curves with red y-axis; the upper bound in theorem 3 on the transferred loss is shown as magenta curves with magenta y-axis; the true transferred loss is shown as black curves with black y-axis.

Figure 4: Left: Emprically confirmed taskonomy prediction of task categories (Zamir et al., 2018). Right: Task category prediction based on adversarial transferability. Different colors represent different task categories including 2D, 3D, Semantic. It is obvious that the adversarial transferability is able to predict similar task categories aligned with the pure knowledge-transfer empirical observation.

Top 5 Attributes with the highest adversarial transferability and their corresponding average accuracy on the validation benchmarks.

consists of 202,599 face images from 10,177 identities. A reference facial recognition model is trained on this identities. Each image also comes with 40 binary attributes, on which we train 40 source models. Our goal is to test whether source models of source attributes, can transfer to perform facial recognition. Adversarial Transferability We sample 1000 images from CelebA and perform a virtual adversarial attack as described in section 3 on each of the 40 attribute classifiers. Then we measure the adversarial transfer effectiveness of these adversarial examples on the reference facial recognition model. Knowledge Transferability To fairly assess the knowledge transferability, we test the 40 transferred models on 7 well-known facial recognition benchmarks, LFW(Huang et al., 2007), CFP-FF, CFP-FP

Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese.Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp. 3712-3722, 2018.

