ROBUST TRANSFER LEARNING BASED ON MINIMAX PRINCIPLE

Abstract

The similarity between target and source tasks is a crucial quantity for theoretical analyses and algorithm designs in transfer learning studies. However, this quantity is often difficult to be precisely captured. To address this issue, we make a boundedness assumption on the task similarity and then propose a mathematical framework based on the minimax principle, which minimizes the worst-case expected population risk under this assumption. Furthermore, our proposed minimax problem can be solved analytically, which provides a guideline for designing robust transfer learning models. According to the analytical expression, we interpret the influences of sample sizes, task distances, and the model dimensionality in knowledge transferring. Then, practical algorithms are developed based on the theoretical results. Finally, experiments conducted on image classification tasks show that our approaches can achieve robust and competitive accuracies under random selections of training sets. z∈Z P (z) -Q(z) , and H(P, Q) ≜ H 2 (P, Q). (2) 1 This assumption comes from the fact that in practice such joint distributions are typically modeled by some positive parameterized families, e.g., the softmax function.

1. INTRODUCTION

The goal of the transfer learning is to solve target tasks by the learning results from some source tasks. In order to study the fundamental aspects of the transfer learning problems, it is important to define and quantify the similarity between source and target tasks (Pan & Yang, 2009) . While it is assumed that the source and target tasks are kind of similar in transfer learning problems (Weiss et al., 2016) , the joint structures and similarity between the tasks can only be learned from the training data, which is challenging to be practically computed due to the limited availability of the labeled target samples. Therefore, in order to conduct meaningful theoretical analyses, it is often necessary to make extra assumptions, such as the linear combination of learning results (Ben-David et al., 2010) and linear regression transferring (Kuzborskij & Orabona, 2013) , which could be limited in many applications. As such, in this paper, we attempt to theoretically study the transfer learning by only assuming that the similarity between the source and target tasks is bounded, which is a weaker assumption, and is often valid in transfer learning problems. Under such an assumption, the minimax principle can be applied (Verdu & Poor, 1984) for estimating the target distribution. Based on this principle, the estimator minimizes the worst-case expected population risk (EPR) (Jin et al., 2018) under the bounded task distance constraint, which maintains robustness against the weak assumption. Practically, many empirical works have also followed the minimax setting and verify its validness (Zhang et al., 2019) , while the theoretical analyses appear to be rather behind. The main challenge of analyzing general minimax problems in transfer learning is due to the difficulty of computing the expectations of the population risk under popular distance measures, such as the Kullback-Leibler (K-L) divergence (Thomas & Joy, 2006) . To deal with this difficulty, we adopt the widely used χ 2 -distance and Hellinger distance (Csiszár & Shields, 2004) as the distance measure between data distributions of the tasks, and present a minimax formulation of transfer learning. By adopting such measures, the proposed minimax problems can be analytically solved. In particular, we show that the optimal estimation is to linearly combine the learning results of two tasks, where the combining coefficient can be computed from the training data. This provides a theoretical justification for many existing analyzing framework and algorithms (Ben-David et al., 2010; Garcke & Vanck, 2014) . Note that the recent work (Tong et al., 2021 ) also analytically evaluates the combining coefficients, which rely on the underlying task distributions that are not available for real applications. Our work essentially provides the combining coefficient that are both theoretical optimal and computable from data, which can be more appealing in practical applications. Moreover, the analyses of the minimax transfer learning problem on discrete data can be extended to the continuous data for real applications. In the continuous case, we consider similar transfer learning scheme as in (Nguyen et al., 2020) , which transfers the topmost layer of the neural networks between source and target tasks. In particular, we show the analytical solution of optimal weights in the topmost layer, which is again a linear combination of the weights of the source and target problem. Furthermore, we propose the transfer learning algorithm guided by the theoretical results, where the robustness and performances of the algorithms are validated by several experiments on real datasets. The contribution of this paper can be summarized as follows: • We make mild assumptions of the task distance and propose a minimax framework for analyzing transfer learning. Additionally, we establish the analytical solutions of the minimax problems in the discrete data space. • We extend the analyses to continuous data and establish similar results for the learning models with neural networks. Furthermore, we apply our theoretical results to develop robust transfer learning algorithms. • The experiments in real datasets validate our proposed algorithms, where our approaches can have higher robustness and competitive accuracy. Due to the space limitations, the proofs of theorems are presented in the supplemental materials.

2.1. NOTATIONS AND DEFINITIONS

We denote X and Y as the random variables of data and label with domains X and Y, respectively. For ease of illustration, the data X is set as a discrete random variable in section 2 and section 3. We consider the transfer learning problem that has a target task and a source task, denoted as task T and S, respectively. For each task i = T, S, there are n i training samples {(x (i) ℓ , y ℓ )} ni ℓ=1 i.i.d. generated from the underlying joint distributions P (i) XY with 1 P (i) XY (x, y) > 0, for all (x, y) ∈ X × Y. The empirical distributions P (i) XY (i = T, S) of the samples are defined as P (i) XY (x, y) ≜ 1 n i ni ℓ=1 1{x (i) ℓ = x, y (i) ℓ = y}, where 1{•} denotes the indicator function (Feller, 2008) and let P n be the set of all the possible empirical distributions supported by X × Y with n samples. In this paper, we employ the following two distance measures for probability distributions, which are also widely used in statistics (Csiszár & Shields, 2004) , and more convenient in our analyses. Definition 2.1 (Referenced χ 2 -distance). Let R(z), P (z), and Q(z) be the distributions supported by Z. The χ 2 -distance between P (z) and Q(z) referenced by R(z) is defined as follows, χ 2 R (P, Q) ≜ z∈Z (P (z) -Q(z)) 2 R(z) . ( ) Definition 2.2 (Hellinger Distance). Let P (z) and Q(z) be the distributions supported by Z, The Hellinger distance between P (z) and Q(z) is defined as follows. H 2 (P, Q) ≜ 1 2

2.2. MINIMAX FORMULATION

Since estimating the similarity between target and source tasks from data is challenging, we attempt to only make the assumption that the distance between two tasks is bounded by some constant D under a distance measure d(  E d P (T ) XY , Q XY , where the expectation is taken over all possible P (T ) XY and P (S) XY in P n T and P n S . This formulation can be divided into two parts: (1) for given estimator Q XY , we consider the largest expected population risk (EPR) of the distance between the underlying target distribution and the estimator; (2) we find the best estimator Q XY as the function of training data that could minimize the worst risk. Note that the empirical distributions are sufficient statistics for the underlying distributions (Van der Vaart, 2000) . We therefore consider Q XY as the function of both empirical distributions. Accordingly, the EPR of the derived estimator under the true similarity is always smaller than the result of formulation (4). In other word, we design an estimator that has an upper-bounded EPR and it thus leads to robustness. Notice that the formulation (4) is generally difficult to be solved analytically due to: (i) the distance measure d(•, •) can cause difficulty in computation, e.g., the logarithm function in the K-L divergence; (ii) the expectations over P (T ) XY and P (S) XY follow the multinomial distribution (Csiszár, 1998) , i.e., the probability of the empirical distribution P( P (i) XY ; P (i) XY ) ∝ exp(-n i D( P (i) XY ∥P (i) XY )) , which is complicated to analyze. To address the issue (i), we choose the χfoot_0 -distance and Hellinger distance, which are more convenient to be analyzed in minimax problems. Moreover, for the issue (ii), we propose to study the surrogate problem which replaces the expectation computation in (4) by the integral d P (T ) XY , Q XY i=T,S exp -n i d P (i) XY , P (i) XY d P (i) XY . Note that (5) is the asymptotic approximation of the expectation over multinomial distributions with the additional surrogation that the exponent in (5) can be chosen different from the K-L divergence. Such asymptotic approximation is also applied for theoretical analyses in high dimensional statistics (Morris, 1975) . Then, the goal of this paper is to study the following minimax problems for transfer learning. Formulation 1 (referenced χ 2 -distance): min Q XY ( P (T ) XY , P (S) XY ) max P (S) XY :χ 2 R (P (T ) XY ,P (S) XY )≤D 2 E χ 2 R (P (T ) XY , Q XY ) , where the expectation is the integral over P P (i) XY ; P (i) XY ∝ exp - n i 2 χ 2 R P (i) XY , P (i) XY , i = T, S. Note that the referenced-χ 2 distance can be recognized as an asymptotic approximation of K-L divergence by Lemma A.1 in Appendix A. Here the reference distribution R is selected as P (S) XY . 2 Formulation 2 (Hellinger distance): min Q XY ( P (T ) XY , P (S) XY ) max P (S) XY :H(P (T ) XY ,P (S) XY )≤D E H 2 P (T ) XY , Q XY , where the expectation is the integral over P P (i) XY ; P (i) XY ∝ exp -2n i H 2 P (i) XY , P (i) XY , i = T, S. Note that Hellinger distance provides a lower bound of the K-L divergence with Lemma A.2 in Appendix A, and thus Formulation 2 computes a lower bound of the population risk in (4).

3. ANALYSES FOR DISCRETE DATA

In this section, we provide the analytical solutions of the formulations ( 6) and ( 8). Similar minimax estimation problems have been studied in early works (Trybula, 1958; Berry, 1990) . We now directly give the detailed expressions and the proof is provided in the supplementary material.

3.1. ANALYTICAL SOLUTION OF FORMULATION 1

Theorem 3.1. Let Q (1) XY be the estimator that achieves the minimax solution of problem ( 6) and thenfoot_1 Q (1) XY (x, y) = (1 -α 1 ) P (T ) XY (x, y) + α 1 P (S) XY (x, y), ) for all (x, y) ∈ X × Y, where α 1 ≜ n S n T + n S 1 - I |X ||Y| 2 ( n T n S DD1 n T +n S ) I |X ||Y| 2 -1 ( n T n S DD1 n T +n S ) D D 1 , and D 1 ≜ χ 2 R ( P (T ) XY , P (S) XY ) . Specifically, I ν (•) denotes the modified Bessel function of the first kind with order ν (Abramowitz et al., 1988) , whose definition is in Appendix B. In the remaining parts of this paper, we denote for ease of presentation J ν (x) ≜ I ν 2 (x)/I ν 2 -1 (x). Theorem 3.1 implies that linearly combining the learning results of different tasks is a preferable method for robust transfer learning, which is also widely used in existing algorithms and theoretical frameworks (Ben-David et al., 2010) . Moreover, these works intuitively assume it, whereas we provide a theoretic support that could help explain the rationality. Remark 3.2. To help understand the related factors contained in (11), we consider a special regime that ν ≫ x, where J ν (x) ∼ x/ν. Then the expression of ( 11) can be approximated by α 1 ∼ n S n T + n S 1 - n T n S D 2 (n S + n T )|X ||Y| . ( ) This result is consistent with Eq.( 6) in (Tong et al., 2021) under this special regime as explained in Appendix C. The coefficient α 1 , which represents the requirement of source samples, is positively associated with the model dimensionality |X ||Y|, which comes from that we learn all the |X ||Y| entries of the target distribution, and negatively associated with the target sample size n T and task distance D. These relationships are examined in the experimental part. We also provide an interesting geometric explanation of this pattern, which is shown in Appendix C.

3.2. ANALYTICAL SOLUTION OF FORMULATION 2

In the following, we provide the solution of problem (8) based on Hellinger distance. Theorem 3.3. Let Q (2) XY be the estimator that achieves the minimax solution of problem (8) and thenfoot_2 Q (2) XY (x, y) = (1 -α 2 ) P (T ) XY (x, y) + α 2 P (S) XY (x, y) 2 , for all (x, y) ∈ X × Y, where α 2 = n S n T + n S 1 - D D 2 J |X ||Y| 4n S n T DD 2 n T + n S , and D 2 ≜ H( P (T ) XY , P (S) XY ). Accordingly, we can also achieve a similar interpretation of the affecting factors as in section 3.1.

4. CONTINUOUS CASE AND ALGORITHM

In this section, we extend the previous analyses of discrete data to continuous data, which conform to the practical setting. In such cases, the previously adopted empirical distributions can not be seen as valid observations due to the infinite cardinality |X |, where most of the possible data are not sampled. In order to apply the previous analyzing framework, we consider the retrain-head method (Nguyen et al., 2020) , which is a commonly used transfer learning technique. With this method, a pre-trained network is prepared for extracting the feature of the data and then the topmost layer (known as "head") can be retrained with observed samples. Under such a setting, we can recognize the weights in retrained topmost layers as the observations from the corresponding tasks. Note that compared with the cardinality of the large data space, the topmost layer has much fewer parameters. In detail, a pre-trained network is composed of two parts of networks: (a) the previous layers whose input is data x and output is the d-dimensional features f (x) ∈ R d , and (b) the topmost layer for linear classification, with weights g(y) ∈ R d of each label y. Each task provides the learned weights to estimate the optimal weights for the target task. Moreover, the learned weights are decided by the model proposed for describing the data. For theoretical analyses, we provide the discriminative models for χ 2 -distance and Hellinger distance, where thereupon we can design practical algorithms. For convenience, we use notation h to represent the topmost layer in section 4.2.

4.1. REVISED FORMULATION 1 AND MM-χ 2 ALGORITHM

When χ 2 -distance is chosen as the distance measure, we consider the discriminative model for the target distribution in the factorization form Q (f ,g) Y |X (y|x) ≜ P (T ) Y (y) 1 + f T (x)g(y) , which provides the probability of each label y ∈ Y for any data x. Such a model has been introduced in factorization machines (Rendle, 2010) and is commonly used in natural language processing problems (Levy & Goldberg, 2014) . Under the pre-trained feature extractor f * (•), the learned weights of topmost layers can be derived by minimizing the distance between the empirical distribution and the model. For computation, we avoid using the joint distribution as the reference and define the χ 2 -distance measure referenced by the product marginal distribution, i.e., χ 2 M (•, •) ≜ χ 2 P (T ) X P (T ) Y (•, •). Then, the learned weights ĝi of each task i = T, S can be defined as ĝi ≜ arg min g χ 2 M P (i) XY , P (T ) X Q (f * ,g) Y |X . ( ) Now ĝT and ĝS are the observations to generate the minimax solution, where the expectations can be defined as g i (y) = E[ĝ i (y)]. Note that the parameters of g T are just the topmost weights we hope to achieve. Then, the minimax problem can be defined as follows [cf. ( 6)]: g * = arg min g(ĝ T ,ĝ S ) max g S ∈G E χ 2 M (P (T ) X Q (f * ,g T ) Y |X , P (T ) X Q (f * ,g) Y |X ) , where G ≜ g : χ 2 M (P (T ) X Q (f * ,g T ) Y |X , P (T ) X Q (f * ,g) Y |X ) ≤ D 2 . We can directly apply Theorem 3.1 and obtain the following theorem. Theorem 4.1. When the empirical distributions P (T ) XY and P (S) XY follow the density function ( 7), the minimax solution as defined in (15) isfoot_3 g * = (1 -α1 )ĝ T + α1 ĝS , (16) where α1 = n S n T + n S 1 - D D1 J d|Y| n T n S D D1 n T + n S . and D2 1 ≜ χ 2 M (P (T ) X Q (f * ,ĝ T ) Y |X , P (T ) X Q (f * ,ĝ S ) Y |X ). Algorithm 1 Minimax χ 2 -Algorithm (MM-χ 2 ) 1: Input: target and source data samples {(x (i) l , y (i) l )} ni l=1 (i = T, S), learning rate η 2: Randomly initialize α, f * , g * 3: repeat 4: (f * , g * ) ← (f * , g * ) -η∇ (f ,g) L 1 (α, f * , g * ) 5: α ← n S n T +n S 1 -D D1 J d|Y| n T n S D D1 n T +n S 6: until f * , g * converge 7: return f * , g * Accordingly, we can design an algorithm based on Theorem 4.1. Despite the theoretical analyses where the feature extractor is fixed, our algorithm jointly optimizes the feature extraction f and the topmost layers g, which is a typical retraining procedure. It is proved that the linearly combined weights ( 16) can be achieved by minimizing the linearly combined training loss with the same coefficient. We therefore define L 1 (α, f , g) ≜ (1 -α)χ 2 M P (S) XY , P (T ) X Q (f ,g) Y |X + αχ 2 M P (T ) XY , P (T ) X Q (f ,g) Y |X . ( ) Then, the MM-χ 2 algorithm is given in Algorithm 1. In practice, the loss L 1 (α, f , g), the related quantities D, and D1 in Theorem 4.1 can be estimated by the empirical means of the features of samples. Detailed implementations are provided in the supplementary material. With the f * and g * computed by Algorithm 1, the predicted label ŷ(x) for sample x is given by the maximum a posterior (MAP) decision rule with ŷ(x) = arg max y∈Y Q (f * , g * ) Y |X (y|x).

4.2. REVISED FORMULATION 2 AND MM-HEL ALGORITHM

When Hellinger distance is chosen as the distance measure, we consider the discriminative model for the target distribution in the following form. For each i = T, S, Q(i,f,h) Y |X (y|x) ≜ P (i) Y (y) 1 + f T (x)h(y) . ( ) This model is a deformation of (13), which makes the model trainable under Hellinger distance. Similarly, under the pre-trained feature extractor f * , the learned weights ĥi of each task i = T, S can be defined as ĥi ≜ arg min h H 2 P (i) XY , P (i) X Q(i,f * ,h) Y |X . ( ) Now ĥT and ĥS are the observations to generate the minimax solution, where the expectations can be defined as h i (y) = E[ ĥi (y)], i = T, S. The minimax problem can be defined as follows [cf. ( 8)]: h * ≜ arg min h( ĥT , ĥS ) max h S ∈H E H 2 P (T ) X Q(T,f * ,h T ) Y |X , P (T ) X Q(T,f * ,h) Y |X , where H ≜ h : 1 2 y∈Y P (T ) Y (y)Λ 1 2 T h T (y) -P (S) Y (y)Λ 1 2 S h(y) 2 ≤ D 2 , and Λ i ≜ E P (i) X [f * (X)f * T (X)], for i = T, S. We can directly apply Theorem 3.3 and obtain the following theorem. Theorem 4.2. When the empirical distrbutions P (T ) XY and P (S) XY follow the density function ( 9), the minimax solution as defined in (20) is h * (y) = (1 -α2 ) ĥT (y) + α2 P (S) Y (y)/P (T ) Y (y)Λ -1 2 T Λ 1 2 S ĥS (y), for all y ∈ Y , where α2 = n S n T + n S 1 - D D2 J d|Y| 4n T n S D D2 n T + n S , and D2 2 ≜ 1 2 y∈Y P (T ) Y (y)Λ 1 2 T ĥT (y) -P (S) Y (y)Λ 1 2 S ĥS (y)

2

. Algorithm 2 Minimax Hellinger-Algorithm (MM-Hel) 1: Input: target and source data samples {(x (i) l , y (i) l )} ni l=1 (i = T, S) 2: (f * , h * 1 , h * 2 ) ← arg min f ,h1,h2 L 2 (f , h 1 , h 2 ) 3: α ← n S n T +n S 1 -D D2 J d|Y| 4n T n S D D2 n T +n S 4: h * (y) ← (1 -α)h * 1 (y) + α P (S) Y (y) P (T ) Y (y) Λ -1 2 T Λ 1 2 S h * 2 (y) 5: return f * , h * Similarly, we can design an algorithm based on Theorem 4.2. Note that we cannot apply the linearly combined training loss in the Hellinger distance setting due to the design of two different distribution models Q(T,f,h) Y |X and Q(S,f,h) Y |X . We choose jointly training the shared feature extractor f and individual topmost layers of the target and source task. The training loss is defined as L 2 (f , h 1 , h 2 ) ≜ H 2 P (T ) XY , P (T ) X Q(T,f,h1) Y |X + H 2 P (S) XY , P (S) X Q(S,f,h2) Y |X . ( ) Then, the MM-Hel algorithm is given in Algorithm 2. Similarly, we also provide the estimation of the related quantities in the supplementary material. With the computed f * and h * , the predicted label ŷ(x) for sample x is given by the MAP decision rule with ŷ (x) = arg max y∈Y Q(T,f * , g * ) Y |X (y|x).

5. EXPERIMENTS

To validate the theoretical analyses in Theorem 3.1 and Theorem 3.3, and the robustness of our algorithms, we conduct a series of experiments on common datasets for image recognition, including CIFAR-10 ( Krizhevsky et al., 2009) , Office-31 and Office-Caltech (Gong et al., 2012b) datasets. For convenience, different transfer settings are denoted by "source→target".

5.1. CIFAR-10

We conduct transfer learning experiments on CIFAR-10 dataset in order to verify the theoretical interpretations of the related factors in Remark 3.2, which mainly cover the sample size and task distance. Specifically, CIFAR-10 dataset contains 50 000 training images and 10 000 testing images in 10 classes. We first construct the source tasks and target task by dividing the original CIFAR-10 dataset into five disjoint sub-datasets, each containing two classes of the original data, which corresponds to a binary classification task. Then, we choose one as our target task (task 1), and use the other four as source tasks referred to as task 2, 3, 4, 5, where four corresponding transfer learning tasks are established. In each transfer learning task, we use 2000 source images, with 1000 images per binary class. Target sample size n is set as n = 12, 20, 60, 200 for four sub-tasks. Throughout this experiment, the feature f is of dimensionality d = 10, generated by GoogLeNet (Szegedy et al., 2015) pre-trained by ImageNet (Russakovsky et al., 2015) , and followed by a fully connected layer. The accuracies on the target testing images of MM-χ 2 and MM-Hel algorithms are summarized in Table 1 and Table 2 . In each task, target samples are randomly picked from the target training set. All the accuracies and standard deviations are reported over five random selections of target samples. In detail, we analyze the effect of target sample sizes as in figure 1 . figure 1 shows the changes of the accuracies and coefficient α1 as defined in (16) (averaged over 5 tests) of different target sample sizes. The coefficient α1 represents how much the final model relies on the source task. It corresponds to our interpretation in Remark 3.2 that a larger target sample size could lead to less dependency on the source task, and meanwhile the accuracies become higher and more stable. We also analyze the effect of task distances as in figure 2 . figure 2 shows the changes of the accuracies and task distance D2 as defined in (21) (averaged over 5 tests) of different source tasks, where a larger task distance can lead to a worse accuracy and stability. 

5.2. OFFICE-31

Office-31 dataset contains images of 31 categories with 3 sub-datasets, including Amazon (A, 2817 images), Dslr (D, 498 images), and Webcam (W, 795 images). Six transferring tasks can be established as A→D, A→W, D→W, D→A, W→A, and W→D. We adopt the transfer learning setting in (Tzeng et al., 2015) , illustrated as follows. Specifically, 3 target samples per category are used for training, and the training sample size (per category) for source task is set to 20 or 8, depending on whether the source task is Amazon or not. In this experiment, the feature f is extracted by the VGG-16 (Simonyan & Zisserman, 2014 ) network pre-trained on the ImageNet, succeeded by fully connected layers, and the output is 64-dimensional. We introduce the UDDA (Motiian et al., 2017) algorithm as the typical baseline and the iterative linear combination method (Tong et al., 2021) (ILCM) for comparison, which employs a similar linear combination method. Table 3 summarizes test accuracies under different transfer settings, where all reported accuracies and standard deviations are averaged over five train-test splits. The results indicate that our algorithms generally have higher robustness and competitive accuracies.

5.3. OFFICE-CALTECH

Office-Caltech dataset is composed of 10 categories, divided as four sub-datasets: Amazon (A, 958 images), Caltech (C, 1123 images), Webcam (W, 295 images), and Dslr (D, 157 images). We focus on the six transfer settings depending on C, i.e., A→C, W→C, D→C, C→A, C→W, and C→D. The train-test split is as introduced in (Gong et al., 2012a) . The feature f is based on the pre-trained DeCAF network (Donahue et al., 2014) , succeeded by fully connected layers, and the output dimension is d = 10. Table 4 shows the performances in comparison with CPNN (Ding et al., 2018) and ILCM (Tong et al., 2021) algorithms, where CPNN is chosen as the baseline to be consistent with ILCM. Minimax estimator is a significant theme in statistical decision theory, which deals with the problem of estimating a deterministic parameter in a certain family (Hodges & Lehmann, 2012) . Under the special setting of bounded normal mean, many works study the analytical solution when the centers of Gaussian observations are restricted, including analytical solution for 1-dimensional observations (Casella & Strawderman, 1981) , high-dimensional observations (Berry, 1990; Marchand & Perron, 2002) . Moreover, the objective function can also be measured by norms other than mean square error (Bischoff et al., 1995) , which allows the applications in machine learning scenarios and help derive the solutions of this paper's formulations.

6.2. SELECTION OF DISTANCE MEASURE

In this paper, we select χ 2 -distance and Hellinger distance as the distance measure in (3), which help analytically solve the minimax problem. These two measurements are both from the family of f -divergence (Csiszár & Shields, 2004) and are widely-used in machine learning. Specifically, χ 2 -distance can lead to the typical alternating conditional expectation algorithm (Xu & Huang, 2020) . Hellinger distance is also used to evaluate the domain adaptation in transfer learning (Baktashmotlagh et al., 2014; 2016) . Moreover, most existing measurements with non-linear functions, e.g, K-L divergence containing the logarithm function, could be ill-defined, as explained in section 2.2.

6.3. MINIMAX TRANSFER LEARNING AND ROBUSTNESS

Minimax principle has been widely-used in transfer learning to promote the robustness of algorithms (Verdu & Poor, 1984) . The most common empirical method is connected to the adversarial learning methods (Shafahi et al., 2019) , including maximizing the training loss of adversarial classifier (Tzeng et al., 2017) and maximizing the discrepancy between classifiers' outputs (Saito et al., 2018) . Meanwhile, transfer learning settings can naturally imply minimax optimization problems in view of the relationship between target and source tasks (Zhang et al., 2019) . Recent researches reveal that the maximization can succeed over the constraints on the distribution shift (Lei et al., 2021) , the similarity between neaural network parameters (Kalan et al., 2020) , and optional source tasks (Cai & Wei, 2021) .

7. CONCLUSION

This paper introduces a minimax framework for transfer learning based on the assumption of task distance. We provide the analytical solution of the minimax problem and characterize the roles of sample sizes, task distance, and model dimensionality in knowledge transferring. In addition, we develop robust transfer learning algorithms based on theoretical analyses. Experiments on practical tasks show the robustness and effectiveness of our proposed algorithms. B PROOF OF THEOREM 3.1 AND THEOREM 3.3 First, the regular definition of the modified Bessel functions of the first kind is I ν (x) = ∞ m=0 1 m!Γ(m + ν + 1) x 2 (2m+ν) , where Γ(•) denotes the gamma function. To solve the minimax problem ( 6) and ( 8), we apply the minimax estimation of a bounded normal mean vector Berry (1990) . The following lemma is a direct extension of the bounded normal mean results.  max z:∥x-z∥ 2 ≤D 2 E[∥ x(y, w) -x∥ 2 ]. Then, the expression of x * is x * = σ 2 1 σ 2 1 + σ 2 2 w + σ 2 2 σ 2 1 + σ 2 2 y + σ 2 1 σ 2 1 + σ 2 2 I k 2 ( D σ 2 1 +σ 2 2 ∥y -w∥) I k 2 -1 ( D σ 2 1 +σ 2 2 ∥y -w∥) D ∥y -w∥ (y -w). Proof. First, we derive the posterior MMSE estimator for x under the uniform prior distribution on the surface of the sphere.

Lemma B.2 (MMSE estimator).

When the means and variances are finite, the MMSE estimator for parameter x of observation y is uniquely defined and is given by x(y) = E[x|y]. ( ) The likelihood of the observation y, w is P (y, w|x, z) = P (y|x) P (w|z) ∝ exp - (y -x) 2 2σ 2 1 exp - (z -w) 2 2σ 2 2 . ( ) Let π(x|z) be the uniform prior distribution on the surface of the sphere in k dimensions with center at z and radius D. Let t ≜ xz, and we denote the prior as π(t) = 1 {∥t∥=D} (t). Under such a prior distribution, P(y, w|x) ∝ R k exp - (y -x) 2 2σ 2 1 exp - (z -w) 2 2σ 2 2 1 {∥x-z∥=D} (x -z)dz. Then, the posterior distribution is P (x|y, w) ∝ R k exp - (y -x) 2 2σ 2 1 exp - (z -w) 2 2σ 2 2 1 {∥x-z∥=D} (z)dz ∝ R k exp - (x -y) 2 2σ 2 1 exp - (x -(w + t)) 2 2σ 2 2 1 {∥t∥=D} (t)dt ∝ R k exp - 1 2 1 σ 2 1 + 1 σ 2 2 x - σ 2 2 y + σ 2 1 (w + t) σ 2 1 + σ 2 2 2 • exp - 1 2 (y -(w + t)) 2 σ 2 1 + σ 2 2 1 {∥t∥=D} (t)dt. ( ) D ∥y -w∥ (y -w), where A and A ′ are the normalization constants. Then, we will prove that x * is the minimax estimator. Specifically , we prove that σ 2 2 y + σ 2 1 w is independent of yw. Since both two r.v.s are normal, we only need to prove Cov(σ 2 2 y + σ 2 1 w, yw) = 0, i.e.,

Cov(σ

2 2 y + σ 2 1 w, y -w) = σ 2 2 Var(y) + σ 2 1 Var(w) = σ 2 1 σ 2 2 -σ 2 1 σ 2 2 = 0. ( ) We then define the risk function R x * (t) ≜ E   σ 2 2 y + σ 2 1 w σ 2 1 + σ 2 2 + σ 2 1 σ 2 1 + σ 2 2 I k 2 ( D σ 2 1 +σ 2 2 ∥y -w∥) I k 2 -1 ( D σ 2 1 +σ 2 2 ∥y -w∥) D ∥y -w∥ (y -w) -x 2   = 2σ 2 1 σ 2 2 σ 2 1 + σ 2 2 + σ 2 1 σ 2 1 + σ 2 2 2 • E y-w∼N (t,σ 2 1 +σ 2 2 )   I k 2 ( D σ 2 1 +σ 2 2 ∥y -w∥) I k 2 -1 ( D σ 2 1 +σ 2 2 ∥y -w∥) D ∥y -w∥ (y -w) -t 2   . Lemma B.3 (Minimax Theorem Marchand & Perron (2002) ). The unique Bayes estimator x * is also the unique minimax estimator when max t R x * (t) = R x * (t)dπ(t). Let R ′ (t) ≜ E (y-w)∼N (t,σ 2 1 +σ 2 2 ) I k 2 ( D σ 2 1 +σ 2 2 ∥y-w∥) I k 2 -1 ( D σ 2 1 +σ 2 2 ∥y-w∥) D ∥y-w∥ (y -w) -t 2 . Lemma B.4 (Berry (1990)). When D/ σ 2 1 + σ 2 2 ≤ √ k, max t R ′ (t) = R ′ (t)dπ(t) Based on Lemma B.3 and Lemma B.4, x * is the minimax estimator. We denote Λ T ≜ E P (T ) X [f * (X)f * T (X)] and Λ S ≜ E P (S) X [f * (X)f * T (X)].

D.1 REVISED FORMULATION 1

For revised Formulation 1 (15), first, we give the expression of ĝi as defined in ( 14) Tong et al. (2021) ĝi (y) = 1 P (T ) Y (y) Λ -1 T x∈X P (i) XY (x, y)f * (x) . Then, we can define a random vector u, v ∈ R d|Y| , where u ≜ P (T ) Y (1)Λ 1 2 T ĝT T (1), • • • , P (T ) Y (|Y|)Λ 1 2 T ĝT T (|Y|) T , and v ≜ P (T ) Y (1)Λ 1 2 T ĝT S (1), • • • , P (T ) Y (|Y|)Λ 1 2 T ĝT S (|Y|) T . Their centers are u 0 ≜ P (T ) Y (1)Λ 1 2 T g T T (1), • • • , P (T ) Y (|Y|)Λ 1 2 T g T T (|Y|) T , and v 0 ≜ P (T ) Y (1)Λ 1 2 T g T S (1), • • • , P (T ) Y (|Y|)Λ 1 2 T g T S (|Y|) T . With (7), we have u ∼ N (u 0 , 1 n T I d|Y| ), v ∼ N (v 0 , 1 n S I d|Y| ). Problem ( 15) can be re-defined as min w(u,v) max v0:∥v0-u0∥ 2 ≤D 2 E ∥u 0 -w∥ 2 , and thus Theorem 4.1 is proved.

D.2 REVISED FORMULATION 2

For revised Formulation 2 (15), we give the expression of ĥi as defined in (19), ĥi (y) = 1 P (i) Y (y) Λ -1 i x∈X P (i) XY (x, y)P (i) X (x)f * (x) . Then, we can define a random vector u, v ∈ R d|Y| , where u ≜ P u ∼ N (u 0 , 1 2n T I d|Y| ), v ∼ N (v 0 , 1 2n S I d|Y| ). Problem ( 15) can be re-defined as min w(u,v) max v0: 1 2 ∥v0-u0∥ 2 ≤D 2 E 1 2 ∥u 0 -w∥ 2 , and the optimal estimator is (1 -α2 )u + α2 . Note that w here refers to the vector Then, we define Λf and Λg as the covariance matrices of features on target samples: Λf ≜ E P (T ) X [ f (X) f T (X)], Λg ≜ E P (T ) Y [g(Y )g T (Y )]. In our implementations, all the computations in terms of the underlying distribution are replaced by the corresponding empirical distributions. As proved in Xu & Huang (2020) , minimizing χ 2 M ( P (i) XY , P (T ) X Q(f,g) Y |X ) ( Q(f,g) Y |X (y|x) ≜ P (T ) Y (y) 1 + f T (x)g(y) ) is equivalent to maximizing H (i) (f , g) ≜ E P (i) XY [ f T (X)g(Y )] - 1 2 tr( Λf Λg ), where i = T, S. Then, line 4 in Algorithm 1 can be implemented by L 1 (α, f , g) ← (1 -α)H (T ) (f , g) + αH (S) (f , g). (60) Meanwhile, the distance bound D is also estimated from samples. As the simplest way, we let D = D1 (D can actually be adjusted), where Then the covariance matrices are Λf1 ≜ E P (T ) D2 1 = χ 2 M (P (T ) X Q (f * ,ĝ T ) Y |X , P X [ f1 (X) f T 1 (X)], Λf2 ≜ E P (S) X [ f2 (X) f T 2 (X)], Λg1 ≜ E P (T )

Y

[g 1 (Y )g T 1 (Y ), (64) Λg2 ≜ E P (S) Y [g 2 (Y )g T 2 (Y ). Still, all the computations in terms of the underlying distribution are replaced by the corresponding empirical distributions. Considering the local approximation that P 



Since the source samples are sufficient, with high probability, all the entries of P (S) XY are positive. This solution actually requests D/ 1/nT + 1/nS ≤ |X ||Y|, which can be easily guaranteed. It requests D/ 1/4nT + 1/4nS ≤ |X ||Y|. In practice, when the assumption D/ 1/nT + 1/nS < d|Y| does not hold, the estimator can still provide a sub-optimal solution.



Figure 1: The accuracies and coefficient α1 in transferring task 5 → 1 based on MM-χ 2 algorithm under different target sample sizes.

Figure 2: The accuracies and distance measure D2 of target sample size n T = 12 based on MM-Hel algorithm under different source tasks.

Given two observations y ∼ N (x, σ 2 1 I k ) and w ∼ N (z, σ 2 2 I k ), where I k denotes the k × k identity matrix, their centers satisfy ∥x -z∥ ≤ D, and D/ σ 2 Let x * be the minimax estimator for the following minimax problem, e.g., x * = arg min x(y,w)

Figure3: A geometrical explanation of the minimax setting (6). Two balls centered at the underlying distributions represent all possible empirical distributions.

the details of the loss function L 1 (α, f , g) in line 4 of Algorithm 1, the quantity D and the quantity D1 in line 5. Our main procedures follow the results inXu & Huang (2020) As for the zero-mean assumption of f * , which results in zero-mean weights g, we first definef (X) ≜ f (X) -E P (T ) X [f (X)] and g(Y ) ≜ g(Y ) -E P (T )

the details of the loss function L 2 (f , h 1 , h 2 ) in line 2 of Algorithm 2, the quantity D and the quantity D2 in line 3.Similarly, we define f1 (X) ≜ f (X) -E P (T ) X [f (X)], f2 (X) ≜ f (X) -E P (S) X [f (X)], h1 (Y ) ≜ h 1 (Y ) -E P (T ) Y [g 1 (Y )], and h2 (Y ) ≜ h 2 (Y ) -E P (S) Y [h 2 (Y )].

Algorithm 2 can be implemented byL 2 (f , h 1 , h 2 ) ← H(T ) (f , h 1 ) + H(S) (f , h 2 ).

•, •), i.e.,

Accuracies (%) of CIFAR-10 transfer learning tasks based on MM-χ 2 algorithm, where n T represents the target sample size. The baseline is trained with merely target samples.

Accuracies (%) of CIFAR-10 transfer learning tasks based on MM-Hel algorithm, where n T represents the target sample size and it shares the same baseline with Table1.

Test accuracies for target tasks under different transfer settings on Office-31.

Test accuracies for target tasks under different transfer settings on Office-Caltech.

annex

A APPROXIMATION OF K-L DIVERGENCE Firstly, we provide the following approximation of K-L divergence, under the assumption that R(z), P (z), and Q(z) are close to each other.Lemma A.1. Suppose that |P (z) -R(z)| < ϵ and |Q(z) -R(z)| < ϵ for each z ∈ Z, where ϵ/|Z| ≪ 1, K-L divergence between P (z) and Q(z) can have the following approximationProof.Secondly, we consider Hellinger distance is closely connected with K-L divergence as the lower bound, which is explained in the following lemma. Lemma A.2. Let P (z), and Q(z) be the distribution supported by Z. We haveThe proof of this lemma is easy to find and omitted here.Lemma B.1 reveals that when two Gaussian observations contain a prior knowledge that their centers have a maximum distance, the optimal estimator is a linear combination of the observations, where the combining coefficient is related to the variances of the two observations, the maximum center distance, and the dimension of observations.For Formulation 1 (6), we can define a random vector u, v ∈ R |X ||Y| , where for all (x, y) ∈ X × Y, u(x, y) ≜and v(x, y) ≜Their centers are u 0 (x, y) ≜and v 0 (x, y) ≜. According to (7), we have, and problem ( 6) can be re-defined as minWith Lemma B.1, we derive Theorem 3.1.For Formulation 2 (8), we can define a random vector u, v ∈ R |X ||Y| , where for all (x, y) ∈ X × Y,andTheir centers are u 0 (x, y) ≜ P Firstly, Remark 3.2 leads to thatwhich is close to Eq.( 6) in Tong et al. (2021) .A geometric explanation for Theorem 3.1 can be depicted in figure 3 , where the entire space represents all the distributions supported by X × Y. Two balls centered at the target and source distributions represent the empirical distribution sets P n T and P n S . In particular, the radii of the balls show the variances of the empirical distributions, which are inversely proportional to the sample sizes. The area with a deeper color contains those paracentral empirical distributions, which have higher probability according to (7). In addition, the distance between the centers is determined by the task distance assumption. The minimax problem ( 6) is meant to find estimatorXY , which is closest to the target distribution in average, where the dash line implies the linear family of P (T ) XY and P (S) XY . When the target sample size increases, the blue ball will shrink. When distance D increases, two balls will be farther away. Then, P (T ) XY would be closer to the target distribution, which implies a smaller coefficient α 1 . andΛ S ← Λf2 .Specifically, to compute the value of Bessel functions, we make the following approximations when needed. When x ≪ ν, J ν (x) ∼ x/v. When x ≫ ν, J ν (x) ∼ (2x -ν)/(2x).

F INSTRUCTION FOR CODES

We provide code examples in "supplementary material.zip". In the folder "./cifar10 

