ROBUST TRANSFER LEARNING BASED ON MINIMAX PRINCIPLE

Abstract

The similarity between target and source tasks is a crucial quantity for theoretical analyses and algorithm designs in transfer learning studies. However, this quantity is often difficult to be precisely captured. To address this issue, we make a boundedness assumption on the task similarity and then propose a mathematical framework based on the minimax principle, which minimizes the worst-case expected population risk under this assumption. Furthermore, our proposed minimax problem can be solved analytically, which provides a guideline for designing robust transfer learning models. According to the analytical expression, we interpret the influences of sample sizes, task distances, and the model dimensionality in knowledge transferring. Then, practical algorithms are developed based on the theoretical results. Finally, experiments conducted on image classification tasks show that our approaches can achieve robust and competitive accuracies under random selections of training sets. , and H(P, Q) ≜ H 2 (P, Q). (2) 1 This assumption comes from the fact that in practice such joint distributions are typically modeled by some positive parameterized families, e.g., the softmax function.

1. INTRODUCTION

The goal of the transfer learning is to solve target tasks by the learning results from some source tasks. In order to study the fundamental aspects of the transfer learning problems, it is important to define and quantify the similarity between source and target tasks (Pan & Yang, 2009) . While it is assumed that the source and target tasks are kind of similar in transfer learning problems (Weiss et al., 2016) , the joint structures and similarity between the tasks can only be learned from the training data, which is challenging to be practically computed due to the limited availability of the labeled target samples. Therefore, in order to conduct meaningful theoretical analyses, it is often necessary to make extra assumptions, such as the linear combination of learning results (Ben-David et al., 2010) and linear regression transferring (Kuzborskij & Orabona, 2013) , which could be limited in many applications. As such, in this paper, we attempt to theoretically study the transfer learning by only assuming that the similarity between the source and target tasks is bounded, which is a weaker assumption, and is often valid in transfer learning problems. Under such an assumption, the minimax principle can be applied (Verdu & Poor, 1984) for estimating the target distribution. Based on this principle, the estimator minimizes the worst-case expected population risk (EPR) (Jin et al., 2018) under the bounded task distance constraint, which maintains robustness against the weak assumption. Practically, many empirical works have also followed the minimax setting and verify its validness (Zhang et al., 2019) , while the theoretical analyses appear to be rather behind. The main challenge of analyzing general minimax problems in transfer learning is due to the difficulty of computing the expectations of the population risk under popular distance measures, such as the Kullback-Leibler (K-L) divergence (Thomas & Joy, 2006) . To deal with this difficulty, we adopt the widely used χ 2 -distance and Hellinger distance (Csiszár & Shields, 2004) as the distance measure between data distributions of the tasks, and present a minimax formulation of transfer learning. By adopting such measures, the proposed minimax problems can be analytically solved. In particular, we show that the optimal estimation is to linearly combine the learning results of two tasks, where the combining coefficient can be computed from the training data. This provides a theoretical justification for many existing analyzing framework and algorithms (Ben-David et al., 2010; Garcke & Vanck, 2014) . Note that the recent work (Tong et al., 2021 ) also analytically evaluates the combining coefficients, which rely on the underlying task distributions that are not available for real applications. Our work essentially provides the combining coefficient that are both theoretical optimal and computable from data, which can be more appealing in practical applications. Moreover, the analyses of the minimax transfer learning problem on discrete data can be extended to the continuous data for real applications. In the continuous case, we consider similar transfer learning scheme as in (Nguyen et al., 2020) , which transfers the topmost layer of the neural networks between source and target tasks. In particular, we show the analytical solution of optimal weights in the topmost layer, which is again a linear combination of the weights of the source and target problem. Furthermore, we propose the transfer learning algorithm guided by the theoretical results, where the robustness and performances of the algorithms are validated by several experiments on real datasets. The contribution of this paper can be summarized as follows: • We make mild assumptions of the task distance and propose a minimax framework for analyzing transfer Additionally, we establish the analytical solutions of the minimax problems in the discrete data space. • We extend the analyses to continuous data and establish similar results for the learning models with neural networks. Furthermore, we apply our theoretical results to develop robust transfer learning algorithms. • The experiments in real datasets validate our proposed algorithms, where our approaches can have higher robustness and competitive accuracy. Due to the space limitations, the proofs of theorems are presented in the supplemental materials.

2.1. NOTATIONS AND DEFINITIONS

We denote X and Y as the random variables of data and label with domains X and Y, respectively. For ease of illustration, the data X is set as a discrete random variable in section 2 and section 3. We consider the transfer learning problem that has a target task and a source task, denoted as task T and S, respectively. For each task i = T, S, there are n i training samples {(x (i) ℓ , y ℓ )} ni ℓ=1 i.i.d. generated from the underlying joint distributions P (i) XY with 1 P (i) XY (x, y) > 0, for all (x, y) ∈ X × Y. The empirical distributions P (i) XY (i = T, S) of the samples are defined as P (i) XY (x, y) ≜ 1 n i ni ℓ=1 1{x (i) ℓ = x, y (i) ℓ = y}, where 1{•} denotes the indicator function (Feller, 2008) and let P n be the set of all the possible empirical distributions supported by X × Y with n samples. In this paper, we employ the following two distance measures for probability distributions, which are also widely used in statistics (Csiszár & Shields, 2004) , and more convenient in our analyses. Definition 2.1 (Referenced χ 2 -distance). Let R(z), P (z), and Q(z) be the distributions supported by Z. The χ 2 -distance between P (z) and Q(z) referenced by R(z) is defined as follows, χ 2 R (P, Q) ≜ z∈Z (P (z) -Q(z)) 2 R(z) . ( ) Definition 2.2 (Hellinger Distance). Let P (z) and Q(z) be the distributions supported by Z, The Hellinger distance between P (z) and Q(z) is defined as follows. H 2 (P, Q) ≜ 1 2 z∈Z P (z) -Q(z)

