CONTINUOUS TRANSFER LEARNING

Abstract

Transfer learning has been successfully applied across many high-impact applications. However, most existing work focuses on the static transfer learning setting, and very little is devoted to modeling the time evolving target domain, such as the online reviews for movies. To bridge this gap, in this paper, we focus on the continuous transfer learning setting with a time evolving target domain. One major challenge associated with continuous transfer learning is the time evolving relatedness of the source domain and the current target domain as the target domain evolves over time. To address this challenge, we first derive a generic generalization error bound on the current target domain with flexible domain discrepancy measures. Furthermore, a novel label-informed C-divergence is proposed to measure the shift of joint data distributions (over input features and output labels) across domains. It could be utilized to instantiate a tighter error upper bound in the continuous transfer learning setting, thus motivating us to develop an adversarial Variational Auto-encoder algorithm named CONTE by minimizing the C-divergence based error upper bound. Extensive experiments on various data sets demonstrate the effectiveness of our CONTE algorithm.

1. INTRODUCTION

Transfer learning has achieved significant success across multiple high-impact application domains (Pan & Yang, 2009) . Compared to conventional machine learning methods assuming both training and test data have the same data distribution, transfer learning allows us to learn the target domain with limited label information by leveraging a related source domain with abundant label information (Ying et al., 2018) . However, in many real applications, the target domain is constantly evolving over time. For example, the online movie reviews are changing over the years: some famous movies were not well received by the mainstream audience when they were first released, but became famous only years later (e.g., Citizen Cane, Fight Club, and The Shawshank Redemption); whereas the online book reviews typically do not have this type of dynamics. It is challenging to transfer knowledge from the static source domain (e.g., the book reviews) to the time evolving target domain (e.g., the movie reviews). Therefore, in this paper, we study the transfer learning setting with a static source domain and a continuously time evolving target domain (see Figure 1 ), which has not attracted much attention from the research community and yet is commonly seen across many real applications. The unique challenge for continuous transfer learning lies in the time evolving nature of the task relatedness between the static source domain and the time evolving target domain. Although the change in the target data distribution in consecutive time stamps might be small, over time, the cumulative change in the target domain might even lead to negative transfer (Rosenstein et al., 2005) . Existing theoretical analysis on transfer learning (Ben-David et al., 2010; Mansour et al., 2009) showed that the target error is typically bounded by the source error, the domain discrepancy of marginal data distributions and the difference of labeling functions. However, it has been observed (Zhao et al., 2019; Wu et al., 2019) that marginal feature distribution alignment might not guarantee the minimization of the target error in real world scenarios. This indicates that in the context of continuous transfer learning, marginal feature distribution alignment would lead to the sub-optimal solution (or even negative transfer) with undesirable predictive performance when directly transferring from D S to the target domain D Tt at the t th time stamp. This paper aims to bridge the gap in terms of both the theoretical analysis and the empirical solutions for the target domain with a time evolving distribution, which lead to a novel continuous transfer learning algorithm as well as the characterization of negative transfer. The main contributions of this paper are summarized as follows: (1) We derive a generic error bound for continuous transfer learning setting with flexible domain divergence measures; (2) We propose a label-informed domain discrepancy measure (C-divergence) with its empirical estimate, which instantiates a tighter error bound for continuous transfer learning setting; (3) Based on the proposed C-divergence, we design a novel adversarial Variational Auto-encoder algorithm (CONTE) for continuous transfer learning; (4) Extensive experimental results on various data sets verify the effectiveness of the proposed CONTE algorithm. The rest of the paper is organized as follows. Section 2 introduces the notation and our problem definition. We derive a generic error bound for continuous transfer learning setting in Section 3. Then we propose a novel C-divergence in Section 4, followed by a instantiated error bound and a novel continuous transfer learning algorithm in Section 5. The experimental results are provided in Section 6. We summarize the related work in Section 7, and conclude the paper in Section 8.

2. PRELIMINARIES

In this section, we introduce the notation and problem definition of continuous transfer learning.

2.1. NOTATION

We use X and Y to denote the input space and label space. Let D S and D T denote the source and target domains with data distribution p S (x, y) and p T (x, y) over X ⇥ Y, respectively. Let H be a hypothesis class on X , where a hypothesis is a function h : X ! Y. The notation is summarized in Table 3 in the appendices.

2.2. PROBLEM DEFINITION

Transfer learning (Pan & Yang, 2009) refers to the knowledge transfer from source domain to target domain such that the prediction performance on the target domain could be significantly improved as compared to learning from the target domain alone. However, in some applications, the target domain is changing over time, hence the time evolving relatedness between the source and target domains. This motivates us to consider the transfer learning setting with the time evolving target domain, which is much less studied as compared to the static transfer learning setting. We formally define the continuous transfer learning problem as follows. Definition 2.1. (Continuous Transfer Learning) Given a source domain D S (available at time stamp j = 1) and a time evolving target domain {D Tj } n j=1 with time stamp j, continuous transfer learning aims to improve the prediction function for target domain D Tt+1 using the knowledge from source domain D S and the historical target domain D Tj (j = 1, • • • , t). Notice that the source domain D S can be considered a special initial domain for the time-evolving target domain. Therefore, for notation simplicity, we will use D T0 to represent the source domain in this paper. It assumes that there are m T0 labeled source examples drawn independently from a source domain D T0 and m Tj labeled target examples drawn independently from a target domain D Tj at time stamp j.

3. A GENERIC ERROR BOUND

Given a static source domain and a time evolving target domain, continuous transfer learning aims to improve the target predictive function over D Tt+1 using the source domain and historical target domain. We begin by considering the binary classification setting, i.e., Y = {0, 1}. The source error of a hypothesis h can be defined as follows: ✏ T0 (h) = E (x,y)⇠p T 0 (x,y) ⇥ L(h(x), y) ⇤ where L(•, •) is the loss function. Its empirical estimate using source labeled examples is denoted as ✏T0 (h). Similarly, we define the target error ✏ Tj (h) and the empirical estimate of the target error ✏Tj (h) over the target distribution p Tj (x, y) at time stamp j. A natural domain discrepancy measure over joint distributions on X ⇥ Y between features and class labels can be defined as follows: d 1 (D T0 , D T ) = sup Q2Q Pr D T 0 [Q] Pr D T [Q] ( ) where Q is the set of measurable subsets under p T0 (x, y) and p T (x, y)foot_0 . Then, the error bound of continuous transfer learning is given by the following theorem. Theorem 3.1. Assume the loss function L is bounded with 0  L  M . Given a source domain D T0 and historical target domain {D Ti } t i=1 , for h 2 H, the target domain error ✏ Tt+1 on D t+1 is bounded as follows. ✏ Tt+1 (h)  1 μ 0 @ t X j=0 µ t j ✏ Tj (h) + M t X j=0 µ t j d 1 (D Tj , D Tt+1 ) 1 A where µ 0 is the domain decay ratefoot_2 indicating the importance of source or historical target domain over D Tt+1 , and μ = P t j=0 µ t j . Remark. In particular, we have the following arguments. (1) It is not tractable to accurately estimate d 1 from finite examples in real scenarios (Ben-David et al., 2010) ; (2) This error bound could be much tighter when considering other advanced domain discrepancy measures, e.g., Adistance (Ben-David et al., 2007) , discrepancy distance (Mansour et al., 2009) , etc. (3) There are two special cases: when µ = 0, the error bound of D Tt+1 would be simply determined by the latest historical target data D Tt , and if µ goes to infinity, D Tt+1 is just determined by the source data D T0 because intuitively the coefficient µ t j /μ of historical target domain data D Tj (j = 1, • • • , t) converges to zero. Corollary 3.2. With the assumption in Theorem 3.1 and assume that the loss function L is symmetric (i.e., L(y 1 , y 2 ) = L(y 2 , y 1 ) for y 1 , y 2 2 Y) and obeys the triangle inequality, Then (1) if A-distance (Ben-David et al., 2007 ) is adopted to measure the distribution shift, i.e., d H H = sup h,h 0 2H Pr D T 0 [h(x) 6 = h 0 (x)] Pr D T [h(x) 6 = h 0 (x)] , we have: (Mansour et al., 2009 ) is adopted to measure the distribution shift, i.e., ✏ Tt+1 (h)  1 μ 0 @ t X j=0 µ t j ✏ Tj (h) + M t X j=0 µ t j ✓ d H H (D Tj , D Tt+1 ) + ⇤ j M ◆ 1 A where ⇤ j = min h2H ✏ Tj (h) + ✏ Tt+1 (h). (2) if discrepancy distance d disc (D T0 , D T ) = max h,h 0 2H E D T 0 [L(h(x), h 0 (x))] E D T [L(h(x), h 0 (x))] , we have: ✏ Tt+1 (h)  1 μ 0 @ t X j=0 µ t j ✏ Tj (h) + t X j=0 µ t j d disc (D Tj , D Tt+1 ) + ⌦ j 1 A where ⌦ j = E D T j [L(h ⇤ j (x), y)] + E D T t+1 [L(h ⇤ j (x), h ⇤ t+1 (x))] + E D T t+1 [L(h ⇤ t+1 (x), y)], and h ⇤ j = arg min h2H ✏ Tj (h) for j = 0, • • • , t, t + 1. The aforementioned domain discrepancy measures mainly focus on the marginal distribution over input features and have inspired a line of practical transfer learning algorithms (Ganin et al., 2016; Chen et al., 2019) . However, recent work (Wu et al., 2019; Zhao et al., 2019) observed that the minimization of marginal distributions cannot guarantee the success of transfer learning in real scenarios. We propose to address this problem by incorporating the label information in the domain discrepancy measure (see next section).

4. LABEL-INFORMED DOMAIN DISCREPANCY

In this section, we introduce a novel label-informed domain discrepancy measure between the source domain D T0 and target domain D T , its empirical estimate, and a transfer signature based on this measure to identify potential negative transfer. The use of this discrepancy measure in continuous transfer learning will be discussed in the next section.

4.1. C-DIVERGENCE

For a hypothesis h 2 H, we denote I(h) to be the subset of X such that x 2 I(h) , h(x) = 1. In order to estimate the label-informed domain discrepancy from finite samples in practice, instead of Eq. ( 1), we propose the following C-divergence between D T0 and D T , taking into consideration the joint distribution over features and class labels: d C (D T0 , D T ) = sup h2H Pr D T 0 [{I(h), y = 1}[{I(h), y = 0}] Pr D T [{I(h), y = 1}[{I(h), y = 0}] (2) where I(h) is the complement of I(h). We show that some existing domain discrepancy methods (e.g., Ben-David et al. (2007) ) can be seen as special cases of this definition by using the following relaxed covariate shift assumption. Definition 4.1. (Relaxed Covariate Shift Assumption) The source and target domains satisfy the relaxed covariate shift assumption if for any h 2 H, Pr D T 0 [y | I(h)] = Pr D T [y | I(h)] = Pr[y | I(h)] Notice that it would be equivalent to covariance shift assumption (Shimodaira, 2000; Johansson et al., 2019) when I(h) consists of only one example for all h 2 H (see Lemma A.6 for details). Lemma 4.2. With the relaxed covariate shift assumption, for any h 2 H, we have: d C (D T0 , D T ) = sup h2H ⇣ Pr D T 0 [I(h)] Pr D T [I(h)] ⌘ • S h + Pr D T [y = 1] Pr D T 0 [y = 1] where S h = Pr[y = 1|I(h)] Pr[y = 0|I(h)]. Remark. From Lemma 4.2, we can see that in the special case where S h is a constant for all h 2 H and Pr David et al., 2007) defined on the marginal distribution of features. More generally speaking, Cdivergence can be considered as a weighted version of the A-distance where the hypothesis whose characteristic function has a larger class-separability (i.e., |S h |) receives a higher weight. Intuitively, compared to A-distance, C-divergence would pay less attention to class-inseparable regions in the input feature space, which provide irrelevant information for learning the prediction function in the target domain. D T [y = 1] = Pr D T 0 [y = 1], the proposed C-divergence is reduced to the A-distance (Ben- Moreover, the following theorem states that in conventional transfer learning scenario with a static source domain and a static target domain, the target error is bounded in terms of C-divergence across domains and the expected source error. Theorem 4.3. Assume that loss function L is bounded, i.e., there exists a constant M > 0 such that 0  L  M . For a hypothesis h 2 H, we have the following bound: ✏ T (h)  ✏ T0 (h) + M • d C (D T0 , D T )

4.2. EMPIRICAL ESTIMATE OF C-DIVERGENCE

In practice, it is difficult to calculate the proposed C-divergence based on Eq. ( 2) as it uses the true underlying distributions. Therefore, we propose the following empirical estimate of the Cdivergence between D T0 and D T as follows. Assuming that the hypothesis class H is symmetric (i.e., 1 h 2 H if h 2 H), the empirical C-divergence is: d C ( DT0 , DT ) = 1 min h2H 1 m T0 X (x,y):h(x)6 =y I[(x, y) 2 DT0 ] + 1 m T X (x,y):h(x)=y I[(x, y) 2 DT ] (4) where DT0 and DT denote the source and target domains with finite samples, respectively. I[a] is the binary indicator function which is 1 if a is true, and 0 otherwise. The following lemma provides the upper bound of the true C-divergence using its empirical estimate. Lemma 4.4. For any 2 (0, 1), with probability at least 1 over m T0 labeled source examples B T0 and m T labeled target examples B T , we have: d C (D T0 , D T )  d C ( DT0 , DT ) + ⇣ <B T 0 (L H ) + <B T (L H ) ⌘ + 3 s log 4 2m T0 + s log 4 2m T ! where <B (L H )(B 2 {B T0 , B T }) denotes the Rademacher complexity (Mansour et al., 2009) over B and L H = {(x, y) ! I[h(x) = y] : h 2 H} is a class of functions mapping Z = X ⇥ Y to {0, 1}.

4.3. NEGATIVE TRANSFER CHARACTERIZATION

Informally, negative transfer is considered as the situation where transferring knowledge from the source domain has a negative impact on the target learner (Wang et al., 2019) : ✏ T (A(D T0 , D T )) > ✏ T (A(;, D T )) where A is the learning algorithm. ✏ T is the target error induced by algorithm A. ; implies that it only considers the target data set for target learner. In this paper, we define a transfer signature to measure the transferability from source domain to target domain as follows.

T S(D

T ||D T0 )) = inf A2G (✏ T (A(D T0 , D T )) ✏ T (A(;, D T ))) ( ) where G is the set of all learning algorithms. We state that source domain knowledge is not transferable over target domain when T S(D T ||D T0 )) > 0. Specially, since A(D T0 , D T ) learns an optimal classifier using both source and target data, we can define ✏ T (A(D T0 , D T )) = ✏ T (h ⇤ ↵ ) where h ⇤ ↵ = arg min h2H(A) ↵✏ T (h) + (1 ↵)✏ T0 (h) and H(A) is the hypothesis space induced by A. When we only consider the target domain with ↵ = 1, ✏ T (A(;, D T )) = ✏ T (h ⇤ T ) where h ⇤ T = arg min h2H(A) ✏ T (h). Then we have the following theorem regarding the transfer signature. Theorem 4.5. Assuming the loss function L is bounded with 0  L  M , we have ✏ T (h ⇤ ↵ )  ✏ T (h ⇤ T ) + 2(1 ↵)Md C (D T0 , D T ) Furthermore, T S(D T ||D T0 ))  2(1 ↵)Md C (D T0 , D T ) Remark. We have the following observations: (1) Larger C-divergence between domains is often associated with a higher transfer signature, which indicates that negative transfer can be characterized using the proposed C-divergence; (2) Empirically, the larger amount of labeled target data could increase the value of ↵, resulting in the learned classifier relying more on the target data, which is consistent with the observation in (Wang et al., 2019) . One extreme case is where ↵ = 1, implying we have adequate labeled target examples for standard supervised learning on the target domain without transferring knowledge from the source domain.

5. PROPOSED ALGORITHM

In this section, we derive the continuous error bound based on our proposed C-divergence, followed by a novel continuous transfer learning algorithm (CONTE) by minimizing the error upper bound. Notice that in the context of continuous transfer learning, we also use the proposed C-divergence between the target domain at adjacent time stamps to measure the change in distribution over time.

5.1. CONTINUOUS ERROR BOUND WITH EMPIRICAL C-DIVERGENCE

The following theorem states that for a bounded loss function L, the target error in continuous transfer learning can be bounded in terms of the empirical classification error within source and historical target domains, the empirical C-divergence across domains as well as the empirical Rademacher complexity of the class of functions L H = {(x, y) ! I[h(x) = y] : h 2 H}. Theorem 5.1. (Continuous Error Bound) Assume the loss function L is bounded with 0  L  M . Given a source domain D T0 and historical target domain {D Ti } t i=1 , for h 2 H and 2 (0, 1), with probability at least 1 , the target domain error ✏ Tt+1 on D Tt+1 is bounded as follows. ✏ Tt+1 (h)  1 μ 0 @ t X j=0 µ t j ✏Tj (h) + M t X j=0 µ t j d C ( DTj , DTt+1 ) + M ⇤ 1 A where ⇤ = P t j=0 ⇣ <B T j (L H ) + <B T t+1 (L H ) + 3 r log 8 2m T j + 3 r log 8 2m T t+1 + r M 2 log 4 2m T j ⌘ . Remark. Compared to continuous error bounds in Corollary 3.2 using existing domain divergence measures (Ben-David et al. (2007) ; Mansour et al. (2009) ), our bound consists of only datadependent terms (e.g., empirical source error and C-divergence), whereas previous error bounds are determined by the error terms involving the intractable labeling function or optimal target hypothesis (see Corollary 3.2).

5.2. CONTE ALGORITHM

For continuous transfer learning, we leverage both the source domain and historical target domain data to learn the predictive function for the current time stamp. To this end, we propose to minimize the error bound in Theorem 5.1 for learning the predictive function on D Tt+1 . Furthermore, we aim to learn a domain-invariant and time-invariant latent feature space such that the C-divergence across domains and across time stamps could be minimized. Therefore, we present an adversarial Variational Auto-encoder (VAE) algorithm with the following overall objective function: J (T0, T1, T2, • • • , Tt+1) = t X j=0 µ t j ⇣ L clc (Tj, Tt+1) + dC( DT j , DT t+1 ) + LELBO (Tj, Tt+1) ⌘ (6) where L clc (T j , T t+1 ) represents the classification error over the labeled examples from D Tj and D Tt+1 , d C ( DTj , DTt+1 ) is the empirical estimate of C-divergence across domain. Thus the first two terms of Eq. ( 6) are associated with ✏Tj (h)+d C ( DTj , DTt+1 ) in the error bound of Theorem 5.1. The third term L ELBO (T j , T t+1 ) is the variational bound in the VAE framework (see Figure 4 ) when learning the latent feature space and > 0 is a hyper-parameter. In this case, we have µ 2 [0, 1] because we assume that the data distribution of a time-evolving target domain shifts smoothly over time. Then we instantiate the terms of Eq. ( 6) as follows. Inspired by semi-supervised VAE (Kingma et al., 2014) , we propose to learn the feature space by maximizing the following likelihood across domains. log p ✓ (x, y) = KL q (z|x, y)||p ✓ (z|x, y) + E q (z|x,y) [log p ✓ (x, y, z) log q (z|x, y)] where and ✓ are the learnable parameters in the encoder and decoder respectively, and z is the latent feature representation of the input example (x, y). KL(•||•) is Kullback-Leibler divergence. The evidence lower bound (ELBO), a lower bound on this log-likelihood, can be written as follows. E ✓, (x, y) = E q (z|x,y) [log p ✓ (x, y|z)] + KL (q (z|x, y)||p(z)) (7) where E ✓, (x, y)  log p ✓ (x, y). Similarly, we have the following ELBO to maximize the loglikelihood of p ✓ (x) when the label is not available: U ✓, (x) = X y q (y|x) • E ✓, (x, y) E q (y|x) [log q (y|x)] where p ✓ (x, y, z) = p ✓ (x|y, z)p ✓ (y|z)p(z) with prior Gaussian distribution p(z) = N (0, I). Therefore, the variational bound L ELBO (T j , T t+1 ) is given below. L ELBO (T j , T t+1 ) = X m T j +m T t+1 i=1 E ✓, (x i , y i ) X u T t+1 i=1 U ✓, (x i , y i ) (9 ) where u Tt+1 is the number of unlabeled training examples from D Tt+1 . Besides, the classification error L clc (T j , T t+1 ) can be expressed as follows. L clc (T j , T t+1 ) = X m T j +m T t+1 i=1 L (y i , q (•|x i )) (10) where q (•) is the discriminative classifier formed by the distribution q (y|x) in Eq. ( 8), and L(•, •) is the cross-entropy loss function in our experiments. To estimate the C-divergence, we first define h to be a two-dimensional characteristic function with h(x, y) = 1 , h(x) = y , {h(x) = 1, y = 1} _ {h(x) = 0, y = 0} for h 2 H. Then the empirical C-divergence in Eq. ( 4) can be rewritten as follows. d C ( DTj , DTt+1 ) = 1 min h 1 m Tj X (x,y): h(x,y)=0 I[(x, y) 2 DTj ]+ 1 m Tt+1 X (x,y): h(x,y)=1 I[(x, y) 2 DTt+1 ] Note that the latent feature representation z learned by q (z|x, y) could capture the label-informed information of an example (x, y). Thus, the hypothesis h can be considered as the composition of a feature extraction q and a domain classifier F j , i.e, h(x, y) = F j (q (z|x, y)). Formally, the empirical estimate of C-divergence is given below. d C ( DTj , DTt+1 ) = 1 min Fj 1 m Tj X z:Fj (z)=0 I[z 2 DTj ] + 1 m Tt+1 X z:Fj (z)=1 I[z 2 DTt+1 ] The benefits of CONTE are twofold: first, it learns the latent feature space using both input x and output y; second, it minimizes a tighter error upper bound based on C-divergence in Theorem 5.1. This framework can also be interpreted as a minimax game: the VAE learns a domain-invariant and time-invariant latent feature space, whereas the domain classifier F j aims to distinguish the examples from different domains and different time stamps. In this paper, we adopt the gradient reversal layer (Ganin et al., 2016) when updating the parameters of domain classifier F j , and thus CONTE can be optimized by back-propagation in an end-to-end manner (see Algorithm 1 in appendices). However, we observe that (1) it is difficult to estimate the C-divergence with only limited labeled target examples from D Tt+1 ; (2) when learning the latent features z, combining the data x (e.g., one image) and class-label y directly might lead to over-emphasizing the data itself due to its high dimensionality compared to y. To address these problems, we propose the following Pseudo-label Inference, i.e., we infer the pseudo labels of unlabeled examples using the classifier q (y|x) for each training epoch. Using labeled source and target examples as well as unlabeled target examples with inferred pseudo labels, the C-divergence could be estimated in a balanced setting. Furthermore, to enforce the compatibility between features x and label y, we adopt a pre-encoder step to learn a dense representation for the input x, and then learn the label-informed latent features z.

6. EXPERIMENTAL RESULTS

Synthetic Data: We generate a synthetic data set in which each domain has 1000 positive examples and 1000 negative examples randomly generated from Gaussian distributions N ([1.5 cos ✓, 1.5 sin ✓] T , 0.5 • I 2⇥2 ) and N ([1.5 cos ( ✓), 1.5 sin ( ✓)] T , 0.5 • I 2⇥2 ), respectively. We let ✓ = 0 for the source domain (denoted as S1), and ✓ = i•⇡ t (i = 1, • • • , t) for the time evolving target domain with t = 8 time stamps (denoted as T1, • • • , T8). Image Data: We consider the following two tasks: digital classification (MNIST, SVHN) and image classification (Office-31 with three domains: Amazon, DSLR and Webcam; and Office-Home with four domains: Art, Product, Clipart and Real World). Since standard domains are static in these data sets, we will simulate the time-evolving distribution shift on the target domain by adding noise (e.g., random salt&pepper noise, adversarial noise, rotation). Take SVHN!MNIST as an example, we will use SVHN as the static source domain, and MNIST as the target domain at the first time stamp. By adding adversarial noise to the MNIST images, we obtain a time-evolving target domain (denoted as T1, • • • , T11 in Table 1 ). For Office-31 and Office-Home, we add the random salt&pepper noise and rotation to generate the evolving target domain. More details can be found in the appendices. Baselines: The baseline methods are as follows. (1) SourceOnly: training with only source data; (2) TargetERM: empirical risk minimization (ERM) on only target domain; (3) DAN (Long et al., 2015) , CORAL (Sun & Saenko, 2016) , DANN (Ganin et al., 2016) , ADDA (Tzeng et al., 2017) , WDGRL (Shen et al., 2018) , DIFA (Volpi et al., 2018) and MDD (Zhang et al., 2019) : training with feature distribution alignment. (4) CONTE: training with label-informed distribution alignment on the evolving target domain while µ 2 {0, 0.2, 0.4, 0.6, 0.8, 1}. (5) CONTE 1 : a one-time transfer learning variant of CONTE that directly transfers from source domain to current target domain. We fix = 0.1, and all the methods use the same neural network architecture for feature extraction. We compare the proposed C-divergence with conventional domain discrepancy measure Adistance (Ben-David et al., 2007) on a synthetic data set with an evolving target domain. We assume that the hypothesis space H consists of linear classifiers in the feature space. Figure 2 shows the domain discrepancy and target classification accuracy for each pair of source and target domains. We have the following observations. (1) The classification accuracy on the target domain significantly decreases from target domain T1 to T8. One explanation is that the joint distribution p(x, y) on the time evolving target domain gradually shifted. (2) The A-distance increases from S1!T1 to S1!T4, and then decreases from S1!T4 to S1!T8. That is because it only estimates the difference of the marginal feature distribution p(x) between the source and target domains. (3) The C-divergence keeps increasing from S1!T1 to S1!T8, which indicates the decreasing task relatedness between the source and the target domains. Therefore, compared with A-distancefoot_3 , the proposed C-divergence better characterizes the transferability from the source to the target domains.

6.2. EVALUATION OF ERROR BOUND

S1->T1 S1->T2 S1->T3 S1->T4 S1->T5 S1->T6 S1->T7 S1->T8 Target domain When there is only one time stamp involved in the target domain, Theorem 5.1 is reduced to the standard error bound in the conventional static transfer learning setting. We empirically compare this reduced error bound with the existing Rademacher complexity based error bound in (Mansour et al., 2009) (see Theorem A.4 in appendices for being self-contained). We use the 0-1 loss function as L and assume that the hypothesis space H consists of linear classifiers in the feature space. Figure 3 shows the estimated error bounds and target error with the time evolving target domain (i.e., S1!T1, • • • , S1!T8 in a new synthetic data set with a slower time evolving target domain to ensure that the baseline bound is meaningful most of the time) where we choose h = h ⇤ T0 . It demonstrates that our C-divergence based error bound is much tighter than the baseline. Notice that when transferring source domain S1 to target domain T8, our error bound is largely determined by the C-divergence, whereas the baseline is determined by the difference between the optimal source and target hypothesizes. Furthermore, given any hypothesis h 2 H, we may not be able to estimate the baseline bound when the optimal hypothesis is not available. 1 and 2 provide the continuous transfer learning results on digital and office-31 data sets where the classification accuracy on target domain is reported (the best results are highlighted in bold). It is observed that (1) the classification accuracy using SourceOnly algorithm significantly decreases on the evolving target domain due to the shift of joint data distribution p(x, y) on target domain; (2) the performance of static baseline algorithms is largely affected by the distribution shift in the evolving target domain, and even worse than TargetERM in some cases (e.g., on T6-T11 from SVHN to evolving MNIST); (3) CONTE significantly outperforms CONTE 1 as well as other competitors on target domain by a large margin (i.e., up to 30% improvement on the last time stamp of target domain) because it effectively leverages the historical target domain information to smoothly re-align the target distribution when the change of target domain distribution in consecutive time stamps is small. 

7. RELATED WORK

Transfer Learning: Transfer learning (Ying et al., 2018; Jang et al., 2019) improves the performance of a learning algorithm on the target domain by using the knowledge from the source domain. It is theoretically proven that the target error is well bounded (Ben-David et al., 2010; Mansour et al., 2009) , followed by a line of practical algorithms (Shen et al., 2018; Long et al., 2017; 2018; Saito et al., 2018; Chen et al., 2019) with covariate shift assumption. However, it is observed that this assumption does not always hold in real-world scenarios (Rosenstein et al., 2005; Wang et al., 2019) . Multi-source Domain Adaptation: Multi-source domain adaptation improves the target prediction function from multiple source domains (Zhao et al., 2018; Hoffman et al., 2018; Wen et al., 2020) . It is similar to our problem setting as source and historical target domains can be considered as multiple "source" domains when modeling the target domain at the current time stamp. However, only limited labeled target examples are provided in our problem setting, whereas multi-source domain adaptation requires that all source domains have adequate labeled examples. Continual Learning: Continual lifelong learning (Parisi et al., 2019; Rusu et al., 2016; Hoffman et al., 2014; Bobu et al., 2018) involves the sequential learning tasks with the goal of learning a predictive function on the new task using knowledge from historical tasks. Most of them focused on mitigating catastrophic forgetting when learning new tasks from only one evolving domain, whereas our work studied the transferability between a source domain and a time evolving target domain.

8. CONCLUSION

In this paper, we study continuous transfer learning with a time evolving target domain, which has not been widely studied and yet is commonly seen in many real applications. We start by deriving a generic error bound of continuous transfer learning with flexible domain discrepancy measures. Then we propose a novel label-informed C-divergence to measure the domain discrepancy incorporating the label information, and study its application in continuous transfer learning, which leads to an improved error bound. Based on this bound, we further propose a generic adversarial Variational Auto-encoder algorithm named CONTE for continuous transfer learning. Extensive experiments on both synthetic and real data sets demonstrate the effectiveness of our CONTE algorithm.



Note that it is slightly different from L1 or variation divergence in(Ben-David et al., 2010) with only marginal distribution of features involved. In this case, we assume µ 0 = 1 for any µ 0. The results for other existing discrepancy measures follow a similar pattern and thus omitted for brevity



Figure 1: Illustration of continuous transfer learning. It learns a predictive function in D Tt using knowledge from both source domain D S and historical target domain D Ti (i = 1, • • • , t 1). Directly transferring from the source domain D S to the target domain D Tt might lead to negative transfer with undesirable predictive performance.

Figure 2: Comparison of domain discrepancy and target accuracy

Figure 3: Comparison of error bounds

Transfer learning accuracy from SVHN (source) to time evolvingMNIST (target)

Transfer learning accuracy on Office-31

