DIVERSITY OF GENERATED UNLABELED DATA MAT-TERS FOR FEW-SHOT HYPOTHESIS ADAPTATION Anonymous

Abstract

Generating unlabeled data has been recently shown to help address the few-shot hypothesis adaptation (FHA) problem, where we aim to train a classifier for the target domain with a few labeled target-domain data and a well-trained sourcedomain classifier (i.e., a source hypothesis), for the additional information of the highly-compatible unlabeled data. However, the generated data of the existing methods are extremely similar or even the same. The strong dependency among the generated data will lead the learning to fail. In this paper, we propose a diversity-enhancing generative network (DEG-Net) for the FHA problem, which can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC). Specifically, DEG-Net will generate data via minimizing the HSIC value (i.e., maximizing the independence) among the semantic features of the generated data. By DEG-Net, the generated unlabeled data are more diverse and more effective for addressing the FHA problem. Experimental results show that the DEG-Net outperforms existing FHA baselines and further verifies that generating diverse data plays an important role in addressing the FHA problem.

1. INTRODUCTION

Data and expert knowledge are always scarce in newly-emerging fields, thus it is very important and challenging to study how to leverage knowledge from other similar fields to help complete tasks in the new fields. To cope with this challenge, transfer learning methods were proposed to leverage knowledge of source domains (e.g., data in source domains or models trained with data in source domains) to help complete the tasks in other similar domains (a.k.a. the target domains) (Fang et al., 2020; Jing et al., 2020; Pan & Yang, 2009; Sun et al., 2019; Teshima et al., 2020; Zamir et al., 2018) . Among many transfer learning methods, hypothesis transfer learning (HTL) methods have received a lot of attention since it does not require access to the data in source domains, which prevents the data leakage issue and protects the data privacy (Chi et al., 2021a; Du et al., 2017; Liang et al., 2020; Yang et al., 2021a; b) . Recently, the few-shot hypothesis adaptation (FHA) problem has been formulated to make HTL more realistic, which is suitable to solve many problems (Liu et al., 2021; Snell et al., 2017; Wang et al., 2020; Yang et al., 2020) . In FHA, only a well-trained source-domain classifier (i.e., source hypothesis) and few labeled target-domain data are available (Chi et al., 2021a) . Similar to HTL, FHA also aims to obtain a good target-domain classifier with the help of a source hypothesis and few target-domain data (Chi et al., 2021a; Motiian et al., 2017) . Recently, generating unlabeled data has been shown to be an effective strategy to address FHA (Chi et al., 2021a) . The target-oriented hypothesis adaptation network (TOHAN), a one-step solution to the FHA problem, constructed an intermediate domain to enrich the training data. The data in the intermediate domain are highly compatible with both source domain and target domain (Balcan & Blum, 2010) . By the generated unlabeled data in the intermediate domain, TOHAN partially overcame the problems caused by data scarce in the target domain. However, the existing methods ignore the diversity of the generated data or the independence among the generated data, so that the generated data are extremely similar or even the same. Lack of diversity leads to less effective data for addressing the FHA problem. Taking the FHA task of digits datasets as an example, we found that the data generated by TOHAN has an issue that the generator tends to copy target data (Figure 1 (a)). To show how diversity matters in the FHA problem, we conduct the illustrates the labeled data (left) drawn from the target domain and unlabeled data (right) generated by TOHAN on the MNIST→SVHN (M → S) task. It is clear that the generated data are similar to each other and seem to copy the original target data (middle, left). Subfigure (b) illustrates the accuracy of the data drawn from different domains with different data volumes on the task M → S. For the source data and target data, the accuracy of the model trained by them is higher as the number of the data increases. For the generated data, the growth of data volume only helps to improve the accuracy of the model when it is small. experiments in the digits datasets. We use a few target labeled data and the increasing unlabeled data to train the target model. The result is shown in Figure 1(b) . For the source data and target data, it is clear that the accuracy of the model trained is higher as the number of the data increases. For the generated data, the growth of data volume only helps to improve the accuracy of the model when it is small (e.g., less than 45 in Figure 1 (b)). However, the accuracy of the model fluctuates around 33% regardless of the increase in the unlabeled data, when the number of data exceeds 35. This result shows that the model trained by generate data converge faster than those trained by the source data and target data, since the generated data have less diversity. In this paper, to show how the diversity of unlabeled data (i.e., the independence among unlabeled data) affects the FHA methods, we theoretically analyze the affect of the sample complexity regarding the FHA problem (Theorem 1). In this analysis, we adopt the log-coefficient score α (Dagan et al., 2019) to measure the dependency among unlabeled data. Our results show that we can still count on the unlabeled data to help address the FHA problem as long as the unlabeled data are weakly dependent (α < 0.5). Nevertheless, once α ≥ 0.5, the results in Theorem 1 may not hold, resulting to fail theoretically. In addition, we find that high dependency among unlabeled data usually means that we need more unlabeled data to obtain a good target-domain classifier. From the above analysis and Figure 1 , we argue that diversity matters in addressing the FHA problem. To this end, we propose the diversity-enhancing generative network (DEG-Net) for the FHA problem, which is a weight-shared conditional generative method equipped with a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; Ma et al., 2020; Pogodin & Latham, 2020) , which is used in various situations, e.g., clustering (Song et al., 2007; Blaschko & Gretton, 2008) , independence testing (Gretton et al., 2007) , and self-supervised classification (Li et al., 2021) . Although the log-coefficient score is used to analyze the affect of the sample complexity regarding the FHA problem, its calculation requires to know the distribution regarding the target-domain data, which is unknown in practice. Yet, HSIC can be estimate easily by the data sample. Thus, we adopt the HSIC to calculate the dependency among generated unlabeled data. The overview of DEG-Net is in Figure 2 , showing that there are two modules in DEG-Net: the generation module and the adaptation module. In the generation module, we train the conditional generator with a well-trained source classifier and few target domain data. To train the generator with both the source domain and target domain knowledge and improve the diversity of generated data simultaneously, the generative loss of DEG-Net consists of 3 parts: the classification loss, similarity loss and diversity loss. More specifically, DEG-Net generates data via minimizing the HSIC value (i.e., maximizing the independence) between the semantic features of the target data and generated data, where the semantic features are the hidden-layer outputs of the well-trained source hypothesis. To use the generalization knowledge in the semantic features of data that is shared by different classes (Chen et al., 2020; Chi et al., 2021b; Yao et al., 2021) , the generator is a weight-shared network. As for the adaptation module, the source classifier is trained to work well over the target domain. The adaptation module consists of a classifier and a group discriminator (Chi et al., 2021a; Motiian et al., 2017) . With the help of the group discriminator which tries to confuse the classifier to distinguish data from different domains, the classifier is trained to classify the data over the target domain using the generated data and few label target data. We verify the effectiveness of DEG-Net on 8 FHA benchmark tasks (Chi et al., 2021a) . Experimental results show that DEG-Net outperforms existing FHA methods and achieves the state-of-the art performance. Besides, due to the weight-shared mechanism, DEG-Net is much faster than previous generative FHA methods in training. We also conduct an ablation study to verify that each component in the DEG-Net is useful, which shows that diverse generated data can help improve the performance when addressing the FHA problem, which lights up a novel road for the FHA problem.

2. PRELIMINARY AND RELATED WORKS

Problem Setting. In this section, we formalize FHA problem mathematically. Denoting by X ⊂ R n the input space and by Y := {1, • • • , K} the output space, where K is the number of classes. The source domain (target domain) can be defined as a joint probability distribution µ s on X × Y (µ t on X × Y). Besides, we assume that there is a well-trained model f s : X → Y. The f s is trained with data {x s i , y s i } n i=1 drawn independently and identically distributed (i.i.d) from µ s with the aim of minimizing Ê(x,y)∼µ s [ℓ(h(x), y)], where ℓ : Y × Y is a loss function to measure if two elements in Y are close, and h is an element in a hypothesis set H := {h : X → Y}. Thus, f s can be defined as f s := arg min h∈H Ê(X s ,Y s ) [ℓ(h(X s ), Y s )]. f s is also called the source hypothesis in our paper. Hence, the FHA problem is defined as follow: Problem 1. (FHA) Given the source hypothesis f s defined in Eq. ( 1) and the labeled dataset S t := {(x t i , y t i )} m l i=1 (m l ≤ 7K, according to (Chi et al., 2021a; Park et al., 2019 )), drawn i.i.d. from target domain µ t , the FHA methods aim to train a classifier f t ∈ H with f s and S t such that we can minimize the value of E (X t ,Y t ) [ℓ(h(X t ), Y t )], where h ∈ H. Namely, f t = arg min h∈H E (X t ,Y t ) [ℓ(h(X t ), Y t )]. Hypothesis Transfer Learning Methods. Hypothesis transfer learning (HTL) aims to train a classifier with only a well-trained classifier and small labeled data or abundant unlabeled data over the target domain (Kuzborskij & Orabona, 2013; Tommasi et al., 2010; Liang et al., 2020) . In (Kuzborskij & Orabona, 2013) , they used the leave-one-out error to find the optimal transfer parameters based on the regularized least squares with biased regularization. SHOT (Liang et al., 2020) froze the source hypothesis and trained a domian-specific encoding module using the abundant unlabled data. Later, neighborhood reciprocity clustering (Yang et al., 2021a) was proposed to address HTL by encouraging reciprocal neighbors to concord in their label prediction. Different from the FHA problem, the HTL problem does not have the limitation on the number of the target domain labeled data. Target-oriented Hypothesis Adaptation Network. TOHAN (Chi et al., 2021a ) is a one-step solution for the FHA problem. It has a good performance due to using the generated unlabeled data in the adaptation process. Motivated by the learnability in semi-supervised learning (SSL), TOHAN found that unlabeled data in the intermediate domain, which is compatible with both the source classifier and target classifier, can address the FHA problem for providing the additional information in the training. Guided by this principle, the key module of TOHAN is to generate the unlabeled data drawn from the probability distribution µ m : µ m = arg min µ χ(h s , µ) + χ(h t , µ) where χ(h s , µ) (resp. χ(h t , µ)) measures how compatible h s (resp. h t ) is with unlabeled data distribution µ (Balcan & Blum, 2010) .

3. THEORETICAL ANALYSIS REGARDING THE DATA DIVERSITY IN FHA

In previous works, researchers have shown that generated high-compatible data can help address the FHA problem. However, as discussed in Section 1, the diversity of the generated data matters in addressing the FHA. Besides, the previous studies assume that the generated data is independent in their theory, and is inconsistent with their methods. In this section, we will show how the dependency among the generated data affects the performance of FHA methods. Similar to (Chi et al., 2021a) , our theory is also based on the theory regarding SSL (Webb & Sammut, 2010) . Dependency Measure. Following Dagan et al. (2019) , we use the log-coefficient that measures the dependency among observations from a random variable Z to theoretically analyze the data diversity. Definition 1 (Log-influence and log-coefficient (Dagan et al., 2019) ). Let Z = (Z 1 , . . . , Z m ) be a random variable over (X × Y) m and let µ Z denote either its probability distribution if discrete or its density if continuous. Assume that µ Z > 0 on all (X × Y) m . For any i ̸ = j ∈ {1, 2, . . . , m}, define the log-influence between j and i as I log j,i (Z) = 1 4 sup Z -i-j ∈ (X × Y) m-2 Z i , Z ′ i , Z j , Z ′ j ∈ X × Y log µ Z [Z i Z j Z -i-j ]µ Z [Z ′ i Z ′ j Z -i-j ] µ Z [Z ′ i Z j Z -i-j ]µ Z [Z i Z ′ j Z -i-j ] . (3) Then the log-coefficient of Z is defined as α log (Z) = max i=1,...,m i̸ =j I log j,i (Z). From Definition 1, it is clear that α log (Z) will be zero if Z i and Z j are independent (for any i ̸ = j). Sample Complexity Analysis for FHA. Since the generated data are unlabeled, we follow the theory regarding SSL to analyze how the generated unlabeled data can help address the FHA problem. More importantly, we will analyze how the dependency among the generated data affects the performance of FHA methods. For simplicity, we consider a binary SSL problem (i.e., K = 2). Let f * : X → {0, 1} be the optimal target classifier. Let err(h) = E x∼µ t X [h(x) ̸ = f * (x) ] be the true error rate of a hypothesis h ∈ H over a marginal distribution µ t X . In FHA, its learnability mainly depends on the compatibility χ : H × X → [0, 1] measuring how "compatible" h is to one unlabeled data x. In the following, we use χ(h, µ t X ) = E x∼µ t X [χ(h, x)] to represent the expectation of compatibility of data from µ t X on a classifier h, and let S (mu) X be an observation of a random variable X t,mu = (X t 1 , . . . , X t mu ), where the distribution regarding X t i is µ t X , i = 1, . . . , m u . The following theorem shows that, under some conditions, we can still learn a good f t even when the dependency among unlabeled the target data exists. Theorem 1. Let χ(h, S (mu) X ) = 1 mu x∈S (mu) X χ(h, x) be the empirical compatibility over S (mu) X and H 0 = {h ∈ H : err(h) = 0}. If f * ∈ H, χ(f * , µ t X ) = 1 -t, and α log (X t,mu ) < 1/2, then m u unlabeled data and m l labeled data are sufficient to learn to error ϵ with probability 1 -δ, for m u = max O 1 (1 -α log (X t,mu ))ϵ 2 log 2 δ , O VCdim(χ(H)) (1 -2α log (X t,mu ))ϵ 2 (4) and m l = 2 ϵ ln(2H µ t X ,χ (t + 2ϵ)[2m l , µ t X ]) + ln 4 δ , where χ(H) = {χ h : h ∈ H} is assumed to have a finite VC dimension, χ h (•) = χ(h, •), H µ t X ,χ (t+2ϵ)[2m l , µ t X ] is the expected number of splits of 2m l data drawn from µ t X using hypotheses in H of compatibility more than 1 -t -2ϵ. In particular, with probability at least 1 -δ, we have err( ĥ) ≤ ϵ, where ĥ = arg max h∈H0 χ(h, S (mu) X ). The proof of Theorem 1 is presented in Appendix A, which mainly follows the recent result in the problem of learning from dependent data (Dagan et al., 2019) . Theorem 1 shows that when the dependency among the unlabeled data is weak (i.e., α log (X t,mu ) < 1/2), we can obtain a similar result compared to classical result in the SSL theory (Balcan & Blum, 2010) . Namely, if we can generate unlabeled data that are highly compatible to f * , which means that t is very small and thus H µ t X ,χ (t + 2ϵ)[2m l , µ t X ] is small, thus we do not need a lot of labeled data to learn a good f t (Chi et al., 2021a) . Diversity Matters in FHA. Theorem 1 also shows that diversity of unlabeled data matters in FHA. There are two reasons. The first reason is that Theorem 1 might not be true if there is strong dependency among the unlabeled data (e.g., α log (X t,mu ) > 1/2). This will directly make the previous work lose the theoretical foundation to address the FHA problem. The second reason is that we need more unlabeled data to reach the same error ϵ if the dependency among the unlabeled data increases. Specifically, if α log (X t,mu ) is very close to 1/2, then m u could be a very large number. The above reasons motivate us to take care of the dependency among the generated data. To weaken such dependency, we propose our diversity-enhancing generative network for the FHA problem.

4. DIVERSITY-ENHANCING GENERATIVE NETWORK FOR FHA PROBLEM

In this section, we propose the diversity-enhancing generative network (DEG-Net) for the FHA problem. DEG-Net has two modules: the generation module to generate diverse unlabeled data; the adaptation module to train the classifier for the target domain.

4.1. DIVERSITY-ENHANCING GENERATION

In order to overcome the shortcomings of the current generative method for FHA problem, we come up with solutions for both the generative architecture and the loss function. As for the generative architecture, we propose the weight-shared conditional generative network for generating the data with the specific category. As for the generative loss function , we design a novel loss function to constrain the similarities and diversity of the semantic information of the generated data. Weight-shared Conditional Generative Network. As discussed before, for using generalized features which are shared among different classes to improve the quality of generated data and reduce the time of training, the weight-shared conditional generative network is promising. Following Chi et al. (2021a) , the generator aims to generate data of specific categories from Gaussian random noise. The encoder outputs the semantic feature and the class probability distribution of the generated data. To achieve the aim of the generative method, we design the two-part loss functions: the classification loss function and the similarity of semantic feature loss function. We assume that x gn i = G(z i , c n ) is the generated data with a specific category, where the inputs of generator G are the Gaussian random noise z i and the specific categorical information c n . Specifically, we use the one-hot encoded label as the categorical information. The generated data x gn i inputs to the well-trained source-domain classifier f s = h s • g s , where the output of h s is the group discriminator feature, which will be used in the adaptation module and the output of g s is probability feature p i = (p 1 i , . . . .p n i , . . . , p K i ) , where p n i is the probability of the generated data belonging to the n th class respectively. The semantic feature s gn i used in the similarity loss and diversity loss is the hidden-layer output of h s (details of the hidden-layer selection can be found in Appendix C.). We aim to update parameters of the generator to force the generated data with the categorical information  θ f ← θ f -α f ∇L f ({G 4 i=1 }, x) using Eq. ( 13); 12: Update θD ← θD -αD∇LD({G 4 i=1 }) using Eq. ( 12); end end Output: a well-trained classifier f θ . x gn i belonging to the n th class, i.e., making p n close to 1. Specifically, we minimize the following loss to generate the data of a specific category n: L c = 1 K K n=1 1 B n Bn i=1 ∥p n i -1∥ 2 2 , ( ) where B n is the batch-size of the generator. To make the generated data closer to data in the target domain, we need to define the loss function to measure the difference between data of two different domains. Motivated by Zheng & Zhou (2021) , DEG-Net uses the semantic features to calculate the similarities. To avoid the copy issue, we decided to use the ℓ 1 distance ∥x -y∥ 1 = i ω i |x i -y i |, where ω i = |x i -y i | 2 / ∥x -y∥ 2 , since it encourages larger gradients for feature dimensions with higher residual errors. Compared to the ℓ 2 -norm, it is better to measure the similarity of the semantic features between the generated images and the target images, since ℓ 1 distance is more robust to outliers (Oneto et al., 2020) . Thus, the similarity loss is defined as following: L s = 1 K K n=1 1 m l M B n Bn i=1 m l j=1 s gn i -s t j 1 , where M = max s1,s2∈X ∥s 1 -s 2 ∥ 1 (X is compact and ∥•∥ 1 is continuous), m l is the number of labeled data drawn from the target domain, s t j and s gn i are the semantic features of target data and generated data, respectively. Combining Eq. ( 6) and Eq. ( 7), we obtain the loss to train the weight-shared conditional generative network: L G = L c + λL s , where λ ≥ 0 is a hyper-parameter between two losses. Note that optimizing Eq. ( 8) is corresponding to TOHAN's principle Eq. ( 2), where Eq. ( 6) (resp. Eq. ( 7)) is corresponding to χ(h s , µ) (resp. χ(h t , µ)). To ensure that the conditional generator can generate the image with the correct class label, we pretrain the generator using the well-trained source model for some epochs. Generative Function with Diversity. As discussed above, the weak dependence of unlabeled data is an important condition for using generated unlabeled data to address the FHA problem. To ensure that the unlabeled data are weakly dependent among unlabeled data (i.e., to generate more diverse unlabeled data), it is necessary to use diversity regularization to train the generator. Unfortunately, the log-coefficient score, a dependence measure used to analyze the sample complexity, is hard to calculate, since its calculation requires the unknown distribution regarding the target-domain data. HSIC, a kernel independence measure can also measure the dependency of the generated data. Different from the log-coefficient score, HSIC can be easily estimated (Gretton et al., 2005; Song et al., 2012) : HSIC(X, Y ) = 1 (N -1) 2 Tr(KHLH), where K = (k ij ) = k(x i , x j ) (L = (l ij ) = k(y i , y j ) ) is the kernel matrix (k(•, •) is the kernel function) and H = I -1 N 11 ⊤ is the centering matrix. We minimize the HSIC measure of the generated data's semantic features to obtain weakly dependent data. Specifically, we use the Gaussian kernel as the kernel function, and minimize the following loss to generate more diversity data: L d = 1 K K n=1 HSIC(s gn , s gn ) = 1 K K n=1 1 (B n -1) 2 Tr(S n HS n H), where S n = (s n ij ) = k(s gn i , s gn j ) is the kernel matrice of the semantic features of the generated data with a specific class. Hence, we obtain the total loss to train the generator with diversity enhancing: L G d = L G + βL d , where β ≥ 0 is a hyper-parameter between the generative loss and diversity regularization. Note that the diversity regularization and the similarity loss restrict themselves.

4.2. ADAPTATION MODULE USING GENERATED DATA

Following Chi et al. (2021a) , we create paired data using the labeled data in the target domain and the generated data, and assign the group labels to the paired groups under the following rules: G 1 pairs the generated data with the same class label; G 2 pairs the generated data and the data in the target domain with the same class label; G 3 pairs the generated data but with different class label; G 4 pairs the generated data and the data in the target domain and also with different class label. By using adversarial learning, we train a discriminator D which could distinguish between the data in different domains while maintaining high classification accuracy on generated data. The discriminator D is a four-class classifier with the inputs of the above paired group data. Different from classical adversarial domain adaptation (Ganin et al., 2016; Jiang et al., 2020) , the group discriminator D decides which of the four groups a given data pair belongs to. By freezing the encoder, we train D with the cross-entropy loss: L D = -Ê 4 i=1 y Gi log(D(ϕ(G i ))) , where Ê(•) represents the empirical mean value, y Gi is the group label of group G i and ϕ(G i ) := [g(x 1 ), g(x 2 )] is the output of the encoder with the paired data input. Next, we will train the classifier f t = h t • g t while freezing the group discriminator, which is initialized with the same weight as that in the source classifier f s = h s • g s . Motivated by nonsaturating games (Goodfellow, 2016) , we minimize the following loss to update f t : L f = -γ Ê [y G1 log (D (ϕ (G 2 ))) -y G3 log (D (ϕ (G 4 )))] + Ê [ℓ (f t (x t ) , f * t (x t ))] , where γ ≥ 0 is a hyper-parameter, l is the cross-entropy loss, and f * t is the optimal target model. Note that, the label information of generated data has as certain noise. As demonstrated in Theorem 1, it is only necessary to use generated unlabeled data for addressing the FHA problem. Thus, we only use labeled target data for target supervised loss in Eq. ( 13).

5. EXPERIMENTS

We compare DEG-Net with previous FHA methods on digits datasets (i.e. MNIST (M ),USPS (U ),and SVHN (S)) and objects datasets (i.e. CIFAR-10 (CF ) and STL-10 (SL)), following (Chi 2017), We conduct 6 tasks of the adaptation among the 3 digital datasets and choose the number of target data from 1 to 7 per class. The classifier accuracy on the target domain of our method over 6 tasks is shown in Table 1 . The results show that the performance of DEG-Net is the best on almost all the tasks. It is clear that the accuracy of DEG-Net is lower than TOHAN when the amount of target data is too small. The diversity regularization and the similarity loss restrict each other, to avoid the copy issue. However, when the amount of target data is too small, the target domain information is few, so the generator is less likely to generate similar data with the target domain. Diversity loss enhances this adversarial effect, resulting DEG-Net degrading to TOHAN and SHOT. Another improvement of DEG-Net over TOHAN is that the faster training process of the generator. We need 0.93s to complete the training within each epoch in DEG-Net, while needing 1.35s in TOHAN. Objects Datasets. Following Chi et al. (2021a) , we examine the performance of DEG-Net on 2 object tasks and choose the number of target data as 10 per class. The classification accuracies on object tasks are shown in Table 2 . It is clear that we outperform baselines. In CF → SL, we achieve 1.5% improvement over TOHAN. In SL → CF , we achieve a performance accracy of 57.2%, 0.3% improvement over S+FADA. It is clear that the effect of DEG-Net is not obvious in objective tasks. It may be caused by the simple structure of generative networks and complexity of datasets. DEG-Net Generates More Diverse Data Than TOHAN. In this part, we analyze the diversity of the generated data by DEG-Net and TOHAN to see if our generation process can produce more diverse data than TOHAN's. We choose the square root of the HSIC to measure the diversity of the generated data in the task M → S, and calculate the HSIC value among the target-domain data as a reference value that is 0.0013. After the calculation, the average diversity measure of DEG-Net is 0.0019, and the average diversity measure of TOHAN is 0.0027. It is clear that DEG-Net can generate more diverse data than TOHAN. The detailed diversity analysis can be found in Appendix D. Ablation Study. To show the advantage of weight-shared architecture and the diversity loss, we conduct two experiments: (1) The architecture of weight-shared is the same as the DEG-Net but uses Eq. ( 8) to train the generator (DEG-Net without diversity). ( 2) The separate generative method, which is similar to TOHAN, has K generators and use the semantic features to calculate the similarity loss for training each generator: L n Gs = 1 B n ∥p n -1∥ 2 2 + λ 1 N y M B n Bn i=1 Ny j=1 s gn i -s t j 1 . As shown in Table 3 , DEG-Net works better than both methods introduced above, and the weightshared architecture works better than the separate generative method, which reveals that both the weight-shared architecture and the diversity loss can improve the quality of generated data and thus achieve the higher accuracy. Specifically, compared to modified DEG-Net without the diversity loss, the separate generative method ignores the generalization knowledge in the semantic features of data which is shared with all the classes. Modified DEG-Net discards the diversity loss, and thus generates the low diverse data and results in the worse performance. However, the HSIC diversity loss does not work for all the situations. It is clear that DEG-Net achieves the similar accuracy with mo dified DEG-Net without diversity or even worse if the amount of the labeled data is very few (i.e., m 1 ≤ 2). This phenomenon may be caused by worse data generated by diversity method. Since the diversity loss restricting to the similarity loss, the generator is less likely to generated similar data over target domain (i.e., the distribution of generated data is far from the target domain).

6. CONCLUSION

In this paper, we focus on generating more diverse unlabeled data for addressing the FHA problem. We experimentally and theoretically prove that the diversity of generated data (i.e., the independence among the generated data) matters in addressing the FHA problem. For addressing FHA problem, we propose a diversity-enhancing generative network (DEG-Net), which consists of the generation module and the adaptation module. With the weight-shared conditional generative method equipped with a kernel independence measure: HSIC, DEG-Net can generate more diverse unlabeled data and achieve the better performance. Experiments show that the generated data of DEG-Net are more diverse, and thus DEG-Net achieves the state-of-the art performance when addressing the FHA problem, which lights up a novel and theoretical-guaranteed road to the FHA problem in the future.

7. REPRODUCIBILITY STATEMENT

We implement all methods by PyTorch 1.7.1 and Python 3.7.6, and conduct all the experiments on two NVIDIA RTX 2080Ti GPUs. We use the standard DCGAN network (Radford et al., 2015) as the generator network architecture. We adopt the backbone network of LeNet-5 (LeCun et al., 1998) with batch normalization and dropout as the encoder. We employ connected layers with softmax function as the classifier. The semantic feature in digital tasks is the output of first fully connection layer. We adopt 3 connected layers with softmax function as the group discriminator D. We choose the Gaussian kernel as the kernel funcion to calculate the HSIC measure. The hyper-parameter settings details can be found in Appendix C. The details regarding to the datasets used in the paper can be found in Appendix B. For the theoretical resultsclear explanations, we proof the Theorem 1 in Appendix A. A PROOF OF THEOREM 1 Before proving the Theorem 1, we first introduce a McDiarmid-like inequality under the logcoefficient of a random vector Z. Lemma 1 (McDiarmid-like Inequality under the Log-coefficient of Z). Let µ (m) be a distribution defined over Z m , and Z = (Z 1 , . . . , Z m ) ∼ µ (m) be a random vector, and g : Z m → R with the following bounded differences property with parameters λ 1 , . . . , λ m > 0: ∀Z i , Z j : |g(Z i ) -g(Z j )| ≤ m i=1 1 Zi̸ =Zj λ i , where Z m = (X × Y) m . If α log (Z) < 1, then, for all t > 0, Pr[|g(Z) -E[g(Z)]| ≥ t] ≤ 2 exp - (1 -α log (Z))t 2 2 m i=1 λ 2 i . ( ) Proof. Based on Definition 2.2 and Lemma 5.2 in (Dagan et al., 2019) , we know that µ (m) satisfies Dobrushin's condition with a coefficient α < 1. Thus, based on Theorem 2.3 in (Dagan et al., 2019) , we have Pr[|g(Z) -E[g(Z)]| ≥ t] ≤ 2 exp - (1 -α)t 2 2 m i=1 λ 2 i . Since, α ≤ α log (Z), we prove this lemma. Then, we introduce a recent result regarding bounding the expected suprema of a empirical process using the corresponding Gaussian complexity. Theorem 2 ( (Dagan et al., 2019) ). Let Z be a random vector over some domain Z m and let G be a class of functions from Z to R. If α log (Z) < 1/2, then E S∼Z sup g∈G 1 m m i=1 g(s i ) -E S 1 m m i=1 g(s i ) ≤ CG Z (G) 1 -2α log (Z) , ( ) where C > 0 is a universal constant, and S = (s 1 , . . . , s m ) is a sample of Z. Note that, the above result is very general, it does not assume that the m marginals of the distribution of Z are identical. Based on the above theorem and lemma, we can prove the following lemma. Lemma 2. Let Z be a random vector over some domain Z m and let G be a class of functions from Z to R. If α log (Z) < 1/2, and there exists L > 0 such that for any g ∈ G and Z i , |g(Z i )| ≤ L, then, for any t > 0, Pr S∼Z sup g∈G 1 m m i=1 g(s i ) -E S 1 m m i=1 g(s i ) ≥ CG Z (G) 1 -2α log (Z) + t ≤ e -(1-α log (Z))mt 2 C ′ L 2 for some universal constants C, C ′ > 0. Proof. We first consider proving the non-absolute-value version. Let M (S) = sup g∈G 1 m m i=1 g(s i ) -E S 1 m m i=1 g(s i ) . For any S ∼ Z and S ′ ∼ Z, we have that |M (S) -M (S ′ )| ≤ m i=1 2L1 si̸ =s ′ i /m. According to Lemma 1, we have Pr S∼Z [M (S) -E[M (S)] ≥ t] ≤ exp - (1 -α log (Z))mt 2 C ′ L 2 C DETAILS REGARDING EXPERIMENTS Baselines. We follow the standard domain-adaptation protocols (Shu et al., 2018) and compare DEG-Net with 4 baseline: (1) Without adaptation (WA): to classify target domain with the well-trained source domain calssifier. (2) Fine tuning (FT): to train the last connected layer of the classifier with few accessible labeled data. (3) SHOT: an HTL mehtod, where we modify it to use both the labeled target data and unlabeled target data (Liang et al., 2020) . ( 4) S+FADA:to generate unlabeled data using the loss L c with the well-trained source clasifier and apply them into DANN (Ganin et al., 2016) . ( 5) T+FADA:to generate unlabeled data using the loss L s with the few labeled target data and apply them into DANN. ( 6) TOHAN: a novel FHA method, which generate the specific category unlabeled data separately (Chi et al., 2021a) . Implementation Details. We implement all methods by PyTorch 1.7.1 and Python 3.7.6, and conduct all the experiments on NVIDIA RTX 2080Ti GPUs. Due to the limitation of the accessible computing resources, we can not choose more complex networks as the backbone of the generator. Our conditional generator G uses the standard DCGAN network (Radford et al., 2015) . We adopt the backbone network of LeNet-5 with batch normalization and dropout to extarct the group discriminator feature. We employ connected layers with softmax function as the classifier to obtain the probability. The semantic feature in digital tasks is the output of first fully connection layer. We adopt 3 connected layers with softmax function as the group discriminator D. Hyper-parameter Settings. Following the common protocol of domain adaptation (Shu et al., 2018) , we set fixed hyper-parameters for the different datasets. We pretrain the conditional generator for 300 epochs and pretrain the group discriminator for 100 epochs. The training step of the classifier (i.e. the adaptation module) are set 50. As for the generator and the group discriminator, the learding rate of adam optimizer is set to 1 × 10 -3 . As for the classifier, the learding rate of adam optimizer is set to 1 × 10 -2 . The tradeoff parameter λ in Eq. ( 8) is set to 0.9 and the tradeoff parameter β in Eq. ( 11) is set to 0.1. Following (Long et al., 2018) the radeoff parameter γ in Eq. ( 13) is set to 2 1+exp(-10 q) -1.

D ADDITIONAL ANALYSIS

Augmentation Techniques on the FHA Problem In this section, we compare the accuracy of the target classifier trained by TOHAN and that of TOHAN with the basic geometric data augmentation for FHA problem over the digit tasks. The geometric data augmentation technique has been widelyexplored to diversify the image data (Shorten & Khoshgoftaar, 2019) . In our experiment, we randomly choose one or more the following augmentation technique: resizing, shifting, cropping and slight rotations (1 and 20 and -1 to -20) fot the generated data in TOHAN. The classifier accuracy on the target domain of our method over 4 experiments and the average accuracy is shown in Table 4 . It is clear that the performance of the augmentation techniques is worse than our method in general. It may be caused by the fact that the generated image are similar and even the same as the few target data. The diversity of is still low with data augmentation. The accuracy of the augmentation is basically the same as TOHAN's. The improvement brought by the augmentation is more obvious while the number of the target data is increasing. Diversity Analysis of DEG-Net In this section, we compare the diversity of generated data of DEG-Net with that of TOHAN and target data. Because of the difficulty of calculating log-influence, we use the HSIC to measure the diversity of data. Considering that the generated batch in the training process is 32, we calculate the HSIC measure with the 32 sample data. Table 5 shows the diversity of the different data. It is clear that the diversity loss in DEG-Net works well to make the generated data data more diverse. Data Efficiency Analysis of DEG-Net In this section, we conduct the experiments in the taskS M → S and M → U to analyze the efficiency of the generated data. Following the architecture of the DEG-Net, we use the Eq. ( 11) to train the conditional generator and obtain the following loss to  where x g is the generated data and f * t (x g ) is the label of the generated data. We use the different numbers of the generated data by TOHAN (Chi et al., 2021a) and DEG-Net to train the classifier and the classification accuracy is shown in Table 6 . It is clear that the performance of using data generated by TOHAN is almost the same as just using labeled data. In addition, the data generated by DEG-Net can not improve the performance of the model while the number of the target data per class is small . It may be caused by that the generated data is similar to the label target data, so that add the almost same data for the training will bring little improvement. However, it is worth nothing that the improvement will be large if the number of data generated by DEG-Net is more than 5 per class. This phenomenon indicates that that the data generated by DEG-Net is more independent to the existing target data and could be treated as the new ones in some degree. 



Figure 1: The low-diversity issue of generated unlabeled data when solving the FHA problem. Subfigure (a)

Figure 2: Overview of the diversity-enhancing generative network (DEG-Net). It consists of generator G, a classifier f t = h t • g t (initialize f s = f t ) and group discriminator D. (a) In the generation module, we train a generator G with freezing the classifier f t = f s . (b) In the adaptation module, we first pair the generated data and labeled data and use the paired data to train the discriminator D while freezing the encoder h t . Then, we freeze the discriminator D to train the classifier f t .

Algorithm 1 Diversity-enhancing Generative Network (DEG-Net) Input: conditional generator G θ G parameterized by θG, a group discriminator D θ D parameterized by θD, a classifier f θ f parameterized by θ f , kernel function k, generation batch size Bn, learning rate αG, αD and α f , total epoch Tmax, pertraining group discriminator epoch T d . for t = 1, 2, ....., Tmax do for n = 0, 1, ..., K -1 do 1: Generate random noise z and categorical information cn; 2: Generate data Gn(z) and then add them to Dm; 3: Calculate the semantic feature s and classification probability p; 4: Calculate the kernel matrice of semantic feature S n using kernel function k; end 5: Update θG ← θG -αG∇LG d (s, p) using Eq. (11); if t = Tmax -T f then for i = 1, 2, ..., T d do 6: Sample G1,G3 from Dm × Dm;

Classification accuracy±standard deviation (%) on 6 digits FHA tasks. Bold value represents the highest accuracy on each column.

Classification accuracy±standard deviation (%) on 2 objects FHA tasks. Bold value represents the highest accuracy on each row.

Ablation study. Classification accuracy±standard deviation(%) on M → U . Bold value represents the highest accuracy on each column.

Classification accuracy±standard deviation (%) on digits FHA tasks of the data augmentation. Bold value represents the highest average accuracy on each column.

The Diversity of the target data and generated data by different methods.

Classification accuracy (%) on digits FHA tasks using the generated data. Bold value represents the highest average accuracy on each column

annex

for some universal constant C ′ > 0. Then, combining E[M (S)] (based on Theorem 2), we haveFor the opposite inequality, -M (S) part, following (Dagan et al., 2019) , we can apply the same arguments on -G. Note that G(-G) = G(G), which concludes the bound.The above lemma is a slightly general version of Theorem 6.7 in (Dagan et al., 2019) by considering the influence of α log (Z). Based on Lemma 2, we can prove the Theorem 1 below.χ(h, x) be the empirical compatibility over S (mu) Xandand α log (X t,mu ) < 1/2, then m u unlabeled data and m l labeled data are sufficient to learn to error ϵ with probability 1 -δ, forandwhereis the expected number of splits of 2m l data drawn from µ t X using hypotheses in H of compatibility more than 1 -t -2ϵ. In particular, with probability at least 1 -δ, we have err( ĥ) ≤ ϵ, where ĥ = arg max h∈H0 χ(h, S (mu) X).Proof. Let S be the set of m u unlabeled data. Based on the relation between VC dimension and the Gaussian complexity, Lemma 2 gives that, with probability at least 1 -δ 2 , we have, where S denotes the uniform distribution over S. Since χ h (x) = χ(h, x), this implies that we have |χ(h, D) -χ(h, S)| ≤ ϵ for all h ∈ H.Therefore, the set of hypotheses with χ(h, S) ≥ 1 -t -ϵ is contained in H µ t X ,χ (t + 2ϵ). The bound on the number of labeled data now follows directly from known concentration results using the expected number of partitions instead of the maximum in the standard VC-dimension bounds. This bound ensures that with probability 1 -δ 2 , none of the functions h ∈ H µ t X ,χ (t + 2ϵ) with err(h) ≥ ϵ have err(h) = 0.The above two arguments together imply that with probability 1 -δ, all h ∈ H with err(h) = 0 and χ(h, S) ≥ 1-t-ϵ have err(h) ≥ ϵ, and furthermore f * has χ(f * , S) ≥ 1-t-ϵ. This in turn implies that with probability at least 1 -δ, we have err( ĥ) ≤ ϵ, where ĥ = arg max h∈H0 χ(h, S).

B DATASETS

Digits. Following TOHAN (Chi et al., 2021a) , we conduct 6 adaptation experiments on digits datasets: M → U , M → S, S → U ,S → M , U → M and U → S. MNIST (M ) (LeCun et al., 1998) is the handwritten digits dataset, which have been size-normalized adn centered in 28 × 28 pixels. SVHN (S) (Netzer et al., 2011) is the real-world image digits dataset, of which images are 32 × 32 pixels with 3 channels. USPS (U ) (Hull, 1994) data are 16×16 grayscale pixels. The SVHN and USPS images are resized to 28 × 28 grayscale pixels in the adaptation task (Chi et al., 2021a) .Objects. Following (Sun et al., 2019) , we compared DEG-Net and benchmark on CIFAR-10 and STL-10. The CIFAR-10 ( Krizhevsky et al., 2009) dataset contains 60, 000 32 × 32 color images in 10 categories, while the STL-10 ( Coates et al., 2011) dataset is inspired by the CIFAR-10 dataset with some modifications. However, these two datasets only contain nine overlapping classes. We removed the non-overlapping classes ("frog" and "monkey") (Shu et al., 2018) .

