DIVERSITY OF GENERATED UNLABELED DATA MAT-TERS FOR FEW-SHOT HYPOTHESIS ADAPTATION Anonymous

Abstract

Generating unlabeled data has been recently shown to help address the few-shot hypothesis adaptation (FHA) problem, where we aim to train a classifier for the target domain with a few labeled target-domain data and a well-trained sourcedomain classifier (i.e., a source hypothesis), for the additional information of the highly-compatible unlabeled data. However, the generated data of the existing methods are extremely similar or even the same. The strong dependency among the generated data will lead the learning to fail. In this paper, we propose a diversity-enhancing generative network (DEG-Net) for the FHA problem, which can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC). Specifically, DEG-Net will generate data via minimizing the HSIC value (i.e., maximizing the independence) among the semantic features of the generated data. By DEG-Net, the generated unlabeled data are more diverse and more effective for addressing the FHA problem. Experimental results show that the DEG-Net outperforms existing FHA baselines and further verifies that generating diverse data plays an important role in addressing the FHA problem.

1. INTRODUCTION

Data and expert knowledge are always scarce in newly-emerging fields, thus it is very important and challenging to study how to leverage knowledge from other similar fields to help complete tasks in the new fields. To cope with this challenge, transfer learning methods were proposed to leverage knowledge of source domains (e.g., data in source domains or models trained with data in source domains) to help complete the tasks in other similar domains (a.k.a. the target domains) (Fang et al., 2020; Jing et al., 2020; Pan & Yang, 2009; Sun et al., 2019; Teshima et al., 2020; Zamir et al., 2018) . Among many transfer learning methods, hypothesis transfer learning (HTL) methods have received a lot of attention since it does not require access to the data in source domains, which prevents the data leakage issue and protects the data privacy (Chi et al., 2021a; Du et al., 2017; Liang et al., 2020; Yang et al., 2021a; b) . Recently, the few-shot hypothesis adaptation (FHA) problem has been formulated to make HTL more realistic, which is suitable to solve many problems (Liu et al., 2021; Snell et al., 2017; Wang et al., 2020; Yang et al., 2020) . In FHA, only a well-trained source-domain classifier (i.e., source hypothesis) and few labeled target-domain data are available (Chi et al., 2021a) . Similar to HTL, FHA also aims to obtain a good target-domain classifier with the help of a source hypothesis and few target-domain data (Chi et al., 2021a; Motiian et al., 2017) . Recently, generating unlabeled data has been shown to be an effective strategy to address FHA (Chi et al., 2021a) . The target-oriented hypothesis adaptation network (TOHAN), a one-step solution to the FHA problem, constructed an intermediate domain to enrich the training data. The data in the intermediate domain are highly compatible with both source domain and target domain (Balcan & Blum, 2010) . By the generated unlabeled data in the intermediate domain, TOHAN partially overcame the problems caused by data scarce in the target domain. However, the existing methods ignore the diversity of the generated data or the independence among the generated data, so that the generated data are extremely similar or even the same. Lack of diversity leads to less effective data for addressing the FHA problem. Taking the FHA task of digits datasets as an example, we found that the data generated by TOHAN has an issue that the generator tends to copy target data (Figure 1 (a)). To show how diversity matters in the FHA problem, we conduct the experiments in the digits datasets. We use a few target labeled data and the increasing unlabeled data to train the target model. The result is shown in Figure 1(b) . For the source data and target data, it is clear that the accuracy of the model trained is higher as the number of the data increases. For the generated data, the growth of data volume only helps to improve the accuracy of the model when it is small (e.g., less than 45 in Figure 1(b) ). However, the accuracy of the model fluctuates around 33% regardless of the increase in the unlabeled data, when the number of data exceeds 35. This result shows that the model trained by generate data converge faster than those trained by the source data and target data, since the generated data have less diversity. In this paper, to show how the diversity of unlabeled data (i.e., the independence among unlabeled data) affects the FHA methods, we theoretically analyze the affect of the sample complexity regarding the FHA problem (Theorem 1). In this analysis, we adopt the log-coefficient score α (Dagan et al., 2019) to measure the dependency among unlabeled data. Our results show that we can still count on the unlabeled data to help address the FHA problem as long as the unlabeled data are weakly dependent (α < 0.5). Nevertheless, once α ≥ 0.5, the results in Theorem 1 may not hold, resulting to fail theoretically. In addition, we find that high dependency among unlabeled data usually means that we need more unlabeled data to obtain a good target-domain classifier. From the above analysis and Figure 1 , we argue that diversity matters in addressing the FHA problem. To this end, we propose the diversity-enhancing generative network (DEG-Net) for the FHA problem, which is a weight-shared conditional generative method equipped with a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; Ma et al., 2020; Pogodin & Latham, 2020) , which is used in various situations, e.g., clustering (Song et al., 2007; Blaschko & Gretton, 2008) , independence testing (Gretton et al., 2007) , and self-supervised classification (Li et al., 2021) . Although the log-coefficient score is used to analyze the affect of the sample complexity regarding the FHA problem, its calculation requires to know the distribution regarding the target-domain data, which is unknown in practice. Yet, HSIC can be estimate easily by the data sample. Thus, we adopt the HSIC to calculate the dependency among generated unlabeled data. The overview of DEG-Net is in Figure 2 , showing that there are two modules in DEG-Net: the generation module and the adaptation module. In the generation module, we train the conditional generator with a well-trained source classifier and few target domain data. To train the generator with both the source domain and target domain knowledge and improve the diversity of generated data simultaneously, the generative loss of DEG-Net consists of 3 parts: the classification loss, similarity loss and diversity loss. More specifically, DEG-Net generates data via minimizing the HSIC value (i.e., maximizing the independence) between the semantic features of the target data and generated data, where the semantic features are the hidden-layer outputs of the well-trained source hypothesis. To use the generalization knowledge in the semantic features of data that is shared by different classes



Figure 1: The low-diversity issue of generated unlabeled data when solving the FHA problem. Subfigure (a) illustrates the labeled data (left) drawn from the target domain and unlabeled data (right) generated by TOHAN on the MNIST→SVHN (M → S) task. It is clear that the generated data are similar to each other and seem to copy the original target data (middle, left). Subfigure (b) illustrates the accuracy of the data drawn from different domains with different data volumes on the task M → S.For the source data and target data, the accuracy of the model trained by them is higher as the number of the data increases. For the generated data, the growth of data volume only helps to improve the accuracy of the model when it is small.

