DIVERSITY OF GENERATED UNLABELED DATA MAT-TERS FOR FEW-SHOT HYPOTHESIS ADAPTATION Anonymous

Abstract

Generating unlabeled data has been recently shown to help address the few-shot hypothesis adaptation (FHA) problem, where we aim to train a classifier for the target domain with a few labeled target-domain data and a well-trained sourcedomain classifier (i.e., a source hypothesis), for the additional information of the highly-compatible unlabeled data. However, the generated data of the existing methods are extremely similar or even the same. The strong dependency among the generated data will lead the learning to fail. In this paper, we propose a diversity-enhancing generative network (DEG-Net) for the FHA problem, which can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC). Specifically, DEG-Net will generate data via minimizing the HSIC value (i.e., maximizing the independence) among the semantic features of the generated data. By DEG-Net, the generated unlabeled data are more diverse and more effective for addressing the FHA problem. Experimental results show that the DEG-Net outperforms existing FHA baselines and further verifies that generating diverse data plays an important role in addressing the FHA problem.

1. INTRODUCTION

Data and expert knowledge are always scarce in newly-emerging fields, thus it is very important and challenging to study how to leverage knowledge from other similar fields to help complete tasks in the new fields. To cope with this challenge, transfer learning methods were proposed to leverage knowledge of source domains (e.g., data in source domains or models trained with data in source domains) to help complete the tasks in other similar domains (a.k.a. the target domains) (Fang et al., 2020; Jing et al., 2020; Pan & Yang, 2009; Sun et al., 2019; Teshima et al., 2020; Zamir et al., 2018) . Among many transfer learning methods, hypothesis transfer learning (HTL) methods have received a lot of attention since it does not require access to the data in source domains, which prevents the data leakage issue and protects the data privacy (Chi et al., 2021a; Du et al., 2017; Liang et al., 2020; Yang et al., 2021a; b) . Recently, the few-shot hypothesis adaptation (FHA) problem has been formulated to make HTL more realistic, which is suitable to solve many problems (Liu et al., 2021; Snell et al., 2017; Wang et al., 2020; Yang et al., 2020) . In FHA, only a well-trained source-domain classifier (i.e., source hypothesis) and few labeled target-domain data are available (Chi et al., 2021a) . Similar to HTL, FHA also aims to obtain a good target-domain classifier with the help of a source hypothesis and few target-domain data (Chi et al., 2021a; Motiian et al., 2017) . Recently, generating unlabeled data has been shown to be an effective strategy to address FHA (Chi et al., 2021a) . The target-oriented hypothesis adaptation network (TOHAN), a one-step solution to the FHA problem, constructed an intermediate domain to enrich the training data. The data in the intermediate domain are highly compatible with both source domain and target domain (Balcan & Blum, 2010) . By the generated unlabeled data in the intermediate domain, TOHAN partially overcame the problems caused by data scarce in the target domain. However, the existing methods ignore the diversity of the generated data or the independence among the generated data, so that the generated data are extremely similar or even the same. Lack of diversity leads to less effective data for addressing the FHA problem. Taking the FHA task of digits datasets as an example, we found that the data generated by TOHAN has an issue that the generator tends to copy target data (Figure 1 (a)). To show how diversity matters in the FHA problem, we conduct the

