ADDRESSING PARAMETER CHOICE ISSUES IN UNSU-PERVISED DOMAIN ADAPTATION BY AGGREGATION

Abstract

We study the problem of choosing algorithm hyper-parameters in unsupervised domain adaptation, i.e., with labeled data in a source domain and unlabeled data in a target domain, drawn from a different input distribution. We follow the strategy to compute several models using different hyper-parameters, and, to subsequently compute a linear aggregation of the models. While several heuristics exist that follow this strategy, methods are still missing that rely on thorough theories for bounding the target error. In this turn, we propose a method that extends weighted least squares to vector-valued functions, e.g., deep neural networks. We show that the target error of the proposed algorithm is asymptotically not worse than twice the error of the unknown optimal aggregation. We also perform a large scale empirical comparative study on several datasets, including text, images, electroencephalogram, body sensor signals and signals from mobile phones. Our method 1 outperforms deep embedded validation (DEV) and importance weighted validation (IWV) on all datasets, setting a new state-of-the-art performance for solving parameter choice issues in unsupervised domain adaptation with theoretical error guarantees. We further study several competitive heuristics, all outperforming IWV and DEV on at least five datasets. However, our method outperforms each heuristic on at least five of seven datasets.

1. INTRODUCTION

The goal of unsupervised domain adaptation is to learn a model on unlabeled data from a target input distribution using labeled data from a different source distribution (Pan & Yang, 2010; Ben-David et al., 2010) . If this goal is achieved, medical diagnostic systems can successfully be trained on unlabeled images using labeled images with a different modality (Varsavsky et al., 2020; Zou et al., 2020) ; segmentation models for natural images can be learned using only labeled data from computer simulations Peng et al. (2018) ; natural language models can be learned from unlabeled biomedical abstracts by means of labeled data from financial journals (Blitzer et al., 2006) ; industrial quality inspection systems can be learned on unlabeled data from new products using data from related products (Jiao et al., 2019; Zellinger et al., 2020) . However, missing target labels combined with distribution shift makes parameter choice a hard problem (Sugiyama et al., 2007; You et al., 2019; Saito et al., 2021; Zellinger et al., 2021; Musgrave et al., 2021) . Often, one ends up with a sequence of models, e.g., originating from different hyperparameter configurations (Ben-David et al., 2007; Saenko et al., 2010; Ganin et al., 2016; Long et al., Shimodaira, 2000; Sugiyama et al., 2007; You et al., 2019) . Left: Source distribution (solid) and target distribution (dashed). Right: A sequence of different linear models (dashed) is used to find the optimal linear aggregation of the models (solid). Model selection methods (Sugiyama et al., 2007; Kouw et al., 2019; You et al., 2019; Zellinger et al., 2021) cannot outperform the best single model in the sequence, confidence values as used in Zou et al. ( 2018) are not available, and, approaches based on averages or tendencies of majorities of models (Saito et al., 2017) suffer from a high fraction of large-error-models in the sequence. In contrast, our approach (dotted-dashed) is nearly optimal. In addition, the model computed by our method provably approaches the optimal linear aggregation for increasing sample size. For further details on this example we refer to Section C in the Supplementary Material. 2015; Zellinger et al., 2017; Peng et al., 2019) . In this work, we study the problem of constructing an optimal aggregation using all models in such a sequence. Our main motivation is that the error of such an optimal aggregation is clearly smaller than the error of the best single model in the sequence. Although methods with mathematical error guarantees have been proposed to select the best model in the sequence (Sugiyama et al., 2007; Kouw et al., 2019; You et al., 2019; Zellinger et al., 2021) , methods for learning aggregations of the models are either heuristics or their theory guarantees are limited by severe assumptions (cf. Wilson & Cook (2020)). Typical aggregation approaches are (a) to learn an aggregation on source data only (Nozza et al., 2016) , (b) to learn an aggregation on a set of (unknown) labeled target examples (Xia et al., 2013; Dai et al., 2007; III & Marcu, 2006; Duan et al., 2012) , (c) to learn an aggregation on target examples (pseudo-)labeled based on confidence measures of the given models (Zhou et al., 2021; Ahmed et al., 2022; Sun, 2012; Zou et al., 2018; Saito et al., 2017), (d) to aggregate the models based on data-structure specific transformations (Yang et al., 2012; Ha & Youn, 2021) , and, (e) to use specific (possibly not available) knowledge about the given models, such as information obtained at different time-steps of its gradient-based optimization process (French et al., 2018; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Athiwaratkun et al., 2019; Al-Stouhi & Reddy, 2011) or the information that the given models are trained on different (source) distributions (Hoffman et al., 2018; Rakshit et al., 2019; Xu et al., 2018; Kang et al., 2020; Zhang et al., 2015) . One problem shared among all methods mentioned above is that they cannot guarantee a small error, even if the sample size grows to infinity. See Figure 1 for a simple illustrative example. In this work, we propose (to the best of our knowledge) the first algorithm for computing aggregations of vector-valued models for unsupervised domain adaptation with target error guarantees. We extend the importance weighted least squares algorithm (Shimodaira, 2000) and corresponding recently proposed error bounds (Gizewski et al., 2022) to linear aggregations of vector-valued models. The importance weights are the values of an estimated ratio between target and source density evaluated at the examples. Every method for density-ratio estimation can be used as a basis for our approach, e.g. 



Large scale benchmark experiments are available at https://github.com/Xpitfire/iwa; dinu@ml.jku.at, werner.zellinger@ricam.oeaw.ac.at



Figure1: Unsupervised domain adaptation problem(Shimodaira, 2000; Sugiyama et al., 2007; You  et al., 2019). Left: Source distribution (solid) and target distribution (dashed). Right: A sequence of different linear models (dashed) is used to find the optimal linear aggregation of the models (solid). Model selection methods(Sugiyama et al., 2007; Kouw et al., 2019; You et al., 2019; Zellinger et al.,  2021)  cannot outperform the best single model in the sequence, confidence values as used in Zou et al. (2018) are not available, and, approaches based on averages or tendencies of majorities of models(Saito et al., 2017)  suffer from a high fraction of large-error-models in the sequence. In contrast, our approach (dotted-dashed) is nearly optimal. In addition, the model computed by our method provably approaches the optimal linear aggregation for increasing sample size. For further details on this example we refer to Section C in the SupplementaryMaterial.  2015; Zellinger et al., 2017; Peng et al., 2019). In this work, we study the problem of constructing an optimal aggregation using all models in such a sequence. Our main motivation is that the error of such an optimal aggregation is clearly smaller than the error of the best single model in the sequence.

Sugiyama et al. (2012); Kanamori et al. (2012)  and references therein. Our error bound proves that the target error of the computed aggregation is asymptotically at most twice the target error of the optimal aggregation.

