f -DOMAIN-ADVERSARIAL LEARNING: THEORY AND ALGORITHMS FOR UNSUPERVISED DOMAIN ADAPTA-TION WITH NEURAL NETWORKS

Abstract

The problem of unsupervised domain adaptation arises in a variety of practical applications where the distribution of the training samples differs from those used at test time. The existing theory of domain adaptation derived generalization bounds based on divergence measures that are hard to optimize in practice. This has led to a large disconnect between theory and state-of-the-art methods. In this paper, we propose a novel domain-adversarial framework that introduces new theory for domain adaptation and leads to practical learning algorithms with neural networks. In particular, we derive a novel generalization bound that utilizes a new measure of discrepancy between distributions based on a variational characterization of f -divergences. We show that our bound recovers the theoretical results from Ben-David et al. (2010a) as a special case with a particular choice of divergence, and also supports divergences typically used in practice. We derive a general algorithm for domain-adversarial learning for the complete family of fdivergences. We provide empirical results for several f -divergences and show that some, not considered previously in domain-adversarial learning, achieve state-ofthe-art results in practice. We provide empirical insights into how choosing a particular divergence affects the transfer performance on real-world datasets. By further recognizing the optimization problem as a Stackelberg game, we utilize the latest optimizers from the game optimization literature, achieving additional performance boosts in our training algorithm. We show that our f -domain adversarial framework achieves state-of-the-art results on the challenging Office-31 and Office-Home datasets without extra hyperparameters.

1. INTRODUCTION

Figure 1 : Domain Adaptation. A learner is trained on abundant labeled data and is expected to perform well in the target domain (marked as +). Decision boundaries correspond to a 2-layers neural net trained using f -DAL. The ability to learn new concepts and skills from general-purpose data and transfer them to similar scenarios is critical in many modern applications. For example, it is often the case that the learner has access to only a small (unlabeled) subset of data on its domain of interest, but has access to a larger labeled dataset (for the same task) in a domain that is similar to the target domain. If the gap between these two domains is not considerable, we may expect to train a model by using the labeled and unlabeled data, and to generalize well to the target dataset. This scenario is called unsupervised domain adaptation, and it is the focus of this paper. 2019) took one step further and proposed the Margin Disparity Discrepancy (MDD) with the aim of closing the gap between theory and algorithms. Their notion of discrepancy is tailored to margin losses and builds on the observation of only taking a single supremum over the class set to make optimization easier. Moreover, theories based on weighted combination of hypotheses for multiple source DA have also been developed (Hoffman et al., 2018a) . From an algorithmic perspective, specifically in the context of neural networks, Ganin & Lempitsky (2015); Ganin et al. (2016) proposed the idea of learning domain-invariant representations as a two-player zero-sum game. This approach led to a plethora of methods including state-of-the-art approaches such as Shu et al. ( 2018 While these methods were explained with insights from the theory of Ben-David et al. (2010a) , and more recently through MDD (Zhang et al., 2019) , in deep neural networks, both the H∆H divergence (Ben-David et al., 2010a) and MDD are hard to optimize, and ad-hoc objectives have been introduced to minimize the divergence between source and target distributions in a representation space. This has led to a disconnect between theory and the current SoTA methods. Specifically, approaches that follow Ganin et al. ( 2016) minimize a Jensen Shanon (JS) divergence, while the practical objective of MDD can be interpreted as minimizing a γ-weighted JS divergence. From the optimization perspective, the game-optimization nature of the problem has been ignored, and these min-max objectives are usually optimized using Gradient Descent Ascent (GDA) (referred to as Gradient Descent (GD) with the Gradient Reversal Layer (GRL)). Paradoxically, the last iterate of GDA is known to not converge to a Nash equilibrium even in simple bilinear games (Nemirovsky & Yudin, 1983) . The aim of this paper is to provide a novel perspective on the domain-adversarial problem by deriving theory that generalizes previous seminal works and translates into a new general framework that supports the complete family of f -divergences and is practical for modern neural networks. In particular, we introduce a novel measure of discrepancy between distributions and derive its corresponding learning bounds. Our notion of discrepancy is based on a variational characterization of f -divergences and includes both previous theoretical results (i.e. based on reductions of the TV) and practical results (i.e. based on JS). We empirically show that any f -divergence can be used to learn invariant representations. Most importantly, we show that several divergences that were not considered previously in domain-adversarial learning achieve SoTA results in practice. From an optimization point of view, we observe that under mild conditions, the optimal solution of our framework is a Stackelberg equilibrium. This allows us to plug-and-play the latest optimizers from the recent min-max optimization literature within our framework. We also discuss practical considerations in deep networks, and compare how learning invariant representations for different choices of divergence affects the transfer performance on real-world datasets. We further discuss the practical gains (for popular f -divergences) that can be achieved by introducing more advanced optimizers. We will release code upon acceptance.

2. PRELIMINARIES

In this paper, we focus on the unsupervised domain adaptation scenario. During training, we assume that the learner has access to a source dataset of n s labeled examples S = {(x s i , y s i )} ns i=1 , and a target dataset of n t unlabeled examples T = {(x t i )} nt i=1 , where the source inputs x s i are sampled i.i.d. from a distribution P s (source distribution) over the input space X and the target inputs x t i are sampled i.i.d. from a distribution P t (target distribution) over X . Usually, in the case of binary classification, we have Y = {0, 1} and in the multiclass classification scenario, Y = {1, ..., k}. When X or Y cannot be inferred from the context or other assumptions are required, we will mention it explicitly. We denote a labeling function by f : X → Y, and the source and target labeling functions by f s and f t , respectively. The task of unsupervised domain adaptation is to find a hypothesis function h : X → Y that generalizes to the target dataset T (i.e., to make as few errors as possible by comparing with the ground truth label f t (x t i )). The risk of a hypothesis h w.r.t. the labeling function f , using a loss function : Y × Y → R + under distribution D is defined as: R D (h, f ) := E x∼D [ (h(x) , f (x))]. We also assume that satisfies the triangle inequality. For simplicity of notation, we define R S (h) := R Ps (h, f s ) and R T (h) := R Pt (h, f t ) where the indices S and T refer to the source and target domains, respectively. We additionally use RS , RT to refer to the empirical risks over the source dataset S and the target dataset T.



The paramount importance of domain adaptation (DA) has led to remarkable advances in the field. From a theoretical point of view, the seminal works of Ben-David et al. (2007; 2010a;b); Mansour et al. (2009) provided generalization bounds for unsupervised DA based on discrepancy measures that are a reduction of the Total Variation (TV). More recently, Zhang et al. (

); Long et al. (2018); Hoffman et al. (2018b); Zhang et al. (2019).

