f -DOMAIN-ADVERSARIAL LEARNING: THEORY AND ALGORITHMS FOR UNSUPERVISED DOMAIN ADAPTA-TION WITH NEURAL NETWORKS

Abstract

The problem of unsupervised domain adaptation arises in a variety of practical applications where the distribution of the training samples differs from those used at test time. The existing theory of domain adaptation derived generalization bounds based on divergence measures that are hard to optimize in practice. This has led to a large disconnect between theory and state-of-the-art methods. In this paper, we propose a novel domain-adversarial framework that introduces new theory for domain adaptation and leads to practical learning algorithms with neural networks. In particular, we derive a novel generalization bound that utilizes a new measure of discrepancy between distributions based on a variational characterization of f -divergences. We show that our bound recovers the theoretical results from Ben-David et al. (2010a) as a special case with a particular choice of divergence, and also supports divergences typically used in practice. We derive a general algorithm for domain-adversarial learning for the complete family of fdivergences. We provide empirical results for several f -divergences and show that some, not considered previously in domain-adversarial learning, achieve state-ofthe-art results in practice. We provide empirical insights into how choosing a particular divergence affects the transfer performance on real-world datasets. By further recognizing the optimization problem as a Stackelberg game, we utilize the latest optimizers from the game optimization literature, achieving additional performance boosts in our training algorithm. We show that our f -domain adversarial framework achieves state-of-the-art results on the challenging Office-31 and Office-Home datasets without extra hyperparameters.

1. INTRODUCTION

Figure 1 : Domain Adaptation. A learner is trained on abundant labeled data and is expected to perform well in the target domain (marked as +). Decision boundaries correspond to a 2-layers neural net trained using f -DAL. The ability to learn new concepts and skills from general-purpose data and transfer them to similar scenarios is critical in many modern applications. For example, it is often the case that the learner has access to only a small (unlabeled) subset of data on its domain of interest, but has access to a larger labeled dataset (for the same task) in a domain that is similar to the target domain. If the gap between these two domains is not considerable, we may expect to train a model by using the labeled and unlabeled data, and to generalize well to the target dataset. This scenario is called unsupervised domain adaptation, and it is the focus of this paper. The paramount importance of domain adaptation (DA) has led to remarkable advances in the field. From a theoretical point of view, the seminal works of Ben-David et al. (2007; 2010a; b) ; Mansour et al. (2009) provided generalization bounds for unsupervised DA based on discrepancy measures that are a reduction of the Total Variation (TV). More recently, Zhang et al. (2019) took one step further and proposed the Margin Disparity Discrepancy (MDD) with the aim of closing the gap between theory and algorithms. Their notion of discrepancy is tailored to margin losses and builds on the observation of only taking a single supremum over the class set to make optimization easier. Moreover, theories based on weighted combination of hypotheses for multiple source DA have also been developed (Hoffman et al., 2018a) . From an algorithmic perspective, specifically in the context of neural networks, Ganin & Lempitsky (2015) ; Ganin et al. (2016) proposed the idea of learning domain-invariant representations as a two-player zero-sum game. This approach led to a plethora of methods including state-of-the-art approaches such as Shu et al. (2018) ; Long et al. (2018) ; Hoffman et al. (2018b) ; Zhang et al. (2019) . While these methods were explained with insights from the theory of Ben-David et al. (2010a) , and more recently through MDD (Zhang et al., 2019) , in deep neural networks, both the H∆H divergence (Ben-David et al., 2010a) and MDD are hard to optimize, and ad-hoc objectives have been introduced to minimize the divergence between source and target distributions in a representation space. This has led to a disconnect between theory and the current SoTA methods. Specifically, approaches that follow Ganin et al. (2016) minimize a Jensen Shanon (JS) divergence, while the practical objective of MDD can be interpreted as minimizing a γ-weighted JS divergence. From the optimization perspective, the game-optimization nature of the problem has been ignored, and these min-max objectives are usually optimized using Gradient Descent Ascent (GDA) (referred to as Gradient Descent (GD) with the Gradient Reversal Layer (GRL)). Paradoxically, the last iterate of GDA is known to not converge to a Nash equilibrium even in simple bilinear games (Nemirovsky & Yudin, 1983) . The aim of this paper is to provide a novel perspective on the domain-adversarial problem by deriving theory that generalizes previous seminal works and translates into a new general framework that supports the complete family of f -divergences and is practical for modern neural networks. In particular, we introduce a novel measure of discrepancy between distributions and derive its corresponding learning bounds. Our notion of discrepancy is based on a variational characterization of f -divergences and includes both previous theoretical results (i.e. based on reductions of the TV) and practical results (i.e. based on JS). We empirically show that any f -divergence can be used to learn invariant representations. Most importantly, we show that several divergences that were not considered previously in domain-adversarial learning achieve SoTA results in practice. From an optimization point of view, we observe that under mild conditions, the optimal solution of our framework is a Stackelberg equilibrium. This allows us to plug-and-play the latest optimizers from the recent min-max optimization literature within our framework. We also discuss practical considerations in deep networks, and compare how learning invariant representations for different choices of divergence affects the transfer performance on real-world datasets. We further discuss the practical gains (for popular f -divergences) that can be achieved by introducing more advanced optimizers. We will release code upon acceptance.

2. PRELIMINARIES

In this paper, we focus on the unsupervised domain adaptation scenario. During training, we assume that the learner has access to a source dataset of n s labeled examples S = {(x s i , y s i )} ns i=1 , and a target dataset of n t unlabeled examples T = {(x t i )} nt i=1 , where the source inputs x s i are sampled i.i.d. from a distribution P s (source distribution) over the input space X and the target inputs x t i are sampled i.i.d. from a distribution P t (target distribution) over X . Usually, in the case of binary classification, we have Y = {0, 1} and in the multiclass classification scenario, Y = {1, ..., k}. When X or Y cannot be inferred from the context or other assumptions are required, we will mention it explicitly. We denote a labeling function by f : X → Y, and the source and target labeling functions by f s and f t , respectively. The task of unsupervised domain adaptation is to find a hypothesis function h : X → Y that generalizes to the target dataset T (i.e., to make as few errors as possible by comparing with the ground truth label f t (x t i )). The risk of a hypothesis h w.r.t. the labeling function f , using a loss function : Y × Y → R + under distribution D is defined as: R D (h, f ) := E x∼D [ (h(x), f (x))]. We also assume that satisfies the triangle inequality. For simplicity of notation, we define R S (h) := R Ps (h, f s ) and R T (h) := R Pt (h, f t ) where the indices S and T refer to the source and target domains, respectively. We additionally use RS , RT to refer to the empirical risks over the source dataset S and the target dataset T.

Divergence

φ(x) φ * (t) φ (1) g(x) Kullback-Leibler (KL) x log x exp(t -1) 1 x Reverse KL (KL-rev) -log x -1 -log(-t) -1 -exp x Jensen-Shannon (JS) -(x + 1) log 1+x 2 + x log x -log(2 -e t ) 0 log 2 1+exp(-x) Pearson χ 2 (x -1) 2 t 2 /4 + t 0 x Total Variation (TV) 1 2 |x -1| 1 -1/2≤t≤1/2 [-1/2, 1/2] 1 2 tanh x Table 1 : Popular f -divergences, their conjugate functions and choices of g. We take l(a, b) = g(bargmax a).

2.1. COMPARING SOURCE AND TARGET DOMAINS WITH f -DIVERGENCES

A key component of domain adaptation is the study of the discrepancy between the source and target distributions. This differentiates transductive approaches and more generally transfer learning from traditional supervised learning methods. In our work, we derive generalization bounds that capture the entire family of f -divergences. We define new discrepancies between source and target distributions, based on the variational characterization of popular choices of f -divergences. These new discrepancies play a fundamental role in our work. Definition 1 (f -divergence, Csiszár (1967); Ali & Silvey (1966) ). Let P s and P t two distributions functions with densities p s and p t , respectively. Let p s and p t be absolute continuous with respect to a base measure dx. Let φ : R + → R be a convex, lower semi-continuous function that satisfies φ(1) = 0. The f -divergence D φ is defined as: D φ (P s ||P t ) = p t (x) φ p s (x) p t (x) dx. (2.1) Variational characterization of f -divergences: Nguyen et al. ( 2010) derive a general variational method that estimates f -divergences from samples by turning the estimation problem into variational optimization. They show that any f -divergence can be written as (see details in Appendix A.2): D φ (P s ||P t ) ≥ sup T ∈T E x∼Ps [T (x)] -E x∼Pt [φ * (T (x))] (2.2) where φ * is the (Fenchel) conjugate function of φ : R + → R defined as φ * (y) := sup x∈R+ {xyφ(x)}, and T : X → dom φ * . The equality holds if T is the set of all measurable functions. Many popular divergences that are heavily used in machine learning and information theory are special cases of f -divergences. We summarize them and their conjugate function in Table 1 . For simplicity, we assume in the following that X ⊆ R n and each density (i.e p s and p t ) is absolutely continuous.

3. DOMAIN ADAPTATION THEORY

Domain adaptation approaches generally build upon the idea of bounding the gap between the source and target domains' error functions in terms of the discrepancy between their probability distributions. Measuring the similarity between the distributions P s and P t is thus critical in the derivation of generalization bounds and/or the design of algorithms. We remind the reader of the seminal work of Ben-David et al. (2010a) that bounds the risk of any binary classifier in the hypothesis class H with the following theorem: Theorem 1. If (x, y) = |h(x) -y| and H is a class of functions, then for any h ∈ H we have: R T (h) ≤ R S (h) + D TV (P s P t ) + min{E x∼Ps [|f t (x) -f s (x)|], E x∼Pt [|f t (x) -f s (x)|]}. (3.1) Here D TV (P s P t ) := sup T ∈T |E x∼Ps [T (x)] -E x∼Pt [T (x)] | is the TV and T is the set of measurable functions. TV is an f -divergence such that φ(x) = |x -1| in Definition 1. For any function φ(x) ≥ |x -1|, one can replace D TV (P s P t ) in eq. 3.1 with D φ (P s P t ). Theorem 1 thus bounds a classifiers target error in terms of the source error, the divergence between the two domains, and the dissimilarity of the labeling functions. Unfortunately, D TV (P s P t ) cannot be estimated from finite samples of arbitrary distributions (Kifer et al., 2004) . It is also a very loose upper bound as it involves the supremum over all measurable functions and does not account for the hypothesis class.

3.1. MEASURING DISCREPANCY WITH f -DIVERGENCES

This section introduces a new discrepancy that aims to solve the two aforementioned problems, namely (1) estimation of the divergence from finite samples of arbitrary distributions (Lemma 2) and (2) restriction of the discrepancy to the set including the hypothesis class H. (Defs. 2 and 3). We show in Sec. 3.2 how these allows to extend the bounds studied in Ben-David et al. (2010a) . Definition 2 (D φ H discrepancy). Let φ * be the Fenchel conjugate of a convex, lower semi-continuous function φ that satisfies φ(1) = 0, and let T be a set of measurable functions such that T = { (h(x), h (x)) : h, h ∈ H}. We define the discrepancy between P s and P t as: D φ H (P s ||P t ) := sup h,h ∈H |E x∼Ps [ (h(x), h (x))] -E x∼Pt [φ * ( (h(x), h (x)))| . (3.2) The D φ H discrepancy can be interpreted as a lower bound estimator of a general class of fdivergences (Lemma 1). Therefore, for any hypothesis class H and choice of φ, D φ H is never larger than its corresponding f -divergence. We show in Lemma 2 that its computation can be bounded in terms of finite examples. Finally, we recover the H∆H divergence (Ben-David et al., 2010a) if we consider φ * (t) = t and (h(x), h (x)) = 1[h(x) = h (x)], which corresponds to the TV.

Definition 3 (D φ

h,H discrepancy). Suppose the same conditions as above, the discrepancy between two distributions P s and P t is defined by: D φ h,H (P s ||P t ) := sup h ∈H |E x∼Ps [ (h(x), h (x))] -E x∼Pt [φ * ( (h(x), h (x)))|. (3.3) Taking the supremum of D φ h,H over h ∈ H, we obtain D φ H , and thus D φ h,H (P s ||P t ) ≤ D φ H (P s ||P t ). This bound will be useful when deriving practical algorithms. Lemma 1 (lower bound). For any two functions h,h in H, we have: |R S (h, h ) -R φ * • T (h, h )| ≤ D φ h,H (P s ||P t ) ≤ D φ H (P s ||P t ) ≤ D φ (P s ||P t ). (3.4) Lemma 1 is fundamental in the derivation of divergence-based generalization bounds for DA. Specifically, it bounds the gap between the source and target domains' error functions in terms of the discrepancy between their distributions using f -divergences. We now show that the D φ h,H can be estimated from finite samples. Lemma 2. Suppose : Y × Y → [0, 1], φ * L-Lipschitz, and [0, 1] ⊂ dom φ * . Let S and T be two empirical distributions corresponding to datasets containing n datapoints sampled i.i.d. from P s and P t , respectively. Let us note R the Rademacher complexity of a given class of functions, and • H := {x → (h(x), h (x)) : h, h ∈ H}. ∀δ ∈ (0, 1), we have with probability of at least 1δ: |D φ h,H (P s ||P t ) -D φ h,H (S||T )| ≤ 2R Ps ( • H) + 2LR Pt ( • H) + 2 (-log δ)/(2n). (3.5) In Lemma 2, we show that the empirical D φ h,H converges to the true D φ h,H discrepancy. It can then be estimated using a set of finite samples from the two distributions. The gap is bounded by the complexity of the hypothesis class and the number of examples (n). This result will be also important in the derivation of Theorem 3.

3.2. DOMAIN ADAPTATION: GENERALIZATION BOUNDS

We now provide a novel generalization bound to estimate the error of a classifier in the target domain using the proposed D φ h,H divergence and results from the previous section. We also provide a generalization Rademacher complexity bound for a binary classifier based on the estimation of the D φ h,H from finite samples. We show that our bound generalizes previous existing results in Appendix D. 2010a)) and also includes popular divergences typically used in practice (see Appendix D). Intuitively, the first term in the bound accounts for the source error, the second corresponds to the discrepancy between the marginal distributions, and the third measures the ideal joint hypothesis (λ * ). If H is expressive enough and the labeling functions are similar, this last term could be reduced to a small value. The ideal joint hypothesis incorporates the notion of adaptability: when the optimal hypothesis performs poorly in either domain, we cannot expect successful adaptation.  := R S (h * ) + R T (h * ). ∀δ ∈ (0, 1), we have with probability of at least 1 -δ: R T (h) ≤ R S (h) + D φ h,H (S||T) + λ * φ + 6R S ( • H) + 2(1 + L)R T ( • H) + 5 (-log δ)/(2n). (3.7) In Theorem 3, we show the computation of our generalization bound for a binary classifier in terms of the Rademacher complexity of the class H. We see that under the assumption of an ideal joint hypothesis, λ * φ , the generalization error can be reduced by jointly minimizing the risk in the source domain, the discrepancy between the two distributions, and regularizing the model to limit the complexity of the hypothesis class. We take all these into account when deriving practical algorithms in the next sections. We interpret h : X → Y as the composition of two nets h = ĥ • g, where g : X → Z and ĥ is a classifier that operates in a representation space Z. Inspired by the theory, we let ĥ be another net of the same topology than ĥ. This is intuitively interpreted as a per-category domain classifier. Our framework is different from domain-adversarial frameworks that follows from (Ganin et al., 2016) since they use a global domain-classifier or discriminator. 4 f -DOMAIN ADVERSARIAL LEARNING (f -DAL) < l a t e x i t s h a 1 _ b a s e 6 4 = " X c L U 0 O Q l z Q n 4 T D 3 D E R h z 5 Z s A Z / U = " > A A A B 8 n i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I o s e i F 4 8 V b C 2 m o W y 2 2 3 b p J h t 2 X 4 Q S + j O 8 e F D E q 7 / G m / / G T Z u D t g 4 s D D P v s f M m T K Q w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h + 0 j U o 1 4 y 2 m p N K d k B o u R c x b K F D y T q I 5 j U L J H 8 L x T e 4 / P H F t h I r v c Z L w I K L D W A w E o 2 g l v x t R H D E q s 8 d p r 1 p z 6 + 4 M Z J l 4 B a l B g W a v + t X t K 5 Z G P E Y m q T G + 5 y Y Y Z F S j Y J J P K 9 3 U 8 I S y M R 1 y 3 9 K Y R t w E 2 S z y l J x Y p U 8 G S t s X I 5 m p v z c y G h k z i U I 7 m U c 0 i 1 4 u / u f 5 K Q 6 u g k z E S Y o 8 Z v O P B q k k q E h + P + k L z R n K i S W U a W G z E j a i m j K 0 L V V s C d 7 i y c u k f V b 3 L u r u 3 X m t c V 3 U U Y Y j O I Z T 8 O A S G n A L T W g B A w X P 8 A p v D j o v z r v z M R 8 t O c X O I f y B 8 / k D l 5 2 R d A = = < / l a t e x i t > Z g . . . k . . . k < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 x t d y 5 l a A / W Q W H L v q 4 E t M a r e d Y I = " > A A A B 7 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o k o e i x 6 8 V j B f k A b y m a 7 a d Z u N m F 3 I p T Q / + D F g y J e / T / e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u Y F q R Q G X f f b W V l d W 9 / Y L G 2 V t 3 d 2 9 / Y r B 4 c t k 2 S a 8 S Z L Z K I 7 A T V c C s W b K F D y T q o 5 j Q P J 2 8 H o d u q 3 n 7 g 2 I l E P O E 6 5 H 9 O h E q F g F K 3 U 6 k U U S d S v V N 2 a O w N Z J l 5 B q l C g 0 a 9 8 9 Q Y J y 2 K u k E l q T N d z U / R z q l E w y S f l X m Z 4 S t m I D n n X U k V j b v x 8 d u 2 E n F p l Q M J E 2 1 J I Z u r v i Z z G x o z j w H b G F C O z 6 E 3 F / 7 x u h u G 1 n w u V Z s g V m y 8 K M 0 k w I d P X y U B o z l C O L a F M C 3 s r Y R H V l K E N q G x D 8 B Z f X i a t 8 5 p 3 W X P v L 6 r 1 m y K O E h z D C Z y B B 1 d Q h z t o Q B M Y P M I z v M K b k z g v z r v z M W 9 d c Y q Z I / g D 5 / M H K g q O 2 w = = < / l a t e x i t > ĥ < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 Z z W v s 7 x x Q O 9 i k + E n e 1 m X f v A G H A = " > A A A B 7 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P o K e y K o s e g F 4 8 R z A O S J c x O Z r N D Z m f X m V 4 h L P k J L x 4 U 8 e r v e P N v n D w O m l j Q U F R 1 0 9 0 V p F I Y d N 1 v p 7 C y u r a + U d w s b W 3 v 7 O 6 V 9 w + a J s k 0 4 w 2 W y E S 3 A 2 q 4 F I o 3 U K D k 7 V R z G g e S t 4 L h 7 c R v P X F t R K I e c J R y P 6 Y D J U L B K F q p 3 Y 0 o k o i c 9 s o V t + p O Q Z a J N y c V m K P e K 3 9 1 + w n L Y q 6 Q S W p M x 3 N T 9 H O q U T D J x 6 V u Z n h K 2 Z A O e M d S R W N u / H x 6 7 5 i c W K V P w k T b U k i m 6 u + J n M b G j O L A d s Y U I 7 P o T c T / v E 6 G 4 b W f C 5 V m y B W b L Q o z S T A h k + d J X 2 j O U I 4 s o U w L e y t h E d W U o Y 2 o Z E P w F l 9 e J s 3 z q n d Z d e 8 v K r W b e R x F O I J j O A M P r q A G d 1 C H B j C Q 8 A y v 8 O Y 8 O i / O u / M We now use the theory presented in the previous sections to derive a novel generalized domain adversarial learning framework. The key idea of domain-adversarial training is to simultaneously minimize the source error and align the two distributions in a representation space Z. Specifically, we let a hypothesis h be the composition of h = ĥ•g (i.e let H := { ĥ•g : ĥ ∈ Ĥ, g ∈ G} with Ĥ another function class) where g : X → Z. This can be interpreted as a mapping that pushes forward the two densities p s and p t to a representation space Z where a classifier ĥ ∈ Ĥ operates. Consequently, we refer to p z s := g#p s and p z t := g#p t as the push-forwards of the source and target domain densities, respectively. Figure 2 illustrates the f -DAL framework. Clearly from Theorem 2, for adaptation to be possible in the representation space Z, there has to be an ĥ ∈ Ĥ, such that the ideal joint risk λ * is negligible. This condition is necessary even if p z s = p z t . In other words, we need the difference between p z s and p z t to be small, and the ideal joint risk λ * to be negligible. These are both sufficient and necessary conditions. We refer the reader to Ben-David et al. (2010b) for details on the impossibility theorems for DA. Consequently, we state the following: Assumption 1. There is a g ∈ G and ĥ * ∈ Ĥ, such that the ideal joint risk (λ * ) is negligible. We also assume that the classconditional distributionsfoot_0 between source and target are similar. While these assumptions may seem restrictive, they are ubiquitous in modern DA methods, including SoTA methods i.e Ganin et al. (2016) ; Long et al. (2018) ; Hoffman et al. (2018b) ; Zhang et al. (2019) (sometimes not explicitly mentioned). Moreover, neural networks are generally known to be able to learn rich and powerful representations, and in practical scenarios, g and ĥ are both neural networks. From Theorem 2 and Assumption 1, the target risk R T (h) can be optimized by jointly minimizing the error in the source domain and the discrepancy between the two distributions. Letting y := f s (x), an optimization objective can be clearly written as: min ĥ∈ Ĥ E z∼p z s [ ( ĥ(z), y)] + D φ ĥ, Ĥ(p z s ||p z t ). (4.1) Here, is a surrogate loss function used to minimize the empirical risk in the source domain. Nonetheless, it does not have to be the binary classification loss (i.e it can be the cross-entropy loss). Under some assumptions (Proposition 1) and the use of Lemma 1, the minimization problem in equation 4.1 can be upper bounded (hence replaced) by the following min-max objectivefoot_2 : min ĥ∈ Ĥ max ĥ ∈ Ĥ E z∼p z s [ ( ĥ(z), y)] + E z∼p z s [ ˆ ( ĥ (z), ĥ(z))] -E z∼p z t [(φ * • ˆ )( ĥ (z), ĥ(z))] ds,t (4.2) where we refer to the difference between the last two terms as d s,t . We now formalize this result. Proposition 1. Suppose d s,t takes the form shown in equation 4.2 with ˆ ( ĥ (z), ĥ(z)) → dom φ * and that for any ĥ ∈ Ĥ, there exists ĥ ∈ Ĥ s.t. ˆ ( ĥ (z), ĥ(z)) = φ ( p z s (z) p z t (z) ) for any z ∈ supp(p z t (z)), with φ the derivative of φ. The optimal d s,t is D φ (P z s ||P z t ) (i.e max ĥ ∈ Ĥ d s,t = D φ (P z s ||P z t )). If we let the feature extractor g ∈ G be the one that minimizes both the source error and the discrepancy term, equation 4.2 can be rewritten as: min ĥ∈ Ĥg∈G max ĥ ∈ Ĥ E x∼ps [ ( ĥ • g, y)] + E x∼ps [ ˆ ( ĥ • g, ĥ • g)] -E x∼pt [(φ * • ˆ )( ĥ • g, ĥ • g)]. (4.3) The choice of ˆ is "somewhat arbitrary" as stated for GANs in Nowozin et al. (2016) . For the multiclass scenario, we let ˆ (a, b) = g(b argmax a ), where argmax a is the index of the largest element of vector a. For the binary case, we define ˆ ( , b) = g(b). This implies that we choose the domain of ˆ to be R k × R k with k categories for the multi-class scenario and R for binary classification. Intuitively, ĥ is an auxiliary per-category domain classifier. Note that this is different from Ganin et al. (2016) where there is a unique domain classifier or discriminator. For the choice of g, we follow Nowozin et al. (2016) and choose it to be a monotonically increasing function when possible. We summarize our choices of ˆ for different f -divergences in Table 1 . We also show other choices of ˆ including generalizations of previous methods in Appendix D. γ-weighted JS divergence. If we relax the need for φ(1) = 0 in Proposition 1, the new objective only shifts by a constant, e.g, max ĥ ∈ Ĥ d s,t = D φ(P z s ||P z t ) + φ(1) with φ(x) := φ(x)φ(1). By Lemma 4 (Appendix D), we can then rescale φ * , and φ will change accordingly. This allows to include the practical objective from Zhang et al. (2019) as part of our framework (i.e. γ-weighted JS, see Appendix D). While these can be done for the general family of divergences, we do not pursue this line of research in practice as it requires additional hyperparameter tuning of γ.

4.1. OPTIMALITY IN f -DAL

The main objective of our framework (i.e equation 4.3) is a minimax optimization problem and our desired (optimal) solution is under mild assumptions a Stackelberg equilibrium. This key observation allows us to incorporate in our framework the latest optimizers from the game-optimization literature. We now formalize and prove this concept. Based on this, we propose to use the extragradient algorithm and its aggressive version within our framework. We show the effectiveness through a toy example (Figure 3 , Appendix E) and also empirically in our large scale experiments. Stackelberg equilibria in f -DAL. We show the existence of Stackelberg equilibria in f -Domain Adversarial Learning (f -DAL). Let G and Ĥ be a class of functions defined by a fixed parametric functional (i.e neural networks with fixed architecture), and define ω 1 such that is a vector composed of the parameters of the feature extractor g and the source classifier ĥ. Similarly, let ω 2 be the parameters of the auxiliary classifier ĥ , and Ω 1 and Ω 2 denote their separate domains. Equation (4.2) can be rewritten as: min ω1∈Ω1 max ω2∈Ω2 V (ω 1 , ω 2 ). (4.4) In general, V is nonconvex in ω 1 and nonconcave in ω 2 , and for the min-max game in equation 4.4, Nash equilibria may not exist as in (Farnia & Ozdaglar, 2020) . A Stackelberg equilibrium is more general than Nash equilibrium (see definition in Definition 5) and reflects the sequential nature of our zero-sum game equation 4.4. We now show that the optimal solution of f -DAL is a Stackelberg equilibrium. Such an equilibrium is a stationary point under the assumption that V (ω 1 , •) is (locally) strongly concave in ω 2 (Evtushenko, 1974), and we can then use gradient algorithms to search for such a desirable solution. In the following theorem, we use the explicit form of push-forward to emphasize the dependence on the feature extractor g, rather than p z s , p z t . Theorem 4 (Stackelberg equilibrium, informal). Suppose d s,t takes the form shown in equation 4.2, and assume that (a) There exists an optimal g * ∈ G that maps both the source and the target distribution to the same distribution. (b) There exists an optimal classifier that yields the ground truth in a neighborhood. (c) For any g ∈ G and ĥ ∈ H, there exists ĥ that achieves ˆ ( ĥ (z), ĥ(z)) = φ ((g#p s )(z)/(g#p t )(z)). Then the objective of f -adversarial learning has a Stackelberg equilibrium at ( ĥ * , g * , ĥ * ). When ( ĥ, ĥ ) = ( ĥ ) (e.g., in a binary classification scenario or in Ganin et al. (2016) ), the Stackelberg equilibrium can be shown to be a Nash equilibrium (see Theorem 6). Extra-gradient algorithms. We have shown that the optimal solution in f -DAL is a Stackelberg equilibrium which is more general than a Nash equilibrium. For convergence to a Nash equilibrium, the simplest method is GDA. However, the last iterate of GDA does not converge in the bilinear case (e.g. Nemirovsky & Yudin, 1983) . To accelerate and stabilize the convergence, the extra-gradient (EG) method was proposed in (Korpelevich, 1976) . It was recently shown (Zhang et al., 2020; Hsieh et al., 2020) that having an aggressive extra-step is even more stable than vanilla EG, and is more suitable for convergence to Stackelberg equilibria (Zhang et al., 2020) . With the aim of quantifying whether exploiting Theorem 4 leads to practical gains, we follow those works, and therefore let the extrapolation step to be larger. We refer to this algorithm as Aggressive Extra-Gradient (AExG). We illustrate this in a simple example (Appendix E), whose convergence/trajectories are shown in Figure 3 and we will explore AExG further in the experimental section.

5. EXPERIMENTAL RESULTS

We present experimental results of our framework in practical scenarios. In these scenarios, the learner is a neural network and the input domain is the set of natural images. Specifically, we aim to answer the following questions: (1) How does choosing a particular divergence affect the domain adaptation performance among different datasets? (2) Is there a better universal notion of f -divergence that achieves significant performance gains across different datasets and thus helps generalization? (3) Are there considerable practical gains by exploiting the fact that the optimal solution of f -DAL is a Stackelberg Equilibrium? (4) How does our theoretical framework compare in practice to existing SoTA methods? We also compare in Figure 4 the difference in interpretation of the auxiliary classifier of f -DAL vs Ganin et al. (2016) . The comparison shows significant performance boost for the same divergence i.e DANN vs f -DAL (JS) Datasets. In our experiments we use two main datasets. We use (1) the Office-31 dataset (Saenko et al., 2010) which contains 4,652 images and 31 categories, collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). We also use (2) the Office-Home dataset (Venkateswara et al., 2017) . This is a more complex dataset containing 15,500 images from four different domains: Artistic images, Clip Art, Product images, and Real-world images. In each of our experiments, we report the average over 3 different seeds. Implementation Details: We implement our algorithm in PyTorch. We use ResNet-50 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) as the feature extractor. The main classifier ( ĥ) and auxiliary classifier ( ĥ ) are both 2 layers neural nets with Leaky-Relu activation functions. We use spectral normalization (SN) as in Miyato et al. (2018) only for these two (i.e ĥ and ĥ ). We did not see any transfer improvement by using it, neither by using Leaky-Relu activation functions instead of Relu. The reason for this was to avoid gradient issues and instabilities during training for some divergences (i.e KL, TV) in the first epochs. For simplicity, and fair comparison with previous work, we perform simultaneous updates using the GRL. We also use the GRL warm-up strategy. This is standard in most DA frameworks and follows from Eq. ( 14) in Ganin & Lempitsky (2015) . Method A → W D → W W → D A → D D → A W → A Avg ResNet- For optimization, we use 1) mini-batch (32) SGD (or GDA) with the Nesterov momentum (0.9). 2) For experiments using AExG, we take inspiration from Gidel et al. ( 2019) and implement a version of the ExtraGradient with momentum (0.9). For the aggressive step, we use a multiplier [10, 1] with a polynomial decay rate with power= 0.5 for the first 10K iterations. In all cases, the learning rate of the classifiers is set 10 times larger than the one of the feature extractor (0.01) whose value is adjusted according to Ganin et al. (2016) , which is standard practice. We will release source code. 2 ). When compared with f -DAL (JS), we see a significant performance boost. This is in line with our theory which suggests the use of a per-category domain classifier vs a discriminator. Comparing f -divergences. We first compare the performance of fdivergences on Office-31. Specifically, we evaluate the performance of the model on the six combinations of transfer tasks with different divergences. The optimizer is SGD with Nesterov Momentum. All hyperparameters are kept constant for all divergences. As shown in Figure 4 , the Pearson χ 2 achieves the best overall result among all the transfer tasks on this benchmark. This divergence was never used before to learn invariant representations in the context of DA. Interestingly, a similar trend was observed for GANs in Nowozin et al. (2016) . This observation is also reminiscent of histogram-based (visual) bag of words representations that were shown to work better with χ 2 distances than with 2 and 1 distances for image and text classification tasks (i.e. (Li et al., 2013) and references therein). The KL divergence performs well in some transfer tasks but significantly worse than the rest in others (i.e D → A). The reason might be that, unlike the JS, TV and Pearson χ 2 divergences which are lower and upper bounded by finite values, the KL divergence can grow exponentially and tend to +∞ even when the densities p s and p t are nonzero (Nielsen & Nock, 2013) . The lack of upper bound of the KL divergence might lead to numerical instability of the optimizer and explain inconsistency of performance. What do we get by the "extra" gradient? We now compare the use of the AExG vs GDA method with Nesterov Momentum. The main idea is to evaluate whether the characterization of the optimal solution of f -DAL as a Stackelberg Equilibrium leads to practical gains by exploiting more suitable optimizers (Section 4.1). In both cases, we use a momentum coefficient (0.9). The experiments are performed on Office-31 and hyperparameters are kept constant for all divergences. In Figure 5 , we observe that using AExG significantly improves the performance in some transfer tasks for some divergences (i.e JS in A → W). Overall, we also observe gains in performance among all divergences. Figure 5 illustrates the transfer curves of the AExG vs GDA with Nesterov Momentum for the task A→ W. For this divergence and pair of datasets, AExG converges faster and also obtains slightly better accuracy. This is in line with the insights obtained from the theoretical results presented in Section 4.1 and Appendix 5. If computation is not an issue we encourage the use of AExG. That said, f -DAL achieves comparable performance with GDA in terms of accuracy (see Tables 2 and 3 ). A look to state-of-the-art arena. We also compare our best method (i.e Pearson χ 2 + AExG and Pearson χ 2 + GDA) with current SoTA unsupervised DA methods. We use this in a leader-board like fashion. For fair comparison, we use the same network architecture (i.e Resnet-50), training strategy and set of hyperparameters from Long et al. (2018) ; Zhang et al. (2019) from where we took the baselines results. Our approach achieves SoTA results on Office-31 and Office-Home Datasets. Remark on SoTA Comparison: We compare with SoTAs that rely on adversarial training since this is the focus of our work. Therefore, methods such as Kang et al. (2019) are not included as they rely on additional techniques that neither our method nor the proposed baselines use and could be added to improve the performance further. Our goal is to propose a unifying framework that connects the theory used to explain DANN (Ganin et al., 2016) (and similar SoTA algorithms), and the algorithms themselves. The new theory results in a new adversarial framework (Sec 4), which impressively outperforms previous SoTA (Tables 2 and 3 ). This follows from connecting theory and algorithms and a proper interpretation of the former. Our results can be further improved with additional tuning or techniques (i.e CDAN) since most of SoTAs either follow from Ganin et al. (2016) or are part of our framework (i.e MDD= γ-JS). This problem is deferred to future work.

6. CONCLUSIONS

We have provided a novel perspective on the domain-adversarial problem by deriving new theory and learning algorithms that support the complete family of f -divergences, and that are practical for modern neural networks. We further recognize the learning objective of our framework as a Stackelberg game, borrowing the latest optimizers from the game-optimization literature, achieving additional performance boosts. We show through large-scale experiments that any f -divergence can be used to minimize the discrepancy between source and target domains in a representation space. We also show that some divergences, not considered previously in domain-adversarial learning, achieve SoTA results in practice, reducing the need for additional techniques and hyperparameter tuning as required by previous methods.

A RELATED WORK

A.1 DOMAIN ADAPTATION We briefly but clearly positioned our approach w.r.t. related work mentioned in the paper. We refer the reader to Redko et al. (2019) and Wang & Deng (2018) for a comprehensive survey.

A.2 DIVERGENCES BETWEEN PROBABILITY MEASURES

As explained above, the difference term between source and target domains is important in bounding the target loss. We now provide more details about the H∆H-divergence and f -divergences that are used to compare both domains.

H∆H-divergence

The H-divergence is a restriction of total variation. For binary classification, define I(h) := {x ∈ X : h(x) = 1}, then the H-divergence between two measures µ and ν given the hypothesis class H is (Ben-David et al., 2010a) : d H (µ, ν) = 2 sup h∈H |µ(I(h)) -ν(I(h))|. (A.1) Define H∆H := {h ⊕ h : h, h ∈ H} (⊕: XOR), then d H∆H (µ, ν) can be used to bound the difference between the source and target errors. H∆H divergence has been extended to general loss functions (Mansour et al., 2009) and marginal disparity discrepancy (Zhang et al., 2019) . f -divergence Given two measures µ and ν with µ ν (µ absolute continuous wrt ν), the fdivergence D φ (µ||ν) is defined as (Csiszár, 1967; Ali & Silvey, 1966) : D φ (µ ν) = φ dµ dν dν, (A.2) where dµ/dν is known as the Radon-Nikodym derivative (e.g. Billingsley, 2008) . Assume φ is convex and lower semi-continuous, then from the Fenchel-Moreau theorem, φ * * = φ, with φ * known as the Fenchel conjugate of φ: φ * (y) = sup x∈dom φ x, yφ(x), (A.3) which is convex since it is a supremum of an affine function. In order for x to take the supremum, it is necessary and sufficient that y ∈ ∂φ(x) using the stationarity condition. Therefore, with equation A.2 and equation A.3, D φ (µ ν) can be written as: D φ (µ ν) = sup T ∈T E X∼µ [T (X)] -E Z∼ν [φ * (T (Z))], where T = {T : T is a measurable function and T : X → dom φ * }. In practice we restrict T to a subset as in Definition 2. For different choices of φ see Table 4 . Nguyen et al. (2010) derive a general variational method to estimate f -divergences given only samples. Nowozin et al. (2016) extend their method from merely estimating a divergence for a fixed model to estimating model parameters. While our method builds on this variational formulation, we use it in the context of domain adaptation.

B PROOFS

In this section, we provide the proofs for the different theorems and lemmas: Theorem 1. If (x, y) = |h(x) -y| and H is a class of functions, then for any h ∈ H we have: R T (h) ≤ R S (h) + D TV (P s P t ) + min{E x∼Ps [|f t (x) -f s (x)|], E x∼Pt [|f t (x) -f s (x)|]}. (3.1) Proof. Rewriting the target loss we have: R T (h) = R T (h) -R S (h, f t ) + R S (h, f t ) -R S (h) + R S (h), ≤ R S (h) + |R S (h) -R S (h, f t )| + |R T (h) -R S (h, f t )| Divergence φ(x) φ * (t) φ (1) g(x) MDD x log γx 1+γx + 1 γ log 1 1+γx -log(1 -e t )/γ log γ 1+γ log x Kullback-Leibler (KL) x log x exp(t -1) 1 x Reverse KL (KL-rev) -log x -1 -log(-t) -1 -exp x Jensen-Shannon (JS) -(x + 1) log 1+x 2 + x log x -log(2 -e t ) 0 log 2 1+exp(-x) Pearson χ 2 (x -1) 2 t 2 /4 + t 0 x Squared Hellinger (SH) ( √ x -1) 2 t 1-t 0 1 -exp x γ-weighted Pearson χ 2 (γx -1) 2 /γ (t 2 /4 + t)/γ 0 x Neynman χ 2 (1-x) 2 x 2 -2 √ 1 -t 0 1 -exp x γ-weighted total variation 1 2γ |γx -1| (t/γ)1 -1/2≤t≤1/2 [-1/2, 1/2] 1 2 tanh x Total Variation (TV) 1 2 |x -1| 1 -1/2≤t≤1/2 [-1/2, 1/2] 1 2 tanh x Table 4 : Popular f -divergences, their conjugate functions and choices of g. We take l(a, b) = g(bargmax a). where: |R S (h) -R S (h, f t )| = |R S (h, f s ) -R S (h, f t )| = |E x∼Ps [|h(x) -f t (x)| -|h(x) -f s (x)|]| ≤ E x∼Ps [|f t (x) -f s (x)|] and: |R T (h) -R S (h, f t )| = |R T (h, f t ) -R S (h, f t )| ≤ |p t (x) -p s (x)| • |h(x) -f t (x)|dx ≤ | p t (x) p s (x) -1 p s (x)|dx = D φ (P s ||P t ) with φ(x) = |x -1| which represents the total divergence. Lemma 1 (lower bound). For any two functions h,h in H, we have: |R S (h, h ) -R φ * • T (h, h )| ≤ D φ h,H (P s ||P t ) ≤ D φ H (P s ||P t ) ≤ D φ (P s ||P t ). (3.4) Proof. D φ H (P s ||P t ) = sup h∈H D φ h,H (P s ||P t ) ≥ D φ h,H (P s ||P t ) (B.1) = sup h ∈H |E x∼Ps [ (h(x), h (x))] -E x∼Pt [φ * ( (h(x), h (x)))]| (B.2) ≥ |E x∼Ps [ (h(x), h (x))] -E x∼Pt [φ * ( (h(x), h (x)))]| (B.3) = |R S (h, h ) -R φ * • T (h, h )|. (B.4) For the rightmost inequality in equation 3.4, it is well-known that f -divergence D φ is nonnegative (e.g. Sason & Verdú, 2016) , and thus D φ (P s P t ) = sup T ∈T |E x∼Ps T (x) -E x∼Pt φ * (T (x))|. (B.5) Restricting T to T as in Definition 2 we obtain D φ (P s P t ) ≥ D φ H (P s ||P t ). Lemma 2. Suppose : Y × Y → [0, 1], φ * L-Lipschitz, and [0, 1] ⊂ dom φ * . Let S and T be two empirical distributions corresponding to datasets containing n datapoints sampled i.i.d. from P s and P t , respectively. Let us note R the Rademacher complexity of a given class of functions, and • H := {x → (h(x), h (x)) : h, h ∈ H}. ∀δ ∈ (0, 1), we have with probability of at least 1δ: |D φ h,H (P s ||P t ) -D φ h,H (S||T )| ≤ 2R Ps ( • H) + 2LR Pt ( • H) + 2 (-log δ)/(2n). (3.5) Proof. For reference, we refer the reader to Chapter 3 of Mohri et al. (2018) . Using the notations of R and R that represent the true and empirical risks, we have: D φ h,H (P s ||P t ) -D φ h,H (S||T) = sup h ∈H {|R S (h, h ) -R φ * • T (h, h )|} (B.6) -sup h ∈H {| R S (h, h ) -Rφ * • T (h, h )|} ≤ sup h ∈H ||R S (h, h ) -R φ * • T (h, h )| -| R S (h, h ) -Rφ * • T (h, h )|| ≤ sup h ∈H |R S (h, h ) -R φ * • T (h, h ) -R S (h, h ) + Rφ * • T (h, h )| = sup h ∈H |R S (h, h ) -R S (h, h )| + |R φ * • T (h, h ) -Rφ * • T (h, h )| ≤ 2R Ps ( • H) + log 1 δ 2n + 2R Pt (φ * • • H) + log 1 δ 2n where: Proof. We first introduce the following lemma for our proof: |R S (h, h ) -R S (h, h )| ≤ 2R Ps ( • H) + Lemma 3. For any function φ that satisfies φ(1) = 0 we have φ * (t) ≥ t where φ * is the Fenchel conjugate of φ. Proof. From the definition of Fenchel conjugate, φ * (t) = sup x∈dom φ (xt -φ(x)) ≥ t -φ(1) = t. With the triangle inequality of , we can write:  R T (h, f t ) ≤ R T (h, h * ) + R T (h * , f t ) (B.7) = R T (h, h * ) + R T (h * , f t ) -R S (h, h * ) + R S (h, h * ) (B.8) ≤ R φ * • T (h, h * ) -R S (h, h * ) + R S (h, h * ) + R T (h * , f t ) (Lemma 3) (B.9) ≤ |R φ * • T (h, h * ) -R S (h, h * )| + R S (h, h * ) + R T (h * , f t ) (B.10) ≤ D φ h,H (P s ||P t ) + R S (h, h * ) + R T (h * , f t ) (Lemma 1) (B.11) ≤ D φ h,H (P s ||P t ) + R S (h, f s ) + R S (h * , f s ) + R T (h * , f t ) λ * . ( (h) ≤ R S (h) + D φ h,H (S||T) + λ * φ + 6R S ( • H) + 2(1 + L)R T ( • H) + 5 (-log δ)/(2n). (3.7) Proof. We show in the following that: R T (h) ≤ R S (h) + D φ h,H (S||T) + λ * φ (B.13) + 6R S ( • H) + 2(1 + L)R T ( • H) + 5 (-log δ)/(2n). (B.14) This follows from Theorem 2 where: R T (h) ≤ R S (h) + D φ h,H (P s ||P t ) + R S (h * ) + R T (h * ). We also have: |R D (h) -R D (h)| ≤ 2R D ( • H) + log 1 δ 2n (Theorem of 3.3 Mohri et al. (2018)). From Lemma 2, D φ h,H (P s ||P t ) ≤ 2R Ps ( • H) + 2LR Pt ( • H) + 2 log 1 δ 2n . Plugging in and rearranging gives the desired results. Proposition 1. Suppose d s,t takes the form shown in equation 4.2 with ˆ ( ĥ (z), ĥ(z)) → dom φ * and that for any ĥ ∈ Ĥ, there exists ĥ ∈ Ĥ s.t. ˆ ( ĥ (z), ĥ(z)) = φ ( p z s (z) p z t (z) ) for any z ∈ supp(p z t (z)), with φ the derivative of φ. The optimal d s,t is D φ (P z s ||P z t ) (i.e max ĥ ∈ Ĥ d s,t = D φ (P z s ||P z t )). Proof. We first rewrite from the definition of d s,t in equation 4.2: d s,t = E z∼p z s [ ˆ ( ĥ (z), ĥ(z))] -E z∼p z t [(φ * • ˆ )( ĥ (z), ĥ(z))] (B.15) = [p z s (z) ˆ ( ĥ (z), ĥ(z)) -p z t (z)(φ * • ˆ )( ĥ (z), ĥ(z))]dz (B.16) = p z t (z) p z s (z) p z t (z) ˆ ( ĥ (z), ĥ(z)) -(φ * • ˆ )( ĥ (z), ĥ(z)) dz. (B.17) Maximizing w.r.t h and assuming Ĥ is unconstrained we have: p z s (z) p z t (z) ∈ (∂φ * )( ˆ ( ĥ (z), ĥ(z)) for any z ∈ supp(p z t ). From the definition of Fenchel conjugate we have: x ∈ ∂φ * (t) ⇐⇒ φ(x) + φ * (t) = xt ⇐⇒ φ (x) = t. Plugging x = p z s (z)/p z t (z) and t = ( ĥ (z), ĥ(z)) we obtain ( ĥ (z), ĥ(z)) = φ (p z s (z)/p z t (z)). Hence, from the definition of f -divergences (Definition 1) and its variational characterization (eq. 2.2), we write:  max ĥ ∈ Ĥ d s,t = D φ (P z s ||P z t ). , ω * 2 ) ∈ Ω 1 × Ω 2 of the min-max game satisfies ∀(ω 1 , ω 2 ) ∈ Ω 1 × Ω 2 , V (ω * 1 , ω 2 ) ≤ V (ω * 1 , ω * 2 ) ≤ max ω2∈Ω2 V (ω 1 , ω 2 ). Definition 5 (Nash equilibrium). A Nash equilibrium (ω * 1 , ω * 2 ) ∈ Ω 1 × Ω 2 of the min-max game equation 4.4 is defined such that ∀(ω 1 , ω 2 ) ∈ Ω 1 × Ω 2 , V (ω * 1 , ω 2 ) ≤ V (ω * 1 , ω * 2 ) ≤ V (ω 1 , ω * 2 ) . Theorem 5 (Stackelberg equilibrium). Suppose d s,t takes the form shown in equation 4.2, and assume that (a) There is an optimal feature extractor g * ∈ G that maps both the source and the target distribution to the same distribution, i.e. g * #p s = g * #p t . (b) There is an optimal classifier s.t. ĥ * • g * = f s is the ground truth, and ĥ • g = f s for any ( ĥ, g) in a neighborhood of ( ĥ * , g * ). (c) For any g ∈ G and ĥ ∈ H, there exists ĥ s.t. for any z ∈ supp(g#p s ), one has ˆ ( ĥ (z), ĥ(z)) = φ ((g#p s )(z)/(g#p t )(z)). Then the objective of f -adversarial learning has a Stackelberg equilibrium at ( ĥ * , g * , ĥ * ), where ∀z ∈ supp(g * #p s ), ˆ ( ĥ * (z), ĥ * (z)) = φ (1). Proof. At g * and ĥ * , we have d s,t ( ĥ * , g * , ĥ ) = E z∼g * #ps ˆ ( ĥ * (z), ĥ (z))φ * ( ˆ ( ĥ * (z), ĥ (z))). (C.1) Maximizing over ˆ ( ĥ * (z), ĥ (z)) yields: ˆ ( ĥ * (z), ĥ * (z)) = φ (1), ∀z ∈ supp(g * #p s ). (C.2) In other words, d s,t ( ĥ * , g * , ĥ ) ≤ d s,t ( ĥ * , g * , ĥ * ) for any ĥ ∈ Ĥ. Now let us prove that max ĥ ∈ Ĥ d s,t ( ĥ, g, ĥ ) ≥ d s,t ( ĥ * , g * , ĥ * ) = φ(1) = 0. This is because from our assumptions and Lemma 1, max ĥ ∈ Ĥ d s,t ( ĥ, g, ĥ ) = D φ (g#p s g#p t ) ≥ 0. So far, we have shown that d s,t ( ĥ * , g * , ĥ ) ≤ d s,t ( ĥ * , g * , ĥ * ) ≤ max ĥ ∈ Ĥ d s,t ( ĥ, g, ĥ ) (C.3) for any ĥ, ĥ ∈ H and g ∈ G. Also, ( ĥ * , g * ) is an optimal pair for the source loss, namely: R s ( ĥ * , g * ) ≤ R s ( ĥ, g), (C.4) for any ĥ ∈ H and g ∈ G. Combining equation C.3 and equation C.8, we claim that ( ĥ * , g * , ĥ * ) is a Stackelberg equilibrium. Theorem 6 (Nash equilibrium). With the same assumptions as in Theorem 5 and assume also that only depends on the second argument, i.e., ( ĥ, ĥ ) = ( ĥ ). Then the objective of f -adversarial learning has a Nash equilibrium at ( ĥ * , g * , ĥ * ) where ∀z ∈ g∈G supp(g#p s ), ˆ ( ĥ * (z)) = φ (1). (C.5) Proof. The proof is in parallel to the proof of Theorem 5 except that at ĥ * , d s,t (g, ĥ * ) = E z∼p z s ˆ ( ĥ * (z)) -E z∼p z t φ * • ˆ ( ĥ * (z)) = φ (1)φ * (φ (1)) (C.6) is a constant in terms of g, and thus we have: d s,t (g, ĥ * ) ≥ d s,t (g * , ĥ * ) ≥ d s,t (g * , ĥ ), (C.7) for any g ∈ G and ĥ * ∈ Ĥ. Combining with equation C.8 we conclude that ( ĥ * , g * , ĥ * ) is a Nash equilibrium of d s,t . Also, ( ĥ * , g * ) is an optimal pair for the source loss, namely: R s ( ĥ * , g * ) ≤ R s ( ĥ, g), (C.8) for any ĥ ∈ H and g ∈ G. Combining equation C.7, we claim that ( ĥ * , g * , ĥ * ) is a Nash equilibrium of the objective in f -DAL.

D CONNECTION TO PREVIOUS FRAMEWORKS

In this appendix we show that f -DAL encompasses previous frameworks on domain adaptation, including H∆H-divergence, DANN (Ganin et al., 2016) and MDD (Zhang et al., 2019) . 

D.2 DANN FORMULATION AND JS DIVERGENCE

The DANN formulation by Ganin & Lempitsky (2015) can also be incorporated in our framework if one takes ˆ (a, b) = log b and φ * (t) =log(1e t ). Effectively, this formulation ignores the contribution of the source classifier and experimentally we saw it had inferior performance compared to using ˆ (a, b) = g(b argmax a ). The original idea of domain adversarial training was introduced by Ganin et al. (2016) where the authors defined the following surrogate function to measure the discrepancy between the two domains: d s,t := E xs∼ps [log ĥ (g(x s ))] + E xt∼pt [log(1 -ĥ (g(x t )))]. (D.1) In this context, ĥ was defined to be a domain classifier, that is ĥ : Z → {0, 1} with 0 and 1 corresponding to the source and target domain pseudo-labels. The following proposition shows that under the assumption of an optimal domain classifier ĥ , d s,t achieves JS-divergence (up to a constant shift), which upper bounds the D JS h,H . Proposition 2. Suppose d s,t follows the form of eq. D.1 and ĥ is the optimal domain classifier which is unconstrained, then max ĥ d s,t = D JS (S||T ) -2 log 2. Proof. From the definition, we have: d s,t ( ĥ , g) = Z p z s (z) log ĥ (z) + p z t (z) log(1 -ĥ (z))dz. (D.2) By taking derivatives and finding the optimal ĥ * (z), we get : h * (z) = p z s (z) p z s (z)+p z t (z) . By plugging ĥ * (z) into equation D.1, rearranging, and using the definition of the Jensen-Shanon (JS) divergence, we get the desired result. It is worth noting that the additional negative constant -2 log 2 does not affect the optimization.

D.3 MDD FORMULATION AND γ-WEIGHTED JS DIVERGENCE

Now let us demonstrate how our f -DAL framework incorporates MDD naturally. Suppose φ * (t) = -1 γ log(1e t ) and ˆ ( ĥ(z), ĥ (z)) = log ĥ (z) argmax ĥ(z) . We retrieve the following result as in Zhang et al. (2019) : Proposition 3 (Zhang et al. (2019) ). Suppose d s,t takes the form of MDD, i.e, γd s,t = γE z∼p z s log ĥ (z) argmax ĥ(z) + E z∼p z t ĥ(z) • log(1 -ĥ (z) argmax ĥ(z) ).

(D.3)

With unconstrained function class Ĥ, the optimal d s,t satisfies: max ĥ γd s,t = (γ + 1)JS γ (p z s p z t ) + γ log γ -(γ + 1) log(γ + 1), (D.4) where JS γ (p z s p z t ) is γ-weighted Jensen-Shannon divergence (Huszár, 2015; Nowozin et al., 2016) : We remark that when γ = 1, JS γ (p z s p z t ) is the original Jensen-Shannon divergence. One should also note the the additional negative constant γ log γ -(γ + 1) log(γ + 1), which attributes to the negativity of MDD, does not affect the optimization. JS γ ( φ * (t) = -1 γ log(1e t ) can be considered by rescaling the φ * for the usual JS divergence (see Table 4 ). In general we can rescale φ * for any f -divergence with the following lemma: Lemma 4 (Boyd & Vandenberghe (2004) ). For any λ > 0, the Fenchel conjugate of λφ is (λφ) * (t) = λφ * (t/λ), with dom(λφ) * = λ dom φ * .



Also referred to as the coovariate-shift assumption(Shimodaira, 2000). The source and target domains only differ in their marginals according to the input space. Indeed, under these assumptions, ds,t can be seen as an upper bound for the D φ h,H discrepancy.



1. Theorem 2 (generalization bound). Suppose : Y × Y → [0, 1] ⊂ dom φ * and that (a, b) ≤ (a, c) + (c, b) for any a, b, c ∈ Y. Denote λ * := R S (h * ) + R T (h * ), and let h * be the ideal joint hypothesis. We have: R T (h) ≤ R S (h) + D φ h,H (P s ||P t ) + λ * . (3.6) The three terms in this upper bound share similarity with the bounds proposed by Ben-David et al. (2010a) and more recently by Zhang et al. (2019). The main difference lies in the discrepancy being used to compare the two marginal distributions. In the case of Ben-David et al. (2010a), they use the H∆H divergence (a reduction of the TV), and in Zhang et al. (2019), they use the MDD. In our case, we use a reduction of a lower bound estimator of a variational characterization of the general f -divergences. This generalizes the TV (and thus Ben-David et al. (

Figure 2: f -DAL framework.We interpret h : X → Y as the composition of two nets h = ĥ • g, where g : X → Z and ĥ is a classifier that operates in a representation space Z. Inspired by the theory, we let ĥ be another net of the same topology than ĥ. This is intuitively interpreted as a per-category domain classifier. Our framework is different from domain-adversarial frameworks that follows from(Ganin et al., 2016) since they use a global domain-classifier or discriminator.

Figure 3: Comparison of GDA vs AExG in a toy task with JS as divergence. G is the class of quadratic functions and Ĥ is linear. AExG can accelerate the convergence to the optimal solution. (Appendix E)

Figure 4: Transfer performance of a model trained using f -DAL for different choices of divergences and different transfer tasks on the Office-31 benchmark. Baseline is ResNet-50 w/o f -DAL. We additionally show the performance of DANN (Table2). When compared with f -DAL (JS), we see a significant performance boost. This is in line with our theory which suggests the use of a per-category domain classifier vs a discriminator.

3.3 ofMohri et al. (2018)). Similarly, by Talagrand's lemma (Lemma 5.7 and Definition 3.2 ofMohri et al. (2018)) we have:R Pt (φ * • • H) ≤ LR Pt ( • H), with φ * • • H := {x → φ( (h(x), h (x))) : h, h ∈ H}. Theorem 2 (generalization bound). Suppose : Y × Y → [0, 1] ⊂ dom φ * and that (a, b) ≤ (a, c) + (c, b) for any a, b, c ∈ Y. Denote λ * := R S (h * ) + R T (h * ),and let h * be the ideal joint hypothesis. We have: R T (h) ≤ R S (h) + D φ h,H (P s ||P t ) + λ * . (3.6)

H∆H-DIVERGENCEWe now show that Theorem 2 generalizes the bound proposed inBen-David et al. (2010a). Let the pair {φ(x), φ * (t)} = { 1 2 |x -1|, t} for t ∈ [0, 1], such that D φ h,H = D TV h,H and sup h∈H D TV h,H = D TV H = 1 2 d H∆H , with d H∆H defined in Ben-David et al. (2010a) (see also equation A.1). Theorem 2 gives us that R T (h) ≤ R S (h) + 1 2 d H∆H + λ * , recovering Theorem 2 of Ben-David et al. (2010a).

Theorem 3 (generalization bound with Rademacher complexity). Let : Y × Y → [0, 1] and φ * be L-Lipschitz. Let S and T be two empirical distributions (i.e. datasets containing n data points sampled i.i.d. from P s and P t , respectively). Denote λ * φ

Comparison vs previous unsupervised domain adaptation approaches on the Office-31 benchmark. Accuracy represented in (%) with average and standard deviation. Impressively, our approach achieves SoTA results without the need of additional techniques (i.e CDAN) or additional hyperparameters (i.e MDD/γ-JS).

Relative Improvement (%) AExG vs GDA(SGD) for different choices of divergences and transfer tasks on the Office-31 benchmark. Overall, we observe gains in performance among all divergences. right) Transfer curves for Pearson χ 2 on the task A→W on the Office-31 benchmark (# Iter vs Acc). We can see AExG converges faster and also obtains slightly better results. This is inline with the insights obtained from the theoretical results presented in Sec 4.1 and Appendix 5. Accuracy (%) on Office-Home for unsupervised DA. Impressively, our approach achieves SoTA without additional techniques (i.e. CDAN), or additional hyperparameters (i.e. MDD).

B.18) C THE EXISTENCE OF THE STACKELBERG/NASH EQUILIBRIUM IN f -DAL In this appendix, we formally define the Nash/Stackelberg equilibrium and show that they exist in our f -DAL framework under mild assumptions. Definition 4 (Stackelberg equilibrium). A Stackelberg equilibrium (ω * 1

E A TOY EXAMPLE FOR THE TRAINING DYNAMICS

Suppose that the source dataset S and target dataset T contain one sample each and are formulated: S = {(0.5, 1)} and T = {0.55}. Let the feature extractor to be quadratic functions, and we choose linear predictors, i.e.: g(x) = w 1 x 2 + x, ĥ(x) = w 2 x, ĥ (x) = σ(w 3 x).(E.1)Let us consider the regression task with the JS divergence used to compare S and T. min w1,w2∈R max w3∈R E x∼Ps (f s (x) -ĥ(g(x))) 2 + E x∼Ps log ĥ (g(x))+ + E x∼Pt log(1 -ĥ (g(x))).If we consider that g(0.5) = g(0.55), ĥ(g(0.5)) = 1 and w * 3 = 0, then the optimal solution (w * 1 , w * 2 , w * 3 ) satisfies the assumption in Theorem 5. We plot the trajectories of GDA and AExG in Figure 3 and show that AExG can accelerate the convergence to the optimal solution.

