GATED DOMAIN UNITS FOR MULTI-SOURCE DOMAIN GENERALIZATION Anonymous

Abstract

Distribution shift (DS) is a common problem that deteriorates the performance of learning machines. To tackle this problem, we postulate that real-world distributions are composed of elementary distributions that remain invariant across different environments. We call this an invariant elementary distribution (I.E.D.) assumption. The I.E.D. assumption implies an invariant structure in the solution space that enables knowledge transfer to unseen domains. To exploit this property in domain generalization (DG), we developed a modular neural network layer that consists of Gated Domain Units (GDUs). Each GDU learns an embedding of an individual elementary distribution that allows us to encode the domain similarities during the training. During inference, the GDUs compute similarities between an observation and each of the corresponding elementary distributions which are then used to form a weighted ensemble of learning machines. Because our layer is trained with backpropagation, it can naturally be integrated into existing deep learning frameworks. Our evaluation on image, text, graph, and time-series data shows a significant improvement in the performance on out-of-training target domains without domain information and any access to data from the target domains. This finding supports the practicality of the I.E.D. assumption and demonstrates that our GDUs can learn to represent these elementary distributions.

1. INTRODUCTION

A fundamental assumption in machine learning is that training and test data are independently and identically distributed (I.I.D.). This assumption ensures consistency-results from statistical learning theory, meaning that the learning machine obtained from an empirical risk minimization (ERM) attains the lowest achievable risk as sample size grows (Vapnik, 1998; Schölkopf, 2019) . Unfortunately, a considerable amount of research and real-world applications in the past decades has provided a staggering evidence against this assumption (Zhao et al., 2018; 2020; Ren et al., 2019; Taori et al., 2020 ) (see D'Amour et al. (2020) for case studies). The violation of the I.I.D. assumption is usually caused by a distribution shift (DS) and can result in inconsistent learning machines (Sugiyama & Kawanabe, 2012) , implying the loss of performance guarantee of machine learning models in the real world. Therefore, to tackle DS, recent work advocates for domain generalization (DG) (Blanchard et al., 2011; Muandet et al., 2013; Li et al., 2017; 2018b; Zhou et al., 2021a) . This generalization to utterly unseen domains is crucial for robust deployment of the models in practice, especially when new, unforeseeable domains emerge after model deployment. However, the most important question that DG seeks to answer is how to identify the right invariance that allows for generalization. The contribution of this work is twofold. First, we advocate that real-world distributions are composed of smaller "units" called invariant elementary distributions that remain invariant across different domains; see Section 2.1. Second, we propose to implement this hypothesis through so-called gated domain units (GDUs). Specifically, we developed a modular neural network layer that consists of GDUs. Each GDU learns an embedding of an individual elementary domain that allows us to express the domain similarities during training. For this purpose, we adopt the theoretical framework of reproducing kernel Hilbert space (RKHS) to retrieve a geometrical representation of each distribution in the form of a kernel mean embedding (KME) without information loss (Berlinet & Thomas-Agnan, 2004; Smola et al., 2007; Sriperumbudur et al., 2010; Muandet et al., 2017) . This representation accommodates methods based on analytical geometry to measure similarities between distributions. We show that these similarity measures can be learned and utilized to improve the generalization capability of deep learning models to previously unseen domains. The remainder of this paper is organized as follows: Our theoretical framework is laid out in Section 2 with our modular DG layer implementation shown in Section 3. In Section 4, we outline related work. Our experimental evaluations are presented in Section 5. Finally, we discuss potential limitations of our approach and future work in Section 6.

2. DOMAIN GENERALIZATION WITH INVARIANT ELEMENTARY DISTRIBUTIONS

We assume a mixture component shift for the multi-source DG setting. This shift refers to the most common DS stating that the data is made up of different sources, each with its own characteristics, and their proportions vary between the training and test scenario (Quinonero-Candela et al., 2022) . Our work thus differs in the assumption from related work in DG, in which the central assumption is the covariate shift (i.e., the conditional distribution of the source and test data stays the same) (David et al., 2010) . In the following, let X and Y be the input and output space, with a joint distribution P. We are given a set of D labeled source datasets {D s i } D i=1 with D s i ⊆ X × Y. Each of the source datasets is assumed to be I.I.D. generated by a joint distribution P s i with support on X × Y, henceforth denoted domain. The set of probability measures with support on X × Y is denoted by P. The multi-source dataset D s comprises the merged individual source datasets {D s j } D j=1 . We aim to minimize the empirical risk, see Section 3.3 for details. Important notation is summarized in Table 1 . In contrast, we generalize their problem descriptions: We express the distribution of each domain as a convex combination of K elementary distributions {P j } K j=1 ⊂ P, meaning that P s = K j=1 α j P j where α ∈ ∆ K . Our main assumption is that these elementary distributions remain invariant across the domains. The advantage is that we can find an invariant subspace at a more elementary level, as opposed to when we consider the source domains as some sort of basis for all unseen domain. Figure 1 illustrates this idea.

2.1. INVARIANT ELEMENTARY DISTRIBUTIONS

Theoretically speaking, the I.E.D assumption is appealing because it implies the invariant structure in the solution space, as shown in the following lemma. The proof is given in Appendix A.1. Lemma 1. Let L : Y × Y → R + be a non-negative loss function, F a hypothesis space of functions f : X → Y, and P s (X, Y ) a data distribution. Suppose that the I.E.D assumption holds, i.e., there exist K elementary distributions P 1 , . . . , P K such that any data distribution can be expressed as P s (X, Y ) = K j=1 α j P j (X, Y ) for some α ∈ ∆ K . Then, the corresponding Bayes predictor f * ∈ arg min f ∈F E (X,Y )∼P [L(Y, f (X)) ] is Pareto-optimal with respect to a vector of elementary risk functionals (R 1 , . . . , R K ) : F → R K + where R j (f ) := E (X,Y )∼Pj [L(Y, f (X))]. Lemma 1 implies that, under the I.E.D assumption, Bayes predictors must belong to a subspace of F called the Pareto set F Pareto ⊂ F which consists of Pareto-optimal models. The model f is said to be Pareto-optimal if there exists no g ∈ F such that R j (g) ≥ R j (f ) for all j ∈ {1, . . . , K} with R j (g) > R j (f ) for some j; see, e.g., Sener & Koltun (2018, Definition 1). In other words, the I.E.D assumption allows us to translate the invariance property of data distributions to the solution space. Since Bayes predictors of all future test domains must lie within the Pareto set, which is a



βij coefficient for sample xi and µ V j Similar to Mansour et al. (2009; 2012); Hoffman et al. (2018a), we assume that the distribution of the source dataset can be described as a convex combination P s = D j=1 α s j P s j where α s = (α s 1 , . . . , α s D ) is an element of the probability simplex, i.e., α s ∈ ∆ D := {α ∈ R D | α j ≥ 0 ∧ D j=1 α j = 1}. In other words, α j quantifies the contribution of each individual source domain to the combined source domain.

Important notation

