LEARNING FROM CONFLICTING DATA WITH HIDDEN CONTEXTS

Abstract

Classical supervised learning assumes a stable relation between inputs and outputs. However, this assumption is often invalid in real-world scenarios where the inputoutput relation in the data depends on some hidden contexts. We formulate a more general setting where the training data is sampled from multiple unobservable domains, while different domains may possess semantically distinct input-output maps. Training data exhibits inherent conflict in this setting, rendering vanilla empirical risk minimization problematic. We propose to tackle this problem by introducing an allocation function that learns to allocate conflicting data to different prediction models, resulting in an algorithm that we term LEAF. We draw an intriguing connection between our approach and a variant of the Expectation-Maximization algorithm. We provide theoretical justifications for LEAF on its identifiability, learnability, and generalization error. Empirical results demonstrate the efficacy and potential applications of LEAF in a range of regression and classification tasks on both synthetic data and real-world datasets.

1. INTRODUCTION

Classical supervised learning assumes a stable relation between inputs and outputs (Vapnik, 1999) . On the other hand, real-world objects often have a variety of semantic properties, and which one is of interest depends on the context. Such contexts are implicitly embedded in the datasets during the labeling process in conventional supervised learning. However, contexts may not be accessible in many open-ended real-world scenarios due to their innate unobservability (Harries et al., 1998) or privacy and security constraints (McMahan et al., 2017; Mitchell et al., 2021) . With hidden contexts, the same input may correspond to distinct outputs due to the change of the context, violating the stable relation assumption. For example, when collecting uncurated image-label pairs from the Internet for a general recognition task, an image of a red sphere may be sometimes labeled as "red" and sometimes "sphere". Similar cases are also identified across datasets (Taori et al., 2020) : an image of a scuba diver can be labeled as "scuba diver" in ImageNet (Deng et al., 2009) while being labeled as "person" in Youtube-BB (Real et al., 2017) . Contexts can also reflect human preferences whose identities are not available in privacy-sensitive applications, whilst influencing predictions (Kairouz et al., 2021) . It is easy to see that when the stable relation assumption is invalid, the common practice of performing global Empirical Risk Minimization (ERM) is problematic since the data exhibit inherent conflict: the outputs conditioned on the same input can be semantically distinct across different examples, thus mutually interfering during training. Existing machine learning models usually rely on human efforts to filter or re-label the data to eliminate this conflict (Krause et al., 2016; Vo et al., 2017; Taori et al., 2020) . These workarounds typically require domain knowledge on the specific problem, and could be ineffective when some crucial side-information of the data is hard to define or collect in practice (Hanna et al., 2020) . It is then natural to ask: can we empower the learner to automatically eliminate the conflict in the training data without additional human intervention? In this work, we formulate the problem of training on conflicting data by assuming that the data comes from multiple unobservable domains; different domains may have semantically distinct input-output maps, while the map in each domain is stable. In particular, instead of learning a global model, our goal is to learn multiple local models that separately capture mutually conflicting input-output relations in data. To this end, we propose LEAF (LEarning from conflicting data by Allocation Function), a learning framework that comprises a high-level allocation function and a low-level model set. The allocation function learns to allocate the data among the models to eliminate the conflict, while the models learn local predictions. Our key insight is that the conflict itself can be exploited to learn a good allocation strategy: if the allocation function assigns conflicting examples to the same model, then this will hinder the global minimization of training error due to the conflict, which can in turn be used to improve the allocations. Using a probablistic reinterpretation of our framework, we establish a connection between LEAF and a variant of Expectation-Maximization (EM) (Dempster et al., 1977) , showing that an analytic form of the allocation function can be explicitly derived. We provide theoretical justifications of LEAF by analyzing its identifiability, learnability, and generalization error. Our theoretical results generalize the analysis on classical single-domain supervised learning with no conflict (Vapnik, 1999) , revealing the conditions under which LEAF can provably recover all conflicting concepts and providing a generalization error upper bound. Empirically, we conduct extensive experiments that span regression and classification tasks on both synthetic data and real-world datasets. Experimental results show that LEAF effectively resolves the conflict in the training data, significantly outperforms task-specific approaches in three settings reflecting practical scenarios with different label space structures, and achieves competitive results with fully supervised learning methods with domain index or label set oracles. In summary, our contributions are three-fold: • We formulate a supervised learning setting of learning from conflicting data, which captures the key difficulty in training on the data with multiple hidden contexts. • We introduce a theoretically grounded learning framework termed LEAF that automatically eliminates the conflict in data and establish its connection with a variant of the EM algorithm. • We empirically evaluate LEAF on a wide range of tasks, in which LEAF effectively resolves the conflict in data without additional human intervention, outperforms task-specific approaches, and achieves competitive results even with fully supervised learning methods.

2. PROBLEM FORMULATION AND THE LEAF FRAMEWORK

In this section, we formulate our learning problem and introduce the LEAF framework. We adhere to the conventional supervised learning terminology: let X ⊆ R L be an input space, Y an output space, H a hypothesis space where each hypothesis (model) is a function from X to Y, and ℓ : Y ×Y → [0, 1] a non-negative and bounded loss function. We denote by λ(S) the Lebesgue measure of S for any S ⊆ X . We use [k] = {1, 2, • • • , k} for positive integers k and use 1 as the indicator function. For a probability density P X over X , we denote by supp(P X ) := {x ∈ X | P X (x) > 0} its support. For a set S, we use |S| to denote its cardinality. We use superscripts to denote sampling indices (e.g., d i and x ij ) and subscripts to denote element indices in a given ordered set (e.g., d i ).

2.1. PROBLEM FORMULATION

We begin by formalizing the notion of conflict. Following the seminal works in domain adaptation (Ben-David et al., 2010a; Mansour et al., 2009; 2008) , we define a domain d as a pair ⟨P X , c⟩ consisting of an input distribution P Xfoot_0 over X and a target function c : X → Y. We assume that the training data is sampled from a domain set D = {d i } N i=1 = {⟨P i , c i ⟩} N i=1 containing N (agnostic to the learner) domains, and the goal of the learner is to learn all target functions {c i } N i=1 . We focus on the setting where there exists conflict between domains, which we make formal by Definition 1. Definition 1 (Conflicting domains). For two domains ⟨P, c⟩ and ⟨P ′ , c ′ ⟩, let X int = supp(P ) ∩ supp(P ′ ) be the intersection of the supports of their input distributions. We say that the two domains are conflicting if there exists S ⊆ X int such that λ(S) > 0 and c(x) ̸ = c ′ (x) for every x ∈ S. We assume that there exist conflicting domains in the domain set D. Since conflicting domains involve different target functions, one needs multiple models to capture those different input-output maps in the training data. We thus equip the learner with a model set H = {h i } K i=1 parameterized by Θ = {θ i } K i=1 with cardinality K > 1, in contrast to conventional supervised learning that only trains a global model. Intuitively, K should be sufficiently large to fully resolve the conflict in the training data; this depends not only on the number of domains, but also how "severe" the conflict is among all domains. We formalize this notion by introducing the conflict rank of the domain set. Definition 2 (Conflict rank). (1)



For brevity, we omit the subscript X in PX when it is clear from the context.



For a domain set D = {⟨P i , c i ⟩} N i=1 , its conflict rank is defined as R(D) := max S⊆X ,λ(S)>0 min x∈S c(x) | ⟨P, c⟩ ∈ D, x ∈ supp(P ) .

