LEARNING FROM CONFLICTING DATA WITH HIDDEN CONTEXTS

Abstract

Classical supervised learning assumes a stable relation between inputs and outputs. However, this assumption is often invalid in real-world scenarios where the inputoutput relation in the data depends on some hidden contexts. We formulate a more general setting where the training data is sampled from multiple unobservable domains, while different domains may possess semantically distinct input-output maps. Training data exhibits inherent conflict in this setting, rendering vanilla empirical risk minimization problematic. We propose to tackle this problem by introducing an allocation function that learns to allocate conflicting data to different prediction models, resulting in an algorithm that we term LEAF. We draw an intriguing connection between our approach and a variant of the Expectation-Maximization algorithm. We provide theoretical justifications for LEAF on its identifiability, learnability, and generalization error. Empirical results demonstrate the efficacy and potential applications of LEAF in a range of regression and classification tasks on both synthetic data and real-world datasets.

1. INTRODUCTION

Classical supervised learning assumes a stable relation between inputs and outputs (Vapnik, 1999) . On the other hand, real-world objects often have a variety of semantic properties, and which one is of interest depends on the context. Such contexts are implicitly embedded in the datasets during the labeling process in conventional supervised learning. However, contexts may not be accessible in many open-ended real-world scenarios due to their innate unobservability (Harries et al., 1998) or privacy and security constraints (McMahan et al., 2017; Mitchell et al., 2021) . With hidden contexts, the same input may correspond to distinct outputs due to the change of the context, violating the stable relation assumption. For example, when collecting uncurated image-label pairs from the Internet for a general recognition task, an image of a red sphere may be sometimes labeled as "red" and sometimes "sphere". Similar cases are also identified across datasets (Taori et al., 2020) : an image of a scuba diver can be labeled as "scuba diver" in ImageNet (Deng et al., 2009) while being labeled as "person" in Youtube-BB (Real et al., 2017) . Contexts can also reflect human preferences whose identities are not available in privacy-sensitive applications, whilst influencing predictions (Kairouz et al., 2021) . It is easy to see that when the stable relation assumption is invalid, the common practice of performing global Empirical Risk Minimization (ERM) is problematic since the data exhibit inherent conflict: the outputs conditioned on the same input can be semantically distinct across different examples, thus mutually interfering during training. Existing machine learning models usually rely on human efforts to filter or re-label the data to eliminate this conflict (Krause et al., 2016; Vo et al., 2017; Taori et al., 2020) . These workarounds typically require domain knowledge on the specific problem, and could be ineffective when some crucial side-information of the data is hard to define or collect in practice (Hanna et al., 2020) . It is then natural to ask: can we empower the learner to automatically eliminate the conflict in the training data without additional human intervention? In this work, we formulate the problem of training on conflicting data by assuming that the data comes from multiple unobservable domains; different domains may have semantically distinct input-output maps, while the map in each domain is stable. In particular, instead of learning a global model, our goal is to learn multiple local models that separately capture mutually conflicting input-output relations in data. To this end, we propose LEAF (LEarning from conflicting data by Allocation Function), a learning framework that comprises a high-level allocation function and a low-level model set. The allocation function learns to allocate the data among the models to eliminate the conflict,

