CONTEXT-AGNOSTIC LEARNING USING SYNTHETIC DATA

Abstract

We propose a novel setting for learning, where the input domain is the image of a map defined on the product of two sets, one of which completely determines the labels. Given the ability to sample from each set independently, we present an algorithm that learns a classifier over the input domain more efficiently than sampling from the input domain directly. We apply this setting to visual classification tasks, where our approach enables us to train classifiers on datasets that consist entirely of a single example of each class. On several standard benchmarks for real-world image classification, our approach achieves performance competitive with state-of-the-art results from the few-shot learning and domain transfer literature, while using significantly less data.

1. INTRODUCTION

Despite recent advances in deep learning, one central challenge is the large amount of labelled training data required to achieve state-of-the-art performance. Procuring such volumes of high quality, reliably annotated data can be costly or even close to impossible (e.g., obtaining data to train an autonomous navigation system for a lunar probe). Additional hurdles include hidden biases in large datasets (Tommasi et al., 2017) and maliciously perturbed training data (Biggio et al., 2012) . Synthetically generated data has seen growing adoption in response to these problems, since the marginal cost of producing new training data is generally very low, and one has full control over the generation process. This is particularly true for applications with a physical component, such as autonomous navigation (Gaidon et al., 2016) or robotics (Todorov et al., 2012) . However, training with purely synthetic data suffers from the so-called "reality gap", whereby good performance on synthetic data does not necessarily yield good performance in the real world (Jakobi et al., 1995) . In particular, the difficulty of generating realistic training images scales not just with the objects of interest, but also the real-world contexts in which the learned model is expected to operate. This work begins with the simple observation that, for many classification tasks, the label of an input is determined entirely by the object; however, this additional structure is discarded by current synthetic data pipelines. Our goal is to leverage this decomposition to develop more efficient methods for the related problems of generating training data and learning from a synthetic domain. Our contributions are two-fold: first, we formally introduce the setting of context-agnostic learning, where the input space is decomposed into object and context spaces, and the labels are independent of contexts when conditioned on the objects. Second, we propose an algorithm to efficiently train a classifier in the context-agnostic setting, which relies on the ability to sample from the object and context spaces independently. We apply our methods to train deep neural networks for real-world image classification using only a single synthetic example of each class, obtaining performance comparable to existing methods for domain adaptation and few-shot learning while using substantially less data. Our results show that it is possible to train classifiers in the absence of any contextual training data that nonetheless generalize to real world domains. well on the source domain may not generalize well in the target domain. A standard method for addressing this challenge is domain adaptation, which leverages a small amount of data from the target domain to adapt a function that is learned over the source domain (Blitzer et al., 2006) . In the context of learning from synthetic data, the domain shift that occurs between synthetic and real world data is known as the reality gap (Jakobi et al., 1995) . State-of-the-art rendering engines, such as those used for video games, can help narrow this gap by generating photorealistic data for training (Dosovitskiy et al., 2017; Johnson-Roberson et al., 2016; Qiu and Yuille, 2016) . Another technique is using domain randomization to generate the source domain with more variability than is expected in the target domain (e.g., extreme lighting conditions and camera angles), so as to make real images appear as just another variant (Tobin et al., 2017; Tremblay et al., 2018) ; in particular, Torres et al. ( 2019) apply domain randomization to traffic sign detection and find that arbitrary natural images suffice for the task. Another body of work exploits generative adversarial networks (Goodfellow et al., 2014a) to generate synthetic domains (Hoffman et al., 2017; Liu et al., 2017; Shrivastava et al., 2016; Taigman et al., 2016; Tzeng et al., 2017) . Finally, several works have explored using synthetic data for natural image text recognition (Gupta et al., 2016; Jaderberg et al., 2014) . These works use an approach that is roughly analogous to our baseline models, and test their techniques on the target domain of street signs rather than handwritten characters (as we do). A different paradigm for the low-data regime is few-shot learning. In contrast to domain adaptation, few-shot learning operates under the assumption that the target and source distributions are the same, but the ability to sample certain classes is limited in the source domain. Early approaches emphasized capturing knowledge in a Bayesian framework (Fe-Fei et al., 2003) , which was later formulated as Bayesian program learning (Lake et al., 2015) . Another approach based on metric learning is to find a nonlinear embedding for objects where closeness in the geometry of the embedding generalizes to unseen classes (Koch, 2015; Snell et al., 2017; Sung et al., 2018; Vinyals et al., 2016) . Meta-learning approaches aim to extract higher level concepts which can be applied to learn new classes from a few examples (Finn et al., 2017; Munkhdalai and Yu, 2017; Nichol et al., 2018; Ravi and Larochelle, 2016) . A conceptually-related method that leverages synthetic training data is learning how to generate new data from a few examples of unseen classes; in contrast to our work, however, these methods still require a large number of samples to learn the synthesizer (Schwartz et al., 2018; Zhang et al., 2019) . Finally, some works combine domain adaptation with few-shot learning to learn under domain shift and limited samples (Motiian et al. ( 2017)). The main characteristic that differentiates our work from these approaches is that we are interested in learning classifiers that are context-agnostic, i.e., do not rely on background signals. As such, while we find our approach is applicable to many of the same tasks as the aforementioned works, our theoretical setting and objectives differ significantly. From a practical perspective, we demonstrate our techniques when the entire training set consists solely of a single synthetic image of each class, though our techniques can certainly be applied when more data is available; however we do not expect the reverse to hold for domain adaptation or few-shot learning in our setting. Indeed, we consider this work to be complementary in that we are concerned with exploiting the additional structure that is inherent in certain source domains, while the goal of domain adaptation and fewshot learning is to achieve good performance under various downstream domain shift assumptions.

3. SETTING

The standard supervised learning setting consists of an input space X , an output space Y, and a hypothesis space H of functions mapping X to Y. A domain P D is a probability distribution over (X , Y). Given a target domain P T and a loss function , the goal is to learn a classifier h ∈ H that minimizes the risk, i.e., the expected loss R P T (h) := E P T [ (h(x), y)]. The training procedure consists of n samples (x 1 , y 1 ), ..., (x n , y n ) from a source domain P S . A standard approach is empirical risk minimization, which takes the classifier that minimizes R emp (h) = 1 n i (h(x i ), y i ); if P S is close to P T , then with enough samples, such a classifier also achieves low risk in the target domain.

3.1. CONTEXT-AGNOSTIC LEARNING

In general, we can frame the goal of classification as learning to extract reliable signals for the label y from points x ∈ X . This task is often complicated by the presence of noise or other spurious signals.



RELATED WORK Domain shift refers to the problem that occurs when the training set (source domain) and test set (target domain) are drawn from different distributions. In this setting, a classifier which performs

