CONTEXT-AGNOSTIC LEARNING USING SYNTHETIC DATA

Abstract

We propose a novel setting for learning, where the input domain is the image of a map defined on the product of two sets, one of which completely determines the labels. Given the ability to sample from each set independently, we present an algorithm that learns a classifier over the input domain more efficiently than sampling from the input domain directly. We apply this setting to visual classification tasks, where our approach enables us to train classifiers on datasets that consist entirely of a single example of each class. On several standard benchmarks for real-world image classification, our approach achieves performance competitive with state-of-the-art results from the few-shot learning and domain transfer literature, while using significantly less data.

1. INTRODUCTION

Despite recent advances in deep learning, one central challenge is the large amount of labelled training data required to achieve state-of-the-art performance. Procuring such volumes of high quality, reliably annotated data can be costly or even close to impossible (e.g., obtaining data to train an autonomous navigation system for a lunar probe). Additional hurdles include hidden biases in large datasets (Tommasi et al., 2017) and maliciously perturbed training data (Biggio et al., 2012) . Synthetically generated data has seen growing adoption in response to these problems, since the marginal cost of producing new training data is generally very low, and one has full control over the generation process. This is particularly true for applications with a physical component, such as autonomous navigation (Gaidon et al., 2016) or robotics (Todorov et al., 2012) . However, training with purely synthetic data suffers from the so-called "reality gap", whereby good performance on synthetic data does not necessarily yield good performance in the real world (Jakobi et al., 1995) . In particular, the difficulty of generating realistic training images scales not just with the objects of interest, but also the real-world contexts in which the learned model is expected to operate. This work begins with the simple observation that, for many classification tasks, the label of an input is determined entirely by the object; however, this additional structure is discarded by current synthetic data pipelines. Our goal is to leverage this decomposition to develop more efficient methods for the related problems of generating training data and learning from a synthetic domain. Our contributions are two-fold: first, we formally introduce the setting of context-agnostic learning, where the input space is decomposed into object and context spaces, and the labels are independent of contexts when conditioned on the objects. Second, we propose an algorithm to efficiently train a classifier in the context-agnostic setting, which relies on the ability to sample from the object and context spaces independently. We apply our methods to train deep neural networks for real-world image classification using only a single synthetic example of each class, obtaining performance comparable to existing methods for domain adaptation and few-shot learning while using substantially less data. Our results show that it is possible to train classifiers in the absence of any contextual training data that nonetheless generalize to real world domains. well on the source domain may not generalize well in the target domain. A standard method for addressing this challenge is domain adaptation, which leverages a small amount of data from the target domain to adapt a function that is learned over the source domain (Blitzer et al., 2006) . In the context of learning from synthetic data, the domain shift that occurs between synthetic and real world data is known as the reality gap (Jakobi et al., 1995) . State-of-the-art rendering engines, such as those used for video games, can help narrow this gap by generating photorealistic data for training (Dosovitskiy et al., 2017; Johnson-Roberson et al., 2016; Qiu and Yuille, 2016) . Another technique is using domain randomization to generate the source domain with more variability than is expected in the target domain (e.g., extreme lighting conditions and camera angles), so as to make real images appear as just another variant (Tobin et al., 2017; Tremblay et al., 2018); in particular, Torres et al. (2019) apply domain randomization to traffic sign detection and find that arbitrary natural images suffice for the task. Another body of work exploits generative adversarial networks (Goodfellow et al., 2014a) to generate synthetic domains (Hoffman et al., 2017; Liu et al., 2017; Shrivastava et al., 2016; Taigman et al., 2016; Tzeng et al., 2017) . Finally, several works have explored using synthetic data for natural image text recognition (Gupta et al., 2016; Jaderberg et al., 2014) . These works use an approach that is roughly analogous to our baseline models, and test their techniques on the target domain of street signs rather than handwritten characters (as we do). A different paradigm for the low-data regime is few-shot learning. In contrast to domain adaptation, few-shot learning operates under the assumption that the target and source distributions are the same, but the ability to sample certain classes is limited in the source domain. Early approaches emphasized capturing knowledge in a Bayesian framework (Fe-Fei et al., 2003) , which was later formulated as Bayesian program learning (Lake et al., 2015) . Another approach based on metric learning is to find a nonlinear embedding for objects where closeness in the geometry of the embedding generalizes to unseen classes (Koch, 2015; Snell et al., 2017; Sung et al., 2018; Vinyals et al., 2016) . Meta-learning approaches aim to extract higher level concepts which can be applied to learn new classes from a few examples (Finn et al., 2017; Munkhdalai and Yu, 2017; Nichol et al., 2018; Ravi and Larochelle, 2016) . A conceptually-related method that leverages synthetic training data is learning how to generate new data from a few examples of unseen classes; in contrast to our work, however, these methods still require a large number of samples to learn the synthesizer (Schwartz et al., 2018; Zhang et al., 2019) . Finally, some works combine domain adaptation with few-shot learning to learn under domain shift and limited samples (Motiian et al. (2017) ). The main characteristic that differentiates our work from these approaches is that we are interested in learning classifiers that are context-agnostic, i.e., do not rely on background signals. As such, while we find our approach is applicable to many of the same tasks as the aforementioned works, our theoretical setting and objectives differ significantly. From a practical perspective, we demonstrate our techniques when the entire training set consists solely of a single synthetic image of each class, though our techniques can certainly be applied when more data is available; however we do not expect the reverse to hold for domain adaptation or few-shot learning in our setting. Indeed, we consider this work to be complementary in that we are concerned with exploiting the additional structure that is inherent in certain source domains, while the goal of domain adaptation and fewshot learning is to achieve good performance under various downstream domain shift assumptions.

3. SETTING

The standard supervised learning setting consists of an input space X , an output space Y, and a hypothesis space H of functions mapping X to Y. A domain P D is a probability distribution over (X , Y). Given a target domain P T and a loss function , the goal is to learn a classifier h ∈ H that minimizes the risk, i.e., the expected loss R P T (h) := E P T [ (h(x), y)]. The training procedure consists of n samples (x 1 , y 1 ), ..., (x n , y n ) from a source domain P S . A standard approach is empirical risk minimization, which takes the classifier that minimizes R emp (h) = 1 n i (h(x i ), y i ); if P S is close to P T , then with enough samples, such a classifier also achieves low risk in the target domain.

3.1. CONTEXT-AGNOSTIC LEARNING

In general, we can frame the goal of classification as learning to extract reliable signals for the label y from points x ∈ X . This task is often complicated by the presence of noise or other spurious signals. However, for input spaces generated by physical processes, such signals are generally produced by distinct physical entities and can thus be thought of as independent signals that become mixed via the observation process. We aim to capture this additional structure in our setting. Concretely, we have an object space O, a context space C, and an observation function γ on O × C. The input space X is defined as the image of γ : O × C → X . We will assume that points in O are associated with a unique label in Y, and require that γ preserves this property when passing to X . Note that this setting can be easily generalized to a case when the image of γ is a subdomain of X . In this work, we will consider the special case when X ⊆ C. Conceptually, the context space is an "ambient space" containing not only valid inputs, but also random noise or irrelevant classes; the input space is a subset of the context space for which there exists a well-defined label. For example, in our experiments we explore such a decomposition for the task of traffic sign recognition, where the object space O consists of traffic signs viewed from different angles, the context space C is unconstrained pixel space, and the input space X is the set of images that contain a traffic sign. Recall that the standard objective of learning is to find a good classifier for an unknown subdomain X P T ⊆ X . We consider instead the task of learning a classifier on the entire input space X . To sample from X we are given oracle access to the observation function and draw (labelled) samples from O and C independently. Clearly, if this problem is realizable, i.e., there exists h * ∈ H for which R X (h * ) = 0, then we do not even need to know the target domain P T , since X P T ⊆ X =⇒ R X (h * ) = 0 =⇒ R P T (h * ) = 0 Assuming access to X through γ, we can learn h * simply by taking the number of samples to infinity. Unfortunately, learning a classifier on X generally requires many more samples than learning a classifier on X P T . Thus we aim to learn h * using as few samples as possible. Our new goal will be to learn a classifier over X which depends only on signals from O; more precisely, we have the following definitions: Definition 3.1. A function f on X is context-agnostic if Pr[f • γ(o, c) = x] = Pr[f • γ(o, c ) = x] ∀c, c ∈ C, o ∈ O, x ∈ Im(f ) Definition 3.2. Given a context-agnostic label function y * , the objective of context-agnostic learning is to find h ∈ H such that h achieves the lowest risk of all context-agnostic classifiers. The hope is that, since y * is context-agnostic, we can learn y * through the lower dimensional structure of O using fewer samples. Note, however, that while we only need max(|O|, |C|) samples to observe every object and context once, we need |O| * |C| samples to observe every object in every context. Hence the main challenge when the number of samples is low will be avoiding spurious signals, i.e., statistical correlations between context and objects (and by extension, labels) which are artifacts of the sampling process and do not generalize outside the training set. We conclude with some high-level remarks about this setting. First, note that if the problem is realizeable, then the lowest risk classifier is also context-agnostic. Second, we recover the standard supervised setting for the trivial context space C = ∅. Conversely, classification remains welldefined even in the trivial object space O = {y i }, the set of classes; however, this pushes all the complexity to the observation function γ, which may be hard to define or intractable to compute. Finally, we do not preclude the existence of useful signals originating from the context for certain domains. For instance, a great deal of information can often be gleaned from the backgrounds of photos, e.g., stop signs are more often found in cities than on highways. Our theoretical setting avoids this issue by assuming realizability and uniqueness of labels; more practically, we argue that a "good" classifier should nonetheless recognize stop signs on the highway, and our experimental results provide evidence that over-reliance on such background signals leads to brittle classifiers.

3.2. EFFICIENT SAMPLING FOR OBJECT-CONTEXT DECOMPOSED INPUT SPACES

In this section, we present an algorithm for context-agnostic learning. We first develop a formal notion of contextual bias for this setting. We assume a binary classifier h and slightly abuse notation, writing h for h • γ, i.e., h  : O × C → {-1, 1}. (B(h, c)) := sgn(E o∼O [h(o, c) -ō]) ||B(h, c)|| := E o∼O h(o, c), ō where is the hinge loss (i, j) := max(0, 1 -i * j). Intuitively, sign of the bias corresponds to the label toward which the classifier is biased by a given context; the magnitude measures the strength of this bias. Clearly, the classifier is context-agnostic exactly when the bias is zero. We are now ready to state our main theoretical result, which gives an upper bound on the risk in terms of the context bias on C and object error over O. Theorem 3.1. Let h be a classifier with average bias K and object error for all objects bounded from above by α < 1. Then the risk is bounded from above by K/(2 -α). Furthermore, equality holds if and only if all object errors equal α. We give a proof in Appendix A. The assumption α < 1 is fairly weak, being equivalent to the classifier performing better than random guessing. Note that the error bound α and bias bound K are not independent; in particular, α = 0 if and only if K = 0 and α < 1. Observe also that when C = ∅, K = 0 holds trivially, but α < 1 for all objects means the classifier is correct on all inputs. The central idea behind Theorem 3.1 is leveraging the fact that labels depend only on objects to factor the risk into separate terms for object error and context bias. This factorization enables us to exploit our ability to sample independently from the object and context spaces. More specifically, we can use samples from O to minimize the object error, and samples from C to minimize the context bias. Since we only need α < 1, we continue to draw objects randomly; however given an object o, we aim to observe it with the context for which the classifier has the strongest opposing bias. Intuitively, this allows the classifier to "correct" its bias and unlearn the spurious signals, thereby minimizing the bias and also the risk. Adopting this approach without modification requires computing the bias of every context in C. In most cases, however, even estimating a single bias may be prohibitively expensive. Thus, rather than solve for the maximum bias explicitly, we instead propose a heuristic for identifying contexts with large biases. Note that since X ⊆ C, a reasonable assumption is that the classifier learns a strong bias on recent training inputs when taken as contexts. This suggests a simple greedy approach for correcting biases by repurposing recent training inputs as contexts; we call this algorithm Greedy Bias Correction and present a description in Algorithm 1.

4. LEARNING VISUAL TASKS USING CONTEXT-AGNOSTIC SYNTHETIC DATA

We introduce an instantiation of Greedy Bias Correction for learning visual tasks using synthetic data. We are given a function which takes a label y and outputs a rendering of the corresponding class in a random pose without any background. The context is the background of the image, on which we place no restrictions. The observation function γ superimposes an object over a background.

Local refinement via robustness training

We note that our observation function γ is fairly restrictive; for instance, we do not support occlusions. Because our ultimate goal will be to perform on data taken from a real-world context, we aim to capture this discrepancy using robustness training. 1 In particular, we assume that the image of γ is an -covering of X , where a set A is said to be an -covering of another set B iff for all points b ∈ B, there exists a point a ∈ A such that ||a -b|| ≤ . Then for a given sample, we will instead add the point in the -neighborhood of x which maximizes the training loss, i.e., for a classifier h and a sample x = γ(o, c), we use x = arg max x ∈N (x) (h(x ), y). This formulation is often used to train models which are robust against local perturbations. An empirically effective method for finding approximations to x is known as Projected Gradient Descent (PGD) (Goodfellow et al., 2014b; Madry et al., 2017) . The algorithm can be summarized as x 0 ← x + δ x i ← Π x+ x i-1 + η • sgn(∇ x (h(x i-1 ), y))), i = 1, ..., n where δ is a small amount of random noise, Π is the projection back onto to the -ball, η is the step size, and n is the number of iterations. As is standard for robustness training, we use the ∞ norm defined as ||(x 1 , ..., x n )|| ∞ = max i x i . Our choice of will depend on the task at hand, and we also use different for the portions of the image corresponding to the object and context. Additionally, since we are no longer in a binary context, we sample a random permutation on labels instead of flipping the label deterministically. The full algorithm is presented as Algorithm 2 in Appendix B; Figure 1 provides a visualization of the key generative process, with images taken from a real step of training a deep neural network to perform classification of traffic signs. From a practical standpoint, this algorithm makes concrete several benefits of our approach. First, rendering object classes, i.e. sampling from O, is often relatively easy. In the case of twodimensional rigid body objects, this can be captured using standard data augmentation such as rotations, flips, and perspective distortions. Indeed, in this setting, our work can be viewed as a form of minimal one-shot learning, where the training data consists solely of a single unobstructed straight-on shot for each object class. Second, there is no requirement to perform realistic rendering of contexts C, avoiding an additional layer of complexity. Finally, because our approach is context agnostic, our functions are learned without any reference to target domains. In the formal setting, we assumed that the target domain was contained in the image of the observation function; however, synthetic images will always be subject to the reality gap. Our experiments suggest that our approach overcomes this barrier and successfully generalizes to natural images while training on synthetic data only.

5. EXPERIMENTS

We evaluate our approach to learning visual tasks using synthetic data on three benchmarks for image recognition. Our training sets consist of a single synthetic image for each object class with no additional information about the target domain; Figure 2 shows examples of the training and test images from two of the datasets. On all three benchmarks, our models perform comparably with previous state-of-the-art results from related settings using few-shot learning and domain adaptation. For domain adaptation, all approaches train on the full 100,000 images in SynSign plus part of the GTSRB training set. ATT (Saito et al., 2017) is the only method with better performance than ours, achieving 0.3% higher accuracy; however they use 31,367 unlabelled images from the GTSRB training set (in addition to SynSign). Methods using few-shot learning train on roughly half of the data (22 classes) from the GTSRB training set. The leading few-shot learning approach, VPE (Kim et al., 2019) , adds a pictographic dataset similar to Picto, but achieves only 83.79% accuracy. In comparison, our training set consists of only 43 images, none of which are from GTSRB.

5.2. HANDWRITTEN CHARACTER RECOGNITION

MNIST (LeCun) consists of 60,000 training and 10,000 test images of handwritten Arabic numerals in grayscale against a blank background. Our training set, Digit, consists of a single example of each digit taken from a standard digital font. Omniglot (Lake et al., 2015) consists of 1623 hand-written characters from 50 different alphabets, with 20 samples each. The samples were sourced online from 20 workers on Amazon's Mechanical Turk, who were asked to copy each character from a single font-based example using digital input (e.g., a mouse). We obtained the original representations for our dataset, OmniFont. On MNIST, we achieve 90.2% accuracy training only on Digit, compared to human accuracy of 98%; on Omniglot, we achieve 92.2% 20-way accuracy training only on Omnifont, compared to human accuracy of 95.5%. Tables 3 and 4 in Appendix D compare these results with approaches using few-shot learning and domain adaptation. Handwritten characters and GTSRB present conceptually opposed challenges for learning: in GTSRB, the objects are rigid two-dimensional objects and backgrounds are complex settings in the natural world; in Omniglot and MNIST, backgrounds are uniform, but classes no longer have a strict specification and individual examples exhibit high variability. Thus, the main challenge of these tasks is learning how to generalize over the object class. Despite the inherent variation, a baseline model trained on Digit with plain data augmentation was able to achieve 81.9% accuracy on MNIST, exceeding many domain adaptation approaches and all the one-shot learning results; Omniglot is more difficult, with an Omnifont plus data augmentation baseline accuracy of 71.9%. On MNIST, every approach using domain adaptation uses the full Street View House Numbers (SVHN) training set of 73,257 images of house numbers obtained from Google Street View (Netzer et al., 2011) , plus varying amounts of data from MNIST. The domain transfer problem faces a similar challenge as Digit, namely, handwriting exhibits different characteristics than house numbers fonts. Nevertheless, we note that SVHN contains far more examples of each digit. The only nonbaseline approach to exceed our performance is CyCADA (Hoffman et al., 2017) , which achieves 0.2% better performance by performing domain adaptation using 60,000 unlabelled images from the MNIST training set (in addition to training on SVHN). All approaches using few-shot learning (except FADA) train on 32,460 images from Omniglot and use as few as one image per class from MNIST; the best result achieves accuracy 3% below ours using 70 images from MNIST. In contrast, we use only 10 images, none of which are from MNIST. Omniglot is often described as an MNIST-transpose, where the goal is learn handwriting rather than specific symbols, and is widely used as a benchmark for few-shot learning. We reproduce the most common split given in Lake et al. (2015) , which uses a predefined set of 30 alphabets, with 19,280 images for training. Test performance is reported as an average over random subsets of n = 5, 20 unseen classes for the n-way task (given one labelled example). In comparison, for each test run, we retrain a model using only the corresponding n images from OmniFont. As expected, our method finds 5-way classification easier than 20-way classification (95.8% vs 92.2%). In both cases, our performance lags behind the state-of-the-art for few-shot learning (>99%), though we emphasize that our experimental setup differs significantly in both the type and amount of training data used. Finally, several approaches apply few-shot learning from Omniglot to MNIST, with the idea of transferring extracted features from human handwriting. However the one-shot experiments all perform worse than even our baseline approach. We hypothesize that in comparison to Omniglot, where all the samples come from the same 20 subjects, MNIST may be particularly difficult for transfer oneshot learning, since any two examples will likely exhibit high "variance"; conversely, our approach benefits from using a canonical form which might be closer to the "mean" representation.

5.3. ABLATION STUDIES

We conduct two sets of ablation studies to better understand our approach to context-agnostic learning. The first study tests the individual components of our algorithm for their contributions to generalization over the real world dataset. All strategies employ the same data augmentation and use the following sampling procedures: baseline picks a fresh random background for each training point, and measures the performance of training on our synthetic dataset with plain data augmentation; random-context reuses random backgrounds as contexts; bias-correction reuses previous training images as contexts; refinement-only is the same as random-context with the addition of PGD-based refinement; full is the full algorithm as described in Algorithm 2. The results are in Table 1 . In all cases, we observe that both bias correction and local refinement contribute individually and jointly to the performance of our models. For GTSRB, a particularly interesting comparison is training on SynSign, a dataset designed to provide synthetic training data with realistic backgrounds for GTSRB, which yields 79.2% accuracy (Saito et al., 2017) . Though this is an improvement over our baseline of using random backgrounds at 72.0% accuracy, refinement-only and bias-correction achieve higher accuracy at 86.4% and 87.3%, respectively. Both methods leverage the background of training images to combat spurious signals, generating completely unrealistic backgrounds; this suggests that learning context-agnostic features is more effective than using realistic backgrounds. The second study measures classification performance in a context-agnostic setting on the synthetic Picto dataset. By definition, the performance of a context-agnostic classifier should not degrade under perturbations of the background. We thus run an adaptive attack using a PGD adversary which fixes the foreground pixels, and ranges from fixed to unbounded on the background pixels, effectively searching the context space for a background that causes a misclassification on the given object. We also consider two initialization strategies for the PGD adversary: a standard random initialization, and initializing to the previous image, inspired by our bias heuristic. We test the same set of strategies as before, plus a classifier trained directly on the GTSRB training set achieving 98% performance on the GTSRB test set (real2sim). Appendix E.2 contains samples of the generated images, and the results are plotted in Figure 3 . Across all experiments, the models have worse (or very close) performance when using our bias heuristic for initialization. We believe this supports our usage of the bias heuristic for contextagnostic learning. Additionally, in the last column of Figure 3b , only our full method maintains passable accuracy, which suggests the gap between models is larger than performance on GTSRB indicates. We also note that real2sim seems to suffer from a "synthetic gap" even at = 0/255, which is not entirely unexpected. However, in both settings, performance degrades very quickly as increases: the effect is most pronounced when the bias heuristic is used to initialize the PGD adversary, though in both cases the accuracy eventually drops to 0. We emphasize that all of the experiments leave the foreground objects completely unperturbed (and easily human-identifiable); our results thus suggest that classifiers trained on natural images can become over-reliant on contextual signals, leading to surprisingly brittle behavior even given unambiguous foregrounds.

6. CONCLUSION

We introduce the task of context-agnostic learning, a theoretical setting for learning models whose predictions are independent of background signals. Leveraging the ability to sample objects and contexts independently, we propose an approach to context-agnostic learning by minimizing a formally defined notion of context bias. Our algorithm has a natural interpretation for training classifiers on vision-based tasks using synthetic data, with the distinct advantage that we do not need to model the background. We evaluate our methods on several real-world domains; our results suggest that our approach succeeds in learning context-agnostic classifiers that generalize to natural images using only a single synthetic image of each class, while training with natural images can lead to brittleness in the context-agnostic setting. Our performance is competitive with existing methods for learning when data is limited, while using significantly less data. More broadly, the ability to learn from single synthetic examples of each class also affords fine-grained control over the data used to train our models, allowing us to sidestep issues of data provenance and integrity entirely.

A PROOFS

Proof of Theorem 3.1. By the assumption that α < 1, we have that for all o, the signs of the expected classification ō and correct classification o * match, so that α ≥ ô = |o * -ō| = 1 -|ō|. Then for all o, (ō, o * ) = 1 -ō * o * = 1 -|ō| = 1 + |ō| 1 + |ō| (1 -|ō|) = 1 -ō * ō 1 + |ō| = (ō, ō) 1 + |ō| ≤ (ō, ō) 2 -α Now to bound the risk, we can write, R(h) := E o∼O,c∼C [ (h(o, c), o * )] = 1 |C||O| c o (h(o, c), o * ) = 1 |O| o 1 |C| c (1 -h(o, c) * o * ) = 1 |O| o (1 -ō * o * ) ≤ 1 |O| o 1 -ō * ō 2 -α = 1 (2 -α)|O| o 1 |C| c (1 -h(o, c) * ō) = 1 (2 -α)|C||O| c o (h(o, c), ō) = 1 (2 -α)|C| c ||B(h, c)|| = K 2 -α as desired. It also follows that equality holds if and only if α = ô for all o.

C EXPERIMENTAL SETUP

We used PyTorch 1.5.0 (Paszke et al., 2019) , OpenCV 4.2.0 (Bradski, 2000) , and scikit-image 0.17.2 (van der Walt et al., 2014) for all experiments. In setting the number of epochs, we did not observe any significant degradation or improvements in performance when training for longer. We use fewer epochs in the case of Omniglot due to computational constraints, as the model is retrained for each test split. For GTSRB, we use a 5-layer convolutional neural network adapted from the official PyTorch tutorials. To train with Picto, the data augmentation consists of PyTorch transforms translate=(.15, .15) , scale=(0.65, 1.05), shear=5), RandomPerspective(0.5, p=1); contrast=.8, saturation=.8, hue=.05) ; OpenCV box blur with a random kernel size between 1 and 6 in both dimensions (independently sampled, so not necessarily square); and a random exposure adjustment by adjusting all pixels by the same random amount between -30% and 50%. For refinement, we used step sizes of α = 2/255 with 8 steps and an epsilon of = 4/255 for the foreground only. For the observation function, we superimpose the segmented foreground of the transformed pictographic sign over the context. We train for 300 epochs using the Adam optimizer (learning rate 1e-4, weight decay 1e-4), with 5 examples of each class per batch and 20 batches per epoch. We report results for the model that achieves the best performance on the training set, checking every 5 epochs. For MNIST, we use the two-layer convolutional neural network from the official PyTorch examples for MNIST, with Dropout regularization replaced with pre-activation BatchNorm. To train with Digit, the data augmentation consists of PyTorch transforms RandomAffine (15, translate=(.15, .15) , scale=(0.75, 1.05), shear=40), RandomPerspective(0.5, p=1); OpenCV box blur with a random kernel size between 1 and 6 in both dimensions (independently sampled, so not necessarily square); then set the foreground to all pixels with value greater than 0.2. For refinement, we used step sizes of α = 1.6/255 with 8 iterations and no projection ( = ∞). For the observation function, we blend the object with the context at a 2:1 ratio; this ensures that inputs have a well-defined ground truth label. We train for 300 epochs using the Adam optimizer (learning rate 1e-4, weight decay 1e-4), with 5 examples of each class per batch and 20 batches per epoch. We report results for the model that achieves the best performance on the training set, checking every 5 epochs. For Omniglot, we use the pre-activation variant of ResNet18 (He et al., 2015) . To train with Omnifont, we first preprocess with scikit-learn skeletonize and dilation to standardize stroke widths. Data augmentation consists of PyTorch transforms RandomAffine (15, translate=(.15, .15) , scale=(0.75, 1.1), shear=20), RandomPerspective(0.25, p=1); OpenCV box blur with a random kernel size between 1 and 3 in both dimensions (independently sampled, so not necessarily square); then resize the images to 28 by 28. For refinement, we used step sizes of α = 1.6/255 with 8 iterations and no projection ( = ∞). For the observation function, we blend the object with the context at a 2:1 ratio; this ensures that inputs have a well-defined ground truth label. For the n-way classification task, we randomly sample n characters from the Omniglot test set, and use the corresponding characters from the Omnifont dataset as our training set. We then train a fresh model for 150 epochs using the Adam optimizer (learning rate 1e-4, weight decay 1e-4), and report performance on the all 20n images in the Omniglot test set, averaged over 20 runs (10 runs for the ablation studies). The exact set up of the one-shot classification task often varies between authors. We believe the broad performance numbers are still useful for contextualizing our approach, and refer the reader to the original works for details. § As reported in Vinyals et al. (2016) E.2 ABLATION STUDIES 



RELATED WORK Domain shift refers to the problem that occurs when the training set (source domain) and test set (target domain) are drawn from different distributions. In this setting, a classifier which performs Robustness training is more commonly referred to as adversarial training in the adversarial robustness community whence we borrow this technique. We use the nonstandard term to avoid confusion with the unrelated (generative) adversarial methods found in the few-shot learning literature.



For an object o, denote the correct label o * , the expected classification ō := E c∼C [h(o, c)], and the object error ô := |o * -ō|. Definition 3.3. The context bias B(h, c) of a classifier h on the context c is defined as sgn

Figure 1: A graphical representation of the generative loop in Algorithm 2 using real training data. (1) Sample from object space. (2) Observe object and context. (3) Perform local refinement. (4) Add to training set. (5) Previous image becomes next context (resample from C with probability p).

Figure 2: Images from the training (top) and test (bottom) set for GTSRB (left) and MNIST (right).

Figure 3: Context-agnostic performance on Picto using a PGD adversary on the background.

Figure 6: Training images from the first ablation study using Picto dataset. From top to bottom: baseline, random-context, refinement-only, bias-correction, full.

Figure 7: Test images from the second ablation study using the Picto dataset. Examples of test images generated using a PGD adversary initialized randomly (top) and with the bias heuristic (bottom) at = 255/255.

Figure 8: Training images from the first ablation study for the Digit dataset. From top to bottom: baseline, random-context, refinement-only, bias-correction, full.

Performance of Algorithm 2 on various benchmarks, plus ablation studies.

and 12,630 test images of 43 classes of German traffic signs taken from the real world. Our training set consists of a single, canonical pictogram of each class taken from the visualization software accompanying the dataset, which we refer to as Picto. We achieve 95.9% accuracy on the GTSRB test set training only on Picto, against a human baseline of 98.8%. A comprehensive comparison with existing approaches can be found in Appendix D, Table2.

MNIST results.

Omniglot results for one-shot classification. ‡

D FULL EXPERIMENTAL RESULTS

We compare a model trained using our methods with previous state-of-the-art results from related settings using few-shot learning and domain adaptation on GTSRB (Table 2 ), MNIST (Table 3 ), and Omniglot (Table 4 ). When multiple experiments are reported for the same approach, we compare against both the most accurate result as well as the result using the least amount of target data. We distinguish between labelled (L) and unlabelled (UL) data; experiments for which the training data is not known are marked (?). 

