CURI: A BENCHMARK FOR PRODUCTIVE CONCEPT LEARNING UNDER UNCERTAINTY

Abstract

Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts ("a scene with objects that have the same color") and ad-hoc categories defined through goals ("objects that could fall on one's head"). In contrast, standard classification benchmarks: 1) consider only a fixed set of category labels, 2) do not evaluate compositional concept learning and 3) do not explicitly capture a notion of reasoning under uncertainty. We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) to bridge this gap. CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning boolean operations, variable binding, etc. Importantly, it also defines a model-independent "compositionality gap" to evaluate difficulty of generalizing out-of-distribution along each of these axes. Extensive evaluations across a range of modeling choices spanning different modalities (image, schemas, and sounds), splits, privileged auxiliary concept information, and choices of negatives reveal substantial scope for modeling advances on the proposed task. All code and datasets will be available online.

1. INTRODUCTION

Human concept learning is more flexible than today's AI systems. Human conceptual knowledge is productive: people can understand and generate novel concepts via compositions of existing concepts ("an apartment dog") (Murphy, 2002) , unlike standard machine classifiers that are limited to a fixed set of classes ("dog", "cat", etc.). Further, humans can induce goal-based, "ad hoc" categories such as "things to take from one's apartment in a fire" (children, dogs, keepsakes, etc.) (Barsalou, 1983) . Thus, unlike AI systems, humans reason seamlessly in large, essentially "unbounded" concept spaces. Beyond unboundedness, a natural challenge in such concept spaces is uncertainty -the right concept to be inferred is uncertain, as a plethora of candidate concepts could explain observations. For e.g. in Figure 1 (top, image panel), the "right" concept could be that "All objects are blue and have the same size", but it could also be "There are less than four objects in the scene", or "All objects have the same color". Humans gracefully handle such uncertainty and underdetermination (Tenenbaum & Griffiths, 2001; Xu & Tenenbaum, 2007; Goodman et al., 2008; Piantadosi et al., 2016) . Popular compositional reasoning benchmarks such as CLEVR (Johnson & Zhang, 2016) for visual question answering and Ravens Progressive Matrices (Santoro et al., 2017) for deductive, analogical reasoning are compositionally rich and challenging in nature, but do not tackle ambiguity and underdetermination. We address this gap in the literature, and propose the Compositional Reasoning Under Uncertainty (CURI) benchmark to study how modern machine learning systems can learn concepts spanning a large, productively defined space (Figure 1 ). In pursuit of this goal, we instantiate a meta learning task where a model must acquire a compositional concept from finite samples. A signature of productivity in human thought is our ability to handle novel combinations of known, atomic components. Thus, in CURI we instantiate different systematic train-test splits to analyze different forms of generalization in concept learning, involving novel combinations of intrinsic properties (e.g. color, shape) with boolean operators, counting, extrinsic object properties (e.g. object location), and a novel test of variable binding in context of compositional learning. While related systematic splits have been proposed in prior work in context of other tasks such as question answering and analogical reasoning (Barrett et al., 2018; Hill et al., 2019; Agrawal et al., 1 

