CURI: A BENCHMARK FOR PRODUCTIVE CONCEPT LEARNING UNDER UNCERTAINTY

Abstract

Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts ("a scene with objects that have the same color") and ad-hoc categories defined through goals ("objects that could fall on one's head"). In contrast, standard classification benchmarks: 1) consider only a fixed set of category labels, 2) do not evaluate compositional concept learning and 3) do not explicitly capture a notion of reasoning under uncertainty. We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) to bridge this gap. CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning boolean operations, variable binding, etc. Importantly, it also defines a model-independent "compositionality gap" to evaluate difficulty of generalizing out-of-distribution along each of these axes. Extensive evaluations across a range of modeling choices spanning different modalities (image, schemas, and sounds), splits, privileged auxiliary concept information, and choices of negatives reveal substantial scope for modeling advances on the proposed task. All code and datasets will be available online.

1. INTRODUCTION

Human concept learning is more flexible than today's AI systems. Human conceptual knowledge is productive: people can understand and generate novel concepts via compositions of existing concepts ("an apartment dog") (Murphy, 2002) , unlike standard machine classifiers that are limited to a fixed set of classes ("dog", "cat", etc.). Further, humans can induce goal-based, "ad hoc" categories such as "things to take from one's apartment in a fire" (children, dogs, keepsakes, etc.) (Barsalou, 1983) . Thus, unlike AI systems, humans reason seamlessly in large, essentially "unbounded" concept spaces. Beyond unboundedness, a natural challenge in such concept spaces is uncertainty -the right concept to be inferred is uncertain, as a plethora of candidate concepts could explain observations. For e.g. in Figure 1 (top, image panel), the "right" concept could be that "All objects are blue and have the same size", but it could also be "There are less than four objects in the scene", or "All objects have the same color". Humans gracefully handle such uncertainty and underdetermination (Tenenbaum & Griffiths, 2001; Xu & Tenenbaum, 2007; Goodman et al., 2008; Piantadosi et al., 2016) . Popular compositional reasoning benchmarks such as CLEVR (Johnson & Zhang, 2016) for visual question answering and Ravens Progressive Matrices (Santoro et al., 2017) for deductive, analogical reasoning are compositionally rich and challenging in nature, but do not tackle ambiguity and underdetermination. We address this gap in the literature, and propose the Compositional Reasoning Under Uncertainty (CURI) benchmark to study how modern machine learning systems can learn concepts spanning a large, productively defined space (Figure 1 ). In pursuit of this goal, we instantiate a meta learning task where a model must acquire a compositional concept from finite samples. A signature of productivity in human thought is our ability to handle novel combinations of known, atomic components. Thus, in CURI we instantiate different systematic train-test splits to analyze different forms of generalization in concept learning, involving novel combinations of intrinsic properties (e.g. color, shape) with boolean operators, counting, extrinsic object properties (e.g. object location), and a novel test of variable binding in context of compositional learning. While related systematic splits have been proposed in prior work in context of other tasks such as question answering and analogical reasoning (Barrett et al., 2018; Hill et al., 2019; Agrawal et 

Context Free Grammar

for-all x \in S (color?(x) = "blue") and (all (size?(S) = size?(x))) for-all x \in S (all (color?(x) = color?(S))) exists x \in S (color?(x) = "blue") and all (shape?(S {-x} ) = "square") : The grammar of variables, quantifiers, functions and operators to induce compositional concepts. 2017; Johnson et al., 2016; Vedantam et al., 2017; Higgins et al., 2017; Bakhtin et al., 2019; Lake & Baroni, 2018; Ruis et al., 2020) , ours is the first benchmark which tests different qualitative aspects of reasoning about productive concepts under uncertainty. Compositional Reasoning Under Uncertainty (CURI) Task. Concretely, the CURI task tests few-shot learning of relational concepts in a large compositional conceptual space, with design inspiration from studies in cognitive modeling using a language of thought (LOT) approach (Fodor, 1975; Piantadosi, 2011; Kemp et al., 2005) . CURI includes scene-based concepts such as "All objects have the same color" and "There exists a blue object while the rest are triangles" (Figure 1 ) but unlike CLEVR (Johnson et al., 2016) there are too few examples to deduce answers with certainty. Our benchmark is defined through a series of meta-learning episodes (see example in Triantafillou et al., 2019; Lake et al., 2019) that focuses on atomic concepts, our benchmarks concerns more structured, relational concepts built compositionally from a set of atomic concepts, and involves reasoning under uncertainty -an ideal learner must marginalize over many hypotheses when making predictions (Gelman et al., 2004; Xu & Tenenbaum, 2007; Piantadosi et al., 2016) .



Figure 1: Concept Space. Three example concepts (rows) along with schematic positive examples. Actual scenes are rendered in multiple ways including the CLEVR renderer (Johnson et al., 2016) (see Figure 2). Right:The grammar of variables, quantifiers, functions and operators to induce compositional concepts.

Figure 2): given positive and negative examples of a new concept D supp (known as the "support set"), the goal of an episode is to classify new examples D query (the "query set"). As in few-shot classification (Fei-Fei et al., 2006), meta-learning (Vinyals et al., 2016), and other open-set tasks (Lampert et al., 2014), models are evaluated on novel classes outside the (meta-)training set. Unlike previous work (

al.,

annex

x m V q g E k 6 X z Q w A k O C p y n g P l e M g p h Y Q q j i 9 l Z M R 0 Q R C j a r s g 3 B X 3 x 5 m b T O a r 5 X 8 + / O q / W r I o 4 S O k L H 6 B T 5 6 A L V 0 S 1 q o C a i y K B n 9 I r e n C f n x X l 3 P u a t K 0 4 x c 4 j + w P n 8 A a z G k 7 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a 7 2 o 8 ya i y K B n 9 I r e n C f n x X l 3 P u a t K 0 4 x c 4 j + w P n 8 A a z G k 7 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a 7 2 o 8 y w L e y v l l 0 w z j j a 7 u g 3 B + / 3 y X 9 L b a X l u y z v Z b b Y P x n H U y D r Z J N v E I 3 u k T Y 5 J h 3 Q J J 3 f k g T y R Z + f e e X R e n N e v 1 g l n P L N G f s B 5 + w T O F 5 p 5 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " x P 2 Q i s 8 h 4 w g z r 7 9 s y 8 j kw L e y v l l 0 w z j j a 7 u g 3 B + / 3 y X 9 L b a X l u y z v Z b b Y P x n H U y D r Z J N v E I 3 u k T Y 5 J h 3 Q J J 3 f k g T y R Z + f e e X R e n N e v 1 g l n P L N G f s B 5 + w T O F 5 p 5 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " x P 2 Q i s 8 h 4 w g z r 7 9 s y 8 j ke n N e v 1 g l n P L N G f s B 5 + w T O F 5 p 5 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " x P 2 Q i s 8 h 4 w g z r 7 9 s y 8 x 6 4 4 t 6 e l 6 u U s j h w 5 J E V S J g 4 5 I 1 V y T W q k T j h 5 J M / k l b x Z T 9 a L 9 W 5 9 T F u X r N n M A f l T 1 u c P O O y Z N w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "x 6 4 4 t 6 e l 6 u U s j h w 5 J E V S J g 4 5 I 1 V y T W q k T j h 5 J M / k l b x Z T 9 a L 9 W 5 9 T F u X r N n M A f l T 1 u c P O O y Z N w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "x 6 4 4 t 6 e l 6 u U s j h w 5 J E V S J g 4 5 I 1 V y T W q k T j h 5 J M / k l b x Z T 9 a L 9 W 5 9 T F u X r N n M A f l T 1 u c P O O y Z N w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " We also vary the modality in which scenes are presented-rendering them as images, symbolic schemas, and sounds-enabling future research on modality-specific representational choices for compositional reasoning under uncertainty. Finally, we vary the concepts learned by the model during meta-training and meta-testing to test different aspects of systematic generalization.Compositionality Gap. In addition to defining systematic splits, we also characterize (for the first time, in our knowledge), the difficulty of generalization entailed by each split by introducing the notion of a model-independent "compositionality gap". Concretely, the compositionality gap is the difference in test performance between an ideal Bayesian learner with access to the full hypothesis space, and a Bayesian learner with access to only a (potentially large) list of the hypotheses examined during meta-training. A large gap indicates that any learner must extrapolate compositionally from the training hypotheses to solve the task; additionally, models can be compared to ideal learners that either do or do not engage in such extrapolation. We anticipate that this tool will be more broadly useful for analyzing other benchmarks with compositional splits.Models. We evaluate models around various dimensions which concern the difficulty of learning productive concepts under uncertainty, including: 1) the modality in which the input is rendered (image, schemas, sounds), 2) method used for reasoning across objects in a scene (transformer,

