CONCEPTUAL SCAN: LEARNING WITH AND ABOUT RULES Anonymous authors Paper under double-blind review

Abstract

The ability to learn from a mix of rules and examples and to reflect on the learned abstractions is an important aspect of human intelligence. At the same time, there is a lack of benchmarks that systematically test for this ability, which makes it hard to evaluate the degree to which it is present in state-of-the-art ML architectures. We introduce a novel task format for such benchmarks by using an example structure that allows us to explicitly provide and ask about rules that are relevant for the given task. We present a simple dataset illustrating this format, and we use it to analyze the performance of a variety of T5-based ML models. We identify three challenge areas in this setup: maintaining consistency between learned rules and their application, scaling to larger rule sets, and compositional generalization.

1. INTRODUCTION

Machine learning algorithms are typically designed to be able to learn functions from examples. This is a very general paradigm, but it does not explicitly capture some aspects of human learning. Humans, in contrast, are able to learn both by being shown examples of the task to accomplish and by being told rules or instructions about this task. They can even provide relevant rules to others once they have learned the task from examples. As a realistic illustration of this ability, consider the task of a personal assistant who, among other things, is expected to make movie recommendations based on the age and interests of a user. Even for a task such as this that would currently be considered a standard use case for an example-based recommender system, as humans, we do not learn how to perform this task exclusively by observing examples of movie recommendations. Instead, we can accomplish this task much more efficiently by also taking into account relevant knowledge in the form of rules that have been communicated to us explicitly, i.e., by "learning with rules". For recommending a movie to a girl called Anna, we may, among others, use the rules (and facts, which we consider a special case of rules) illustrated on the left side of Figure 1 . In addition to the ultimate goal of providing movie recommendations (e.g., "What movie could Anna watch?"), we would also expect a human to be able to answer the intermediate questions shown on Figure 1 : Personal assistants answer questions using knowledge consisting of rules and facts. Note that the last knowledge bullet point above can also be construed as an example of the underlying "movie recommendation" task, while the other bullet points represent other relevant knowledge. The first bullet point is a traditional "rule" that states conditional knowledge that can apply to many different movies. The second is a concept definition, which can be equivalently construed as a rule relating two different pieces of information about a person. The other bullet points are facts stated at varying levels of granularity. the right side of Figure 1 -i.e., to "learn about rules" that are relevant to the ultimate task -and we would expect these questions to be answered consistently w.r.t. the provided knowledge and ultimate recommendation. These questions allow us to introspect the understanding of the assistant, e.g., to debug why a movie recommendation was not as expected. Similar interactions between learning from examples and learning with and about rules can also be observed in simpler synthetic tasks. Consider, for instance, the SCAN task of Lake & Baroni (2017) , which our work builds on. This task requires the learner to translate natural language commands (such as "jump left twice") to corresponding actions (such as "LTURN JUMP LTURN JUMP"). The learner is presented with a certain subset of some thousands of (command, action sequence) pairs during training and is then expected to translate unseen commands. This focus on learning purely from examples, while typical of most traditional ML tasks, differs from the way one would "teach" a human the task, and indeed how the authors of the SCAN paper "teach" the task to their readers. On the one hand, while humans are also adept at guessing rules from examples, rather than depending on thousands of examples, they can often grasp the relevant rule from just a handful of examples (Lake et al., 2015) , as we as readers may find ourselves doing when seeing the handful of illustrative examples of SCAN provided by the authors in the paper figures. More fundamentally, however, rather than expecting readers to learn the translation function purely from examples, the authors provide this function in a much more direct and efficient fashion using a set of interpretation rules like those in Figure 2 . The explicit nature of the provided rules has the additional advantage that it allows us to deduce the translation of a given command by applying the translation rules rather than having to always speculatively induce the translation by generalizing from a set of examples. In this paper, we introduce conceptual learning tasks (CLTs), which are a type of learning task that is specifically designed to evaluate the combination of such inductive and deductive learning, and we make the following contributions: • We define the notion of a CLT (Section 2). • We present a first simple instance of a CLT called Conceptual SCAN (cSCAN), which is a synthetically-constructed conceptual learning variation of SCAN (Section 3). • We formalize metrics to measure a learner's performance on cSCAN, including a novel measurement of consistency between learned rules and their application (Section 4). • We analyze the performance of baseline ML architectures on cSCAN and identify three challenge areas: consistency, rule set size, and compositional generalization (Section 6). • We provide the code used in generating cSCAN, constructing compositional generalization splits, and calculating consistency from experiment results.foot_0 

2. CONCEPTUAL LEARNING TASKS (CLTS)

2.1 DESIRED PROPERTIES Motivated by the use case from the introduction and our goal of evaluating learning with and about rules, we interest ourselves in tasks with the following properties, which we formalize in the following section. 1. Context. The learner answers requests based on some explicit knowledge (context), which consists of "examples" that directly determine the replies of certain requests and "rules" that provide indirect information to do so. Part of this knowledge varies across contexts (e.g., transient knowledge about Anna's preferences or country-specific rules about movie ratings). 2. Request. In addition to requests corresponding to the ultimate goal, (e.g., "What movie could Anna watch?"), we can ask the learner whether an intermediate rule holds given the context (e.g., "Is Jerry Maguire appropriate for 14-year-olds?"). This allows us to test whether the learner "understands" the rules by checking for consistency between rules that the learner claims to be true (or false) and their application to answering the ultimate requests. 3. Output. The learner needs to indicate in the output whether a reply follows deductively -and thus monotonically -from the context (e.g., "How old is Anna? ⇒ 14") or whether it requires generalizing the context inductively (i.e, "Will Anna like Mission Impossible 6? ⇒ Probably"), which would imply that the reply could change if more information were to become available (i.e., that the reply is defeasible). Also, the learner needs to identify if a request cannot be answered based on the given context (e.g., "Who is Anna's best friend? ⇒ I don't know"). 4. Compositionality of rules. The rules have a compositional structure, which means that it is possible to determine the meaning of an unknown rule from its syntax if we are given the meaning of a sufficient subset of rules. (E.g., we as humans understand the meaning of the rules shown in Figure 1 because of the compositional syntax and semantics of natural language. Even if we have never seen these exact sentences before, we know what they mean based on our exposure to other sentences built from the same natural language building blocks.)

2.2. STRUCTURAL DEFINITION

As a way of providing a concrete but generic task format that satisfies these properties, we define a conceptual learning task (CLT) as a supervised learning task T with the following structure. (See Appendix A for a discussion of the design choices.) • The task T = {e 1 ,...,e N } is a finite set of examples e k ∈ E, where E = 2 Q×R × Q × R × U denotes the set of possible examples. • Each example e k = C k ,q k ,r k ,u k is a quadruple consisting of a context C k ∈ 2 Q×R , a request q k ∈Q, a reply r k ∈R, and a qualifier u k ∈U. Of these, the context and request together form the input of the ML system, while the reply and qualifier together form the output. (See Appendix O for details on the exact format in which these are input and output by our T5 baseline systems.) • The set of possible requests Q can be partitioned into rule requests Q R ⊆Q (i.e., requests that ask whether a certain rule holds) and non-rule requests Q N ⊆Q. • The set of possible replies R must contain dedicated elements representing true (1), false (0), and unknown (?), which are the only valid replies for rule requests q∈Q R . • The set of qualifiers U = {M,D} indicates whether the reply follows monotonically from the context (M) or whether it is defeasible (D). • Each context C ∈2 Q×R consists of a set of context examples e i = q i ,r i , which represent either examples of an underlying task or other relevant knowledge expressed in "example" form. Note that unlike the top-level examples, these context examples do not themselves contain a context or qualifier, as for the purposes of this paper, we take all context examples to be unconditional and monotonic. (See Appendix C for a possible generalization of this definition.) Note that even though the context is strictly a set of examples, it may still contain any rule q ∈ Q R by means of including the example q,1 , which asserts the rule q to be true. As a more complete illustration, Figure 3 shows a few examples from the cSCAN task, which is a CLT that we introduce in Section 3, while Appendix B shows the motivating example represented as a CLT.

2.3. CONSISTENCY REQUIREMENTS

In addition to satisfying the structural definition, for the purposes of this paper, we require CLTs to be logically consistent, which means that each example and the example set as a whole must be noncontradictory. For instance, if a CLT contains a monotonic example C,q,r,M , it may not contain an example C ,q,r ,M where C ⊆C and r =r as this would violate monotonicity. (Or using the example from the introduction, this would mean, for instance, that if for a given context the task contains "How old is Anna? ⇒ 14" and "Is Jerry Maguire appropriate for 14-year-olds? ⇒ No", it should not contain "Is Jerry Maguire appropriate for Anna ⇒ Yes".) While this requirement could be relaxed in practice, consistency of the task is helpful, as it enables us to precisely measure the consistency of the learner. For the task to be logically consistent, it requires at minimum that the monotonic examples adhere to the axioms of propositional logic. (See Appendix D for a formalization of the consistency requirements.) Note that even while requiring a CLT to be logically consistent, we still allow for the common phenomenon in which rules are stated in a form that is only "usually" true. Exceptions to a rule are allowed so long as the rule is assigned the qualifier "defeasible" (D).

2.4. CLASSIFICATION OF EXAMPLES

The structure of CLTs allows us to classify examples according to the following dimensions. • Request q∈Q: Rule request (q∈Q R ) vs. non-rule request (q∈Q N ) • Reply r ∈R: Known vs. unknown (?), and, for rule requests, true (1) vs. false (0) • Qualifier u∈U: Monotonic (M) vs. defeasible (D) Each class of example should be reasonably well represented in a conceptual learning dataset.

3. CONCEPTUAL SCAN (CSCAN)

One benefit of abstracting out key properties of interest from our motivating use case into the definition of a CLT is that we can now seek to study the capabilities of current ML methods on this family of tasks by starting with a very simple CLT instance, which illustrates the key dynamics of a CLT in as pure as possible a form. In particular, it gives us a basis by which we can construct a simple synthetic task that enables exploration of ML system performance on CLTs, while carefully controlling the task complexity. In this section, we present one such "simplest possible" CLT, named Conceptual SCAN (cSCAN). cSCAN is a conceptual learning adaptation of the SCAN task (Lake & Baroni, 2017), which was itself originally presented as a kind of "simplest possible" compositional generalization task. We construct cSCAN according to a recipe illustrating one possible way of deriving a CLT from a base task.

3.1. BASE TASK: SCAN

SCAN is a task where natural language commands (such as "jump left twice") are translated into sequences of actions (such as "LTURN JUMP LTURN JUMP"). SCAN was designed to evaluate compositional generalization using non-standard train-test splits, which is one of the themes we explore in this paper. In addition, SCAN is an attractive choice of base task for a first conceptual learning benchmark because it can be generated automatically from a simple set of rules: a phrase-structure grammar for generating the valid input commands and a set of interpretation rules that specifies how to compute the action sequence for each command (see Figure 2 and Appendix E.1).

3.2. CONSTRUCTING CSCAN

In contrast to SCAN, which tests whether a learner can learn one specific translation from commands to actions, the goal of cSCAN is to test whether the learner can learn to perform a family of SCAN-like tasks using knowledge that consists of an arbitrary mix of rules and examples. Note that moving from a single task to a family of related tasks is essential for cSCAN because it forces the learner to take into account the knowledge provided by the context rather than just memorize behavior that is constant across all examples. In our original motivating example, this corresponds to the fact that we do not want a learner to make movie recommendations in a single, fixed scenario, but rather based on knowledge that may differ from person to person and evolve over time. cSCAN is constructed using the following recipe. Step 1: Accommodate the base task. A CLT subsumes the base task. This means that any input of SCAN is a valid request q ∈ Q, and any output of SCAN is a valid reply r ∈ R. Step 2: Make some of the relevant knowledge available as explicit rules. We choose which part of the relevant knowledge we want to be able to talk about explicitly and vary across examples. For cSCAN, we choose to talk only about interpretation rules like those shown in Figure 2 . This means that any interpretation rule is a valid request r ∈R, which allows us to assert such a rule in the context (by adding the example r,1 to the context) and teach / ask the learner whether any such rule holds. CLTs require that explicit rules have a compositional structure (Section 2.1). This is the case for cSCAN because the meaning of the interpretation rules can be determined using a compositional grounding function that we provide in Appendix E. Step We make sure that the different example classes are evenly represented (see the statistics in Section 3.3) and that, for simplicity, contexts contain only unconditional, monotonic examples that do not conflict with one another. We provide a detailed specification of cSCAN in Appendix E.

3.3. THE CSCAN DATASET

Examples. These examples are based on two contexts (C 1 and C 2 ) that are incompatible (e.g., they define the meaning of walk differently) and correspond to different SCAN variations. Each context contains some explicit rules (e.g., x 1 and x 2 = x 1 x 2 in C 1 ) as well as rules that are illustrated implicitly via examples (e.g., x 1 twice = x 1 x 1 in C 1 ) . The qualifier of the examples indicates whether any implicit rules are needed to determine the reply (in which case the qualifier is D). In context C 2 , we do not provide explicit rules for x 1 twice as well as x 1 thrice . While the rule x 1 twice = x 1 x 1 x 1 is expected to be induced from the provided examples, there is no obvious rule that can be induced for x 1 thrice . As a consequence, we expect the learner to reply "unknown" (?) for the request "jump thrice". At the same time, we expect the learner to reply "false" for the rule x 1 thrice = x 1 x 1 because there is an example in the context that contradicts it.

Rule space variants.

In order to cover a range of rule set complexities, we construct two versions of cSCAN (cSCAN-B and cSCAN-X), using different sizes of rule space (Table 1 ). cSCAN-B (short for "cSCAN-Base") uses a fixed phrase-structure grammar equivalent to that of the original SCAN task, as reformulated in Nye et al. (2020) using 14 interpretation rules. Action sequences are kept short by allowing a token or variable to be repeated at most twice in an interpretation rule's output sequence. This ensures that cSCAN-B examples do not exceed input size of 2048 tokens or output size of 256 tokens in our baseline models. cSCAN-X (short for "cSCAN-Extended") is based on a richer grammar space, which extends the original SCAN phrase-structure grammar with additional terminals, variations of existing rules, and an additional level that enables adverbs. Output sequences for individual interpretation rules are allowed to contain up to 4 repetitions of any given token or variable, which is the same as in original SCAN, but longer than in cSCAN-B. To keep context sizes manageable, we apply rule sampling for each context, so that the number of interpretation rules actually used in any given context is the same as in cSCAN-B, and apply a limit to the Table 1 : cSCAN rule space variants. cSCAN-B uses a fixed phrase-structure grammar equivalent to that of the original SCAN task, while keeping action sequences short. cSCAN-X is based on a richer grammar space, while using rule sampling to keep the number of rules used in any given context equivalent to the number of rules in cSCAN-B and the reformulation of original SCAN. Splitting methods. For each of the two sizes of rule space, we prepare datasets based on two splitting methods: random and maximum compound divergence (MCD) (Keysers et al., 2020) . For the cSCAN Random variants, we generate 1200 contexts and split these contexts randomly into a train set of 1000 contexts and validation and test sets of 100 contexts each. For the cSCAN MCD variants, we apply a variation of the MCD splitting algorithm of Keysers et al. (2020) in order to evaluate the ability of the system to compositionally generalize to new rule combinations, which we consider a stronger test of the system's ability to "understand" the meaning of the rules and to apply them correctly in new situations. Note that compared to the random split, the splitting method used in the cSCAN MCD variants leads to a somewhat less balanced dataset in terms of example classes, although each of the classes is still well covered. Also, while it leads to a challenging split in terms of generalization to new top-level request patterns, it is potentially easier than the random split in terms of the contexts shown, as we do not prevent the same context from appearing in train and test, and due to the effect of down-sampling from a larger context pool, the total number of contexts that are shown in the MCD train set is an order of magnitude greater than those shown in the random train set.

4. METRICS

Accuracy Our primary accuracy metric is example-level accuracy, where credit is given only when the reply + qualifier exactly matches the ground truth. For more nuanced error analysis, we secondarily track several accuracy variants that give credit for partially correct outputs (see Appendix N). Consistency A key aspect of learning with rules is that a learner does not just learn how to memorize and recite rules, but is actually able to combine and apply them consistently. For instance in cSCAN, if the learner believes that the rules x 1 twice = x 1 x 1 x 1 and walk = JUMP hold, a consistent learner should also believe that all combinations and applications of these rules hold, such as walk twice = JUMP JUMP JUMP. Note that unlike accuracy, this notion of consistency is not judging the correctness of individual predictions. Instead, it judges to what degree a whole set of predictions is consistent within itself. While a perfectly accurate learner would also be perfectly consistent, when accuracy is low to moderate, consistency can potentially be quite orthogonal to accuracy. As an illustration, consider Table 3 , which shows a golden example set as well as the predictions of two learners A and B. Learner A is more accurate because it gets 2 examples correct, whereas learner B gets none of the examples correct. At the same time, learner A is not consistent, because it is not able to correctly apply the two rules that it believes in to derive the example walk twice,WALK WALK . In contrast, learner B is perfectly consistent, as it correctly combines the rules it believes in to derive the example walk twice, JUMP JUMP JUMP . To capture this notion of consistency, we introduce for any set of predictions E ⊆E the consistency metric C(E), which is the percentage of subsets of E that contain a logical implication, in comparison to the number of subsets of E that contain an implication or a contradiction. This means that C(E) is a value in [0,100]: C(E)=100 says that the set E is perfectly consistent, while C(E)=0 says that E is completely inconsistent. (See Appendix D for a formalization of this metric and Appendix G for practicalities of calculation.) For learner A, the example C 1 ,walk twice,JUMP JUMP contradicts the rules C 1 , x twice = x x ,1 and C 1 , walk = WALK,1 , which means that the consistency of learner A is 100• 0 /1 = 0. For learner B, the example C 1 ,walk twice,JUMP JUMP JUMP is implied by the rules C 1 , x twice = x x x ,1 and C 1 , walk = JUMP,1 , which means that the consistency of learner B is 100• 1 /1 = 100.

5. BASELINES

As baselines, we evaluate variations of T5 (Raffel et al., 2019) , a Transformer encoder-decoder model (Vaswani et al., 2017) , which when pre-trained on natural language served as a strong baseline on the original SCAN task (Furrer et al., 2020; Csordás et al., 2021; Ontañón et al., 2021) . The most computationally and memory intensive of the T5 architectures that we evaluate is the standard T5 architecture, which applies full self-attention in the encoder, in addition to self-attention and cross-attention in the decoder. We refer to these models as simply T5. Motivated by the potentially large context size associated with CLTs, we further evaluate two variants of T5 that were designed to more efficiently scale to longer input sequences. LongT5 (Guo et al., 2022) reduces the computational load of attention in the encoder by applying local attention within a sliding window (Ainslie et al., 2020) . LongT5-TGlobal (Guo et al., 2022) extends LongT5 with a local-global As a first set of experiments, we compare baseline performance on the random splits of the cSCAN-B and cSCAN-X datasets. The results are shown in Table 4 . On the smaller cSCAN-B rule space, it can be seen that the provided 100K examples are sufficient for a pre-trained full-attention Transformer to achieve accuracies in excess of 90%, with accuracies increasing steadily with model size. Even in this relatively simple setting, however, several challenges can be observed. First, it appears that appropriate pre-training of the model is critical, as all the Transformer variants when trained from scratch managed to learn only superficial statistical correlations, as evidenced by them failing to outperform a naive baseline. Second, regarding LongT5 and LongT5-TGlobal, while it is possible that performance could be improved through more thorough hyperparameter tuning, our initial results show these to struggle on the conceptual learning task. Specifically, non-pretrained versions suffer from the same poor performance as non-pretrained T5, while when fine-tuning from an existing checkpoint, we were not able to find a converging setup. One possible explanation is that unlike document summarization tasks, for which LongT5 produced strong results (Guo et al., 2022) , CLTs may depend heavily on full attention over the context. If so, this could pose challenges in scaling to real-world conceptual learning tasks with even larger contexts. Third, while consistency scores for the evaluated models correlate roughly with accuracy, significantly more headroom is seen in consistency, with even the best-performing T5-Large scoring under 0.9, while the naive baseline and all non-pretrained models score close to 0. On the cSCAN-X rule space, accuracy drops significantly for all sizes of T5, suggesting that scaling to larger rule sets will be a challenge. Consistency continues to correlate with accuracy for these models, but drops rapidly as performance degrades. Non-pretrained models continue to fail to outperform the naive baseline.

6.2. COMPOSITIONAL GENERALIZATION

As a second set of experiments, we evaluate the ability for baseline solutions to compositionally generalize on CLTs using the cSCAN-B MCD and cSCAN-X MCD datasets (Table 5 ). Prior research on semantic parsing tasks has shown that while pre-trained Transformers exhibit strong performance on specialized cases of compositional generalization, they tend to struggle with more complex forms of compositional generalization, as reflected in low performance on MCD splits when an appropriate notion of "atom" and "compound" is identified (Furrer et al., 2020) . Here we show that in the context of a conceptual learning task, one form of compositional generalization that is challenging for T5-based models is generalization to new syntactic patterns in the request, even in the relatively easy setting where the same contexts could appear in train as in test. Specifically, as can be seen in the cSCAN-B MCD results, when switching from random to MCD split, accuracy drops from over 90% to less than 70%, even for the largest of the pre-trained T5 models, illustrating that compositional generalization is a challenge for these models, independently of the size of the rule space. Accuracy on cSCAN-X MCD is roughly similar to both cSCAN-B MCD and cSCAN-X Random, suggesting that the challenges of compositional generalization and rule space size do not necessary compound. Note also that while accuracies for non-pretrained models are somewhat higher on the MCD splits than on the random splits, this is not actually a sign of stronger performance, but rather simply an artifact of the mix of example classes that occur in the different splits, due to the down-sampling that is performed when constructing the MCD split. As shown in Table 8 , a side effect of this splitting algorithm was a relative increase in the proportion of defeasible rule examples vs. monotonic or non-rule examples and in the proportion of examples with "unknown" as the reply. This leads to an increase in the accuracy achievable by the naive "T5 w/o Context" baseline. On the MCD splits, like the random splits, none of the non-pretrained models manage to outperform the naive baseline. For the MCD splits, we do not report consistency metrics, as due to technicalities of the MCD splitting algorithm, there ended up being insufficient signal for logical implications among the test examples, leading the consistency metric to be undefined in most cases. (See Appendix F for details.)

7. RELATED WORK

Here we discuss the most closely related lines of research. See Appendix K for more related work. Tasks providing knowledge as context. In representing the input of a CLT as a request paired with a context, we build on a long tradition of QA and reasoning task formulations that provide knowledge relevant to a task via various forms of context, such as a text passage (Kwiatkowski et al., 2019; Weston et al., 2015) , set of natural language statements (Talmor et al., 2020) , knowledge graph fragment (Sinha et al., 2020) , or grid world (Ruis et al., 2020) 

8. CONCLUSIONS AND FUTURE WORK

In this paper, we presented the cSCAN benchmark as a first instance of a "conceptual learning task" (CLT), following a task format motivated by a personal assistant use case. Through experiments on baseline solutions, we identified several challenge areas with headroom for improvement. As next steps, we are interested in exploring solutions to CLTs, including prompting of large language models, neuro-symbolic solutions, and improved ways of handling large set-like contexts. In parallel, we are interested in exploring CLTs based on a wider range of base tasks and rule formats, including non-synthetic tasks and tasks that draw on a full KB as context.

9. REPRODUCIBILITY STATEMENT

Experiments Appendix L describes the details of the baseline configurations that we used, together with other details of the environment in which we ran the experiments reported in this paper, while Appendix O provides details of the input-output format. Upon paper acceptance, we plan to release on GitHub both the cSCAN datasets and the code needed to reproduce the experiments.

Dataset generation

The cSCAN datasets themselves were synthetically generated using a configurationdriven Python program described in Appendix F, which we also plan to open-source, together with the specific configurations used for each of the cSCAN datasets. While regeneration of the datasets is not necessary for reproducing our experiment results, researchers can use this code to generate new conceptual learning datasets based either on the existing cSCAN grammar spaces or on modified grammars. When the code is run with the provided configurations, it can reproduce the generation of datasets with statistically comparable content to the official cSCAN datasets. 

APPENDIX A CLT DESIGN CHOICES

Here we discuss the rationale behind the design choices made in our definition of a Conceptual Learning Task (CLT). Splitting the input into a set-like context and a request. The main goal of CLTs is to test whether a learner is capable of learning from explicitly provided knowledge consisting of rules and examples. Assuming a supervised learning setup as a basis, the applicable knowledge must somehow be provided in the input of each example. It is therefore quite natural to split the input into two parts: context and request. Since the examples and rules that form the background knowledge do not usually have a specific order (in the movie recommendation example for instance, it does not matter whether we are first told that Top Gun is rated PG-13 or that Jerry Maguire is rated R), representing the context as an unordered set is a natural choice. Representing rule assertions as examples. Another important property of CLTs is the ability to ask the learner explicitly whether a certain rule holds (in a given context). One straightforward way to achieve this is to include examples where the request asks for the validity of a certain rule and the output provides the corresponding truth value. These kind of examples also provide us with a natural way to assert (or refute) rules in the context, which allows us to represent the context simply as a set of examples (rather than a heterogeneous set containing both examples and a dedicated representation of rule assertions and refutations). Distinguishing monotonic and defeasible replies. Once we include the context as part of the input, we can distinguish between two different methods of how the learner may determine the reply for a given request: deduction and induction. For deduction, the learner infers the reply for a given request from the context using deductive reasoning alone. As an illustration, consider an example where we assert in the context that Top Gun is a PG-13 movie and then ask for the rating of Top Gun in the request (see Appendix D for a formalization of the logic axioms underlying deductive reasoning). For induction, the information provided by the context is not sufficient to unambiguously determine the reply for a given request. As an illustration, suppose that we ask for the rating of Jerry Maguire in an example whose context asserts that both Mission Impossible 1 and Jerry Maguire are movies starring Tom Cruise and that Mission Impossible 1 is rated PG-13. This information is not sufficient for us to deduce the answer. Instead, the learner needs to rely on inductive bias to determine whether it should speculatively generalize the PG-13 rating from Mission Impossible 1 to Jerry Maguire or whether it should play it safe and say that it doesn't know. Deductive reasoning is always monotonic w.r. 

B MOTIVATING EXAMPLE AS CLT

Here we show what the motivating example from the introduction (Figure 1 ) could look like when formulated in the syntax of a CLT. Here we take Q to be the space of natural language statements and questions. Context containing assertions of relevant knowledge: C1={ "R rated movies are not appropriate for kids who are less than 17 years old.",1 "The age of a person is the time that has passed since the person was born.",1 "The current date is June 3, 2021.",1 "Mission Impossible 1 -6 are PG-13 rated action movies starring Tom Cruise.",1 "Jerry Maguire is an R rated comedy starring Tom Cruise.",1 "Anna was born in January of 2008 and Tom Cruise is her favorite actor.",1 "Anna saw Mission Impossible 1 -5 and liked all of them.",1 ... } Top-level example set containing a mixture of non-rule examples (natural language QA for movie recommendations) and rule examples (which probe the model's understanding of intermediate steps in the recommendation process): E ={ C1,"What movie could Anna watch?","Mission Impossible 6",D C1,"Anna is 14 years old.",1,M C1,"Anna is at least 17 years old.",0,M C1,"Jerry Maguire is appropriate for 14-year-olds.",0,M C1,"Anna will like Mission Impossible 6.",1,D C1,"Tom Cruise is Anna's best friend.",?,D ... } The above is a relatively straightforward translation of the motivating example into CLT syntax, while using the format of rule examples for all of the context examples and for all of the intermediate questions. If we assume that the request space Q and reply space R include natural language questions and answers for querying about background knowledge, as well as for the end goal of providing movie recommendations, the above example could be alternatively expanded to represent some of the background knowledge in non-rule format (e.g., "What is the rating of Jerry Maguire?","R" ) and/or to represent some of the intermediate questions as non-rule top-level examples (e.g., C1,"How old is Anna?","14 years",M ).

C GENERALIZATION OF CLTS TO SUPPORT NESTED CONTEXTS

As discussed in Section 2, since the context of each top-level example in a CLT is itself represented as a set of "examples", we have similar but slightly different notions of "example" at two different levels in a CLT: • Top-level examples, which we represent as quadruples of context,request,reply,qualifier . • Context examples, which in Section 2 we require to be unconditional and monotonic and which we thus represent as simply request,reply pairs. While for the simple cSCAN task, it was sufficient to provide only unconditional monotonic examples in the context, in the general case, one could imagine extensions of the notion of a CLT to allow inclusion of conditional and/or defeasible examples in the context as well. In this more general view of a CLT, we can drop the distinction between top-level examples and context examples, and instead adopt a recursive structure in which each example is a quadruple of context,request,reply,qualifier , while each context is a set of examples. One motivation for this more general view of CLTs is if we were to think of each top-level CLT example as representing one observation of the behavior of a personal assistant with very strong decision-making capabilities (which we might call the "teacher"), whose behavior at all times is conditioned on the knowledge available to it, and whose knowledge is stored in a large and growing set-like knowledge base. In this view, the context of the top-level example can be thought of as a snapshot of the relevant contents of the assistant's knowledge base at the time that the assistant received the given request and output the given reply and qualifier. Now let us further suppose that we have another personal assistant (which we might call the "student"), which itself has some kind of growing set-like knowledge base, whose contents may or may not agree with the contents of the teacher's knowledge base. One interesting question is how the student can go about selectively "learning" from the teacher, so as to emulate its decision-making capabilities, without necessarily accepting wholesale the full contents of its knowledge base, which may also include information that is transient, situation-specific, or debatable, and which may thus not be relevant or appropriate for the student to adopt. One natural approach could be to simply select the top-level examples of interest (i.e., to select the specific instances of the teacher's behavior that the student wishes to emulate) and assert those examples in the student's knowledge base. This would be the equivalent of storing the knowledge that "if such and such things (i.e., the contents of the example's context) were true, then when faced with the given request, this would be the appropriate reply and the appropriate qualifier". Under this approach, the student's set-like knowledge base would now come to contain full CLT examples (including context and qualifier). Continuing with the view of a "context" as being a snapshot of some or all of an assistant's knowledge base, our contexts could now contain examples that themselves contain non-empty contexts. The maximum depth to which we allow such recursive structures to continue could be considered an arbitrary choice in the design of a conceptual learning system or task.

C.1 GENERALIZED CLT

Below is a formalization of this more general definition of a CLT, along with some shorthand notations, by which the simpler CLT definition from Section 2 can be seen as just a special case of the general definition. • • Each input i k = C,q is a pair consisting of a context C ∈ 2 E and a request q ∈ Q. The set of possible requests Q can be partitioned into rule requests Q R ⊆Q (i.e., requests that ask whether a certain rule holds) and non-rule requests Q N ⊆Q. • Each output o k = r,u is a pair consisting of the reply r ∈ R and a qualifier u ∈ U. The set of possible replies R must contain dedicated elements representing true (1), false (0), and unknown (?), which are the only valid replies for rule requests q ∈ Q R . The set of qualifiers U = {M,D} indicates whether the reply follows monotonically from the context (M) or whether it is defeasible (D). For conciseness, we use the flat notation C,q,r,u to mean the nested pairs C,q , r,u , we use C,q,r as a shorthand for the monotonic example C,q,r,M , and we use q,r as a shorthand for the unconditional monotonic example ∅,q,r,M . Hence, the example C,q,r means that given the context C or any superset of C, the request q is translated to the reply r, while q,r means that under all circumstances, the request q is translated to the reply r. Note that even though the context is strictly a set of examples, it may still contain any rule q∈Q R by means of including the example ∅,q,1,M , which asserts the rule q to be true unconditionally. Similarly, we can express that the rule q does not hold in a context by including the example ∅,q,0,M . We refer to an example as unconditional if it has an empty context. Note also that since examples within a context are of the same form as top-level examples, they may in principle themselves contain contexts up to arbitrary levels of nesting. We can, however, for any given CLT, choose a maximum level to which we allow such nesting to occur. In cSCAN, for example, contexts contain only unconditional examples, so that there is no nesting of contexts.

C.2 ADDITIONAL SHORTHAND FOR RULE ASSERTIONS IN A CONTEXT

To express the assertion of a rule q∈Q R in contexts more concisely, we use q to mean its unconditional monotonic assertion ∅,q,1,M . For instance, {q},q ,r stands for { ∅,q,1,M },q ,r . In the example from Figure 1 , if we take Q to be the space of natural language statements and questions, then with the above shorthand the following would be equivalent: • { ∅,"Anna was born in January of 2008.",1,M },"Who is Anna's best friend?",?,D • {"Anna was born in January of 2008."},"Who is Anna's best friend?",?,D

D FORMALIZATION OF CONSISTENCY REQUIREMENTS AND METRICS

In this appendix, we formalize the consistency requirements and consistency metrics of CLTs by mapping example sets into classical propositional logic. We start by summarizing the definitions and axioms of propositional logic, and we specify functions to capture the meaning of rules and the inductive bias. This allows us to map example sets to logical formulas, which in turn allows us to formalize the consistency requirements as well as the metrics for measuring consistency of a learner. For the purposes of this formalization, we assume the more general form of a CLT described in Appendix C, for which the simpler CLT form used for cSCAN follows as a special case. Note that while we provide this formalization for reference purposes and to facilitate future investigation into the formal properties of CLTs, it is possible in practice to use CLTs and the related consistency metric without detailed consideration of this formalization.

D.1 TERMINOLOGY

If not otherwise indicated, we assume that C,C ∈2 E are arbitrary contexts, q,q ∈Q are arbitrary requests, r,r ∈ R are arbitrary replies, u,u ∈ U are arbitrary qualifiers, E,E ,F ∈ E are arbitrary example sets, and e, e , f are arbitrary examples. Furthermore, we assume that M(e) and D(e) denote the monotonic and defeasible variants of the example e, respectively. This means that for any u∈U, M( C,q,r,u )= C,q,r,M and D( C,q,r,u )= C,q,r,D .

D.2 CLASSICAL PROPOSITIONAL LOGIC

For reference in later proofs, we summarize the basic definitions and axioms of Lukasiewicz classical propositional logic (Klement, 2004 ), which we adopt as-is, with only a change of symbols for the logical connectives to avoid ambiguity with the ∧, ∨, and ¬ symbols that we use elsewhere in this document in first order logic statements. We assume a set of propositional variables V which represent atomic formulas. General logical formulas L are recursively constructed from these atomic formulas using the logical connectives → (implication), ¬ (negation), ∧ (and), ∨ (or), and ≡ (equivalence). In our formalization, we assume the primitive connectives → and the constant ⊥ (falsum), and we define the other connectives as follows (assuming that x,y,z∈L). ¬ x:=x→⊥ (1) x ∨ y:= ¬ x→y (2) x ∧ y:= ¬ (x→ ¬ y) (3) x≡y:=(x→y) ∧ (y→x) := ¬ ⊥ To formulate the propositional logic axioms and inference rules, we use the notation x y to express that we can infer y from x and we use x to express that x is a tautology. x→(y→x) (6) (x→(y→z))→((x→y)→(x→z)) (7) ( ¬ x→ ¬ y)→(y→x) (8) x,(x→y) y (9) The axioms ( 6), ( 7), (8) form the Lukasiewicz system, while the inference rule ( 9) is modus ponens.

D.3 SEMANTICS OF RULES AND INDUCTIVE BIAS

To map example sets to logical formulas, we need to be able to refer to the semantics of rules and the inductive bias. For the semantics of rules, we assume a grounding function G : Q R → 2 E , which maps each rule to an equivalent set of examples. This means that in a consistent task, a rule q∈Q R holds (i.e., ∅,q,1 is a valid example) if and only if each example in G(q) is valid. It also means that if the context of an example e∈E contains the assertion ∅,q,1 of a rule q∈Q R , we can obtain an equivalent example e by replacing the assertion of q with the examples G(q). We provide the grounding function of cSCAN in Appendix E.3. Since a CLT requires a learner to induce as well as deduce rules, every CLT inherently assumes some form of inductive bias, which determines the criteria based upon which a learner is expected to induce a rule to be "true" vs. considering it to be "unknown". While the inductive bias of the task is in principle arbitrary (being dependant on the choice of the task designer or the needs of the real-world use case), the learner will need to be able to emulate this bias in order to perform well on the task. We formalize this domain-specific inductive bias through a bias function B : 2 E → 2 ED , which maps each set of examples to the set of examples that are expected to be induced. We use E M ,E D ⊆ E to denote the subsets of all monotonic and defeasible examples, respectively. We provide the bias function of cSCAN in Appendix E.4. While the grounding and bias functions are needed to precisely formalize the consistency requirements and metrics, they are not a requirement for a conceptual learning task. This is important because providing complete grounding and bias functions may not be feasible or practical for more realistic CLTs. Instead, we may only provide partial functions, which means that the formalization in this section will only approximate the true consistency requirements and consistency metrics.

D.4 MAPPING EXAMPLE SETS TO LOGICAL FORMULAS

We treat each example as a propositional variable (i.e, C,q,r,u ∈ V). This allows us to define the embedding function M : 2 E → L as follows. M(∅):= (10) M({ C,q,r,u }):= C,q,r,u (11) M(E∪E ):=M(E) ∧ M(E ) The empty set maps to true (10), a set of size one maps to its only element (11), and union maps to logical conjunction (12). In addition, our embedding adheres to the following axioms. C,q,r,M ≡(M(C)→ ∅,q,r,M ) (13) C,q,r,M → C,q,r,D M(C)≡M(C ): C,q,r,D ≡ C ,q,r,D r =r : C,q,r,D ∧ C,q,r ,D →(C →⊥) (16) ∅,q,1,M ≡M(G(q)) (17) M(E)→M(B(E)) Axiom ( 13) says that the context of a monotonic example becomes the antecedent of a logical implication, and axiom ( 14) specifies that a monotonic examples implies the corresponding defeasible example. Axiom (15) says that defeasible examples that differ only in equivalent contexts are equivalent. Axiom ( 16) specifies that defeasible examples are functional or the context must be contradictory, which means that each request with a non-contradictory context must have a unique reply. Finally, axiom (17) specifies that the assertion of a rule is equivalent to its grounding, and axiom (18) says that each set of examples implies the set of examples that can be induced using the inductive bias. Together with the axioms of propositional logic, we obtain the following theorems. C,q,r,M → C∪C ,q,r,M (19) M({ C,q,1,M })≡M({ C∪C ,q ,r ,u : C ,q ,r ,u ∈G(q)}) (20) r =r : C,q,r,u ∧ C,q,r ,u →(C →⊥) Theorem ( 19) is obtained from axioms (6), ( 13) and ( 12), and it says that monotonic examples behave monotonically w.r.t. their context. This allows us to rewrite the grounding axiom ( 17) to obtain theorem (20). The last theorem (21) says that functionality applies independently of the qualifier.

D.5 CONSISTENCY METRIC

As discussed in Section 4, the consistency metric C(E) differs from the standard accuracy metric in that it does not judge the correctness of a learner's predictions on individual examples, but rather measures the degree to which a complete set of examples E is consistent w.r.t. to the axioms of classical propositional logic. In essence, we define C(E) to be the percentage of subsets of E that contain a logical implication, in comparison to the number of subsets of E that contain an implication or a contradiction. Note that the definition of the consistency metric provided in equation ( 26) could essentially be applied as-is to any arbitrary CLT, provided there is some way to identify the "implications" (impl(E)) and "contradictions" (cont(E)) among the learner's predictions. In practice, task designers may be free to apply any reasonable heuristic to identify such implications and contradictions. The formal definition of the consistency metric below can be considered an ideal, which we seek to emulate closely in the consistency metric implementation provided for cSCAN, as described in Appendix G. E →f :⇔ ¬ M(E)∧ M(E)→M({f })∧( E E : M(E )→M({f }))∧ ¬(f ∈E D ∧ M(E)→M({M(f )})) (22) E →f :⇔( ¬ M(E)∧ M(E)→ ¬ M({f })∧( E E : M(E )→ ¬ M({f })))∨ (f ∈E D ∧E →M(f )) (23) impl(E):= {F ⊆E : (∃f ∈F : F\{f } →{f })} (24) cont(E):= {F ⊆E : (∃f ∈F : F\{f } →{f })} (25) C(E):= 100• |impl(E)| |impl(E)|+|cont(E)| , |impl(E)|+|cont(E)|>0 NaN, otherwise Minimal implication E → f means that the non-contradictory example set E implies the example f and that there is no strict subset of E that has the same property. Furthermore, we require matching qualifiers, which means that E must not imply the monotonic variant of f if the latter is marked as defeasible ( 22). The minimal contradiction E →f can be met by either of the following two conditions (23). First, E →f holds if the non-contradictory example set E implies the negation of f and there is no strict subset of E that has the same property. Secondly, E → f holds if there is a qualifier mismatch, i.e., if f is marked defeasible but E minimally implies the monotonic variant of f . Note that we do not consider it to be a qualifier mismatch if f is marked as monotonic and E implies only the defeasible variant because there may be other evidence outside of E that may justify the monotonic qualifier. This allows us to define impl(E) to be the set of subsets F ⊆E that contain a minimal implication (24) and cont(E) to be the set of subsets F ⊆E that contain a minimal contradiction (25). Finally, the consistency metric C(E) is the percentage of subsets of E that contain implications, in comparison with the number of subsets of E that contain implications or contradictions (26). If the set E does not contain any implications or contradictions, then the consistency metric is not defined. Illustration. As an illustration, consider the following example set E, which consists of 4 rule assertions and 8 examples. (We assume the syntax and semantics of cSCAN, which is formally specified in Appendix E.) E :={ C 1 , u twice = u u ,1 , (a) C 1 , x after y = y x ,1 , (b) C 1 , walk =WALK,1 , (c) C 1 , jump =JUMP,1 , (d) C 1 ,walk,WALK , (e) C 1 ,eat twice,EAT EAT , (f) C 1 ,walk twice,WALK WALK WALK WALK , (g) C 1 ,jump twice,JUMP JUMP,D , (h) C 1 ,walk after walk,WALK WALK , C 1 ,jump after walk,WALK JUMP,D , (j) C 1 ,walk after jump,WALK JUMP , (k) C 1 ,walk twice after jump,JUMP WALK WALK } (l) In this example set there are a total of 5 minimal implication sets (with the implied example indicated in green above): • {c,e} • {b,c,i} • {b,e,i} • {a,b,c,d,l} • {a,b,e,d,l} And there are a total of 8 minimal contradiction sets (with the contradictory example indicated in red above), of which the following are because of an inconsistent reply: • {a,c,g} • {a,e,g} • {b,c,d,k} • {b,e,d,k} • {b,d,g,l} while the following are because of an inconsistent qualifier: • {a,d,h} • {b,c,d,j} • {b,e,d,j} Based on this, the consistency of example set E would be C(E) = 100• 5 /13 ≈ 38.5. Note that example d is both implied and contradicted. Also, since examples c and e are semantically equivalent according to cSCAN's grounding function (Appendix E.3), each minimal implication or contradiction set that involves example c can be written alternatively using example e, causing such sets to appear in pairs in the lists above.

D.6 CLT: DEFINITION AND CONSISTENCY REQUIREMENTS

To obtain a precise measure for the consistency of the predictions produced by a given learner, it is important that the task T itself be consistent. Specifically, this means that each example e ∈ T must be consistent on its own, and the example set T as a whole must be consistent. For the purposes of this paper, we further require that contexts are non-contradictory. We capture these requirements using a predicate consistent CLT (T ), which is recursively defined as follows. consistent CLT ({ C,q,r,u }):⇔consistent CLT (C) ∧ ¬ C,q,r,u ∧ (27) (u=M⇔ M({ C,q,r,u })) (28) consistent CLT (E):⇔(∀e∈E : consistent CLT ({e})) ∧ (C(E)=100 ∨ C(E)=NaN) We first define the consistency of an individual example C,q,r,u ∈ E, which requires that its context C is consistent (and therefore non-contradictory), that the example itself is not a contradiction, and that it is qualified as monotonic if and only if it maps to a tautology (28). Then, we define that an example set E ⊆ E is consistent if and only if each individual example e ∈ E is consistent and the example set is consistent as a whole (29) (i.e., C(E) must be 100 or undefined). Note that while the above formulation of CLT consistency requirements is sufficient for cSCAN, it could be desirable to adjust this for more complex tasks, which we leave for future work. In particular, while this definition of consistency provides only basic control over the behavior of defeasible examples, we could imagine defining stricter consistency requirements for defeasible examples, e.g., by requiring them to adhere to the KLM properties (Sarit Kraus & Magidor, 1990; Casini et al., 2021) . Also, while in this paper, for simplicity, we require contexts to be non-contradictory, this requirement is not strictly necessary for the task to be consistent or for the consistency metrics to be meaningful, as long as the top-level examples in the dataset do not contain a contradiction or contradict each other. Taken a step farther, for dealing with real-world datasets which may contain noise, it may be desirable to relax the requirement that the task be strictly consistent. In such an approach, the "requirements" of a CLT can be considered more as an aspiration rather than as strict requirements. The consistency metric could in that case still be calculated, but would be less reliable an indicator of learner consistency compared to the case where the task is consistent.

E SPECIFICATION OF CSCAN

This appendix contains a specification of the cSCAN task. It consists of the phrase-structure grammars to generate valid requests and replies, the compositional grounding function (which defines the meaning of the explicit rules), and the bias function (which defines the inductive bias). Together with the formalization of simple CLTs provided in Appendix D, this specifies the complete behavior of cSCAN. Note that in this section, as in Appendix D, we assume the more general CLT formalism described in Appendix C, in which contexts can contain examples of the same form as the top-level examples. While the cSCAN specification could be expressed equivalently in terms of the simplified CLT formalism of Section 2, the more general formalism allows us to express some aspects of the specification more concisely, as we can thus describe the semantics of both top-level examples and context examples using the same grounding function. For readability, we make use of the shorthand notation described in Section C.2 to allow expressing some of the more verbose examples more concisely. Note also that while we provide the complete formal specification of cSCAN here as a reference, it is not necessary in general to provide a specification at this level of detail when defining future CLTs.

E.1 RULE SPACES

Most of the specification of cSCAN is identical between cSCAN-B and cSCAN-X. In this section, we summarize the points that differ between the two, together with some notes on the original SCAN task for comparison.

Original SCAN

In the original SCAN task, natural language commands are generated by a fixed phrase-structure grammar as described in Lake & Baroni (2017), which is equivalent to the phrase-structure grammar shown for cSCAN-B in the top left in Figure 5 . (Note that for readability, we renamed the non-terminals in this presentation to proceed alphabetically, beginning with "S" as the traditional start symbol.) The mapping from command to action sequence follows the fixed set of interpretation rules shown in Figure 4 . As can be seen in the figure, action sequences in the original SCAN task are constructed from a set of 6 possible actions. A:={WALK,RUN,JUMP,LOOK,LTURN,RTURN}

cSCAN-B

In cSCAN-B, the set of non-rule requests Q N consists of all natural language commands that are generated by the phrase-structure grammar shown on the top left in Figure 5 , which is equivalent to the original SCAN phrase structure grammar, as reformulated in Nye et al. (2020) . The mapping from command to action sequence varies from context to context. cSCAN-B constructs action sequences from the same 6 actions used in the original SCAN task, plus several additional ones provided for diversity. (In a nod to earlier research on the SCAN task, we follow the lead of Nye et al. (2020) in using as these additional actions the ones that appear in the "MiniSCAN" task of Lake et al. (2019).) A:={WALK,RUN,JUMP,LOOK,LTURN,RTURN, RED,YELLOW,GREEN,BLUE,PURPLE,PINK,BLACK,WHITE}

cSCAN-X

In cSCAN-X, the set of non-rule requests Q N consists of all natural language commands that are generated by the phrase-structure grammar shown on the top left in Figure 6 . The mapping from command to action sequence varies from context to context, similarly to cSCAN-B. cSCAN-X constructs action sequences from a set of 13 possible actions. A:={WALK,RUN,JUMP,LOOK,LTURN,RTURN, DRIVE,RIDE,FLY,LEAP,PEEK,UTURN,DTURN}

E.2 REQUESTS AND REPLIES

For simplicity, this section focuses on a formal description of cSCAN-B. The specification of cSCAN-X would follow the same form, with the exception of the differences described in Appendix E.1 above. Non-rule requests. In cSCAN, the set of non-rule requests Q N consists of all natural language commands that are generated by the phrase-structure grammar shown on the top left in Figure 5 . Note that for convenience of generation, the grammar adopted here is based on the alternative formulation of the SCAN grammar from Nye et al. (2020) . This generates a slightly larger set of commands than the original SCAN grammar from Lake & Baroni (2017), as it includes commands such as "turn" and "turn and turn". Replies. Since cSCAN is a CLT, the set of replies R includes the dedicated replies 1 (true), 0 (false), and ? (unknown). In addition, R contains all sequences consisting of actions A (described in Appendix E.2 above) separated by space with a maximum sequence length N:  N ⊂ R * . R * :={a 1 a 2 ... a K : 1≤K <N,a i ∈A∪Q * N } We assume a function var : R * → 2 X , which returns the set of variables used by a given RHS expression. This allows us to define the set of rule requests Q R as an LHS and an RHS expression with matching variables: Q R :={q=r : q∈Q * N ,r ∈R * ,var(q)=var(r)}

E.3 GROUNDING OF RULES

As discussed in Appendix E.3, we specify the semantics of cSCAN rules via a grounding function G :Q R →2 E , which maps each rule to an equivalent set of examples. For rules where the LHS consists of a command without any variables, we can define the grounding as an example that provides the interpretation of this command. For example, the rule walk = WALK is grounded as follows: G( walk =WALK):={ ∅,walk,WALK } Note that because we determine consistency based on propositional logic equivalence (see Appendix D), we could specify equivalent groundings that include additional examples that are logically implied by { ∅, walk, WALK }. For instance, we could add variants with non-empty contexts (e.g., { run = RUN},walk,WALK ), which follow from monotonicity. For rules containing variables, the grounding may consist of hundreds or thousands of examples, even if we do not include any examples that are logically implied. This is because the variable can be replaced with any command that leads to a valid LHS. For example, the rule x 1 twice = x 1 x 1 can be grounded as follows: The first example in this grounding can be read as: if run is translated to "RUN" then "run twice" is translated to "RUN RUN". G( x 1 twice = x 1 x 1 ):={ { run =RUN}, Compositional grounding function of cSCAN. Since the semantics of cSCAN rules is compositional (which is a requirement for all CLTs), we are able to specify the grounding function in a complete yet concise fashion using the following helper constructs. For simplicity, we focus here on describing the compositional grounding function for cSCAN-B. The grounding function for cSCAN-X would follow the same general form. • For any RHS sequence r ∈R * , we use lhs(r) to denote the set of elements of r that are LHS expressions Q * N . For example, lhs(WALK JUMP run left look thrice )={ run left , look thrice }. • For any subset of LHS expressions Q ⊆ Q * N , we use c2a(Q) to denote the set of all possible functions f : Q→A * that map each expression in Q to an action sequence in A * . (Note that this also applies to the empty set, i.e., there is exactly one function in c2a(∅).) • We define subsets of the commands Q N generated by the phrase-structure grammar in Figure 5 . For each z ∈ {T,U,V,W,X} the set Q z N ⊂ Q N denotes the commands that are generated when starting from the symbol z (rather than S). • For each z ∈ {T,U,V,W,X}, we assume a function var z : Q * N → 2 X , which map each LHS expression q∈Q * N to a subset of its variables var(q). For any expression q∈Q * N , the following holds: var T (q) consists of the variables that are generated using the rule path T → U, U → V, V → W, W → Y. var U (q) consists of the variables that are generated using the rule path U → V, V → W, W → Y. var V (q) consists of the variables that are generated using the rule path V → W, W → Y. var W (q) consists of the variables that are generated using the rule W → Y. var X (q) consists of the variables that are generated using the rule X → Y. • For all LHS expressions q ∈ Q * N we use v2c(q) to denote the set of all possible functions f : var(q) → Q N that map each variable in var(q) to a command in Q N such that ∀z ∈ {T,U,V,W,X} : f (var z (q)) ∈ Q z N . Note that this restriction of the mapping ensures that variable substitution does not break the implicit precedence rules that apply to the interpretation of commands (see the discussion below for more detail). • For any RHS expression r ∈ R * and any partial function f : R * ∪X → R * , we use subst(r,f ) to denote the RHS expression that we obtain by lexically substituting in r all occurrences of r ∈dom(f ) with f (r ). These constructs allow us to define the grounding of all rules (q = r) ∈ Q R as follows: G(q=r):= ∪ f ∈c2a(lhs(r)) { ∪ q ∈dom(f ) { ∅,q =f (q ),1 },q=subst(r,f ),1 }, var(q)=∅ ∪ f ∈v2c(q) { ∅,subst(q,f ),subst(r,f ) }, otherwise Discussion and illustration. The first case of definition 34 applies when the rule q=r does not contain any variables. It says that whenever a LHS expression q occurs in the RHS expression r, we can substitute q with an arbitrary action sequence f (q ) as long as we make sure that the context contains a rule asserting that q is interpreted as f (q ). As an illustration, consider the grounding of the rule run twice and jump = run run jump LTURN that is shown below. The set of LHS expressions on the right-hand side of this rule is{run, jump}. The first example of the grounding can be read as: if run maps to "RUN" and jump maps to "JUMP" then the RHS of the rule becomes "RUN RUN JUMP LTURN". The second case of definition 34 applies when the rule q=r contains a set of variables. It says that we can substitute any variable x with a command f (x) that corresponds to the same non-terminal in the parse tree.

G( run twice and jump

= run run jump LTURN)= { { ∅, run =RUN,1 , ∅, jump =JUMP, As an illustration, consider the grounding of the rule x twice = x x LTURN shown below. In the first example of the grounding, we replace the variable x with the command "run", in the second example we replace it with "jump around left", and in the last example we replace it with "look thrice". G( x 1 twice = x 1 x 1 LTURN)= { ∅, run twice = run run LTURN,1 , ∅, jump around left twice = jump around left jump around left LTURN,1 , ∅, look thrice twice = look thrice look thrice LTURN,1 , ...} Note that because x 1 ∈var U ( x 1 twice ) and x 1 ∈var T ( x 1 twice ), the variable x 1 cannot be substituted with commands such as "walk and jump" that are in Q T N but not in Q U N . This is important because the grounding would otherwise contain examples that violate the higher precedence of "twice" when compared to "and", which is not what we intended. I.e., our formalization makes sure that: ∅, walk and jump twice = walk and jump walk and jump LTURN,1 ∈ G( x 1 twice = x 1 x 1 LTURN) E.4 INDUCTIVE BIAS As discussed in Appendix E.4, the inductive bias of a CLT is the criteria based upon which a learner is expected to induce a rule to be "true" as opposed to considering it to be "unknown". For cSCAN, we adopt a simple set of criteria, based loosely on our own intuition, in which we consider there to be sufficient evidence to support induction of a rule if there are examples in the context that could be explained by the given rule (in combination with other rules that are explicitly or implicitly provided in the context) and which in total illustrate at least 4 different substitutions of each of the rule's variables. The number 4 is arbitrary, but based on our intuition that we would be comfortable in generalizing rules from a relatively small number of examples, but that 2 examples is not quite enough to justify inducing a general pattern. We chose the threshold of 4 rather than 3 because we use the inductive bias criteria internally for determining the minimum number of illustrating examples to include in the context for each rule that we intend to be induced to be "true", and we wanted to avoid penalizing a learner that is slightly cautious in its inductions. At the same time, we took care to avoid illustrating any rules' variable substitutions exactly three times, so as to avoid penalizing a learner that is just slightly on the aggressive side in its inductions. In this way, while we describe the task's inductive bias formally as requiring a minimum of 4 illustrated variable substitutions, a learner could succeed on the cSCAN task by adopting a minimum threshold of either 3 or 4 substitutions. We formalize this domain-specific inductive bias of cSCAN with the bias function B : 2 E → 2 ED . To avoid circularity, we assume a variation * of the logic embedding defined in Appendix D.4 with the only difference being that * does not use axiom ( 18), which depends on the inductive bias. Similarly, we use → * to denote the variant of the minimal implication → that is based on → * (rather than →). As we mentioned in Section 3.2, we expect the learner to induce a rule if it has been illustrated with a sufficient number of variable substitutions. To formalize this, we define the function num subst : X ×Q R × 2 Q → N such that num subst (x,q,Q) is the number of different expressions that can be substituted for the variable x in the rule q to obtain a rule in Q (35). We also define the function min subst : Q R ×2 Q →N such that min subst (q,Q) is the minimum number of such substitutions for any variable used in q (36). num subst (x,q,Q):=|{f (x)∈Q N :(∃f ∈X →R * : subst(q,f )∈Q)}| (35) min subst (q,Q):= min x∈var(q) (num subst (x,q,Q)), var(q)>0 0, otherwise For any E ⊆ E, the bias function B(E) is then defined as follows: Q 1 (E,C):=∪{Q⊆Q R : (∃e∈E : (E\{e})∪{ C,q,1 : q∈Q} → * e)} (37) B 1 (E):={ C,q,1,D ∈E : q∈Q 1 (E,C) ∧ min subst (q,Q 1 (E,C))≥4} (38) B ? (E):={ C,q,?,D ∈E : (∀r ∈{1,0}: * M(E∪B 1 (E))→M( C,q,r,D))} (39) B(E):=B 1 (E)∪B ? (E) The set Q 1 (E,C) is the set of candidate rules with context C that we may want to induce from the example set E. Specifically, it is the union of all sets of rules Q that, together with E \ {e}, allow us to (minimally) explain some example e ∈ E (37). The set B 1 (E) contains the assertions of all rules that are expected to be induced to be true based on the examples provided by E. Specifically, it consists of the assertions C,q,1,D of all the candidate rules q for which Q 1 (E,C) contains instances with at least 4 different substitutions for each variable (38). B ? (E) contains the assertions of all rules for which E does not provide any evidence about whether or not they hold. Specifically, it contains an example C,q,?,D for all contexts C and rules q for which E does not imply a clear answer, i.e., 1 or 0 (39). Finally, B(E) is the union of all the rules that should be induced to be true and all the rules that should be induced to be unknown (40). Discussion. Because all cSCAN contexts exclusively contain unconditional, monotonic examples, it is sufficient to induce rules with the empty context, i.e., we only need to consider Q 1 (E,∅). The consistency criteria in B 1 is relatively simplistic, which makes the task easier for humans. Indeed, we only need to make sure that each induced rule q is consistent with E but do not need to check whether the induced rules are consistent with each other. This allows us to greedily induce rules one at a time without worrying about potential conflicts among them.

F DATASET GENERATION

The dataset generation process is config-driven, with a different dataset spec config being defined for each of the cSCAN datasets. All cSCAN examples are automatically generated by a Python program built on the NLTK library (Bird et al., 2009 ).foot_1 Generation is performed via the following steps. 4 for the original SCAN task. While many of these rules may never be shown directly to the learner, this initial rule set serves as a kind of "basis rule set" from which the truth value of all other possible rules can be derived. We then randomly choose which of those basis rules to (a) provide explicitly in the context via an example that asserts that rule to be "true", (b) illustrate implicitly through some set of context examples sufficient to satisfy the task's inductive bias (see Appendix E.4 for details), (c) illustrate insufficiently or not at all, or (d) contradict in one or more cases, so that it should be inferrable to be "false". By generating contexts via the above procedure, we ensure that the basis rules cover each of the different possible replies and qualifiers for rule examples: monotonically "true" (case a), defeasibly "true" (case b), "unknown" (case c), and either monotonically or defeasibly "false" (case d). By extension, this also ensures that we achieve a diverse mixture of possible replies and qualifiers across the much larger set of rules that could be derived from different combinations of those basis rules. We ensure that the exact ratio between the different above cases varies randomly from context to context, while achieving across the dataset as a whole the desired ratio that is configured in the dataset spec. Top-level example generation Once we have fixed a context, we then randomly generate a set of (request, reply, qualifier) triples corresponding to the top-level examples that we wish to generate using the given context. To aid in sampling such examples, we first construct a pair of inference engines in which we exhaustively generate all possible examples that would be considered "true" either monotonically or defeasibly (respectively) based on the given context. To generate a non-rule example or positive rule example, we then randomly sample a "true" example from the full set of examples that were inferred by one of those inference engines. To construct a negative rule example, we first sample a positive example and then apply one of a number of different heuristics to construct an example that is similar to that positive example, but which is not among the examples inferred by the inference engine. These two inference engines encapsulate the logic needed to ensure that each top-level example satisfies the consistency criteria stated in Section 2.3 and agrees with the domain-specific inductive bias. Sub-sampling In order to ensure that the different example classes are evenly represented in the dataset, we perform the top-level example generation process described above in separate streams, each of which is dedicated to generating one specific example class (e.g., "positive monotonic rule examples" or "non-rule examples with reply of unknown", etc.). We then sub-sample examples from each of the different streams in order to achieve the desired ratio of examples from each of the different classes. We ensure that the exact ratio between the different classes varies randomly from context to context, while matching the desired ratio in the dataset overall. Splitting Due to interdependencies between the splitting algorithm and the example generation process (see, e.g., the notes on "additional top-level example generation" below), we perform the train-validationtest split as one step of the dataset generation process, rather than generating a single dataset and then splitting it in multiple ways. For performing MCD splits, we build on the open-sourced Python implementation of the MCD algorithm from Shaw et al. (2020) . While their version of the MCD algorithm consists of initially performing a random split and then iteratively swapping examples to increase compound divergence, however, we found that we were able to achieve higher compound divergence on the cSCAN dataset by implementing an algorithm closer to that of the original one described in Keysers et al. (2020) . In this approach, we begin with empty train and test sets and then iteratively select examples from a large example pool to add to one of the two sets, while once in every three steps selecting an example from one of the two sets to remove and put back in the pool. At each addition or removal step, we select from among a random sample of 200 examples the one whose addition or removal would maximize divergence at that stage, while keeping atom divergence low. One of the advantages of the insertion/deletion approach is that, in cases where it is acceptable to use only a portion of the available examples, the process can be stopped early, which can result in train and test splits with significantly higher compound divergence than would be possible if the algorithm were constrained to use all of the examples from the example pool. As described in Section 3.3, we do perform such sub-sampling when constructing our MCD splits, beginning with a pool of 1 While our above algorithm largely emulates the one described in Keysers et al. (2020) , we do introduce one additional enhancement to enable generating a 3-way compound divergence split between train, validation and test. This is in contrast to Keysers et al. (2020) , which performs only a single stage of MCD splitting for maximizing compound divergence between train and test. In our approach, we perform two stages of MCD splitting, with the goal of constructing a train set, validation set, and test set, with high pairwise compound divergence between any two of the three. In the first stage, we split the example pool into a train+validation pool and a test set, with the objective of maximizing the compound divergence between the two. In the second stage, we keep the test set fixed, while further splitting the train+validation pool into a train set and a validation set, with the joint objective of maximizing compound divergence between train and validation and maximizing compound divergence between train and test. We perform down-sampling in each of the two stages to increase the compound divergences that we are able to achieve. Appendix H contain statistics on the MCD splits, along with other details of the datasets. Additional top-level example generation For the cSCAN Random variants, after performing a random split, we augment the validation and test sets by generating additional top-level examples for each validation and test context, using the same example generation logic described above. This allows us to achieve a higher density of logically related examples in the validation and test sets, so as to yield a larger number of potential implications and contradictions for use in calculating the consistency metric. We do not perform this step for the cSCAN MCD variants, however, to avoid impacting the compound divergences between the train, validation and test sets. For this reason, we focus our investigation of consistency in this paper on the cSCAN random splits, as it is only in the random splits that we are able to identify a significant number of implications and contradictions. Improving the sampling and splitting algorithms to enable investigation of consistency in MCD splits could be a topic for future work.

G CONSISTENCY METRIC CALCULATION

While the consistency metric as described in Appendix D in its most general form can be prohibitively expensive to calculate if approached naively, a close approximation of it can be calculated efficiently through the application of several task-specific assumptions. First of all, in the consistency metric calculation we use with As the cSCAN validation and test sets contain up to 1000 top-level examples per context, however, it would still be prohibitively expensive to enumerate each of the possible subsets of these examples to identify the sets that involve a "minimal" implication or contradiction. Instead, we find that we are able to identify the implications and contradictions much more efficiently by seeding an inference engine with the rule assertions that would correspond to the up to 1000 top-level request-reply pairs (ignoring negative rule replies and unknown replies for simplicity) and then performing exhaustive forward inference to determine all possible rules that could be inferred from combinations of these asserted rules. This is essentially the same inference process that is used in top-level example generation (as described in Appendix F), except that we take the additional step of tracking the provenance of each inferred rule, and we continue the inference process so as to generate all possible provenances of each rule (rather than omitting reprocessing of rules that have already been inferred via a different route). For the purposes of this consistency calculation, it is sufficient to consider as provenance the set of asserted rules that led to the given inference. Once the exhaustive inference process is complete, we then check each of the asserted rules (i.e., each of the top-level predictions) against the contents of the inference engine. If the asserted rule was also inferred from some other rules, then we take the full set of inference provenances for that rule, filter out any inference provenances that are supersets of some other provenance, and then treat each of those remaining provenances together with the asserted rule as one "minimal implication" (i.e., as one minimal implication set F ∈ impl(E) as defined in Equation 24). Similarly, if a rule was inferred that shares the same left-hand side as the asserted but contains a different right-hand side, then we look at the provenances of each such inferred rule, filter out any that are supersets of some other provenance of the same rule, and then treat each of those remaining provenances together with the asserted rule as one "minimal contradiction" (i.e., as one minimal implication set F ∈cont(E) as defined in Equation 25). Despite the relatively large number of rules with which we seed each inference engine and the extra expense of tracking multiple rule provenances, we find that we are able to complete exhaustive inference quickly in practice, due to the fact that the majority of the rules that are asserted in top-level examples tend to be more specific than the rules that are typically asserted inside of a context, and thus lead to only a limited number of interactions, which ends up being comparable to or less than the computational cost of the inference involved in constructing the contexts in the first place. In practice, when calculating the consistency metric in parallel using a different work unit for each context, we find that we are able to calculate the consistency metric for a full cSCAN experiment within a few minutes.

H DATASET DETAILS

Table 6 shows atom divergence and compound divergence between pairs of splits. Table 7 shows the minimal number of times an atom occurs in each split, the number of compounds in each split, and the number of compounds in the test split that are held out from the train split. Table 8 shows details of all splits of all datasets used in the cSCAN experiments. 

I ANALYSIS OF REALISTIC CSCAN EXAMPLES

In this appendix, we provide realistic examples from a slightly earlier version of the cSCAN dataset and outline a systematic strategy for solving them. We selected a total of 22 examples (Figure 8 ) that are based on a single context of size 23 (Figure 7 ). We made sure that the different classes (see Section 2.4) are well represented among the selected examples. This means that the examples cover all valid combinations of request types, replies, and qualifiers. To illustrate how cSCAN can be solved by humans, we outline one possible strategy below. This strategy is based on the assumption that we are aware of the grammar (Appendix E.2 and Figure 5 ), the rule semantics (Appendix E.3), and the inductive bias (Appendix E.4) of cSCAN.

I.1 MONOTONIC EXAMPLES

In a first step, we check whether a given example can be deduced directly from the context. This is the case for all requests that exclusively contain syntactic constructs for which the behavior is completely determined by explicit rules in the context. In our context C, the rules (C1) through (C11) completely determine the behavior of all constructs except "look", "thrice", and "and". This means that we can deduce the reply for the examples (E1), (E2), (E3), (E4), (E5), and (E7), which we consequently mark as monotonic. As an illustration, consider the request "jump around left twice" from example (E1). Once we know that jump =JUMP (C3), left =PURPLE (C1), The examples (E6), (E8), and (E9) contain some of the syntactic constructs that are not completely determined (i.e., "look", "thrice", and "and"), but they can still be deduced from the context and are thus marked as monotonic. The example C, look left =WHITE WHITE PURPLE,1,M (E6) follows directly from the rules left = PURPLE (C1) and look x1 = WHITE WHITE x1 (C12). x1 Similarly, the example C, look = WHITE YELLOW,0,M (E8) can be deduced from the examples (C7), and (C12). Indeed, if look =WHITE YELLOW were true, (C7) would allow us to deduce the rule look x1 = WHITE YELLOW WHITE YELLOW x1 , which contradicts (C12). 

I.2 DEFEASIBLE EXAMPLES

To determine the reply of examples that cannot be deduced from the context, we check whether the context allows us to induce some rules for the constructs "look", "thrice", and "and", which are not fully determined. Inducing rules for "look", "thrice", and "and". We start with the primitive command "look" and identify all context examples that contain "look" but none of the other partially-determined constructs "thrice" and "and". This yields the rules (C12) and (C13). Together with (C7), the rule (C12) tells us that "look" must either be "WHITE" or undefined (i.e., ?), and the same holds for (C13). While these two examples alone are not sufficient under the task's inductive bias (Appendix E.4) to justify inducing the general form of the "look" rule, we will proceed with the assumption that look = WHITE for the time being. Next, we apply the same process to the syntactic construct "thrice". This means that we identify all context examples that contain "thrice" but not "look" and "and", which are (C14), (C15), and (C16). The first two of them tell us that " x1 thrice " must either translate to " x1 x1 x1 " or be undefined. However, example (C16) is incompatible with the rule x1 thrice = x1 x1 x1 . In accordance with the inductive bias, we therefore cannot induce a general rule for "thrice". Now, we apply the process to "and", which means that we look at the rules (C17) and (C18). Together with the example (C10), they both tell us that " x1 and x2 " must either translate to " x2 x2 x1 " or be undefined. So, we proceed with the assumption that x1 and x2 = x2 x2 x1 . As a next step, we look at the examples (C19) and (C20) whose requests contain both "and" and "look" but not "thrice". Both of them follow from our current assumptions look =WHITE and x1 and x2 = x2 x2 x1 . This means that we have now found four examples in the context that agree with our assumed rules for "look" and "and", and there are no examples that contradict them. Based on our inductive bias, this means that we induce these rules. All the remaining examples (i.e., (C21), (C22), and contain the construct "thrice", for which we cannot induce a general rule. Applying the induced rules for "look" and "and". Using the induced rules for "look" and "and", we can determine a concrete reply for the examples (E10) through (E12) as well as (E16) through (E19). These examples are marked as defeasible because they are based (among others) on induced rules, which means that the reply might change if the context is expanded. Using reply "unknown" for examples containing "thrice". Because we are not able to induce a generic rule for "thrice", we are not able to determine a concrete reply for the examples (E13) through (E14) and (E20) through (E22). Instead we use the reply "unknown" (?) and mark them as defeasible because the reply might again change if the context is expanded. In representing the input of a CLT as a request paired with a context, we build on a long tradition of QA and reasoning task formulations that provide knowledge relevant to a task via various forms of context, such as a text passage (Kwiatkowski et al., 2019; Weston et al., 2015; Dua et al., 2019; Sinha et al., 2019; Yang et al., 2018; Rajpurkar et al., 2016; Levesque et al., 2012) , logical premise (Roemmele et al., 2011; Dagan et al., 2005; Bowman et al., 2015) , set of natural language statements (Talmor et al., 2020) , knowledge graph fragment (Sinha et al., 2020) , antecedent description grammar (Cohen, 1994) , dialog (Semantic Machines et al., 2020; Budzianowski et al., 2018) , image (Antol et al., 2015; Johnson et al., 2017; Hudson & Manning, 2019; Bahdanau et al., 2019a; b) , grid world (Ruis et al., 2020) , or DB schema (Yu et al., 2018) .

J EFFECT OF DATA SIZE

Here, to give a better sense of how a CLT is similar to and different from these existing task formulations, we make a closer comparison of cSCAN with several representative NLU tasks that provide explicit knowledge as part of the input and satisfy some of the desired properties for a CLT formulated in Section 2.1. This is illustrated in Figure 9 . Note that we focus our comparison here specifically on the defining features of a CLT. This should not be construed as a commentary on the overall usefulness or quality of these benchmarks. Indeed, many of the benchmarks described here have complementary strengths which cSCAN lacks, such as being based on true natural language, supporting multi-modal input, illustrating specific domains of reasoning, or covering more complex reasoning or syntax. Talmor et al. present a series of "Leap-of-Thought" tasks (Talmor et al., 2020) where the learner needs to judge yes/no hypotheses by performing inference over knowledge that is obtained implicitly from language model pretraining while part of the knowledge is also provided explicitly using a context containing natural For the context (explicit knowledge), we distinguish whether it is organized as a set of independent units rather than a monolithic block (column 1) and whether different examples use different and sometimes contradictory knowledge (column 2). We also distinguish whether the context contains both rules (column 3) as well as examples (column 4). For the request, we indicate whether the task contains examples that explicitly test the truth value of rules (column 5). For the output, we indicate whether the task requires identifying whether a request cannot be answered based on the given context (column 6) and whether it distinguishes between inductive and deductive reasoning. language statements. This setup satisfies various CLT properties. In particular, arbitrary hypotheses can be asserted in the context, and the requests can ask for the truth value of any context statement (examples and rules coincide in this task because all examples are yes/no hypotheses). However, the knowledge is constant across all examples (albeit different parts of this knowledge are provided explicitly for different examples). Therefore, Leap-of-Thought tasks do not investigate whether a learner is able to adapt to transient knowledge, e.g., explicit knowledge that may contradict the implicit knowledge obtained during language model pretraining (e.g., an actor was married and is now divorced). Similarly, these tasks exclusively focus on monotonic inference: they do not test whether the learner is able to induce defeasible hypotheses from the explicit knowledge nor do they test the ability to identify certain hypotheses as "unknown". GraphLog (Sinha et al., 2020) is a benchmark suite based on tasks that are quite similar to CLTs. The learner is presented with part of a graph consisting of labeled edges (context) and then needs to predict a the label for an edge that is not part of the context. For example, the context may contain two "father-child" relations and the learner needs to predict the "grandfather-grandchild" relation. As for cSCAN, the graph is constructed automatically based on a set of first-order logic rules, which make sure that each task is consistent. However, in contrast to cSCAN, the underlying rules cannot be part of the context, nor are they expressible as requests. In the CLT terminology, this means that GraphLog tests only the case where the context consists of examples and the learner has to induce new examples. It does not support the case where the context contains rules and the learner has to consistently combine and apply these rules deductively. The bAbI tasks (Weston et al., 2015) also require the learner to answer a question using variable context consisting of natural language statements. Each context consists of a sequence of relatively simple factual statements that are, at least for some of the tasks, order-dependent (Dehghani et al., 2018) . This means that a bAbI context does not directly correspond to a set-based context that we are using with CLTs. One way to bridge this gap is to consider a sequence of bAbI statements as a single "macro rule", but the truth value of these rules cannot be requested. bAbI contains some tasks that require deductive reasoning and some tasks that require inductive reasoning. The gSCAN task (Ruis et al., 2020) is an extension of SCAN where the learner is provided with a context that describes a spacial configuration of objects in order to translate commands into a sequences of actions. However, as with bAbI, the context consists of a single dedicated structure describing the spacial configuration as a whole rather than a set of rules that describe the different objects that are part of the spacial configuration one by one. This makes the language to specify the context disjoint from the request language. As an example of a reading comprehension task, Natural Questions (Kwiatkowski et al., 2019) is a benchmark where the learner is given a Wikipedia page as context and then needs to answer a natural language question by outputting a long answer (e.g., the paragraph containing the answer) as well as a short answer. As with bAbI, the context is not a set of independent rules but instead a sequence of inter-dependent statements, which makes this benchmark quite different from a CLT.

K.2 OTHER RELATED WORK

Compositional generalization. Our evaluation of cSCAN MCD splits builds on existing research in measuring the ability of machine learning models to compositionally generalize (Keysers et al., 2020; Lake & Baroni, 2017) . In response to compositional generalization benchmarks such as SCAN, a range of techniques have been proposed which have not yet been evaluated on cSCAN. Some of these solutions involve specialized architectures for enforcing a compositional bias (Qiu et al., 2022; Chen et al., 2020; Liu et al., 2020; Nye et al., 2021) . Such architectures are appealing due to their potential for achieving a principled solution to compositional generalization, but would require some effort to adapt to the context and heterogenous output of a CLT. In the past, some specialized architectures have shown limited success when transferring to new tasks, compared to more general techniques such as language model pre-training (Furrer et al., 2020) . In the latter category, there is some promise shown by recent developments in decompositional prompting techniques, which have led to strong results on the SCAN MCD splits using off-the-shelf large language models (Zhou et al., 2022) . Instruction following. In evaluating the ability for a system to apply rules to a task, our work relates to research in building systems that learn to follow instructions (Goldwasser & Roth, 2014) , including recent research in the instruction-following capabilities of large language models (Wei et al., 2022; Ouyang et al., 2022; Wang et al., 2022) . Our approach differs in that we provide in the context a set of rules, each of which may be applicable to some part of the task, and we evaluate the ability to infer new rules in addition to applying the rules to an underlying task. Meta-learning. Meta-learning or "learning to learn" generally refers to a setup where the learner is provided with a family of tasks, which are also called episodes, each of which comes with its own set of training examples and test examples. A learner is able to "learn to learn" if its performance for each task increases both with increasing training data and with an increasing number of tasks (Thrun & Pratt, 1998; Hospedales et al., 2021; Finn et al., 2017) . The Closely related to cSCAN are two pieces of work that apply meta-learning to SCAN, both of which generate large numbers of SCAN-like grammars, from which they construct meta-learning episodes, similarly to how we generate cSCAN contexts based on SCAN-like grammars. Lake (2019) uses a memory-augmented network to attend to the train examples of each episode, while Nye et al. (2020) trains for each episode a program synthesis model that outputs the underlying rules of the task. Our approach differs in that we include in the context a mixture of rules and examples, rather than just examples of the underlying task, and we use the synthetically-generated contexts to define a new task for evaluating the ability of the system to generalize to many different rule sets, rather than using meta-learning techniques as a means to improve accuracy on the original SCAN task. Rule induction and logic deduction tasks. CLTs require the learner to judge whether a certain rule can be induced from a given set of observations (i.e., examples provided as part of the context). This is similar to a rule induction task (Cohen, 1995; Reddy & Tadepalli, 1998; Grzymala-Busse, 2010) , with the main caveat that the learner only needs to verify rules rather than generate them. CLTs also require the learner to apply rules and judge whether a rule may be deductively obtained from a set of other rules. Interpretable ML models. CLTs make part of the rules that govern the underlying tasks explicit, which allows us to "introspect" the behavior of the learner by asking, as part of the task, whether or not a certain rule holds (inductively or deductively). This means that the question of whether and why a certain model behaves correctly or incorrectly can be broken down into two parts: (a) did the model learn the right rules and (b) is it able to apply these rules consistently. This is related to, yet different from, other efforts to make ML models more interpretable. For example, Sushil et al. (Sushil et al., 2018) propose a method to induce if-then-else rules to explain the behavior of ML models. However, unlike for CLTs, this method is external to the actual task: it does not reflect whether the model claims a certain rule to be true but instead identifies the if-then-else rules between different input features and class labels that are the most important for classification according to the model.

Consistency.

Our consistency metric is related to research into evaluating and improving the consistency of neural networks. One closely related work is Li et al. (2019) , which evaluates consistency by generating clusters of examples that by construction are related via logical symmetry, transitivity, or some other logical constraint. Based on these examples, they calculate a "conditional violation" metric, which similarly to our consistency metric is a ratio of violated (in our case, satisfied) constraints vs. the total number of logical constraints. Our approach differs in that we gather the logical constraints automatically using symbolic inference over the generated examples, rather than depending on a specific algorithm for generating related examples. Handling of large contexts. Due to the potentially large context size in CLTs, another relevant area of research is how to deal with very large contexts, including large set-like contexts. One line of research in this area involves modifying the Transformer architecture to be able to handle longer inputs more efficiently (Tay et al., 2020a; b) . We evaluated two such architecture variants in our LongT5 and LongT5-Global baselines (Guo et al., 2022) , but many other such variants have been proposed (Gu et al., 2021; Zaheer et al., 2020; Choromanski et al., 2020; Wang et al., 2020) . In cases where the context takes the form of a set or a graph, some approaches seek to explicitly take into account this structure by encoding the input structure in positional embeddings (Herzig et al., 2020) , guiding the Transformer's attention via the structural relations within the input (Ainslie et al., 2020) , or message-passing in graph neural networks (Gilmer et al., 2017; Battaglia et al., 2018) . Other lines of research seek to more efficiently deal with large pools of potentially relevant knowledge by either performing cross-attention from portions of the input to knowledge stored in neural memory (Verga et al., 2020) or by retrieving only the most relevant material from a knowledge base or text corpus for concatenation to the input (Guu et al., 2020; Pasupat et al., 2021) .

L REPRODUCIBILITY

Hardware and training period Table-10 shows the different hardware used for each experiment and the training period (in steps). T5 versions For our T5 baselines, we use T5X (Roberts et al., 2022) , which is a re-implementation of T5 in JAX (Bradbury et al., 2018) using Flax (Heek et al., 2020) . Table 11 presents the configurations of different T5 variants that we used in our experiments. For the full-attention version of T5, we experiment with both fine-tuning from a standard pre-trained checkpoint and training from scratch. For LongT5 and LongT5-TGlobal, while we initially evaluated with both fine-tuning from a standard pre-trained checkpoint and training from scratch, when fine-tuning, we failed to find a setup in which the models converge on the train set, possibly due to poor compatibility between the cSCAN task and the summarization-oriented PEGASUS Principle Sentences Generation pre-training objective (Zhang et al., 2019) used in LongT5. For this reason, we only report results on LongT5 and LongT5-TGlobal models trained from scratch. For each of the architectures, we evaluate at minimum two sizes: Small (60M parameters) and Base (220M parameters). For the best-performing T5 architecture, we further evaluate on size Large (770M parameters). We omitted experiments on Large variants of the other architectures for reasons of computational cost, as the poor performance on the Small and Base sizes suggest that it is unlikely for the performance of LongT5, LongT5-TGlobal or the non-pretrained version of T5 to improve significantly with model size alone. Hyperparameters Table 12 summarizes the hyperparameters used for each of the baselines. The reasons for choosing those hyperparameters are: • Config: We chose the config version based on experiments on an earlier version of the dataset. In that version, we found that T5 achieved higher performance using the T5.1.0 config, while LongT5 and LongT5-TGlobal performed better on the T5.1.1 config. • Learning Rate: A constant learning rate is the standard way to fine-tune pre-trained T5 models. For non-pretrained models we found that a constant learning rate performed as well as the inverse square root decay with linear warmup scheduler (the standard learning rate scheduler for pretraining T5). for fine-tuning and models to maintain consistency. Note that for the small models using T5.1.1 config, we set the number of layers and heads to 6 and 8 respectively to match those set in T5.1.0 config. Tokenization All models use the pretrained SentencePiece tokenizer (Kudo & Richardson, 2018) provided by T5, which is pretrained to cover the English, French, German and Romanian languages with 32,000 tokens. We also tried using a simple whitespace tokenizer, which resulted in similar performance when compared to the pretrained T5 tokenizer. In contrast, when looking at the results of the models that failed to outperform the naive baseline, we can see that for these, accuracy on rule examples is significantly higher than that on non-rule examples, with the highest accuracy on examples with the reply of "unknown", consistent with the view that these models are relying on guessing based on superficial example characteristics. For the models that failed to outperform the naive baseline, the pattern is similar on cSCAN-X as on cSCAN-B.

M.3 CSCAN-B MCD

Table 15 shows breakdown of accuracy by example characteristics on cSCAN-B MCD. From these results, we can see the following trends: • All models achieve particularly high accuracy on examples with reply "unknown", as would be expected from the dataset stats shown in Table 2 , where we can see that this class of examples makes up over 50% of the examples in the cSCAN-B MCD train set. This makes the answer of "unknown" a natural guess in any situation where the model is unsure. • Given that T5 w/o Context is able to achieve significantly higher than zero accuracy on rule examples with replies other than "unknown", however, it is clear that the model is doing more than simply predicting "unknown" every time. Rather, it appears that a moderate amount of statistical clues must be available in the request itself to allow some degree of "educated guessing" of the reply, particularly in the case of rule examples. • For all models, accuracy on non-rule examples lags significantly behind accuracy on rule examples. This is in contrast to the cSCAN Random datasets, where the stronger-performing pre-trained models frequently performed better on non-rule examples than on rule examples. One reason for this difference is likely the fact that the train set for cSCAN-B MCD is skewed toward rule examples, which make up somewhat over 70% of the dataset. Taken in light of the observation above about the naive T5 w/o Context baseline, however, the poor performance on non-rule examples also suggests that the T5 baselines may be achieving even less proper "understanding" of the examples than one would have thought from looking at the overall accuracy numbers alone, and is likely relying to a large degree on "educated guessing", based on statistical clues from the 

M.4 CSCAN-X MCD

Table 16 shows breakdown of accuracy by example characteristics on cSCAN-X MCD. From these results, we can see the following trends: • Similarly to cSCAN-B MCD, all models achieve high accuracy on examples with reply "unknown", which is again the most commonly occurring class of examples in this dataset (around 40% of examples in the train set). • For pre-trained T5, however, accuracy on non-rule examples is significantly higher than on cSCAN-B MCD, suggesting that these models are likely benefiting from the more balanced distribution of examples in the cSCAN-X MCD dataset, where around 40% of the train examples are non-rule examples, compared with less than 30% in cSCAN-B MCD. • The large gap in accuracy between negative and possible rule examples on pre-trained T5 suggests that while they are able to use information from the context to do a better job than the naive T5 w/o Context at distinguishing between "unknown" and "not unknown" rules, they are still relying largely on guessing for determining the rules' actual truth value. The features being considered are:

M.5 EFFECT OF EXAMPLE AND CONTEXT CHARACTERISTICS

• num rules: The number of distinct rules used to create an example. For example, for the request "walk and walk", num rules is 2: it is created with the rules: x 1 and x 2 =... and walk =...

• num variables:

The number of variables in the rule. For example, for the request "walk and x 1 ", num variables is 1. • derivation level: The number of compositions used to build an example. For example, for the request "walk and walk", derivation level is 2: it is created by first composing x 1 and x 2 =... with walk =... to get walk and x 2 =..., followed by another composition with walk =.... • frac explicit rules bucket: The fraction of explicit rules among all distinct rules used to create an example. The fractions are bucketed for legibility: frac explicit rules bucket=0.5 includes all examples with a fraction of explicit rules at least 0.5 and less than 0.6. • input length bucket: The length of the input (context + request) in tokens. The lengths are bucketed for legibility: input length bucket=500 includes all examples with length at least 500 and less than 600. • context num explicit rules: The number of rules explicitly asserted in the context. Every context is based on 14 grammar rules, so for example, context num explicit rules=5 means that the context contains explicit assertions of 5 of these rules, while the other 9 rules are either illustrated indirectly via examples (such that the learner is expected to induce the rule to be true) or are not illustrated sufficiently (such that the learner is expected to consider the rule to be false or unknown). In all cases the accuracy appears negatively correlated num rules and derivation level. See text below each figure for additional observations.

N FINE-GRAINED EVALUATION METRICS

In addition to exact match accuracy, for more nuanced error analysis, we track several additional finergrained metrics, including partial accuracy metrics, edit distance, and counts of implications and contradictions related to the consistency metric.

N.1 PARTIAL ACCURACY METRICS

Sequence level accuracy is a hard metric where a single wrong token leads the entire prediction to be labeled as wrong. For this reason we define additional accuracy measures that take into account partial success in solving the actual task. These measures are: • Reply Accuracy: A prediction is considered correct if the reply portion is correct. • Qualifier Accuracy: A prediction is considered correct if the qualifier portion is correct. • Pattern Accuracy: Pattern accuracy assigns each token an incremental ID based on the order it appears in the sequence, thus ignoring the specific predicted tokens and focusing on token variation pattern. E.g. the sequence JUMP JUMP RUN JUMP and WALK WALK EAT WALK both have the same pattern of A A B A where A replaces JUMP and WALK in the first and second sequences respectively, while B replaces RUN and EAT in the first and second sequence respectively. • Naive Accuracy: A prediction is considered correct if it produces the same set of unique tokens as the target regardless of the order or the count. E.g. the sequences JUMP JUMP RUN JUMP and RUN JUMP both have the same unique tokens set (JUMP and RUN). • Token Accuracy: Token-wise accuracy between the prediction and the target. The two sequences are aligned at the start token, and the shorter sequence is padded to have the same length as the longer sequence such that the padded tokens are considered wrong predictions. Tables 17, 18 , 20, and 20 show the performance of each baseline on all of these metrics. Edit distance is the number of edits that would need to be applied to the predicted sequence to transform it to the target sequence. There are three types of edits: Substitutions (S), Insertions (I) and Deletions (D). Tables 23, 24 , 26, and 26 show the edit distance and constituent metrics of each baseline on all of these metrics. 

O INPUT-OUTPUT FORMAT

We evaluate T5 variants, which are all encoder-decoder architectures. Here, we show how we prepare the input (that is fed to the encoder) and output (that the decoder generates) of our models, given an example from cSCAN. The example below is selected from the cSCAN-B dataset. Note that the request and the context examples are all concatenated into a single newline-separated string to form the input, and the bullet-points here (and in Appendix P.1) are just added to improve the readability of the examples. Note that we concatenate request + context, rather than context + request, so as to make the system more robust to truncation of the example string, if any example were to exceed the maximum length of T5's input buffer (although in our experiments we made sure that the example lengths did not exceed this buffer size). In representing the context for T5, it can be noted that we omitted special syntactic tokens such as braces, angle brackets, and commas wherever possible, so as to reduce the token count and keep the format closer to natural language, to the extent that this could be done without introducing ambiguity. 



To be released on GitHub upon paper acceptance. For an overview, see Appendices F and G. To be released on GitHub upon paper acceptance.



Figure 2: Examples of SCAN interpretation rules from Lake & Baroni (2017).

Figure 3: Hypothetical cSCAN examples based on two different contexts that contradict each other.

3: Generate or curate examples. For cSCAN, we generate examples automatically using a Python program, which we summarize here and describe in detail in Appendix F. As a first step of generating each example, we create a context by first picking a coherent set of interpretation rules like those shown in Figure 2 (or in Figure 4 from the appendix) and then choosing which of those rules to (a) provide explicitly in the context, (b) illustrate implicitly through context examples, (c) illustrate insufficiently or not at all, or (d) contradict in one or more cases.Once we have fixed a context, we choose a request, reply and qualifier that, together with the context, satisfy the consistency criteria stated in Section 2.3 and agree with the task's assumed inductive bias for distinguishing between a rule being "defeasibly true" vs. "unknown" (see Appendix E.4).

Figure 3 shows some hypothetical cSCAN examples in human-friendly notation. (We chose these examples for simplicity and conciseness. For a sample of actual cSCAN examples, see Appendix I. For examples in the exact format in which they are presented to the baseline systems, see Appendix O.)

The task T = {e 1 ,...,e N } is a finite set of examples e k ∈ E, where E = I ×O denotes the set of possible examples, with I being the set of possible inputs and O the possible outputs. • Each example e k = i k ,o k is a pair consisting of an input i k ∈I and an output o k ∈O.

walk =WALK x1 andx2 = x1 x2 run =RUN x1 afterx2 = x2 x1 jump =JUMP turn opposite left =LTURN LTURN look =LOOK turn opposite right =RTURN RTURN turn left =LTURN turn around left =LTURN LTURN LTURN LTURN turn right =RTURN turn around right =RTURN RTURN RTURN RTURN x1 left =LTURN x1 x1 opposite left = turn opposite left x1 x1 right =RTURN x1 x1 opposite right = turn opposite right x1 x1 twice = x1 x1 x1 around left =LTURN x1 LTURN x1 LTURN x1 LTURN x1 x1 thrice = x1 x1 x1 x1 around right =RTURN x1 RTURN x1 RTURN x1 RTURN x1

Figure 4: SCAN interpretation rules as provided by Lake & Baroni (2017). Double brackets denote the interpretation function translating SCAN's linguistic commands into sequences of actions. Symbols x1 and x2 denote variables.

Figure 5: The green area on the top left is the phrase-structure grammar to generate the cSCAN-B commands. The red area on the lower right is the extension to generate the left-hand side of cSCAN-B rules.

Figure 6: The green area on the top left is the phrase-structure grammar to generate the cSCAN-X commands. The red area on the lower right is the extension to generate the left-hand side of cSCAN-X rules. The items in blue boldface are those which do not appear in cSCAN-B.

around x2 = x1 x2 x1 (C8), and x1 twice = x1 (C10), we can immediately deduce that "jump around left twice" is translated to "JUMP PURPLE JUMP". Note that some of the examples are closely related to one another. For example, the example (E1) is an instance of the rule x1 around x2 twice = x1 x2 x1 from example (E5).

Finally, (E9) can be deduced from (C2), (C8), and (C14) because x1 thrice = x1 x1 would imply that x1 around right thrice = x1 GREEN x1 x1 GREEN x1 , which contradicts (C14).

Figure 7: A context C from the cSCAN dataset. The examples that make up the context are sorted, starting with 11 explicit rules that specify the complete behavior of all syntactic constructs except "look", "thrice", and "and". The remaining examples are sorted such that we first have the examples that do not contain "thrice" and "and", then the examples that do not contain "and", and finally the remaining examples.

Figure 8: Examples from the cSCAN dataset. The examples are based on context C shown in Figure 7 above, and they are grouped by request type and qualifier.

Figure 9: Comparison of cSCAN and other NLU task against the key features of conceptual learning.For the context (explicit knowledge), we distinguish whether it is organized as a set of independent units rather than a monolithic block (column 1) and whether different examples use different and sometimes contradictory knowledge (column 2). We also distinguish whether the context contains both rules (column 3) as well as examples (column 4). For the request, we indicate whether the task contains examples that explicitly test the truth value of rules (column 5). For the output, we indicate whether the task requires identifying whether a request cannot be answered based on the given context (column 6) and whether it distinguishes between inductive and deductive reasoning.

presence of examples within the context of an example gives CLTs a nested structure that allows us to view CLTs through the lens of the meta-learning setup. In this view, top-level examples that share the same context correspond to an episode where the context examples are the training examples and the top-level examples (w/o the context) are the test examples.

Figures 10, 11, 12, and 13 show T5's performance on the test splits of cSCAN-B, cSCAN-X, cSCAN-B MCD, and cSCAN-X MCD datasets with respect to various features of the examples and the contexts, broken down into rule and non-rule examples.

Figure 10: T5 accuracy on cSCAN-B examples. For non-rule examples, the accuracy appears positively correlated with the fraction of explicit rules (bottom-left).

Figure 11: T5 accuracy on cSCAN-X examples. For non-rule examples, the accuracy appears positively correlated with the number of explicit examples in the context (bottom-right), and negatively correlated with the input length (bottom-center).

Figure 12: T5 accuracy on cSCAN-B MCD examples.

Figure 13: T5 accuracy on cSCAN-X MCD examples. For both rule and non-rule examples, the accuracy appears negatively correlated with the input length (bottom-center).

Figure 14: Consistency sets size distribution for cSCAN-B Random

•run around left and turn opposite right twice PURPLE PURPLE JUMP JUMP YELLOW RTURN RTURN YELLOW RTURN RTURN YELLOW RTURN RTURN •walk opposite left twice after run thrice JUMP RUN LTURN PURPLE PURPLE LTURN PUR-PLE PURPLE •[run opposite left] = JUMP PURPLE PURPLE •[run opposite x1 twice] = JUMP [x1] [x1] •turn opposite right twice and turn around left thrice YELLOW RTURN RTURN YELLOW RTURN RTURN YELLOW RTURN RTURN PURPLE PURPLE YELLOW YELLOW •run right thrice JUMP RTURN RTURN •[x1 around x2 thrice] = [x2] [x2] [x1] [x1] •[look around x1 thrice] = [x1] [x1] PINK PINK •look around right thrice and run opposite left RTURN RTURN PINK PINK JUMP PURPLE PURPLE •look around left thrice PURPLE PURPLE PINK PINK •[look left thrice] = PINK PURPLE PURPLE •[look opposite x1] = PINK [x1] [x1] •run opposite right twice and turn around left twice JUMP RTURN RTURN PURPLE PUR-PLE YELLOW YELLOW •turn around left and walk twice PURPLE PURPLE YELLOW YELLOW LTURN LTURN JUMP •[x1 and x2 thrice] = [x1] [x2] •[x1 thrice and x2] = [x1] [x2] •look thrice and run around right thrice PINK RTURN RTURN JUMP JUMP •turn opposite right YELLOW RTURN RTURN •[jump right] = RUN RTURN RTURN •[x1 right] = [x1] RTURN RTURN •turn left after look opposite right PINK RTURN RTURN YELLOW PURPLE PURPLE •look right twice YELLOW PINK RTURN RTURN PINK RTURN RTURN •[look x1] = PINK [x1] [x1] •[x1 left] = [x1] PURPLE PURPLE •walk around left twice afterz walk around left thrice PURPLE PURPLE LTURN LTURN PURPLE PURPLE LTURN LTURN PURPLE PURPLE LTURN LTURN YELLOW •jump right twice RUN RTURN RTURN RUN RTURN RTURN RTURN •[look around x1 twice] = RUN [x1] [x1] PINK PINK [x1] [x1] PINK PINK •[jump around left twice] = PINK PURPLE PURPLE RUN RUN PURPLE PURPLE RUN RUN Output: <Reply, Qualifier> YELLOW PURPLE PURPLE PINK PURPLE PURPLE RUN RUN PURPLE PURPLE RUN RUN (Reasoning: Defeasible)

For context examples Example of T5-Large failing at acknowledging lack of information to reply. Input turn around right thrice and jump opposite left thrice •[x1 and x2] = [x1] [x2] [x1] •[x1 around x2] = [x1] [x2] [x2] •[x1 opposite x2] = [x1] [x2] [x2] •[x1 twice] = [x1] •[x1 after x2] = [x2] [x1] [x2] •[left] = BLUE •[x1 thrice] = [x1] [x1] •[look] = PINK •[run] = RED •walk opposite right after run opposite left twice RED BLUE BLUE WHITE YEL-LOW YELLOW RED BLUE BLUE •walk opposite left twice and run WHITE BLUE BLUE RED WHITE BLUE BLUE •[walk opposite x1] = WHITE [x1] [x1] •[walk x1] = [x1] WHITE •jump opposite left thrice and turn left thrice RTURN BLUE BLUE RTURN BLUE BLUE BLUE RUN BLUE RUN RTURN BLUE BLUE RTURN BLUE BLUE •turn opposite right twice after look opposite right PINK YELLOW YELLOW RUN YELLOW YELLOW PINK YELLOW YELLOW •[turn left] = BLUE RUN •[turn around x1 twice] = RUN [x1] [x1] •jump left twice after turn around left RUN BLUE BLUE BLUE LTURN RUN BLUE BLUE •jump around right thrice WALK WHITE YELLOW YELLOW WALK WHITE YELLOW YEL-LOW •[jump right] = YELLOW WALK WALK •[jump opposite left] = YELLOW BLUE BLUE •walk right twice and walk around left thrice YELLOW WHITE WHITE BLUE BLUE WHITE BLUE BLUE YELLOW WHITE •turn around right twice after jump around left WALK WALK BLUE BLUE RUN YELLOW YELLOW WALK WALK BLUE BLUE •[jump opposite right] = RED WALK YELLOW YELLOW •[turn opposite right] = RUN YELLOW YELLOW •jump right twice and jump thrice YELLOW BLACK BLACK BLACK YELLOW BLACK •run right thrice after look around left twice PINK BLUE BLUE YELLOW RED YELLOW RED PINK BLUE BLUE •RUN YELLOW YELLOW YELLOW BLUE BLUE YELLOW BLUE BLUE RUN YELLOW YELLOW RUN YELLOW YELLOW (Reasoning: Defeasible)Commentary[jump] = is not well illustrated by at least 2 unique substitutions, therefore the mapping of [jump] is unknown.Example of T5-Large failing at drawing correct information to compose the reply.Inputturn opposite left after turn around left thrice•[x1 and x2] = [x1] [x2] •[jump] = RTURN •[x1 opposite x2] = [x1] [x2] [x2] [x1] •[left] = YELLOW •[x1 twice] = [x1] •[turn] = RED •[right] = WHITE •[x1 thrice] = [x1] [x1] •[walk] = GREEN •[x1 after x2] = [x1] [x1] [x2] [x2] •[run] = JUMP •[x1 around x2] = [x2] [x1] [x2] [x1] •lookaround left and walk twice YELLOW LOOK YELLOW LOOK GREEN •look opposite left thrice WALK WALK YELLOW YELLOW WALK WALK WALK WALK YEL-LOW YELLOW WALK WALK •[look opposite left thrice] = WALK WALK YELLOW YELLOW WALK WALK WALK WALK YELLOW YELLOW WALK WALK •[look right] = WHITE WHITE RTURN RTURN •walk right twice WHITE WHITE GREEN GREEN •run left thrice and run YELLOW YELLOW JUMP JUMP YELLOW YELLOW JUMP JUMP JUMP •[x1 left] = YELLOW YELLOW [x1] [x1] •[turn x1] = [x1] [x1] RED RED target RED YELLOW YELLOW RED RED YELLOW YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED (Reasoning: Monotonic) prediction RED YELLOW YELLOW RED RED YELLOW YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED (Reasoning: Monotonic) Commentary •Composing [x1 around x2] = [x2] [x1] [x2] [x1] into [x1 thrice] = [x1] [x1] gives [x1 around x2 thrice] = [x2] [x1] [x2] [x1] [x2] [x1] [x2] [x1]. •Composing the above rule and [x1 opposite x2] = x1 x2 x2 x1 into x1 and x2 of [x1 after x2] = [x1] [x1] [x2] [x2] respectively results in the rule [x1 opposite x2 after x3 around x4 thrice] = [x1] [x2] [x2] [x1] [x1] [x2] [x2] [x1] [x4] [x3] [x4] [x3] [x4] [x3] [x4] [x3] [x4] [x3] [x4] [x3] [x4] [x3] [x4] [x3]. •Substitutingx1 and x3 by [turn] = RED and x2 and x4 by [left] = YELLOW results in the rule [turn opposite left after turn around left thrice] = RED YELLOW YELLOW RED RED YELLOW YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED YELLOW RED. Which varies from the predicted reply by the underlined tokens.

Key statistics of the main cSCAN datasets. For full details, see Appendix H.

Specifically, we start by generating 12K contexts with 100 top-level examples each, yielding a pool of 1.2M top-level examples. We then annotate each top-level example with a set of atoms and compounds based on the phrase-structure grammar rules that were composed to form the toplevel example request, and we split the set of top-level examples in such a way as to maximize the divergence in the distribution of compounds between train, validation, and test, while keeping the distribution of atoms nearly the same. Similarly toKeysers et al. (2020), we down-sample during the splitting process for more effective control of the distribution divergences, leading to a total of 100K top-level examples in train (comparable to cSCAN Random) and 10K top-level examples in each of validation and test.Statistics.Table 2 gives an overview of the key statistics of the cSCAN Random datasets and representative MCD datasets. (See Appendix H for details of other cSCAN dataset variants.

Accuracy vs. consistency. Learner A is more accurate, but learner B is more consistent.

Test results on cSCAN random splits by model size: S (Small), B (Base), and L (Large).

Test accuracy on different cSCAN MCD splits by model size: S (Small), B (Base), and L (Large).

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In EMNLP, 2018. URL https://arxiv.org/pdf/1809. 08887.pdf. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization, 2019. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.

run twice,RUN RUN , { run =JUMP},run twice,JUMP JUMP , { run left =LTURN RUN}, run left twice,LTURN RUN LTURN RUN , ...}

1 }, run twice and jump =RUN RUN JUMP LTURN,1 , { ∅, run =LOOK,1 , ∅, jump =WALK,1 }, run twice and jump =LOOK LOOK WALK LTURN,1 ,

Context generation For efficiency, and to aid in generating clusters of related examples for calculation of the consistency metric, we generate examples in batches, in which we first generate a context and then generate multiple top-level examples that share that same context. Each context is created by first randomly generating a coherent set of interpretation rules of similar form to those shown in Figure

cSCAN, we consider for simplicity only implications and contradictions among predictions for top-level examples that share the same context. While in general implications and contradictions can occur even among examples with different contexts (particularly if the examples are monotonic and if one context is a superset of the other), due to the way in which we construct the cSCAN dataset, such situations are extremely unlikely to occur. By focusing only on identifying implications and contradictions among examples sharing the same context, we are able to cleanly partition the dataset into independent clusters of examples, such that we can analyze each cluster efficiently in parallel. Dealing with clusters of examples that share the same context also simplifies analysis in that we now only need to consider the implications and contradictions among the request-reply pairs, while effectively ignoring the context.

Atom and compound divergence of the cSCAN MCD datasets between pairs of splits.

Atom and compound counts of the cSCAN MCD datasets. The "held out" column records the number of compounds that appear in the test split but not in the train split.

Key statistics of the cSCAN datasets.

Accuracy as a function of training data size. Results from experiments with pre-trained T5-Base (220M parameters) on datasets of similar form to cSCAN-B, but of varying size.As an additional set of experiments, we investigate the impact of training data size on the performance of the relatively strong pre-trained T5-Base (220M parameters) baseline. In each experiment, we use a dataset generated with a similar mix of examples as in cSCAN-B, but with a varying number of contexts and examples in the train set. As can be seen in the results in Table9, while the model is able to achieve high accuracy and relatively high consistency as the train size approaches 800K examples, performance drops dramatically as the train decreases from 100K examples down to 10K examples. This suggests that finding ways to reduce models' dependency on large amounts of task-specific training data will be an additional important theme for future work in conceptual learning.

Hyperparameters used in the cSCAN baselines.

Breakdown of accuracy by example characteristics on cSCAN-B Random.

Breakdown of accuracy by example characteristics on cSCAN-X Random.

Breakdown of accuracy by example characteristics on cSCAN-B MCD.

Breakdown of accuracy by example characteristics on cSCAN-X MCD. , which is much easier to do on rule examples than on non-rule examples, due to the smaller space of possible replies for rule examples.

Partial accuracy measures on cSCAN-B Random

Partial accuracy measures on cSCAN-X Random

Partial accuracy measures on cSCAN-B MCD

Partial accuracy measures on cSCAN-X MCD

Edit distance measures for cSCAN-B Random

Edit distance measures for cSCAN-B MCD

Edit distance measures for cSCAN-X MCD

annex

In this section we show the number of implications and contradictions used to calculate the consistency metric. Tables 21 and 22 show this breakdown.To further illustrate the type of inconsistencies the model is making, we show the consistency sets (implications and contradictions) size distribution in Figures 14 and 15 . This shows that sets of size 2 where one prediction implies or contradicts another are the most common. Appendix P.2 expands on this by providing examples of the contradictions.As can be seen in these tables, the consistency metric for each cSCAN Random experiment is calculated based on a minimum of 500 implications and contradictions.As discussed in Section 6.2 and Appendix F, we do not report consistency metrics for the MCD datasets, as we are only able to achieve a high enough density of potential implications in the Random datasets, where after splitting, we augment each context with additional top-level examples from the same distribution.We do not perform this additional example generation step for the MCD datasets to avoid impacting the compound divergences between the train and test sets.that represent rule assertions, we also adopted a simplified syntax, similar to the shorthand described in Appendix C.2, consisting of a single line containing the rule request alone, while omitting the reply, since in cSCAN we only include positive rule assertions in the context (i.e., never rule examples with reply of 0 or ?).Note also that we do not perform any clustering of top-level examples by their context, but rather represent each top-level example in flattened form as shown here (with its context included). We then shuffle the full set of top-level examples before batching them for input into T5. This means that even when there may be 100 or more top-level examples with the same context, T5 will in general not see them all in sequence or in the same batch, but rather intermixed with top-level examples with different contexts.

P QUALITATIVE ERROR ANALYSIS P.1 ACCURACY ERROR ANALYSIS

Here, we showcase examples where our best model (T5-Large with pretraining) fails at producing the accurate results when evaluated on an example from the cSCAN-B. We show different cases where the target and the prediction from the model are different.Example of T5-Large failing at rule assertion (when the request is False). Example of T5-Large failing at rule assertion (when the request is True). 

P.2 CONSISTENCY ERROR ANALYSIS

All the above examples show inaccurate output, where there is a mismatch between the expected target and model prediction. However, as discussed in Section 4, regardless of being accurate, a model may fail to stay consistent in replying to different requests. In this section we sample contradictions made by T5-Large on cSCAN-X Random and categorize the type of contradictions the model makes. Note that consistency is independent of the context and therefore it's omitted when presenting the examples.As seen in Appendix N.2, the models produce contradictory sets at different sizes. We find it easy to analyse each size independently as the type of mistakes vary between them.Contradictions of size 2 At this level, the inconsistencies are 1:1 relationships between two contradictory predictions P1 and P2. There are two types at this level• Type 1: Both P1 and P2 are asserted rules with matching right-hand sides and and different right-hand sides. • Type 2: One prediction is an asserted rule with no variables and the other is a rule application prediction with a request matching the right-hand side of the rule request while the reply does not match the rule request's right-hand side.From a sample of 20 contradictory sets of size 2, 11 sets were of type 1 while 9 were found to be of type 2.Here are examples of both types Contradictory set of size 2: Type 1Reply 1 1 (Reasoning: Monotonic)Request 2 [drive left 5x framingReply 2 1 (Reasoning: Monotonic)

Commentary

The two right-hand sides are different in the underlined tokens. Contradictions of size 3 At this level, the inconsistencies are 2:1 relationships with at least two rule assertions.• Type 1: The composition of two rules directly contradicts a third rule.• Type 2: The composition of two rules forms a rule with no variables where it's right-hand side matches the request of a rule application prediction and it's right-hand side contradicts the reply of that prediction.From a sample of 20 contradictory sets of size 3, both Type 1 and Type 2 had 10 sets each.Here are examples of both types Contradictory set of size 3: Type 1Request 1Reply Contradictions of size 4 At this level there are more complex contradictory relations as the space of possible compositions grows. we categorise 3 types of contradictions• Type 1: Composing 3 rules to form a rule with no variables that contradicts a rule application prediction.• Type 2: The composition of a subset set of rules contradicts the composition of another set of rules.• Type 3: The composition of 3 rules directly contradicts another rule.For this size, the model only made 10 contradictions: 5 were of type 2, 3 were of type 3 and 2 were of type 1. Reply 

