A MINIMALIST DATASET FOR SYSTEMATIC GENERAL-IZATION OF PERCEPTION, SYNTAX, AND SEMANTICS

Abstract

Inspired by humans' exceptional ability to master arithmetic and generalize to new problems, we present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts at three levels: perception, syntax, and semantics. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images (i.e., perception), how multiple concepts are structurally combined to form a valid expression (i.e., syntax), and how concepts are realized to afford various reasoning tasks (i.e., semantics), all in a weakly supervised manner. Focusing on systematic generalization, we carefully design a five-fold test set to evaluate both the interpolation and the extrapolation of learned concepts w.r.t. the three levels. Further, we design a few-shot learning split to determine whether or not models can rapidly learn new concepts and generalize them to more complex scenarios. To comprehend existing models' limitations, we undertake extensive experiments with various sequenceto-sequence models, including RNNs, Transformers, and GPT-3 (with the chain of thought prompting). The results indicate that current models struggle to extrapolate to long-range syntactic dependency and semantics. Models exhibit a considerable gap toward human-level generalization when evaluated with new concepts in a few-shot setting. Moreover, we discover that it is infeasible to solve HINT by merely scaling up the dataset and the model size; this strategy contributes little to the extrapolation of syntax and semantics. Finally, in zero-shot GPT-3 experiments, the chain of thought prompting exhibits impressive results and significantly boosts the test accuracy. We believe the HINT dataset and the experimental findings are of great interest to the learning community on systematic generalization.

1. INTRODUCTION

Humans possess a versatile mechanism for learning concepts from data (Firestone & Scholl, 2016 ). Suppose, for example, that we were tasked with deciphering ancient Egyptian signs based on the examples in Table 1 . Given sufficient time, we may comprehend these signs by how to recognize them-what each sign looks like at the perceptual level, by how to compose them into valid sequence-at the syntactic level, and how to predict the results-at the semantic level. Learning concepts heavily rely on these three-level interweaving meanings. Such observation is also consistent with the classic view of human cognition, which postulates at least three distinct levels of organizations in computation systems (Pylyshyn, 1984) . Interested readers can refer to the website, https://liqing-ustc.github.io/HINT/ Egyptian, for more training and test samples with the ground-truth meaning for each sign. We strongly encourage the readers to play this game prior to reviewing the answers. Another appealing characteristic of human concept learning is its systematic compositionality (Chomsky, 1957; Montague, 1970) : the algebraic capacity to understand and construct an endless number of novel combinations from a finite set of known components, i.e., "infinite use of finite means" (Chomsky, 1965) . As illustrated in Table 1 , this form of compositionality is essential to the human ability to make strong generalizations from simple examples to complex ones. Various benchmarks (Lake & Baroni, 2018; Hupkes et al., 2020; Keysers et al., 2020) and methods (Lake, 2019; Gordon et al., 2019; Csordás et al., 2021) have been introduced by the emerging community of learning models that capture human-like systematic compositionality. As it is difficult to collect real data with systematic compositionality, the majority of existing benchmarks are derived from artificial domains using synthetic data and tasks, covering only a subset of the concept learning spectrum; see Table 2 for a detailed comparison. When evaluating systematic compositionality, prior datasets frequently conflate syntax and semantics. For instance, the SCAN dataset (Lake & Baroni, 2018) is a semantic parsing task from natural language commands to action sequences; when a model fails on a longer command than the ones in the training set, the root cause could stem from misinterpreting the complex syntactic relations in a long input sequence (command) or its inability to generate a long output sequence (actions) (e.g., as a result of the EOS decision problem (Newman et al., 2020) . In addition, previous benchmarks frequently incorporated simple semantics (e.g., a simple mapping or repetition), resulting in an undesired bias toward syntactic generalization. To expand systematic compositionality to a full-spectrum systematic generalization w.r.t. perception, syntax, and semantics, we draw inspiration from arithmetic and present a new benchmark called HINT, Handwritten arithmetic with INTegers. The HINT task is intuitive: Machines accept as input images of handwritten expressions and predict the final results of expressions, restricted in the integers. Since there is no intermediary supervision, the three-level meanings are apparently intertwined during learning, and models are expected to simultaneously acquire the three-level meanings to make correct predictions. To provide a comprehensive and rigorous test of how models generalize the learned concepts, we introduce a carefully structured evaluation scheme with five subsets, focusing on generalization patterns (i.e., interpolation and extrapolation) at various levels (i.e., perception, syntax, and semantics). In addition, we build a few-shot learning split to determine if models can rapidly learn new concepts from few examples and generalize them to more complicated scenarios. Being minimal yet comprehensive in terms of systematic generalization, HINT is fundamentally more difficult than earlier datasets because: (i) The images are of actual handwriting with considerable visual variation; (ii) The syntactic relations between the tokens in the expressions are more complex with long-range dependency. (iii) The semantics of arithmetic concepts are more complex than the simple mappings in prior datasets. To facilitate future research in this direction, we conduct extensive experiments of various sequenceto-sequence (seq2seq) models, including Recurrent Neural Networks (Hochreiter & Schmidhuber, 1997; Chung et al., 2014) , Transformers (Vaswani et al., 2017) , and GPT-3 (Brown et al., 2020) (with chain of thought prompting Wei et al. (2022) ). Our experiments indicate that all models still struggle on HINT; even the state-of-the-art model, Universal Transformer (Dehghani et al., 2018) with relative positional encoding (Shaw et al., 2018; Dai et al., 2019) , achieves just 54% accuracy on HINT, although it achieves virtually perfect accuracy on prior datasets such as SCAN (Csordás et al., 2021) . An in-depth analysis of the results on each test subset reveals that current models still struggle with extrapolation to long-range syntactic dependency and semantics. In the GPT-3 experiments, the chain of thought prompting significantly increases the zero-shot test accuracy from 8.6% to 27.6%. By examining the scaling trends of the test accuracy w.r.t. the size of the model and the dataset, we find that it is impractical to solve HINT by simply scaling up the size of the dataset or the model, as is typically done in NLP tasks (Kaplan et al., 2020; Henighan et al., 2020) ; more data and parameters do not significantly improve the extrapolation over syntax and semantics. The fewshot learning experiments demonstrate that, despite the fact that the top-performing models exhibit decent capabilities for learning new concepts, they are still far from the human-level generalization that only requires the learning examples of a new concept in a primitive form and readily generalizes to more complex compositions of the learned concept. In short, we introduce the HINT dataset for investigating the systematic generalization across three levels-perception, syntax, and semantics. By benchmarking various seq2seq models on HINT, we uncover their primary weaknesses in systematic generalization. We hope the HINT dataset and our experimental findings will stimulate future developments of systematic generalization. (Ruis et al., 2020) synthetic SP i&t ✓* ✓ ✓ systematic 300K PCFG (Hupkes et al., 2020) synthetic SP text ✓ ✓ systematic 100K CFQ (Keysers et al., 2020) real SP text ✓ ✓ systematic 239K CURI (Vedantam et al., 2021) synthetic IC image ✓ ✓ systematic 15K COGS (Kim & Linzen, 2020) real SP text ✓ ✓ systematic 30K Mathematics (Saxton et al., 2018) real QA text ✓ ✓ systematic 2M PGM (Barrett et al., 2018) 

2. RELATED WORK

Benchmarks on Systematic Generalization Although several benchmarks (Lake & Baroni, 2018; Hupkes et al., 2020; Barrett et al., 2018; Zhang et al., 2019; Teney et al., 2020; Keysers et al., 2020; Bahdanau et al., 2019; Ruis et al., 2020; Kim & Linzen, 2020; Keysers et al., 2020) have advanced systematic generalization, the majority of them are based on artificial domains with synthetic tasks, involve just one or two aspects of concept learning and often mixing the generalization over syntax and semantics. SCAN (Lake & Baroni, 2018) is tasked with translating a natural language command into a sequence of operations in a simplified navigation domain using only syntax and semantics. CLEVR (Johnson et al., 2017) requires parsing questions (syntax) and grounding visual objects (perception), although objects themselves lack functional semantics. We refer readers to Table 2 for detailed comparisons of related datasets. In contrast, the proposed HINT benchmark stems from the area of arithmetic reasoning with real handwriting images (at the primitive level, rather than the expression level) and requires joint learning of perception, syntax, and semantics. The precise definitions and boundaries of these meanings in HINT permit to build test splits to evaluate the specific generalizations. Notably, HINT possesses more complex semantics, which eliminates the undesirable bias towards syntactic generalization present in earlier datasets. The task of the HINT benchmark is inspired by the HWF dataset (Li et al., 2020) but requires full-spectrum learning of perception, syntax, and semantics. By going beyond an i.i.d train/test split in Li et al. (2020) , HINT focuses on examining systematic generalization across many aspects of concepts. Methods on Systematic Generalization To capture systematic generalization, new training regimes (Lake, 2019; Andreas, 2020; Akyürek et al., 2020; Zhang et al., 2022) and model architectures (Dessì & Baroni, 2019; Russin et al., 2019; Csordás et al., 2021; Gordon et al., 2019; Bergen et al., 2021) have been developed. Russin et al. (2019) , for instance, expand a seq2seq model by segregating syntactic and semantic information. Csordás et al. (2021) investigate a variety of Transformer configurations to enhance its systematic compositionality. Andreas (2020) and Akyürek et al. (2020) investigate data enhancement for compositional generalization. In particular, several neural-symbolic methods with domain-specific designs (Chen et al., 2020; Nye et al., 2020; Liu et al., 2020) achieve near-perfect accuracy on prior systematic generalization datasets like SCAN (Lake & Baroni, 2018). However, these neural-symbolic methods introduce certain non-trivial domain-specific symbolic components, making it difficult to transfer to other domains; their flexibility and transferability are unclear. In this paper, we benchmark on HINT with prevailing seq2seq frameworks, including RNNs, Transformers, and GPT-3, which require minimal domain-specific design and may be of broad interest to the learning community. We reserve for future research the investigation of more sophisticated methods, such as data augmentation and neural-symbolic approaches.

3. THE HINT DATASET

In this section, we present the specifics of the HINT benchmark, devised to evaluate models' capability of learning generalizable concepts at three distinct levels: perception, syntax, and semantics.

3.1. THE DEFINITIONS OF PERCEPTION, SYNTAX, AND SEMANTICS

We first define the perception, syntax, and semantics in the domain of HINT, as shown in Table 3 . Perception refers to the mapping from image pixels into meaningful patterns, such as mapping an image of handwritten expression to a symbolic sequence. Syntax refers to the mechanism of how the concepts in one sample are structurally organized e.g., parsing the symbolic sequence into a tree, and the syntax in Table A2 is expressed by a phrase-structure grammar. Semantics refers to the functional meanings of these arithmetic concepts, e.g., what value '5' represents and what value '+' produces when given two arguments 1 and 1. Table 3 : The definitions of perception, syntax, and semantics In syntax, number, op1, and op2 are the HINT grammar's pre-terminals in Table A2 . In semantics, i and j are the operator's inputs. ´is defined as maxp0, i jq to prevent negative results, and ˜is defined as ceilpi ˜jq to remove the decimal portions of the results. Notably, although these three levels have a clear boundary by their definitions, a model need not necessarily represent them by separate and individual modules. An end-to-end neural network trained on this domain, for instance, will likely contain neurons and parameters from all three layers. The notion of perception, syntax, and semantics simply requires the models to capture these meanings during evaluation, regardless of how the models finish the tasks, implicitly or explicitly. Task The task of HINT is intuitive: predict the final results of handwritten arithmetic expressions in a weakly-supervised manner. That is, only the final results are given as supervision; all the symbolic expressions, parse trees, and intermediate values are latent. In such a setting, any model must simultaneously master perception, syntax, and semantics to solve this task successfully. The data generation process consists of three steps. First, we extract handwritten images for each concept from CROHME (Mahdavi et al., 2019) , including digits 0 through 9, operators `, ´, ˆ, ˜, and parentheses p, q. Second, we randomly sample prefix expressions and convert them to infix expressions with necessary parentheses based on the operator precedence; only single-digit numbers are permitted. The symbolic expressions are fed into a solver to calculate the final results. Third, we randomly sample handwritten images for symbols in an expression and concatenate them to construct the final handwritten expression. We only retain the handwritten expressions as input and the corresponding final results as supervision; all intermediate results are discarded.

Full-Spectrum Systematic Generalization

To rigorously evaluate the systematic generalization of the learned concepts, we substitute the standard i.i.d. split with a meticulously crafted evaluation scheme. We randomly divide all handwritten images into three splits: training (75%), validation (5%), and test (20%). First, we limit the maximum number of operators in the training set to 10 and the maximum intermediate values to 100: D train Ă T train " tpx, yq : |x| ď 10, maxpvq ď 100u, where x is the expression, |x| its number of operators, y the final result, and v all the intermediate values and the final results. To ensure diversity in the training set, we sample a maximum of 100,000 distinct expressions with the same number of operators. To prevent bias in the final results, we cap the percentage of a certain result at less than 5%. Next, we carefully curate the test set to evaluate different generalization capabilities (i.e., interpolation and extrapolation) on different levels of meaning (i.e., perception, syntax, and semantics). Specifically, the test set comprises five subsets, formally defined as: Dtest " I Y SS Y LS Y SL Y LL, where (2) I Ă Dtrain, generalization on perception only SS Ă TtrainzDtrain, interpolation on both syntax and semantics LS Ă tpx, yq : |x| ą 10, maxpvq ď 100u, extrapolation on syntax and interpolation on semantics SL Ă tpx, yq : |x| ď 10, maxpvq ą 100u, interpolation on syntax and extrapolation on semantics LL Ă tpx, yq : |x| ą 10, maxpvq ą 100u. extrapolation on both syntax and semantics All subsets of the test set require generalization on perception since all images in the test set are unseen during training. For the test set, we sample no more than 1,000 unique expressions with the same number of operators, and the final results are also balanced. The maximum number of operators is set up to 20, and the maximum intermediate value to 10,000. We also build a small validation set for hyperparameter tuning. See Table 4 for training and test examples and refer to Appendix A for further dataset statistics.

Few-shot Learning and Generalization

To determine if models can rapidly learn new concepts, we constructed a few-shot learning split to learn six new concepts, as shown in Table 3 . These six concepts have different meanings in terms of perception, syntax, and semantics: two new numbers 

4. DEEP SEQUENCE-TO-SEQUENCE BASELINES

The task of HINT can be naturally formulated as a sequence-to-sequence (seq2seq) problem: The input is a handwritten expression, segmented into a sequence of images by a sliding window, and the output is an integer, converted into a sequence of digits. We benchmark deep seq2seq frameworks on HINT; see Figure 1 for an illustration using a detailed example.

4.1. IMAGE TOKENIZING AND EMBEDDING

Existing seq2seq frameworks typically accept a sequence of tokens as input. To tokenize a handwritten expression, its height is first resized to 32, and a 32-pixel sliding window is applied along the horizontal axis to render a sequence of images. Next, each image in the sequence is encoded by ResNet-18 (He et al., 2016) , sufficient to handle the visual variance in handwriting.

4.2. ENCODER-DECODER ARCHITECTURES

RNNs Recurrent neural networks (RNNs) have long been a dominant choice for sequence modeling tasks. We test two popular RNNs in the literature: long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU) (Chung et al., 2014) . Each model is evaluated both with and without attention (Bahdanau et al., 2015) . Transformers Since its inception (Vaswani et al., 2017) , Transformers have gradually supplanted recurrent or convolutional neural networks as the de facto choice for various sequence modeling tasks (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) . Nevertheless, prior work (Dehghani et al., 2018; Hupkes et al., 2020; Kim & Linzen, 2020) suggests that the vanilla Transformer fails substantially in many tasks requiring systematic generalization when the sequence lengths exceed those observed during training. Recently, several simple tricks have been proposed (Csordás et al., 2021) to improve the generalization capability of Transformers; two of them work particularly well: (i) using relative positional encoding (Shaw et al., 2018; Dai et al., 2019) , and (ii) sharing weights across the blocks in the Transformer, a.k.a.., Universal Transformer (Dehghani et al., 2018) . Therefore, we benchmark Transformer variants: the vanilla Transformer, Transformer with relative positional encoding, and Universal Transformer with relative positional encoding. GPT-3 Since the commencement of GPT-3 (Brown et al., 2020) , there have been intense debates and different perspectives regarding the mathematical reasoning capacity of pre-trained large language models. 1 To systematically and comprehensively evaluate GPT-3's competence of arithmetic reasoning, we test it on the proposed HINT benchmark using symbolic expressions as input. Since all tokens of HINT are in the vocabulary of GPT-3, we directly evaluate GPT-3 via zero-shot prompting using the OpenAI API. 2 We construct the prompt in the following form: "Q: What is Expression? A: The answer is", similar to the practice in Brown et al. (2020) , but with more complex expressions. Recently, chain of thought (CoT) prompting (Wei et al., 2022) has been extended to the zero-shot setting (Kojima et al., 2022) by adding a simple prompt, "Let's think step by step," to facilitate stepby-step thinking prior to answering each question. Zero-shot CoT surpasses the standard zero-shot prompting by a significant margin in various reasoning tasks. Therefore, we also apply zero-shot CoT prompting to evaluate GPT-3 on HINT; we refer the readers to Appendix B.2 for the details of zero-shot CoT.

4.3. TRAINING AND EVALUATION

Training All models are trained using the Adam optimizer (Kingma & Ba, 2014); the gradients exceeding 5.0 are clipped. Dropout (Srivastava et al., 2014) is applied to each recurrent layer of RNNs and each sub-layer of Transformers, including both the multi-head attention layers and the feedforward layers. No training is required for zero-shot experiments on GPT-3; instead, 100 samples from each test subset are selected and fed to GPT-3 through zero-shot or zero-shot-CoT prompting. Hyperparameter Tuning To produce reliable results, a thorough hyperparameter tuning is performed w.r.t. the number of layers in the encoder and the decoder, the dimension of the token embedding, the number of hidden units per layer, the number of attention heads in Transformers, the dropout ratio, and the learning rate. We refer the readers to Table A3 for further information. Evaluation Metric We report the accuracy of the final results. A predicted result is considered correct only when it exactly matches the ground-truth answer. 

5.1. JOINT LEARNING OF PERCEPTION, SYNTAX, AND SEMANTICS

Tables 5 and 6 summarize the results of all models on HINT using image inputs and symbol inputs, respectively. Among all models, the Universal Transformer with relative positional encoding ("Transformer rel. uni.") has the highest average accuracy on the test set. Upon careful examination of the results, the following observations and insights can be made: • Models attains high accuracy on the subset I. Particularly, Transformer rel. uni. using image inputs achieves an accuracy of 88.4%. The test subset I shares the symbolic expressions with training and has different handwritten images for symbols. This indicates that Transformers and RNNs, jointly trained with ResNet-18, have strong generalization over perception. As depicted in Figure 2 , the model forms meaningful clusters for each concept and captures syntactic roles to some extent without direct supervision on perception. • Transformers achieve high accuracy on the subset SS. The expressions in SS share the same length and value distribution as training. This result indicates that Transformers exhibit robust interpolation over syntax and semantics. • The accuracy of Transformer rel. uni. on LS is substantially lower than its accuracy on SS or I (see Table 6 ). Note that the identical model yields perfect accuracy on the length cutoff splits of SCAN (Csordás et al., 2021) . This result, however rather unexpected, may be explained by the syntax difference between HINT and SCAN shown in Table A2 : The expressions in HINT may have a longer-range dependency and greater tree depth than the commands in SCAN. This observation suggests that present Transformers, which have finite depth, are incapable of adequately capturing the syntax with long dependencies and large depth. • Transformer with relative positional encoding achieves similar performance on I and SS as the vanilla Transformer with absolute positional encoding, yet relative positional encoding doubles the Transformer's accuracy on LS (see Table 6 ). This contradiction implies that relative positional encoding is essential for Transformer to generalize to long expressions. Sharing weights between the layers using the Universal Transformer can further enhance performance. • Models behave clumsily on the subsets SL and LL. The accuracy on SL and LL is significantly lower than that on I and SS. All models exhibit near-zero accuracy on samples whose answers are over 100 (the maximum final result in the training set). This finding suggests that neither RNNs nor Transformers are able to extrapolate to larger numbers beyond those in the training set. • While GPT-3 with zero-shot prompting performs poorly, chain of thought (CoT) prompting significantly improves the accuracy. Notably, GPT-3 with zero-shot CoT achieves an accuracy of 49.0% on SL, which is superior to other fine-tuned models. We believe this is due to the fact that GPT-3 has been pre-trained on data with larger numbers, and CoT improves the reasoning process. Despite CoT prompting, GPT-3 performs poorly on long expressions in LS and LL. Summary We observe a significant room for improvement on HINT. Even the best model, Universal Transformer with relative positional encoding, can only achieve an accuracy of 54.3% on HINT, while the same model achieves virtually perfect accuracy on earlier datasets of systematic generalization, such as SCAN. The challenges of HINT stem from the fact that it requires joint learning and generalization of perception, syntax, and semantics: The perception has a large variance in real handwriting, the syntax supports long dependency between symbols, and the semantic complexity is well beyond the capability of the state-of-the-art models. Scaling Laws Since HINT can generate an endless amount of data for training, one may wonder if merely increasing the dataset and the model size can solve the problem, akin to certain NLP tasks (Kaplan et al., 2020; Henighan et al., 2020) . Empirically, Figure 3 

5.2. FEW-SHOT LEARNING AND GENERALIZATION

In this section, we fine-tune the top two models on six new concepts; Table 7 summarizes the results. Transformer rel. uni. outperforms LSTM w/ attn across all concepts by a significant margin, which is greater than six times their performance gap in Table 5 . This discrepancy suggests that with limited data, Transformer is superior to LSTM at learning new concepts. Figure 4 depicts the test accuracy of Transformer rel. uni. while using varied maximum operators for training. In general, the more data and longer expressions used for training, the higher the model's performance. One test case for learning new numbers ("xy") is p0, 26.5q, where the model is only exposed to the primitive concept during training and is expected to generalize to complex compositions during testing. The classic thought experiments (Fodor, 1975) indicate that this is straightforward for humans: If you grasp the meanings of "1," "1 `1," and "x," you should also comprehend the meaning of "1 `x". A similar test case for learning new operators ("abcd") is p2, 24.1q since expressions comprising at least two operators are required to capture the syntax of a new operator. Transformer performs poorly on both of these tasks, demonstrating that it is still far from humanlevel generalization.

6. DISCUSSIONS: CONCLUSIONS AND LIMITATIONS

In this paper, we took inspiration from arithmetic and introduced a new challenge for the learning community, Handwritten arithmetic with INTegers (HINT), which serves as a minimal yet comprehensive benchmark for examining the full-spectrum systematic generalization of concept learning w.r.t. perception, syntax, and semantics. HINT is intrinsically more challenging than previous datasets on systematic generalization due to its substantial perceptual diversity in real handwriting, complex syntax, and sophisticated semantics. We benchmark on HINT with the state-of-the-art seq2seq models, including RNNs, Transformers, and GPT-3; the results point out their inability to extrapolate over syntax and semantics. The scaling trends of test accuracy w.r.t. dataset size and model size indicate that it is impractical to solve HINT by only increasing the size of the dataset and model. We believe that the HINT dataset and our experimental findings will inspire new advances in systematic generalization, particularly extrapolation over syntax and semantics. Limitations and Future Work Despite a large visual variance, the handwritten expressions are rather basic in terms of spatial locations and visual complexity. It would be more intriguing if we could further increase the perceptual complexity w.r.t. spatial relations like natural images (Lin et al., 2014) . Although syntax and semantics in HINT are already more complex than those of prior datasets, they remain context-free. Extending our findings to context-dependent syntax and semantics would be of practical value given their prevalence in natural languages; e.g., a word might have different syntactic roles or semantic meanings in different contexts. Regarding model development on HINT, our findings reveal that current seq2seq models, including Transformers, are unable to extract the systematic rules for both syntax and semantics from the training data. Improving the systematic generalization of Transformers, particularly extrapolation over semantics, is a crucial future direction. We also intend to investigate more advanced methods, such as meta-learning (Lake, 2019), data augmentation (Andreas, 2020; Akyürek et al., 2020) , Edge Transformer (Bergen et al., 2021) , and Neural-Symbolic Stack Machines (Chen et al., 2020) . In addition, understanding the systematic generalization of large language models by evaluating them in few-shot or fine-tuning settings will be beneficial. Figure A1 : The number of handwritten images for each symbol. There are 82 arithmetic symbols (the top 50 are shown here) and 83,501 images in total. We use the handwritten images for digits 0 " 9, operators `, ´, ˆ, ˜, and parentheses p, q in this work; others are for potential future use. 

B IMPLEMENTATION DETAILS

We benchmark deep sequence-to-sequence (seq2seq) frameworks on HINT, as illustrated by Figure 1 . All models are implemented in PyTorch (Paszke et al., 2019) .

B.1 IMAGE TOKENIZER AND EMBEDDING

To tokenize a handwritten expression, we first resize it by making its height 32 and apply a sliding window of size 32 along the horizontal axis to render a sequence of images. Next, each image in the sequence is encoded by the ResNet-18 (He et al., 2016) . We found in preliminary experiments that pre-training on the ImageNet does not help, likely due to the domain gap between ImageNet and HINT. Therefore, we use a random initialization for ResNet-18 in our experiments.

B.2 ENCODER-DECODER ARCHITECTURES

We consider the following three choices for the encoder-decoder architecture in a seq2seq framework: Recurrent Neural Networks (RNNs), Transformers, and GPT-3. RNNs We test two popular RNNs: long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU) (Chung et al., 2014) . Both networks are evaluated with and without attention (Bahdanau et al., 2015) . Our implementations of RNNs are adapted from a seq2seq tutorial.foot_4 GPT-3 To test GPT-3's ability to perform simple arithmetic operations without task-specific training, Brown et al. (2020) developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language; see Section 3.9.1 and Table 3 .9 in Brown et al. (2020) for the results. In these tests, GPT-3 displays reasonable proficiency at simple arithmetic in the fewshot setting. However, they do not evaluate the multi-hop reasoning capability required by complex arithmetic expressions, which usually involve more operators and larger numbers.

Transformers

To systematically and comprehensively evaluate GPT-3's capability of arithmetic reasoning, we test GPT-3 on the proposed HINT benchmark using symbolic expressions as input. Since all tokens of HINT are in the vocabulary of GPT-3, we directly evaluate GPT-3 via zero-shot prompting using the OpenAI APIfoot_6 . We construct the prompt in the following form: "Q: What is <Expression>? A: The answer is," similar to the practice in Brown et al. (2020) but with more complex expressions. Via task-specific zero-shot or few-shot prompting, pre-trained large language models achieve excellent performance in intuitive and single-step System 1 tasks Kahneman (2011) . However, LLMs struggled on System 2 tasks that require slow thinking and multi-hop reasoning (Rae et al., 2021), even at the scale of over 100B parameters like GPT-3. To address this shortcoming, chain of thought prompting (CoT) (Wei et al., 2022) , which feeds LLMs with the intermediate step-by-step reasoning to augment the final answer in a few-shot setting, has been proposed to elicit the multi-hop reasoning in LLMs. Very recently, chain of thought prompting has been extended to the zero-shot setting (Kojima et al., 2022) by adding a simple prompt, "Let's think step by step", to facilitate step-by-step thinking before answering each question. Zero-shot CoT amazingly outperforms the standard zeroshot prompting by a large margin in a variety of reasoning tasks. Therefore, we also apply zero-shot CoT prompting to evaluate GPT-3 on HINT. More concretely, it follows a two-stage prompting strategy similar to Kojima et al. (2022) : 1st prompt "Q: What is <Expression>? A: Let's think step-by-step." This prompt extracts the step-by-step reasoning process in the form of natural language from GPT-3, ,3,6,9 1,3,6,9 128, 256, 512 128, 256, 512 4,8,12 which is denoted by ¡Z¿. 2st prompt "Q: What is <Expression>? A: Let's think step-by-step. <Z> Therefore, the answer (arabic numerals) is" In the second stage, the response ¡Z¿ generated in the first step is appended to the initial prompt along with an answer trigger sentence. This second prompt is then fed into GPT-3 to predict the final answer. In our experiments, we use the 'text-davinci-002' engine in the OpenAI API, the most capable GPT-3 model at the time of writing with approximately 175 billion parametersfoot_7 .

B.3 TRAINING

Table A3 shows the tuned hyperparameters for the baselines. Our choices for each model are underlined, and the performance is reported under these settings unless explicitly stated otherwise. When generating the output, we use greedy decoding in all models for simplicity. All models reported in our paper can be trained on a single NVIDIA TITAN V GPU with 12G memory. It takes at most eight hours to train a model.

B.4 ADDITIONAL EXPERIMENTAL RESULTS

Figure A3 shows the test accuracy as a function of several sample properties. Figure A4 shows the importance of these properties.

C HUMAN STUDY FOR FEW-SHOT LEARNING AND GENERALIZATION

We conduct a preliminary human study to evaluate human performance in the few-shot learning experiment. Specifically, we test ten human subjects on the six concepts that are unknown to subjects to reduce the human prior as much as possible. 



Can GPT-3 do math? https://www.youtube.com/watch?v=TMxAbNAVrzI https://openai.com/api/ https://www.kaggle.com/datasets/xainano/handwrittenmathsymbols https://www.cs.rit.edu/ ˜crohme2019/ https://github.com/bentrevett/pytorch-seq2seq https://github.com/RobertCsordas/transformer_generalization https://openai.com/api/ OpenAI API GPT-3 model sizes: https://blog.eleuther.ai/gpt3-model-sizes



(x and y , representing 11 and 12, respectively), two operators of precedence 1 (a and b , representing max and min), and two operators of precedence 2 (c and d , representing arithmetic mean and harmonic mean). The train, validation, and test splits are constructed using the same strategy as in the full-spectrum generalization. Expressions are sampled to guarantee that the corresponding new concept appears at least once in the expression. This few-shot learning split is used to determine whether the models pre-trained on the training set can rapidly learn a new concept by fine-tuning on only a handful of examples involving the new concept. In this context, "few-shot" implies that the examples used to acquire a new concept are significantly fewer than those of the training set, but still exceed the number of examples required by humans to learn a new concept.

Figure 1: The seq2seq framework applied to an example in HINT. ¡SOS¿: start-of-sentence tokens. ¡EOS¿: end-of-sentence tokens. A sliding window segments the handwritten expression into a sequence of images, which are then separately encoded by ResNet-18. The expected output is a sequence of digits in reverse order.

Figure2: The t-SNE visualization of the embeddings (the outputs of ResNet-18) of handwritten images using the Transformer rel. univ. model. The image embeddings form clear clusters for each concept based on visual appearance. In addition, these clusters reflect the concepts' syntactic roles: The majority of digits are towards the bottom, operators are around the center, and parentheses are near the top.

Figure 3: Scaling trends w.r.t. model size and dataset size when training Transformer rel. uni. on the test subset LL with symbol inputs.

We benchmark three variants of Transformer: the vanilla Transformer, Transformer with relative positional encoding, and Universal Transformer with relative positional encoding. The implementations of these Transformers are adapted fromCsordás et al. (2021). 4

FigureA3: Test accuracy (avg.) of Transformer rel. uni. using symbol inputs as a function of several properties of samples: the expression's length, the depth of the expression's parse tree, the expression's maximum dependency range, the number of operators in the expression, the final result.

the few-shot learning experiments, models are first pre-trained on the main training set and then fine-tuned on the training set of each new concept individually. Models are fine-tuned for 1000 iterations using a batch size of 128 with half examples from the main training set to prevent forgetting. The learning rates are 10 ´5 and 10 ´3 for Transformers and RNNs, respectively.

Can

Dataset categorization and comparison. SP: semantic parsing, IC: image classification, QA: question answering, i&t: image & text. Perception/Syntax/Semantics: whether the task requires models to learn perception/syntax/semantics. Generalization: the type of generalization required for test examples. *: the generated images in these datasets have little variance.

Examples from the training set and the test subsets of HINT.

The accuracy on the test set using image inputs. All models are jointly trained with a randomly initialized ResNet-18. Reported accuracy (%) is the median and standard deviation of 5 runs. "rel." denotes Transformer with relative positional encoding, and "uni." denotes Universal Transformer.

The accuracy on the test set using symbol inputs.

The few-shot learning performance of the top two models: LSTM w/ attn (left) and Transformer rel. uni. (right). Reported results are the median of 5 runs. See Table3for the meanings of these concepts. *Please refer to Appendix C for the details regarding the human study.

Hyperparameter tuning. Our choices are underlined.

The human subjects are asked to determine each concept's meaning from 10 training examples and answer 4 test questions. We report the accuracy of test questions as human performance. The importance of sample properties w.r.t. the test accuracy of Transformer rel. uni. using symbol inputs. Normalized permutation feature importance is reported here using a k-nearest neighbors classifier (k=3) to predict if the model can generate correct results.

acknowledgement

Acknowledgements. The authors would like to thank four anonymous reviews for constructive feedback. This work is supported in part by the National Key R&D Program of China (2021ZD0150200) and the Beijing Nova Program.

A DATASET STATISTICS

The handwritten images for each arithmetic concept originate from the handwritten math symbols dataset 1 hosted on Kaggle under the "CC0: Public Domain" license, parsed and extracted from the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) (Mahdavi et al., 2019) 2 . We further clean the dataset by removing duplicate images, resulting in statistics shown in Figure A1 .We conduct a detailed analysis of the collected data to demonstrate the validity of the HINT dataset as a benchmark for systematic generalization. Table A1 shows the size of each split in HINT, and Table A2 shows a comparison between the grammars of HINT and SCAN. For each split, we plot the frequency distributions of various aspects, including symbol, number of operators, expression length, tree depth, maximum dependency range, and result, as shown in Figure A2 . The symbol distributions are similar across different splits, and the Kullback-Leibler divergence between train and test is low (0.0055). The digits and operators are approximately equally distributed, except for the test-SL split. The test-SL split has a relatively higher portion of multiplication ('*') since generating large numbers generally requires more multiplication for short expressions.The test set's result distributions differ from the train set. All results in the training set are smaller than 100 as desired; about half are in r0, 10q. In comparison, 29% of the results in the test set are larger than 100.Several properties of an input expression, including length, number of operators, tree depth, and maximum dependency range, are indicators of the difficulty of calculating the expression. We plot the frequency distributions w.r.t. these input properties in Figure A2 . These distributions demonstrate significant differences between train and test.

