NEURAL-SYMBOLIC RECURSIVE MACHINE FOR SYSTEMATIC GENERALIZATION

Abstract

Despite the tremendous success, existing machine learning models still fall short of human-like systematic generalization-learning compositional rules from limited data and applying them to unseen combinations in various domains. We propose Neural-Symbolic Recursive Machine (NSR) to tackle this deficiency. The core representation of NSR is a Grounded Symbol System (GSS) with combinatorial syntax and semantics, which entirely emerges from training data. NSR implements a modular design for neural perception, syntactic parsing, and semantic reasoning, which are jointly learned by a deduction-abduction algorithm. We prove that NSR is expressive enough to model various sequence-to-sequence tasks. Superior systematic generalization is achieved via the inductive biases of equivariance and recursiveness embedded in NSR. In experiments, NSR achieves state-of-the-art performance in three benchmarks from different domains: SCAN for semantic parsing, PCFG for string manipulation, and HINT for arithmetic reasoning. Specifically, NSR achieves 100% generalization accuracy on SCAN and PCFG and outperforms state-of-the-art models on HINT by about 23%. Our NSR demonstrates stronger generalization than pure neural networks due to its symbolic representation and inductive biases. NSR also demonstrates better transferability than existing neural-symbolic approaches due to less domain-specific knowledge required. A remarkable property underlying human intelligence is its systematic compositionality: the algebraic capacity to interpret an infinite number of novel combinations from finite known components (Chomsky, 1957)-"infinite use of finite means" (Chomsky, 1965) . This type of compositionality is central to the human ability to generalize from limited data to novel combinations (Lake et al., 2017) . Recently, several datasets have been proposed to test systematic generalization of machine learning models-SCAN (Lake & Baroni, 2018), PCFG (Hupkes et al., 2020) , CFQ (Keysers et al., 2020), and HINT (Li et al., 2021), to name a few. While conventional neural networks fail dramatically on these datasets, certain inductive biases have been explored to improve systematic generalization. Csordás et al. (2021); Ontanón et al. (2022) improve Transformers' generalization performance by using relative positional encoding and sharing weights between layers. Chen et al. ( 2020) introduce a neural-symbolic stack machine to achieve nearly perfect accuracy on SCAN-like datasets. Despite the improved performance, these neural-symbolic methods often require domain-specific knowledge to design non-trivial symbolic components and are difficult to transfer to other domains. To achieve human-like systematic generalization in a wide range of domains, we propose Neural-Symbolic Recursive Machine (NSR), which integrates the joint learning of perception, syntax, and semantics in a principled framework. The core representation of NSR is a Grounded Symbol System (GSS) (see Fig. 1 ), which entirely emerges from training data without domain-specific knowledge. NSR implements a modular design for neural perception, syntactic parsing, and semantic reasoning. Specifically, we first utilize a neural network as the perception module to ground symbols on the raw inputs. Next, the symbols are parsed into a syntax tree of the Grounded Symbol System by a transition-based neural dependency parser (Chen & Manning, 2014) . Finally, we adopt functional programs to realize the semantic meaning of symbols (Ellis et al., 2021) . Theoretically, we show that the proposed NSR is expressive enough to model various sequence-to-sequence tasks. Critically, the inductive biases of equivariance and recursiveness, encoded in each module, enable NSR to break



down the long input into small components, process them progressively, and compose the results, encouraging the model to learn meaningful symbols and their compositional rules. Such inductive biases are the crux of NSR's superb systematic generalization. It is challenging to optimize NSR in an end-to-end fashion since annotations for the internal GSS are oftentimes unavailable and NSR is not fully differentiable. To tackle this issue, we present a probabilistic learning framework and derive a novel deduction-abduction algorithm to coordinate the joint learning of different modules. In the learning phase (see also Fig. 2 ), the model first performs greedy deduction over these modules to propose an initial GSS, which may yield wrong results. Next, a search-based abduction is applied top-down to search the neighborhood of initial GSS for possible solutions; such abduction revises the GSS until it generates the correct result. As a plausible solution, the revised GSS provides pseudo supervision to train each module, facilitating the learning of individual components in NSR. We evaluate NSR on three benchmarks from various domains to study systematic generalization: (1) SCAN (Lake & Baroni, 2018) , mapping natural language commands to action sequences; (2) PCFG (Hupkes et al., 2020) , predicting the output sequences of string manipulation commands; (3) HINT (Li et al., 2021) , predicting the results of handwritten arithmetic expressions. All these datasets include multiple splits for evaluating different aspects of systematic generalization. NSR achieves state-of-the-art performance on all these benchmarks. Specifically, NSR obtains 100% generalization accuracy on SCAN and PCFG and improves the state-of-the-art accuracy on HINT by about 23%. Result analyses reveal that NSR possesses stronger generalization than pure neural networks due to its symbolic representation and inductive bias. It also demonstrates better transferability than existing neural-symbolic approaches due to less domain-specific knowledge required. We also evaluate NSR on a proof-of-concept machine translation task from Lake & Baroni (2018) and the results demonstrate the promise of applying NSR to realistic domains.

2. RELATED WORK

There has been an increasing interest in studying the systematic generalization of deep neural networks. Started by the SCAN dataset (Lake & Baroni, 2018) , multiple benchmarks across various domains have been proposed, including semantic parsing (Keysers et al., 2020; Kim & Linzen, 2020) , string manipulation (Hupkes et al., 2020) , visual question answering (Bahdanau et al., 2019) , grounded language understanding (Ruis et al., 2020) , and mathematical reasoning (Saxton et al., 2018; Li et al., 2021) . These datasets serve as the test bed for evaluating different aspects of generalization, including systematicity and productivity. A line of research has developed different techniques for these datasets by injecting various inductive biases into deep neural networks. We categorize previous approaches into three classes by how they inject the inductive bias: Architectural Prior The first class of methods explores different architectures of deep neural networks for compositional generalization. Dessì & Baroni (2019) found that convolutional networks are significantly better than recurrent networks in the "jump" split of SCAN. Russin et al. (2019) improved the standard RNNs by learning separate modules for syntax and semantics. Gordon et al. (2019) proposed the equivariant seq2seq model by incorporating convolution operations into RNNs to achieve local equivariance over permutation symmetries of interest, which are provided beforehand. Csordás et al. (2021) and Ontanón et al. (2022) observed that relative position encoding and sharing weights across layers significantly improve the systematic generalization of Transformers.

Data Augmentation

The second class of methods designs different schemes to generate auxiliary training data for encouraging compositional generalization. Andreas (2020) performed data augmentation by replacing fragments of training samples with other fragments from similar samples, and Akyürek et al. (2020) trained a generative model to recombine and resample training data. The meta sequence-to-sequence model (Lake, 2019) and the rule synthesizer (Nye et al., 2020) are trained with samples drawn from a meta-grammar with a format close to the SCAN grammar.

Symbolic Scaffolding

The third class of methods bakes symbolic components into neural architectures for improving compositional generalization. Liu et al. (2020) connected a memoryaugmented neural model with analytical expressions, simulating the reasoning process. Chen et al. (2020) integrated a symbolic stack machine into a seq2seq framework and learned a neural controller to operate the machine. Kim (2021) 2020), which often have to be constrained under a carefully designed curriculum in practice. In contrast, the proposed NSR model requires less domain-specific knowledge for designing its modules and the proposed deduction-abduction algorithm for learning NSR does not require a specialized curriculum, leading to better transferability and easier optimization. append swap A B C , repeat B E A 0 [A] swap 51 [C, B, A] B 1 [B] repeat 52 [B, E, B, E] append 53 [C, B, A, B, E, B, E] B 1 [B] C 2 [C] E 4 [E]

3. NEURAL-SYMBOLIC RECURSIVE MACHINE

3.1 REPRESENTATION: GROUNDED SYMBOL SYSTEM (GSS) Two schools of thought, connectionism and symbolism, have a long debate about the proper representation of the human mind (Fodor et al., 1988; Fodor & Lepore, 2002; Marcus, 2018) . Central to connectionism is distributed representations (Hinton, 1984) , arguing that a certain concept or meaning is represented by a pattern of activity across many neurons. In contrast, symbolism postulates a physical symbol system (Newell, 1980) , wherein each symbol alone represents an atomic concept, and more complicated concepts are formed by combining multiple symbols with certain syntax (Chomsky, 1965; Hauser et al., 2002; Evans & Levinson, 2009) . Symbol systems are more interpretable and support stronger abstraction and generalization than distributed representations (Launchbury, 2017) . However, handcrafting a symbol system for a domain requires a lot of domainspecific knowledge, and the constructed system is fragile and suffers from the notorious symbol grounding problem (Harnad, 1990) . In this work, we propose a grounded version of the symbol system as the internal representation for systematic generalization. Such a Grounded Symbol System (GSS) provides a principled integration of perception, syntax, and semantics, as exemplified by Fig. 1 . Formally, a GSS is defined as a directed tree T "ă px, s, vq, e ą. Each node is a triplet of the grounded input x, the abstract symbol s, and the semantic meaning v. Each edge denotes the semantic dependency between the parent and the child node; an edge i Ñ j denotes that the node i's semantic meaning depends on node j's. Despite their nice properties, handcrafted symbol systems are inevitably fragile and labor-intensive. Therefore, it is essential to ground symbol systems on raw inputs and learn their syntax and semantics from the training examples, which we will discuss next.

3.2. MODEL: NEURAL-SYMBOLIC RECURSIVE MACHINE (NSR)

We now describe the proposed NSR that induces a GSS from training data. As illustrated in Fig. 2 , the NSR consists of three learnable modules: A neural perception module for grounding symbols on the raw input, a dependency parser for predicting the dependencies between symbols, and a program synthesizer for predicting the semantic meanings. Since the ground-truth GSS is unavailable during training, these three modules ought to be learned in an end-to-end fashion without intermediate supervision. Below, we first describe the three modules of NSR and then discuss how to learn NSR end-to-end with our proposed deduction-abduction algorithm. Neural Perception The role of the perception module is to map a raw input x (e.g., a handwritten expression) into a symbolic sequence s (denoted by a list of indices). This perception module handles the perceptual variance from raw input signals, such that each predicted token w i P s is grounded in a certain part of the input x i P x. Formally, we have: pps|x; θ p q " ź i ppw i |x i ; θ p q " ź i softmaxpϕpw i , x i ; θ p qq, where ϕpw i , x i ; θ p q is a scoring function parameterized by a neural network with parameters θ p . The architecture of the perceptual neural network depends on the format of the raw input, which could be pre-trained; for example, we use a pre-trained convolutional neural network for images. Dependency Parsing To predict the dependencies between symbols, we adopt a transition-based neural dependency parser (Chen & Manning, 2014) , commonly used for parsing natural language sentences. The transition-based dependency parser relies on a state machine that defines the possible transitions to parse the input sequence into a dependency tree. The parser constructs the parse tree from the input sequence by recursively applying the predicted transition to the state machine until the parsing process ends; see Fig. A1 for an illustration. At each step, the parser predicts one transition based on the state representation. The state representation is constructed from a local window and contains: (i) the top three words in the stack and buffer, (ii) the first and second leftmost/rightmost children of the top two words in the stack, and (iii) the leftmost of leftmost/rightmost of rightmost children of the top two words in the stack. Given the state representation, we adopt a two-layer feed-forward neural network to predict the transition. Formally, given the input sentence s, we have: ppe|s; θ s q " ppT |s; θ s q " ź tiPT ppt i |c i ; θ s q, where θ s is the parser's parameters, T " tt 1 , t 2 , ..., t l u ( e the transition sequence that generates the dependencies e, and c i the state representation at the i-th step. In practice, we use a greedy decoding strategy to predict the transition sequence for the input sentence. Program Induction Inspired by recent advances in program induction (Ellis et al., 2021; Balog et al., 2017; Devlin et al., 2017) , we adopt functional programs to represent the semantics of symbols and view learning as program induction. Compared to purely statistical approaches, symbolic programs exhibit better generalizability and interpretability, and the learning is usually more sample-efficient. Formally, given the input symbols s and their dependencies e, we have: ppv|e, s; θ l q " ź i ppv i |s i , childrenps i q; θ l qq where θ l is the set of programs induced for every symbol. In practice, we use symbolic programs, yielding a deterministic reasoning process. Learning semantics for a symbol is equivalent to searching for a program that matches the given examples. Candidate programs are composed of a set of pre-defined primitives. Based on the Peano axioms (Peano, 1889) , we find a universal set of primitives: (1) 0; (2) inc: i Ñ i `1; (3) dec: i Ñ i ´1; (4)==; (5) if: pa, b, cq Ñ if a is true, return b, else c. Such a simple set of primitives are provably sufficient to represent any symbolic function, which we will further demonstrate in Sec. 3.4. To speed up the search process and facilitate generalization, we augment this set of primitives with a minimal subset of Lisp primitives for list processing (cons, car, cdr, etc.) and the recursion primitive (Y-combinator (Peyton Jones, 1987) ). The Y-combinator allows to represent any recursive function and is the crux of extrapolation on semantics. In practice, we adopt DreamCoder (Ellis et al., 2021) for program induction, which can efficiently synthesize programs from input-output pairs across a wide range of domains. We revised the original DreamCoder to handle noise in the given examples during the search process.

3.3. LEARNING

Since the intermediate GSS is latent and non-differentiable, the back-propagation is not applicable to learn NSR. Previous methods usually learn such neural-symbolic models with policy gradient algorithms, such as REINFORCE (Williams, 1992) . However, the policy gradient methods have been shown to converge slowly or even fail to converge (Liang et al., 2018; Li et al., 2020) ; it requires generating a large number of samples over a large latent space with sparse rewards, hoping some samples may be lucky enough to hit high rewards for updating the policy. Due to the prohibitively large space of GSS, we have to seek for an alternative and more efficient learning algorithm. Formally  where ppT |x, yq is the posterior distribution of T given px, yq. Since ppy|vq can only be 0 or 1, ppT |x, yq can be rewritten as: ppT |x, yq " ppT, y|x; Θq ř T 1 ppT 1 , y|x; Θq " # 0, for T R Q ppT |x;Θq ř T 1 PQ ppT 1 |x;Θq , for T P Q (6) where Q " tT : ppy|T q " 1, T P Ω T u is the set of T that match y. Usually, Q is a very small subset of the entire space of ps, e, vq, i.e., Q Ď Ω T . Since taking expectation w.r.t. this posterior distribution (Eq. ( 5)) is intractable, we use Monte Carlo sampling to approximate it. The optimization procedure for an example px, yq is as follows: 1. Sample a plausible solution: T "ă px, ŝ, vq, ê ą" ppT |x, yq. 2. Use px, ŝq to update the perception module (θ p ). 3. Use pŝ, êq to update the syntax module (θ s ). 4. Use pŝ, ê, vq to update the semantic module (θ l ). Steps 2-4 are supervised training for each module and no longer require the continuous assumption for the derivation of Eq. ( 5). Step 1 requires sampling from a highly sparse distribution in a large space, which we describe below.

Deduction-Abduction

The key to the above learning procedure is the efficient sampling from the posterior distribution ppT |x, yq. To tackle this problem, we develop a novel deduction-abduction algorithm, summarized in Alg. A1. Concretely, for a given example px, yq, we first perform greedy deduction from x to obtain an initial GSS T "ă px, ŝ, vq, ê ą, which may yield a wrong result. To find a revised T ˚that matches the correct result y during training, we search the neighbors of T in a top-down manner by performing abduction over perception, syntax, and semantics; see Alg. A1 and Fig. A2 and A3 . Theoretically, the above deduction-abduction process behaves as a Metropolis-Hastings sampler for the posterior distribution ppT |x, yq, as proven by Li et al. (2020) .

3.4. EXPRESSIVENESS AND GENERALIZATION OF NSR

We now discuss NSR's properties. Overall, we find that NSR has the expressive power to model various seq2seq tasks, and the inductive biases in NSR enhance its systematic generalization capability. Expressiveness We prove that NSR has the expressive power to model various seq2seq tasks. Lemma 1. For a finite unique set of tx i : i " 0, ..., N u, there exists a neural network f p with enough capability, satisfying: @x i , f p px i q " i. Provably, there exists a neural network that maps every element in a finite set to a unique index, i.e., x i Ñ i (Hornik et al., 1989; Lu et al., 2017) . The parsing process is trivial since every input is mapped to a single token. Lemma 2. Any index space can be created from the primitives t0, incu. All indices are natural numbers, which can be recursively defined by t0, incu: 0 Ñ 0, 1 Ñ inc(0), 2 Ñ inc(inc(0)), 3 Ñ inc(inc(inc(0))), .... We can create indices for both inputs and outputs via this lemma. Equipped with the above lemmas, we have the following theorem on the expressiveness of NSR: Theorem 3. For any finite dataset D " tpx i , y i q : i " 0, ..., N u, there exists an NSR that can sufficiently express D using four primitives t0, inc, ==, ifu. This theorem can be proven by constructing a degenerate NSR to "memorize" all examples in D via a long program: NSRpxq " ifpf p pxq==0, y 0 , ifpf p pxq==1, y 1 , ...ifpf p pxq==N, y N , ∅q...q (7) The above program is akin to a lookup table, composed by the primitives tif, ==u and the index space created by the primitives t0, incu. Since these four primitives are universal for all domains, we are assured that NSR is expressive enough to model various seq2seq tasks and thus enjoys better transferability than previous neural-symbolic approaches. Generalization Although the degenerate program defined by Eq. ( 7) achieves perfect accuracy on the training set, it can hardly generalize beyond training examples. To achieve strong generalization, we have to impose certain inductive biases; these inductive biases should be minimal and applicable to most domains. Inspired by previous progress on compositional generalization (Gordon et al., 2019; Chen et al., 2020) , we hypothesize two necessary inductive biases: equivariance and recursiveness. Intuitively, equivariance enables the model's systematicity, generalizing from {"run", "run twice", "jump"} to "jump twice", whereas recursiveness enables the model's productivity, generalizing to longer sequences like "run and jump twice". Following Gordon et al. (2019) , we formalize the hypothesis of equivariance and recursiveness from the perspective of group theory. A discrete group G is a set of elements tg 1 , ..., g |G| u with a binary group operation "¨", satisfying the group axioms (closure, associativity, identity, and invertibility). For equivariance, we consider a permutation group P, whose elements are permutations of a set X . For p P P, we define the permutation operation T p : X Ñ X . For recursiveness, we consider a composition operation C and define T c : pX , X q Ñ X . Formally, we have: Definition 1 (Equivariance). Given two sets X and Y, let P be a permutation group, whose group operations on X and Y are denoted by T p : X Ñ X and T 1 p : Y Ñ Y, respectively. A mapping Φ : X Ñ Y is equivariant if and only if @x P X , p P P : ΦpT p xq " T 1 p Φpxq. Definition 2 (Recursiveness). Given two sets X and Y, let T c : pX , X q Ñ X and T 1 c : pY, Yq Ñ Y C be two composition operations on X and Y, respectively. A mapping Φ : X Ñ Y is recursive if and only if DT c , T 1 c , @x 1 P X , x 2 P X : ΦpT c px 1 , x 2 qq " T 1 c pΦpx 1 q, Φpx 2 qq. It is straightforward to see that the three modules of NSR, i.e., neural perception (Eq. ( 1)), dependency parsing (Eq. ( 2)), and program induction (Eq. ( 3)) are equivariant and recursive, because they are all pointwise transformations according to their formulations. Empirically, models that achieve superb compositional generalization on certain reasoning tasks, including NeSS (Chen et al., 2020) , LANE (Liu et al., 2020) , and the proposed NSR, are equivariant and recursive. Therefore, we propose the following hypothesis about compositional generalization: Hypothesis 1. If a model achieves compositional generalization, the mapping instantiated by the model Φ : X Ñ Y is equivariant and recursive. 5) NeSS (Chen et al., 2020) , which integrates a symbolic stack machine into a seq2seq framework.

Results

Tab. 1 summarizes the results. Both NSR and NeSS achieve 100% accuracy on splits requiring systematic generalization, while other models' best performance on the Length split is only 20%. This contrast demonstrates the superiority of symbolic components (i.e., the symbolic stack machine in NeSS and the Grounded Symbol System in NSR) for systematic generalization. While both NeSS and NSR achieve perfect generalization on SCAN, we would like to highlight some key differences. First, NeSS requires a considerable amount of domain-specific knowledge to design the stack machine's components, such as the stack operations and the category predictors; NeSS without category predictors drops to around 20% for 3 out of 5 runs. Second, training NeSS requires a manually defined curriculum with customized training procedures for the latent category predictors. In contrast, our proposed NSR embraces a modular design with little domain-specific knowledge and does not require any special training scheme like a pre-defined curriculum. Fig. 3 visualizes the syntax and semantics learned by NSR from the SCAN Length split. The dependency parser of NSR, which is equivariant as discussed in Sec. 3.4, forms clear permutation equivalent groups in terms of syntax among the input words: {turn, walk, look, run, jump}, {left, right, opposite, around, twice, thrice}, {and, after}. Note that we do not provide any prior information about these groups-they entirely emerge from the training data, unlike providing equivariant groups beforehand (Gordon et al., 2019) or explicitly incorporating a category induction procedure from execution traces (Chen et al., 2020) . In the learned programs, the program synthesizer of NSR creates an index space for the target language and finds the correct programs for representing the semantics of each source word.

4.2. PCFG

Next, we conduct experiments on the PCFG dataset (Hupkes et al., 2020) , in which the model learns to predict the output of a command of string manipulations, e.g., append swap F G H, repeat I J Ñ H G F I J I J. The input sequences in PCFG are generated by a probabilistic context-free grammar, and the output sequences are constructed by recursively applying the string edit operations that are specified in the input sequences. The input samples are selected to match the statistical properties of a textual corpus from natural languages, such as the lengths of the sentences and depths of the parse trees. Evaluation We evaluate models on the following splits. ( 1) IID: randomly split samples for training and testing. ( 2) Systematicity: this split focuses explicitly on models' ability to interpret pairs of functions that were never seen together during training. ( 3) Productivity: this split focuses on generalization to longer sequences; i.e., the training samples contain up to 8 functions, while the test samples contain at least 9 functions. We compare with the following baselines: LSTM and CNN (Hupkes et al., 2020) , Transformer (Csordás et al., 2021; Ontanón et al., 2022) , and NeSS (Chen et al., 2020) . Results Tab. 1 summarizes the results. NSR achieves 100% accuracy on all splits of PCFG and outperforms the previous state-of-the-art model (Transformer) by 4% on the "Systematicity" split and 15% on "Productivity." Notably, NeSS completely fails on the PCFG dataset, although it achieves perfect accuracy on SCAN. By inspecting the training process of NeSS on PCFG, we find that the stack operations of NeSS cannot represent the binary functions in PCFG, and the trace search process fails due to the large vocabulary and the long sequences of PCFG. Domain-specific knowledge and significant efforts would be required to re-design the stack machine and the training procedure.

4.3. HINT

Finally, we evaluate NSR on the HINT dataset (Li et al., 2021) , in which the model learns to predict the integer result of a handwritten arithmetic expression without any intermediate supervision, e.g., Ñ 40. Compared with SCAN and PCFG, HINT is much more challenging since its perception deals with real handwritten images with high variance and ambiguity, its syntax is more complicated due to the existence of parentheses, and its semantics involves recursive functions. The HINT dataset comes with a single training set and five test subsets for evaluating different generalizations across perception, syntax, and semantics. Evaluation Following Li et al. (2021) , we train models on the single training set and evaluate them on the following five test subsets. ( 1) "I": expressions are seen in the training set, but composed of unseen handwritten images. ( 2) "SS": expressions are unseen, but their lengths and values are within the same range as training. ( 3) "LS": expressions are longer than training, but their values are within the same range. ( 4) "SL": expressions have larger values and their lengths are the same as training. ( 5) "LL": expressions are longer, and their values are bigger than training. A prediction is considered correct if and only if it is exactly equal to the ground-truth result. Baselines We compare the proposed NSR with the following baselines: the seq2seq neural models (i.e., GRU, LSTM, and Transformer) (Li et al., 2021) and NeSS (Chen et al., 2020) . Note that each model is equipped with a ResNet-18 as the image encoder.  0: 0 1: inc 0 2: inc inc 0 … 9: inc inc … inc 0 +: Null -: Null ×: Null ÷: Null 0: 0 1: inc 0 2: inc inc 0 … 9: inc inc … inc 0 +: if (𝑦 == 0, 𝑥, +(inc 𝑥, dec 𝑦)) -: if (𝑦 == 0, 𝑥, +(dec 𝑥, dec 𝑦)) ×: if (𝑦 == 0, 𝑦, 𝑥) ÷: if (𝑦 == inc 0, 𝑥, if (𝑥 == 0, 𝑥, inc inc 0)) 0: 0 1: inc 0 2: inc inc 0 … 9: inc inc … inc 0 +: if (𝑦 == 0, 𝑥, (inc 𝑥) + (dec 𝑦)) -: if (𝑦 == 0, 𝑥, (dec 𝑥) + (dec 𝑦)) ×: if (𝑥 == 0, 0, 𝑦 × (dec 𝑥) + 𝑦) ÷: if (𝑥 == 0, 0, inc ( 𝑥 -𝑦 ÷ 𝑦)) Figure 4 : The evolution of learned programs in NSR for HINT. The recursive programs in DreamCoder are represented by lambda calculus (a.k.a.. λ-calculus) with Y-combinator. Here, we translate the induced programs into pseudo code for easier interpretation. Note that there might be different yet functionally-equivalent programs to represent the semantics of a symbol; we only visualize a plausible one here. Results Tab. 2 summarizes the results on HINT. Comparing NSR with seq2seq neural models, we can see that NSR improves the state-of-the-art performance (Transformer) by about 23%. By inspecting the results on each test subset, we find that the improvement primarily comes from better extrapolation over syntax and semantics: NSR boosts the accuracy of "LL" from 19.0% to 58.2%, while the accuracy gains on "I" and "SS" are only around 2%. Please refer to Fig. A5 for qualitative examples of NSR's predictions on HINT. Again, NeSS completely fails on HINT due to similar reasons for its failure on PCFG as discussed in Sec. 4.2. Fig. 4 illustrates the evolution of semantics along the training of NSR in HINT. This pattern is highly in accordance with how children learn arithmetic in developmental psychology (Carpenter et al., 1999) : The model first masters the semantics of digits as counting, then learns `and ´as recursive counting, and finally figures out how to define ˆand ˜based on `and ´. Crucially, ând ˜are impossible to be correctly learned before mastering `and ´. The model is endowed with such an incremental learning capability since the program induction module allows the semantics of concepts to be built compositionally from those learned earlier (Ellis et al., 2021) .

5. CONCLUSION AND DISCUSSION

In this paper, we present NSR, which learns Grounded Symbol System from data to achieve systematic generalization. The Grounded Symbol System simulates a generalizable, interpretable representation and provides a principled integration of perception, syntax, and semantics. NSR adopts a modular design and encodes the inductive biases of equivariance and recursiveness in each module, which are critical for achieving compositional generalization. To train NSR without the supervision of GSS, we present a probabilistic learning framework and propose a novel deduction-abduction algorithm to facilitate the efficient learning of NSR. NSR achieves state-of-the-art performance on three benchmarks ranging from semantic parsing and string manipulation to arithmetic reasoning. A proof-of-concept machine translation experiment demonstrates the promise of applying NSR to realistic domains. At the same time, we anticipate potential challenges of applying NSR to realworld tasks: (1) The noisy and numerous concepts in real-world tasks have a large space of grounded symbol system and might slow the training of NSR; (2) The functional programs in NSR are deterministic and thus not able to represent probabilistic semantics in real-world tasks, e.g., in machine translation, there might be multiple ways to translate a single sentence. How to solve these challenges will be explored in future work. The initial state for a sequence s " w 0 w 1 ...w n is α " rRoots, β " rw 0 w 1 ...w n s, A " H. A state is regarded as terminal if the buffer is empty and the stack only contains the node Root. The parse tree can be derived from the dependency arcs A. Let α i denote the i-th top element on the stack, and β i the i-th element on the buffer. The parser defines three types of transitions between states: • LEFT-ARC: add an arc α 1 Ñ α 2 to A and remove α 2 from the stack α. Precondition: |α| ě 2. • RIGHT-ARC: add an arc α 2 Ñ α 1 to A and remove α 1 from the stack α. Precondition: |α| ě 2. • SHIFT: move β 1 from the buffer β to the stack α. Precondition: |β| ě 1. The goal of the parser is to predict a transition sequence from an initial state to a terminal state. The parser predicts one transition from T " tLEFT-ARC, RIGHT-ARC, SHIFTu at a time, based on the current state c " pα, β, Aq. The state representation is constructed from a local window and contains following three elements: (i) The top three words on the stack and buffer: α i , β i , i " 1, 2, 3; (ii) The first and second leftmost/rightmost children of the top two words on the stack: lc 1 pα i q, rc 1 pα i q, lc 2 pα i q, rc 2 pα i q, i " 1, 2; (iii) The leftmost of leftmost/rightmost of rightmost children of the top two words on the stack: lc 1 plc 1 pα i qq, rc 1 prc 1 pα i qq, i " 1, 2. We use a special Null token for non-existent elements. Each element in the state representation is embedded to a d-dimensional vector e P R d , and the full embedding matrix is denoted as E P R |Σ|ˆd , where Σ is the concept space. The embedding vectors for all elements in the state are concatenated as its representation: c " re sample ŝ " pps|x; θ p q, ê " ppe|ŝ; θ s q, v " f pŝ, ê; θ l q 3: return T "ă px, ŝ, vq, ê ą 4: end function Q=PriorityQueue() Q.push(rootpT q, y, 1.0) A " px, w, v, arcsq for w 1 P Σ do 10: A 1 " Apw Ñ w 1 q 11: if A 1 .v "" y A then 12: Q.push(A 1 , y A , ppA 1 q) Ź Abduce semantics 23: A 1 " Apv Ñ y A q 24: Q.push(A 1 , y A , ppA 1 q) 25: Ź Top-down search "GT" and "PD" denote "ground-truth" and "prediction," respectively. Each node in the tree is a tuple of (symbol, value).



https://github.com/ellisk42/ec



learned latent neural grammars for both the encoder jump left after walk twice jump 4 [JUMP] left 5 [LTURN, JUMP] walk 1 [WALK] twice 9 [WALK, WALK] after 12 [WALK, WALK , LTURN, JUMP]

Figure 1: Examples of Grounded Symbol Systems, which are interpretable and illustrate a human-like reasoning process. (1) SCAN: each node is a triplet of (word, symbol, action sequence); (2) PCFG: each node is a triplet of (word, symbol, letter list). (3) HINT: each node is a triplet of (image, symbol, value). Each symbol is denoted by its index. and the decoder in a seq2seq framework. These methods achieved strong generalization by injecting a symbolic scaffolding into their models. However, these symbolic components require domain knowledge for specialized modules and complicated training procedures, such as hierarchical reinforcement learning inLiu et al. (2020) and expensive search process for execution traces in Chen et al. (2020), which often have to be constrained under a carefully designed curriculum in practice. In contrast, the proposed NSR model requires less domain-specific knowledge for designing its modules and the proposed deduction-abduction algorithm for learning NSR does not require a specialized curriculum, leading to better transferability and easier optimization.

Figure 2: The pipeline of inference and learning in NSR.

Figure 3: Left: the syntactic similarity between input words of NSR trained on the SCAN Length split. The similarity between word i and word j is measured by the percentage of test samples in which replacing i with j, or vice versa, does not change the dependencies predicted by the dependency parser. Right: the induced programs for input words, where x and y denote the inputs, ∅ denotes empty inputs, cons inserts an item to the first position of a list, car returns the first item in a list, and + concatenates two lists.

Figure A1: Applying the transition-based dependency parser to an example of HINT. It is similar for SCAN and PCFG.

function ABDUCE(T, y) 2:

while A, y A , p = Q.pop() do 4:

Figure A2: Abduction over perception, syntax, and semantics in HINT. Parts revised during abduction are highlighted in red.

Figure A4: Examples of NSR predictions on the test set of the SCAN Length split. We use * (repeating the list) and + (concatenating two to shorten the outputs for easier interpretation. Test subset I

, let x denote the input, T "ă px, s, vq, e ą the intermediate GSS, and y the output. During learning, px, yq is observed but T is latent. The likelihood of the observation px, yq marginalized over T can be decomposed as: s|x denotes the process of grounding symbols to raw signals, guided by the perception module θ p . e|s denotes the process of parsing the symbol sequence into a parse tree, guided by the syntax module θ s . v|s, e denotes the process of reasoning over the parse tree, guided by the semantic module θ l .

The test accuracy on different splits of SCAN and PCFG. "A. R." stands for "Around Right," "Sys." for "Systematicity," and "Prod." for "Productivity." The results of NeSS on PCFG are reported by adapting the source code provided byChen et al. (2020) on PCFG.

The test accuracy on HINT. The results of GRU, LSTM, and Transformer are cited fromLi et al. (2021), and the results of NeSS are reported by adapting its source code on HINT.

1 e 2 ...e n s P R nd . Given the state representation, we adopt a two-layer feed-forward neural network to predict the transition. Training set D " tpx i , y i q : i " 1, 2, ..., N u 2: Initial Module: perception θ function DEDUCE(x, θ p , θ s , θ l )

if A 1 .v "" y A then 19: Q.push(A 1 , y A , ppA 1 q)

annex

To learn the semantics of a symbol c from a set of examples D c is to find a program ρ c composed from a set of primitives L, which maximizes the following objective: where ppD c |ρq is the likelihood of the program ρ matching D c , and ppρ|Lq is the prior of ρ under the program space defined by the primitives L. Since finding a globally optimal program is usually intractable, the maximization in Eq. ( A1) is approximated by a stochastic search process guided by a neural network, which is trained to approximate the posterior distribution ppρ|D c , Lq. We refer the readers to DreamCoder (Ellis et al., 2021) 1 for more technical details.

A.2 LEARNING

Derivation of Eq. ( 5) Take the derivative of L w.r. Deduction-Abduction Alg. A1 describes the procedure for learning NSR by the proposed deduction-abduction algorithm. Fig. A2 illustrate the one-step abduction over perception, syntax, and semantics in HINT and Fig. A3 visualizes a concrete example to illustrate the deductionabduction process. It is similar for SCAN and PCFG.

A.3 EXPERIMENTS

Experimental Setup For tasks taking symbols as input (i.e., SCAN and PCFG), the perception module is not required in NSR; For the task taking images as input, we adopt ResNet-18 as the perception module, which is pre-trained unsupervisedly (Van Gansbeke et al., 2020) on handwritten images from the training set. In the dependency parser, the token embeddings have a dimension of 50, the hidden dimension of the transition classifier is 200, and we use a dropout of 0.5. For the program induction, we adopts the default setting in DreamCoder Ellis et al. (2021) . For learning NSR, both the ResNet-18 and the dependency parser are trained by the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 ´4. NSR are trained for 100 epochs for all datasets.Qualitative Examples Fig. A4 and Fig. A5 show several examples of the NSR predictions on SCAN and HINT. Given a handwritten expression, the system first performs a greedy deduction to propose an initial solution, which generates a wrong result. In abduction, the root node, paired with the ground-truth result, is first pushed to the priority queue. The abduction over perception, syntax, and semantics is performed on the popped node to generate possible revisions. A top-down search is also applied to propagate the expected value to its children. All possible revisions are then pushed into the priority queue. This process is repeated until we find the most likely revision for the initial solution.

A.4 COMPOSITIONAL MACHINE TRANSLATION

To explore how well the proposed NSR model is applicable to real-world tasks, We run a proofof-concept machine translation experiment using the English-French translation task from Lake & Baroni (2018), which is also used by previous works (Li et al., 2019; Chen et al., 2020; Kim, 2021) to explore their methods on realistic domains. Compared to synthetic tasks like SCAN and PCFG, this translation task contains more complex and ambiguous rules.Evaluation We adopt the public data splits from Li et al. (2019) . Specifically, the training set contains 10,000 English-French sentence pairs, where the English sentences begin with phrases such as "I am", "you are", and their contractions. The training set also contains 1,000 repetitions of the sentence pair ("I am daxy", "je suis daxiste"), which is the only sentence that introduces the pseudoword "daxy" in the training set. The test set contains 8 combinations of "daxy" with other phrases, e.g., "you are not daxy", which do not appear in the training set. Note that there are 2 different French translations of "you are" that frequently appear in the training set, so both of them are considered correct in the test set. We compare with the following baselines: Seq2Seq Lake & Baroni (2018), Primitive Substitution Li et al. (2019), and NeSS Chen et al. (2020) .Results Tab. A1 summarizes the results of the compositional machine translation task. Similar to previous methods (Primitive Substitution and NeSS), NSR achieves a 100% generalization accuracy on this task. This experiment shows that NSR has the promise to be applied to real-world tasks. Despite the perfect accuracy in this proof-of-concept experiment, we indeed anticipate potential challenges of applying NSR to real-world tasks: (1) The noisy and numerous concepts in real-world tasks have a large space of grounded symbol system and might slow the training of NSR; (2) The functional programs in NSR are deterministic and thus not able to represent probabilistic semantics in real-world tasks, e.g., in machine translation, there might be multiple ways to translate a single sentence. To explore how well the individual modules of NSR are learned, we perform an ablation study on HINT to analyze the performance of each module of NSR. Specifically, along with the final results, the HINT dataset also provides the symbolic sequences and parse trees for evaluation. For Neural Perception, we report the accuracy of classifying each symbol. For Dependency parsing, we report the accuracy of attaching each symbol to its correct parent, given the ground-truth symbol sequence as the input. For Program Induction, we report the accuracy of final results, given the ground-truth symbol sequence and parse tree. Overall, each module achieves high accuracy, as shown in Tab. A2. For Neural Perception, most errors come from the two parentheses, "(" and ")", because they are visually similar. For Dependency Parsing, we analyze the parsing accuracies for different concept groups: digits (100%), operators (95.85%), and parentheses (64.28%). The parsing accuracy of parentheses is much lower than those of digits and operators. We think this is because, as long as digits and operators are correctly parsed in the parsing tree, where to attach the parentheses does not influence the final results because parentheses have no semantic meaning. For Program Induction, we can manually verify that the induced programs (Fig. 4 ) have correct semantics. The errors are caused by exceeding the recursion limit when calling the program for multiplication. The above analysis is also verified by the qualitative examples in Fig. A5 .

