LEARNING ALGEBRAIC REPRESENTATION FOR ABSTRACT SPATIAL-TEMPORAL REASONING Anonymous authors Paper under double-blind review

Abstract

Is intelligence realized by connectionist or classicist? While connectionist approaches have achieved superhuman performance, there has been growing evidence that such task-specific superiority is particularly fragile in systematic generalization. This observation lies in the central debate (Fodor et al., 1988; Fodor & McLaughlin, 1990 ) between connectionist and classicist, wherein the latter continually advocates an algebraic treatment in cognitive architectures. In this work, we follow the classicist's call and propose a hybrid approach to improve systematic generalization in reasoning. Specifically, we showcase a prototype with algebraic representations for the abstract spatial-temporal reasoning task of Raven's Progressive Matrices (RPM) and present the ALgebra-Aware Neuro-Semi-Symbolic (ALANS 2 ) learner. The ALANS 2 learner is motivated by abstract algebra and the representation theory. It consists of a neural visual perception frontend and an algebraic abstract reasoning backend: the frontend summarizes the visual information from object-based representations, while the backend transforms it into an algebraic structure and induces the hidden operator on-the-fly. The induced operator is later executed to predict the answer's representation, and the choice most similar to the prediction is selected as the solution. Extensive experiments show that by incorporating an algebraic treatment, the ALANS 2 learner outperforms various pure connectionist models in domains requiring systematic generalization. We further show that the algebraic representation learned can be decoded by isomorphism and used to generate an answer.

1. INTRODUCTION

"Thought is in fact a kind of Algebra." -William James (James, 1891) Imagine you are given two alphabetical sequences of "c, b, a" and "d, c, b", and asked to fill in the missing element in "e, d, ?". In nearly no time will one realize the answer to be c. However, more surprising for human learning is that, effortlessly and instantaneously, we can "freely generalize" (Marcus, 2001) the solution to any partial consecutive ordered sequences. While believed to be innate in early development for human infants (Marcus et al., 1999) , such systematic generalizability has constantly been missing and proven to be particularly challenging in existing connectionist models (Lake & Baroni, 2018; Bahdanau et al., 2019) . In fact, such an ability to entertain a given thought and semantically related contents strongly implies an abstract algebra-like treatment (Fodor et al., 1988) ; in literature, it is referred to as the "language of thought" (Fodor, 1975) , "physical symbol system" (Newell, 1980) , and "algebraic mind" (Marcus, 2001) . However, in stark contrast, existing connectionist models tend only to capture statistical correlation (Lake & Baroni, 2018; Kansky et al., 2017; Chollet, 2019) , rather than providing any account for a structural inductive bias where systematic algebra can be carried out to facilitate generalization. This contrast instinctively raises a question-what constitutes such an algebraic inductive bias? We argue that the foundation of the modeling counterpart to the algebraic treatment in early human development (Marcus, 2001; Marcus et al., 1999) lies in algebraic computations set up on mathematical axioms, a form of formalized human intuition and the starting point of modern mathematical reasoning (Heath et al., 1956; Maddy, 1988) . Of particular importance to the basic building blocks of algebra is the Peano Axiom (Peano, 1889) . In the Peano Axiom, the essential components of algebra, the algebraic set and corresponding operators over it, are governed by three statements: (1) the existence of at least one element in the field to study ("zero" element), (2) a successor function that is recursively applied to all elements and can, therefore, span the entire field, and (3) the principle of mathematical induction. Building on such a fundamental axiom, we begin to form the notion of an algebraic set and induce the operator along with it to construct an algebraic structure. We hypothesize that such a treatment of algebraic computations set up on fundamental axioms is essential for a model's systematic generalizability, the lack of which will only make it sub-optimal. To demonstrate the benefits of such an algebraic treatment in systematic generalization, we showcase a prototype for Raven's Progressive Matrices (RPM) (Raven, 1936; Raven & Court, 1998) , an exemplar task for abstract spatial-temporal reasoning (Santoro et al., 2018; Zhang et al., 2019a) . In this task, an agent is given an incomplete 3 ˆ3 matrix consisting of eight context panels with the last one missing, and asked to pick one answer from a set of eight choices that best completes the matrix. Human's reasoning capability of solving this abstract reasoning task has been commonly regarded as an indicator of "general intelligence" (Carpenter et al., 1990) and "fluid intelligence" (Spearman, 1923; 1927; Hofstadter, 1995; Jaeggi et al., 2008) . In spite of the task being one that ideally requires abstraction, algebraization, induction, and generalization (Raven, 1936; Raven & Court, 1998; Carpenter et al., 1990) , recent endeavors unanimously propose pure connectionist models that attempt to circumvent such intrinsic cognitive requirements (Santoro et al., 2018; Zhang et al., 2019a; b; Wang et al., 2020; Zheng et al., 2019; Hu et al., 2020; Wu et al., 2020) . However, these methods' inefficiency is also evident in systematic generalization; they struggle to extrapolate to domains beyond training, as pointed out in (Santoro et al., 2018; Zhang et al., 2019b) and shown later in this paper. To address the issue, we introduce the ALgebra-Aware Neuro-Semi-Symbolic (ALANS 2 ) learner. At a high-level, the ALANS 2 learner is embedded in a general neuro-symbolic architecture (Yi et al., 2018; Mao et al., 2019; Han et al., 2019; Yi et al., 2020) but has on-the-fly operator learnability and hence semi-symbolic. Specifically, it consists of a neural visual perception frontend and an algebraic abstract reasoning backend. For each RPM instance, the neural visual perception frontend first slides a window over each panel to obtain the object-based representations (Kansky et al., 2017; Wu et al., 2017) for every object. A belief inference engine latter aggregates all object-based representations in each panel to produce the probabilistic belief state. The algebraic abstract reasoning backend then takes the belief states of the eight context panels, treats them as snapshots on an algebraic structure, lifts them into a matrix-based algebraic representation built on the Peano Axiom and the representation theory (Humphreys, 2012) , and induces the hidden operator in the algebraic structure by solving an inner optimization (Colson et al., 2007; Bard, 2013) . The algebraic representation for the answer is predicted by executing the induced operator: its corresponding set element is decoded by isomorphism established in the representation theory, and the final answer is selected as the one most similar to the prediction. The ALANS 2 learner enjoys several benefits in abstract reasoning with an algebraic treatment: 1. Unlike previous monolithic models, the ALANS 2 learner offers a more interpretable account of the entire abstract reasoning process: the neural visual perception frontend extracts object-based representations and produces belief states of panels by explicit probability inference, whereas the algebraic abstract reasoning backend induces the hidden operator in the algebraic structure. The corresponding representation for the final answer is obtained by executing the induced operator, and the choice panel with minimum distance is selected. This process much resembles the topdown bottom-up strategy in human reasoning: humans reason by inducing the hidden relation, executing it to generate a feasible solution in mind, and choosing the most similar answer available (Carpenter et al., 1990) . Such a strategy is missing in recent literature (Santoro et al., 2018; Zhang et al., 2019a; b; Wang et al., 2020; Zheng et al., 2019; Hu et al., 2020; Wu et al., 2020) . 2. While keeping the semantic interpretability and end-to-end trainability in existing neurosymbolic frameworks (Yi et al., 2018; Mao et al., 2019; Han et al., 2019; Yi et al., 2020) , ALANS 2 is what we call semi-symbolic in the sense that the symbolic operator can be learned and concluded on-the-fly without manual definition for every one of them. Such an inductive ability also enables a greater extent of the desired generalizability. 3. By decoding the predicted representation in the algebraic structure, we can also generate an answer that satisfies the hidden relation in the context. This work makes three major contributions: (1) We propose the ALANS 2 learner. Compared to existing monolithic models, the ALANS 2 learner adopts a neuro-semi-symbolic design, where the problem-solving process is decomposed into neural visual perception and algebraic abstract reasoning. (2) To demonstrate the efficacy of incorporating an algebraic treatment in abstract spatialtemporal reasoning, we show the superior systematic generalization ability of the proposed ALANS 2 learner in various extrapolatory RPM domains. (3) We present analyses into both neural visual perception and algebraic abstract reasoning. We also show the generative potential of ALANS 2 .

2. RELATED WORK

Quest for Symbolized Manipulation The idea to treat thinking as a mental language can be dated back to Augustine (Augustine, 1876; Wittgenstein, 1953) . Since the 1970s, this school of thought has undergone a dramatic revival as the quest for a symbolized manipulation in cognitive modeling, such as "language of thought" (Fodor, 1975) , "physical symbol system" (Newell, 1980) , and "algebraic mind" (Marcus, 2001) . In their study, connectionist's task-specific superiority and inability to generalize beyond training (Kansky et al., 2017; Chollet, 2019; Santoro et al., 2018; Zhang et al., 2019a) have been hypothetically linked to a lack of such symbolized algebraic manipulation (Lake & Baroni, 2018; Chollet, 2019; Marcus, 2020) . With evidence that an algebraic treatment adopted in early human development (Marcus et al., 1999) can potentially address the issue (Bahdanau et al., 2019; Mao et al., 2019; Marcus, 2020) , classicist (Fodor et al., 1988) approaches for generalizable reasoning used in programs (McCarthy, 1960) and blocks world (Winograd, 1971) have resurrected. As a hybrid approach to bridge connectionist and classicist, recent developments lead to neuro-symbolic architectures. In particular, Yi et al. (2018) demonstrate a neuro-symbolic prototype for visual question answering, where a perception module and a language parsing module are separately trained, and the predefined logic operators associated with language tokens are chained to process the visual information. Mao et al. (2019) soften the predefined operators to afford end-to-end training with only question answers. Han et al. (2019) and Yi et al. (2020) use the hybrid architecture for metaconcept learning and temporal causal learning, respectively. ALANS 2 follows the classicist's call but adopts a neuro-semi-symbolic architecture: it is end-to-end trainable as opposed to Yi et al. (2018; 2020) and the operator can be learned and concluded on-the-fly without manual specification (Yi et al., 2018; Mao et al., 2019; Han et al., 2019; Yi et al., 2020) . Abstract Visual Reasoning Recent works by Santoro et al. (2018) and Zhang et al. (2019a) arouse the community's interest in abstract visual reasoning, where the task of Raven's Progressive Matrices (RPM) is introduced as such a measure for intelligent agents. Initially proposed as an intelligence quotient test for humans (Raven, 1936; Raven & Court, 1998) , RPM is believed to be strongly correlated with human's general intelligence (Carpenter et al., 1990) and fluid intelligence (Spearman, 1923; 1927; Hofstadter, 1995; Jaeggi et al., 2008) . Early RPM-solving systems employ symbolic representations based on hand-designed features and assume access to the underlying logics (Carpenter et al., 1990; Lovett et al., 2009; 2010; Lovett & Forbus, 2017) . Another stream of research on RPM recruits similarity-based metrics to select the most similar answer from the choices (Little et al., 2012; McGreggor & Goel, 2014; McGreggor et al., 2014; Mekik et al., 2018; Shegheva & Goel, 2018) . However, their hand-defined visual features are unable to handle uncertainty from imperfect perception, and directly assuming access to the logic operations simplifies the problem. Recently proposed data-driven approaches arise from the availability of large datasets: Santoro et al. (2018) extend a pedagogical RPM generation method (Wang & Su, 2015) , whereas Zhang et al. (2019a) use a stochastic image grammar (Zhu et al., 2007) and introduce structural annotations in it, which Hu et al. (2020) further refine to avoid shortcut solutions by statistics in candidate panels. Despite the fact that RPM intrinsically requires one to perform abstraction, algebraization, induction, and generalization, existing methods bypass such cognitive requirements using a single feedforward pass in connectionist models: Santoro et al. (2018) 2020) apply a scattering transformation to learn objects, attributes, and relations. In contrast, ALANS 2 attempts to fulfill the cognitive requirements in a neuro-semi-symbolic framework: the perception frontend abstracts out visual information, and the reasoning backend induces the hidden operator in an algebraic structure.

3. THE ALANS 2 LEARNER

In this section, we introduce the ALANS 2 learner for the RPM problem. In each RPM instance, an agent is given an incomplete 3 ˆ3 panel matrix with the last entry missing and asked to induce the operator hidden in the matrix and choose from eight choice panels one that follows it. Formally, let the answer variable be denoted as y, the context panels as tI o,i u 8 i"1 , and choice panels as tI c,i u 8 i"1 . Then the problem can be formulated as estimating P py | tI o,i u 8 i"1 , tI c,i u 8 i"1 q. According to the common design (Santoro et al., 2018; Zhang et al., 2019a; Carpenter et al., 1990) , there is one operator that governs each panel attribute. Hence, by assuming independence among attributes, we propose to factorize the probability as P py " n | tIo,iu 8 i"1 , tIc,iu 8 i"1 q9 ź a ÿ T a P py a " n | T a , tIo,iu 8 i"1 , tIc,iu 8 i"1 q P pT a | tIo,iu 8 i"1 q, where y a denotes the answer selection based only on attribute a and T a the operator on a. Overview As shown in Fig. 1 , the ALANS 2 learner decomposes the process into perception and reasoning: the neural visual perception frontend extracts the belief states from each of the sixteen panels, whereas the algebraic abstract reasoning backend views an instance as an example in an abstract algebra structure, transforms belief states into algebraic representations by representation theory, induces the hidden operators, and executes the operators to predict the representation of the answer. Therefore, in Eq. ( 1), the operator distribution is modeled by the fitness of an operator and the answer distribution by the distance between the predicted representation and that of a candidate.

3.1. NEURAL VISUAL PERCEPTION

The neural visual perception frontend consists of an object CNN and a belief inference engine. It is responsible for extracting the belief states for each of the sixteen (context and choice) panels. Object CNN For each panel, we use a sliding window to traverse the spatial domain of the image and feed each image region into an object CNN. The CNN has four branches, producing for each region its object attribute distributions, including objectiveness (if the region contains an object), type, size, and color. Distributions of type, size, and color are conditioned on an object's existence. Belief Inference Engine The belief inference engine summarizes the panel attribute distributions (over position, number, type, size, and color) by marginalizing out all object attribute distributions (over objectiveness, type, size, and color). As an example, the distribution of the panel attribute of Number can be computed as such: for N image regions and their predicted objectiveness P pNumber " kq " ÿ R o N ź j"1 P pr o j " R o j q, where P pr o j q denotes the jth region's estimated objectiveness distribution, and R o is a binary sequence of length N that sums to k. All panel attribute distributions compose the belief state of a panel. In the following, we denote the belief state as b and the distribution of an attribute a as P pb a q.

3.2. ALGEBRAIC ABSTRACT REASONING

Given the belief states of both context and choice panels, the algebraic abstract reasoning backend concerns the induction of hidden operators and the prediction of answer representations for each attribute. The fitness of induced operators is used for estimating the operator distribution and the difference between the prediction and the choice panel for estimating the answer distribution. Algebraic Underpinning Without loss of generality, here we assume row-wise operators. For each attribute, under perfect perception, the first two rows in an RPM instance provide snapshots into an example of magma (Hausmann & Ore, 1937) constrained to an integer-indexed set, the simplest group-like algebra structure that is closed under a binary operator. To see this, note that an accurate perception module would see each panel attribute as a deterministic set element. Therefore, RPM instances with unary operators, such as progression, are magma examples with special binary operators where one operand is constant. Instances with binary operators, such as arithmetics, directly follow the magma properties. Those with ternary operators are ones with unary operators on a three-tuple set defined on rows. Algebraic Representation A systematic algebraic view allows us to felicitously recruit ideas in representation theory (Humphreys, 2012) to glean the hidden properties in the abstract structures: it makes abstract algebra amenable by reducing it onto linear algebra. Following the same spirit, we propose to lift both the set elements and the hidden operators to a learnable matrix space. To encode the set element, we employ the Peano Axiom (Peano, 1889) . According to the Peano Axiom, an integer-indexed set can be constructed by (1) a zero element (0), (2) a successor function (Sp¨q), and (3) the principle of mathematical induction, such that the kth element is encoded as S k p0q. Specifically, we instantiate the zero element as a learnable matrix M and the successor function as the matrix-matrix product parameterized by M . In an attribute-specific manner, the representation of an attribute taking the kth value is pM a q k M a 0 . For operators, we consider them to live in a learnable matrix group of a corresponding dimension, such that the action of an operator on a set can be represented as matrix multiplication. Such algebraic representations establish an isomorphism between the matrix space and the abstract algebraic structure: abstract elements on the algebraic structure have a bijective mapping to/from the matrix space, and inducing the abstract relation can be reduced to solving for a matrix operator. See Fig. 2 for a graphical illustration of the isomorphism. M 1 (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " C 2 W G 1 S O 7 I 1 3 0 n G g + 8 F 0 l B t M z n m A = " > A A A B 9 H i c b V D L S s N A F J 3 U V 6 2 v q k s 3 Q 4 t Q E U t i w c c u 6 M a N U M E + o I l l M p m 0 Q y e Z O D M p h N C / E N y 4 U M S t H + O u f 2 O S F v F 1 4 M L h n H u 5 9 x 4 n Z F Q q X Z 9 q h Y X F p e W V 4 m p p b X 1 j c 6 u 8 v d O W P B K Y t D B n X H Q d J A m j A W k p q h j p h o I g 3 2 G k 4 4 w u M 7 8 z J k J S H t y q O C S 2 j w Y B 9 S h G K p X s 6 7 v k y J j U L O x y d d A v V / W 6 n g P + J c a c V M 2 K d f g w N e N m v / x h u R x H P g k U Z k j K n q G H y k 6 Q U B Q z M i l Z k S Q h w i M 0 I L 2 U B s g n 0 k 7 y o y d w P 1 V c 6 H G R V q B g r n 6 f S J A v Z e w 7 a a e P 1 F D + 9 j L x P 6 8 X K e / M T m g Q R o o E e L b I i x h U H G Y J Q J c K g h W L U 4 K w o O m t E A + R Q F i l O Z X y E M 4 z n H y 9 / J e 0 j + t G o 9 6 4 M a r m B Z i h C P Z A B d S A A U 6 B C a 5 A E 7 Q A B v f g E T y D F 2 2 s P W m v 2 t u s t a D N Z 3 b B D 2 j v n 8 7 o l J c = < / l a t e x i t > M (•) < l a t e x i t s h a _ b a s e = " g y R l S F L N g D X F R B Y x f F K P v j g a w = " > A A A B i c b Z D L S s N A F I Y n V b r r e r S z d A i V I S S W P C y C p x I S w F h C m U w m d D J J M M h F D E i K U M S t r + O u b + M k L a L W H w Y + / v c p z j x Y x K Z Z p T o C v L K V l w v b W x u b e + U d / f a M k o E J i c s U h P S Q J o y F F W M d G N B U O g x v F G V n e e S B C o j f q T Q m b o g G n A Y U I W t k N w X k j v r l q l k c F F s O Z Q t S v O d P U T p v q f j R z g J C V e Y I S l l h k r d y E o p i R S c l J J I k R H q E B W n k K C T S H e f z T u C h d n w Y R E I / r m D u / u w Y o D K N P R Z Y j U U P N M v O / r J e o N w d U x n i n A + y h I G F Q R z J a H P h U E K Z q Q F h Q P S v E Q y Q Q V v p E p f w I F l O v d e h P Z J W r U G d W b E M x X B A a i A G r D A G b D B N W i C F s C A g U f w A l N e + P Z e D P e Z U F Y z D J + P g C c D y S s Q = = < / l a t e x i t > M 1 (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " C 2 W G 1 S O 7 I 1 3 0 n G g + 8 F 0 l B t M z n m A = " > A A A B 9 H i c b V D L S s N A F J 3 U V 6 2 v q k s 3 Q 4 t Q E U t i w c c u 6 M a N U M E + o I l l M p m 0 Q y e Z O D M p h N C / E N y 4 U M S t H + O u f 2 O S F v F 1 4 M L h n H u 5 9 x 4 n Z F Q q X Z 9 q h Y X F p e W V 4 m p p b X 1 j c 6 u 8 v d O W P B K Y t D B n X H Q d J A m j A W k p q h j p h o I g 3 2 G k 4 4 w u M 7 8 z J k J S H t y q O C S 2 j w Y B 9 S h G K p X s 6 7 v k y J j U L O x y d d A v V / W 6 n g P + J c a c V M 2 K d f g w N e N m v / x h u R x H P g k U Z k j K n q G H y k 6 Q U B Q z M i l Z k S Q h w i M 0 I L 2 U B s g n 0 k 7 y o y d w P 1 V c 6 H G R V q B g r n 6 f S J A v Z e w 7 a a e P 1 F D + 9 j L x P 6 8 X K e / M T m g Q R o o E e L b I i x h U H G Y J Q J c K g h W L U 4 K w o O m t E A + R Q F i l O Z X y E M 4 z n H y 9 / J e 0 j + t G o 9 6 4 M a r m B Z i h C P Z A B d S A A U 6 B C a 5 A E 7 Q A B v f g E T y D F 2 2 s P W m v 2 t u s t a D N Z 3 b B D 2 j v n 8 7 o l J c = < / l a t e x i t > Operator Induction Operator induction concerns about finding a concrete operator in the abstract algebraic structure. By the property of closure, we formulate it as an inner-level regularized linear regression problem: a binary operator T a b in a magma example for attribute a minimizes arg min T a b pT q " ÿ i E " }M pb a o,i qT M pb a o,i`1 q ´M pb a o,i`2 q} 2 F ‰ `λa b }T } 2 F , where under visual uncertainty, we take the expectation with respect to the distributions in the belief states of context panels P pb a o,i q in the first two rows, and denote its algebraic representation as M pb a o,i q. For unary operators, one operand can be treated as constant and absorbed into T . Note that Eq. ( 3) admits a closed-form solution (see Appendix for details). Therefore, the operator can be learned and adapted for different instances of binary relations and concluded on-the-fly. Such a design also simplifies the recent neuro-symbolic approaches, where every single symbol operator needs to be hand-defined (Yi et al., 2018; Mao et al., 2019; Han et al., 2019; Yi et al., 2020) . Instead, we only specify an inner-level optimization framework and allow symbolic operators to be quickly induced based on the neural observations, while keeping the semantic interpretability in the neurosymbolic methods. Therefore, we term such a design semi-symbolic. The operator probability in Eq. ( 1) is then modeled by each operator type's fitness, e.g., for binary, P pT a " T a b | tI o,i u 8 i"1 q 9 expp´ a b pT a b qq. (4) Operator Execution To predict the algebraic representation of the answer, we solve another innerlevel optimization similar to Eq. ( 3), but now treating the representation of the answer as a variable: y M a b " arg min M a b pM q " Er}M pb a o,7 qT a b M pb a o,8 q ´M } 2 F s, where the expectation is taken with respect to context panels in the last row. The optimization also admits a closed-form solution (see Appendix for details), which corresponds to the execution of the induced operator in Eq. ( 3). The predicted representation is decoded probabilistically as the predicted belief state of the solution, P p p b a " k | T a q 9 expp´} y M a ´pM a q k M a 0 } 2 F q. (6) Answer Selection Based on Eqs. ( 1) and ( 4), estimating the answer distribution is now boiled down to estimating the conditional answer distributions for each attribute. Here, we propose to model it based on the Jensen-Shannon Divergence (JSD) of the predicted belief state and that of a choice, P py a " n | T a , tI o,i u 8 i"1 , tI c,i u 8 i"1 q 9 expp´D JSD pP p p b a | T a q}P pb a c,n qqq. (7)

Discussion

The algebraic abstract reasoning module offers a computational and interpretable counterpart to human-like reasoning in RPM (Carpenter et al., 1990) . Specifically, the induction component resembles the fluid intelligence, where one quickly induces the hidden operator by observing the context panels. The execution component synthesizes an image by executing the induced operator, and the choice most similar to the image is selected as the answer. We also note that by decoding the predicted representation in Eq. ( 6), a solution can be generated: by sequentially selecting the most probable operator and the most probable attribute value, a rendering engine can directly render the solution. The reasoning backend also enables end-to-end training: by integrating the belief states from neural perception, the module conducts both induction and execution in a soft manner, such that the gradients can be back-propagated and the learner jointly trained.

3.3. LEARNING OBJECTIVE

We train the entire ALANS 2 learner by minimizing the cross-entropy loss between the estimated answer distribution and the ground-truth selection, i.e., min θ,tM a 0 u,tM a u pP py | tI o,i u 8 i"1 , tI c,i u 8 i"1 q, y ‹ q, where p¨q denotes the cross-entropy loss, y ‹ the ground-truth selection, θ the parameters in the object CNN, and tM a 0 u and tM a u the zero elements and the successor functions for element encodings, respectively. Note notations are simplified by making the dependency on parameters implicit. However, we notice in practice that with only the cross-entropy loss on the ground-truth selection, the ALANS 2 learner experiences difficulty in convergence. Without a proper guidance, the object CNN does not produce meaningful object-based representations. Therefore, following the discussion in (Santoro et al., 2018; Zhang et al., 2019a; Wang et al., 2020) , we augment training with an auxiliary loss on the distribution of the operator, i.e., min θ,tM a 0 u,tM a u pP py | tI o,i u 8 i"1 , tI c,i u 8 i"1 q, y ‹ q `ÿ a λ a pP pT a | tI o,i u 8 i"1 q, y a ‹ q, (9) where y a ‹ denotes the ground-truth operator selection for attribute a, and λ a balances the trade-off.

4. EXPERIMENTS

A cognitive architecture with systematic generalization is believed to demonstrate the following three principles (Fodor et al., 1988; Marcus, 2001; 2020) : (1) systematicity, (2) productivity, and (3) localism. Systematicity requires an architecture to be able to entertain "semantically related" contents after understanding a given thought. Productivity states that the awareness of a constituent implies that of a recursive application of the constituent, and vice versa for localism. To verify the effectiveness of an algebraic treatment in systematic generalization, we showcase the superiority of the proposed ALANS 2 learner on the three principles in the abstract spatial-temporal reasoning task of RPM. Specifically, we use the generation methods proposed in Zhang et al. (2019a) and Hu et al. (2020) to generate RPM problems and carefully split training and testing to construct the three regimes. The former generates candidates by perturbing only one attribute of the correct answer while the later modifies attribute values in a hierarchical manner to avoid shortcut solutions by pure statistics. Both methods categorize relations in RPM into three types, according to Carpenter et al. (1990) : unary (Constant and Progression), binary (Arithmetic), and ternary (Distribution of Three), each of which comes with several instances. Grounding the principles into learning abstract relations in RPM, we fix the configuration to be 3 ˆ3Grid and generate the following data splits for evaluation (see Appendix for details): • Systematicity: the training set contains only a subset of instances for each type of relation, while the test set all other relation instances. • Productivity: as the binary relation results from a recursive application of the unary relation, the training set contains only unary relations, whereas the test set only binary relations. • Localism: the training and testing sets in the productivity split are swapped to study localism. We follow Zhang et al. (2019a) to generate 10, 000 instances for each split and assign 6 folds for training, 2 folds for validation, and 2 folds for testing. Experimental Setup We evaluate the systematic generalizability of the proposed ALANS 2 learner on the above three splits, and compare the ALANS 2 learner with other baselines, including ResNet, ResNet+DRT (Zhang et al., 2019a) , WReN (Santoro et al., 2018) , CoPINet (Zhang et al., 2019b) , MXGNet (Wang et al., 2020) , LEN (Zheng et al., 2019) , HriNet (Hu et al., 2020) , and SCL (Wu et al., 2020) . We use either official or public implementations that reproduce the original results. Systematic Generalization Table 1 shows the performance of various models on systematic generalization, i.e., systematicity, productivity, and localism. Compared to results reported in Santoro et al. ( 2018 2020), all pure connectionist models experience a devastating performance drop when it comes to the critical cognitive requirements on systematic generalization, indicating that pure connectionist models fail to perform abstraction, algebraization, induction, or generalization needed in solving the abstract reasoning task; instead, they seem to only take a shortcut to bypass them. In particular, MXGNet (Wang et al., 2020) 's superiority is diminishing in systematic generalization. Despite of learning with structural annotations, ResNet+DRT (Zhang et al., 2019a) does not fare better than its base model. The recently proposed HriNet (Hu et al., 2020) slightly improves on ResNet in this aspect, with LEN (Zheng et al., 2019) being only marginally better. WReN (Santoro et al., 2018) , on the other hand, shows oscillating performance across the three regimes. Evaluated under systematic generation, SCL (Wu et al., 2020) and CoPINet (Zhang et al., 2019b ) also far deviate from their "superior performance". These observations suggest that pure connectionist models highly likely learn from variation in visual appearance rather than the algebra underlying the problem. Embedded in a neural-semi-symbolic framework, the proposed ALANS 2 learner improves on systematic generalization by a large margin. With an algebra-aware design, the model is considerably stable across different principles of systematic generalization. The algebraic representations learned in relations of either a constituent or a recursive composition naturally support productivity and localism, while semi-symbolic inner optimization further allows various instances of an operator type to be induced from the algebraic representations and boosts systematicity. The importance of the algebraic representations is made more significant in the ablation study: ALANS 2 -Ind, with algebraic representation replaced by independent encodings and the algebraic isomorphism broken, shows inferior performance. The ALANS 2 learner also enables diagnostic tests into its jointly learned perception module and reasoning module, in contrast to the black-box-like connectionist counterparts.

Analysis into Perception and Reasoning

The neural-semi-symbolic design allows analyses into both perception and reasoning. To evaluate the neural perception module and the algebraic reasoning module, we extract region-based object attribute annotations from the dataset generation methods (Zhang et al., 2019a; Hu et al., 2020) and categorize all relations into three types, i.e., unary, binary, and ternary, respectively. Table 2 shows the perception module's performance on the test sets in the three regimes of systematic generalization. We note that in order for the ALANS 2 learner to achieve the desired results shown in Table 1 , ALANS 2 learns to construct the concept of objectiveness perfectly. The model also shows a fairly accurate prediction accuracy on the attributes of type and size. However, on the texture-related concept of color, ALANS 2 fails to develop a reliable notion on it. Despite that, the general prediction accuracy of the perception module is still surprising, considering that the perception module is only jointly learned with ground-truth annotations on answer selections. The relatively lower accuracy on color could be attributed to its larger space compared to other attributes. Table 3 lists the reasoning module's performance during testing for the three aspects. Note that on position, the unary operator (shifting) and binary operator (set arithemtics) do not systematically imply each other. Hence, we do not count them as probes into productivity and localism. In general, we notice that the better the perception accuracy on one attribute, the better the performance on reasoning. However, we also note that despite the relatively accurate perception of objectiveness, type, and size, near perfect reasoning is never guaranteed. This deficiency is due to the perception uncertainty handled by expectation in Eq. ( 3): in spite of correctness when we take arg max, marginalizing by expectation will unavoidably introduce noise into the reasoning process. Therefore, an ideal reasoning module requires the perception frontend to be not only correct but also certain. Computationally, one can sample from the perception module and optimize Eq. ( 9) using REINFORCE (Williams, 1992) . However, the credit assignment problem and variance in gradient estimation will further complicate training. Generative Potential Compared to existing discriminative-only RPM-solving methods, the proposed ALANS 2 learner is unique in its generative potential. As mentioned above, the final panel attribute can be decoded by sequentially selecting the most probable hidden operator and the attribute value. When equipped with a rendering engine, a solution can be generated. Here, we use the rendering program released by Zhang et al. (2019a) to demonstrate such a generative potential in the proposed ALANS 2 learner. Fig. 3 shows examples where the solutions are generated by ALANS 2 . Such a generative ability is a computational counterpart to human reasoning: ALANS 2 selects the one most similar to a synthesized image from the pool of candidates, which resembles human's top-down bottom-up reasoning.

A INDUCING AND EXECUTING OPERATORS

In the main text, we examplify the induction and the execution process using a binary operator. Here, we discuss other details regarding the formulation for all three types of operators, i.e., unary, binary, and ternary. Unary Operator To induce the unary operator T a u for an attribute a, we solve the following optimization problem T a u " arg min T a u pT q " 1{5 ˆ´E " › › M pb a o,1 qT ´M pb a o,2 q › › 2 F ı `E " › › M pb a o,2 qT ´M pb a o,3 q › › 2 F ı È " › › M pb a o,4 qT ´M pb a o,5 q › › 2 F ı `E " › › M pb a o,5 qT ´M pb a o,6 q › › 2 F ı È " › › M pb a o,7 qT ´M pb a o,8 q › › 2 F ı ¯`λ a u }T } 2 F , S1) where the indexing follows the row / column major. By taking the derivative with respect to T and setting it to be 0, we have the following solution, T a u " A ´1B (S2) where, assuming independence, A "E " M pb a o,1 q T M pb a o,1 q ‰ `E " M pb a o,2 q T M pb a o,2 q ‰ `E " M pb a o,4 q T M pb a o,4 q ‰ È " M pb a o,5 q T M pb a o,5 q ‰ `E " M pb a o,7 q T M pb a o,7 q ‰ `5λ a u I (S3) and B "E " M pb a o,1 q T ‰ E " M pb a o,2 q ‰ `E " M pb a o,2 q T ‰ E " M pb a o,3 q ‰ `E " M pb a o,4 q T ‰ E " M pb a o,5 q ‰ È " M pb a o,5 q T ‰ E " M pb a o,6 q ‰ `E " M pb a o,7 q T ‰ E " M pb a o,8 q ‰ . ( ) Note that as long as λ a u ą 0, A is a symmetric positive definite matrix and hence is invertible. Compared to the binary case, the unary operator can be regarded as a special binary operator where one of the operand is a constant, absorbed into operator learning, and jointly solved. To predict the answer representation, we solve another optimization problem, i.e., y M a u " arg min M a u pM q " E " › › M pb a o,8 qT a u ´M › › 2 F ı . ( ) Taking its derivative and setting it to 0, we have y M a u " E " M pb a o,8 q ‰ T a u . (S6) Note that this is exactly the execution of the learned operator.

Binary Operator

The optimization problem for the binary case can be expanded as T a b " arg min T a b pT q " 1{2 ˆ´E " › › M pb a o,1 qT M pb a o,2 q ´M pb a o,3 q › › 2 F ı È " › › M pb a o,4 qT M pb a o,5 q ´M pb a o,6 q › › 2 F ı ¯`λ a b }T } 2 F . We note that, assuming independence, the solution satisfies E " M pb a o,1 q T M pb a o,1 q ‰ T E " M pb a o,2 qM pb a o,2 q T ‰ È " M pb a o,4 q T M pb a o,4 q ‰ T E " M pb a o,5 qM pb a o,5 q T ‰ `2λ a b T " E " M pb a o,1 q T ‰ E " M pb a o,3 q ‰ E " M pb a o,2 q T ‰ `E " M pb a o,4 q T ‰ E " M pb a o,6 q ‰ E " M pb a o,5 q T ‰ . (S8) This is a linear matrix equation and can be turned into a linear equation by vectorization. Using vecpAT Bq " A b B vecpT q (Lancaster, 1970) , where b denotes the Kronecker product, we have vecpT a b q " A ´1B, (S9) where A "E " M pb a o,1 q T M pb a o,1 q ‰ b E " M pb a o,2 qM pb a o,2 q T ‰ È " M pb a o,4 q T M pb a o,4 q ‰ b E " M pb a o,5 qM pb a o,5 q T ‰ `2λ a b I (S10) and B "vec `E " M pb a o,1 q T ‰ E " M pb a o,3 q ‰ E " M pb a o,2 q T ‰˘v ec `E " M pb a o,4 q T ‰ E " M pb a o,6 q ‰ E " M pb a o,5 q T ‰˘. (S11) Note that A is also symmetric positive definite given positive λ a b and hence invertible. The predicted answer representation is given by y M a b " arg min M a b pM q " E " › › M pb a o,7 qT a b M pb a o,8 q ´M › › 2 F ı , which can be solved by executing the induced binary operator y M a b " E " M pb a o,7 q ‰ T a b E " M pb a o,8 q ‰ . Ternary Operator A ternary operation can be regarded as an unary operation on elements defined on rows / columns. Specifically, we propose to construct the algebraic representation of a row / column by concatenating the algebraic representation of each panel in it, i.e., M pb a o,i , b a o,i`1 , b a o,i`2 q " rM pb a o,i q; M pb a o,i`1 q; M pb a o,i`2 qs. (S13) Then the ternary operator can be solved by T a t " arg min ‰ T a t and slicing it from the result. To compute the operator distribution, we model it based on the fitness of each operator type, P pT a " T a u | tI o,i u 8 i"1 q 9 expp´ a u pT a u qq (S18) P pT a " T a b | tI o,i u 8 i"1 q 9 expp´ a b pT a b qq (S19) P pT a " T a t | tI o,i u 8 i"1 q 9 expp´ a t pT a t qq. (S20)

B INSTANCES OF OPERATORS

In the original work of Zhang et al. (2019a) and Hu et al. (2020) , there are four operators: Constant, Progression, Arithmetic, and Distribute of Three. Progression is parameterized by its step size (˘1{2). Arithmetic includes addition and subtraction. And Distribute of Three is implemented as shifting and can be either a left shift or a right one. Note that Constant can be regarded as special Progression with a step size of 0. In this work, we group all four operators into three types: unary (Constant and Progression), binary (Arithmetic), and ternary (Distribute of Three). To study systematic generalization in abstract relation learning, we use the RPM generation method proposed in (Zhang et al., 2019a; Hu et al., 2020) and carefully split data into three regimes: • We use a LeNet-like architecture (LeCun et al., 1998) for each branch of the object CNN. See Table S1 for the design. Note that the object CNN consists of four branches, including objectiveness, Note that in the training example, the constant rule is applied to the number, type, and size, while the progression rule is applied on color. In the testing example, the arithmetic rule is applied on all attributes.



use a relational module(Santoro et al., 2017), Steenbrugge et al. (2018) augment it with a VAE (Kingma & Welling, 2013), Zhang et al. (2019a) assemble a dynamic tree, Hill et al. (2019) arrange the data in a contrasting manner, Zhang et al. (2019b) propose a contrast module, Zheng et al. (2019) formulate it in a student-teacher setting, Wang et al. (2020) build a multiplex graph network, Hu et al. (2020) aggregate features from a hierarchical decomposition, and Wu et al. (

Figure1: An overview of the ALANS 2 learner. For an RPM instance, the neural visual perception module produces the belief states for all panels: an object CNN extracts object attribute distributions for each image region, and a belief inference engine marginalizes them out to obtain panel attribute distributions. For each panel attribute, the algebraic abstract reasoning module transforms the belief states into matrix-based algebraic representations and induces hidden operators by solving inner optimizations. The answer representations are obtained by executing the induced operators, and the choice most similar to the prediction is selected as the solution. An example of the underlying discrete algebra and its correspondence is also shown on the right.

Figure 2: Isomorphism between the abstract algebra and the matrix-based representation. Operator induction reduced to matrices.

); Zhang et al. (2019a;b); Wang et al. (2020); Zheng et al. (2019); Hu et al. (2020); Wu et al. (

Systematicity: The training set and the test set contain all three types of operators but disjoint instances. Specifically, the training set has Constant, Progression of ˘1, addition in Arithmetic, and left shift in Distribute of Three, while in the test set there are Progression of ˘2, subtraction in Arithmetic, and right shift in Distribute of Three. • Productivity: The training set contains only unary operators and the test set only binary operators. Specifically, the training set has Constant and all instances of Progression, while the test set all instances of Arithmetic. • Localism: The training set contains only binary operators and the test set only unary operators. Specifically, the training set has all instances of Arithmetic and the test set Constant and all instances of Progression. Please see Figs. S1 to S3 for examples in the three splits. C IMPLEMENTATION DETAILS C.1 NETWORK ARCHITECTURE

Figure S1: A training example (left) and a test example (right) in the systematicity split. Note that in the training example, the arithmetic relation (in number) is addition and the shifting is always a left shift (in type, size, and color). In the test example, the shifting becomes a right shift (in type), the size progression has a step of 2, and color arithmetic becomes subtraction.

Figure S2: A training example (left) and a test example (right) in the productivity split.Note that in the training example, the constant rule is applied to the number, type, and size, while the progression rule is applied on color. In the testing example, the arithmetic rule is applied on all attributes.

Model performance on different aspects of systematic generalization. The performance is measured by accuracy and reported on the test sets. Upper: results on datasets generated byZhang et al. (2019a). Lower: results on datasets generated byHu et al. (2020).

Perception accuracy of the proposed ALANS 2 learner, measured by whether the module can correctly predict an attribute's value. Left: results on datasets genereted byZhang et al. (2019a). Right: results on datasets genereted byHu et al. (2020).

Reasoning accuracy of the proposed ALANS 2 learner, measured by whether the module can correctly predict the type of a relation on an attribute. Left: results on datasets genereted byZhang et al. (2019a). Right: results on datasets generated byHu et al. (2020).

7

< l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 J g W V F U m 0 6 h J a w = " > A A A C Z H i c b V B b S 8 M w G M 3 q b c 7 b d P g k S H E I P o 1 2 4 u 1 t 6 I u P C k 6 F d Y w 0 / T b D k r Q k q W 6 E / h Z f 9 S f 5 B / w d p t 0 Q q x 4 I H M 7 5 b j l h w q j S n v d R c R Y W l 5 Z X q q u 1 t f W N z a 3 6 9 s 6 9 i l N J o E t i F s v H E C t g V E B X U 8 3 g M Z G A e c j g I R x f 5 f 7 D M 0 h F Y 3 G n p w n 0 O R 4 J O q Q E a y s N 6 o 2 g m G E k R J k J J h z L c T a o N 7 2 W V 8 D 9 S / w 5 a a I 5 b g b b l Y s g i k n K Q W j C s F I 9 3 0 t 0 3 2 C p K W G Q 1 Y J U Q Y L J G I + g Z 6 n A H F T f F J s z 9 9 A q k T u M p X 1 C u 4 X 6 s 8 N g r t S U h 7 a S Y / 2 k f n u 5 + J / X S / X w v G + o S F I N g s w W D V P m 6 t j N o 3 A j K o F o N r U E E 0 n t r S  < l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 J g W V F U m 0 6 h J a w = " > A A A C Z H i c b V B b S 8 M w G M 3 q b c 7 b d P g k S H E I P o 1 2 4 u 1 t 6 I u P C k 6 F d Y w 0 / T b D k r Q k q W 6 E / h Z f 9 S f 5 B / w d p t 0 Q q x 4 I H M 7 5 b j l h w q j S n v d R c R Y W l 5 Z X q q u 1 t f W N z a 3 6 9 s 6 9 i l N J o E t i F s v H E C t g V E B X U 8 3 g M Z G A e c j g I R x f 5 f 7 D M 0 h F Y 3 G n p w n 0 O R 4 J O q Q E a y s N 6 o 2 g m G E k R J k J J h z L c T a o N 7 2 W V 8 D 9 S / w 5 a a I 5 b g b b l Y s g i k n K Q W j C s F I 9 3 0 t 0 3 2 C p K W G Q 1 Y J U Q Y L J G I + g Z 6 n A H F T f F J s z 9 9 A q k T u M p X 1 C u 4 X 6 s 8 N g r t S U h 7 a S Y / 2 k f n u 5 + J / X S / X w v G + o S F I N g s w W D V P m 6 t j N o 3 A j K o F o N r U E E 0 n t r S 

3

< l a t e x i t s h a 1 _ b a s e 6 4 = " l s 4 4 p f 8 V u t z I 5 P e 2 h j X C X N u J

7

< l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 J g W V F U m 0 6 h J a w = " > A A A C Z H i c b V B b S 8 M w G M 3 q b c 7 b d P g k S H E I P o 1 2 4 u 1 t 6 I u P C k 6 F d Y w 0 / T b D k r Q k q W 6 E / h Z f 9 S f 5 B / w d p t 0 Q q x 4 I H M 7 5 b j l h w q j S n v d R c R Y W l 5 Z X q q u 1 t f W N z a 3 6 9 s 6 9 i l N  < l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8

3

< l a t e x i t s h a 1 _ b a s e 6 4 = " l s 4 4 p f 8 V u t z I 5 P e 2 h j X C X N u J 

7

< l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " h P e q u C A H 3 r I R w 8 m LogSoftMax type, size, and color. The parameters for Convolution denote the output channel size, kernel size, and stride, respectively. A BatchNorm layer is parameterized by the number of channels, whereas a MaxPool layer by its stride. An output size is used to specify a Linear layer's parameter. m equals 2, 5, 6, 10 for objectiveness, type, size, and color, respectively. For numerical stability, we use LogSoftMax to turn a probability simplex into its log space.

C.2 OTHER HYPERPARAMETERS

For the inner regularized linear regression, we set different regularization coefficients for different attributes but, for the same attribute, we keep them the same across all three types of operators. For position, λ " 10 ´4. For number, λ " 10 ´6. For type, λ " 10 ´6. For size, λ " 10 ´6. For color, λ " 5 ˆ10 ´7. All of the regularization terms in Eq. ( 9) in the main text are set to be 1 and tM a 0 u and tM a u are initialized as 2 ˆ2 square matrices.For training, we first train for 10 epochs parameters regarding objectiveness, including the objectiveness branch, and the representation matrices on position and number. We then perform 2 rounds of cyclic training on parameters regarding type, size, and color, each of which experiences 10 epochs of updates in a round. Finally, we fine-tune all parameters for another 10 epochs, totaling up to 80 training epochs. The entire system is optimized using ADAM (Kingma & Ba, 2014) with a learning rate of 9.5 ˆ10 ´5.

D MARGINALIZATION FOR OTHER ATTRIBUTES

For the attribute of position, we denote its value as R o , a binary vector of length N , with each entry corresponding to one of the N windows. Thenwhere P pr o j q denotes the jth region's estimated objectiveness distribution returned by a CNN as in the main text.For the attribute of type, the panel attribute of type being k is evaluated aswhere P pr t j q denotes the jth region's estimated type distribution returned by a CNN. The computation for size and color is exactly the same as type, except that we use the region's estimated size and color distribution returned by a CNN.

E RELATED WORK ON NEURAL THEOREM PROVING

Combining neural architectures with symbolic reasoning has a long history in the field of theorem proving (Garcez et al., 2012) , with early works dated back to propositional rules (Shavlik & Towell, 1991; Towell & Shavlik, 1994; Garcez & Zaverucha, 1999) . Later works extend the propositional rules to first-order inference (Shastri, 1992; Ding, 1995; Franc ¸a et al., 2014; Sourek et al., 2015; Cohen, 2016) . More recent works include the Logic Tensor Networks (Serafini & Garcez, 2016 ) and the NTP model (Rocktäschel & Riedel, 2017) . The former grounds first-order logics and supports function terms, while the latter is constructed from Prolog's backward chaining and is related to Komendantskaya (2011) ; Hiolldobler (1990) but supports function-free terms. DeepProbLog (Manhaeve et al., 2018) further improves on NTP by focusing on tight interactions between a neural component and subsymbolic representation and parameter learning for both the neural and the logic components. Evans & Grefenstette (2018) introduces a differentiable rule induction process, though not integrating the neural and symbolic components. Our work is related to the stream of work on neural theorem proving. However, we formulate the relation induction process as continuous optimization rather than logical induction.

F MORE ON NEURAL VISUAL PERCEPTION

• Why not train a CNN to predict the position and number of objects? The CNN is trained to predict the type, size, color, and object existence in a window. The object existence in windows is marginalized to be a Number distribution and Position distribution. This is a light-weight method for object detection. Nevertheless, it is also possible to use a Fast-RCNN like method to predict object positions (this implies number) directly. However, in this way, the framework loses the probabilistic interpretation (the object proposal branch is currently still deterministic), and we cannot perform end-to-end learning.• How does the CNN predict the presence of an object, its type, size, and color given that it is not trained to do that? For each window, the CNN outputs 4 softmaxed vectors, corresponding to the probability distributions of object existence, object type, object size, and object color. The spaces for these attributes are pre-defined. CNN's weights are then jointly trained in the framework. Such a design follows recent neuro-symbolic methods (Mao et al., 2019; Han et al., 2019) that also rely on the implicitly trained representation. In short, we assign semantics to the implicitly trained representation (probability distributions for attributes), performs marginalization and reasoning as if they are groundtruth attribute distributions, and jointly train using only the problem's target label.

