LEARNING TASK-GENERAL REPRESENTATIONS WITH GENERATIVE NEURO-SYMBOLIC MODELING

Abstract

People can learn rich, general-purpose conceptual representations from only raw perceptual inputs. Current machine learning approaches fall well short of these human standards, although different modeling traditions often have complementary strengths. Symbolic models can capture the compositional and causal knowledge that enables flexible generalization, but they struggle to learn from raw inputs, relying on strong abstractions and simplifying assumptions. Neural network models can learn directly from raw data, but they struggle to capture compositional and causal structure and typically must retrain to tackle new tasks. We bring together these two traditions to learn generative models of concepts that capture rich compositional and causal structure, while learning from raw data. We develop a generative neuro-symbolic (GNS) model of handwritten character concepts that uses the control flow of a probabilistic program, coupled with symbolic stroke primitives and a symbolic image renderer, to represent the causal and compositional processes by which characters are formed. The distributions of parts (strokes), and correlations between parts, are modeled with neural network subroutines, allowing the model to learn directly from raw data and express nonparametric statistical relationships. We apply our model to the Omniglot challenge of human-level concept learning, using a background set of alphabets to learn an expressive prior distribution over character drawings. In a subsequent evaluation, our GNS model uses probabilistic inference to learn rich conceptual representations from a single training image that generalize to 4 unique tasks, succeeding where previous work has fallen short.

1. INTRODUCTION

Human conceptual knowledge supports many capabilities spanning perception, production and reasoning [37] . A signature of this knowledge is its productivity and generality: the internal models and representations that people develop can be applied flexibly to new tasks with little or no training experience [30] . Another distinctive characteristic of human conceptual knowledge is the way that it interacts with raw signals: people learn new concepts directly from raw, high-dimensional sensory data, and they identify instances of known concepts embedded in similarly complex stimuli. A central challenge is developing machines with these human-like conceptual capabilities. Engineering efforts have embraced two distinct paradigms: symbolic models for capturing structured knowledge, and neural network models for capturing nonparametric statistical relationships. Symbolic models are well-suited for representing the causal and compositional processes behind perceptual observations, providing explanations akin to people's intuitive theories [38] . Quintessential examples include accounts of concept learning as program induction [13, 46, 29, 15, 4, 28] . Symbolic programs provide a language for expressing causal and compositional structure, while probabilistic modeling offers a means of learning programs and expressing additional conceptual knowledge through priors. The Bayesian Program Learning (BPL) framework [29] , for example, provides a dictionary of simple sub-part primitives for generating handwritten character concepts, and symbolic relations that specify how to combine sub-parts into parts (strokes) and parts into whole character concepts. These abstractions support inductive reasoning and flexible generalization to a range of different tasks, utilizing a single conceptual representation [29] . Symbolic models offer many useful features, but they come with important limitations. Foremost, symbolic probabilistic models make simplifying and rigid parametric assumptions, and when the assumptions are wrong-as is common in complex, high-dimensional data-they create bias [11] . The BPL character model, for example, assumes that parts are largely independent a priori, an assumption that is not reflective of real human-drawn characters. As a consequence, characters generated from the raw BPL prior lack the complexity of real characters (Fig 1 , left ), even though the posterior samples can appear much more structured. Another limitation of symbolic probabilistic models is that the construction of structured hypothesis spaces requires significant domain knowledge [2] . Humans, meanwhile, build rich internal models directly from raw data, forming hypotheses about the conceptual features and the generative syntax of a domain. As one potential resolution, previous work has demonstrated that the selection of structured hypotheses can itself be attributed to learning in a Bayesian framework [47, 13, 14, 41, 24, 40] . Although more flexible than a priori structural decisions, models of this kind still make many assumptions, and they have not yet tackled the types of raw, high-dimensional stimuli that are distinctive of the neural network approach. The second paradigm, neural network modeling, prioritizes powerful nonparametric statistical learning over structured representations. This modeling tradition emphasizes emergence, the idea that conceptual knowledge arises from interactions of distributed sub-symbolic processes [36, 32] . Neural networks are adept at learning from raw data and capturing complex patterns. However, they can struggle to learn the compositional and causal structure in how concepts are formed [30] ; even when this structure is salient in the data, they may have no obvious means of incorporating it. These limitations have been linked to shortcomings in systematic generalization [35, 27] and creative abilities [31] . An illustrative example is the Omniglot challenge: in 4 years of active research, neural network models do not yet explain how people quickly grasp new concepts and use them in a variety of ways, even with relatively simple handwritten characters [31] . Surveying over 10 neural models applied to Omniglot, Lake et al. [31] found that only two attempted both classification and generation tasks, and they were each outperformed by the fully-symbolic, probabilistic BPL. Moreover, neural generative models tended to produce characters with anomalous characteristics, highlighting their shortcomings in modeling causal and compositional structure (see Fig. A13 and [31, Fig. 2a] ). In this paper, we introduce a new approach that leverages the strengths of both the symbolic and neural network paradigms by representing concepts as probabilistic programs with neural network subroutines. We describe an instance of this approach developed for the Omniglot challenge [29] of task-general representation learning and discuss how we see our Omniglot model fitting into a broader class of Generative Neuro-Symbolic (GNS) models that seek to capture the data-generation process. As with traditional probabilistic programs, the control flow of a GNS program is an explicit representation of the causal generative process that produces new concepts and new exemplars. Moreover, explicit re-use of parts through repeated calls to procedures such as GeneratePart (Fig. 2 ) ensures a representation that is compositional, providing an appropriate inductive bias for compositional generalization. Unlike fully-symbolic probabilistic programs, however, the distribution of parts and correlations between parts in GNS are modeled with neural networks. This architectural choice allows the model to learn directly from raw data, capturing nonparametric statistics while requiring only minimal prior knowledge. We develop a GNS model for the Omniglot challenge of learning flexible, task-general representations of handwritten characters. We report results on 4 Omniglot challenge tasks with a single model: 1) one-shot classification, 2) parsing/segmentation, 3) generating new exemplars, and 4) generating new concepts (without constraints); the 5th and final task of generating new concepts (from type) is left for future work. We also provide log-likelihood evaluations of the generative model. Notably, our goal is not to chase state-of-the-art performance on one task across many datasets (e.g., classification). Instead we build a model that learns deep, task-general knowledge within a single domain and evaluate it on a range of different tasks. This "deep expertise" is arguably just as important as "broad expertise" in characterizing human-level concept learning [37, 31] ; machines that seek human-like abilities will need both. Our work here is one proposal for how neurally-grounded approaches can move beyond pattern recognition toward the more flexible model-building abilities needed for deep expertise [30] . In Appendix E, we discuss how to extend GNS to another conceptual domain, providing a proposal for a GNS model of 3D object concepts.

2. RELATED WORK

The Omniglot dataset and challenge has been widely adopted in machine learning, with models such as Matching Nets [49] , MAML [7] , and ARC [44] applied to just one-shot classification, and others such as DRAW [19] , SPIRAL [8] , and VHE [21] applied to one or more generative tasks. In their "3-year progress report," Lake et al. [31] reviewed the current progress on Omniglot, finding that although there was considerable progress in one-shot classification, there had been little emphasis placed on developing task-general models to match the flexibility of human learners (Table 1 ). Moreover, neurally-grounded models that attempt more creative generation tasks were shown to produce characters that either closely mimicked the training examples or that exhibited anomalous variations, making for easy identification from humans (see Fig. A13 and [31, Fig. 2a] ). Our goal is distinct in that we aim to learn a single neuro-symbolic generative model that can perform a variety of unique tasks, and that generates novel yet structured new characters. Neuro-symbolic modeling has become an active area of research, with applications to learning input-output programs [42, 17, 3, 39, 48] , question answering [50, 34] and image description [26, 4] . GNS modeling distinguishes itself from prior work through its focus on hybrid generative modeling, combining both structured program execution and neural networks directly in the probabilistic generative process. Neuro-symbolic VQA models [50, 34] are neither generative nor task-general; they are trained discriminatively to answer questions. Other neuro-symbolic systems use neural networks to help perform inference in a fully-symbolic generative model [26, 4] , or to parameterize a prior over fully-symbolic hypotheses [22] . In order to capture the dual structural and statistical characteristics of human conceptual representations, we find it important to include neural nets directly in the forward generative model. As applied to Omniglot, our model bears some resemblance to SPIRAL [8] ; however, SPIRAL does not provide a density function, and it has no hierarchical structure, limiting its applications to image reconstruction and unconditional generation. Another class of models on the neuro-symbolic spectrum aims to learn "object representations" with neural networks [5, 18, 25] , which add minimal object-like symbols to support systematic reasoning and generalization. Although these models have demonstrated promising results in applications such as scene segmentation and unconditional generation, they have not yet demonstrated the type of rich inductive capabilities that we are after: namely, the ability to learn "deep" conceptual representations from just one or a few examples that support a variety of discriminative and generative tasks. Other works (e.g. [16, 20] ) have used autoregressive models like ours with similar stroke primitives to model the causal generative processes of handwriting. We develop a novel architecture for generating handwriting, which represents explicit compositional structure by modeling parts and relations with separate modules and applying intermediate symbolic rendering. Most importantly, these prior models have not made a connection to the image; therefore while they can generate handwriting as symbolic coordinates, they cannot explain how people use their causal knowledge to learn new characters from visual presentations, how they infer the strokes of a character seen on paper, or how they generate a new example of an observed character. By combining a powerful autoregressive model of handwriting with a symbolic image renderer and algorithms for probabilistic inference, we seek to replicate a spectrum of unique human concept learning abilities.

3. GENERATIVE MODEL

Our GNS model leverages the type-token hierarchy of BPL [29] , which offers a useful scaffolding for conceptual models (Fig. 2 , left). 1 The type-level model P (ψ) defines a prior distribution over character concepts, capturing overarching principles and regularities that tie together characters from different alphabets and providing a procedure to generate new character concepts unconditionally in latent stroke format (Fig. 2 , right). A token-level model p(θ|ψ) captures the within-class variability that arises from motor noise and drawing styles, and an image distribution P (I|θ) provides an explicit model of how causal stroke actions translate to image pixels. All parameters of our model are learned from the Omniglot background set of drawings (Appendix A). The full joint distribution over type ψ, token θ (m) and image I (m) factors as P (ψ, θ (m) , I (m) ) = P (ψ)P (θ (m) |ψ)P (I (m) |θ (m) ). Although sharing a common hierarchy, the implementation details of each level in our GNS model differ from BPL in critical ways. The GNS type prior P (ψ) is a highly expressive generative model that uses an external image canvas, coupled with a symbolic rendering engine and an attentive Published as a conference paper at ICLR 2021 recurrent neural network, to condition future parts on the previous parts and model sophisticated causal and correlational structure. This structure is essential to generating new character concepts in realistic, human-like ways (Sec. 5). Moreover, whereas the BPL model is provided symbolic relations for strokes such as "attach start" and "attach along," GNS learns implicit relational structure from the data, identifying salient patterns in the co-occurrences of parts and locations. Importantly, the GNS generative model is designed to be differentiable at all levels, yielding log-likelihood gradients that enable powerful new inference algorithms (Sec. 4) and estimates of marginal image likelihood (Sec. 5). Type prior. The type prior P (ψ) is captured by a neuro-symbolic generative model of character drawings. The model represents a character as a sequence of strokes (parts), with each stroke i decomposed into a starting location y i ∈ R 2 and a variable-length trajectory x i ∈ R di×2 . Rather than use raw pen trajectories as our stroke format, we use a minimal spline representation of strokes, obtained from raw trajectories by fitting cubic b-splines with a residual threshold. The starting location y i therefore conveys the first spline control point, and trajectory x i = {∆ i1 , ..., ∆ idi } conveys the offsets between subsequent points of a (d i +1)-length spline. These offsets are transformed into a sequence of relative points x i = {x i1 , ..., x idi+1 }, with x i1 = 0, specifying locations relative to y i . The model samples a type one stroke at a time, using an image canvas C as memory to convey the sample state. At each step, a starting location for the next stroke is first sampled from the location model, followed by a trajectory from the stroke model. The stroke is then rendered to the canvas C, and a termination model decides whether to terminate or continue the sample. Each of the three model components is expressed by a neural network, using a LSTM as the stroke model to generate trajectories as in [16] . The details of these neural modules are provided in Appendix A. The type model P (ψ) specifies an auto-regressive density function that can evaluate exact likelihoods of character drawings, and its hyperparameters (the three neural networks) are learned from the Omniglot background set of 30 alphabets using a maximum likelihood objective. A full character type ψ includes the random variables ψ = {κ, y 1:κ , x 1:κ }, where κ ∈ Z + is the number of strokes. The density function P (ψ) is also fully differentiable w.r.t. the continuous random variables in ψ. Token model. A character token θ (m) = {y  P (θ (m) |ψ) = P (A (m) ) κ i=1 P (y (m) i | y i )P (x (m) i | x i ). Here, P (y (m) i | y i ) represents a simple noise distribution for the location of each stroke, and P (x (m) i | x i ) for the stroke trajectory. The first two dimensions of affine warp A (m) control a global re-scaling of the token drawing, and the second two a global translation of its center of mass. The distributions and pseudocode of our token model are given in Appendix A.

Image model. The image model P (I

(m) | θ (m) ) is based on [29] and is composed of two pieces. First, a differentiable symbolic engine f receives the token θ (m) and produces an image pixel probability map p img = f (θ (m) , σ, ) by evaluating each spline and rendering the stroke trajectories. Here, σ ∈ R + is a parameter controlling the rendering blur around stroke coordinates, and ∈ (0, 1) controlling pixel noise, each sampled uniformly at random. The result then parameterizes an image distribution P (I (m) | θ (m) ) = Bernoulli(p img ), which is differentiable w.r.t. θ (m) , σ, and .

4. PROBABILISTIC INFERENCE

Given an image I of a novel concept, our GNS model aims to infer the latent causal, compositional process for generating new exemplars. We follow the high-level strategy of BPL for constructing a discrete approximation Q(ψ, θ | I) to the desired posterior distribution [29] , P (ψ, θ | I) ≈ Q(ψ, θ | I) = K k=1 π k δ(θ -θ k )δ(ψ -ψ k ). A heuristic search algorithm is used to find K good parses, {ψ, θ} 1:K , that explain the underlying image with high probability. These parses are weighted by their relative posterior probability, Published as a conference paper at ICLR 2021 ) , we compute an approximation of the Bayesian score log P (I (T ) | I (c) ) for every example I (c) , using our posterior parses {ψ, θ (c) } 1:K and corresponding weights π 1:K from I (c) (Eq. 3). The approximation is formulated as log P (I (T ) | I (c) ) ≈ log P (I (T ) |θ (T ) )P (θ (T ) | ψ)Q(ψ, θ (c) , | I (c) )∂ψ∂θ (c) ∂θ (T ) ≈ log K k=1 π k max θ (T ) P (I (T ) | θ (T ) )P (θ (T ) | ψ k ), where the maximum over θ (T ) is determined by refitting token-level parameters θ (c) to image I (T ) with gradient descent. Following Lake et al. [29] , we use a two-way version of the Bayesian score that also considers parses of I (T ) refit to I (c) . The classification rule is therefore c * = arg max c log P (I (T ) | I (c) ) 2 = arg max c log P (I (c) | I (T ) ) P (I (c) ) P (I (T ) | I (c) ) , where P (I (c) ) ≈ k πk is approximated from the unnormalized weights of I (c) parses.

5. EXPERIMENTS

GNS was evaluated on four concept learning tasks from the Omniglot challenge [31] : one-shot classification, parsing, generating new exemplars, and generating new concepts. All evaluations use novel characters from completely held-out alphabets in the Omniglot evaluation set. As mentioned Published as a conference paper at ICLR 2021 earlier, our goal is to provide a single model that captures deep knowledge of a domain and performs strongly in a wide range of tasks, rather than besting all models on every task. Our experiments include a mixture of quantitative and qualitative evaluations, depending on the nature of the task. One-shot classification. GNS was compared with alternative models on the one-shot classification task from Lake et al. [29] . The task involves a series of 20-way within-alphabet classification episodes, with each episode proceeding as follows. First, the machine is given one training example from each of 20 novel characters. Next, the machine must classify 20 novel test images, each corresponding to one of the training classes. With 20 episodes total, the task yields 400 trials. Importantly, all character classes in an episode come from the same alphabet as originally proposed [29] , requiring finer discriminations than commonly used between-alphabet tests [31] . Model Error GNS 5.7% BPL [29] 3.3% RCN [12] 7.3% VHE [21] 18.7% Proto. Net [45] 13.7% ARC [44] 1.5% *

*used 4x training classes

As illustrated in Fig. 3a , GNS classifies a test image by choosing the training class with the highest Bayesian score (Eq. 5). A summary of the results is shown in Table 2 . GNS was compared with other machine learning models that have been evaluated on the withinalphabets classification task [31] . GNS achieved an overall test error rate of 5.7% across all 20 episodes (N=400). This result is very close to the original BPL model, which achieved 3.3% error with significantly more hand-design. The symbolic relations in BPL's token model provide rigid constraints that are key to its strong classification performance [29] . GNS achieves strong classification performance while emphasizing the nonparametric statistical knowledge needed for creative generation in subsequent tasks. Beyond BPL, our GNS model outperformed all other models that received the same background training. The ARC model [44] achieved an impressive 1.5% error, although it was trained with four-fold class augmentation and many other augmentations, and it can only perform this one task. In Appendix Fig. A10 , we show a larger set of classification fits from GNS, including examples of misclassified trials. Parsing. In the Omniglot parsing task, machines must segment a novel character into an ordered set of strokes. These predicted parses can be compared with human ground-truth parses for the same images. The approximate posterior of GNS yields a series of promising parses for a new character image, and to complete the parsing task, we identify the maximum a posteriori parse k * = max k π k , reporting the corresponding stroke configuration. Fig. 3b shows a visualization of the GNS predicted parses for 27 different raw images drawn from 3 unique character classes, plotted alongside groundtruth human parses (how the images were actually drawn) along with predicted parses from the BPL model. Compared to BPL, GNS parses possess a few unique desirable qualities. The first character class has an obvious segmentation to the human eye-evidenced by the consistency of human parses in all examples-and the GNS model replicates this consistency across all 9 predicted parses. In contrast, BPL predicts seemingly-unlikely parses for 2 of the examples shown. The second character is more complex, and it was drawn in different styles by different human subjects. The GNS model, which is trained on data from subjects with different styles, captures the uncertainty in this character by predicting a variety of unique parses. BPL, on the other hand, produces a single, ubiquitous segmentation across all 9 examples. In Appendix Fig. A11 , we provide a larger set of parses from the GNS model for a diverse range of Omniglot characters. Generating new exemplars. Given just one training image of a novel character concept, GNS produces new exemplars of the concept by sampling from the approximate conditional P (I (2) , θ (2) | I (1) ) of Eq. 8. In Fig. 4a we show new exemplars produced by GNS for a handful of target images, plotted next to human productions (more examples in Appendix Fig. A12 ). In the majority of cases, samples from the model demonstrate that it has successfully captured the causal structure and invariance of the target class. In contrast, deep generative models applied to the same task miss meaningful compositional and causal structure, producing new examples that are easily discriminated from human productions [43, 21] (see Appendix Fig. A13 ). In some cases, such as the third column of Fig. 4a , samples from GNS exhibit sloppy stroke junctions and connections. Compared to BPL, which uses engineered symbolic relations to enforce rigid constraints at stroke junctions, GNS misses some of these structural elements. Nevertheless, new examples from GNS appear strong enough to pass for human in many cases, which we would like to test in future work with visual Turing tests. Published as a conference paper at ICLR 2021 Human GNS Target bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization Fig. 5 ). All behavioral experiments were run through Amazon's Mechanical Turk, and the experimental procedures are detailed in section S5. one-shot learning task achieved 8.0% errors (33) , still about twice as high as humans or our model. BPL's advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here. BPL's other key ingredients also make positive contributions, as shown by higher error rates for BPL lesions without learning to learn (token-level only) or compositionality (11.0% errors and 14.0%, respectively). Learning to learn was studied separately at the type and token level by disrupting the learned hyperparameters of the generative model. Compositionality was evaluated by comparing BPL to a matched model that allowed just one spline-based stroke, resembling earlier analysisby-synthesis models for handwritten characters that were similarly limited (34, 35) . The human capacity for one-shot learning is more than just classification. It can include a suite of abilities, such as generating new examples of a concept. We compared the creative outputs produced by humans and machines through "visual Turing tests," where naive human judges tried to identify the machine, given paired examples of human and machine behavior. In our most basic task, judges compared the drawings from nine humans asked to produce a new instance of a concept given one example with nine new examples drawn by BPL (Fig. 5 ). We evaluated each model based on the accuracy of the judges, which we call their identification (ID) level: Ideal model performance is 50% ID level, indicating that they cannot distinguish the model's behavior from humans; worst-case performance is 100%. Each judge (N = 147) completed 49 trials with blocked feedback, and judges were analyzed individually and in aggregate. The results are shown in Fig. 6B (new exemplars). Judges had only a 52% ID level on average for discriminating human versus BPL behavior. As a group, this performance was barely better than chance [t(47) = 2.03, P = 0.048], and only 3 of 48 judges had an ID level reliably above chance. Three lesioned models were evaluated by different groups of judges in separate Human or Machine? on November 21, 2020 http://science.sciencemag.org/ oaded from jectories S . Second, the trajectory's precise start location L (m) is sampled from the schematic provided by its relation Ri to previous strokes. Third, global transformations are sampled, including an affine warp A (m) and adaptive noise parameters that ease probabilistic inference (30) . Last, a binary image I (m) is created by a stochastic rendering function, lining the stroke trajectories with grayscale ink and interpreting the pixel values as independent Bernoulli probabilities. Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image I (m) . Our strategy uses fast bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization when at least one set of parts and relations can successfully explain both the training and the test images, without violating the soft constraints of the learned within-class variability model. Figure 4B compares the model's best-scoring parses with the ground-truth human parses for several characters.

Results

People, BPL, and alternative models were compared side by side on five concept learning tasks that examine different forms of generalization from just one or a few examples (example task Fig. 5 ). All behavioral experiments were run through Amazon's Mechanical Turk, and the experimental procedures are detailed in section S5. acters produced by a typical drawer of that alpha-bet. Performance is shown in Fig. 6A , where chance is 95% errors. As a baseline, the modified Hausdorff distance (32) was computed between centered images, producing 38.8% errors. People were skilled one-shot learners, achieving an average error rate of 4.5% (N = 40). BPL showed a similar error rate of 3.3%, achieving better performance than a deep convolutional network (convnet; 13.5% errors) and the HD model (34.8%)-each adapted from deep learning methods that have performed well on a range of computer vision tasks. A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors (33) , still about twice as high as humans or our model. BPL's advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here. BPL's other key ingredients also make positive contributions, as shown by higher error rates for BPL lesions without learning to learn (token-level only) or compositionality (11.0% errors and 14.0%, respectively). Learning to learn was studied separately at the type and token level by disrupting the learned hyperparameters of the generative model. Compositionality was evaluated by comparing BPL to a matched model that allowed just one spline-based stroke, resembling earlier analysisby-synthesis models for handwritten characters that were similarly limited (34, 35) . The human capacity for one-shot learning is more than just classification. It can include a suite of abilities, such as generating new examples of a concept. We compared the creative outputs produced by humans and machines through "visual Turing tests," where naive human judges tried to identify the machine, given paired examples of human and machine behavior. In our most basic task, judges compared the drawings from nine humans asked to produce a new instance of a concept given one example with nine new examples drawn by BPL (Fig. 5 ). We evaluated each model based on the accuracy of the judges, which we call their identification (ID) level: Ideal model performance is 50% ID level, indicating that they cannot distinguish the model's behavior from humans; worst-case performance is 100%. Each judge (N = 147) completed 49 trials with blocked feedback, and judges were analyzed individually and in aggregate. The results are shown in Fig. 6B (new exemplars). Judges had only a 52% ID level on average for discriminating human versus BPL behavior. As a group, this performance was barely better than chance [t(47) = 2.03, P = 0.048], and only 3 of 48 judges had an ID level reliably above chance. Three lesioned models were evaluated by different groups of judges in separate Human or Machine? bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization Fig. 5 ). All behavioral experiments were run through Amazon's Mechanical Turk, and the experimental procedures are detailed in section S5. one-shot learning task achieved 8.0% errors (33) , still about twice as high as humans or our model. BPL's advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here. BPL's other key ingredients also make positive contributions, as shown by higher error rates for BPL lesions without learning to learn (token-level only) or compositionality (11.0% errors and 14.0%, respectively). Learning to learn was studied separately at the type and token level by disrupting the learned hyperparameters of the generative model. Compositionality was evaluated by comparing BPL to a matched model that allowed just one spline-based stroke, resembling earlier analysisby-synthesis models for handwritten characters that were similarly limited (34, 35) . The human capacity for one-shot learning is more than just classification. It can include a suite of abilities, such as generating new examples of a concept. We compared the creative outputs produced by humans and machines through "visual Turing tests," where naive human judges tried to identify the machine, given paired examples of human and machine behavior. In our most basic task, judges compared the drawings from nine humans asked to produce a new instance of a concept given one example with nine new examples drawn by BPL (Fig. 5 ). We evaluated each model based on the accuracy of the judges, which we call their identification (ID) level: Ideal model performance is 50% ID level, indicating that they cannot distinguish the model's behavior from humans; worst-case performance is 100%. Each judge (N = 147) completed 49 trials with blocked feedback, and judges were analyzed individually and in aggregate. The results are shown in Fig. 6B (new exemplars). Judges had only a 52% ID level on average for discriminating human versus BPL behavior. As a group, this performance was barely better than chance [t(47) = 2.03, P = 0.048], and only 3 of 48 judges had an ID level reliably above chance. Three lesioned models were evaluated by different groups of judges in separate Human or Machine? Character tokens q (m) are produced by executing the parts and the relations and modeling how ink flows from the pen to the page. First, motor noise is added to the control points and the scale of the subparts to create token-level stroke trajectories S (m) . Second, the trajectory's precise start location L (m) is sampled from the schematic provided by its relation Ri to previous strokes. Third, global transformations are sampled, including an affine warp A (m) and adaptive noise parameters that ease probabilistic inference (30) . Last, a binary image I (m) is created by a stochastic rendering function, lining the stroke trajectories with grayscale ink and interpreting the pixel values as independent Bernoulli probabilities. Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image I (m) . Our strategy uses fast bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization and local search, forming a discrete approximation to the posterior distribution P(y, q (m) |I (m) ) (section S3). Figure 4A shows the set of discovered programs for a training image I (1) and how they are refit to different test images I (2) to compute a classification score log P(I (2) |I (1) ) (the log posterior predictive probability), where higher scores indicate that they are more likely to belong to the same class. A high score is achieved when at least one set of parts and relations can successfully explain both the training and the test images, without violating the soft constraints of the learned within-class variability model. Figure 4B compares the model's best-scoring parses with the ground-truth human parses for several characters. People, BPL, and alternative models were compared side by side on five concept learning tasks that examine different forms of generalization from just one or a few examples (example task Fig. 5 ). All behavioral experiments were run through Amazon's Mechanical Turk, and the experimental procedures are detailed in section S5. The main results are sum additional lesion analys ported in section S6. One-shot classification a series of within-alphab 10 different alphabets. A a single image of a new c and participants selected same character from a acters produced by a typi bet. Performance is shown is 95% errors. As a baselin distance (32) was comp images, producing 38.8 skilled one-shot learner error rate of 4.5% (N = 40 error rate of 3.3%, achiev than a deep convolutional errors) and the HD mode from deep learning meth well on a range of compu Siamese convolutional ne one-shot learning task ac still about twice as high a BPL's advantage points to the underlying causal pro a strategy different from t ing approaches examine ingredients also make p shown by higher error without learning to lea compositionality (11.0% e tively). Learning to learn at the type and token learned hyperparameters Compositionality was e BPL to a matched mode spline-based stroke, rese by-synthesis models for that were similarly limit The human capacity more than just classificati of abilities, such as gener concept. We compared t duced by humans and m Turing tests," where naiv identify the machine, gi human and machine beh task, judges compared t humans asked to produ concept given one exam amples drawn by BPL (F model based on the accu we call their identificatio performance is 50% ID le cannot distinguish the humans; worst-case perf judge (N = 147) complete feedback, and judges we and in aggregate. The re 6B (new exemplars). Ju level on average for discr BPL behavior. As a grou barely better than chance and only 3 of 48 judges above chance. Three lesi uated by different group SCIENCE sciencemag.org 11 DECEMBER 2015 • VO 1 2 1 2 1 2 1 2 1 2 1 2 Human or Machine? Generating new concepts (unconstrained). In addition to generating new exemplars of a particular concept, GNS can generate new character concepts altogether, unconditioned on training images. Whereas the BPL model uses a complicated procedure for unconditional generation that involves a preliminary inference step and a supplemental nonparametric model, GNS generates new concepts by sampling directly from the type prior P (ψ). Moreover, the resulting GNS productions capture more of the structure found in real characters than either the raw BPL prior (Fig 1 , 4b ) or the supplemental nonparametric BPL prior [29] . In Fig. 4b we show a grid of 36 new character concepts sampled from our generative model at reduced temperature setting T = 0.5 (Appendix A.2). The model produces characters in multiple distinct styles, with some having more angular, line-based structure and others relying on complex curves. In Appendix Fig. A14 , we show a larger set of characters sampled from GNS, plotted in a topologically-organized grid alongside a corresponding grid of "nearest neighbor" training examples. In many cases, samples from the model have a distinct style and are visually dissimilar from their nearest Omniglot neighbor. Marginal image likelihoods. As a final evaluation, we computed likelihoods of held-out character images by marginalizing over the latent type and token variables of GNS to estimate P (I) = P (ψ, θ, I)∂ψ∂θ. We hypothesized that our causal generative model of character concepts would yield better test likelihoods compared to deep generative models trained directly on image pixels. As detailed in Appendix D, under the minimal assumption that our K posterior parses represent sharply peaked modes of the joint density, we can obtain an approximate lower bound on the marginal P (I) by using Laplace's method to estimate the integral around each mode and summing the resulting integrals. In Table 3 , we report average log-likelihood (LL) bounds obtained from GNS for a random subset of 1000 evaluation images, compared against test LL bounds from both the SG [43] and the VHE [21] models. Our GNS model performs stronger than each alternative, reporting the best overall log-likelihood per dimension.

6. DISCUSSION

We introduced a new generative neuro-symbolic (GNS) model for learning flexible, task-general representations of character concepts. We demonstrated GNS on the Omniglot challenge, showing that it performs a variety of inductive tasks in ways difficult to distinguish from human behavior. Published as a conference paper at ICLR 2021 Some evaluations were still qualitative, and future work will further quantify these results using Visual Turing Tests [29] . Whereas many machine learning algorithms emphasize breadth of data domains, isolating just a single task across datasets, we have focused our efforts in this paper on a single domain, emphasizing depth of the representation learned. Human concept learning is distinguished for having both a breadth and depth of applications [37, 31] , and ultimately, we would like to capture both of these unique qualities. We see our character model as belonging to a broader class of generative neuro-symbolic (GNS) models for capturing the data generation process. We have designed our model based on general principles of visual concepts-namely, that concepts are composed of reusable parts and locations-and we describe how it generalizes to 3D object concepts in Appendix E. As in the human mind, machine learning practitioners have far more prior knowledge about some domains vs. others. Handwritten characters is a domain with strong priors [1, 33, 23] , implemented directly in the human mind and body. For concepts like these with more explicit causal knowledge, it is beneficial to include priors about how causal generative factors translate into observations, as endowed to our character model through its symbolic rendering engine. For other types of concepts where these processes are less clear, it may be appropriate to use more generic neural networks that generate concepts and parts directly as raw stimuli, using less symbolic machinery and prior knowledge. We anticipate that GNS can flexibly model concepts in both types of domains, although further experiments are needed to demonstrate this. Our current token model for character concepts is much too simple, and we acknowledge a few important shortcomings. First, as shown in Appendix Fig. A10 , there are a number of scenarios in which the parses from a training character cannot adequately refit to a new example of the same character without a token model that allows for changes to discrete variables. By incorporating this allowance in future work, we hope to capture more knowledge in this domain and further improve performance. Furthermore, although our vision for GNS is to represent both concepts and background knowledge with neuro-symbolic components, the current token-level model uses only simple parametric distributions. In future work, we hope to incorporate token-level models that use neural network sub-routines, as in the type-level model presented here. To learn the parameters of P (ψ) and P (θ (m) | ψ), we fit our models to the human drawing data from the Omniglot background set. In this drawing data, a character is represented as a variable-length sequence of strokes, and each stroke is a variable-length sequence of pen locations {z 1 , ..., z T }, with z t ∈ R 2 (Fig. A6 , left). Before training our model on background drawings, we convert each stroke into a minimal spline representation using least-squares optimization (Fig. A6 , right), borrowing the B-spline tools from [29] . The number of spline control points depends on the stroke complexity and is determined by a residual threshold. Furthermore, we removed small strokes using a threshold on the trajectory length. These processing steps help suppress noise and emphasize signal in the drawings. Our generative models are trained to produce character drawings, where each drawing is represented as an ordered set of splines (strokes). The number of strokes, and the number of spline coordinates per stroke, are allowed to vary in the model.

A.2 TYPE PRIOR

The type prior P (ψ) represents a character as a sequence of strokes, with each stroke decomposed into a starting location y i ∈ R 2 , conveying the first spline control point, and a stroke trajectory x i = {∆ 1 , ..., ∆ N }, conveying deltas between spline control points. It generates character types one stroke at a time, using a symbolic rendering procedure called f render as an intermediate processing step after forming each stroke. An image canvas C is used as a memory state to convey information about previous strokes. At each step i, the next stroke's starting location and trajectory are sampled with procedure GeneratePart. In this procedure, the current image canvas C is first read by the location model (Fig. 2 ), a convolutional neural network (CNN) that processes the image and returns a Published as a conference paper at ICLR 2021 Step 1: Propose a range of candidate parses with fast bottom-up methods. The bottom-up algorithm extracts an undirected skeleton graph from the character image and uses random walks on the graph to propose a range of candidate parses. There are typically about 10-100 proposal parses, depending on character complexity (Fig. A8 ). input Step 2: Select stroke order and stroke directions for each parse using exhaustive search with the type prior P (ψ). Random search is used for complex parses with large configuration spaces. Step 3: Score each of the proposal parses using type prior P (ψ) and select the top-K parses. We use K = 5 following previous work [29] . Step 4: Separate each parse into type and token {ψ, θ} and optimize the continuous type-and token-level parameters with gradient descent to maximize the full joint density P (ψ, θ, I) of Eq. 1. Step 5: Compute weights π 1:K for each parse by computing πk = P (ψ k , θ k , I) and normalizing π k = πk / K k=1 πk .

C INFERENCE FOR GENERATING NEW EXEMPLARS

When generating new exemplars, we are given a single image I (1) of a novel class and asked to generate new instances I (2) (overloading the parenthesis notation from classification). To perform this task with GNS, we first sample from our approximate posterior Q(ψ, θ | I (1) ) to obtain parse {ψ, θ} (see Eq. 3), and then re-sample token parameters θ from our token model P (θ (2) | ψ). Due to high-dimensional images, mass in the approximate posterior often concentrates on the single best parse. To model the diversity seen in different human parses, we apply a temperature parameter to the log of unnormalized parse weights log(π k ) = log(π k )/T before normalization, selecting T = 8 for our experiments. With updated weights π 1:K our sampling distribution is written as P (I (2) , θ (2) | I (1) ) ≈ K k=1 π k P (I (2) | θ (2) )P (θ (2) | ψ k ).

D MARGINAL IMAGE LIKELIHOODS

Let z = ψ ∪ θ be a stand-in for the joint set of type-and token-level random variables in our GNS generative model. The latent z includes both continuous and discrete variables: the number of strokes κ and the number of control points per stroke d 1:κ are discrete, and all remaining variables are continuous. Decomposing z into its discrete variables z D ∈ Ω D and continuous variables z C ∈ Ω C , the marginal density for an image I is written as P (I) = z D ∈Ω D P (I, z D , z C )∂z C . Published as a conference paper at ICLR 2021 For any subset ΩD ⊂ Ω D of the discrete domain, the following inequality holds: P (I) ≥ z D ∈ ΩD P (I, z D , z C )∂z C . Our approximate posterior (Eq. 3) gives us K parses that represent promising modes {z D , z C } 1:K of the joint density P (I, z D , z C ) for an image I, and by setting ΩD = {z D } 1:K to be the set of K unique discrete configurations from our parses, we can compute the lower bound of Eq. 10 by computing the integral P (I, z D , z C )∂z C at each of these z D . At each z D k ∈ {z D } 1:K , the log-density function f (z C ) = log P (I, z D k , z C ) has a gradient-free maximum at z C k , the continuous configuration of the corresponding posterior parse. These maxima were identified by our gradient-based continuous optimizer during parse selection (Appendix B). If we assume that these maxima are sharply peaked, then we can use Laplace's method to estimate the integral P (I, z D k , z C )∂z C at each z D k . Laplace's method uses Taylor expansion to approximate the integral of e f (x) for a twice-differentiable function f around a maximum x 0 as e f (x) ∂x ≈ e f (x0) (2π) d 2 | -H f (x 0 )| 1 2 , where x ∈ R d and H f (x 0 ) is the Hessian matrix of f evaluated at x 0 . Our log-density function f (z C ) is fully differentiable w.r.t. continuous parameters z C , therefore we can compute H(z C ) = ∂ 2 f /∂z 2 C with ease. Our approximate lower bound on P (I) is therefore written as the sum of Laplace approximations at our K parses: P (I) ≥ K k=1 P (I, z D k , z C )∂z C ≈ K k=1 P (I, z D k , z C k ) (2π) d 2 | -H(z C k )| 1 2

E APPLYING GNS TO 3D OBJECT CONCEPTS

The GNS modeling framework is designed to capture inductive biases for concept learning that generalize across different types of visual concepts. We are actively working on applying GNS as a task-general generative model of 3D object concepts, such as chairs and vehicles, again with a neuro-symbolic generative process for parts and relations. Here, we briefly review the path forward for training GNS models of object concepts. In Fig. A10 we show a collection of GNS fits from 7 different classification trials, including 2 trials that were misclassified (a misrepresentative proportion). instance model p(x | k, C) C p(k | C) C q(z | x, k, C) z k p(x | z) x p k C k p(z | k, C

F.2 PARSING

In Fig. A11 we show a collection of predicted parses from GNS for 100 different target images. elCNN architecture, however, ise between training objectives. atistician learns a strong global but makes only minimal use of eans that, despite the use of a classification accuracy is much hieved using a deconvolutional e reason, conditional samples rpness but are no longer identin which they were conditioned raining suggests that this is not but is core to the objective, as 2016). ble to gain a large benefit from N architecture, with a 3-fold reerror (5-shot accuracy 98.8%) ted by each model (more in Supplement). With a PixelCNN architecture, both Neural ives lead to underutilisation of the latent space, producing unfaithful samples. architecture, however, een training objectives. n learns a strong global kes only minimal use of hat, despite the use of a ation accuracy is much using a deconvolutional on, conditional samples ut are no longer identih they were conditioned suggests that this is not core to the objective, as ain a large benefit from tecture, with a 3-fold re-5-shot accuracy 98.8%) re simultaneously sharp improvement is in part of the latent space, due ce term in the objective. this common technique leading to overfitting to N architecture, however, tween training objectives. ian learns a strong global akes only minimal use of that, despite the use of a cation accuracy is much d using a deconvolutional son, conditional samples but are no longer identiich they were conditioned g suggests that this is not s core to the objective, as . gain a large benefit from itecture, with a 3-fold re-(5-shot accuracy 98.8%) are simultaneously sharp is improvement is in part on of the latent space, due nce term in the objective. t this common technique , leading to overfitting to tatistician learns a strong global s but makes only minimal use of means that, despite the use of a , classification accuracy is much achieved using a deconvolutional ame reason, conditional samples arpness but are no longer identis on which they were conditioned l training suggests that this is not lty but is core to the objective, as . (2016). able to gain a large benefit from NN architecture, with a 3-fold ren error (5-shot accuracy 98.8%) s which are simultaneously sharp 4). This improvement is in part utilisation of the latent space, due divergence term in the objective. how that this common technique ed alone, leading to overfitting to 



Aspects of our model were recently published in a non-archival conference proceedings[6]. The previous manuscript presents only a prior distribution that alone performs just one task (generating new concepts); our new developments include a full hierarchical model, a differentiable image renderer / likelihood, and a procedure for approximate probabilistic inference from image data. These ingredients together enable GNS to perform 4 unique conceptual tasks. Concept learning experiments can be reproduced using our pre-trained generative model and source code: https://github.com/rfeinman/GNS-Modeling.



Figure 1: Character drawings produced by the BPL model (left), GNS model (middle), and humans (right).

Figure 2: A generative neuro-symbolic (GNS) model of character concepts. The type model GenerateType (P (ψ)) produces character types one stroke at a time, using an image canvas C as memory. At each step, the current canvas C is fed to procedure GeneratePart and a stroke sample is produced. The canvas is first processed by the location model, a CNN-MLP architecture that samples starting location y, and next by the stroke model, a CNN-LSTM architecture that samples trajectory x while attending to the encoded canvas. Finally, a symbolic renderer updates the canvas according to x and y, and a termination model decides whether to terminate the type sample. Unique exemplars are produced from a character type by sampling from the token model conditioned on ψ, adding motor noise to the drawing parameters and performing a random affine transformation.

, A (m) } represents a unique instance of a character concept, where y (m) 1:κ are the token-level locations, x (m) 1:κ the token-level parts, and A (m) ∈ R 4 the parameters of an affine warp transformation. The token distribution factorizes as

Fig.3. A generative model of handwritten characters. (A) New types are generated by choosing primitive actions (color coded) from a library (i), combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by running these programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types y and new token images I (m) for m = 1, ..., M. The function f (•, •) transforms a subpart sequence and start location into a trajectory.Human parsesMachine parses Human drawings

Fig. 4. Inferring motor programs from images. Parts are distinguished by color, with a colored dot indicating the beginning of a stroke and an arrowhead indicating the end. (A) The top row shows the five best programs discovered for an image along with their log-probability scores (Eq. 1). Subpart breaks are shown as black dots. For classification, each program was refit to three new test images (left in image triplets), and the best-fitting parse (top right) is shown with its image reconstruction (bottom right) and classification score (log posterior predictive probability). The correctly matching test item receives a much higher classification score and is also more cleanly reconstructed by the best programs induced from the training item. (B) Nine human drawings of three characters (left) are shown with their ground truth parses (middle) and best model parses (right).

Fig. 4. Inferring motor programs from images. Parts are distinguished by color, with a colored dot indicating the beginning of a stroke and an arrowhead indicating the end. (A) The top row shows the five best programs discovered for an image along with their log-probability scores (Eq. 1). Subpart breaks are shown as black dots. For classification, each program was refit to three new test images (left in image triplets), and the best-fitting parse (top right) is shown with its image reconstruction (bottom right) and classification score (log posterior predictive probability). The correctly matching test item receives a much higher classification score and is also more cleanly reconstructed by the best programs induced from the training item. (B) Nine human drawings of three characters (left) are shown with their ground truth parses (middle) and best model parses (right).

Fig. 4. Inferring motor programs from images. Parts are distinguished by color, with a colored dot indicating the beginning of a stroke and an arrowhead indicating the end. (A) The top row shows the five best programs discovered for an image along with their log-probability scores (Eq. 1). Subpart breaks are shown as black dots. For classification, each program was refit to three new test images (left in image triplets), and the best-fitting parse (top right) is shown with its image reconstruction (bottom right) and classification score (log posterior predictive probability). The correctly matching test item receives a much higher classification score and is also more cleanly reconstructed by the best programs induced from the training item. (B) Nine human drawings of three characters (left) are shown with their ground truth parses (middle) and best model parses (right).

Fig. 4. Inferring motor programs from images. Parts are distinguished by color, with a colored dot indicating the beginning of a stroke and an arrowhead indicating the end. (A) The top row shows the five best programs discovered for an image along with their log-probability scores (Eq. 1). Subpart breaks are shown as black dots. For classification, each program was refit to three new test images (left in image triplets), and the best-fitting parse (top right) is shown with its image reconstruction (bottom right) and classification score (log posterior predictive probability). The correctly matching test item receives a much higher classification score and is also more cleanly reconstructed by the best programs induced from the training item. (B) Nine human drawings of three characters (left) are shown with their ground truth parses (middle) and best model parses (right).

π k ∝ πk = P (ψ k , θ k , I) such that k π k = 1. To find the K good parses, search uses fast bottomup methods to propose many variants of the discrete variables, filtering the most promising options, before optimizing the continuous variables with gradient descent (we use K = 5). See Appendix B for further details about inference, and Appendix C for generating new exemplars of a concept. Inference for one-shot classification. In one-shot classification, models are given a single training image I (c) from each of c = 1, ..., C classes, and asked to classify test images according to their corresponding training classes. For each test image I (T

Fig. 5. Generating new exemplars. Humans and machines were given an image of a novel character (top) and asked to produce new exemplars. The nine-character grids in each pair that were generated by a machine are (by row) 1, 2; 2, 1; 1, 1.

Fig. 5. Generating new exemplars. Humans and machines were given an image of a novel character (top) and asked to produce new exemplars. The nine-character grids in each pair that were generated by a machine are (by row) 1, 2; 2, 1; 1, 1.

Fig. 5. Generating new exemplars. Humans and machines were given an image of a novel character (top) and asked to produce new exemplars. The nine-character grids in each pair that were generated by a machine are (by row) 1, 2; 2, 1; 1, 1.

Fig. 5. Generating new exemplars. Humans and machines were given an image of a novel character (top) and asked to produce new exemplars. The nine-character grids in each pair that were generated by a machine are (by row) 1, 2; 2, 1; 1, 1.

Figure 4: Generation tasks. (a) GNS produced 9 new exemplars for each of 5 target images, plotted here next to human and BPL productions. (b) A grid of 36 new character concepts sampled unconditionally from GNS, shown next to BPL samples.

Figure A5: The GNS hierarchical generative model. The full hierarchical generative model of GNS is depicted in Fig. A5. The joint density for type ψ, token θ (m) , and image I (m) factors as P (ψ, θ (m) , I (m) ) = P (ψ)P (θ (m) |ψ)P (I (m) |θ (m) ). (6) The type ψ parameterizes a motor program for generating character tokens θ (m) , unique exemplars of the concept. Both ψ and θ (m) are expressed as causal drawing parameters. An image I (m) is obtained from token θ (m) by rendering the drawing parameters and sampling binary pixel values. A.1 TRAINING ON CAUSAL DRAWING DATA original stroke minimal spline

Figure A8: The initial "base" parses proposed for an image with skeleton extraction and random walks.

Fig.A12shows new exemplars produced by GNS for 12 different target images, and Fig.A13shows new exemplars produced by two alternative neural models from prior work on Omniglot.F.4 GENERATING NEW CONCEPTS (UNCONSTRAINED)In Fig.A14we show a grid of 100 new character concepts produced by GNS, plotted alongside a corresponding grid of "nearest neighbor" training examples.

Figure A12: Generating new exemplars with GNS. Twelve target images are highlighted in red boxes. For each target image, the GNS model sampled 9 new exemplars, shown in a 3x3 grid under the target. ot Generalization in Deep Generative Models

Figure 5: One-shot same-class samples generated by our model. Cue images were sampled from previously unseen classes.

Figure 5: One-shot same-class samples generated by our model. Cue images were sampled from previously unseen classes.

Figure 5: One-shot same-class samples generated by model. Cue images were sampled from previously seen classes.

Figure 5: One-shot same-class samples generated by our model. Cue images were sampled from previously unseen classes.

Figure A13: New exemplars by the Sequential Generative (SG) model [43] and the Variational Homoencoder (VHE) [21]. (a) The SG model shows far too much variability, drawing what is clearly the wrong character in many cases (e.g. right-most column). (b) The VHE character samples are often incomplete, missing important strokes of the target class.

Attempted Omniglot tasks by model. Attempt does not imply successful completion.

Test error on withinalphabet one-shot classification.

Test log-likelihood bounds.

Published as a conference paper at ICLR 2021 and statistical considerations. To address these differences within GNS, we shift up one level in the generative hierarchy and design a token-level model for generating new exemplars of an individual concept (chairs, cars, etc.) that mirrors our type-level model for characters. The architecture and sampling procedure of GNS for 3D object tokens is given in Fig.A9. As with characters, our object model produces samples one part at a time, using a 3D canvas C in place of the previous 2D canvas. The procedure GeneratePart consists of two neural network components: 1) a discriminitivelytrained category model p(k | C) that predicts the category label k of the next part given the current canvas (leg, arm, back, etc.), and 2) a generative instance model p(x | k, C) that is trained with a variational autoencoder objective to sample an instance of the next part x given the current canvas and predicted part category label. Objects and object parts are represented as 3D voxel grids, and all neural modules, including the category model and the encoder/decoder of the instance model, are parameterized by 3D convolutional neural networks. A function f update is used to update the current canvas with the most recent part by summing the voxel grids. A GNS model for a particular concept is trained on examples of the 3D voxel grids with semantic part labels.

annex

Published as a conference paper at ICLR 2021 probability distribution for starting location y i :The starting location y i is then passed along with the image canvas C to the stroke model, a Long Short-Term Memory (LSTM) architecture with a CNN-based image attention mechanism.The stroke model samples the next stroke trajectory x i sequentially one offset at a time, selectively attending to different parts of the image canvas at each sample step and combining this information with the context of y i :After GeneratePart returns, the stroke parameters y i , x i are rendered to produce an updated canvas C = f render (y i , x i , C). The new canvas is then fed to the termination model, a CNN architecture that samples a binary termination indicator v i :Both our location model and stroke model follow a technique from [16] , who proposed to use neural networks with mixture outputs to model handwriting data. Parameters {π 1:K , µ 1:K , σ 1:K , ρ 1:K } output by our network specify a Gaussian mixture model (GMM) with K components (Fig. 2 ; colored ellipsoids), where π k ∈ (0, 1) is the mixture weight of the k th component,+ its standard deviations, and ρ k ∈ (-1, 1) its correlation. In our location model, a single GMM describes the distribution p(y i | C). In our stroke model, the LSTM outputs one GMM at each timestep, describing p(∆ t |∆ 1:t-1 , y i , C). The termination model CNN has no mixture outputs; it predicts a single Bernoulli probability to sample binary variable v i . When sampling from the model at test time, we use a temperature parameter proposed by Ha & Eck [20] (see [20, Eq. 8] ) to control the entropy of the mixture density outputs. | x i ) are each zeromean Gaussians, with standard deviations fit to the background drawings following the procedure of Lake et al. [29] (see SM 2.3.3). The location model adds noise to the start of each stroke, and the part model adds isotropic noise to the 2d cooridnates of each spline control point in a stroke. In the affine warp A (m) ∈ R 4 , the first two dimensions control global re-scaling of spline coordinates, and the second two control a global translation of the center of mass. The distribution iswith the parameter Σ A similarly fit from background drawings (see SM 2.3.4 in [29] ).

B APPROXIMATE POSTERIOR

To obtain parses {ψ, θ} 1:K for our approximate posterior (Eq. 3) given an image I, we follow the high-level strategy of Lake et al. [29] , using fast bottom-up search followed by discrete selection and continuous optimization. The algorithm proceeds by the following steps. 

