COMPOSITIONAL LAW PARSING WITH LATENT RANDOM FUNCTIONS

Abstract

Human cognition has compositionality. We understand a scene by decomposing the scene into different concepts (e.g., shape and position of an object) and learning the respective laws of these concepts, which may be either natural (e.g., laws of motion) or man-made (e.g., laws of a game). The automatic parsing of these laws indicates the model's ability to understand the scene, which makes law parsing play a central role in many visual tasks. This paper proposes a deep latent variable model for Compositional LAw Parsing (CLAP), which achieves the human-like compositionality ability through an encoding-decoding architecture to represent concepts in the scene as latent variables. CLAP employs conceptspecific latent random functions instantiated with Neural Processes to capture the law of concepts. Our experimental results demonstrate that CLAP outperforms the baseline methods in multiple visual tasks such as intuitive physics, abstract visual reasoning, and scene representation. The law manipulation experiments illustrate CLAP's interpretability by modifying specific latent random functions on samples. For example, CLAP learns the laws of position-changing and appearance constancy from the moving balls in a scene, making it possible to exchange laws between samples or compose existing laws into novel laws.

1. INTRODUCTION

Compositionality is an important feature of human cognition (Lake et al., 2017) . Humans can decompose a scene into individual concepts to learn the respective laws of these concepts, which can be either natural (e.g. laws of motion) or man-made (e.g. laws of a game). When observing a scene of a moving ball, one tends to parse the changing patterns of its appearance and position separately: The appearance stays consistent over time, while the position changes according to the laws of motion. By composing the laws of the ball's appearance and position, one can understand the changing pattern and predict the status of the moving ball. Although compositionality has inspired a number of models in visual understanding such as representing handwritten characters through hierarchical decomposition of characters (Lake et al., 2011; 2015) and representing a multi-object scene with object-centric representations (Eslami et al., 2016; Kosiorek et al., 2018; Greff et al., 2019; Locatello et al., 2020) , automatic parsing of laws in a scene is still a great challenge. For example, to understand the rules in abstract visual reasoning such as Raven's Progressive Matrices (RPMs) test, the comprehension of attribute-specific representations and the underlying relationships among them is crucial for a model to predict the missing images (Santoro et al., 2018; Steenbrugge et al., 2018; Wu et al., 2020) . To understand the laws of motion in intuitive physics, a model needs to grasp the changing patterns of different attributes (e.g. appearance and position) of each object in a scene to predict the future (Agrawal et al., 2016; Kubricht et al., 2017; Ye et al., 2018) . However, these methods usually employ neural networks to directly model changing patterns of the scene in black-box fashion, and can hardly abstract the laws of individual concepts in an explicit, interpretable and even manipulatable way. A possible solution to enable the abovementioned ability for a model is exploiting a function to represent a law of concept in terms of the representation of the concept itself. To represent a law that may depict arbitrary changing patterns, an expressive and flexible random function is required. The Gaussian Process (GP) is a classical family of random functions that achieves the diversity of function space through different kernel functions (Williams & Rasmussen, 2006) . Recently proposed random functions (Garnelo et al., 2018b; Eslami et al., 2018; Kumar et al., 2018; Garnelo et al., 2018a; Singh et al., 2019; Kim et al., 2019; Louizos et al., 2019; Lee et al., 2020; Foong et al., 2020) describe function spaces with the powerful nonlinear fitting ability of neural networks. These random functions have been used to capture the changing patterns in a scene, such as mapping timestamps to frames to describe the physical law of moving objects in a video (Kumar et al., 2018; Singh et al., 2019; Fortuin et al., 2020) . However, these applications of random functions take images as inputs and the captured laws account for all pixels instead of expected individual concepts. In this paper, we propose a deep latent variable model for Compositional LAw Parsing (CLAP) 1 . CLAP achieves the human-like compositionality ability through an encoding-decoding architecture (Hamrick et al., 2018) to represent concepts in the scene as latent variables, and further employ concept-specific random functions in the latent space to capture the law on each concept. By means of the plug-in of different random functions, CLAP gains generality and flexibility applicable to various law parsing tasks. We introduce CLAP-NP as an example that instantiates latent random functions with recently proposed Neural Processes (Garnelo et al., 2018b) . Our experimental results demonstrate that the proposed CLAP outperforms the compared baseline methods in multiple visual tasks including intuitive physics, abstract visual reasoning, and scene representation. In addition, the experiment on exchanging latent random functions on a specific concept and the experiment on composing latent random functions of the given samples to generate new samples both well demonstrate the interpretability and even manipulability of the proposed method for compositional law parsing. For each concept 𝑐 ∈ 𝐶: parse the random function 𝑓 # on 𝒛 " # and predict the concept of target points 𝒛 $ # using the conceptspecific random function 𝑓 #

Azimuths

Figure 1 : Overview of CLAP: to predict target images, CLAP first encodes the given context images into representations of concepts, such as object appearance, background and angle of object rotation; then parses concept-specific latent random functions and computes the representations of concepts in the target images; finally decodes these concepts to compose the target images. tant role of a random function is uncovering the underlying function over some given points and predicting function values at novel positions accordingly. In detail, given a set of context points D S = {(x s , y s )|s ∈ S}, we should find the most possible function that y s = f (x s ) for these context points, then use f (x t ) to predict y t for target points D T = {(x t , y t )|t ∈ T }. The abovementioned prediction problem can be regarded as estimating a conditional probability p(y T |x T , D S ). Neural Process (NP) NP (Garnelo et al., 2018b) captures function stochasticity with a Gaussian distributed latent variable g and lets p(y T |x T , D S ) = t∈T p(y t |g, x t )q(g|D S )dg. In this case, p(y T |x T , D S ) consists of a generative process p(y t |g, x t ) and an inference process q(g|D S ), which can be jointly optimized through variational inference. To estimate p(y T |x T , D S ), NP extracts the feature of each context point in D S and sum them up to acquire the mean µ and deviation σ of g: µ, σ = T f 1 |S| s∈S T a (y s , x s ) . Then NP samples the global latent variable g from q(g|D S ). NP can independently predict y t at the target position x t through the generative process p(y t |g, x t ) = N (y t |T p (g, x t ), σ 2 y I). Neural networks T a , T f and T p increase the model capacity of NP to learn data-specific function spaces.

4. MODEL

Figure 1 is the overview of CLAP. The encoder converts context images to individual concepts in the latent space. Then CLAP parses concept-specific random functions on the concept representations of context images (context concepts) to predict the concept representations of target images (target concepts). Finally, a decoder maps the target concepts to target images. The concept-specific latent random functions achieve the compositional law parsing in CLAP. We will introduce the generative process, variational inference, and parameter learning of CLAP. At the end of this section, we propose CLAP-NP that instantiates the latent random function with the Neural Process (NP).

4.1. LATENT RANDOM FUNCTIONS

Given D S and D T , the objective of CLAP is maximizing log p(y T |y S , x). In the framework of conditional latent variable models, we introduce a variational posterior q(h|y, x) for the latent variable h and approximate the log-likelihood with evidence lower bound (ELBO) (Sohn et al., 2015) : log p (y T |y S , x) ≥ E q(h|y,x) log p (y T |h) -E q(h|y,x) log q (h|y, x) p (h|y S , x) = L. The latent variable h includes the global latent random functions f and local information of images z = {z n } N n=1 where z n is the low-dimensional representation of y n . Moreover, we decompose z n into independent concepts {z c n |c ∈ C} (e.g., an image of a single-object scene may have concepts such as object appearance, illumination, camera orientation, etc.) where C refers to the name set of these concepts. Assuming that these concepts satisfy their respective changing patterns, we employ independent latent random functions to capture the law of each concept. Denote the latent random function on the concept c as f c , the specific form of f c depends on the way we model it. As Figure 2b shows, if adopting NP as the latent random function, we have f c (x t ) = T c p (g c , x t ) where g c is a Gaussian-distributed latent variable to control the randomness of f c . Within the graphical model of CLAP in Figure 2a , the prior and posterior in L are factorized into p(h|y S , x) = c∈C p (f c |z c S , x S ) t∈T p (z c t |f c , z c S , x S , x t ) s∈S p (z c s |y s ) p (y T |h) = t∈T p (y t |z t ) , q(h|y, x) = c∈C q(f c |z c , x) s∈S q(z c s |y s ) t∈T q(z c t |y t ) Decomposing f into concept-specific laws {f c |c ∈ C} in the above process is critical for CLAP's compositionality. The compositionality makes it possible to parse only several concept-specific laws rather than the entire law on z, which reduces the complexity of law parsing while increasing the interpretability of laws. The interpretability allows us to manipulate laws, for example, exchanging the law on some concepts and composing the laws of existing samples.

4.2. PARAMETER LEARNING

Based on Equations 2 and 3, we factorize the ELBO as (See Appendix B.3 for the detailed derivation) L = t∈T E q(h|y,x) log p (y t |z t ) Reconstruction term Lr - c∈C t∈T E q(h|y,x) log q(z c t |y t ) p (z c t |f c , z c S , x S , x t ) Target regularizer Lt - c∈C s∈S E q(h|y,x) log q(z c s |y s ) p (z c s |y s ) Context regularizer Ls - c∈C E q(h|y,x) log q(f c |z c , x) p (f c |z c S , x S ) Function regularizer L f (4) Reconstruction Term L r is the sum of log p(y t |z t ), which is modeled with a decoder to reconstruct target images from the representation of concepts. CLAP maximizes L r to connect the latent space with the image space, enabling the decoder to generate high-quality images. Target Regularizor L t consists of Kullback-Leibler (KL) divergences between the posterior q(z c t |y t ) and the prior p(z c t |f c , z c S , x S , x t ) of target concepts. The parameter of q(z c t |y t ) is directly computed using the encoder to contain more accurate information about the target image while p(z c t |f c , z c S , x S , x t ) is estimated from the context. The minimization of L t fills the gap between the posterior and prior to ensure the accuracy of predictors. Context Regularizor L s consists of KL divergences between the posterior q(z c s |y s ) and the prior p(z c s |y s ) of context concepts. To avoid mismatch between the posterior and prior and reduce model parameters, the posterior and prior are parameterized with the same encoder. In this case, we have p(z c s |y s ) = q(z c s |y s ) and further L s = 0. So we remove L s from Equation 4. Function Regularizor L f consists of KL divergences between the posterior q(f c |z c , x) and the prior p(f c |z c S , x S ). We use L f as a measure of function consistency to ensure that model can obtain similar distributions for f c with any subset z c S ⊆ z c and x S ⊆ x as inputs. To this end, CLAP shares concept-specific function parsers to model q(f c |z c , x) and p(f c |z c S , x S ). And we need to design the architecture of function parsers so that they can take different subsets as input. Correct decomposition of concepts is essential for model performance. However, CLAP is trained without additional supervision to guide the learning of concepts. Although CLAP explicitly defines independent concepts in priors and posteriors, it is not sufficient to automatically decompose concepts for some complex data. To solve the problem, we can introduce extra inductive biases to promote concept decomposition by designing proper network architectures, e.g., using spatial transformation layers to decompose spatial and non-spatial concepts (Skafte & Hauberg, 2019) . We can also set hyperparameters to control the regularizers (Higgins et al., 2017) , or add a total correlation (TC) regularizer R T C to help concept learning (Chen et al., 2018) . Here we choose to add extra regularizers to the ELBO of CLAP, and finally, the optimization objective becomes argmax p,q L * = argmax p,q L r -β t L t -β f L f -β T C R T C (5) where β t , β f , β T C are hyperparameters that regulate the importance of regularizers in the training process. We compute L * through Stochastic Gradient Variational Bayes (SGVB) estimator (Kingma & Welling, 2013) and update parameters using gradient descent optimizers (Kingma & Ba, 2015) .

4.3. NP AS LATENT RANDOM FUNCTION

In this subsection, we propose CLAP-NP as an example that instantiates the latent random function with NP. According to NP, we employ a latent variable g c to control the randomness of the latent random function f c for each concept in CLAP-NP. Generative Process CLAP-NP first encodes each context image y s to the concepts: {µ c z,s |c ∈ C} = Encoder(y s ), s ∈ S, z c s ∼ N µ c z,s , σ 2 z I , c ∈ C, s ∈ S. To stabilize the learning of concepts, the encoder outputs only the mean of Gaussian distribution, with the standard deviation as a hyperparameter. Using the process in Equation 1 for each concept, a concept-specific function parser aggregates and transforms the contextual information to the mean µ c g and standard deviation σ c g of g c . As Figure 2b shows, the concept-specific target predictor T c p takes g c ∼ N (µ c g , diag(σ 2 g )) and x t as inputs to predict the mean µ c z,t of z t , leaving the standard deviation as a hyperparameter. To keep independence between concepts, CLAP-NP introduces identical but independent function parsers and target predictors. Once all of the concepts z c t ∼ N (µ c z,t , σ 2 z I) for c ∈ C are generated, we concatenate and decode them into target images: y t ∼ N µ y,t , σ 2 y I , µ y,t = Decoder ({z c t |c ∈ C}) , t ∈ T. To control the noise in sampled images, we introduce a hyperparameter σ y as the standard deviation. As Figure 2a shows, we can rewrite the prior p(h|y S , x) in CLAP-NP as p(h|y S , x) = c∈C p (g c |z c S , x S ) t∈T p (z c t |g c , x t ) s∈S p (z c s |y s ) . ( ) Inference and Learning In the variational inference, other than that the randomness of f c is replaced by g c , the posterior of CLAP-NP is the same as that in Equation 3: q(h|y, x) = c∈C q(g c |z c , x) s∈S q(z c s |y s ) t∈T q(z c t |y t ) . In the first stage, we compute the means of both z S and z T using the encoder in CLAP. Because the encoder is shared in the prior and posterior, we obtain z c s through Equation 6 instead of recalculating z s ∼ q(z c s |y s ). Then we compute the distribution parameters of g c through the shared conceptspecific function parsers and the input z c and x. With the above generative and inference processes, some subterms in the ELBO of CLAP-NP become L t = c∈C t∈T E q(h|y,x) log q(z c t |y t ) p (z c t |g c , x t ) , L f = c∈C E q(h|y,x) log q(g c |z c , x) p (g c |z c S , x S ) .

5. EXPERIMENTS

To evaluate the model's ability of compositional law parsing on different types of data, we use three datasets in the experiments: (1) Bouncing Ball (abbreviated as BoBa) dataset (Lin et al., 2020a) to validate the ability of intuitive physics; (2) Continuous Raven's Progressive Matrix (CRPM) dataset (Shi et al., 2021) to validate the ability of abstract visual reasoning; (3) MPI3D dataset (Gondal et al., 2019) to validate the ability of scene representation. We adopt NP (Garnelo et al., 2018b) , GP with the deep kernel (Wilson et al., 2016), and GQN (Eslami et al., 2018) as baselines. We consider NP and GP as baselines because they are representative random functions that construct the function space in different ways, and CLAP-NP instantiates the latent random function with NP. GQN is adopted since it models the distribution of image-valued functions, which highly fits our experimental settings. NP is not used to model image-valued functions, so we add an encoder and decoder to NP's aggregator and predictor to help it handle high-dimensional images. For GP, we use a pretrained encoder and decoder to reduce the dimension of raw images. See Appendix D.1 and D.2 for a detailed introduction to datasets, hyperparameters, and architectures of models. For quantitative experiments, we adopt Mean Squared Error (MSE) and Selection Accuracy (SA) as evaluation metrics. Since non-compositional law parsing will increase the risk of producing undesired changes on constant concepts and further cause pixel deviations in predictions, we adopt MSE as an indicator of prediction accuracy. To compute the concept-level deviations between predictions We provide two predictions of the same context to illustrate a special case: when a row of images is removed from the given matrix, all predictions with the progressive law of size and constant law of color are correct. In this special case, only CLAP-NP captures such prediction uncertainty. and ground truths, which we think can be an indicator of semantic error, we introduce SA as another metric to calculate the distance on the representation space. Since the scale of representation-level distances is tightly according to the representation space of models, we turn to get SA by selecting target images from a set of candidates. For (x, y), we combine the target images y T with K targets randomly selected from other samples to establish a candidate set. Then the candidates and the prediction ỹT ∼ p(y T |y S , x) of the models are encoded into representations to calculate concept-level distances. Finally, the candidate with the smallest distance is chosen, and SA is the selection accuracy across all test samples. In the following experiments, we use η = N T /N to denote the training or test configuration where N T of N images in the sample are target images to predict.

5.1. INTUITIVE PHYSICS

To evaluate the intuitive physics of models, which is the underlying knowledge for humans to understand the evolution of the physical world (Kubricht et al., 2017) , we use BoBa dataset where physical laws of collisions and motions are described as functions mapping timestamps to frames. BoBa-1 and BoBa-2 are one-ball and two-ball instances of BoBa. When predicting target frames, models should learn both constant laws of appearance and physical laws of collisions. We set up two test configurations η = 4/12 and η = 6/12 for BoBa. The quantitative results in Table 1 and Figure 3 illustrate CLAP-NP's intuitive physics on both BoBa-1 and BoBa-2. Figure 4 visualizes the predictions of BoBa-2 where NP can only generate blurred images while GP and GQN can hardly understand the appearance constancy of balls. That is, balls in the predictions of GP have different colors, and GQN generates non-existent balls and ignores existing balls. Instead, CLAP-NP can separately capture the appearance constancy and collision laws, which benefits the law parsing in scenes with novel appearances and physical laws (Appendix E.7).

5.2. ABSTRACT VISUAL REASONING

Abstract visual reasoning is a manifestation of human's fluid intelligence (Cattell, 1963) . And RPM (Raven & Court, 1998 ) is a famous nonverbal test widely used to estimate the model's abstract visual reasoning ability. RPM is a 3 × 3 image matrix, and the changing rule of the entire matrix consists of multiple attribute-specific subrules. In this experiment, we evaluate models on four instances of the CRPM dataset and two test configurations η = 2/9 and η = 3/9. In Table 1 and Figure 3 , CLAP-NP achieves the best results with the metrics in both MSE and SA. Figure 5 displays the predictions on CRPM-DT. In the results, NP produces ambiguous-edged outer triangles, GP predicts targets with incorrect laws, and GQN generates multiple similar images. It is worth stressing that the answer to an RPM is not unique when we remove the row from it. We display examples of this situation in the second row of Figure 5 , indicating that CLAP-NP can understand progressive and constant laws on different attributes rather than simple context-to-target mappings.

5.3. SCENE REPRESENTATION

In scene representation, we use MPI3D dataset where a robotic arm in scenes manipulates the altitude and azimuth of the object to produce images with different laws. The laws of scenes can be regarded as functions that project object altitudes or azimuths to RGB images. In this task, the key to predicting correct target images is to determine which law the scene follows by observing given context images. We train models under η = 1/8∼4/8 and test them with more target or scene images (η = 4/8, 6/8, 10/40, 20/40). The MSE and SA scores in Table 1 and Figure 3 demonstrate that CLAP-NP outperforms NP in all configurations and GQN in η = 6/8, 10/40, 20/40. CLAP-NP has a slightly higher SA but lower MSE scores than GQN when η = 4/8. A possible reason is that the network of GQN has good inductive biases and more parameters to generate clearer images and achieve better MSE scores. However, such improvement in the pixel-level reconstruction has limited influence on the representation space, as well as the SA scores of GQN. GQN can hardly generalize the random functions to novel situations with more images, leading to performance loss when η = 6/8, 10/40, 20/40. GP's MSE scores on MPI3D indicate that it is suitable for scenes with more context images (e.g., η = 10/40). With only sparse points, the prediction uncertainty of GP leads to a significant decrease in performance. However, compositionality allows CLAP-NP to parse concept-specific subrules, which improves the model's generalization ability in novel test configurations.

5.4. COMPOSITIONALITY OF LAWS

It is possible to exactly manipulate a law if one knows which latent random functions (LRFs) encode it. We adopt two methods to find the correspondence between laws and LRFs. Figure 6 visualizes the changing patterns by perturbing and decoding concepts, enabling us to understand the meaning of LRFs. Another way is to compute variance declines between laws and LRFs (Kim & Mnih, 2018) . Assume that we have ground truth laws L and a generator to produce sample batches by fixing the arbitrary law l ∈ L. First, we generate a batch of samples without fixing any laws and parse LRFs for these samples to estimate the variance vc of each concept-specific LRF. By fixing the law l, we generate some batches {B b l } N B b=1 and estimate the in-batch variance {v c,b l ; c ∈ C} of LRFs for B b l . Finally, the variance decline between the law l and the concept c is s l,c = 1 N B N B b=1 vc -v c,b l vc . ( ) In Figure 6 , we display the variance declines on BoBa-2, CRPM-T, and CRPM-C, which will guide the manipulation of laws in the following experiments. And Figure 7 visualizes the law manipulation process, which well illustrates the motivation of our model on compositional law parsing. Latent Random Function Exchange The top of Figure 7 shows how we swap the laws of samples with the aid of compositional law parsing. To exchange the law of appearance between samples, we first refer to the variance declines in Figure 6 . For BoBa-2, the laws are represented with 6 LRFs where the first and last LRFs encode the law of appearance while the others indicate the motion of balls. Thus we infer the LRFs (global rule latent variables in CLAP-NP) of two samples and swap the LRFs of two samples on the first and last dimensions. Finally, we regenerate the edited samples using the generative process, and further, make predictions and interpolations on them. Latent Random Function Composition For samples of CRPM, we manipulate laws by combining the LRFs from existing samples to generate samples with novel laws. Figure 7 shows the results of law composition on CRPM-T. The first step is parsing LRFs for given samples. By querying the variance declines in Figure 6 , we can find these LRFs respectively correspond to the laws of size, rotation, and color. Since we have figured out the relations between laws and LRFs, the next step is to pick laws from three different samples respectively and compose them into an unseen combination of laws. Finally, the composed laws are decoded to generate a sample with novel changing patterns. This process is similar to the way a human designs RPMs (i.e., set up separate sub-laws for each attribute and combine them into a complete PRM). From another perspective, law composition provides an option for us to generate novel samples from existing samples (Lake et al., 2015) .

6. CONCLUSION AND LIMITATIONS

Inspired by the compositionality in human cognition, we propose a deep latent variable model for Compositional LAw Parsing (CLAP). CLAP decomposes high-dimensional images into independent visual concepts in the latent space and employs latent random functions to capture the conceptchanging laws, by which CLAP achieves the compositionality of laws. To instantiate CLAP, we propose CLAP-NP that uses NPs as latent random functions. The experimental results demonstrate the benefits of our model in intuitive physics, abstract visual reasoning, and scene representation. Through the experiments on latent random function exchange and composition, we further qualitatively evaluate the interpretability and manipulability of the laws learned by CLAP. Limitations. (1) Complexity of datasets. Because compositional law parsing is an underexplored task, we first validate the effectiveness of CLAP on datasets with relatively clear and simple laws to avoid the influence of unknown confounders in complicated datasets. (2) Setting the number of LRFs. For scenes with multiple complex laws, we can empirically set an appropriate upper bound or directly put a large upper bound on the number of LRFs. However, using too many LRFs increases model parameters, and the redundant concepts decrease CLAP's computational efficiency. See Appendix F for detailed limitations and future works.

A APPENDIX

In Supplementary Materials, we (1) provide the details of ELBO, then (2) introduce the datasets, model architecture, hyperparameters, and computational resources adopted in our experiments, and finally (3) provide additional experimental results. In particular, we provide additional results of editing and manipulating latent random functions, which validate our motivation and contribution. B DETAILS OF ELBO B.1 ELBO OF CONDITIONAL LATENT VARIABLE MODELS log p (y T |y S , x) = h q (h|y, x) log p (y T |y S , x) dh = h q (h|y, x) log p (h, y T |y S , x) p (h|y, x) dh = h q (h|y, x) log p (h, y T |y S , x) q (h|y, x) dh + h q (h|y, x) log q (h|y, x) p (h|y, x) dh ≥ h q (h|y, x) log p (h, y T |y S , x) q (h|y, x) dh = E q(h|y,x) log p (y T |h) -E q(h|y,x) log q (h|y, x) p (h|y S , x) = L. B.2 PRIOR AND POSTERIOR FACTORIZATION p (y T |h) = t∈T p (y t |z t ) p(h|y S , x) = c∈C p (f c , z c S , z c T |y S , x) = c∈C p (z c T |f c , z c S , x S , x T ) p (f c |z c S , x S ) p (z c S |y S ) = c∈C p (f c |z c S , x S ) t∈T p (z c t |f c , z c S , x S , x t ) s∈S p (z c s |y s ) q(h|y, x) = c∈C q(f c , z c S , z c T |y, x) = c∈C q(f c |z c , x)q(z c S |y S )q(z c T |y T ) = c∈C q(f c |z c , x) s∈S q(z c s |y s ) t∈T q(z c t |y t ) B.3 ELBO OF CLAP L = E q(h|y,x) log t∈T p (y t |z t ) -E q(h|y,x) log c∈C q(f c |z c , x) s∈S q(z c s |y s ) t∈T q(z c t |y t ) c∈C t∈T p (z c t |f c , z c S , x S , x t ) p (f c |z c S , x S ) s∈S p (z c s |y s ) (14) = t∈T E q(h|y,x) log p (y t |z t ) - c∈C E q(h|y,x) log t∈T q(z c t |y t ) t∈T p (z c t |f c , z c S , x S , x t ) - c∈C E q(h|y,x) log q(f c |z c , x) p (f c |z c S , x S ) - c∈C E q(h|y,x) log s∈S q(z c s |y s ) s∈S p (z c s |y s ) = t∈T E q(h|y,x) log p (y t |z t ) Reconstruction term Lr - c∈C t∈T E q(h|y,x) log q(z c t |y t ) p (z c t |f c , z c S , x S , x t ) Target regularizer Lt - c∈C s∈S E q(h|y,x) log q(z c s |y s ) p (z c s |y s ) Context regularizer Ls - c∈C E q(h|y,x) log q(f c |z c , x) p (f c |z c S , x S ) Function regularizer L f B.3.1 RECONSTRUCTION TERM L r = t∈T E q(h|y,x) log p (y t |z t ) = t∈T E q(zt|yt) log p (y t |z t ) ≈ t∈T log p (y t |z t ) , where zt ∼ q(z t |y t ) (15) B.3.2 TARGET REGULARIZER L t = c∈C t∈T E q(h|y,x) log q(z c t |y t ) p (z c t |f c , z c S , x S , x t ) = c∈C t∈T E q(z c S |y S ) E q(z c T |y T ) E q(f c |z c ,x) log q(z c t |y t ) p (z c t |f c , z c S , x S , x t ) . ( ) Because of the function consistency in samples, function posteriors q(f c |z c , x) derived from arbitrary sampled zc ∼ q(z c |y) need to be consistent. So we let f c condition on only the observations x and y to simplify the posterior distribution, that is, we replace q(f c |z c , x) with q(f c |y, x): L t ≈ c∈C t∈T E q(z c S |y S ) E q(z c T |y T ) E q(f c |y,x) log q(z c t |y t ) p (z c t |f c , z c S , x S , x t ) = c∈C t∈T E q(z c S |y S ) E q(f c |y,x) KL q(z c t |y t ) p (z c t |f c , z c S , x S , x t ) . In this way, we convert the computation of the log-likelihoods on q(z c t |y t ) and p(z c t |f c , z c S , x S , x t ) to the KL divergences between them. Then a Monte Carlo estimator can be used to approximate L t where zc S is sampled by zc S ∼ q(z c S |y S ) to compute the outer expectation E q(z c S |y S ) [ * ]; and for inner expectation E q(f c |y,x) [ * ], because q(f c |y, x) = q(f c |z c , x)q(z c |y)dz c , we are able to first sample zc T ∼ q(z c T |y T ) and obtain f c ∼ q(f c |z c S , zc T , x). By means of f c and zc sampled from the posterior, we have L t ≈ KL(q(z c t |y t ) p(z c t | f c , zc S , x S , x t )).

B.3.3 CONTEXT REGULARIZER

L s = c∈C s∈S E q(h|y,x) log q(z c s |y s ) p (z c s |y s ) = c∈C s∈S E q(h|y,x) log 1 = 0. (18) B.3.4 FUNCTION REGULARIZER L f = c∈C E q(h|y,x) log q(f c |z c , x) p (f c |z c S , x S ) = c∈C E q(z c S |y S ) E q(z c T |y T ) E q(f c |z c ,x) log q(f c |z c , x) p (f c |z c S , x S ) = c∈C E q(z c |y) KL q(f c |z c , x) p (f c |z c S , x S ) . To estimate L f in the same way as the target regularizer, zc is sampled from q(z c |y) and we have L f ≈ KL(q(f c |z c , x) p (f c |z c S , x S )).

B.4 TOTAL CORRELATION

Let I = {y k } K k=1 denote the set of all sample images in the dataset, and zk ∼ q(z k |y k ) are the latent representations of concepts for the kth image. To apply the Total Correlation (TC) (Chen et al., 2018) in CLAP, we decompose the aggregated posterior q(z) = K k=1 q(z k |y k )p(y k ) according to the concepts, that is R T C = KL q(z) c∈C q(z c ) = E q(z) log q(z) - c∈C E q(z) log q(z c ) . ( ) We adopt Minibatch Weighted Sampling (Chen et al., 2018) to approximate R T C on the minibatch {y m } M m=1 ⊆ I of the dataset and the corresponding representations of concepts {z m } M m=1 : E q(z) log q(z) ≈ 1 M M i=1   log M j=1 q (z i |y j ) -log(KM )   , E q(z) log q(z c ) ≈ 1 M M i=1   log M j=1 q (z c i |y j ) -log(KM )   .

C PRELIMINARIES

Generative Query Network (GQN) GQN (Eslami et al., 2018) regards the mappings from camera poses to scene images as random functions. GQN adopts deterministic neural scene representations r and latent variables z to capture the configuration and stochasticity of scenes in the conditional probability p(y T |x T , y C , x C ) = p(y T , z|x T , r)dz. The representation network T a extracts the representation for each context point and summarizes them as the global scene representation r = c∈C T a (y c , x c ). The generation network T p predicts the mean of target scene images via µ = T p (r, z, x t ), and the target scene images are sampled from N (y t |µ, σ 2 I). Gaussian Process (GP) GP (we only discuss noise-free GP here) models the probability distribution p(y|x) as the multivariate Gaussian y ∼ N (0, K) where K ij is computed through the kernel function κ(x i , x j ). In this case, the probability p(y T |x T , y C , x C ) is a multivariate Gaussian, and the parameters have closed-form solutions. Kernel functions are the key to constructing different types of random functions. The models can use a basic kernel like the RBF kernel, combine different kernels to develop complex kernels, or learn the function space for different data adaptively through neural networks (Wilson et al., 2016) .

D DATASETS AND EXPERIMENTAL SETUP D.1 DETAILS OF DATASETS

In this paper, three types of datasets are adopted: BoBa, CRPM, and MPI3D. Figure 8 displays two samples for each instance of datasets, and Bouncing Ball (BoBa) BoBa contains videos of bouncing balls. Depending on the number of balls in the scene, the dataset provides two instances BoBa-1 and BoBa-2. In the videos of Figure 8a and 8b, the motion of balls follows the law of physical collisions; and the appearance of balls (color, size, amount, etc.) is constant over time. The models need to capture the probability space on functions f : R → R 3×64×64 that map timestamps to video frames. Referring to Table 2 , we provide 10,000 videos of bouncing balls for training, 1,000 for validation and hyperparameter selection, and 2,000 to test the intuitive physics of the models. In the training and validation phase, we randomly select 1 to 4 frames from video frames as the target images to predict; in the testing phase, we provide two configurations (referred to as Test-4 and Test-6) to evaluate the model's performance when there are 4 or 6 target images in videos. The first sample in Figure 8c shows that, in the row of the matrix, the sizes of triangles increase progressively, the grayscales change progressively from dark to light, and the rotations keep constant. To predict the missing target images in the matrix, the models need to learn the probability space on functions f : {-1, 0, 1}foot_1 → R 64×64 that map 2D grid coordinates to grid images. Table 2 shows that we provide 10,000 image matrices for training, 1,000 for tuning the model hyperparameters, and 2,000 to test the abstract visual reasoning ability of models. In the training and validation process, we randomly select 1 or 2 images as target images; in the testing process, we use Test-2 and Test-3 to evaluate the performance of the models to predict 2 or 3 target images. MPI3D MPI3D (Gondal et al., 2019) dataset contains a series of single-object scenes, each with 40 different scene images. There are two underlying rules to change the object in a scene: change the altitude of the object (sample 1 in Figure 8g ); change the azimuth of the object (sample 2 in Figure 8g ). The other attributes (e.g., object color, camera height, etc.) do not change within scene images. It is essential for the models to grasp different changing patterns for target image prediction, and the key is to describe the distribution over functions f : R → R 3×64×64 that map object altitudes or azimuths to scene images. As Table 2 shows, we provide 16127 scenes for training, 2305 for tuning the hyperparameters, and 4608 for testing. In terms of training and validation, we first randomly select 8 images from 40 scene images to represent the scene, and then randomly select 1 to 4 images from the 8 images again as the target images, leaving the remaining images as the context. For testing, we supply , where Test-4 and Test-6 evaluate the performance of the models to predict 4 or 6 target images out of 8 images; Test-10 and Test-20 evaluate the performance to predict 10 or 20 target images out of 40 images. One can fetch MPI3D from the repository 2 under the Creative Commons Attribution 4.0 International License.

D.2 MODEL ARCHITECTURE AND HYPERPARAMETERS

CLAP-NP In this subsection, we will first describe the architecture of the encoder, decoder, concept-specific function parsers, and concept-specific target predictors in CLAP-NP. Then we will list hyperparameters in the training phase. • Encoder: for all datasets, CLAP-NP uses the same convolutional blocks to downsample highdimensional images into the representations of concepts, which can be described as To generate more complex scene images in MPI3D, CLAP-NP uses a deeper decoder with hidden sizes [512, 256, 128, 64, 32] . And we use T f with hidden sizes [128, 128] and output size 64, T f with hidden size [64], and T p with hidden sizes [128, 128] . Hyperparameters for CLAP-NP on different datasets are shown in Table 4 . In all datasets, we set the learning rate as 3×10 -4 , batch size as 512, and σ y = 0.1. For MPI3D, we anneal β t and β f in the first 400 epochs. After each training epoch, we use the validation set to compute the evidence lower bound (ELBO) of the current model and save the model with the largest ELBO as the trained model. For all datasets, CLAP-NP uses the Adam (Kingma & Ba, 2015) optimizer to update parameters. -4 × 4 Conv,

Neural Process (NP)

To deal with high-dimensional images in NP (Garnelo et al., 2018b) , we apply the encoder and decoder of CLAP-NP in it to convert high-dimensional images into lowdimensional representations. The aggregator T a extract context information from the representations and function inputs. Then the function parser T f converts the context representation into the mean and standard deviation of the global latent variable. Finally, the decoder generates target images from the target function inputs and the global latent variable. The architectures of T a and T f are • Aggregator NP's hyperparameters on different datasets are shown in Table 5 where d r is the representation size of the encoder. In all datasets, NP adopts the learning rate 1 × 10 -4 batch size 512, and σ y = 0.1. The parameters of NP are updated with the Adam optimizer (Kingma & Ba, 2015) . Generative Query Network (GQN) To implement GQN (Eslami et al., 2018) , we use a PyTorch implementation from the repositoryfoot_2 with the following changes to the default configuration to control the computational resource: set learning rate to 0.0005, batch size to 256, the representation type to pool, the number of iterations to 4, and share the ConvLSTM core among iterations. Gaussian Process (GP) We use a pretrained autoencoder to reduce the dimension of images and let GP capture the changing patterns in the low-dimensional space. We adopt the same encoder and decoder as in CLAP-NP and NP for a fair comparison. The mean function of GP is set to a constant function. After comparing the RBF kernel, periodic kernel, and deep kernel, we chose the deep kernel as the kernel function that κ(x i , et al., 2016) . Neural network T k is a multilayer perceptron with hidden sizes [1000, 1000, 500, 50], output size 2, and ReLU activation functions, which is the same as in DKL (Wilson et al., 2016) . The hyperparameters of GP are tuned on each sample through the log-likelihood of context points and the RMSprop optimizer. We give the learning rate and epoch to adjust hyperparameters in Table 6 and use multi-task Gaussian Process prediction (Bonilla et al., 2007) to model the correlations between dimensions.  x j ) = σ 2 exp(-T k (x i ) -T k (x j ) /2 2 ) (Wilson

E.1 INTUITIVE PHYSICS

In Figure 9 we display additional prediction results on BoBa-1 and BoBa-2. For each instance, we provide samples using the configurations Test-4 (left) or , where CLAP-NP outperforms other baselines. NP generates blurred target images on both BoBa-1 and BoBa-2 datasets. It indicates that NP has difficulty modeling changing patterns on bouncing balls. GQN can produce clear images but may generate non-existent balls and lose existing balls. In the first sample of BoBa-1, the predictions of GQN deviate from the ground truths significantly in the position of balls. CLAP-NP performs well in modeling scene consistency and predicting motion trajectory. 

E.2 ABSTRACT VISUAL REASONING

With the additional experimental results in Figure 10 , we can intuitively see that CLAP-NP has a better understanding of the changing rules than baselines. For the second sample of each dataset, a row of images is removed to interpret the abstract visual reasoning ability of the model. As samples in Figure 10 shows, when we remove a row from the matrix, the answer is not unique: the predictions are correct as long as their sizes increase progressively and their colors and rotations keep constant. CLAP-NP represents such randomness in predictions by means of the probabilistic generative process, making it possible to generate different correct answers. In terms of the prediction quality, although generating blurred images, NP has the basic reasoning ability about the progressive rule of the outer triangle size in Figure 10b . GQN generates clear images, however, the generated images probably deviate from the underlying changing rules on matrices. Table 7 and Table 11 illustrate the outperforming results of CLAP-NP in abstract visual reasoning by quantitative experiments.

E.3 SCENE REPRESENTATION

Figure 12 shows the prediction results within 8 scene images. Both GQN and CLAP-NP generate clear prediction results when the number of target images is 4; if the number of target images is increased to 6, the prediction accuracy of GQN has a significant decline, while CLAP-NP maintains the accuracy. NP generates ambiguous results in all experiments. Figure 13 shows a more complicated situation: we provide 40 scene images and set 20 of them as target images. In this case, GQN and NP can hardly generate clear foreground objects; while CLAP-NP produces relatively accurate predictions with only a little decrease in generative quality. This experiment aims to test the generalization of the laws learned by the models, and the results above illustrate that NP can hardly represent scenes with functions, GQN has difficulty generalizing the scene representation ability to different configurations, and the laws learned by CLAP-NP can be generalized to novel scenes with the compositional modeling of laws.

E.4 CONCEPT DECOMPOSITION

Concept decomposition is the foundation for CLAP to understand concept-specific laws. In this experiment, we traverse each concept in the latent space and visualize the concepts through the decoder to illustrate the LRFs. First of all, we decompose a batch of images into concepts by the encoder to estimate the range of concepts in the latent space. To traverse one concept, we fix the other concepts and linearly interpolate it from the minimum value to the maximum value to generate a sequence of interpolation results, which are decoded into images for visualization. Each row of Figure 14 represents the traversal results of one concept. In BoBa-1, CLAP-NP learns LRFs on concepts of color, horizontal position, and vertical position in an unsupervised manner. This is similar to the way we understand the motion of balls: the color keeps constant over time; the horizontal and vertical positions conform to the physical laws in their respective directions. For CRPM, CLAP-NP correctly parses images into concepts that correspond to the attribute-specific rules in matrices. Concept decomposition in real environments is a challenge for models. For MPI3D, CLAP-NP parses the LRFs on the object appearance (Dimension 1, 5, 6, 7) and other static scene attributes (Dimension 2, 3, 4). 

E.6 LATENT RANDOM FUNCTION EXCHANGE

Another way to manipulate laws is to exchange LRFs of some concepts between samples. In BoBa, we swap the law of appearance or object motion between samples, and the results are shown in Figure 17 . According to the concept decomposition results in Figure 14 , we exchange the LRFs between samples on Dimension 1 (Dimensions 2 and 3) to modify the motion of balls (the law of appearance) in BoBa-1. And we exchange the LRFs between samples on Dimensions 1 and 6 (Dimensions 2, 3, 4, and 5) to modify the motion of balls (the law of appearance) in BoBa-2. To exchange laws in MPI3D, as shown in Figure 18 , we swap LRFs between samples on Dimensions 1, 5, 6, and 7 to change the law of the object motion. By modifying laws on example 1 of Figure 18 , the changing pattern of the first sample becomes the horizontal rotation, and the second sample inherits the law of vertical rotation from the first sample, which illustrates CLAP-NP's ability to generate new samples. The experiments on law exchange evidence CLAP-NP's interpretability and manipulability from another perspective.

E.7 GENERALIZATION ON UNSEEN CONCEPTS

We extend BoBa-2 to generate four datasets with novel laws: Novel-colors (BoBa-2-NC), Novelshape (BoBa-2-NS), Without-ball-collisions (BoBa-2-WBO), and With-gravity (BoBa-2-WG). We draw balls on BoBa-2-NC with unseen colors and replace the balls on BoBa-2-NS with squares without changing the laws of motion. For BoBa-2-WBO, we disable collisions between balls and re- serve collisions between balls and borders; for BoBa-2-WG, we add vertical gravity to scenes. After training on BoBa-2, CLAP-NP is tested on four datasets without retraining to evaluate whether the compositionality of laws improves the model's generalization ability in scenes with unseen concepts or laws. Figure 19 shows that CLAP-NP can predict the correct object positions on BoBa-2-NC and BoBa-2-NS. CLAP-NP predicts inaccurate color and shape of objects because the encoder and decoder are trained on BoBa-2, and the latent space does not encode the unseen colors or shapes. We observe similar results on BoBa-2-WBO and BoBa-2-WG that CLAP-NP learns the correct law of appearance but predicts incorrect positions of balls when there are unseen physical laws in scenes. To better evaluate the generalization ability in scenes with novel laws, we quantitatively evaluate CLAP-NP and baseline models with MSE ad SA scores on the four datasets. The results in Table 8 indicate that CLAP-NP achieves the best MSE and SA scores in most situations, which illustrates CLAP-NP's generalization ability in scenes with novel concepts or laws.

E.8 NUMBER OF LATENT RANDOM FUNCTIONS

This experiment explores the influence of setting too many or too few LRFs in CLAP-NP. Figure 20 shows the latent traversal results on MPI3D, BoBa-2, and CRPM-DT. If we set too few LRFs, CLAP-NP encodes different laws in one LRF instead of learning compositional laws, which will influence CLAP-NP's generation ability (e.g., set only two LRFs for BoBa-2). Setting too many LRFs has no significant influence on CLAP-NP's performance because there will be redundant dimensions in CLAP-NP that do not encode information (e.g., dimensions 1, 3, and 6 on CRPM-DT). However, due to the independent modeling of concept-specific LRFs, setting a large number of LRFs will reduce the computational speed of CLAP-NP and increase the number of model parameters. E.9 PREDICTION STRATEGY CLAP uses the one-shot strategy that predicts all target images at one time. However, the rollout strategy can be another choice to predict the following target images through the few context images at the beginning. For example, in the test configuration η = 20/25, we first predict the 6th to 10th images from the first five context images, then combine them (1st to 10th images) to predict the following five images, and repeat this process until all target images are predicted. Table 9 shows the MSE scores on CRPM-DT, BoBa-2, and MPI3D where CLAP-NP predicts targets with the rollout and one-shot strategy, respectively. On BoBa-2 and MPI3D, the rollout strategy slightly improves the prediction accuracy. Generally, if a target image is far away from all the context images, the prediction may have high uncertainty. The model only predicts a few target images close to the context each time with the rollout strategy, while the one-shot strategy requires the model to predict all the target images at one time. Therefore, the rollout strategy is more likely to have lower computational uncertainty and higher prediction accuracy.

E.10 FAILURE CASES

In this experiment, we display some failure cases. For BoBa, most failure cases occur when there are continuous target images (1st sample of BoBa in Figure 21 ) or too many target images (2nd sample of BoBa in Figure 21 ). For CRPM, CLAP can predict target images with diversity when we remove a row from a matrix, but it sometimes generates target images that break the rules. For example, the color of images keeps invariant in the first two rows, but CLAP generates images in the third row with changing grayscales (samples of CRPM in Figure 21 ). For MPI3D, when the object size has an obvious change (1st sample of MPI3D in Figure 21 ) or the centric object is tiny (2nd sample of MPI3D in Figure 21 ), the predictions can be incorrect or unclear.

F LIMITATIONS

We conclude our limitations in two aspects. (1) Complexity of datasets. Because compositional law parsing is an underexplored task, we first validate the effectiveness of CLAP on datasets with relatively clear and simple rules to avoid the influence of unknown confounders in complicated datasets. We believe that the compositionality of laws also exists in more complex scenarios (e.g., learning physical laws in realistic scenes) and some vision tasks may benefit from compositional law parsing. For example, we can perform a controllable video generation process based on law modification or make more interpretable predictions by analyzing dominant laws in videos. Discovering such compositional law parsing ability in more complex situations can be a valuable topic in future works. (2) Setting the number of LRFs. CLAP places an upper bound on the number of LRFs. For scenes with multiple complex rules, we can empirically set an appropriate upper bound or directly put a large upper bound for the number of LRFs. However, a large bound will linearly increase the number of model parameters, and the redundant concepts will waste computing resources, decreasing computational efficiency. Therefore, exploring a mechanism in the to dynamically adjust the number of functions will be meaningful. When applying CLAP to more complex scenes, we may introduce more inductive biases for better concept decomposition (e.g., use a more task-specific encoder or decoder). We design CLAP as an unsupervised model, which means that it can be extended to take advantage of task-specific annotations. Thus we can find ways to integrate CLAP with additional supervision information (e.g., supervise CLAP with the changing factors of scenes) to help concept learning in a specific task.



Code is available at https://github.com/FudanVI/generative-abstract-reasoning/tree/main/clap https://github.com/rr-learning/disentanglement dataset https://github.com/iShohei220/torch-gqn



Figure 2: The graphical model of CLAP where the generative process is indicated using black solid lines and the variational inference is indicated using red dotted lines. Panel (a) shows the framework of CLAP, where the latent random function f c for the concept c ∈ C can be instantiated by arbitrary random functions; Panel (b) describes how to instantiate the latent random function with NP.

Figure 3: SA scores on BoBa, CRPM, and MPI3D dataset where the blue, orange, purple, and green lines denote scores of CLAP-NP, NP, GP, and GQN.

Figure 4: Prediction results (in red boxes) on BoBa-2 with η = 4/12 (left) and η = 6/12 (right).

Figure 6: The qualitative latent traversal results and quantitative variance declines on BoBa-2, CRPM-T, and CRPM-C. High variance declines indicate significant correlations between laws and LRFs, by which we can determine whether a dimension encodes the law we want to edit.

Figure 7: An illustration of law manipulation. At the top, we exchange the law of appearance by swapping the LRFs of two samples on the first and last dimensions. At the bottom, we compose laws from existing samples of CRPM-T and CRPM-C to generate novel samples.

Figure 8: Examples from different instances of datasets. BoBa includes two instances (a) BoBa-1 and (b) BoBa-2; CRPM includes four instances (c) CRPM-T, (d) CRPM-DT, (e) CRPM-C, and (f) CRPM-DC; (g) we only use one instance of MPI3D.

Figure 9: Target prediction on BoBa. The images with red boxes are predictions from the models.

Figure 10: Target predictions on CRPM. For each sample, we show two predictions of the models to test the understanding of attribute-specific rules.

Figure 11: SA scores on CRPM-T and CRPM-C.

Figure 12: Target prediction with 8 scene images on MPI3D

Figure 13: Target prediction with 40 scene images on MPI3D

Figure 20: Latent traversal results with too many (top) and too few (bottom) LRFs.Table9: MSE scores of CLAP-NP with the rollout and one-shot strategy.

MSEs on BoBa, CRPM, and MPI3D dataset. The training configurations of datasets are given in the brackets, and the test configurations are displayed in the table headers.

The detail of BoBa, CRPM, and MPI3D. Row 1: dataset names. Row 2: splits of datasets where Test-k denotes that there are k target images in one sample. Row 3: the number of samples in each split. Row 4: the number of images in each sample. Row 5: the number of target images in a sample. Row 6: the size of images denoted as Channel × Height × Width.

A detailed description of x n on datasets.

The hyperparameters of CLAP-NP. Continuous Raven's Progressive Matrix (CRPM) CRPM consists of 3 × 3 image matrices where the images contain one or two centered triangles or circles. CRPM provides four instances with different image types: CRPM-T, CRPM-DT, CRPM-C, and CRPM-DC. Images in a matrix follow attribute-specific changing rules (e.g., rules in the size, grayscale, and rotation of the triangle).

The ReshapeBlock flattens the output tensor of size 512 × 1 × 1 from the last convolutional layer to the vector of size 512. The output of the encoder is the mean of |C| concepts.• Decoder: CLAP-NP uses several deconvolutional layers to upsample the representations of con-Channel is the number of image channels and Sigmoid is the activation function used to generate pixel values ranging in (0, 1). • Function Parser: CLAP-NP provides identical but independent function parsers for concepts. Each function parser consists of the network T s to encode the representations of concepts, T i to encode the inputs of latent random functions, T a to aggregate context information, and T f to estimate the distribution of the global latent variable. The architecture of T s and T i is -Fully Connected, 32 ReLU -Fully Connected, 16 By concatenating the representations of concepts and the function inputs, we get the context points in the latent random function. The architecture of T a is -Fully Connected, 256 ReLU -Fully Connected, 256 ReLU -Fully Connected, 128 Finally, the architecture of T f is -Fully Connected, 256 ReLU -Fully Connected, 2d g where d g is the size of the global latent variable g c . The outputs of T f consist of the mean and standard deviation of g c . • Target Predictor: the concept-specific target predictor T p maps g c and the encoded function inputs into the target concept, whose architecture is

The hyperparameters of NP.

The learning rate and training epoch of GP.

MSE scores on CRPM-T and CRPM-C.

MSE and SA scores on BoBa-2 with novel concepts or laws where the models are tested with the configuration η = 4/12. SA-k indicates the SA scores with k candidates.

ACKNOWLEDGMENTS

This work was supported in part by the National Natural Science Foundation of China (No.62176060), STCSM projects (No.20511100400, No.22511105000), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0103), and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

