COMPOSITIONAL LAW PARSING WITH LATENT RANDOM FUNCTIONS

Abstract

Human cognition has compositionality. We understand a scene by decomposing the scene into different concepts (e.g., shape and position of an object) and learning the respective laws of these concepts, which may be either natural (e.g., laws of motion) or man-made (e.g., laws of a game). The automatic parsing of these laws indicates the model's ability to understand the scene, which makes law parsing play a central role in many visual tasks. This paper proposes a deep latent variable model for Compositional LAw Parsing (CLAP), which achieves the human-like compositionality ability through an encoding-decoding architecture to represent concepts in the scene as latent variables. CLAP employs conceptspecific latent random functions instantiated with Neural Processes to capture the law of concepts. Our experimental results demonstrate that CLAP outperforms the baseline methods in multiple visual tasks such as intuitive physics, abstract visual reasoning, and scene representation. The law manipulation experiments illustrate CLAP's interpretability by modifying specific latent random functions on samples. For example, CLAP learns the laws of position-changing and appearance constancy from the moving balls in a scene, making it possible to exchange laws between samples or compose existing laws into novel laws.

1. INTRODUCTION

Compositionality is an important feature of human cognition (Lake et al., 2017) . Humans can decompose a scene into individual concepts to learn the respective laws of these concepts, which can be either natural (e.g. laws of motion) or man-made (e.g. laws of a game). When observing a scene of a moving ball, one tends to parse the changing patterns of its appearance and position separately: The appearance stays consistent over time, while the position changes according to the laws of motion. By composing the laws of the ball's appearance and position, one can understand the changing pattern and predict the status of the moving ball. Although compositionality has inspired a number of models in visual understanding such as representing handwritten characters through hierarchical decomposition of characters (Lake et al., 2011; 2015) and representing a multi-object scene with object-centric representations (Eslami et al., 2016; Kosiorek et al., 2018; Greff et al., 2019; Locatello et al., 2020) , automatic parsing of laws in a scene is still a great challenge. For example, to understand the rules in abstract visual reasoning such as Raven's Progressive Matrices (RPMs) test, the comprehension of attribute-specific representations and the underlying relationships among them is crucial for a model to predict the missing images (Santoro et al., 2018; Steenbrugge et al., 2018; Wu et al., 2020) . To understand the laws of motion in intuitive physics, a model needs to grasp the changing patterns of different attributes (e.g. appearance and position) of each object in a scene to predict the future (Agrawal et al., 2016; Kubricht et al., 2017; Ye et al., 2018) . However, these methods usually employ neural networks to directly model changing patterns of the scene in black-box fashion, and can hardly abstract the laws of individual concepts in an explicit, interpretable and even manipulatable way. A possible solution to enable the abovementioned ability for a model is exploiting a function to represent a law of concept in terms of the representation of the concept itself. To represent a law that may depict arbitrary changing patterns, an expressive and flexible random function is required. The Gaussian Process (GP) is a classical family of random functions that achieves the diversity of function space through different kernel functions (Williams & Rasmussen, 2006) . Recently proposed random functions (Garnelo et al., 2018b; Eslami et al., 2018; Kumar et al., 2018; Garnelo et al., 2018a; Singh et al., 2019; Kim et al., 2019; Louizos et al., 2019; Lee et al., 2020; Foong et al., 2020) describe function spaces with the powerful nonlinear fitting ability of neural networks. These random functions have been used to capture the changing patterns in a scene, such as mapping timestamps to frames to describe the physical law of moving objects in a video (Kumar et al., 2018; Singh et al., 2019; Fortuin et al., 2020) . However, these applications of random functions take images as inputs and the captured laws account for all pixels instead of expected individual concepts. In this paper, we propose a deep latent variable model for Compositional LAw Parsing (CLAP)foot_0 . CLAP achieves the human-like compositionality ability through an encoding-decoding architecture (Hamrick et al., 2018) to represent concepts in the scene as latent variables, and further employ concept-specific random functions in the latent space to capture the law on each concept. By means of the plug-in of different random functions, CLAP gains generality and flexibility applicable to various law parsing tasks. We introduce CLAP-NP as an example that instantiates latent random functions with recently proposed Neural Processes (Garnelo et al., 2018b) . Our experimental results demonstrate that the proposed CLAP outperforms the compared baseline methods in multiple visual tasks including intuitive physics, abstract visual reasoning, and scene representation. In addition, the experiment on exchanging latent random functions on a specific concept and the experiment on composing latent random functions of the given samples to generate new samples both well demonstrate the interpretability and even manipulability of the proposed method for compositional law parsing.

2. RELATED WORK

Compositional Scene Representation Compositional scene representation models (Yuan et al., 2022a) can understand a scene through object-centric representations (Yuan et al., 2019b; a; 2021; Emami et al., 2021; Yuan et al., 2022b) . Some models initialize and update object-centric representations through iterative computational processes like neural expectation maximization (Greff et al., 2017) , iterative amortized inference (Greff et al., 2019) , and iterative cross-attention (Locatello et al., 2020) . Other models adopt non-iterative computational processes to parse disentangled attributes for objects as latent variables (Eslami et al., 2016) or extract representations from evenly divided regions in parallel (Lin et al., 2020b) . Recently, many models focus on capturing layouts of scenes (Jiang & Ahn, 2020) or learning consistent object-centric representations from videos (Kosiorek et al., 2018; Jiang et al., 2019; Lin et al., 2020a) . Unlike CLAP, compositional scene representation models learn how objects in a scene are composed but cannot explicitly represent underlying laws and understand how these laws constitute the changing pattern of the scene.

Random Functions

The GP (Williams & Rasmussen, 2006 ) is a classical family of random functions that regards the outputs of a function as a random variable of multivariate Gaussian distribution. To incorporate neural networks with random functions, some models encode functions through global representations (Wu et al., 2018; Eslami et al., 2018; Gordon et al., 2019) . NP (Garnelo et al., 2018b) captures function stochasticity with a Gaussian distributed latent variable. According to the way of stochasticity modeling, one can construct random functions with different characteristics (Kim et al., 2019; Louizos et al., 2019; Lee et al., 2020; Foong et al., 2020) . And other models develop random functions by learning adaptive kernels (Tossou et al., 2019; Patacchiola et al., 2020) or computing the integral of ODEs or SDEs on latent states (Norcliffe et al., 2021; Li et al., 2020; Hasan et al., 2021) . Random functions provide explicit representations for laws but cannot model the compositionality of laws.



Code is available at https://github.com/FudanVI/generative-abstract-reasoning/tree/main/clap

