COMPOSITIONAL LAW PARSING WITH LATENT RANDOM FUNCTIONS

Abstract

Human cognition has compositionality. We understand a scene by decomposing the scene into different concepts (e.g., shape and position of an object) and learning the respective laws of these concepts, which may be either natural (e.g., laws of motion) or man-made (e.g., laws of a game). The automatic parsing of these laws indicates the model's ability to understand the scene, which makes law parsing play a central role in many visual tasks. This paper proposes a deep latent variable model for Compositional LAw Parsing (CLAP), which achieves the human-like compositionality ability through an encoding-decoding architecture to represent concepts in the scene as latent variables. CLAP employs conceptspecific latent random functions instantiated with Neural Processes to capture the law of concepts. Our experimental results demonstrate that CLAP outperforms the baseline methods in multiple visual tasks such as intuitive physics, abstract visual reasoning, and scene representation. The law manipulation experiments illustrate CLAP's interpretability by modifying specific latent random functions on samples. For example, CLAP learns the laws of position-changing and appearance constancy from the moving balls in a scene, making it possible to exchange laws between samples or compose existing laws into novel laws.

1. INTRODUCTION

Compositionality is an important feature of human cognition (Lake et al., 2017) . Humans can decompose a scene into individual concepts to learn the respective laws of these concepts, which can be either natural (e.g. laws of motion) or man-made (e.g. laws of a game). When observing a scene of a moving ball, one tends to parse the changing patterns of its appearance and position separately: The appearance stays consistent over time, while the position changes according to the laws of motion. By composing the laws of the ball's appearance and position, one can understand the changing pattern and predict the status of the moving ball. Although compositionality has inspired a number of models in visual understanding such as representing handwritten characters through hierarchical decomposition of characters (Lake et al., 2011; 2015) and representing a multi-object scene with object-centric representations (Eslami et al., 2016; Kosiorek et al., 2018; Greff et al., 2019; Locatello et al., 2020) , automatic parsing of laws in a scene is still a great challenge. For example, to understand the rules in abstract visual reasoning such as Raven's Progressive Matrices (RPMs) test, the comprehension of attribute-specific representations and the underlying relationships among them is crucial for a model to predict the missing images (Santoro et al., 2018; Steenbrugge et al., 2018; Wu et al., 2020) . To understand the laws of motion in intuitive physics, a model needs to grasp the changing patterns of different attributes (e.g. appearance and position) of each object in a scene to predict the future (Agrawal et al., 2016; Kubricht et al., 2017; Ye et al., 2018) . However, these methods usually employ neural networks to directly model changing patterns of the scene in black-box fashion, and can hardly abstract the laws of individual concepts in an explicit, interpretable and even manipulatable way. A possible solution to enable the abovementioned ability for a model is exploiting a function to represent a law of concept in terms of the representation of the concept itself. To represent a law

