NEURAL SYSTEMATIC BINDER

Abstract

The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor representations called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations. https://sites. google.com/view/

1. INTRODUCTION

One of the most remarkable traits of human intelligence is the ability to generalize systematically. Humans are good at dealing with out-of-distribution situations because of the ability to understand them as a composition of pre-acquired knowledge pieces and knowing how to manipulate these pieces (Lake et al., 2017; Fodor & Pylyshyn, 1988) . This also seems to be foundational for a broad range of higher-level cognitive functions, such as planning, reasoning, analogy-making, and causal inference. However, realizing this ability in machines still remains a major challenge in modern machine learning (McVee et al., 2005; Bottou, 2014; Schölkopf et al., 2021) . This problem is considerably more difficult for unstructured modalities such as visual scenes or speech signals compared to modalities such as language. In language, we can consider embeddings of words or other forms of tokens as modular knowledge pieces. However, for unstructured modalities, we would first need to obtain such tokens, e.g., by grouping relevant low-level features, through a process called binding (Greff et al., 2020a ). Yet it is quite elusive what should be the appropriate structure and granularity of these tokens to support systematic generalization and how to obtain them, particularly in the unsupervised setting where the model should learn this ability only by observing. In visual scenes, binding has recently been pursued by object-centric learning methods through the spatial binding approach (Locatello et al., 2020; Singh et al., 2022a) . Spatial binding aims to divide a scene spatially into smaller areas so that each area contains a meaningful entity like an object. The information in each area is then grouped and aggregated to produce a representation of an object i.e., a slot, resulting in a set of slots per scene. These slots can be seen as independent modular knowledge pieces describing the full scene. However, the main limitation of the current methods based on spatial binding is that a slot is an entangled vector and not a composition of independent * Correspondence to singh.gautam@rutgers.edu and sjn.ahn@gmail.com.

