NEURAL SYSTEMATIC BINDER

Abstract

The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor representations called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations. https://sites. google.com/view/

1. INTRODUCTION

One of the most remarkable traits of human intelligence is the ability to generalize systematically. Humans are good at dealing with out-of-distribution situations because of the ability to understand them as a composition of pre-acquired knowledge pieces and knowing how to manipulate these pieces (Lake et al., 2017; Fodor & Pylyshyn, 1988) . This also seems to be foundational for a broad range of higher-level cognitive functions, such as planning, reasoning, analogy-making, and causal inference. However, realizing this ability in machines still remains a major challenge in modern machine learning (McVee et al., 2005; Bottou, 2014; Schölkopf et al., 2021) . This problem is considerably more difficult for unstructured modalities such as visual scenes or speech signals compared to modalities such as language. In language, we can consider embeddings of words or other forms of tokens as modular knowledge pieces. However, for unstructured modalities, we would first need to obtain such tokens, e.g., by grouping relevant low-level features, through a process called binding (Greff et al., 2020a ). Yet it is quite elusive what should be the appropriate structure and granularity of these tokens to support systematic generalization and how to obtain them, particularly in the unsupervised setting where the model should learn this ability only by observing. In visual scenes, binding has recently been pursued by object-centric learning methods through the spatial binding approach (Locatello et al., 2020; Singh et al., 2022a) . Spatial binding aims to divide a scene spatially into smaller areas so that each area contains a meaningful entity like an object. The information in each area is then grouped and aggregated to produce a representation of an object i.e., a slot, resulting in a set of slots per scene. These slots can be seen as independent modular knowledge pieces describing the full scene. However, the main limitation of the current methods based on spatial binding is that a slot is an entangled vector and not a composition of independent In spatial binding, the slots undergo a competition for each input feature followed by iterative refinement, similar to Slot Attention. In factor binding, unlike Slot Attention, for each slot, the bottom-up information from the attended input features is split and routed to M independent block refinement pathways. Importantly, each pathway provides a representation bottleneck by performing dot-product attention on a memory of learned prototypes. modular representations. As such, a systematically novel object would map to an unfamiliar slot vector rather than a modular combination of familiar factor tokens, such as color, shape, and position. In this paper, we propose the Neural Systematic Binder, or SysBinder for short. SysBinder generalizes the conventional binding process by combining spatial binding and factor binding. While spatial binding provides spatial modularity across the full scene, factor binding provides factor modularity within an object. More specifically, given an input, SysBinder produces a set of vector representations, called slots, via spatial binding. However, unlike typical object-centric learning, each of these slots is constructed by concatenating a set of independent factor representations called blocks obtained via factor binding. SysBinder uses iterative refinement to learn the slots by applying spatial binding step and factor-binding steps alternatingly during the refinement process. In the spatial binding step, slots compete with each other to find an input attention area per slot. The information in each area is then grouped and used to refine the blocks. Crucially, to achieve factor binding, each block is refined in a modular fashion by applying an independent RNN and a soft information bottleneck on each block. We train SysBinder in a fully unsupervised manner by reconstructing the input from the slots using the decoding framework of SLATE (Singh et al., 2022a; b) . The contributions of this paper are as follows. First, SysBinder is the first deterministic binding mechanism to produce disentangled factors within a slot. This deterministic nature of SysBinder is remarkable since, conventionally, probabilistic modeling has been considered crucial for the emergence of factors within slots (Higgins et al., 2017; Greff et al., 2019) . Second, in probabilistic frameworks, the representation of a factor is a single dimension of the slot, while in SysBinder, the representation of a factor is a multi-dimensional block, providing a more flexible and richer way to represent a factor. Third, similar to Slot Attention, SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we show that i) SysBinder achieves significantly better factor disentanglement within the slots than conventional object-centric methods, including those based on probabilistic frameworks. ii) Notably, for the first time in this line, we show factor emergence in visually complex scene images such as CLEVR-Tex (Karazija et al., 2021b) . iii) Using the emergent factor blocks obtained from SysBinder, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen block combinations. iv) Lastly, we provide an extensive analysis by evaluating the performance of several key model variants.



* Correspondence to singh.gautam@rutgers.edu and sjn.ahn@gmail.com.



Figure 1: Overview. Left:We propose a novel binding mechanism, Neural Systematic Binder, that represents an object as a slot constructed by concatenating multi-dimensional factor representations called blocks. Without any supervision, each block learns to represent a specific factor of the object such as color, shape, or position. Right: Neural Systematic Binder works by combining two binding principles: spatial binding and factor binding. In spatial binding, the slots undergo a competition for each input feature followed by iterative refinement, similar to Slot Attention. In factor binding, unlike Slot Attention, for each slot, the bottom-up information from the attended input features is split and routed to M independent block refinement pathways. Importantly, each pathway provides a representation bottleneck by performing dot-product attention on a memory of learned prototypes.

