TOWARDS LEARNING IMPLICIT SYMBOLIC REPRESENTATION FOR VISUAL REASONING

Abstract

Visual reasoning tasks are designed to test a learning algorithm's capability to infer causal relationships, discover object interactions, and understand temporal dynamics, all from visual cues. It is commonly believed that to achieve compositional generalization on visual reasoning, an explicit abstraction of the visual scene must be constructed; for example, object detection can be applied to the visual input to produce representations that are then processed by a neural network or a neuro-symbolic framework. We demonstrate that a simple and general self-supervised approach is able to learn implicit symbolic representations with general-purpose neural networks, enabling the end-to-end learning of visual reasoning directly from raw visual inputs. Our proposed approach "compresses" each frame of a video into a small set of tokens with a transformer network. The self-supervised learning objective is to reconstruct each image based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We evaluate the proposed approach on two visual reasoning benchmarks, CATER and ACRE. We observe that self-supervised pretraining is essential to achieve compositional generalization for our end-to-end trained neural network, and our proposed method achieves on par or better performance compared to recent neuro-symbolic approaches that often require additional object-level supervision.

1. INTRODUCTION

This paper investigates if an end-to-end trained neural network is able to solve challenging visual reasoning tasks (Zhang et al., 2021; Girdhar & Ramanan, 2019; Yi et al., 2019) that involve inferring causal relationships, discovering object relations, and capturing temporal dynamics. A prominent approach (Shamsian et al., 2020) for visual reasoning is to construct a structured and interpretable representation from the visual inputs, and then apply symbolic programs (Mao et al., 2019) or neural networks (Ding et al., 2021) to solve the reasoning task. Despite their appealing properties, such as being interpretable and easier to inject expert knowledge into the learning framework, it is practically challenging to determine what types of symbols to use and how to detect them reliably from visual data. In fact, the suitable symbolic representation for a single scene may differ significantly across different tasks: the representation for modeling a single human's kinematics (e.g. with body parts and joints) is unlikely to be the same as that for modeling group social behaviors (e.g. each pedestrian can be viewed as a whole entity). With the success of unified neural frameworks for multi-task learning (Bommasani et al., 2021) , it is desirable to have a unified input interface (e.g. raw pixels) and have the neural network learn to dynamically extract suitable representations for different tasks. However, how to learn distributed representation with a deep neural network that behaves and generalizes similarly to learning methods based on symbolic representation (Zhang et al., 2021) for visual reasoning remains an open problem. The key hypothesis we make in this paper is that a general-purpose neural network, such as Transformers (Vaswani et al., 2017) , can be turned into an implicit symbolic concept learner with selfsupervised pre-training. For reasoning with image and video cues, the concepts are often organized as object-centric, as objects usually serve as the basic units in visual reasoning tasks. Our proposed approach is inspired by the success of self-supervised learning of object detectors with neural networks (Burgess et al., 2019; Locatello et al., 2020; Niemeyer & Geiger, 2021) 2021)), and our proposed approach for visual reasoning. The illustration of each model family flows upwards, where visual inputs are encoded by neural networks (stage 1), and then processed by symbolic programs or another neural network to generate reasoning predictions (stage 2). Compared to (a) and (b), our approach does not require a separate "preprocessing" stage to extract the symbolic representation from visual inputs, and the self-supervised pretrained neural network can be end-to-end "finetuned" to the downstream visual reasoning tasks. object masks in self-supervised classification networks (Caron et al., 2021) . It is also motivated by concept binding in neuroscience (Treisman, 1996; Roskies, 1999; Feldman, 2013) and in machine learning (Greff et al., 2020) , where concept binding for raw visual inputs refers to the process of segregating and representing visual scenes into a collection of (distributed) concept representation, which can be composed and utilized to solve downstream recognition and reasoning tasks. The concepts are bound in an object-centric fashion, where attributes (e.g. colors, shapes, sizes) of the same objects are associated via dynamic information routing. Different from explicit symbolic representation, implicit symbolic representation via dynamic information binding in a neural network does not require predefining the concept vocabulary or the construction of concept classifiers. The implicit representation can also be "finetuned" directly on the target tasks, it does not suffer from the early commitment or loss of information issues which may happen when visual inputs are converted into symbols and frozen descriptors (e.g. via object detection and classification). Our proposed representation learning framework, implicit symbolic concept learner (IS-CL) consists of two main components: first, a single image is compressed into a small set of tokens with a neural network. This is achieved by a vision transformer (ViT) network (Dosovitskiy et al., 2020) with multiple "slot" tokens (e.g. the [CLS] token in ViT) that attend to the image inputs. Second, the slot tokens are provided as context information via a temporal transformer network for other images in the same video, where the goal is to perform video reconstruction via the masked autoencoding (He et al., 2022) objective with the temporal context. Despite its simplicity, the reconstruction objective motivates the emergence of two desired properties in the pretrained network: first, to provide context useful for video reconstruction, the image encoder must learn a compact representation of the scene with its slot tokens. Second, to utilize the context cues, the temporal



Figure 1: Comparison between a neuro-symbolic approach (e.g. Mao et al. (2019)), a hybrid approach with learned object embeddings (e.g. Ding et al. (2021)), and our proposed approach for visual reasoning. The illustration of each model family flows upwards, where visual inputs are encoded by neural networks (stage 1), and then processed by symbolic programs or another neural network to generate reasoning predictions (stage 2). Compared to (a) and (b), our approach does not require a separate "preprocessing" stage to extract the symbolic representation from visual inputs, and the self-supervised pretrained neural network can be end-to-end "finetuned" to the downstream visual reasoning tasks.

