LANGUAGE-MEDIATED, OBJECT-CENTRIC REPRESENTATION LEARNING

Abstract

We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised segmentation algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the object segmentation performance of MONet and Slot Attention on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with segmentation algorithms such as MONet, aid downstream tasks such as referring expression comprehension.

1. INTRODUCTION

Cognitive studies show that human infants develop object individuation skill from diverse sources of information: spatial-temporal information, object property information, and language (Xu, 1999; 2007; Westermann & Mareschal, 2014) . Specifically, young infants develop object-based attention that disentangles the motion and location of objects from their visual appearance features. Later on, they can leverage the knowledge acquired through word learning to solve the problem of object individuation: words provide clues about object identity and type. The general picture from cognitive science is that object perception and language co-develop in support of one another (Bloom, 2002) . Our long-term goal is to endow machines with similar abilities. In this paper, we focus on how language may support object segmentation. Many recent works have studied the problem of unsupervised object representation learning, though without language. As an example, factorized, object-centric scene representations have been used in various kinds of prediction (Goel et al., 2018) , reasoning (Yi et al., 2018; Mao et al., 2019) , and planning tasks (Veerapaneni et al., 2020), but they have not considered the role of language and how it may help object representation learning. As a concrete example, consider the input images shown in Fig. 1 and the paired questions. From language, we can learn to associate concepts, such as black, pan, and legs, with the referred object's visual appearance. Further, language provides cues about how an input scene should be segmented into individual objects: a wrong parsing of the input scene will lead to an incorrect answer to the question. We can learn from such failure that the handle belongs to the frying pan (Fig. 1a ) and the chair has four legs (Fig. 1b ). Motivated by these observations, we propose a computational learning paradigm, Language-mediated, Object-centric Representation Learning (LORL), associating learned object-centric representations to their visual appearance (masks) in images, and to concepts-words for object properties such as color, shape, and material-as provided in language. Here the language input can be either descriptive sentences or question-answer pairs. LORL requires no annotations on object masks, categories, or properties during the learning process. In LORL, four modules are jointly trained. The first is an image encoder, learning to encode an image into factorized, object-centric representations. The second is an image decoder, learning to 1

