WHEN AND WHY IS PRETRAINING OBJECT-CENTRIC REPRESENTATIONS GOOD FOR REINFORCEMENT LEARNING?

Abstract

Unsupervised object-centric representation (OCR) learning has recently been drawing a lot of attention as a new paradigm of visual representation. This is because of its potential of being an effective pretraining technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pretraining for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and verify a series of hypotheses answering questions such as "Does OCR pretraining provide better sample efficiency?", "Which types of RL tasks benefit most from OCR pretraining?", and "Can OCR pretraining help with out-of-distribution generalization?". The results suggest that OCR pretraining is particularly effective in tasks where the relationship between objects is important, improving both task performance and sample efficiency when compared to singlevector representations. Furthermore, OCR models facilitate generalization to outof-distribution tasks such as changing the number of objects or the appearance of the objects in the scene.

1. INTRODUCTION

Motivated by the natural ability of humans to break down complex scenes into their constituent entities and reason about them, there has been a surge of recent research in learning unsupervised object-centric (OCR) representations (Eslami et al., 2016; Crawford & Pineau, 2019; Kosiorek et al., 2018; Lin et al., 2019; Jiang et al., 2019; Kipf et al., 2019; Veerapaneni et al., 2020; Burgess et al., 2019; Greff et al., 2019; Engelcke et al., 2019; 2021; Locatello et al., 2020; Singh et al., 2021; Kipf et al., 2021; Elsayed et al., 2022; Singh et al., 2022) . These approaches learn a structured visual representation of a scene, modeling an image as a composition of objects. By using object-centric representations, downstream tasks can potentially benefit from improved systematic generalization, better sample efficiency, and the ability to do reasoning between the objects in the scene. Since these representations can be obtained from visual inputs without the need for explicit labels, they have the promise of being an effective pretraining technique for various downstream tasks, including reinforcement learning (RL). However, most previous works in this line of research have evaluated OCRs only in the context of reconstruction loss, segmentation quality, or property prediction accuracy (Dittadi et al., 2021) . While several studies have attempted to apply OCR to RL (Goyal et al., 2019; Zadaianchuk et al., 2020; Watters et al., 2019b; Carvalho et al., 2020) , OCR pretraining has not been evaluated for RL tasks systematically and thoroughly. Watters et al. (2019b) evaluates OCR pretraining for a synthetic benchmark but a simple search is used rather than policy learning and less complex tasks are evaluated than our benchmark (e.g., the distractors can be ignored while our task requires the agent to avoid distractors). In this study, we investigate when and why OCR pretraining is good for RL. To do this, we propose a new benchmark to cover many object-centric tasks such as object interaction or relational reasoning. Applying OCR pretraining to this benchmark, we empirically verify a series of hypotheses about decomposed representations that have been discussed previously but not systematically investigated (van Steenkiste et al., 2019; Lake et al., 2017; Greff et al., 2020; Diuk et al., 2008; Kansky et al., 2017; Zambaldi et al., 2018; Mambelli et al., 2022; Goyal et al., 2019; Carvalho et al., 2020; Zadaianchuk et al., 2020) . For example, our experiments provide answers to questions such as: "Can decomposed representations improve the sample efficiency?", "Can decomposed representations help with the out-of-distribution generalization?", and "Can decomposed representations be helpful to solve relational reasoning tasks?". Furthermore, we thoroughly investigate the important characteristics of applying OCR to RL, such as how number of objects in the scene affects RL performance, which OCR models work best for RL, and what kind of pooling layer is appropriate to aggregate the object representations. The main contribution of this paper is to provide empirical evidence about the long-standing belief that object-centric representation learning is useful for reinforcement learning. For this, we have the following more specific contributions: (1) Propose a new simple benchmark to validate OCR pretraining for RL tasks systematically, (2) Evaluate OCR pretraining performance compared with various baselines on this benchmark, and (3) Systematically analyze different aspects of OCR pretraining to develop a better understanding of when and why OCR pretraining is good for RL. Lastly, we will release the benchmark and our experiment framework code to the community.

2. RELATED WORK

Object-Centric Representation Learning. Many recent works have studied the problem of obtaining object-centric representations without supervision (Eslami et al., 2016; Crawford & Pineau, 2019; Kosiorek et al., 2018; Lin et al., 2019; Jiang et al., 2019; Kipf et al., 2019; Veerapaneni et al., 2020; Burgess et al., 2019; Greff et al., 2019; Engelcke et al., 2019; 2021; Locatello et al., 2020; Lin et al., 2020; Singh et al., 2021; Kipf et al., 2021; Elsayed et al., 2022; Singh et al., 2022) . These works are motivated by the potential benefits to downstream tasks such as better generalization and relational reasoning (Greff et al., 2020; van Steenkiste et al., 2019) . There are two main methods for building slot representations; bounding box based methods (Eslami et al., 2016; Crawford & Pineau, 2019; Kosiorek et al., 2018; Lin et al., 2019; Jiang et al., 2019) or segmentation based methods (Kipf et al., 2019; Veerapaneni et al., 2020; Burgess et al., 2019; Greff et al., 2019; Engelcke et al., 2019; 2021; Locatello et al., 2020; Singh et al., 2021; Kipf et al., 2021; Elsayed et al., 2022; Singh et al., 2022) . The bounding box based methods infer the latent variables for object presence, object location, and object appearance temporally (Eslami et al., 2016; Kosiorek et al., 2018) or spacially (Crawford & Pineau, 2019; Lin et al., 2019; Jiang et al., 2019) . These methods work best for objects of regular shape and size. Segmentation-based methods are more flexible than bounding box based methods and have shown good performance for natural scenes or videos (Singh et al., 2021; 2022; Kipf et al., 2021; Elsayed et al., 2022) . In this study, we evaluated the segmentation based models only, because of their possibility to be applied to more natural tasks. Object-Centric Representations and Reinforcement Learning. RL is one of the most important and frequently mentioned downstream tasks where OCR is thought to be helpful. This is because it has been previously shown that applying decompositional representation to RL can perform better generalization and reasoning and learn more efficiently (Zambaldi et al., 2018; Garnelo et al., 2016; Diuk et al., 2008; Kansky et al., 2017; Stanić et al.; Mambelli et al., 2022; Heravi et al., 2022) . However, to our knowledge, there have been no studies that systematically and thoroughly show these benefits. Goyal et al. ( 2019) evaluated OCR for RL by learning end-to-end. Through endto-end learning, OCR learns a task-specific representation, which may be difficult to apply to other tasks and may not have the various strengths obtained through unsupervised OCR learning such as sample efficiency, generalization, and reasoning. Zadaianchuk et al. (2020) investigated OCR pretraining, but applies the bounding box based method (Jiang et al., 2019) and proposes/evaluates a new policy for the limited regime; goal-conditioned RL. Watters et al. (2019b) trained OCR with the exploration policy.

