WHEN AND WHY IS PRETRAINING OBJECT-CENTRIC REPRESENTATIONS GOOD FOR REINFORCEMENT LEARNING?

Abstract

Unsupervised object-centric representation (OCR) learning has recently been drawing a lot of attention as a new paradigm of visual representation. This is because of its potential of being an effective pretraining technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pretraining for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and verify a series of hypotheses answering questions such as "Does OCR pretraining provide better sample efficiency?", "Which types of RL tasks benefit most from OCR pretraining?", and "Can OCR pretraining help with out-of-distribution generalization?". The results suggest that OCR pretraining is particularly effective in tasks where the relationship between objects is important, improving both task performance and sample efficiency when compared to singlevector representations. Furthermore, OCR models facilitate generalization to outof-distribution tasks such as changing the number of objects or the appearance of the objects in the scene.

1. INTRODUCTION

Motivated by the natural ability of humans to break down complex scenes into their constituent entities and reason about them, there has been a surge of recent research in learning unsupervised object-centric (OCR) representations (Eslami et al., 2016; Crawford & Pineau, 2019; Kosiorek et al., 2018; Lin et al., 2019; Jiang et al., 2019; Kipf et al., 2019; Veerapaneni et al., 2020; Burgess et al., 2019; Greff et al., 2019; Engelcke et al., 2019; 2021; Locatello et al., 2020; Singh et al., 2021; Kipf et al., 2021; Elsayed et al., 2022; Singh et al., 2022) . These approaches learn a structured visual representation of a scene, modeling an image as a composition of objects. By using object-centric representations, downstream tasks can potentially benefit from improved systematic generalization, better sample efficiency, and the ability to do reasoning between the objects in the scene. Since these representations can be obtained from visual inputs without the need for explicit labels, they have the promise of being an effective pretraining technique for various downstream tasks, including reinforcement learning (RL). However, most previous works in this line of research have evaluated OCRs only in the context of reconstruction loss, segmentation quality, or property prediction accuracy (Dittadi et al., 2021) . While several studies have attempted to apply OCR to RL (Goyal et al., 2019; Zadaianchuk et al., 2020; Watters et al., 2019b; Carvalho et al., 2020) , OCR pretraining has not been evaluated for RL tasks systematically and thoroughly. Watters et al. (2019b) evaluates OCR pretraining for a synthetic benchmark but a simple search is used rather than policy learning and less complex tasks are evaluated than our benchmark (e.g., the distractors can be ignored while our task requires the agent to avoid distractors).

