LEARNING GENERALIZABLE VISUAL REPRESENTA-TIONS VIA INTERACTIVE GAMEPLAY

Abstract

A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization. Comparatively little is known regarding the impact of embodied gameplay upon artificial agents. While recent work has produced agents proficient in abstract games, these environments are far removed from the real world and thus these agents can provide little insight into the advantages of embodied play. Hiding games, such as hide-and-seek, played universally, provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing Cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn generalizable representations of their observations encoding information such as object permanence, free space, and containment. Moving closer to biologically motivated learning strategies, our agents' representations, enhanced by intentionality and memory, are developed through interaction and play. These results serve as a model for studying how facets of vision develop through interaction, provide an experimental framework for assessing what is learned by artificial agents, and demonstrates the value of moving from large, static, datasets towards experiential, interactive, representation learning.

1. INTRODUCTION

We are interested in studying what facets of their environment artificial agents learn to represent through interaction and gameplay. We study this question within the context of hide-and-seek, for which proficiency requires an ability to navigate around in an environment and manipulate objects as well as an understanding of visual relationships, object affordances, and perspective. Inspired by behavior observed in juvenile ravens (Burghardt, 2005) , we focus on a variant of hide-and-seek called cache in which agents hide objects instead of themselves. Advances in deep reinforcement learning have shown that, in abstract games (e.g. Go and Chess) and visually simplistic environments (e.g. Atari and grid-worlds) with limited interaction, artificial agents exhibit surprising emergent behaviours that enable proficient gameplay (Mnih et al., 2015; Silver et al., 2017); indeed, recent work (Chen et al., 2019; Baker et al., 2020) has shown this in the context of hiding games. Our interest, however, is in understanding how agents learn to represent their visual environment, through gameplay that requires varied interaction, in a high-fidelity environment grounded in the real world. This requires a fundamental shift away from existing popular environments and a rethinking of how the capabilities of artificial agents are evaluated. Our agents must first be embodied within an environment allowing for diverse interaction and providing rich visual output. For this we leverage AI2-THOR (Kolve et al., 2017) , a near photo-realistic, interactive, simulated, 3D environment of indoor living spaces, see Fig. 1a . Our agents are parameterized using deep neural networks, and trained adversarially using the paradigms of reinforce- Representation learning within the computer vision community is largely focused on developing SIRs whose quality is measured by their utility in downstream tasks (e.g. classification, depthprediction, etc) (Zamir et al., 2018) . Our first set of experiments show that our agents develop lowlevel visual understanding of individual images measured by their capacity to perform a collection of standard tasks from the computer vision literature, these tasks include pixel-to-pixel depth (Saxena et al., 2006) and surface normal (Fouhey et al., 2013) prediction, from a single image. While SIRs are clearly an important facet of representation learning, they are also definitionally unable to represent an environment as a whole: without the ability to integrate observations through time, a representation can only ever capture a single snapshot of space and time. To represent an environment holistically, we require DIRs. Unlike for SIRs, we are unaware of any well-established benchmarks for DIRs. In order to investigate what has been learned by our agent's DIRs we develop a suite of experiments loosely inspired by experiments performed on infants and young children. These experiments then demonstrate our agents' ability to integrate observations through time and understand spatial relationships between objects (Casasola et al., 2003 ), occlusion (Hespos et al., 2009) , object permanence (Piaget, 1954) , and seriation (Piaget, 1954) of free space. It is important to stress that this work focuses on studying how play and interaction contribute to representation learning in artificial agents and not on developing a new, state-of-the-art, methodology for representation learning. Nevertheless, to better situate our results in context of existing work, we provide strong baselines in our experiments, e.g. in our low-level vision experiments we compare against a fully supervised model trained on ImageNet (Deng et al., 2009) . Our results provide compelling evidence that: (a) on a suite of low level computer vision tasks within AI2-THOR, static representations learned by playing cache perform very competitively (and often outperform) strong unsupervised and fully supervised methods, (b) these static representations, trained using only synthetic images, obtain non-trivial transfer to downstream tasks using real-world images, (c)



Figure 1: A game of cache within AI2-THOR. a Multiple views of a single agent who, over time, explores and interacts with objects. b-fThe five consecutive stages of cache. In c the agent must choose where to hide the object at a high level, using its map of the environment to make this choice, while in d the agent must choose where to hide the object from a first person viewpoint. In e the object being manipulated is a tomato.

