LEARNING GENERALIZABLE VISUAL REPRESENTA-TIONS VIA INTERACTIVE GAMEPLAY

Abstract

A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization. Comparatively little is known regarding the impact of embodied gameplay upon artificial agents. While recent work has produced agents proficient in abstract games, these environments are far removed from the real world and thus these agents can provide little insight into the advantages of embodied play. Hiding games, such as hide-and-seek, played universally, provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing Cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn generalizable representations of their observations encoding information such as object permanence, free space, and containment. Moving closer to biologically motivated learning strategies, our agents' representations, enhanced by intentionality and memory, are developed through interaction and play. These results serve as a model for studying how facets of vision develop through interaction, provide an experimental framework for assessing what is learned by artificial agents, and demonstrates the value of moving from large, static, datasets towards experiential, interactive, representation learning.

1. INTRODUCTION

We are interested in studying what facets of their environment artificial agents learn to represent through interaction and gameplay. We study this question within the context of hide-and-seek, for which proficiency requires an ability to navigate around in an environment and manipulate objects as well as an understanding of visual relationships, object affordances, and perspective. Inspired by behavior observed in juvenile ravens (Burghardt, 2005) , we focus on a variant of hide-and-seek called cache in which agents hide objects instead of themselves. Advances in deep reinforcement learning have shown that, in abstract games (e.g. Go and Chess) and visually simplistic environments (e.g. Atari and grid-worlds) with limited interaction, artificial agents exhibit surprising emergent behaviours that enable proficient gameplay (Mnih et al., 2015; Silver et al., 2017) ; indeed, recent work (Chen et al., 2019; Baker et al., 2020) has shown this in the context of hiding games. Our interest, however, is in understanding how agents learn to represent their visual environment, through gameplay that requires varied interaction, in a high-fidelity environment grounded in the real world. This requires a fundamental shift away from existing popular environments and a rethinking of how the capabilities of artificial agents are evaluated. Our agents must first be embodied within an environment allowing for diverse interaction and providing rich visual output. For this we leverage AI2-THOR (Kolve et al., 2017) , a near photo-realistic, interactive, simulated, 3D environment of indoor living spaces, see Fig. 1a . Our agents are parameterized using deep neural networks, and trained adversarially using the paradigms of reinforce-1

