REINFORCEMENT LEARNING WITH LATENT FLOW

Abstract

Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast to the current paradigm in video classification architectures, which utilize explicit encodings of temporal information through methods such as optical flow and two-stream architectures to achieve state-of-the-art performance. Inspired by leading video classification architectures, we introduce the Flow of Latents for Reinforcement Learning (Flare), a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show that Flare (i) recovers optimal performance in state-based RL without explicit access to the state velocity, solely with positional state information, (ii) achieves state-ofthe-art performance on pixel-based continuous control tasks within the DeepMind control benchmark suite, (iii) is the most sample efficient model-free pixel-based RL algorithm on challenging environments in the DeepMind control suite such as quadruped walk, hopper hop, finger turn hard, pendulum swing, and walker run, outperforming the prior model-free state-of-the-art by 1.9⇥ and 1.5⇥ on the 500k and 1M step benchmarks, respectively, and (iv), when augmented over rainbow DQN, outperforms or matches the baseline on a diversity of challenging Atari games at 50M time step benchmark.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 1998) holds the promise of enabling artificial agents to solve a diverse set of tasks in uncertain and unstructured environments. Recent developments in RL with deep neural networks have led to tremendous advances in autonomous decision making. Notable examples include classical board games (Silver et al., 2016; 2017 ), video games (Mnih et al., 2015;; Berner et al., 2019; Vinyals et al., 2019) , and continuous control (Schulman et al., 2017; Lillicrap et al., 2016; Rajeswaran et al., 2018) . A large body of research has focused on the case where an RL agent is equipped with a compact state representation. Such compact state representations are typically available in simulation (Todorov et al., 2012; Tassa et al., 2018) or in laboratories equipped with elaborate motion capture systems (OpenAI et al., 2018; Zhu et al., 2019; Lowrey et al., 2018) . However, state representations are seldom available in unstructured real-world settings like the home. For RL agents to be truly autonomous and widely applicable, sample efficiency and the ability to act using raw sensory observations like pixels is crucial. Motivated by this understanding, we study the problem of efficient and effective deep RL from pixels. A number of recent works have made progress towards closing the sample-efficiency and performance gap between deep RL from states and pixels (Laskin et al., 2020b; a; Hafner et al., 2019a; Kostrikov et al., 2020) . An important component in this endeavor has been the extraction of high quality visual features during the RL process. Laskin et al. (2020a) and Stooke et al. (2020) have shown that features learned either explicitly with auxiliary losses (reconstruction or contrastive losses) or implicitly (through data augmentation) are sufficiently informative to recover the agent's pose information. While existing methods can encode positional information from images, there has been little attention devoted to extracting temporal information from a stream of images. As a result, existing deep RL methods from pixels struggle to learn effective policies on more challenging continuous control environments that deal with partial observability, sparse rewards, or those that require precise manipulation.

