LEARNING REPRESENTATIONS FOR REINFORCEMENT LEARNING WITH HIERARCHICAL FORWARD MODELS

Abstract

Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider singlestep transitions, which may miss relevant information if important environmental changes take many steps to manifest. We propose Hierarchical k-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.

1. INTRODUCTION

Recently, reinforcement learning (RL) has had significant empirical success in the robotics domain (Kalashnikov et al., 2018; 2021; Lu et al., 2021; Chebotar et al., 2021) . However, previous methods often require a dataset of hundreds of thousands or millions of agent-environment interactions to achieve their performance. This level of data collection may not be feasible for the average industry group. Therefore, RL's widespread real-world adoption requires agents to learn a satisfactory control policy in the smallest number of agent-environment interactions possible. Pixel-based state spaces increase the sample efficiency challenge because the RL algorithm is required to learn a useful representation and a control policy simultaneously. A recent thread of research has focused on developing auxiliary learning tasks to address this dual-objective learning problem. These approaches aim to learn a compressed representation of the high-dimensional state space upon which agents learn control. Several task types have been proposed such as image reconstruction (Yarats et al., 2020; Jaderberg et al., 2017 ), contrastive objectives (Laskin et al., 2020a; Stooke et al., 2021 ), image augmentation (Yarats et al., 2021; Laskin et al., 2020b) , and forward models (Lee et al., 2020a; Zhang et al., 2021; Gelada et al., 2019; Hafner et al., 2020; 2019) . Forward models are a natural fit for RL because they exploit the temporal axis by generating representations of the state space that capture information relevant to the environment's transition dynamics. However, previous approaches learn representations by predicting single-step transitions, which may not capture relevant information efficiently if important environmental changes take many steps to manifest. For example, if we wish to train a soccer-playing agent to score a goal, the pertinent portions of an episode occur at the beginning, when the agent applies a force and direction, and at the end when the agent sees how close the ball came to the goal. Using multi-step transitions in this situation could lead to more efficient learning, as we would focus more on the long-term consequences and less on the large portion of the trajectory where the ball is rolling. In this paper, we introduce Hierarchical k-Step Latent (HKSL)foot_0 , an auxiliary task for RL agents that explicitly captures information in the environment at varying levels of temporal coarseness. HKSL accomplishes this by leveraging a hierarchical latent forward model where each level in the hierarchy predicts transitions with a varying number of steps skipped. Levels that skip more steps should capture a coarser understanding of the environment by focusing on changes that take more steps to manifest and vice versa for levels that skip fewer steps. HKSL also learns to share information between levels via a communication module that passes information from higher to lower levels. As a result, HKSL learns a set of representations that give the downstream RL algorithm information on both short-and long-term changes in the environment. We evaluate HKSL and various baselines in a suite of 30 DMControl tasks (Tassa et al., 2018; Stone et al., 2021) that contains environments without and with distractors of varying types and intensities. Also, we evaluate our algorithms in "Falling Pixels", a task of our creation that requires agents to track objects that move at varying speeds. The goal in our study is to learn a well-performing control policy in the smallest number of agent-environment interactions as possible. We test our algorithms with and without distractors because real-world RL-controlled robots need to work well in controlled settings (e.g., a laboratory) and uncontrolled settings (e.g., a public street). Also, distractors may change at speeds independently from task-relevant information, thereby increasing the challenge of relating agent actions to changes in pixels. Therefore, real-world RL deployments should explicitly learn representations that tie agent actions to long-and short-term changes in the environment. In our DMControl experiments, HKSL reaches an interquartile mean of evaluation returns that is 29% higher than DrQ (Yarats et al., 2021) , 74% higher than CURL (Laskin et al., 2020a) , 24% higher than PI-SAC (Lee et al., 2020b), and 359% higher than DBC (Zhang et al., 2021) . Also, our experiments in Falling Pixels show that HKSL converges to an interquartile mean of evaluation returns that is 24% higher than DrQ, 35% higher than CURL, 31% higher than PI-SAC, and 44% higher than DBC. We analyze HKSL's hierarchical model and find that its representations more accurately capture task-relevant details earlier on in training than our baselines. Additionally, we find that HKSL's communication manager considers both sides of the communication process, thereby giving forward models information that better contextualizes their learning process. Finally, we provide data from all training runs for all benchmarked methods.

2. BACKGROUND

We study an RL formulation wherein an agent learns a control policy within a partially observable Markov decision process (POMDP) (Bellman, 1957; Kaelbling et al., 1998) , defined by the tuple (S, O, A, P s , P o , R, γ). S is the ground-truth state space, O is a pixel-based observation space, A is the action space, P s : S × A × S → [0, 1] is the state transition probability function, P o : S × A × O → [0, 1] is the observation probability function, R : S × A → R is the reward function that maps states and actions to a scalar signal, and γ ∈ [0, 1) is a discount factor. The agent does not directly observe the state s t ∈ S at step t, but instead receives an observation o t ∈ O which we specify as a stack of the last three images. At each step t, the agent samples an action a t ∈ A with probability given by its control policy which is conditioned on the observation at time t, π(a t |o t ). Given the action, the agent receives a reward r t = R(s t , a t ), the POMDP transitions into a next state s t+1 ∈ S with probability P s (s t , a t , s t+1 ), and the next observation (stack of pixels) o t+1 ∈ O is sampled with probability P o (s t+1 , a t , o t+1 ). Within this POMDP, the agent must learn a control policy that maximizes the sum of discounted returns over the time horizon T of the POMDP's episode: arg max π E a∼π [ T t=1 γ t r t ].

3. RELATED WORK

Representation learning in RL. Some research has pinpointed the development of representation learning methods that can aid policy learning for RL agents. In model-free RL, using representation learning objectives as auxiliary tasks has been explored in ways such as contrastive objectives (Laskin et al., 2020a; Stooke et al., 2021) , image augmentation (Yarats et al., 2021; Laskin et al., 2020b ), image reconstruction (Yarats et al., 2020) , information theoretic objectives (Lee et al., 



https://anonymous.4open.science/r/hksl-0D60/README.md

