LEARNING REPRESENTATIONS FOR REINFORCEMENT LEARNING WITH HIERARCHICAL FORWARD MODELS

Abstract

Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider singlestep transitions, which may miss relevant information if important environmental changes take many steps to manifest. We propose Hierarchical k-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.

1. INTRODUCTION

Recently, reinforcement learning (RL) has had significant empirical success in the robotics domain (Kalashnikov et al., 2018; 2021; Lu et al., 2021; Chebotar et al., 2021) . However, previous methods often require a dataset of hundreds of thousands or millions of agent-environment interactions to achieve their performance. This level of data collection may not be feasible for the average industry group. Therefore, RL's widespread real-world adoption requires agents to learn a satisfactory control policy in the smallest number of agent-environment interactions possible. Pixel-based state spaces increase the sample efficiency challenge because the RL algorithm is required to learn a useful representation and a control policy simultaneously. A recent thread of research has focused on developing auxiliary learning tasks to address this dual-objective learning problem. These approaches aim to learn a compressed representation of the high-dimensional state space upon which agents learn control. Several task types have been proposed such as image reconstruction (Yarats et al., 2020; Jaderberg et al., 2017) , contrastive objectives (Laskin et al., 2020a; Stooke et al., 2021) , image augmentation (Yarats et al., 2021; Laskin et al., 2020b) , and forward models (Lee et al., 2020a; Zhang et al., 2021; Gelada et al., 2019; Hafner et al., 2020; 2019) . Forward models are a natural fit for RL because they exploit the temporal axis by generating representations of the state space that capture information relevant to the environment's transition dynamics. However, previous approaches learn representations by predicting single-step transitions, which may not capture relevant information efficiently if important environmental changes take many steps to manifest. For example, if we wish to train a soccer-playing agent to score a goal, the pertinent portions of an episode occur at the beginning, when the agent applies a force and direction, and at the end when the agent sees how close the ball came to the goal. Using multi-step transitions in this situation could lead to more efficient learning, as we would focus more on the long-term consequences and less on the large portion of the trajectory where the ball is rolling.

