LEARNING INVARIANT REPRESENTATIONS FOR REIN-FORCEMENT LEARNING WITHOUT RECONSTRUCTION

Abstract

We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. We also test a first-person highway driving task where our method learns invariance to clouds, weather, and time of day. Finally, we provide generalization results drawn from properties of bisimulation metrics, and links to causal inference.

1. Introduction

Learning control from images is important for many real world applications. While deep reinforcement learning (RL) has enjoyed many successes in simulated tasks, learning control from real vision is more complex, especially outdoors, where images reveal detailed scenes of a complex and unstructured world. Furthermore, while many RL algorithms can eventually learn control from real images given unlimited data, data-efficiency is often a necessity in real trials which are expensive and constrained to real-time. Prior methods for data-efficient learning of simulated visual tasks typically use representation learning. Representation learning summarizes images by encoding them into smaller vectored representations better suited for RL. For example, sequential autoencoders aim to learn lossless representations of streaming observations-sufficient to reconstruct current observations and predict future observations-from which various RL algorithms can be trained (Hafner et al., 2018; Lee et al., 2019; Yarats et al., 2019) . However, such methods are taskagnostic: the models represent all dynamic elements they observe in the world, whether they are relevant to the task or not. We argue such representations can easily "distract" RL algorithms with irrelevant information in the case of real images. The issues of distraction is less evident in popular simulation MuJoCo and Atari tasks, since any change in observation space is likely task-relevant, and thus, worth representing. By contrast, visual images that autonomous cars observe contain predominately task-irrelevant information, like cloud shapes and architectural details, illustrated in Figure 1 . Rather than learning control-agnostic representations that focus on accurate reconstruction of clouds and buildings, we would rather achieve a more compressed representation from a lossy encoder, which only retains state information relevant to our task. If we would like to learn representations that capture only task-relevant elements of the state and are invariant to task-irrelevant information, intuitively we can utilize the reward signal to help determine task-relevance, as shown by Jonschkowski & Brock (2015) . As cumulative rewards are our objective, state elements are relevant not only if they influence the current reward, but also if they influence state elements in the future that in turn influence future rewards. This recursive relationship can be distilled into a recursive task-aware notion of state abstraction: an ideal representation is one that is predictive of reward, and also predictive of itself in the future. We propose learning such an invariant representation using the bisimulation metric, where the distance between two observation encodings correspond to how "behaviourally different" (Ferns & Precup, 2014) both observations are. Our main contribution is a practical representation learning method based on the bisimulation metric suitable for downstream control, which we call deep bisimulation for control (DBC). We additionally provide theoretical analysis that proves value bounds between the optimal value function of the true MDP and the optimal value function of the MDP constructed by the learned representation. Empirical evaluations demonstrate our nonreconstructive approach using bisimulation is substantially more robust to task-irrelevant distractors when compared to prior approaches that use reconstruction losses or contrastive losses. Our initial experiments insert natural videos into the background of MoJoCo control task as complex distraction. Our second setup is a high-fidelity highway driving task using CARLA (Dosovitskiy et al., 2017) , showing that our representations can be trained effectively even on highly realistic images with many distractions, such as trees, clouds, buildings, and shadows. For example videos see https://sites.google.com/view/deepbisim4control.

2. Related Work

Our work builds on the extensive prior research on bisimulation in MDP state aggregation. Reconstruction-based Representations. Early works on deep reinforcement learning from images (Lange & Riedmiller, 2010; Lange et al., 2012) used a two-step learning process where first an auto-encoder was trained using reconstruction loss to learn a low-dimensional representation, and subsequently a controller was learned using this representation. This allows effective leveraging of large, unlabeled datasets for learning representations for control. In practice, there is no guarantee that the learned representation will capture useful information for the control task, and significant expert knowledge and tricks are often necessary for these approaches to work. In model-based RL, one solution to this problem has been to jointly train the encoder and the dynamics model end-to-end (Watter et al., 2015; Wahlström et al., 2015) -this proved effective in learning useful task-oriented representations. Hafner et al. (2018) and Lee et al. (2019) learn latent state models using a reconstruction loss, but these approaches suffer from the difficulty of learning accurate long-term predictions and often still require significant manual tuning. Gelada et al. ( 2019) also propose a latent dynamics model-based method and connect their approach to bisimulation metrics, using a reconstruction loss in Atari. They show that 2 distance in the DeepMDP representation upper bounds the bisimulation distance, whereas our objective directly learns a representation where distance in latent space is the bisimulation metric. Further, their results rely on the assumption that the learned representation is Lipschitz, whereas we show that, by directly learning a bisimilarity-based representation, we guarantee a representation that generates a Lipschitz MDP. We show experimentally that our non-reconstructive DBC method is substantially more robust to complex distractors. Contrastive-based Representations. Contrastive losses are a self-supervised approach to learn useful representations by enforcing similarity constraints between data (van den Oord et al., 2018; Chen et al., 2020) . Similarity functions can be provided as domain knowledge in the form of heuristic data augmentation, where we maximize similarity between augmentations of the same data point (Laskin et al., 2020) or nearby image patches (Hénaff et al., 2019) , and minimize similarity between different data points. In the absence of this domain knowledge, contrastive representations can be trained by predicting the future (van den Oord et al., 2018) . We compare to such an approach in our experiments, and show that DBC is substantially more robust. While contrastive losses do not require reconstruction, they do not inherently have a mechanism to determine downstream task relevance without manual engineering, and when trained only for prediction, they aim to capture all



Figure 1: Robust representations of the visual scene should be insensitive to irrelevant objects (e.g., clouds) or details (e.g., car types), and encode two observations equivalently if their relevant details are equal (e.g., road direction and locations of other cars).

