Measuring Visual Generalization in Continuous Control from Pixels

Abstract

Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face the variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches, and that more significant image transformations provide better visual generalization.

1. Introduction

Reinforcement Learning has successfully learned to control complex physical systems when presented with real-time sensor data (Gu et al., 2017) (Kalashnikov et al., 2018) . However, much of the field's core algorithmic work happens in simulation (Lillicrap et al., 2015 ) (Haarnoja et al., 2018a) , where all of the environmental conditions are known. In the real world, gaining access to precise sensory state information can be expensive or impossible. Camera-based observations are a practical solution, but create a representation learning problem in which important control information needs to be recovered from images of the environment. Significant progress has been made towards a solution to this challenge using auxiliary loss functions (Yarats et al., 2019 ) (Zhang et al., 2020 ) (Srinivas et al., 2020) and data augmentation (Laskin et al., 2020 ) (Kostrikov et al., 2020) . These strategies often match or exceed the performance of state-based approaches in simulated benchmarks. However, current continuous control environments include very little visual diversity. If existing techniques are going to be successful in the real world, they will need to operate in a variety of visual conditions. Small differences in lighting, camera position, or the surrounding environment can dramatically alter the raw pixel values presented to an agent, without affecting the underlying state. Ideally, agents would learn representations that are invariant to task-irrelevant visual changes. In this paper, we investigate the extent to which current methods meet these requirements. We propose a challenging benchmark to measure agents' ability to generalize across a diverse set of visual conditions, including changes in camera position, lighting, color, and scenery, by extending the graphical variety of existing continuous control domains. Our benchmark provides a platform for examining the visual generalization challenge that image-based control systems may face in the real world while preserving the advantages of simulation-based training. We evaluate several recent approaches and find that while they can adapt to subtle changes in camera position and lighting, they struggle to generalize across the full range of visual conditions and are particularly distracted by changes in texture and scenery. A comparison across multiple control domains shows that data augmentation significantly outperforms other approaches, and that visual generalization benefits from more complex, color-altering image transformations.

