Measuring Visual Generalization in Continuous Control from Pixels

Abstract

Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face the variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches, and that more significant image transformations provide better visual generalization.

1. Introduction

Reinforcement Learning has successfully learned to control complex physical systems when presented with real-time sensor data (Gu et al., 2017) (Kalashnikov et al., 2018) . However, much of the field's core algorithmic work happens in simulation (Lillicrap et al., 2015 ) (Haarnoja et al., 2018a) , where all of the environmental conditions are known. In the real world, gaining access to precise sensory state information can be expensive or impossible. Camera-based observations are a practical solution, but create a representation learning problem in which important control information needs to be recovered from images of the environment. Significant progress has been made towards a solution to this challenge using auxiliary loss functions (Yarats et al., 2019 ) (Zhang et al., 2020 ) (Srinivas et al., 2020) and data augmentation (Laskin et al., 2020 ) (Kostrikov et al., 2020) . These strategies often match or exceed the performance of state-based approaches in simulated benchmarks. However, current continuous control environments include very little visual diversity. If existing techniques are going to be successful in the real world, they will need to operate in a variety of visual conditions. Small differences in lighting, camera position, or the surrounding environment can dramatically alter the raw pixel values presented to an agent, without affecting the underlying state. Ideally, agents would learn representations that are invariant to task-irrelevant visual changes. In this paper, we investigate the extent to which current methods meet these requirements. We propose a challenging benchmark to measure agents' ability to generalize across a diverse set of visual conditions, including changes in camera position, lighting, color, and scenery, by extending the graphical variety of existing continuous control domains. Our benchmark provides a platform for examining the visual generalization challenge that image-based control systems may face in the real world while preserving the advantages of simulation-based training. We evaluate several recent approaches and find that while they can adapt to subtle changes in camera position and lighting, they struggle to generalize across the full range of visual conditions and are particularly distracted by changes in texture and scenery. A comparison across multiple control domains shows that data augmentation significantly outperforms other approaches, and that visual generalization benefits from more complex, color-altering image transformations.

2. Background

Reinforcement Learning: We deal with the Partially Observed Markov Decision Process (POMDP) setting defined by a tuple (S, O, φ, A, R, T, γ). S is the set of states, which are usually low-dimensional representations of all the environment information needed to choose an appropriate action. In contrast, O is typically a higher-dimensional observation space that is visible to the agent. R is the reward function S x A → R, T is the function representing transition probabilities S x S → [0, 1]. φ is the emittion function S → O that determines how the observations in O are generated by the true states in S. A is the set of available actions that can be taken at each state; in the continuous control setting we focus on in this paper, A is a bounded subset of R n , where n is the dimension of the action space. An agent is defined by a stochastic policy π that maps observations to a distribution over actions in A. The goal of Reinforcement Learning (RL) is to find a policy that maximizes the discounted sum of rewards over trajectories of experience τ , collected from a POMDP M, denoted η M (π): η M (π) = E τ ∼π [ t=∞ t=0 γ t R(s t , a t )] , where γ ∈ [0, 1) is a discount factor that determines the agent's emphasis on long-term rewards. Continuous Control: Many of the challenging benchmarks and applications of RL involve continuous action spaces. These include classic problems like Cartpole, Mountain Car and Inverted Pendulum (Moore, 1990 ) (Barto et al., 1990) . However, recent work in Deep RL has focused on a collection of 3D locomotion tasks in which a robot is rewarded for a specific type of movement in a physics simulator. At each timestep, the state is a vector representing key information about the robot's current position and motion (e.g. the location, rotation or velocity of joints and limbs). The action space is a continuous subset of R n , where each element controls some aspect of the robot's movement (e.g. the torque of a joint motor). There have been several standardized implementations of these environments (Schulman et al., 2015 ) (Brockman et al., 2016) . This paper will use modified versions of those provided in the DeepMind Control Suite (DMC) (Tassa et al., 2018) , built using the MuJoCo physics simulator (Todorov et al., 2012) . Soft Actor Critic: Soft Actor Critic (SAC) (Haarnoja et al., 2018a ) is an off-policy actorcritic method that achieves impressive performance on many continuous control tasks. For each training step, a SAC agent collects experience from the environment by sampling from a high-entropy policy (a ∼ π θ (o)) paramaterized by a neural network with weights θ. It then adds experience to a large buffer D in the form of a tuple (o, a, r, o )foot_0 . SAC samples a batch of previous experience from the buffer, and updates the actor to encourage higher-value actions, as determined by the critic network (Q φ (o, a)): L actor = -E o∼D min i=1,2 Q φ,i (o, ã) -α log π θ (ã|o)) , ã ∼ π θ (o) The critic networks are updated by minimizing the mean squared error of their predictions relative to a bootstrapped estimate of the action-value function: L critic = E (o,a,r,o )∼D Q φ (o, a) -(r + γ(min i=1,2 Q φ ,i (o , ã ) -α log π θ (ã |o ))) 2 , ã ∼ π θ (o ) (2) The log π θ terms are a maximum-entropy component that encourages exploration subject to a parameter α, which can either be fixed or learned by gradient descent to approach a target entropy level (Haarnoja et al., 2018b) . The min operation makes use of two critic networks that are trained separately, and helps to reduce overestimation bias (Fujimoto et al., 2018) . φ refers to target networks, which are moving averages of the critic networks' weights and help to stabilize learning -a common trick in Deep RL (Lillicrap et al., 2015 ) (Mnih et al., 2015) . Generalization in Reinforcement Learning: Consider a set of POMDPs M = {(S 0 , φ 0 , O 0 , A 0 , R 0 , T 0 , γ 0 ), ..., (S n , φ n , O n , A n , R n , T n , γ n )}. Given access to one or several training tasks M train ∼ M, we would like to learn a policy that can generalize to the entire



We leave out the terminal boolean d common to most implementations both for notational simplicity and because many of the DMC tasks reset after a fixed amount of time.

