DOMAIN INVARIANT Q-LEARNING FOR MODEL-FREE ROBUST CONTINUOUS CONTROL UNDER VISUAL DIS-TRACTIONS

Abstract

End-to-end reinforcement learning on images showed significant performance progress in the recent years, especially with regularization to value estimation brought by data augmentation (Yarats et al., 2020). At the same time, domain randomization and representation learning helped push the limits of these algorithms in visually diverse environments, full of distractors and spurious noise, making RL more robust to unrelated visual features. We present DIQL, a method that combines risk invariant regularization and domain randomization to reduce out-of-distribution (OOD) generalization gap for temporal-difference learning. In this work, we draw a link by framing domain randomization as a richer extension of data augmentation to RL and support its generalized use. Our model-free approach improve baselines performances without the need of additional representation learning objectives and with limited additional computational cost. We show that DIQL outperforms existing methods on complex visuo-motor control environment with high visual perturbation. In particular, our approach achieves state-of the-art performance on the Distracting Control Suite benchmark, where we evaluate the robustness to a number of visual perturbators, as well as OOD generalization and extrapolation capabilities.

1. INTRODUCTION

Data augmentation is used extensively in computer vision models for regularization. One can not imagine reaching state-of-the-art performance on usual benchmarks without using a careful combination of transformation on images. Yet reinforcement learning lags behind on the usage of these techniques. First, this stems from the high variance reinforcement learning suffers during training. This is especially true for off-policy algorithms such as Q-learning (Watkins & Dayan, 1992; Mnih et al., 2013) , where noisy Q values caused by uncertainty induced an overestimation bias which renders training extremely difficult. A number of methods directly tackle the problem of overestimation with algorithmic or architectural changes to the value function (Van Hasselt et al., 2016; Wang et al., 2016; Bellemare et al., 2017; Kumar et al., 2021) . Regularizing the value estimation with light data augmentation is another successful approach (Yarats et al., 2020; Laskin et al., 2020b) but extensive data augmentation in reinforcement means adding even more noise and can lead to difficult or unstable training (Hansen et al., 2021) . Secondly, standard computer vision tasks mostly focus on extracting high level semantic information from images or videos. Because classifying the content of an image is a substantially high level task, the class label is resilient to a lot of intense visual transformation of the image (e.g. geometric transformations, color distortion, kernel filtering, information deletion, mixing). Features such as exact position, relative organization and textures of entities in the image is usually not predictive of the class label and data augmentation pipelines take advantage of it. From a causal perspective, data augmentation performs interventions on the "style" variable which is not linked to the class label in the causal graph. It happens that for classification tasks the dimension of the style variable is much bigger than in visuo-motor control tasks where reinforcement learning is involved. Intuitively, we can change a lot of factors of variation in the visual aspect of a particular object without it being not recognizable anymore: we can still recognize a car on the street with very sparse and highly perturbed visual cues. With visuo-motor control, these factors of variation (dimensionality of the style variable) are in fewer number and less obvious. In particular, geometric deformations or occlusions could destroy crucial information for control such as relatives distances of objects in the image. Simple data augmentation, under the form of random shift, proved to be crucial for boosting RL performance (Yarats et al., 2020; 2021; Laskin et al., 2020b; a; Hansen et al., 2020) and is now used as a baseline in state-of-the-art methodologies. Though, it remains less clear which combination of image transformations is optimal for reinforcement learning (Hansen et al., 2021; Raileanu et al., 2021) . A related technique, Domain randomization (Tobin et al., 2017) was introduced in robotics to close the gap from simulation to real world, by randomizing dynamics and components in simulation. We argue in this paper that domain randomization is a more general case of data augmentation more suited for reinforcement learning that allows for finer control over visual factors of variation by directly changing the hidden state of the system in simulation. Contrary to data augmentation in general, domain randomization directly acts on the causal factors instead of adding uncorrelated noise to the observation, which could destroy useful information. Starting from this observation, we present Domain Invariant Q-Learning (DIQL) for robust visuomotor control under visual distractions. We show that DIQL is able to efficiently train an agent with visual generalization capabilities without losing on convergence speed and asymptotic performance on the original task. In particular, we derive a domain-invariant temporal-difference loss combining domain randomization and risk extrapolation (Figure 1 ). We show that domain randomization can be better integrated in reinforcement learning than is classically done in order to improve performance at very low cost. Our main contributions are: • a novel methodology for robust visuomotor control based on temporal-difference learning on images using invariance principles, • empirical results on the Distracting Control Suite benchmark (Stone et al., 2021) with stateof-the-art results on raw training performance and out-of-distribution (OOD) generalization under the hardest setting of dynamic distractions

2. PRELIMINARIES

Value-based reinforcement learning We define the MDP M = ⟨S, A, P, R, γ⟩ where S is the set of states, A the set of actions, P the transition probability function, R the reward function and γ a discounting factor for future rewards. We also define the transition containing state, action, reward and next state at timestep t as T = (S t , A t , r t , S t+1 ). Reinforcement learning aims to maximize



Figure 1: Domain Invariant Q-Learning. For training, DIQL uses 2 visually different domains (Interpolation domains) based on the same inner state of the environment to learn a Q-function that is invariant to spurious visual features. DIQL promotes risk extrapolation and prevents drastic collapse of the accuracy of the Q-function in out-of-distribution settings (Extrapolation domains).

