OFF-DYNAMICS REINFORCEMENT LEARNING: TRAINING FOR TRANSFER WITH DOMAIN CLASSIFIERS

Abstract

We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Formally, we prove that applying our method in the source domain is guaranteed to obtain a near-optimal policy for the target domain, provided that the source and target domains satisfy a lightweight assumption. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.

1. INTRODUCTION

Reinforcement learning (RL) can automate the acquisition of complex behavioral policies through real-world trial-and-error experimentation. However, many domains where we would like to learn policies are not amenable to such trial-and-error learning, because the errors are too costly: from autonomous driving to flying airplanes to devising medical treatment plans, safety-critical RL problems necessitate some type of transfer learning, where a safer source domain, such as a simulator, is used to train a policy that can then function effectively in a target domain. In this paper, we examine a specific transfer learning scenario that we call domain adaptation, by analogy to domain adaptation problems in computer vision (Csurka, 2017) , where the training process in a source domain can be modified so that the resulting policy is effective in a given target domain. Figure 1 : Our method acquires a policy for the target domain by practicing in the source domain using a (learned) modified reward function. RL algorithms today require a large amount of experience in the target domain. However, for many tasks we may have access to a different but structurally similar source domain. While the source domain has different dynamics than the target domain, experience in the source domain is much cheaper to collect. However, transferring policies from one domain to another is challenging because strategies which are effective in the source domain may not be effective in the target domain. For example, aggressive driving may work well on a dry racetrack but fail catastrophically on an icy road. This paper presents a simple approach for domain adaptation in RL, illustrated in Fig. 1 . Our main idea is that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we formally show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish sourcedomain transitions from target-domain transitions. Because our method learns a classifier, rather than a dynamics model, we expect it to handle high-dimensional tasks better than model-based methods, a conjecture supported by experiments on the 111-dimensional Ant task. Unlike prior work based on similar intuition (Koos et al., 2012; Wulfmeier et al., 2017b) , a key contribution of our work is a formal guarantee that our method yields a near-optimal policy for the target domain. The main contribution of this work is an algorithm for domain adaptation to dynamics changes in RL, based on the idea of compensating for differences in dynamics by modifying the reward function. We call this algorithm Domain Adaptation with Rewards from Classifiers, or DARC for short. DARC does not estimate transition probabilities, but rather modifies the reward function using a pair of classifiers. We formally analyze the conditions under which our method produces nearoptimal policies for the target domain. On a range of discrete and continuous control tasks, we both illustrate the mechanics of our approach and demonstrate its scalability to higher-dimensional tasks.

2. RELATED WORK

While our work will focus on domain adaptation applied to RL, we start by reviewing more general ideas in domain adaptation, and defer to Kouw & Loog (2019) for a recent review of the field. Two common approaches to domain adaptation are importance weighting and domain-agnostic features. Importance-weighting methods (e.g., (Zadrozny, 2004; Cortes & Mohri, 2014; Lipton et al., 2018) ) estimate the likelihood ratio of examples under the target domain versus the source domain, and use this ratio to re-weight examples sampled from the source domain. Similar to prior work on importance weighting (Bickel et al., 2007; Sønderby et al., 2016; Mohamed & Lakshminarayanan, 2016; Uehara et al., 2016) , our method will use a classifier to estimate a probability ratio. Since we will need to estimate the density ratio of conditional distributions (transition probabilities), we will learn two classifiers. Importantly, we will use the logarithm of the density ratio to modify the reward function instead of weighting samples by the density ratio, which is often numerically unstable (see, e.g., Schulman et al. (2017, §3) ) and led to poor performance in our experiments. Prior methods for applying domain adaptation to RL include approaches based on system identification, domain randomization, and observation adaptation. Perhaps the most established approach, system identification (Ljung, 1999) , uses observed data to tune the parameters of a simulator (Feldbaum, 1960; Werbos, 1989; Wittenmark, 1995; Ross & Bagnell, 2012; Tan et al., 2016; Zhu et al., 2017b; Farchy et al., 2013) More recent work has successfully used this strategy to bridge the sim2real gap (Chebotar et al., 2019; Rajeswaran et al., 2016) . Closely related is work on online system identification and meta-learning, which directly uses the inferred system parameters to update the policy (Yu et al., 2017; Clavera et al., 2018; Tanaskovic et al., 2013; Sastry & Isidori, 1989) . However, these approaches typically require either a model of the environment or a manually-specified distribution over potential test-time dynamics, requirements that our method will lift. Another approach, domain randomization, randomly samples the parameters of the source domain and then finds the best policy for this randomized environment (Sadeghi & Levine, 2016; Tobin et al., 2017; Peng et al., 2018; Cutler et al., 2014) . While often effective, this method is sensitive to the choice of which parameters are randomized, and the distributions from which these simulator parameters are sampled. A third approach, observation adaptation, modifies the observations of the source domain to appear similar to those in the target domain (Fernando et al., 2013; Hoffman et al., 2016; Wulfmeier et al., 2017a) . While this approach has been successfully applied to video games (Gamrian & Goldberg, 2018) and robot manipulation (Bousmalis et al., 2018) , it ignores the fact that the source and target domains may have differing dynamics. Finally, our work is similar to prior work on transfer learning (Taylor & Stone, 2009) and metalearning in RL, but makes less strict assumptions than most prior work. For example, most work on meta-RL (Killian et al., 2017; Duan et al., 2016; Mishra et al., 2017; Rakelly et al., 2019) and 

