OFF-DYNAMICS REINFORCEMENT LEARNING: TRAINING FOR TRANSFER WITH DOMAIN CLASSIFIERS

Abstract

We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Formally, we prove that applying our method in the source domain is guaranteed to obtain a near-optimal policy for the target domain, provided that the source and target domains satisfy a lightweight assumption. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.

1. INTRODUCTION

Reinforcement learning (RL) can automate the acquisition of complex behavioral policies through real-world trial-and-error experimentation. However, many domains where we would like to learn policies are not amenable to such trial-and-error learning, because the errors are too costly: from autonomous driving to flying airplanes to devising medical treatment plans, safety-critical RL problems necessitate some type of transfer learning, where a safer source domain, such as a simulator, is used to train a policy that can then function effectively in a target domain. In this paper, we examine a specific transfer learning scenario that we call domain adaptation, by analogy to domain adaptation problems in computer vision (Csurka, 2017) , where the training process in a source domain can be modified so that the resulting policy is effective in a given target domain. RL algorithms today require a large amount of experience in the target domain. However, for many tasks we may have access to a different but structurally similar source domain. While the source domain has different dynamics than the target domain, experience in the source domain is much cheaper to collect. However, transferring policies from one domain to another is challenging because strategies which are effective in the source domain may not be effective in the target domain. For example, aggressive driving may work well on a dry racetrack but fail catastrophically on an icy road.



Figure 1: Our method acquires a policy for the target domain by practicing in the source domain using a (learned) modified reward function.

