CROSS-MODAL DOMAIN ADAPTATION FOR REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

To overcome the unbearable reinforcement training of agents in the real-world, the sim-to-real approach, i.e., training in simulators and adapting to target environments, is a promising direction. However, crafting a delicately simulator can also be difficult and costly. For example, to simulate vision-based robotics, simulators have to render high-fidelity images, which can cost tremendous effort. This work aims at learning a cross-modal mapping between intrinsic states of the simulator and high-dimensional observations of the target environments. This cross-modal mapping allows agents trained on the source domain of state input to adapt well to the target domain of image input. However, learning the cross-modal mapping can be ill-posed for previous same-modal domain adaptation methods, since the structural constraints no longer exist. We propose to leveraging the sequential information in the trajectories and incorporating the policy to guide the training process. Experiments on MuJoCo environments show that the proposed crossmodal domain adaptation approach enables the agents to be deployed directly in the target domain with only a small performance gap, while previous methods designed for same-modal domain adaptation fail on this task.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) for vision-based robotic-control tasks has achieved remarkable success in recent years (Francis et al., 2020; Zhang et al., 2019; Zeng et al., 2018; Riedmiller et al., 2018; Levine et al., 2018) . However, current RL algorithms necessitate a substantial number of interactions with the environment, which is costly both in time and money on real robots. An appealing alternative is to train policies in simulators, then transfer these policies onto real-world systems (Rao et al., 2020; James et al., 2019; Yan et al., 2017) . Due to inevitable differences between simulators and the real world, which is also known as the "reality gap" (Jakobi et al., 1995) , applying policies trained in one domain directly to another almost surely fail, especially in visual-input tasks, due to the poor generalization of RL polices (Cobbe et al., 2019) . Domain adaptation is a common way to improve transferability by mapping inputs from two domains to an aligned distribution. Although distribution alignment is difficult with limited data, many recent works have adopted unsupervised visual domain adaptation (Hoffman et al., 2018; Ganin et al., 2017; Yi et al., 2017; Kim et al., 2017) to learn the mapping function without a groundtruth pairing. These adaptation methods exploit structural constraints (Fu et al., 2019) in two samemodal domains (i.e., learned on simulated images and deployed on real images) to overcome the intrinsic ill-posedness of distribution matching as shown in Fig. 1(a) -mapping an instance in the target domain to anything of a similar probability in the source domain is "reasonable" if we only consider distribution matching. However, training on simulated images introduces unwanted costs and difficulties, which are ignored in current works. First, a rendering engine needs more human engineering and runs much slower (can be up to 20× slower according to Xia et al. (2018) ) than a pure rigid body simulator, which adds considerable cost to the overall process. Second, using RL methods to train a policy with image inputs is usually harder than training with state inputs (Kaiser et al., 2020; Tenenbaum, 2018) , resulting in a sub-optimal simulation policy. An ideal solution to avoid such problems is to train policies with simulated states and adapt the learned polices to real-world images. However, all

