CROSS-MODAL DOMAIN ADAPTATION FOR REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

To overcome the unbearable reinforcement training of agents in the real-world, the sim-to-real approach, i.e., training in simulators and adapting to target environments, is a promising direction. However, crafting a delicately simulator can also be difficult and costly. For example, to simulate vision-based robotics, simulators have to render high-fidelity images, which can cost tremendous effort. This work aims at learning a cross-modal mapping between intrinsic states of the simulator and high-dimensional observations of the target environments. This cross-modal mapping allows agents trained on the source domain of state input to adapt well to the target domain of image input. However, learning the cross-modal mapping can be ill-posed for previous same-modal domain adaptation methods, since the structural constraints no longer exist. We propose to leveraging the sequential information in the trajectories and incorporating the policy to guide the training process. Experiments on MuJoCo environments show that the proposed crossmodal domain adaptation approach enables the agents to be deployed directly in the target domain with only a small performance gap, while previous methods designed for same-modal domain adaptation fail on this task.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) for vision-based robotic-control tasks has achieved remarkable success in recent years (Francis et al., 2020; Zhang et al., 2019; Zeng et al., 2018; Riedmiller et al., 2018; Levine et al., 2018) . However, current RL algorithms necessitate a substantial number of interactions with the environment, which is costly both in time and money on real robots. An appealing alternative is to train policies in simulators, then transfer these policies onto real-world systems (Rao et al., 2020; James et al., 2019; Yan et al., 2017) . Due to inevitable differences between simulators and the real world, which is also known as the "reality gap" (Jakobi et al., 1995) , applying policies trained in one domain directly to another almost surely fail, especially in visual-input tasks, due to the poor generalization of RL polices (Cobbe et al., 2019) . Domain adaptation is a common way to improve transferability by mapping inputs from two domains to an aligned distribution. Although distribution alignment is difficult with limited data, many recent works have adopted unsupervised visual domain adaptation (Hoffman et al., 2018; Ganin et al., 2017; Yi et al., 2017; Kim et al., 2017) to learn the mapping function without a groundtruth pairing. These adaptation methods exploit structural constraints (Fu et al., 2019) in two samemodal domains (i.e., learned on simulated images and deployed on real images) to overcome the intrinsic ill-posedness of distribution matching as shown in Fig. 1(a) -mapping an instance in the target domain to anything of a similar probability in the source domain is "reasonable" if we only consider distribution matching. However, training on simulated images introduces unwanted costs and difficulties, which are ignored in current works. First, a rendering engine needs more human engineering and runs much slower (can be up to 20× slower according to Xia et al. ( 2018)) than a pure rigid body simulator, which adds considerable cost to the overall process. Second, using RL methods to train a policy with image inputs is usually harder than training with state inputs (Kaiser et al., 2020; Tenenbaum, 2018) , resulting in a sub-optimal simulation policy. An ideal solution to avoid such problems is to train policies with simulated states and adapt the learned polices to real-world images. However, all . Shaded regions denote data distributions, where the darker the color, the higher the probability. In Fig. 1 (a), both s t and s t are "realisic" source domain instances, but only s t is correct. Since they are of similar probabilities, distribution matching may map o t to any of them. In RL, the policy may output unreliable actions when taking these incorrectly mapped states as inputs. In Fig. 1 (b), a sequential structure can help rule out the wrong mapping based on trajectory contexts. the structural constraints based on the modality consistency can not be used and the distribution alignment task by learning a mapping function becomes hard to solve. In this paper, we propose Cross-mOdal Domain Adaptation with Sequential structure (CODAS) that learns a mapping function from images in the target domain to states in the source domain. With the help of the learned mapping function, policies trained on states in the source domain can be deployed in the target domain of images directly. Specifically, based on the sequential nature of reinforcement learning problems, we formulate the cross-domain adaptation problem as a sequential variational inference problem and derive a series of solvable optimization objectives in CODAS. It is worth noting that our work is different from recent works that learn state embeddings from image inputs, which map images to an arbitrary subspace in a low-dimensional vector space. In CODAS, we embed the image space into a vector space with clear meanings (defined in the statebased simulator), which improves the interpretability of the policy when deployed in the real world. We evaluate our method on 6 MuJoCo (Todorov et al., 2012) environments provided in OpenAI Gym (Brockman et al., 2016) , where we treat states as the source domain, and rendered images as the target domain, respectively. Experiments are conducted in the scenario where only offline real data are available. Experiment results show that the mapping function learned by our method can help transfer the policy to target domain images with a small performance degradation. Previous methods that use unaligned Generative Adversarial Networks (GANs) suffer from a severe performance degradation on this cross-modal transfer problem. The experiments provide an optimistic result which indicates cross-modal domain adaptation can serve as a low-cost Sim2Real approach.

2. RELATED WORK

To our best knowledge, this work is the first to address cross-modal domain adaptation in RL setting. We will discuss two research areas closely related to this topic, which are, (1) unsupervised visual domain adaptation in RL and (2) image-input representation learning in RL.

2.1. VISUAL DOMAIN ADAPTATION IN RL

Unsupervised visual domain adaptation aims to map the source domain and the target domain to an aligned distribution without pairing the data. Prior methods fall into two major categories: featurelevel adaptation, where domain-invariant features are learnt (Gopalan et al., 2011; Caseiro et al., 2015; Long et al., 2015; Ganin et al., 2017) , and pixel-level adaptation, where pixels from a source image used to generate an image that looks like one from the target domain (Bousmalis et al., 2017; Yoo et al., 2016; Taigman et al., 2017; Hoffman et al., 2018) . Pixel-level adaptation is challenging when data from two domains are unpaired. Prior works tackle this problem by using GANs (Goodfellow et al., 2014) conditioned on simulated images to generate



Figure1: Illustration of mapping functions with and without sequential structure from the target domain (left) to the source domain (right). Shaded regions denote data distributions, where the darker the color, the higher the probability. In Fig.1(a), both s t and s t are "realisic" source domain instances, but only s t is correct. Since they are of similar probabilities, distribution matching may map o t to any of them. In RL, the policy may output unreliable actions when taking these incorrectly mapped states as inputs. In Fig.1(b), a sequential structure can help rule out the wrong mapping based on trajectory contexts.

