CROSS-MODAL DOMAIN ADAPTATION FOR REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

To overcome the unbearable reinforcement training of agents in the real-world, the sim-to-real approach, i.e., training in simulators and adapting to target environments, is a promising direction. However, crafting a delicately simulator can also be difficult and costly. For example, to simulate vision-based robotics, simulators have to render high-fidelity images, which can cost tremendous effort. This work aims at learning a cross-modal mapping between intrinsic states of the simulator and high-dimensional observations of the target environments. This cross-modal mapping allows agents trained on the source domain of state input to adapt well to the target domain of image input. However, learning the cross-modal mapping can be ill-posed for previous same-modal domain adaptation methods, since the structural constraints no longer exist. We propose to leveraging the sequential information in the trajectories and incorporating the policy to guide the training process. Experiments on MuJoCo environments show that the proposed crossmodal domain adaptation approach enables the agents to be deployed directly in the target domain with only a small performance gap, while previous methods designed for same-modal domain adaptation fail on this task.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) for vision-based robotic-control tasks has achieved remarkable success in recent years (Francis et al., 2020; Zhang et al., 2019; Zeng et al., 2018; Riedmiller et al., 2018; Levine et al., 2018) . However, current RL algorithms necessitate a substantial number of interactions with the environment, which is costly both in time and money on real robots. An appealing alternative is to train policies in simulators, then transfer these policies onto real-world systems (Rao et al., 2020; James et al., 2019; Yan et al., 2017) . Due to inevitable differences between simulators and the real world, which is also known as the "reality gap" (Jakobi et al., 1995) , applying policies trained in one domain directly to another almost surely fail, especially in visual-input tasks, due to the poor generalization of RL polices (Cobbe et al., 2019) . Domain adaptation is a common way to improve transferability by mapping inputs from two domains to an aligned distribution. Although distribution alignment is difficult with limited data, many recent works have adopted unsupervised visual domain adaptation (Hoffman et al., 2018; Ganin et al., 2017; Yi et al., 2017; Kim et al., 2017) to learn the mapping function without a groundtruth pairing. These adaptation methods exploit structural constraints (Fu et al., 2019) in two samemodal domains (i.e., learned on simulated images and deployed on real images) to overcome the intrinsic ill-posedness of distribution matching as shown in Fig. 1 (a) -mapping an instance in the target domain to anything of a similar probability in the source domain is "reasonable" if we only consider distribution matching. However, training on simulated images introduces unwanted costs and difficulties, which are ignored in current works. First, a rendering engine needs more human engineering and runs much slower (can be up to 20× slower according to Xia et al. (2018) ) than a pure rigid body simulator, which adds considerable cost to the overall process. Second, using RL methods to train a policy with image inputs is usually harder than training with state inputs (Kaiser et al., 2020; Tenenbaum, 2018) , resulting in a sub-optimal simulation policy. An ideal solution to avoid such problems is to train policies with simulated states and adapt the learned polices to real-world images. However, all Figure 1 : Illustration of mapping functions with and without sequential structure from the target domain (left) to the source domain (right). Shaded regions denote data distributions, where the darker the color, the higher the probability. In Fig. 1 (a), both s t and s t are "realisic" source domain instances, but only s t is correct. Since they are of similar probabilities, distribution matching may map o t to any of them. In RL, the policy may output unreliable actions when taking these incorrectly mapped states as inputs. In Fig. 1 (b), a sequential structure can help rule out the wrong mapping based on trajectory contexts. the structural constraints based on the modality consistency can not be used and the distribution alignment task by learning a mapping function becomes hard to solve. In this paper, we propose Cross-mOdal Domain Adaptation with Sequential structure (CODAS) that learns a mapping function from images in the target domain to states in the source domain. With the help of the learned mapping function, policies trained on states in the source domain can be deployed in the target domain of images directly. Specifically, based on the sequential nature of reinforcement learning problems, we formulate the cross-domain adaptation problem as a sequential variational inference problem and derive a series of solvable optimization objectives in CODAS. It is worth noting that our work is different from recent works that learn state embeddings from image inputs, which map images to an arbitrary subspace in a low-dimensional vector space. In CODAS, we embed the image space into a vector space with clear meanings (defined in the statebased simulator), which improves the interpretability of the policy when deployed in the real world. We evaluate our method on 6 MuJoCo (Todorov et al., 2012) environments provided in OpenAI Gym (Brockman et al., 2016) , where we treat states as the source domain, and rendered images as the target domain, respectively. Experiments are conducted in the scenario where only offline real data are available. Experiment results show that the mapping function learned by our method can help transfer the policy to target domain images with a small performance degradation. Previous methods that use unaligned Generative Adversarial Networks (GANs) suffer from a severe performance degradation on this cross-modal transfer problem. The experiments provide an optimistic result which indicates cross-modal domain adaptation can serve as a low-cost Sim2Real approach.

2. RELATED WORK

To our best knowledge, this work is the first to address cross-modal domain adaptation in RL setting. We will discuss two research areas closely related to this topic, which are, (1) unsupervised visual domain adaptation in RL and (2) image-input representation learning in RL.

2.1. VISUAL DOMAIN ADAPTATION IN RL

Unsupervised visual domain adaptation aims to map the source domain and the target domain to an aligned distribution without pairing the data. Prior methods fall into two major categories: featurelevel adaptation, where domain-invariant features are learnt (Gopalan et al., 2011; Caseiro et al., 2015; Long et al., 2015; Ganin et al., 2017) , and pixel-level adaptation, where pixels from a source image used to generate an image that looks like one from the target domain (Bousmalis et al., 2017; Yoo et al., 2016; Taigman et al., 2017; Hoffman et al., 2018) . Pixel-level adaptation is challenging when data from two domains are unpaired. Prior works tackle this problem by using GANs (Goodfellow et al., 2014) conditioned on simulated images to generate realistic images. Gamrian & Goldberg (2019) transfers policies from Atari Games (Bellemare et al., 2015) to modified variants by training a GAN to map images from the target domain to the source domain. GraspGAN (Bousmalis et al., 2018) addresses domain adaptation in robotic grasping by having the GAN reproduce the segmentation mask for the simulated image as an auxiliary task, including the robot arm, objects, and the bin. RCAN (James et al., 2019) adopts ideas from domain randomization by learning a mapping of images from randomized simulations to a canonical simulation and treating the real world just as one of the random simulations. RL-CycleGAN (Rao et al., 2020) unifies the learning of a CycleGAN (Zhu et al., 2017) and an RL policy, claiming better performance by learning features that are most crucial to the Q-function in RL. Image-to-image domain adaptation can somewhat bypass the ill-posedness for distribution matching (See Fig. 1 (a)) since it often enjoys an implicit advantage that images differ locally, in color, textile, lighting, but resembles globally between two domains, while images and states differ essentially. Some works impose extra structural constraints (e.g., segmentation, geometry) (Fu et al., 2019; Bousmalis et al., 2018) , but such tricks fail in image-to-state domain adaptation either. In this work, we force the mapped states to follow transition consistency by using a recurrent structure (See Fig. 1(b )) and to be able to recover the pre-learned policy. We also employ a stochastic mapping function with the help of a variational encoder that is more robust to target domain data noise.

2.2. REPRESENTATION LEARNING IN RL

Representation learning aims to transform high-dimensional data into lower-dimensional vector representations, which suit RL better. It is widely accepted that learning policies from states (embeddings) is significantly more sample-efficient than learning from pixels, both empirically (Kaiser et al., 2020; Tenenbaum, 2018; Tassa et al., 2018) and theoretically (Jin et al., 2020) . Sequential auto-encoder is a common network structure to learn state representations by minimizing reconstruction loss. Early works on DRL from images (Ha & Schmidhuber, 2018; Lange et al., 2012; Lange & Riedmiller, 2010 ) use a two-step learning process where an auto-encoder is first trained to learn a low-dimensional representation, and subsequently a policy or model is learned based on this representation. Later works on model-based RL improve representation learning by jointly training the encoder and the dynamics model end-to-end (Watter et al., 2015) -this has been proved effective in learning useful task-oriented representations. PlaNet (Hafner et al., 2019) learns a hybrid of stochastic and deterministic latent state models using a reconstruction loss. SOLAR (Zhang et al., 2019) combines probabilistic graphic models with a simple network structure to fit local linear transitions. Some recent works adopt advancements in unsupervised representation learning. CURL (Laskin et al., 2020b) utilizes contrastive learning methods to capture essential information in an image that distinguishes from others, though later works (Laskin et al., 2020a; Kostrikov et al., 2020) point out that data augmentation may play the major role here. Our work utilizes a sequential variational encoder structure to capture sequential information from trajectories. The main difference between our work and representation learning is whether the state space is predefined. We add extra supervised information to guide the training of the mapping by minimizing the distance between the distributions of the mapped states and the original states, and by enforcing the policy to recover the actions from the mapped states. As a result, we successfully learn states that match the ground-truth simulator states well. The mapped states can be directly fed into the pre-trained policy network.

3. CROSS-MODAL DOMAIN ADAPTATION WITH SEQUENTIAL STRUCTURE

Our work follows the problem setting similar to previous methods that tackle visual domain adaptation problems in RL. We have a policy π pre-trained in the source domain (state) and a dataset pre-collected in the target domain (image). The task is to learn a mapping q φ from images to states. In the deployment, agents interact with a new policy π  (o) = (π • q φ )(o),

……

(a) generation process of real world 𝑎 𝑇-1 𝑎 1 𝑠 1 𝑠 2 𝑠 𝑇 …… (b) generation process of simulation Figure 2 : Illustration of the generation in the real world and simulation domains respectively. All nodes are random variables. Shaded nodes are observable variables. The solid line denotes the generation process and dashed lines denote the inference process. There is an unpaired correspondence between two state trajectories surrounded by the rounded rectangle. Note that we include policy π in both generation process, which corresponds to the edge from state s t to action a t .

3.1. DOMAIN ADAPTATION AS VARIATIONAL INFERENCE

The goal of domain adaptation is to find a mapping from the target domain to the source domain, which is images to states in our case. We first model the generation process of the target domain and source domain respectively and the connection between them, as illustrated in Fig. 2 . The initial state follows distribution p(s 1 ). The transition function pϕ (s t | s t-1 , a t-1 ), modeled as a feed-forward neural network with parameters ϕ, predicts the current state from the previous state and previous action. The decoder p θ (o t | s t , a t-1 , o t-1 ), modeled as a deconvolution network with parameters θ, reconstructs the current observation from the current state, previous observation and previous action. In practice, we model these distributions as multivariate Gaussian distributions. We allow p θ (o t | •) dependent on o t-1 so that some irrelevant patterns in images can be reconstructed in an auto-regressive manner. Such a conditional distribution is feasible in practice since we always have ground-truth o 0:t-1 at timestep t in both training and deployment phases. A sequential generation process enjoys an extra benefit of solving Partially Observable Markov Decision Process (POMDP) since a single image can not reveal the full state of the environment in general. The key point that distinguishes our method from conventional representation learning is the additional constraint that the mapped state trajectories should match the simulation state trajectories. See rounded rectangle parts in Fig. 2 . Such a constraint may somewhat require that the underlying transition dynamics are the same in two domains. Introducing distributions can to some extent relax the assumption and increase robustness when there is a small mismatch between dynamics in source and target. s 1 ∼ p (s 1 ) s t ∼ p (s t | s t-1 , a t-1 ) o t ∼ p (o t | s t , a t-1 , o t-1 ) The overall optimization problem can be formulated as a variational inference problem described in Eq. 2, where we want to learn a posterior distribution to approximate the ground-truth distribution. min E τ r [D KL [q φ (τ s | τ r ) || p(τ s | τ r )]] where τ is a trajectory, s and r indicate whether the trajectory is from the source or target domain, p(•) is the ground-truth distribution, q φ (•) is the mapping function we want to learn, and D KL computes the Kullback-Leibler divergence. Modeling the optimization as a trajectory distribution matching can naturally handle the stochasticity of environments and policies, and the possible noise in data collected in the real world. The Evidence Lower Bound (ELBO) of this variational problem can be formulated as: max E τ r E q φ (τ s |τ r ) [log p θ (τ r | τ s )] -D KL [q φ (τ s | τ r ) || p π (τ s )] The derivation of ELBO follows a common practice using Jensen inequality. A detailed derivation can be found in Appendix A. The first term maximizes the reconstruction probability, in order to enforce that the mapped states ŝ can recover both observations o and action a. The second term Initialize the sequential mapping function f (s t | s t-1 , a t-1 , o r t ). for n = 1 to N do Sample a batch of target domain trajectories τ r = {(o r 0 , a r 0 , ..., o r T ) i } from D r Initialize RNN with a zero state. Infer the corresponding state trajectories τ s = {(ŝ 1 , ..., ŝT ) i } via the mapping function f ; Rollout one step with the oracle simulation dynamics p(s | s, a) for each state-action pair in τ s to construct the transition dataset D ŝ = {(ŝ i , a i , s i+1 )}; for d = 1 to D do Update the discriminator D by maximizing Eq. 13; end for Update the mapping function (details are in Algorithm 2); end for enforces the alignment of the distributions of the mapped trajectories and the trajectories collected in the simulator. The policy π in source domains included in the second term does introduce a new assumption that target domain data are collected by a known behavioral policy. This assumption is mild and is implicitly or explicitly introduced in previous works (Kim et al., 2020; Gamrian & Goldberg, 2019) .

3.2. DIFFERENTIABLE OPTIMIZATION OBJECTIVES

The ELBO defined in Eq. 3 contains terms involving distributions over the entire trajectory, and thus is impractical to solve directly. Given the generation process defined in the previous section, we can decompose the joint probability into the multiplication of one-step probabilities. Eq. 4 shows the result after the decomposition. For brevity, we will use ŝ, ô to denote s, o outputted by networks and omit the networks themselves. max E τ r [ T t=1 E q φ (ŝt|ot,ŝt-1,at-1) [log p θs (ô r t | o t-1 , ŝt , a t-1 ) + log p θπ (a t-1 |s t-1 )] -D KL [q π φ (τ s | τ r ) | p π (τ s )] A direct computation of the second D KL term is intractable. Following the idea that the optimization process of GAN is equivalent to minimizing a certain distance measure between two distributions (Nowozin et al., 2016) , we can use the optimization objective of GAN as the surrogate objective of minimizing the D KL [q φ (τ s | τ r ) || p π (τ s )]. For implementation simplicity and training stability, we still choose to use the original GAN optimization objective, which is equivalent to minimizing JS divergence. The optimization objective is formulated in Eq. 5. D (θ, ω) = E s∼D s [D ω (s, a)] + E τ r ∼D r [log(1 -D ω (ŝ, a))] Here, the discriminator takes state, rather than trajectory, as input for better practicality, where "real" samples are simulation states and "fake" samples are mapped states. The latter term is the objective of the generator -q φ in our case. The sequential structure is preserved in the generator. This part is different from previous works like VAE-GAN (Larsen et al., 2016) and Causal InfoGAN (Kurutach et al., 2018) , that discriminate against the reconstructed output. Instead, we discriminate against the states after the variational encoder. The reconstruction of action a can be optimized in an end-to-end manner. However, if a differentiable π of simulation states is available (which is the usual case), we can replace p θπ with π, resulting an objective in Eq. 6. A fixed π here can guide p θs to output states that yields a similar a. Previous works (Schrittwieser et al., 2019) Copy the new parameters ϕ of DM t ϕ to the embed-DM in q φ ; for m = 1 to M do Update the mapping function q φ by minimizing Eq. 7; Update the reconstruction function p θ by minimizing the first term in Eq. 7; end for policy = E (τ r ∼D r [ T t=1 [log π(a r t | ŝt ))] Combining all the aforementioned losses together, i.e, reconstruction loss, policy loss and generation loss, the complete optimization objective of the mapping function is as follows: mapping = E τ r ∼D r T t=1 log p(ô t | o t ) + λ D log(1 -D ω (ŝ t )) + λ π log π(a r t | ŝt )) where λ D and λ π are hyper-parameters. Both the decoder and the policy are fixed during the training process of the mapping function. The loss function for p θ is just the first term in Eq. 7. Similarly, the mapping function q φ is fixed during the training process of the reconstruction function p θ .

3.3. STATE INFERENCE MODEL WITH EMBEDDED DYNAMICS

Since a trajectory of practical RL problems often lasts hundreds or thousands timesteps, training RNN on such a long-horizon trajectory is difficult. The previous success of ResNet (He et al., 2016) shows that a residual structure can simplify the learning target by predicting small residuals, resulting in a remarkable performance increase in learning ultra-deep neural networks. Adopting a similar idea, we incorporate a residual structure in the RNN to help stabilize the training process. We have a Dynamics Model (DM) trained independently using transition tuples collected in the simulator. This structure further forces the mapped states to follow the simulation dynamics. Its parameters are periodically updated to embed-DM, in order to provide an "average" estimation of the next states st (See Fig. 3 ). The job of the rest part, instead, is simplified to just output a correction. With the proposed structure model, the mapped states follow the transition dynamics in the simulator better.

RNN cell

𝑜 𝑡 Ƹ 𝑠 𝑡-1 embed DM encoder 𝑎 𝑡-1 Δ𝑠 𝑡 Ƹ 𝑠 𝑡 DM 𝑠 𝑡-1 𝑎 𝑡-1 𝑠 𝑡 periodically update + ҧ 𝑠 𝑡 DM is first trained using batches of transition tuples collected in the simulator. During the training process of the mapping function, the dynamics model is trained online using D ŝ = {(ŝ t , a t , s t+1 ∼ p(s |ŝ t , a t )} and is periodically updated to the mapping function. That is, we reset the simulator to the mapped states ŝt and then rollout with a one-step oracle simulation transition to get s t+1 . The optimization objective of the dynamics model is an MSE loss in Eq. 8. Algorithm 2 demonstrates a detailed training and updating procedure of the dynamics model and the entire mapping function. dynamics = E (s,a,s )∼D s ∪D s [(t ϕ (s, a) -s ) 2 ] (8)

4. EXPERIMENTS

We evaluate our method in 6 MuJoCo environments from OpenAI Gym, namely InvertedPendulum, InvertedDoublePendulum, HalfCheetah, Hopper, Swimmer and Walker2d. We define the rendered images as the target domain, and the original observations as the source domain. The pre-collected dataset in the target domain contains 600 trajectories and is collected by a sub-optimal policy. Please refer to Appendix C for a detailed experiment setting. We modify state-of-the-art methods in same-modal domain adaptation for comparison, namely GAN and CycleGAN. GAN uses the same model structure of the encoding network in CODAS, and is trained on pre-collected state and image datasets. The state dataset is of the same size as the image dataset. As described in Sec. D.6, CycleGAN fails in all environments. We also compare with Behavioral Cloning (BC) considering target domain data are collected by a (sub)optimal policy. BC trains a policy in a supervised manner using (o r t , a r t ) ∼ D r . To mitigate the partial observability issue mentioned in Sec. 3, we stack every 4 consecutive images as the new input to the BC algorithm. All methods are trained until convergence. Implementation details for all methods can be found in Sec. B. Due to some numerical instabilitiesfoot_0 inside MuJoCo, we cannot get the oracle transition function in HalfCheetah and InvertedDoublePendulum. CODAS in these two environments are implemented without embed-DM. Experiment results in this section will answer the following three questions: 1) Does CODAS enable polices to transfer from states to images? Does CODAS outperform state-of-the-art methods? 2) Does every component in CODAS contribute to the overall performance? 3) Is CODAS robust to small mismatches in environment dynamics between the source and target domains? To focus on the performance of the adaptation process, we use reward ratio as the metric, which is defined as r ratio = r r * , where r and r * are the cumulative return of the adapted policy and the optimal policy trained on states respectively. The quantitative performance of the optimal policy trained on states and rendered images using PPO is given in Sec. D.2.

4.1. PERFORMANCE

Training curves of all methods are shown in Fig. 5 , x-axis being training iterations, y-axis being the reward ratio. Each iteration, every method is updated using a batch of 20 trajectories. CO-DAS reaches an average of 70% reward ratio after adaptation, providing an optimistic result on future applications of cross-modal domain adaptation. BC performs well in simple environments (HalfCheetah and Inverted Pendulum), but poorly in rest environments, especially in those with an early termination. GAN performs even worse than BC in most environments, which suggests A comparison of all ablated variants is shown in Fig. 6 . Both axes are of the same meanings as above. In all the environments, mapping functions with sequential structure yield better deployment performance. In four environments where embed-DM is applicable, embed-DM helps further improve the performance. As we expect, the major purpose of incorporating embed-DM is to simplify the learning process. If RNN itself can learn a reasonable mapping, the improvement of embed-DM will be marginal (Walker2d). The learning process is also a little unstable in InvertedDoublePendulum, where embed-DM is not available.

4.3. ROBUSTNESS TO DYNAMICS MISMATCH

To analyze the CODAS's robustness to dynamics mismatch, we modify the dynamics in the source domain. Specifically, we reconfigure the friction coefficient in Hopper. The result can be found in Figure 11 . The results in Hopper both with 110% and 120% friction still demonstrate a robust mapping. Besides, the mismatch of dynamics even not slow down the speed of learning. The result suggests that the CODAS technique has the potential to be deployed in more realistic applications where the dynamics in two domains can not match exactly. We leave this analysis in future works.

5. CONCLUSION

In this work, we propose Cross-Modal Domain Adaptation with Sequential structure (CODAS). CODAS enables a new paradigm of Sim2Real -adapting policies trained on simulated states to realworld with image inputs. We believe this setting is valuable in real-world applications of RL by exempting us from the tedious work of building and running rendering engines. Previous methods that use GANs fail on this problem since global structural resemblance between two same modal domains no longer exists, while our method succeeds by fully leveraging the sequential structure and other auxiliary information provided by the policy and dynamics underlying RL problem. We first model the cross-modal domain adaptation problem as a variational inference problem and decompose it into several feasible optimization objectives. To better solve the complex long-horizon sequential mapping problem, we propose a residual network structure. We validate the proposed method by adapting policies from images to states on various MuJoCo environments. Our results provide an optimistic results of cross-modal domain adaptation as a low-cost Sim2Real approach. A DERIVATION OF OPTIMIZATION OBJECTIVES Assume the distribution of trajectories in the simulator p π (τ s ) is calculated from the real world trajectories τ r . We want to match q π φ (τ s |τ r ) to the ground-truth distribution p π θ (τ s |τ r ), leading to the optimization objective: min E τ r [D KL [q π φ (τ s |τ r )||p π θ (τ s |τ r )]] To maximize the objective, we first transform it into the Evidence Lower Bound (ELBO): D KL [q π φ (τ s |τ r )||p π θ (τ s |τ r )] =E q π φ (τs|τr) [log q π φ (τ s |τ r ) p π θ (τ s |τ r ) ] =E q π φ (τs|τr) [log q π φ (τ s |τ r ) -log p π θ (τ s , τ r ) p π (τ r ) ] =E q π φ (τs|τr) [log q π φ (τ s |τ r ) -log p π θ (τ r , τ s )] + log p π ( τ r ) The last term p π θ (τ r ) is a constant with regards to the parameter θ, and thus can be ignored in the optimization process. The minimization can be reduced tomin .E τ r [E q π φ (τs|τr) [log q π φ (τ s |τ r )log p π θ (τ r , τ s )]]. Since the sampled trajectories can be considered as an i.i.d., the first term of expectation can be further re-written as E q π φ (τs|τr) [log q π φ (τ s |τ r ) -log p π θ (τ r , τ s )] =E q π φ (τs|τr) [log q π φ (τ s |τ r ) (p π θ (τ r |τ s )p π θ (τ s )) ] = -E q π φ (τs|τr) [log p π θ (τ r |τ s ) -D KL [q π φ (τ s |τ r ) || p π (τ s )] Based on the generation process, we can decompose the first term based on trajectories to multiplication of terms based on states. q π φ (τ s | τ r ) = Tτ t=1 q φ (s t | s t-1 , o t , a r t-1 ), o t , a t-1 ∼ τ r p π θ (τ r | τ s ) = Tτ t=1 p θ (o t | o t-1 , s t , a t-1 ), s t , a t-1 ∼ τ s (11) E τ r [E q π φ (τs|τr) [log p π θ (τ r |τ s )] =E τ r [E q π φ (τs|τr) [ T t=1 log p θ (ô t |o t-1 , ŝt , a t-1 )] =E τ r [ T t=1 (E q φ (ŝt|ot,ŝt-1,at-1) [log p θ (ô t |o i-1 , ŝt , a i-1 )]] A direct computation of the second term D KL is intractable. Following the idea that the optimization process of GAN is equivalent to minimizing a certain distance measure between two distributions (Nowozin et al., 2016) , we can use the optimization objective of GAN as the surrogate objective of minimizing the D KL [q φ (τ s | τ r ) || p π (τ s )] by introducing a new discriminator D ω parameterized by ω that maximizes Eq. 13, D (θ, ω) = E s∼Ds [1 + log D ω (s) 1 -D ω (s) ] + E τ r ∼D r [ D ω (ŝ) 1 -D ω (ŝ) ] For implementation simplicity and training stability, ( Dω(s) 1-Dω(s) ) can become arbitrarily large, we still choose to optimize the original GAN optimization objective, which is formulated as Eq. 14. D (θ, ω) = E s∼Ds [D ω (s)] + E s∼f θ (o r ) [log(1 -D ω (s))] B IMPLEMENTATION DETAILS Fig. 7 gives an illustration of the overall structure of our method. Both π optimal and π behavioral are polices of the source domain, where π optimal is the optimal policy trained in the source domain and π behavioral is a policy policy in the source domain that mimics the behavior of the dataset collection policy (in terms of the cumulative returns). In rare cases when a differentiable π behavioral is not available (e.g. the policy is a decision tree), it can be replaced by a trainable network here. Detailed network structure and hyperparameters are listed in Sec. B.1, all of which are the same for every environment. To help balance the influence of the residual ∆s t , we introduce a new hyperparameter λ res , s. 

D.5 ROBUSTNESS TO DYNAMICS MISMATCH

To test the robustness of CODAS, we manually change the friction of the target environment to create dynamics mismatches. Fig. 11 shows the reward ratio of CODAS in Hopper environment with different magnitudes of friction. The performance of CODAS remains stable even when the amplitude of friction reaches 20%, proving that CODAS is robust to mild dynamics mismatch. It is worth noting that the policy is trained without any technique such as domain randomization that improves robustness. • id loss in the original CycleGAN to improve training stability is not applicable; • network structures (e.g. UNet) that are specially designed for images generation are not applicable; D.7 GAN WITH ACTION LOSS Fig. 13 shows the training curve of GAN with additional policy loss defined in Eq.6. Surprisingly, adding a fixed policy network does not help the learning process of naive GAN. It may be because the gradient from this fixed policy in the initial training stage is useless or even harmful to the training of the mapping function. The performance of GAN with policy loss drops significantly partly due to this loss is large in the modified Swimmer environment (can be found in the BC loss in Fig. 10 as well). 



MuJoCo engine outputs different s given exact same (s, a) as input due to its inner inaccessible random states. https://spinningup.openai.com/en/latest/spinningup/bench.html



Training Procedure of CODAS Input: Simulator with oracle dynamics p; Policy π(a | s) pre-trained in the simulated dynamics p(s | s, a); Target-domain Trajectory Dataset of images D r = {(o r 0 , a r 0 , ..., o r T ) i }; Number of iteration N . Output: Mapping function f : O → S.

Figure 3: Model structure with embed-DM

Figure 4: A visual illustration of (a) original images (b) reconstructed images (c) re-rendered images by setting the simulator to mapped states.

Figure 5: Training curves of different methods on cross-modal domain adaptation

Figure 7: Illustration of full network structure. Blue parts denote mapping function q φ ; Yellow and green parts denote reconstruction function p θ . Green part is always fixed in the entire training process.

Figure 10: Training curves of Behavior Cloning.

Figure 11: Reward ratio in Hopper with small dynamics mismatches.D.6 TRAINING CURVES OF CYCLEGANWe planned to use CycleGAN, one of the state-of-the-art methods in domain adaptation, for comparison. Concretely, we remove the identity loss and use the network structure as ours (See Visual Encoder/Visual Decoder in Sec. B) as the generator network structure. However, the changes in the loss function and network structure may require a complete hyperparameter setting from image-toimage translation. We fail to get a workable CycleGAN in most environments. The best results so far are shown in Fig.12. The training of state-to-image GAN in CycleGAN is not stable and may lead to the overall bad performance. It could be of the following reasons:

Figure 12: Training curves of CycleGAN.

Figure 13: Training curves of GAN with action loss.

Algorithm 2 Detailed Training Procedure of Mapping Function Input: Simulator with oracle dynamics p; Policy π(a | s) pre-trained in the simulation dynamics p(s | s, a); # Initialization. Pre-train DM t ϕ with D s = {(s i , a i , s i+1 )}. # Update per iteration. Update the the DM t ϕ p by minimizing Eq. 8 using D ŝ and D s ;

Hyperparameters for the Reconstruction Network(Cont.d)

C EXPERIMENT SETTING

We evaluate our method in 6 MuJoCo environments from OpenAI Gym, namely InvertedPendulum, InvertedDoublePendulum, HalfCheetah, Hopper, FixedSwimmer and Walker2d. FixedSwimmer is based on modifications proposed in previous works (Wang et al., 2019) to avoid sub-optimal policies by changing one sensor position. We treat the simulation observations as low dimensional simulation states s and rendered images as real images.Images are collected by "track camera" in HalfCheetah, Hopper, FixedSwimmer and Walker2d and "default camera" in InvertedPendulum and InvertedDoublePendulum. All images are resized to [64, 64, 3] without any further pre-processing techniques in all environments. Examples of rendered images are shown in Fig. 8 .Two polices are independently trained until convergence using PPO (Schulman et al., 2017) for every environment. One of them is regarded as π target to collect the "real" image dataset. π target is a stochastic policy to mimic the real data collection process. The other is regarded as a pretrained simulator policy π source . The image dataset contains 600 episodes, each being truncated to a maximum length of 500. The evaluation of all methods are done based on this truncated dataset. . 3 shows the mean value and standard deviation of the un-discounted cumulative return of 100 trajectories collected by the optimal policy trained on states using PPO. The maximum episode length is set to 1000. Full training curves of policies trained on state space are provided in Fig. 9 .The performance of policies trained on state space matches the public benchmarking results. 2 Fig. 9 also provides training curves of the policy trained on images. We modify the network structure of actor and critic to adapt PPO to image input. In all environments, the policy perform poorly. The final performance of image input is calculated at 3.0 × 10 6 time-step, when the value loss has converged. As far as we know, there is no public performance benchmark of optimal policy trained on MuJoCo images. Some previous results on Deepmind Control Suite shows a better result than ours, partly due to a much clearer robots and background and tuned hyperparams. 

