LEARNING CROSS-DOMAIN CORRESPONDENCE FOR CONTROL WITH DYNAMICS CYCLE-CONSISTENCY

Abstract

At the heart of many robotics problems is the challenge of learning correspondences across domains. For instance, imitation learning requires obtaining correspondence between humans and robots; sim-to-real requires correspondence between physics simulators and the real world; transfer learning requires correspondences between different robotics environments. This paper aims to learn correspondence across domains differing in representation (vision vs. internal state), physics parameters (mass and friction), and morphology (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose dynamics cycles that align dynamic robot behavior across two domains using a cycle-consistency constraint. Once this correspondence is found, we can directly transfer the policy trained on one domain to the other, without needing any additional fine-tuning on the second domain. We perform experiments across a variety of problem domains, both in simulation and on real robot. Our framework is able to align uncalibrated monocular video of a real robot arm to dynamic state-action trajectories of a simulated arm without paired data. Video demonstrations of our results are available at: https://sjtuzq.github.io/cycle_dynamics.html.

1. INTRODUCTION

Humans have a remarkable ability to learn motor skills by mimicking behaviors of agents that look and act very differently from them. For example, developmental psychologists has shown that 18-month-old children are able to infer the intentions and imitate behaviors of adults (Meltzoff, 1995) . Imitation is not easy: children likely need to infer correspondences between their observations and their internal representations, which effectively aligns the two domains. Learning such a cross-domain correspondence is particularly valuable for robotics and control. For example, in imitation learning, if we want robots to imitate the motor skills of humans (or robots with different morphologies), we need to find the correspondence in both visual observations and morphology dynamics. Similarly, when transferring a policy trained in simulation to a real robot, we, again, need to align visual inputs and physics parameters across different environments. To align the skills across different domains, several prior approaches have proposed learning invariant feature representations across the domains (Gupta et al., 2017; Sermanet et al., 2018) . Policies or visual representations are trained to be invariant to the changes which are irrelevant to the downstream task, while maintaining useful information for cross-domain alignment. However, these methods require paired and aligned trajectories, usually collected by pre-trained policies or human labeling, which is often too expensive to collect for real-world learning problems. Additionally, invariance is a rather strong constraint, and might not be universally suitable. The reason is that different invariances might be beneficial for different downstream tasks, which has been recently studied in self-supervised visual representation learning (Tian et al., 2020) . Instead of learning invariances, an emerging line of research focuses on finding correspondences by learning to translate between two different domains with unpaired data (Zhu et al., 2017; Bansal et al., 2018) . While this translation technique has shown encouraging results in imitation learning (Smith et al., 2019) and sim-to-real transfer (Hoffman et al., 2017; James et al., 2019) , it is limited to finding correspondences only in the visual observation space. However, in real-world applications, besides visual observations, the physics parameters and morphology dynamics between two domains are also often misaligned. Hence, solely learning with passive visual correspondences, one is unable to reason about the effects of dynamics. We must go beyond the image space and explicitly incorporate dynamics information to truly extend correspondence learning to aligning behaviors. In this paper, we take the first steps toward learning correspondences which can align behaviors on a range of domains including different modalities (vision vs. agent state), different physical parameters (friction and mass), and different morphologies. Importantly, we use unpaired and unaligned data from the two domains to learn the correspondences. Specifically, we propose to find observation correspondences and action correspondences at the same time using dynamics cycle-consistency. Our dynamics cycles chain the observations and actions across time and domains together. The consistency in the dynamics cycle indicates consistent translation and prediction results. The input data to our learning algorithm takes the form of 3-element tuples from both domains: current state, action and the next state. Figure 1 (a) exemplifies our model, which is a 4-cycle chain containing the observations of one domain (x t , x t+1 ) (real robot in Figure 1(a) ) at two time steps, and another domain (y t , y t+1 ) (simulation in Figure 1(a) ). To form a cycle, we learn a domain translator G : x t → y t to translate images to states and a predictive forward dynamics model in state space F : y t ×u t → y t+1 where u t represents the action taken at time t, and a t is the corresponding action in the real robot domain. The forward model in the real robot domain is not necessary in our framework. The training signal is: given observations in time t, the future prediction in time t + 1 should be consistent under the consistent action taken across two domains, namely dynamics cycle-consistency. We explore applications both in simulation and with a real robot. In simulation, we adopt multiple tasks in the MuJoCo (Todorov et al., 2012) physics engine, and show that our model can find correspondence and align two domains across different modalities, physical parameters (Figure 1(b) ), and morphologies (Figure 1(c) ). Given the alignment, we can transfer a reinforcement learning (RL) policy trained in one domain directly to another domain without further optimizing the RL objective. For our real robot experiments, we use the xArm Robot (Figure 1(a) ). Given only uncalibrated monocular videos of the xArm performing random actions, our method learns correspondences between the real robot and simulated robot without any paired data. At test time, given a video of the robot arm executing a smooth trajectory, we can generate the same trajectory in simulation.

2. RELATED WORK

Learning invariant representations. To find cross-domain alignment, researchers have proposed to learn representations which are invariant to the changes unrelated to the downstream task (Tobin et al., 2017; Peng et al., 2018; Gupta et al., 2017; Sermanet et al., 2018; Liu et al., 2017b; Pinto et al., 



Figure 1: We propose to learn observation correspondence (blue arrow) and action correspondence (red arrow) across domains using Dynamics Cycle-Consistency. Our applications include: (a) Aligning real robot images with simulation states; (b) Aligning actions between environments with different physics parameters (We use different rendering to indicate that the physics are different); (c) Aligning both actions and observations between agents at the same time with different morphology.

