LEARNING A DOMAIN-AGNOSTIC POLICY THROUGH ADVERSARIAL REPRESENTATION MATCHING FOR CROSS-DOMAIN POLICY TRANSFER Anonymous

Abstract

The low transferability of learned policies is one of the most critical problems limiting the applicability of learning-based solutions to decision-making tasks. In this paper, we present a way to align latent representations of states and actions between different domains by optimizing an adversarial objective. We train two models, a policy and a domain discriminator, with unpaired trajectories of proxy tasks through behavioral cloning as well as adversarial training. After the latent representations are aligned between domains, a domain-agnostic part of the policy trained with any method in the source domain can be immediately transferred to the target domain in a zero-shot manner. We empirically show that our simple approach achieves comparable performance to the latest methods in zero-shot crossdomain transfer. We also observe that our method performs better than other approaches in transfer between domains with different complexities, whereas other methods fail catastrophically.

1. INTRODUCTION

Humans have an astonishing ability to learn skills in a highly transferable way. Once we learn the route from home to the station, for example, we can get to the destination using different vehicles (e.g., walking, cycling, or driving) in different environments (e.g., on a map or in the real world) ignoring irrelevant perturbations (e.g., weather, time, or traffic conditions). We find underlying structural similarities between situations, perceive the world, and accumulate knowledge in our way of abstraction. Such abstract knowledge can be readily applicable to various similar situations. This seems easy for humans but not for autonomous agents. Agents trained in reinforcement learning (RL) or imitation learning (IL) often have difficulties in transferring knowledge learned in a specific situation to another. It is because the learned policies are strongly tied to the representation acquired in a specific configuration of training, which is not generalizable even to subtle changes in an agent or an environment. Previous studies have attempted to address this problem with various approaches. Domain randomization (Tobin et al., 2017; Peng et al., 2018; Andrychowicz et al., 2020) aims to learn a policy robust to environmental changes by having access to multiple training domains, but it cannot handle large domain gaps out of the domain distribution assumed in training such as drastically different observations or agent morphologies. To overcome such domain discrepancies, Gupta et al. (2017) and Liu et al. (2018) proposed methods to learn domain-invariant state representations for imitation, but these methods require paired temporally-aligned datasets across domains and, in addition, need expensive RL steps to adapt to the target domain. More recently, Kim et al. (2020) proposed a method to find cross-domain correspondence of states and actions from unaligned datasets through adversarial training. This method imposes a strong assumption that there is an exact correspondence of states and actions between different domains and learns it as direct mapping functions. The assumption is sometimes problematic when such correspondence is hard to find. For example, if one agent has no leg while another has a few legs, we cannot expect all information on how the agent walks to be translated into another domain. In this work, we propose a method that does not require the existence of exact correspondence between domains. Our approach learns domain-invariant representations and a common abstract policy on them that is shared across different domains (Figure 1 ). Our model consists of two core components: a policy and a domain discriminator. The policy has three blocks: a state encoder, a common policy, and an action decoder. In the first stage, the model is optimized with an imitation objective and an adversarial objective on a learned state representation simultaneously using an unaligned dataset of proxy tasks. The adversarial training induces the state encoder to generate latent state representations whose domains are indistinguishable to the domain discriminator. Such representations do not contain domain-specific information and thus can work with the common policy. Next, we freeze the parameters of the state encoder and action decoder and only update the common policy in the source domain on the learned feature space to adapt the model to the target task. In this process, as with Kim et al. ( 2020), we can use any learning algorithm for updating the policy and moreover do not require an expensive RL step interacting with the environment. After the update, combined with the fixed encoder and decoder, the learned common policy can be readily used in either domain in a zero-shot manner. We conduct experiments on a challenging maze environment (Fu et al., 2020) with various domain shifts: domain shift in observation, action, dynamics, and morphology of an agent. Our experiments show that our approach achieves comparable performance in most settings. We find that our method is effective in the setting of cross-dynamics or cross-robot transfer, where no exact correspondence between domains exists. In summary, our contributions are as follows: • We provide a novel method of cross-domain transfer with an unaligned dataset. In contrast to the latest method that learns mappings between domains, our approach aims to acquire a domain-invariant feature space and a common policy on it. • Our experiments with various domain shifts show that our method achieves comparable performance in transfer within the same agent and better performance than existing methods in cross-dynamics or cross-robot transfer by avoiding direct mapping between domains.

2. RELATED WORK

Cross-Domain Policy Transfer between MDPs Transferring a learned policy to a different environment is a long-standing challenge in policy learning. Most previous methods acquire some cross-domain metric to optimize and train a policy for a target task using a standard RL algorithm (Gupta et al., 2017; Liu et al., 2018; 2020; Zakka et al., 2022; Fickinger et al., 2022) or a GAIL (Ho & Ermon, 2016)-based approach (Stadie et al., 2017; Franzmeyer et al., 2022) . In particular, Franzmeyer et al. ( 2022) utilize domain confusion as we do in this paper. To calculate the pseudo reward for RL, the distance between temporally-corresponding states (Gupta et al., 2017; Liu et al., 2018) or the distance from the goal (Zakka et al., 2022) in the latent space is often used. Some recent approaches can perform zeros-shot cross-domain transfer as our method does, without interacting with the environment for a target task. Kim et al. (2020); Raychaudhuri et al. (2021) learn mappings between domains while Zhang et al. ( 2020) impose domain confusion on its state representation for domain shift in observation. Our approach predicts actions without learning cross-domain mappings



Figure 1: Illustration of a domain-agnostic representation space. Since semantically similar states are close together in the latent space regardless of the original domain, we can transfer knowledge between different domains through the latent space.

