LEARNING A DOMAIN-AGNOSTIC POLICY THROUGH ADVERSARIAL REPRESENTATION MATCHING FOR CROSS-DOMAIN POLICY TRANSFER Anonymous

Abstract

The low transferability of learned policies is one of the most critical problems limiting the applicability of learning-based solutions to decision-making tasks. In this paper, we present a way to align latent representations of states and actions between different domains by optimizing an adversarial objective. We train two models, a policy and a domain discriminator, with unpaired trajectories of proxy tasks through behavioral cloning as well as adversarial training. After the latent representations are aligned between domains, a domain-agnostic part of the policy trained with any method in the source domain can be immediately transferred to the target domain in a zero-shot manner. We empirically show that our simple approach achieves comparable performance to the latest methods in zero-shot crossdomain transfer. We also observe that our method performs better than other approaches in transfer between domains with different complexities, whereas other methods fail catastrophically.

1. INTRODUCTION

Humans have an astonishing ability to learn skills in a highly transferable way. Once we learn the route from home to the station, for example, we can get to the destination using different vehicles (e.g., walking, cycling, or driving) in different environments (e.g., on a map or in the real world) ignoring irrelevant perturbations (e.g., weather, time, or traffic conditions). We find underlying structural similarities between situations, perceive the world, and accumulate knowledge in our way of abstraction. Such abstract knowledge can be readily applicable to various similar situations. This seems easy for humans but not for autonomous agents. Agents trained in reinforcement learning (RL) or imitation learning (IL) often have difficulties in transferring knowledge learned in a specific situation to another. It is because the learned policies are strongly tied to the representation acquired in a specific configuration of training, which is not generalizable even to subtle changes in an agent or an environment. Previous studies have attempted to address this problem with various approaches. Domain randomization (Tobin et al., 2017; Peng et al., 2018; Andrychowicz et al., 2020) aims to learn a policy robust to environmental changes by having access to multiple training domains, but it cannot handle large domain gaps out of the domain distribution assumed in training such as drastically different observations or agent morphologies. To overcome such domain discrepancies, Gupta et al. (2017) and Liu et al. (2018) proposed methods to learn domain-invariant state representations for imitation, but these methods require paired temporally-aligned datasets across domains and, in addition, need expensive RL steps to adapt to the target domain. More recently, Kim et al. (2020) proposed a method to find cross-domain correspondence of states and actions from unaligned datasets through adversarial training. This method imposes a strong assumption that there is an exact correspondence of states and actions between different domains and learns it as direct mapping functions. The assumption is sometimes problematic when such correspondence is hard to find. For example, if one agent has no leg while another has a few legs, we cannot expect all information on how the agent walks to be translated into another domain. In this work, we propose a method that does not require the existence of exact correspondence between domains. Our approach learns domain-invariant representations and a common abstract

