DOMAIN-ROBUST VISUAL IMITATION LEARNING WITH MUTUAL INFORMATION CONSTRAINTS

Abstract

Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm -called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) -with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert's and the agent's domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.

1. INTRODUCTION

Recent advances demonstrated the strengths of combining reinforcement learning (RL) with powerful function approximators to obtain effective behavior for high dimensional control tasks (Lillicrap et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018a) . However, RL's reliance on a reward function introduces a fundamental limitation as reward specification and instrumentation can bring about a great design burden to potential users aiming to train an agent for a novel problem. An alternative approach for addressing this limitation is to recover a learning signal through expert demonstrations. Most of the past work exploring this area focused on the problem setting where demonstrations are provided directly from the agent's point of view and through the agent's actuators, which we refer to as agent-centric imitation. However, applying agent-centric imitation for real-world robot learning would demand users to provide a diverse range of kinesthetic or teleoperated demonstrations to a robotic platform, leading to an unnatural user-agent interaction process. In this paper, we focus instead on learning effective policies solely from a set of external, high dimensional observations of a different expert agent executing a task. We refer to this problem formulation as observational imitation. Solving this requires disentangling the expert's intentions from the observations' context, which has been a challenging problem for prior research, and often relied on additional assumptions about the environment and expert data (Torabi et al., 2019) . We propose a novel algorithm, called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL), to acquire effective agent behavior without such limitations. Our technique is based on the framework of inverse reinforcement learning, yet, it enables an agent to learn with only access to observations collected by watching a structurally different expert. DisentanGAIL utilizes an off-policy learner alongside a novel discriminator with a latent representation bottleneck, regularized to represent a domain invariant space over the agent's and expert's sets of observations. This is achieved by enforcing two constraints on the estimated mutual information between the latent representation and the origin of collected observations. In particular, the contribution of this work for solving observational imitation is threefold: • We propose a discriminator making use of novel mutual information constraints, and provide techniques to adaptively and consistently ensure their enforcement. • We identify the problem of domain information disguising when estimating mutual information and propose structural modifications to our models for its prevention. • We show that, unlike prior work, our algorithm can scale to high dimensional tasks while being robust to domain differences in both environment appearance and agent embodiment, by testing on a novel diverse set of tasks with varying difficulty.

2. RELATED WORK

Agent-centric imitation has been a long-studied problem setting. Behavior cloning (Pomerleau, 1989; 1991; Ross et al., 2011) was first proposed to approach imitation from a supervised learning perspective. Particularly, an agent is trained to maximize the likelihood of executing a set of recorded optimal actions from the states encountered by the expert. Inverse reinforcement learning (IRL) (Ng et al., 2000; Abbeel & Ng, 2004; Ratliff et al., 2006) was more recently proposed as an alternative two-step solution to imitation and has been often shown to be more effective. First, IRL aims to recover a reward function by parameterizing a discriminator, trained to be representative of the objective portrayed by the expert demonstrations. Second, it tries to learn behavior to accomplish such objective, utilizing RL. To effectively understand and represent the expert's intentions, modern instantiations of IRL combined the maximum entropy problem formulation (Ziebart et al., 2008) with deep learning (Wulfmeier et al., 2015; Finn et al., 2016b) , and proposed a direct connection with adversarial learning (Ho & Ermon, 2016; Finn et al., 2016a; Kostrikov et al., 2018) . This allowed for successful imitation in complex control tasks with few expert demonstrations. Related to our algorithm, Peng et al. ( 2018) implemented a variational bottleneck to limit information flow in the discriminator, for tackling the discriminator saturation problem (Arjovsky & Bottou, 2017). Additionally, Zolna et al. (2019) proposed to optimize the discriminator to be maximally uncertain about uninformative sets of data as a form of regularization to disregard irrelevant features. A different line of research instead considered imitating from observations of an expert performing a task, in problem settings resembling observational imitation. Earlier methods proposed to use handengineered mappings to domain invariant features (Gioioso et al., 2012; Gupta et al., 2016) , while more recent works proposed to learn such mappings, relying on specific prior data obtained under both the agent and expert perspectives. These techniques include using time-aligned demonstrations (Gupta et al., 2017; Sermanet et al., 2018; Liu et al., 2018; Sharma et al., 2019) , or multiple tasks where the agent and the expert already achieved expertise (Smith et al., 2019; Kim et al., 2019) . While effective, these methods for observational imitation make considerable assumptions about the task structure and the available data. Therefore, they have limited applicability for arbitrary problems, where environment instrumentation and prior knowledge are minimal. The work most related to ours is by Stadie et al. (2017) , where a domain invariant representation is learned through utilizing a domain confusion loss that requires two different expert policies for sampling failure and success demonstrations in the expert domain. While this approach was also adopted in recent works (Okumura et al., 2020; Choi et al., 2020) , empirically, it yielded successful imitation results only when working in low dimensional control tasks and with the agent and expert domains differing solely in their appearance. On the contrary, our algorithm only requires a limited set of expert demonstrations and allows for successful imitation in both low and high dimensional control tasks, with the expert's and agent's domains differing in both environment appearance and agent embodiment.

3.1. ADVERSARIAL IMITATION LEARNING

In imitation learning, the agent is provided a set of expert demonstrations B E = {τ 1 , τ 2 , ..., τ N }, where each τ = (s 0 , a 0 , s 1 , a 1 , ..., s T ) represents a trajectory collected with an expert policy from

