DOMAIN-ROBUST VISUAL IMITATION LEARNING WITH MUTUAL INFORMATION CONSTRAINTS

Abstract

Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm -called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) -with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert's and the agent's domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.

1. INTRODUCTION

Recent advances demonstrated the strengths of combining reinforcement learning (RL) with powerful function approximators to obtain effective behavior for high dimensional control tasks (Lillicrap et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018a) . However, RL's reliance on a reward function introduces a fundamental limitation as reward specification and instrumentation can bring about a great design burden to potential users aiming to train an agent for a novel problem. An alternative approach for addressing this limitation is to recover a learning signal through expert demonstrations. Most of the past work exploring this area focused on the problem setting where demonstrations are provided directly from the agent's point of view and through the agent's actuators, which we refer to as agent-centric imitation. However, applying agent-centric imitation for real-world robot learning would demand users to provide a diverse range of kinesthetic or teleoperated demonstrations to a robotic platform, leading to an unnatural user-agent interaction process. In this paper, we focus instead on learning effective policies solely from a set of external, high dimensional observations of a different expert agent executing a task. We refer to this problem formulation as observational imitation. Solving this requires disentangling the expert's intentions from the observations' context, which has been a challenging problem for prior research, and often relied on additional assumptions about the environment and expert data (Torabi et al., 2019) . We propose a novel algorithm, called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL), to acquire effective agent behavior without such limitations. Our technique is based on the framework of inverse reinforcement learning, yet, it enables an agent to learn with only access to observations collected by watching a structurally different expert. DisentanGAIL utilizes an off-policy learner alongside a novel discriminator with a latent representation bottleneck, regularized to represent a domain invariant space over the agent's and expert's sets of observations. This is achieved by enforcing two constraints on the estimated mutual information between the latent representation and the origin of collected observations. In particular, the contribution of this work for solving observational imitation is threefold: • We propose a discriminator making use of novel mutual information constraints, and provide techniques to adaptively and consistently ensure their enforcement. • We identify the problem of domain information disguising when estimating mutual information and propose structural modifications to our models for its prevention. • We show that, unlike prior work, our algorithm can scale to high dimensional tasks while being robust to domain differences in both environment appearance and agent embodiment, by testing on a novel diverse set of tasks with varying difficulty.

2. RELATED WORK

Agent-centric imitation has been a long-studied problem setting. Behavior cloning (Pomerleau, 1989; 1991; Ross et al., 2011) was first proposed to approach imitation from a supervised learning perspective. Particularly, an agent is trained to maximize the likelihood of executing a set of recorded optimal actions from the states encountered by the expert. Inverse reinforcement learning (IRL) (Ng et al., 2000; Abbeel & Ng, 2004; Ratliff et al., 2006) was more recently proposed as an alternative two-step solution to imitation and has been often shown to be more effective. First, IRL aims to recover a reward function by parameterizing a discriminator, trained to be representative of the objective portrayed by the expert demonstrations. Second, it tries to learn behavior to accomplish such objective, utilizing RL. To effectively understand and represent the expert's intentions, modern instantiations of IRL combined the maximum entropy problem formulation (Ziebart et al., 2008) with deep learning (Wulfmeier et al., 2015; Finn et al., 2016b) , and proposed a direct connection with adversarial learning (Ho & Ermon, 2016; Finn et al., 2016a; Kostrikov et al., 2018) . This allowed for successful imitation in complex control tasks with few expert demonstrations. Related to our algorithm, Peng et al. (2018) implemented a variational bottleneck to limit information flow in the discriminator, for tackling the discriminator saturation problem (Arjovsky & Bottou, 2017) . Additionally, Zolna et al. (2019) proposed to optimize the discriminator to be maximally uncertain about uninformative sets of data as a form of regularization to disregard irrelevant features. A different line of research instead considered imitating from observations of an expert performing a task, in problem settings resembling observational imitation. Earlier methods proposed to use handengineered mappings to domain invariant features (Gioioso et al., 2012; Gupta et al., 2016) , while more recent works proposed to learn such mappings, relying on specific prior data obtained under both the agent and expert perspectives. These techniques include using time-aligned demonstrations (Gupta et al., 2017; Sermanet et al., 2018; Liu et al., 2018; Sharma et al., 2019) , or multiple tasks where the agent and the expert already achieved expertise (Smith et al., 2019; Kim et al., 2019) . While effective, these methods for observational imitation make considerable assumptions about the task structure and the available data. Therefore, they have limited applicability for arbitrary problems, where environment instrumentation and prior knowledge are minimal. The work most related to ours is by Stadie et al. (2017) , where a domain invariant representation is learned through utilizing a domain confusion loss that requires two different expert policies for sampling failure and success demonstrations in the expert domain. While this approach was also adopted in recent works (Okumura et al., 2020; Choi et al., 2020) , empirically, it yielded successful imitation results only when working in low dimensional control tasks and with the agent and expert domains differing solely in their appearance. On the contrary, our algorithm only requires a limited set of expert demonstrations and allows for successful imitation in both low and high dimensional control tasks, with the expert's and agent's domains differing in both environment appearance and agent embodiment.

3.1. ADVERSARIAL IMITATION LEARNING

In imitation learning, the agent is provided a set of expert demonstrations B E = {τ 1 , τ 2 , ..., τ N }, where each τ = (s 0 , a 0 , s 1 , a 1 , ..., s T ) represents a trajectory collected with an expert policy from the agent's point of view. These demonstrations are used to provide the learning signal for the agent to improve its policy π. In Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) this learning signal is obtained through a pseudo-reward function R D , derived from a discriminator network D trained to discern between 'expert' and 'agent' state-action-next state triplets: arg max D E B E log(D(s i , a i , s i+1 )) + E pπ(τ ) log(1 -D(s i , a i , s i+1 )), where p π (τ ) is the distribution of trajectories encountered by the agent, stemming from both the environment's dynamics and its own current policy π. Reinforcement learning methods are then applied for π to adversarially maximize the sum of encountered pseudo-rewards: arg max π E pπ(τ ) T -1 t=0 R D (s t , a t , s t+1 ) , where Ho & Ermon (2016) proposed to iteratively execute the adversarial optimization steps described in Eqs. 1 and 2, optimizing π through the Trust Region Policy Optimization algorithm (Schulman et al., 2015) . R D (s i , a i , s i+1 ) = log(D(s i , a i , s i+1 )) -log(1 -D(s i , a i , s i+1 )).

3.2. Observational IMITATION LEARNING

In the problem setting of observational imitation, we are concerned with learning without knowledge of the states visited and the actions taken by the expert. We purely rely on observations, o ∈ O, provided in a set of expert demonstrations B E = {τ 1 , τ 2 , ..., τ N } containing visual tra- jectories τ i = o i 0 , o i 1 , ..., o i T . Each visual trajectory τ i is a sequence of observations obtained by watching the expert act in its domain. Here, the discriminator D will be optimized to discern between the expert demonstrations in B E and 'agent' observations in B π , a set of visual trajectories collected by watching the agent act according to π. We consider each observation to be an RGB image of the environment at the current time-step, hence, containing only partial and highlyentangled information about the true state. We define the expert's and agent's domains as distinct Partially Observed Markov Decision Processes (POMDPs), M E = (S E , A E , O E , P, p o , R) and M A = (S A , A A , O A , P, p o , R), respectively. The main challenge in observational imitation is that the states s ∈ S E , actions a ∈ A E and observations o ∈ O E in the expert's POMDP do not necessarily match the states s ∈ S A , actions a ∈ A A and observations o ∈ O A in the agent's POMDP.

3.3. MUTUAL INFORMATION

To overcome the differences between the expert's and agent's POMDPs, our algorithm relies on estimating the mutual information between different sets of variables. Mutual information is a statistical measure that represents the level of dependency between two random variables and quantifies how much information each random variable is expected to contain about the other (Kinney & Atwal, 2014) . In other words, the mutual information between X and Z measures the difference between the entropy of X and the conditional entropy of X given Z: I(X, Z) = H(X) -H(X|Z) = H(Z) -H(Z|X). (3) Belghazi et al. (2018) showed that the mutual information can be effectively estimated through the Mutual Information Neural Estimator (MINE). This consists in lower bounding the mutual information through the Donskher-Varadhan dual representation of the KL divergence between two random variables distributions (Donsker & Varadhan, 1975) , by searching for a parameterized function T φ : x,z) . I(X, Z) ≥ sup φ∈Φ E P (X,Z) [T φ (x, z)] -log E P (X),P (Z) e T φ ( (4)

4. DISENTANGAIL

DisentanGAIL utilizes several algorithmic components to address the problem of observational imitation. Particularly, its discriminator, D, is regularized by the enforcement of two mutual information constraints between the domain origin and a specific latent representation of the collected observations. This enables the pseudo-rewards, R D , to disregard the domain differences between the expert's and agent's domain and provide a meaningful learning signal to improve the agent's policy, π. Together with the visual trajectories in B E and B π , DisentanGAIL also exploits sets of prior data collected in the expert's and agent's domains, denoted by B P.E and B P.π respectively. Such data is obtained by recording observations from the expert's and agent's domains, while neither expert nor agent is attempting to perform the target task, e.g., while they are acting randomly.

4.1. DISCRIMINATOR COMPONENTS

Our algorithm for observational imitation utilizes a convolutional neural network discriminator D θ , optimized for outputting the probability that an input sequence of observations occurred as a consequence of expert behavior. As illustrated in Fig. 1 , our discriminator can be divided into two distinct sub-models, namely, the preprocessor P θ1 and the invariant discriminator S θ2 : D θ = S θ2 • P θ1 . We define the preprocessor as a parameterized multivariate Gaussian distribution with diagonal covariance P θ1 = {µ θ1 , Σ θ1 }, from which a latent representation is sampled for each input observation z i ∼ N (µ θ1 (o i ), Σ θ1 (o i )) . The preprocessor's objective is to project each observation into a latent space containing information about the achievement state of the goal, disregarding the irrelevant information about the inherent differences between the expert's and agent's domains. The Gaussian representation ensures that, for any input, the support over the possible latent representations z is infinite, moreover, it allows the model to directly reduce the information in any of the independent dimensions of z by increasing the corresponding variance in Σ θ1 (o i ). Based on these latent representations, the invariant discriminator S θ2 is tasked to output the discriminator score for the observed behavior. To classify behavior at any time-step, S θ2 takes as input the concatenated sequence of the latent representations of the four most recent observations, ẑt = concat(z t , z t-1 , z t-2 , z t-3 ). Feeding a concatenation of the latent representations over multiple time-steps serves the purpose of facilitating the recovery of information regarding the true unobserved state of the POMDP from the observations. It also allows the discriminator to reason directly with goal-completion progress throughout different consecutive observations. To understand the necessity of this practice, consider a navigation problem where the agent's task is to reach a target position. Only given access to information about multiple visual observations showing the location of the agent at different time-steps, the discriminator will be able to assess if the agent is approaching the target and retrieve higher-order state information about its motion. Both the preprocessor and invariant discriminator are trained end-to-end through the reparameterization trick (Kingma & Welling, 2013) , to optimize the GAIL objective J G to discern behavior from the set of expert demonstrations B E against behavior from the set of recent agent observations B π : arg max θ J G (θ, B E , B π ) = arg max θ E B E ,P θ 1 log(S θ2 (ẑ i )) + E Bπ,P θ 1 log(1 -S θ2 (ẑ i )).

4.2. MUTUAL INFORMATION CONSTRAINTS

To obtain an invariant latent space from the preprocessor's output, we propose to enforce two different constraints on the mutual information between the observation's latent representations z i ∼ P θ1 (o i ) and a corresponding set of domain labels. Each domain label d i is a binary variable representing whether the associated observation o i was collected in the expert POMDP, i.e., d i = 1 oi∈B E ∪B P.E . To estimate the mutual information, we make use of the MINE estimator and utilize a statistics network T φ optimized to maximize the objective in Eq. 4 between the latent representations and the domain labels for the observations in B E and B π : arg max φ I φ (zi, di|BE ∪Bπ) = arg max φ E P (d i ),P (z i |d i ) [T φ (zi, di)]-log E P (d i ),P (z i ) e T φ (z i ,d i ) . (7) Expert demonstrations constraint. The first mutual information constraint is for the latent representations of the observations from the union of B E with B π . We propose to constraint the estimated mutual information of these latent representations with the corresponding domain labels to be less than 1 bit: I φ (z i , d i |B E ∪ B π ) < 1. We define two kinds of information that the preprocessor P θ1 can encode into the latent representations to aid the invariant discriminator S θ2 in discerning transitions o t:t-3 from B E and B π : (i) domain information, from the visual differences of the two environments (labeled by d i ), or (ii) goal-completion information, from the expected progress shown in the observations towards achieving the goal demonstrated by the expert in B E (represented by the variable c i ). By constraining the mutual information of the latent representations with the domain labels d i to be less than 1 bit, we prevent the invariant discriminator to exclusively rely on information inherent to the domain origin to make its classification decision. Therefore, we force it to seek goal-completion information about c i to fully optimize its objective from Eq. 6. We empirically evaluate this constraint against tighter constraints in Appendix D. Prior data constraint. An additional mutual information constraint is for the latent representations of the observations from the union of prior data sets collected independently in both agent and expert domains, B P.E and B P.π . The observations collected in these sets are expected to come from observing both expert's and agent's domains, while neither expert or agent are attempting to perform the target task. Hence, by assuming that the goal-completion levels observed in these two sets approximately match, we can constraint the mutual information of the relative latent representations with the domain labels to be near 0, namely I φ (z i , d i |B P.E ∪ B P.π ) ≈ 0. This mutual information constraint implicitly optimizes for a mapping which makes the distributions of latent representations generated from the observations in the two prior sets of data equivalent. Hence, it allows for the utilization of great amounts of cheaply collected unsupervised data to provide an additional learning signal regarding the information which should be discarded by the preprocessor. Comparison with prior efforts. Stadie et al. (2017) proposed to constraint the mutual information between the encoded observations and the domain labels in B E ∪B π to be 0. We argue that enforcing such constraint would unnecessarily limit the information in the latent representations and impair learning. This constraint assumes that some observable factor determining c i is independent of the domain labels d i , otherwise, no information about c i can be encoded in the latent representations. However, given that in B E ∪ B π we have o i ∈ B E ⇔ d i = 1 , such assumption seldom holds, as it requires the distributions of some goal-completion information about c i present in the observations in B E and B π to exactly match. However, unlike this work, the algorithm proposed by Stadie et al. (2017) does not attempt to enforce such constraint precisely. Instead, it simply penalizes a measure proportional to the mutual information via a domain confusion loss with a fixed weight coefficient. We further discuss the implications of this practice and provide a toy example where truly enforcing such constraint would prevent any learning in Appendix A.

4.3. OFF-POLICY LEARNING

To learn effective behaviour, we combine our regularized DisentanGAIL discriminator with the offpolicy Soft-Actor Critic (SAC) algorithm by Haarnoja et al. (2018b) . To optimize a parameterized agent policy, π ω , SAC maximizes the expected sum of entropy regularized pseudo-rewards: arg max ω J(ω) = arg max ω E p πω (τ ) T t=0 R D (o t , o t-1 , o t-2 , o t-3 ) -απ ω (a t |s t ) .

5.1. ENFORCING THE MUTUAL INFORMATION CONSTRAINTS

We implement two different techniques to enforce the mutual information constraints proposed in Section 4.2, given a set of observations B with the corresponding domain labels. Both techniques make use of a single hyper-parameter I max , which represents the upper limit on the estimated information about the domain labels that we allow the latent representations to retain. Adaptive penalty L β . The first technique is having a supplementary loss function for the preprocessor P θ1 , penalizing it proportionally to the estimated mutual information in the latent representations. We use an adaptive parameter β to ensure the mutual information is within the desired range: L β (θ 1 , B) = βI φ (z i , d i |B). We design our updates of β to follow a similar pattern to the adaptive penalty coefficient proposed by Schulman et al. (2017) , utilizing I max and updating: β ← β × 1.5, if I φ (z i , d i |B) > I max • β ← β ÷ 1.5, if I φ (z i , d i |B) < I max ÷ 2. • Dual penalty L λ . The second technique consists in a different supplementary loss function penalizing the preprocessor P θ1 proportionally to the violation of the upper limit. In this case, we ensure constraint enforcement through the introduction of a non-negative Lagrange multiplier variable λ: L λ (θ 1 , B) = λ (I φ (z i , d i |B) -I max ) ( ) where λ is updated to maximize L λ , approximating dual gradient descent (Boyd et al., 2004) : λ ← max (0, λ + α (I φ (z i , d i |B) -I max )) . In practice, the dual penalty enforces precisely the mutual information constraints, but the dual variable λ stabilizes in more iterations than the adaptive parameter β. Our final implementation makes use of the adaptive penalty when enforcing the expert demonstrations constraint with I max = 0.99 and the dual penalty when enforcing the prior data constraint with I max = 0.001. Hence, our penalized discriminator objective augments the original discriminator objective in Eq. 6 as: arg max θ J G (θ, B E ∪ B π ) -L β (θ 1 , B E ∪ B π ) -L λ (θ 1 , B P.E ∪ B P.π ).

5.2. DOMAIN INFORMATION DISGUISING

The discriminator D θ is updated to maximize Eq. 12, given a mutual information estimate from a fixed-sized statistics network T φ . Hence, if the optimization of T φ temporarily converges to a sub-optimal local minimum, D θ could encode domain information into the latent representations z without the statistics network detecting it. We refer to this phenomenon as domain information disguising, and we utilize two further techniques to prevent this issue. Double statistics network. The first technique takes inspiration from the work of Van Hasselt et al. (2016) and consists of learning independently two statistics neural networks, namely T φ1 and T φ2 . Thus, the mutual information is estimated through taking the maximum prediction of the independent models over the same set of observations, Îφ (z i , d i |B) = max(I φ1 (z i , d i |B), I φ2 (z i , d i |B)). This change makes it impractical for the discriminator to disguise domain information, as the gradient from each of the regularization losses can only contain information about a single statistics network at a time (from the max operation). Therefore, this gives a statistics network reaching a sub-optimal local minimum the chance to recover, without affecting the mutual information estimates to a great extent. This practice has also the benefit of providing a better prediction of the current mutual information, counteracting the effects of epistemic uncertainty on the optimization. Invariant discriminator regularization. The second technique comprises regularizing the invariant discriminator S θ2 to be approximately 1-Lipschitz. A great part of the GAN literature (Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018) showed the effectiveness of this practice when regularizing discriminator networks. In our specific problem setting, it has the further benefit of restricting the expressivity of the invariant discriminator, preventing it from capturing domain information not captured by our mutual information estimator. To enforce this, we utilize spectral normalization, a regularization technique proposed by Miyato et al. (2018) . DisentanGAIL trains all the models end-to-end, in three main learning steps: (i) Discriminator learning, where the discriminator's parameters θ are updated to maximize Eq. 12; (ii) Mutual information learning, where the statistics network's parameters φ are updated to maximize Eq. 7; (iii) Agent learning, where the learner's parameters ω are updated to maximize Eq. 8 with SAC. We provide further implementation details and a formal summary of the algorithm in Appendix B. 

6. EXPERIMENTS

To evaluate our algorithm, we design six different environment realms, simulated with Mujoco (Todorov et al., 2012) , extending the environments from Brockman et al. ( 2016): Inverted Pendulum, Reacher, Hopper, Half-Cheetah, 7DOF-Pusher and 7DOF-Striker. We define an environment realm as a set of environments with a shared semantic goal but with significant differences in terms of appearance and agent embodiment. For each of the experiments, we select a source environment and a target environment within one environment realm. Thus, we train an 'expert' agent and collect a set of visual trajectories in the source environment. An 'observer' agent will then use these visual trajectories to perform imitation in the target environment, without access to the reward function. In all our experiments, each epoch corresponds to the 'observer' agent collecting 1000 time-steps of experience in the target environment. We report at each epoch the mean and standard error over five experiments of the maximum expected cumulative reward recorded so far. We obtain the expected cumulative reward by averaging the performance of the 'observer' agent over five trajectories. We scale the cumulative rewards such that 0 represents the performance from random behavior, and 1 represents the performance obtained by the 'expert' agent. We provide a detailed description of the different environment realms in Appendix C. Can DisentanGAIL efficiently solve the problem of observational imitation with both appearance and embodiment mismatches? We first evaluate the performance of our algorithm on the Inverted Pendulum and Reacher realms. We test eight different combinations of source and target environments for each of these realms. We allow the 'observer' agent to train for a maximum of 20 epochs. To evaluate the effectiveness of our proposed constraints and techniques in solving the problem of observational imitation, we compare the performance of the following algorithms: • DisentanGAIL: The full proposed algorithm, as described in Section 5. • DisentanGAIL without prior data (No prior): DisentanGAIL without the prior data constraint. • We present the performance curves in Fig. 2 and a summary of the results in Table 1 . Particularly, the full DisentanGAIL algorithm outperforms all other algorithms when considering any domain difference and consistently achieves a performance close to the 'expert' agent. In comparison, TPIL severely under-performs, even when evaluated given five times the amount of experience. Applying the domain confusion loss to DisentanGAIL significantly and consistently deteriorates the performance, validating the effectiveness of our proposed constraints. However, this version of Disen-tanGAIL still vastly outperforms the original TPIL implementation, underlying the superiority of our proposed model and optimization. DisentanGAIL with no prior data, performs well in most experiments, but under-performs when faced with drastic changes in domain appearance, indicating that the utilization of sets of prior data is important when strong visual cues about environment correspondences are missing. We report additional ablation studies for our model in Appendix D. Can DisentanGAIL scale to more challenging, high dimensional control tasks? We evaluate our algorithm on the four remaining realms, which consist of substantially harder problems, narrowing the evaluation gap with agent-centric imitation algorithms. We refer to the environments in these realms as 'high dimensional' since their state and action spaces are significantly larger than the state and action spaces of the environments explored in prior work making use of the domain confusion loss (Stadie et al., 2017; Okumura et al., 2020; Choi et al., 2020) . Namely, we explore two locomotion realms, Hopper and Half-Cheetah, and two manipulation realms, 7DOF-Pusher and We present the performance curves in Fig. 3 and a summary of the results in Table 2 . Remarkably, DisentanGAIL is able to recover close to the expert's performance in both manipulation realms, with at least the same efficiency as the No latent representation regularization baseline learning in the 'source' environment. In the locomotion realms, the performance appears to converge more slowly to a similar but lower value than the expert. We hypothesize this is because the main objectives in the locomotion realms are based on the agents continuously executing a particular stream of actions rather than reaching a target state. Thus, the analyzed shifts in agent's embodiment, even modifying the action-spaces dimensionality, strongly increase the discriminator ambiguity on rewarding the best possible way to solve the tasks. The performance gap of DisentanGAIL with the rest of the baselines is considerably greater than in the previous set of experiments. In particular, TPIL and the No latent representation regularization baseline fail to recover meaningful behavior in any experiment. Similarly, applying the domain confusion loss to DisentanGAIL degrades considerably the performance across all problems. Additionally, removing the prior data constraint from DisentanGAIL also appears to degrade the performance. Yet, DisentanGAIL with no prior data still outperforms all other baselines in three environment realms and is able to almost match the full DisentanGAIL performance in the Half-Cheetah realm. These results highlight the complexity of performing observational imitation in high dimensional environments and show the effectiveness of our proposed constraints and optimization.

7. CONCLUSION

We proposed DisentanGAIL -a novel algorithm to effectively solve the problem of observational imitation. Our method makes use of two mutual information constraints for a latent representation inside the discriminator network to encode goal-completion information and discard domain information about the observations. Unlike prior work, our experiments show DisentanGAIL's effectiveness at dealing with various domain differences, both in terms of environment appearance and agent embodiment, and at scaling to more complex high dimensional tasks. We believe our work might have strong implications for future real-world imitation learning, as it could allow users to teach agents new tasks by simply being observed, leading to natural human-robot interactions. To facilitate future efforts, we share the code for our algorithms and environments: https://github.com/Aladoro/domain-robust-visual-il. B π : {00, 00, 00} {00, 01, 00} {00, 00, 01} {00, 00, 00}

A ALTERNATIVE EXPERT DEMONSTRATIONS CONSTRAINT

Previous work (Stadie et al., 2017) proposed to optimize the GAIL objective described in section 6, subject to constraining the mutual information to be 0: arg max θ J G (θ, B E ∪ B π ) s.t. I(z i , d i |B E ∪ B π ) = 0. In practice, however, the algorithm proposed by Stadie et al. ( 2017) enforces such constraint very loosely via a domain confusion loss. This is achieved by introducing a second classifier, C φ , on top of the preprocessor P θ1 , to predict the domain labels d i from the latent representations z i . J DCL (φ, θ 1 , B E , B π ) = E B E ,P θ 1 log(C φ (z i )) + E Bπ,P θ 1 log(1 -C φ (z i )). The GAIL objective J G is then augmented by the domain confusion loss, J DCL , where the preprocessor is adversarially trained to minimize the information about the domain labels d i useful for C φ . This optimization is then regulated by a fixed weight coefficient λ: arg max θ arg min φ J G (θ, B E ∪ B π ) -λJ DCL (φ, θ 1 , B E , B π ). As a consequence, in practice the domain confusion loss acts more as a heuristic to minimize the domain labels information, contained in the single latent representations, rather than attempting to enforce a precise constraint. Below, we provide a toy example where truly enforcing I(z i , d i ) = 0 would prevent the preprocessor from encoding any useful information about the observations. Consider a simple task where the objective is for an agent to reach and remain in a target state. In this setting, we let the agent and expert POMDPs differ in their observation spaces. Specifically, we define the observations collected to be composed of two binary variables, o i = x i y i . The value of the first variable x i is 1 for any observation in the expert POMDP, and it is 0 for any observation in the agent POMDP. The value of the second variable y i is 1 if the visited state is a target state, and it is 0 otherwise. Therefore, for any observation o i the first binary variable x i contains domain information about d i which should be discarded (as x i = d i ), while the second binary variable y i contains useful goal-completion information about c i which should be encoded in z i . Consider an instance of this problem with a task horizon of 3 and with four visual trajectories in B E and B π described in Table 3 . In this example, unlike the visual trajectories collected by the agent in B π , the expert demonstrations in B E accomplish the goal of this task as they show successfully reaching and remaining in a target state. Therefore, the distribution of goal states encountered in the observations from B E and B π are different, and consequently, y i is not statistically independent from x i . We show this by computing the conditional probabilities: p(x i = 1|y i = 1) = 4/5 = p(x i = 1|y i = 0) = 2/7 ⇒ p(x i = 1|y i ) = p(x i = 1) = 1/2. Thus, the observable goal completion information about c i , present in y i , are not statistically independent from the domain labels d i . For simplicity, consider a deterministic preprocessor encoding the latent representations as z i = P (o i ) (as proposed by Stadie et al. ( 2017)). Enforcing I(z i , d i ) = 0 means that the value of z i must be independent of d i , and since d i = x i , we must have z i = P (o i ) = P (xy i )∀x = P y (y i ). However, we claim that this also implies that I(z i , c i ) = 0. We can easily show this by contradiction, assume that I(z i , c i ) > 0, since y i is the only observable source of information about c i then P y (y i = 0) = P y (y i = 1). Therefore, P y must be invertible or in other words there exists a function such that P -1 y (P y (y i )) = y i ∀y i . However, we then have that p(d i |z i ) = p(d i |P -1 y (z i )) = p(d i |y i ) = p(x i |y i ) = p(x i ) = p(d i ) , therefore z i is not independent of d i and we must have I(z i , d i ) = 0. B ALGORITHM DETAILS B.1 PRIOR DATA The prior data sets utilized to enforce the prior data constraint are collected by executing random behavior in both 'source' and 'target' environments. To add diversity to the discriminator's learning signal, we also use samples from both prior sets as additional negative examples for J G . The baselines making use of the domain confusion loss also make use of prior data, analogously to how 'failure' data is used in the original TPIL algorithm (Stadie et al., 2017) . On the other hand, the baselines DisentanGAIL with no prior data and No latent representation regularization are evaluated without access to prior data.

B.2 DISCRIMINATOR

Given an observation o i , to sample the latent representation z i ∼ N (µ θ1 (o i ), Σ θ1 (o i )) we flatten the output of the preprocessor and split the resulting K-dimensional vector in two halves. We utilize the first half of the variables to obtain the latent representation's mean, while we apply a Tanh nonlinearity, exponentiate and scale the second half to obtain the latent representation's covariance: σ θ1 (o i ) = P θ1 (o i ) 1:K/2 , Σ θ1 (o i ) = diag(exp(tanh(P θ1 (o i ) K/2:K )) ÷ 2)). To obtain a non-stochastic learning signal for the policy, when calculating the pseudo-reward R D (o i , o i-1 , o i-2 , o i-3 ) = log(D θ (o i , o i-1 , o i-2 , o i-3 )) -log(1 -D θ (o i , o i-1 , o i-2 , o i-3 )) we set the Gaussian noise to zero, equivalently substituting D θ (o i , o i-1 , o i-2 , o i-3 ) with: D θ det (o i , o i-1 , o i-2 , o i-3 ) = S θ2 (concat(σ θ1 (o i ), σ θ1 (o i-1 ), σ θ1 (o i-2 ), σ θ1 (o i-3 ))).

B.3 TRAINING SPECIFICATIONS

The discriminator loss from Eq. 12 and the MINE estimator objective from Eq. 7 are approximated utilizing batches of transitions of observations b, sampled uniformly from the corresponding sets of visual trajectories B. Specifically, for all optimizations, we set the batch size |b| = 128. We utilize a fixed size agent set of visual trajectories B π and evict old transitions when reaching full capacity, thus, effectively acting as a replay buffer (similarly to Kostrikov et al. (2018) ). Throughout all the experiments, we utilize the same 2 hidden-layer fully-connected policy and Q-networks with 256 units and ReLU nonlinearities. We keep other model architectures structurally consistent, with a fullyconvolutional preprocessor P θ1 , a fully-connected invariant discriminator S θ2 and fully-connected statistics networks T φi . We only vary the depth of the models and the number of filters/units in each layer depending on the environment realm. To avoid having a biased pseudo-reward which could provide a learning signal to the agent even without any meaningful discriminator (Kostrikov et al., 2018) , we modify the environments by removing terminal states. Thus, each collected visual trajectory has a fixed length, equal to the task-horizon |τ |. We provide the utilized environment-specific hyper-parameters in Table 4 , where we specify the buffer sizes in terms of total/maximum number of observations. We train each model through the Adam optimizer (Kingma & Ba, 2014) with a unique learning rate α = 0.001 and momentum parameters β 1 = 0.9, β 2 = 0.999. We alternate the collection of a single episode using the current policy π ω , with repeating (i) Discriminator learning and (ii) (3, (2, 2  × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {1-FC, Sigmoid} {1-FC} {40 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {64 × (3, 3)-conv, (2, 2)-MaxPool} Mutual information learning for as many iterations as the number of time-steps collected. Then, by averaging the mutual information estimates accumulated from performing (ii), we update the coefficients β and λ. Finally, we also perform (iii) Agent learning for as many iterations as the number of time-steps collected. We utilize the same agent set of visual trajectories B π in all different learning steps. A formal summary of DisentanGAIL is reported below in Algorithm 1.  L D = -J G (θ, b E , b π ) + L β (θ 1 , b E ∪ b π ) + L λ (θ 1 , b P.E ∪ b P.π ) 9: Update θ with Adam 

C ENVIRONMENTS DESCRIPTION

We evaluate the algorithms on six different environment realms designed to test the proposed methods for a diverse range of task difficulties and domain difference, extending the environments in OpenAI Gym (Brockman et al., 2016) : angle between the tip of the links and the moving cart, together with some information about the cart's position. For the Reacher realm, the preprocessor appears to be encoding information about the position of the tip of the reacher, ignoring the orientation of the individual links. For the Hopper and Half-Cheetah realms, the preprocessor appears to be encoding the orientation of the joints movable in both domains, together with the current height and position of the agent with respect to the floor tiles, allowing the discriminator to assess whether the agent is hopping/advancing. For the 7DOF-Pusher and 7DOF-Striker realms, the preprocessor appears to be encoding the location of the item/ball, together with the location and orientation of the agent's hand actuator. We also show the produced couplings for six different experiments with DisentanGAIL with domain confusion loss in Fig. 5 . From the results, it can be inferred that the features encoded by the domain confusion loss are similar to the ones encoded by the mutual information constraints, yet not as consistently interpretable. This is especially evident in the more challenging high dimensional environments. Particularly, for the Hopper and Half-Cheetah realms, the preprocessor does not appear to be always encoding the agent's relative position to the floor tiles, but rather focusing either on the angle of particular joints or the agent's overall appearance. Additionally, for the 7DOF-Pusher and 7DOF-Striker realms, the preprocessor does not appear to be consistently encoding the location of the item/ball, but rather focusing on some less interpretable feature about the agent's appearance.

D.2 EXPERT DEMONSTRATIONS CONSTRAINT

We evaluate the effects of enforcing tighter expert demonstration constraints on DisentanGAIL in the low-dimensional environments. This is achieved by running our algorithm with lower values for the hyper-parameter regulating the upper-limit on the estimated mutual information, I max , in the adaptive penalty loss L β (described in Section 5.1). First, we analyze the effects that tighter constraints have on the amount of goal-completion information encoded in our latent representations ẑ. Particularly, for different values of I max , we utilize MINE to estimate the mutual information between ẑ and default environment rewards r true : I(ẑ, r true ). We argue that this measure is a good heuristic about the goal-completion information contained in ẑ since r true can be effectively used to recover a policy to solve the task in all environments. Specifically, in Fig. 6 , we show the average and standard deviation of the mutual information collected throughout different experiments considering domain differences in both appearance and embodiment. This data shows that there is a positive correlation between I max and I(ẑ, r true ), indicating that tighter mutual information constraints are detrimental. This is explained since, especially at the beginning of training, c i and d i are highly dependent. Hence, looser constraint allow greater amounts of information about c i to be encoded within ẑ. These findings are also consistent with our arguments from Appendix A. We present the performance curves in Fig. 9 and a summary of the results in Table 8 . The obtained results show that background domain differences have a limited effect on DisentanGAIL's final performance. However, they have a more prominent effect on DisentanGAIL's efficiency, making it converge in an increased number of epochs. This is particularly noticeable in the Hopper realm's results. We hypothesize this is because in the new target environments there is a greater amount of domain information that requires to be 'disentangled' from the useful goal-completion information. For example, in the locomotion realms, the pre-processor encodes goal-completion information about the relative position of the agent with respect to the floor tiles in the two environments (as empirically suggested in Section D.1). In the new target environment of the Hopper realm, this information needs to be also disentangled from domain information regarding the tiles' appearance.



Figure 1: Simplified discriminator optimization structure: for each time-step the four most recent observations o t:t-3 are processed independently by the preprocessor P θ1 , outputting the corresponding latent representations z t:t-3 . The latent representations are then concatenated and fed jointly into the invariant discriminator S θ2 , and fed individually into the statistics network T φ , outputting respectively the GAIL objective J G , and the penalty loss L β or L λ .

Figure 2: Performance curves for the Inverted Pendulum (Top) and Reacher (Bottom) realms. Dis-entanGAIL is the only algorithm consistently achieving a performance close to the 'expert' agent's.

0.973 ± 0.074 1.021 ± 0.023 0.941 ± 0.045 0.954 ± 0.081 0.885 ± 0.064 0.894 ± 0.231 0.860 ± 0.081 0.918 ± 0.115 DisentanGAIL (No prior) 1.004 ± 0.012 1.015 ± 0.023 0.847 ± 0.064 0.914 ± 0.134 0.586 ± 0.143 0.794 ± 0.234 0.578 ± 0.160 0.887 ± 0.195 TPIL 0.251 ± 0.111 0.812 ± 0.162 0.185 ± 0.079 0.218 ± 0.191 0.278 ± 0.217 0.309 ± 0.122 0.235 ± 0.154 0.254 ± 0.199 TPIL (×5 experience) 0.683 ± 0.158 1.024 ± 0.025 0.493 ± 0.195 0.331 ± 0.279 0.585 ± 0.256 0.519 ± 0.281 0.626 ± 0.282 0.313 ± 0.266 DisentanGAIL (DCL) 0.894 ± 0.134 1.024 ± 0.025 0.867 ± 0.071 0.889 ± 0.159 0.550 ± 0.146 0.826 ± 0.194 0.523 ± 0.177 0.786 ± 0.288 No regularization 0.988 ± 0.042 1.018 ± 0.038 0.290 ± 0.187 0.635 ± 0.230 0.200 ± 0.176 0.677 ± 0.178 0.186 ± 0.136 0.682 ± 0.182

Figure 3: Performance curves for the Hopper (Top-left), Half-Cheetah (Bottom-left), 7DOF-Pusher (Top-right) and 7DOF-Striker (Bottom-right) environment realms.

TPIL: The original implementation of the algorithm from Stadie et al. (2017). • DisentanGAIL with domain confusion loss (DCL): Re-implementation of the domain confusion loss by Stadie et al. (2017) applied to DisentanGAIL, substituting the proposed constraints. • No latent representation regularization (No regularization): DisentanGAIL model without any loss or constraint to prevent encoding domain information in its latent representations.

conv, Tanh, (2, 2)-MaxPool} 2 × {100-FC, ReLU} 2 × {128-FC, Tanh} 20000 20000 100000 200{24 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {1-FC, Sigmoid} {1-FC} {32 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {48 × (3, 3)-conv, (2, 2)-MaxPool} Half-Cheetah {16 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} 2 × {100-FC, ReLU} 2 × {128-FC, Tanh} 20000 20000 100000 200 {24 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {1-FC, Sigmoid} {1-FC} {32 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {48 × (3, 3)-conv, (2, 2)-MaxPool} 7DOF-Pusher {24 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} 2 × {100-FC, ReLU} 2 × {128-FC, Tanh} 10000 10000 100000 200 {32 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {1-FC, Sigmoid} {1-FC} {40 × (3, 3)-conv, Tanh, (2, 2)-MaxPool} {64 × (3, 3)-conv, (2, 2)-MaxPool} 7DOF-Striker {24 × (3, 3)-conv,Tanh, (2, 2)-MaxPool} 2 × {100-FC, ReLU} 2 × {128-FC, Tanh} 10000 10000 100000 200 {32

b E , b π , b P.E , b P.π ∈ B E , B π , B P.E , B P.π 12:I n = I φn (z i , d i |b E ∪ b π )13:I P n = I φn (z i , d i |b P.E ∪ b P.π ) π ω with SAC, sampling from B π for |τ | stepsAgent learning

Figure 4: Couplings produced by matching the observations collected in the 'observer' agent's environment with the observations collected in the 'expert' agent's environment, to minimize the L1-distance between the latent representations produced by DisentanGAIL. On the right we show the results between the agent set of visual trajectories B π and the set of expert demonstrations B E , and on the left we show the results between the prior set of agent observations B P.π and the prior set of expert observations B P.E . From top to bottom, we show the results in the Inverted Pendulum, Reacher, Hopper, Half-Cheetah, 7DOF-Pusher and 7DOF-Striker environment realms. We produce the couplings in six sample observational imitation problems considering domain differences in terms of both environment appearance and agent embodiment.

Figure 5: Couplings produced by matching the observations collected in the 'observer' agent's environment with the observations collected in the 'expert' agent's environment, to minimize the L1-distance between the latent representations produced by DisentanGAIL with domain confusion loss. This visualization is analogous to Fig. 4

Figure6: Average mutual information between the latent representations ẑ and the default rewards r true throughout performing observational imitation, for differents value of the expert demonstration hyper-parameter I max . We evaluate this measure for the experiments in the Reacher and Inverted Pendulum environment realms considering domain differences in both appearance and embodiment.

Figure 7: Performance curves from enforcing tighter expert demonstration constraints in the Inverted Pendulum (Top) and Reacher (Bottom) environment realms.

Figure 8: Performance curves for the domain information disguising ablation performed in the Inverted Pendulum (Top) and Reacher (Bottom) environment realms.

Figure 9: Performance curves for the Hopper (Left) and 7DOF-Pusher (Right) environment realms in the alternative target environments.

Results summary for the Inverted Pendulum and Reacher environment realms

Results summary for the 'high dimensional' environment realms We use DisentanGAIL to perform observational imitation with the 'source' and 'target' environments differing greatly both in terms of appearance and agent embodiment, as detailed in Appendix C. We compare DisentanGAIL with the previously-introduced baselines. To provide an upper bound on the expected performance of DisentanGAIL, we additionally evaluate the No latent representation regularization baseline with the 'observer' agent learning in the original 'source' environment, i.e., imitating with no domain differences.

Example visual trajectories collected in the sample task visual trajectories {x 0 y 0 , x 1 y 1 , x 2 y 2 }

Environment-realm specific hyper-parameters

Results summary for the experiments considering further background domain differences in the 'target' environments

ACKNOWLEDGMENTS

Edoardo Cetin would like to acknowledge the support from the Engineering and Physical Sciences Research Council [EP/R513064/1].

annex

Published as a conference paper at ICLR 2021 • Inverted Pendulum: This environment realm consists of four variations of the original balancing task. The variations explore changing the color of the agent and adding a second link to be balanced on top of the moving cart.• Reacher: This environment realm consists of four variations of the original 2-D reaching task. The variations explore augmenting the number of joints and changing the observer's camera recording angle.• Hopper: This environment realm consists of two environments, including the original Hopper environment and an alternative version, in which the agent has an additional joint splitting its thigh link in two, with the new link also appearing in a different color.• Half-Cheetah: This environment realm consists of two environments, including the original Half-Cheetah environment and an alternative version, in which the agent has immobilized feet joints and the connected links appearing in a different color.• 7DOF-Pusher/7DOF-Striker: Each of these environment realms consists of two environments, including the original Pusher/Striker environments and an alternative version, in which the agent's model is modified in its appearance and structural configuration, to make it resemble a very simplified human operator.To collect observations, we render the environments with Mujoco and down-scale the renderings to different dimensions with the purpose of having efficient representations but still preserving the relevant details about the observations. We provide further environment-specific descriptive details in Table 5 .

D.1 LATENT REPRESENTATIONS COUPLING

After performing the reported experiments, we utilize the learnt models to understand what features are encoded into our constrained latent representations z i . Particularly, we use the output of the trained preprocessors to map observations between the 'expert' agent's and the 'observer' agent's sets of visual trajectories. We achieve this by taking four different observations from B π and computing their latent representations. Then, we match these observations with the four observations in B E having the closest latent representations, computed by taking the relative L1-distances. We repeat this process, matching four observations in B P.π with four observations in B P.E . We show the produced couplings for six different experiments with DisentanGAIL in Fig. 4 . From the results, it can be inferred that the mutual information constraints successfully guide the preprocessor to encode features which are agnostic to the agent's embodiment and the environment's appearance, yet preserving information about the goal-completion levels displayed the observations. For the Inverted Pendulum realm, the preprocessor appears to be encoding information about the relative Inverted Pendulum DisentanGAIL-Imax = 0.99 0.973 ± 0.074 1.021 ± 0.023 0.941 ± 0.045 0.954 ± 0.081 0.885 ± 0.064 0.894 ± 0.231 0.860 ± 0.081 0.918 ± 0.115 DisentanGAIL-Imax = 0.75 0.983 ± 0.067 1.019 ± 0.021 0.841 ± 0.113 0.892 ± 0.131 0.903 ± 0.064 0.887 ± 0.180 0.897 ± 0.056 0.772 ± 0.200 DisentanGAIL-Imax = 0.50 0.987 ± 0.045 1.013 ± 0.016 0.885 ± 0.104 0.853 ± 0.208 0.861 ± 0.077 0.942 ± 0.131 0.837 ± 0.071 0.790 ± 0.234 DisentanGAIL-Imax = 0.25 0.975 ± 0.028 1.020 ± 0.025 0.927 ± 0.052 0.882 ± 0.173 0.861 ± 0.088 0.930 ± 0.156 0.848 ± 0.055 0.720 ± 0.251 DisentanGAIL-Imax = 0.01 0.921 ± 0.094 0.992 ± 0.041 0.905 ± 0.057 0.790 ± 0.201 0.652 ± 0.188 0.589 ± 0.224 0.576 ± 0.141 0.630 ± 0.345 Second, we analyze directly the effects that tighter constraints have on the performance of Disen-tanGAIL. The performance curves for different values of I max are shown in Fig. 7 and a summary of the results is given in Table 6 . Overall, DisentanGAIL appears to be quite robust to all settings tested, excluding the extreme I max = 0.01. In general, however, a lower mutual information upperlimit appears to have a negative effect on the performance in most experiments, especially when the 'expert' agent's embodiment and the 'observer' agent's embodiment differ. This is likely because a tighter constraint does not permit the discriminator to utilize enough information about single observations, thus, providing a less informative learning signal to finetune the agent's behavior. The effects of varying I max appear to be less accentuated in the experiments performed in the Reacher realm. This is likely because the exploratory policy in this environment covers a greater range of states than in the Inverted Pendulum realm, with a more diverse range of goal-completion levels. Thus, encoding features carrying goal-completion information necessitates to carry less information about the domain labels.

D.3 DOMAIN INFORMATION DISGUISING

We also perform an ablation study to understand the effects of the techniques proposed to counteract domain information disguising, described in Section 5.2. We compare the proposed DisentanGAIL algorithm with alternative versions: (i) with no spectral normalization (No SN) (ii) with no double statistics network (No 2St) (iii) with no domain information disguising prevention (No Prevmaking use of neither spectral normalization or double statistics network). The performance curves are shown in Fig. 8 and a summary of the results is given in Table 7 . Overall, both techniques contribute positively to the final performance. Particularly, the double statistics network appears to have a slightly greater positive effect. This is especially evident in the Inverted Pendulum realm. Additionally, removing spectral normalization from the invariant discriminator's layers makes the agent initially learn slightly faster, indicating that there might be a trade-off between convergence speed and training stability.

D.4 ROBUSTNESS TO BACKGROUND DIFFERENCES

We examine whether DisentanGAIL's performance is affected by larger visual domain differences, solely concerning the background appearance. Particularly, most of the tested domain differences in our environment realms involved changing the appearance and morphology of the agents themselves. Hence, we test DisentanGAIL on two alternative target environments in the Hopper and 7DOF-Pusher environment realms, with very distinct backgrounds from the relative source environments. Particularly, the target environment of the Hopper realm has a much darker floor, where the tiles are difficult to discern. Additionally, the target environment in the 7DOF-Pusher realm includes a green table and a white floor, both of which differ greatly in appearance to the grey table and black floor of the source environment. We compare the performance of DisentanGAIL performing imitation in these two alternative target environments with the performance in the original target environments to evaluate its robustness.

