POLICY CONTRASTIVE IMITATION LEARNING

Abstract

Adversarial imitation learning (AIL) is a popular method that has recently achieved much success. However, the performance of AIL is still unsatisfactory on the more challenging tasks. We find that one of the major reasons is due to the low quality of AIL discriminator representation. Since the AIL discriminator is trained via binary classification that does not necessarily discriminate the policy from the expert in a meaningful way, the resulting reward might not be meaningful either. We propose a new method called Policy Contrastive Imitation Learning (PCIL) to resolve this issue. PCIL learns a contrastive representation space by anchoring on different policies and generates a smooth cosine-similarity-based reward. Our proposed representation learning objective can be viewed as a stronger version of the AIL objective and provide a more meaningful comparison between the agent and the policy. From a theoretical perspective, we show the validity of our method using the apprenticeship learning framework. Furthermore, our empirical evaluation on the DeepMind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.

1. INTRODUCTION

Imitation is one of the fundamental capabilities of an intelligent agent (Hussein et al., 2017) . Animals and humans can acquire many skills by mimicking each other (Byrne, 2009) . In engineering, imitation learning also enables many robotics applications. One mainstream class of imitation learning algorithms is the adversarial imitation learning (AIL) (Ho & Ermon, 2016) . AIL converts the imitation task into a distribution matching problem and proposes to imitate it by training a policy against an adversarial discriminator. AIL has enjoyed great success on many imitation tasks: it achieves superior performance (Ho & Ermon, 2016; Kostrikov et al., 2018) , and has been experimentally proven to alleviate some of the distributional drift issue, and can work even without expert actions (Torabi et al., 2018b) .However, AIL is hard to train in practice, usually involving careful tuning of discriminator neural network sizes and learning rates (Wang et al., 2017; Kim & Park, 2018; Orsini et al., 2021) . The fragility of the discriminator (Peng et al., 2018) not only leads to poor performance but also severely limits the applicability of AIL to a broader range of tasks. Numerous techniques have been proposed to improve the performance of AIL, such as using regularization and gradient penalties (Fu et al., 2017; Kostrikov et al., 2018; Gulrajani et al., 2017) . Some works also propose to use different distribution metrics (e.g., KL divergence, Wasserstein distance) (Xiao et al., 2019) for distribution matching and show some improvements. Though these methods show encouraging results, we notice that they ignore one crucial aspect of the problem: the representation of AIL's discriminator. To be specific, the discriminator in AIL is usually trained with binary classification loss that distinguishes expert transitions from agent transitions. This discriminator is then used to define rewards. However, since the only goal of the discriminator is to distinguish the expert from the agent, it does not necessarily learn a good, smooth representation space that can provide a reasonable comparison between the behavior of two agents.Ideal representations should be able to provide semantically meaningful signals to compare the expert policy and the agent policy. In this paper, we propose a new algorithm called Policy Contrastive Imitation Learning (PCIL) to achieve this goal. Instead of training with a binary-classification objective, we propose to train a discriminator representation space with the contrastive learning loss. Our method differs from the prior representation learning approach in AIL in that we perform contrastive learning between different 

Expert

Sub-optimal Random Figure 1: Comparison between the representation space of AIL method and our method. Since the AIL methods use a binary classification objective to distinguish expert and non-expert transitions, the representation space is only required to separate two classes in two disjoint subspaces. So the embedding space is not required to be semantically meaningful enough, e.g. (Left) the distance between 2 expert data points may be even longer than the distance between expert data point and suboptimal non-expert data point. (Right) We overcome this limitation by proposing PCIL. Our method enforces the compactness of the expert's representation. This ensures that the learned representation can capture common, robust features of the expert's transitions, which leads to a more meaningful representation space. policies. More specifically, we push the expert's representation together and pull the agent policy's representation away from them. We define the imitation learning reward via cosine similarity between the policy's and expert's transition. As is shown in Figure 1 , the discriminator (binary classifier) might not have a good representation space: the distance between two expert transitions can be even larger than the distance between an expert transition and an agent transition. This implies that the discriminator may not encode some common features of the expert's behavior and may use non-robust features to compare the behavior. The lack of proper structure in the representation space of traditional discriminators may yield low-quality AIL rewards. To alleviate this issue, we explicitly define a constraint on the representation space, requiring that the distance between expert transitions be smaller than their distance to the agent's transitions. This is a stronger constraint on the discriminator's representation: it is easy to derive a binary classifier from our learned representation. However, the binary classifier's representation space does not necessarily satisfy our objective. From a theoretical perspective, we show the soundness of PCIL in the apprenticeship learning framework. We also evaluate our method empirically by benchmarking our method on the DeepMind Control Suite (Tassa et al., 2018) . Experimental results show that our method is able to achieve stateof-the-art results. Through ablation study and qualitative visualization, we find that our method is more effective than prior representation learning methods and able to provide a better representation space for imitation learning. In summary, our contributions in this paper are as follows. 1. We point out a new direction to improve the performance of the AIL methods, i.e., going beyond naive binary classification and leveraging more stable and meaningful representation learning algorithms for imitation. 2. We propose an algorithm called Policy Contrast Imitation Learning (PCIL) method that instantiates such an improvement and establishes its connection to apprenticeship learning from a theoretical perspective. 3. We evaluate our method on the DeepMind Control Suite and achieve state-of-the-art performance. Through ablation studies, we highlight its essential difference from previous contrastive learning methods in AIL.



D e c is io n B o u n d a r y

