UNBIASED LEARNING WITH STATE-CONDITIONED REWARDS IN ADVERSARIAL IMITATION LEARNING

Abstract

Adversarial imitation learning has emerged as a general and scalable framework for automatic reward acquisition. However, we point out that previous methods commonly exploited occupancy-dependent reward learning formulation-which hinders the reconstruction of optimal decision as an energy-based model. Despite the theoretical justification, the occupancy measures tend to cause issues in practice because of high variance and low vulnerability to domain shifts. Another reported problem is termination biases induced by provided rewarding and regularization schemes around terminal states. In order to deal with these issues, this work presents a novel algorithm called causal adversarial inverse reinforcement learning. Our formulation draws a strong connection between adversarial learning and energy-based reinforcement learning; thus, the architecture is capable of recovering a reward function that induces a multi-modal policy. In experiments, we demonstrate that our approach outperforms prior methods in challenging continuous control tasks, even under significant variation in the environments.

1. INTRODUCTION

Inverse reinforcement learning (IRL) is an algorithm of recovering the ground truth reward function from observed behavior (Ng & Russell, 2000) . IRL algorithms-followed by appropriate reinforcement learning (RL) algorithms-can optimize policy through farsighted cumulative value measures in the given system (Sutton & Barto, 2018) ; hence it can usually achieve more satisfying results than mere supervision. While a few studies have investigated recovering reward functions to continuous spaces (Babes et al., 2011; Levine & Koltun, 2012) , IRL algorithms often fail to find the ground-truth reward function in high-dimensional complex domains (Finn et al., 2016b) . The notion of the ground-truth reward requires elaboration since IRL is an ill-posed problem; there can be numerous solutions to the reward function inducing the same optimal policy (Ng et al., 1999; Ng & Russell, 2000) . Recently, adversarial imitation learning (AIL) as a reward acquisition method has shown promising results (Ho & Ermon, 2016) . One of the distinctive strengths of AIL is the scalability through parameterized non-linear functions such as neural networks. The maximum causal entropy principles are widely regarded as the solution when the optimal control problem is modeled as probabilistic inference (Ziebart et al., 2010; Haarnoja et al., 2017) . In particular, probabilistic modeling using a continuous energy function forms a representation called an energy-based model (EBM). We highlight the following advantages of the energy-based IRL: • It provides a unified framework for stochastic policies to the learning; most probabilistic models can be viewed as special types of EBMs (LeCun et al., 2006) . • It rationalizes the stochasticity of behavior; this provides robustness in the face of uncertain dynamics (Ziebart et al., 2010) and a natural way of modeling complex multi-modal distribution. AIL reward functions seem to be exceptions to these arguments-the AIL framework produces distinct types of rewards that are ever-changing and are intended for discriminating joint densities. We argue that these characteristics hinder proper information projection to the optimal decision. This work points out that there remain two kinds of biases in AIL. The established AIL algorithms are typically formalized by the cumulative densities called occupancy measure. We claim that the accumulated measures contain biases that are not related to modeling purposeful behavior, and the formulation is vulnerable to distributional shifts of an MDP. Empirically, they work as dominant

