UNBIASED LEARNING WITH STATE-CONDITIONED REWARDS IN ADVERSARIAL IMITATION LEARNING

Abstract

Adversarial imitation learning has emerged as a general and scalable framework for automatic reward acquisition. However, we point out that previous methods commonly exploited occupancy-dependent reward learning formulation-which hinders the reconstruction of optimal decision as an energy-based model. Despite the theoretical justification, the occupancy measures tend to cause issues in practice because of high variance and low vulnerability to domain shifts. Another reported problem is termination biases induced by provided rewarding and regularization schemes around terminal states. In order to deal with these issues, this work presents a novel algorithm called causal adversarial inverse reinforcement learning. Our formulation draws a strong connection between adversarial learning and energy-based reinforcement learning; thus, the architecture is capable of recovering a reward function that induces a multi-modal policy. In experiments, we demonstrate that our approach outperforms prior methods in challenging continuous control tasks, even under significant variation in the environments.

1. INTRODUCTION

Inverse reinforcement learning (IRL) is an algorithm of recovering the ground truth reward function from observed behavior (Ng & Russell, 2000) . IRL algorithms-followed by appropriate reinforcement learning (RL) algorithms-can optimize policy through farsighted cumulative value measures in the given system (Sutton & Barto, 2018) ; hence it can usually achieve more satisfying results than mere supervision. While a few studies have investigated recovering reward functions to continuous spaces (Babes et al., 2011; Levine & Koltun, 2012) , IRL algorithms often fail to find the ground-truth reward function in high-dimensional complex domains (Finn et al., 2016b) . The notion of the ground-truth reward requires elaboration since IRL is an ill-posed problem; there can be numerous solutions to the reward function inducing the same optimal policy (Ng et al., 1999; Ng & Russell, 2000) . Recently, adversarial imitation learning (AIL) as a reward acquisition method has shown promising results (Ho & Ermon, 2016) . One of the distinctive strengths of AIL is the scalability through parameterized non-linear functions such as neural networks. The maximum causal entropy principles are widely regarded as the solution when the optimal control problem is modeled as probabilistic inference (Ziebart et al., 2010; Haarnoja et al., 2017) . In particular, probabilistic modeling using a continuous energy function forms a representation called an energy-based model (EBM). We highlight the following advantages of the energy-based IRL: • It provides a unified framework for stochastic policies to the learning; most probabilistic models can be viewed as special types of EBMs (LeCun et al., 2006) . • It rationalizes the stochasticity of behavior; this provides robustness in the face of uncertain dynamics (Ziebart et al., 2010) and a natural way of modeling complex multi-modal distribution. AIL reward functions seem to be exceptions to these arguments-the AIL framework produces distinct types of rewards that are ever-changing and are intended for discriminating joint densities. We argue that these characteristics hinder proper information projection to the optimal decision. This work points out that there remain two kinds of biases in AIL. The established AIL algorithms are typically formalized by the cumulative densities called occupancy measure. We claim that the accumulated measures contain biases that are not related to modeling purposeful behavior, and the formulation is vulnerable to distributional shifts of an MDP. Empirically, they work as dominant noises in training because of the formulation's innate high variance. The other bias is implicit survival or early termination bias caused by reward formulation, which lacks consideration for the terminal states in finite episodes. These unnormalized rewards often provokes sub-optimal behaviors where the agent learns to maliciously make use of temporal-aware strategies. This paper proposes an adversarial IRL method called causal adversarial inverse reinforcement learning (CAIRL). We primarily associate the reward acquisition method with approaches for energy-based RL and IRL algorithms; the CAIRL reward function can induce complex probabilistic behaviors with multiple modalities. We then show that learning with a dual discriminator architecture provides stepwise, state-conditioned rewards. For handling biases induced by finite-horizon, the model postulates the reward function satisfies a Bellman equation, including "self-looping" terminal states. As a result, it learns the reward function satisfying the property of EBMs. Noteworthy contributions of this work are 1) a model-free, energy-based IRL algorithm that is effective in high-dimensional environments, 2) a dual discriminator architecture for recovering a robust state-conditioned reward function, 3) an effective approach for handling terminal states, and 4) meaningful experiments and comparison studies with state-of-the-art algorithms in various topics.

2. RELATED WORKS

Imitation learning is a fundamental approach for modeling intellectual behavior from an expert at specific tasks (Pomerleau, 1991; Zhang et al., 2018) . For the standard framework called Behavioral Cloning, learning from demonstrations is treated as supervised learning for a trajectory dataset. On the other hand, IRL aims to study the reward function of the underlying system, which characterizes the expert. In this perspective, training a policy with an IRL reward function is a branch of imitation learning, specialized in dealing with sequential decision-making problems by recovering the concise representation of a task (Ng & Russell, 2000; Abbeel & Ng, 2004) . For modeling stochastic expert policies, Boltzmann distributions appeared in early IRL research, such as Bayesian IRL, natural gradient IRL, and maximum likelihood IRL (Ramachandran & Amir, 2007; Neu & Szepesvári, 2012; Babes et al., 2011) . Notably, maximum entropy IRL (Ziebart et al., 2008) is explicitly formulated based on the principle of maximum entropy. The framework has also been derived from causal entropy-the derived algorithm can model the purposeful distribution of optimal policy into a reward function (Ziebart et al., 2010) . Our work draws significant inspirations from these prior works and aims to redeem the perspective of probabilistic causality. Recently, AIL methods (Ho & Ermon, 2016; Fu et al., 2017; Ghasemipour et al., 2020) have shown great success on continuous control benchmarks. Each of the them provides a unique divergence minimization scheme by its architecture. In particular, our work shares major components with AIRL. It has been argued that the algorithm does not recover the energy of the expert policy (Liu et al., 2020) . We stress that our work introduces essential concepts to draw an energy-based representation of the expert policy correctly. The discriminator design is based on the rich energy-based interpretation of GANs (Zhao et al., 2016; Azadi et al., 2018; Che et al., 2020) and numerous studies with multiple discriminators (Chongxuan et al., 2017; Gan et al., 2017; Choi et al., 2018) . The issues of finite-horizon tasks were initially raised in RL during the discussion of time limits in MDP benchmarks (Pardo et al., 2017; Tucker et al., 2018) . It turned out that the time limits, or even the existence of terminal states, would significantly affect the value learning procedure of RL compared to that generated in infinite horizon MDPs. IRL suffers from the identical problem that reward learning of finite episodes is not really stable for tasks outside of appropriate benchmarks. Kostrikov et al. (2018) suggested explicitly adding a self-repeating absorbing state (Sutton & Barto, 2018) after the terminal state; consequently, AIL discriminators can evaluate the termination frequencies.

3. BACKGROUND

Markov Decision Process (MDP). We define an MDP as a tuple M = (S, A, P, r, p 0 , γ) where S and A denote the state and action spaces, and γ is the discount factor. The transition distribution P, the deterministic state-action reward function r, and the initial state distribution p 0 are unknown. Let τ π and τ E be sequences of finite states and actions (s 0 , a 0 , . . . , a T -1 , s T ) obtained by a policy π and the expert policy π E , respectively. The term ρ π denotes the occupancy measures derived by

