LEARNING EFFICIENT PLANNING-BASED REWARDS FOR IMITATION LEARNING Anonymous

Abstract

Imitation learning from limited demonstrations is challenging. Most inverse reinforcement learning (IRL) methods are unable to perform as good as the demonstrator, especially in a high-dimensional environment, e.g, the Atari domain. To address this challenge, we propose a novel reward learning method, which streamlines a differential planning module with dynamics modeling. Our method learns useful planning computations with a meaningful reward function that focuses on the resulting region of an agent executing an action. Such a planning-based reward function leads to policies with better generalization ability. Empirical results with multiple network architectures and reward instances show that our method can outperform state-of-the-art IRL methods on multiple Atari games and continuous control tasks. Our method achieves performance that is averagely 1,139.1% of the demonstration.

1. INTRODUCTION

Imitation learning (IL) offers an alternative to reinforcement learning (RL) for training an agent, which mimics the demonstrations of an expert and avoids manually designed reward functions. Behavioral cloning (BC) (Pomerleau, 1991) is the simplest form of imitation learning, which learns a policy using supervised learning. More advanced methods, inverse reinforcement learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004) seeks to recover a reward function from the demonstrations and train an RL agent on the recovered reward function. In the maximum entropy variant of IRL, the aim is to find a reward function that makes the demonstrations appear near-optimal on the principle of maximum entropy (Ziebart et al., 2008; 2010; Boularias et al., 2011; Finn et al., 2016) . However, most state-of-the-art IRL methods fail to meet the performance of demonstrations in highdimensional environments with limited demonstration data, e.g., a one-life demonstration in Atari domain (Yu et al., 2020) . This is due to the main goal of these IRL approaches is to recover a reward function that justifies the demonstrations only. The rewards recovered from limited demonstration data would be vulnerable to the overfitting problem. Optimizing these rewards from an arbitrary initial policy results in inferior performance. Recently, Yu et al. (2020) proposed generative intrinsic reward learning for imitation learning with limited demonstration data. This method outperforms expert and IRL methods in several Atari games. Although GIRIL uses the prediction error as curiosity to design the surrogate reward that encourages (pushes) states away from the demonstration and avoids overfitting, the curiosity also results in ambiguous quality of the rewards in the environment. In this paper, we focus on addressing the two key issues of previous methods when learning with limited demonstration data, i.e., 1) overfitting problem, and 2) ambiguous quality of the reward function. To address these issues, we propose to learn a straightforward surrogate reward function by learning to plan from the demonstration data, which is more reasonable than the previous intrinsic reward function (i.e., the prediction error between states). Differential planning modules (DPM) is potentially useful to achieve this goal, since it learns to map observation to a planning computation for a task, and generates action predictions based on the resulting plan (Tamar et al., 2016; Nardelli et al., 2019; Zhang et al., 2020) . Value iteration networks (VIN) (Tamar et al., 2016 ) is the representative one, which represents value iteration as a convolutional neural network (CNN). Meaningful reward and value maps have been learned along with the useful planning computation, which leads to policies that generalize well to new tasks. However, due to the inefficiency of summarizing complicated transition dynamics, VIN fails to scale up to the Atari domain. To address this challenge, we propose a novel method called variational planning-embedded reward learning (vPERL), which is composed of two submodules: a planning-embedded action back-tracing module and the transition dynamics module. We leverage a variational objective based on the conditional variational autoencoder (VAE) (Sohn et al., 2015) to jointly optimize the two submodules, which greatly improves the generalization ability. This is critical for the success of achieving a straightforward and smooth reward function and value function with limited demonstration data. As shown in Figure 1 , vPERL learns meaningful reward and value maps that attends to the resulting region of the agent executing an action, which indicates meaningful planning computation. However, directly applying VIN in Atari domain in the way of supervised learning (Tamar et al., 2016) only learns reward and value maps that attend no specific region, which usually results in no avail. Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games and continuous control tasks. Remarkably, our methods achieve performance that is up to 58 times of the demonstration. Moreover, the average performance improvement of our method is 1,139.1% of the demonstration over eight Atari games.

2. BACKGROUND AND RELATED LITERATURE

Markov Decision Process (MDP) (Bellman, 1966 ) is a standard model for sequential decision making and planning. An MDP M is defined by a tuple (S, A, T, R, γ), where S is the set of states, A is the set of actions, T : S × A × S → R + is the environment transition distribution, R : S → R is the reward function, and γ ∈ (0, 1) is the discount factor (Puterman, 2014). The expected discounted return or value of the policy π is given by V π (s) = E τ [ t=0 γ t R(s t , a t )|s 0 = s], where τ = (s 0 , a 0 , s 1 , a 1 , • • • ) denotes the trajectory, in which the actions are selected according to π, s 0 ∼ T 0 (s 0 ), a t ∼ π(a t |s t ), and s t+1 ∼ T (s t+1 |s t , a t ). The goal in an MDP is to find the optimal policy π * that enables the agent to obtain high long-term rewards. In contrast, our vPERL learns efficient planning-based reward that is more straightforward and informative. We have included GIRIL as a competitive baseline in our experiments. Differentiable planning modules perform end-to-end learning of planning computation, which leads to policies that generalize to new tasks. Value iteration (VI) (Bellman, 1957) is a well-known method for calculating the optimal value V * and optimal policy π * : V n+1 (s) = max a Q n (s, a),



(a) State. (b) vPERL Reward and Value Maps. (c) VIN Reward and Value Maps.

Figure 1: Visualization of state, reward and value maps of vPERL and VIN on Battle Zone game (the first row) and Breakout game (the second row).

Generative Adversarial Imitation Learning (GAIL)(Ho & Ermon, 2016)  extends IRL by integrating adversarial training technique for distribution matching(Goodfellow et al., 2014). GAIL performs well in low-dimensional applications, e.g., MuJoCo. However, it does not scale well to high-dimensional scenarios, such as Atari games(Brown et al., 2019a). Variational adversarial imitation learning (VAIL)(Peng et al., 2019)  improves on GAIL by compressing the information via variational information bottleneck. GAIL and VAIL inherit problems of adversarial training, such as instability in training process, and are vulnerable to overfitting problem when learning with limited demonstration data. We have included both methods as comparisons to vPERL in our experiments.Generative Intrinsic Reward driven Imitation Learning (GIRIL)(Yu et al., 2020)  leverage generative model to learn generative intrinsic rewards for better exploration. Though GIRIL outperforms previous IRL methods on several Atari games, the reward map of GIRIL is ambiguous and less informative, which results in inconsistent performance improvements in different environments.

