LEARNING EFFICIENT PLANNING-BASED REWARDS FOR IMITATION LEARNING Anonymous

Abstract

Imitation learning from limited demonstrations is challenging. Most inverse reinforcement learning (IRL) methods are unable to perform as good as the demonstrator, especially in a high-dimensional environment, e.g, the Atari domain. To address this challenge, we propose a novel reward learning method, which streamlines a differential planning module with dynamics modeling. Our method learns useful planning computations with a meaningful reward function that focuses on the resulting region of an agent executing an action. Such a planning-based reward function leads to policies with better generalization ability. Empirical results with multiple network architectures and reward instances show that our method can outperform state-of-the-art IRL methods on multiple Atari games and continuous control tasks. Our method achieves performance that is averagely 1,139.1% of the demonstration.

1. INTRODUCTION

Imitation learning (IL) offers an alternative to reinforcement learning (RL) for training an agent, which mimics the demonstrations of an expert and avoids manually designed reward functions. Behavioral cloning (BC) (Pomerleau, 1991) is the simplest form of imitation learning, which learns a policy using supervised learning. More advanced methods, inverse reinforcement learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004) seeks to recover a reward function from the demonstrations and train an RL agent on the recovered reward function. In the maximum entropy variant of IRL, the aim is to find a reward function that makes the demonstrations appear near-optimal on the principle of maximum entropy (Ziebart et al., 2008; 2010; Boularias et al., 2011; Finn et al., 2016) . However, most state-of-the-art IRL methods fail to meet the performance of demonstrations in highdimensional environments with limited demonstration data, e.g., a one-life demonstration in Atari domain (Yu et al., 2020) . This is due to the main goal of these IRL approaches is to recover a reward function that justifies the demonstrations only. The rewards recovered from limited demonstration data would be vulnerable to the overfitting problem. Optimizing these rewards from an arbitrary initial policy results in inferior performance. Recently, Yu et al. (2020) proposed generative intrinsic reward learning for imitation learning with limited demonstration data. This method outperforms expert and IRL methods in several Atari games. Although GIRIL uses the prediction error as curiosity to design the surrogate reward that encourages (pushes) states away from the demonstration and avoids overfitting, the curiosity also results in ambiguous quality of the rewards in the environment. In this paper, we focus on addressing the two key issues of previous methods when learning with limited demonstration data, i.e., 1) overfitting problem, and 2) ambiguous quality of the reward function. To address these issues, we propose to learn a straightforward surrogate reward function by learning to plan from the demonstration data, which is more reasonable than the previous intrinsic reward function (i.e., the prediction error between states). Differential planning modules (DPM) is potentially useful to achieve this goal, since it learns to map observation to a planning computation for a task, and generates action predictions based on the resulting plan (Tamar et al., 2016; Nardelli et al., 2019; Zhang et al., 2020) . Value iteration networks (VIN) (Tamar et al., 2016 ) is the representative one, which represents value iteration as a convolutional neural network (CNN). Meaningful reward and value maps have been learned along with the useful planning computation, which leads to policies that generalize well to new tasks. However, due to the inefficiency of summarizing complicated transition dynamics, VIN fails to scale up to the Atari domain.

