LEARNING EFFICIENT PLANNING-BASED REWARDS FOR IMITATION LEARNING Anonymous

Abstract

Imitation learning from limited demonstrations is challenging. Most inverse reinforcement learning (IRL) methods are unable to perform as good as the demonstrator, especially in a high-dimensional environment, e.g, the Atari domain. To address this challenge, we propose a novel reward learning method, which streamlines a differential planning module with dynamics modeling. Our method learns useful planning computations with a meaningful reward function that focuses on the resulting region of an agent executing an action. Such a planning-based reward function leads to policies with better generalization ability. Empirical results with multiple network architectures and reward instances show that our method can outperform state-of-the-art IRL methods on multiple Atari games and continuous control tasks. Our method achieves performance that is averagely 1,139.1% of the demonstration.

1. INTRODUCTION

Imitation learning (IL) offers an alternative to reinforcement learning (RL) for training an agent, which mimics the demonstrations of an expert and avoids manually designed reward functions. Behavioral cloning (BC) (Pomerleau, 1991) is the simplest form of imitation learning, which learns a policy using supervised learning. More advanced methods, inverse reinforcement learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004) seeks to recover a reward function from the demonstrations and train an RL agent on the recovered reward function. In the maximum entropy variant of IRL, the aim is to find a reward function that makes the demonstrations appear near-optimal on the principle of maximum entropy (Ziebart et al., 2008; 2010; Boularias et al., 2011; Finn et al., 2016) . However, most state-of-the-art IRL methods fail to meet the performance of demonstrations in highdimensional environments with limited demonstration data, e.g., a one-life demonstration in Atari domain (Yu et al., 2020) . This is due to the main goal of these IRL approaches is to recover a reward function that justifies the demonstrations only. The rewards recovered from limited demonstration data would be vulnerable to the overfitting problem. Optimizing these rewards from an arbitrary initial policy results in inferior performance. Recently, Yu et al. (2020) proposed generative intrinsic reward learning for imitation learning with limited demonstration data. This method outperforms expert and IRL methods in several Atari games. Although GIRIL uses the prediction error as curiosity to design the surrogate reward that encourages (pushes) states away from the demonstration and avoids overfitting, the curiosity also results in ambiguous quality of the rewards in the environment. In this paper, we focus on addressing the two key issues of previous methods when learning with limited demonstration data, i.e., 1) overfitting problem, and 2) ambiguous quality of the reward function. To address these issues, we propose to learn a straightforward surrogate reward function by learning to plan from the demonstration data, which is more reasonable than the previous intrinsic reward function (i.e. , the prediction error between states). Differential planning modules (DPM) is potentially useful to achieve this goal, since it learns to map observation to a planning computation for a task, and generates action predictions based on the resulting plan (Tamar et al., 2016; Nardelli et al., 2019; Zhang et al., 2020) . Value iteration networks (VIN) (Tamar et al., 2016 ) is the representative one, which represents value iteration as a convolutional neural network (CNN). Meaningful reward and value maps have been learned along with the useful planning computation, which leads to policies that generalize well to new tasks. However, due to the inefficiency of summarizing complicated transition dynamics, VIN fails to scale up to the Atari domain. To address this challenge, we propose a novel method called variational planning-embedded reward learning (vPERL), which is composed of two submodules: a planning-embedded action back-tracing module and the transition dynamics module. We leverage a variational objective based on the conditional variational autoencoder (VAE) (Sohn et al., 2015) to jointly optimize the two submodules, which greatly improves the generalization ability. This is critical for the success of achieving a straightforward and smooth reward function and value function with limited demonstration data. As shown in Figure 1 , vPERL learns meaningful reward and value maps that attends to the resulting region of the agent executing an action, which indicates meaningful planning computation. However, directly applying VIN in Atari domain in the way of supervised learning (Tamar et al., 2016) only learns reward and value maps that attend no specific region, which usually results in no avail. Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games and continuous control tasks. Remarkably, our methods achieve performance that is up to 58 times of the demonstration. Moreover, the average performance improvement of our method is 1,139.1% of the demonstration over eight Atari games.

2. BACKGROUND AND RELATED LITERATURE

Markov Decision Process (MDP) (Bellman, 1966 ) is a standard model for sequential decision making and planning. An MDP M is defined by a tuple (S, A, T, R, γ), where S is the set of states, A is the set of actions, T : S × A × S → R + is the environment transition distribution, R : S → R is the reward function, and γ ∈ (0, 1) is the discount factor (Puterman, 2014). The expected discounted return or value of the policy π is given by V π (s) = E τ [ t=0 γ t R(s t , a t )|s 0 = s], where τ = (s 0 , a 0 , s 1 , a 1 , • • • ) denotes the trajectory, in which the actions are selected according to π, s 0 ∼ T 0 (s 0 ), a t ∼ π(a t |s t ), and s t+1 ∼ T (s t+1 |s t , a t ). The goal in an MDP is to find the optimal policy π * that enables the agent to obtain high long-term rewards. Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) extends IRL by integrating adversarial training technique for distribution matching (Goodfellow et al., 2014) . GAIL performs well in low-dimensional applications, e.g., MuJoCo. However, it does not scale well to high-dimensional scenarios, such as Atari games (Brown et al., 2019a) . Variational adversarial imitation learning (VAIL) (Peng et al., 2019) improves on GAIL by compressing the information via variational information bottleneck. GAIL and VAIL inherit problems of adversarial training, such as instability in training process, and are vulnerable to overfitting problem when learning with limited demonstration data. We have included both methods as comparisons to vPERL in our experiments. Generative Intrinsic Reward driven Imitation Learning (GIRIL) (Yu et al., 2020) leverage generative model to learn generative intrinsic rewards for better exploration. Though GIRIL outperforms previous IRL methods on several Atari games, the reward map of GIRIL is ambiguous and less informative, which results in inconsistent performance improvements in different environments. In contrast, our vPERL learns efficient planning-based reward that is more straightforward and informative. We have included GIRIL as a competitive baseline in our experiments. Differentiable planning modules perform end-to-end learning of planning computation, which leads to policies that generalize to new tasks. Value iteration (VI) (Bellman, 1957 ) is a well-known method for calculating the optimal value V * and optimal policy π * : V n+1 (s) = max a Q n (s, a), where Q n (s, a) = R(s, a) + γ s T (s |s, a)V n (s ) denotes the Q value in the nth iteration. The value function V n in VI converges as n → ∞ to V * , from which the optimal policy may be derived as π * (s) = arg max a Q ∞ (s, a). Value iteration networks (VIN) (Tamar et al., 2016) proposes to embed value iteration (VI) (Bellman, 1957) process with a recurrent convolutional network, and generalizes well in conventional navigation domains. VIN assumes there is some unknown embedded MDP M where the optimal plan in M contains useful information about the optimal plan in the original MDP M . VIN connects the two MDPs with a parametric reward function R = f R (s). Nardelli et al. (2019) proposes value propagation networks (VPN) generalize VIN for better sample complexity by employing value propagation (VProp). Recently, universal value iteration networks (UVIN) extends VIN to spatially variant MDPs (Zhang et al., 2020) . Although VIN can be extended to irregular spatial graphs by applying graph convolutional operator (Niu et al., 2018) , most of the VIN variants still focus on solving the conventional navigation problems (Zhang et al., 2020) . In this paper, we extend differentiable planning module to learn an efficient reward function for imitation learning on limited demonstration data. We dig more on leveraging the learned reward function for imitation learning; while previous related work of VIN focuses more on the value function. Therefore, our work is complementary to the research of VIN and its variants. Note that any differentiable planning module can be embedded in our method. As a simple example, we utilize the basic VIN as a backbone to build our reward learning module.

3. VARIATIONAL PLANNING-EMBEDDED REWARD LEARNING

In this section, we introduce our solution, variational planning-embedded reward learning (vPERL). As illustrated in Figure 2 , our reward learning module is composed of two submodules to accomplish planning-embedded action back-tracing and explicit forward transition dynamics modeling. 

3.1. ACTION BACK-TRACING AND FORWARD DYNAMICS MODELLING IN VPERL

Planning-embedded action back-tracing. Instead of directly applying VIN for policy learning (Tamar et al., 2016) , we build our first submodule q φ (a t |s t , s t+1 ) for action back-tracing. As illustrated in the top section of Figure 2 , we first obtain the reward map R = f R (s t , s t+1 ) on an embedded MDP M , where f R is a convolutional layer. A VI module f V I takes in the reward map R, and effectively performs K times of VI by recurrently applying a convolutional layer Q for K times (Tamar et al., 2016) . The Q layer is then max-pooled to obtain the next-iteration value V . The right-directed circular arrow in a light-blue color denotes the direction of convolutions. Then, we simply obtain the action from the intermediate optimal value V * by an action mapping function: a t = f a (V * ). On these terms, we build our planning-embedded action back-tracing submodule, which is formally represented as q φ (a t |s t , s t+1 ) = f a (f V I (f R (s t , s t+1 ))). Since the convolutional kernel is incapable of summarizing the transition dynamics in a complex environment, directly training this submodule is still insufficient for learning efficient reward function and planning computation in an environment like Atari domain. Explicit transition dynamics modeling via inverse VI. To address this, we further build upon another submodule p θ (s t+1 |a t , s t ) for explicit transition dynamics modeling. We build the submodule based on the inverse VI module, which is a NN architecture that mimics the process of the inverse version of VI. The implementation of the inverse VI module is straightforward. We first map the action for the intermediate optima value in another embedded MDP M by a value mapping function: V * = f V (s t , a t ). Then, we apply the inverse VI module to obtain the reward map R . The inverse VI module f V I takes in the intermediate value V and recurrently apply a deconvolutional layer Q for K times on the value to obtain the reward map R . The left-directed circular arrow in a purple color denotes the direction of deconvolutions. To accomplish the transition, we map the obtained R to the future state by: s t+1 = f s (R ). The transition modeling is therefore presented as p θ (s t+1 |a t , s t ) = f s (f V I (f V (s t , a t ))) , which is a differentiable submodule, and can be trained simultaneously with the action back-tracing submodule. Variational solution to vPERL. A variational autoencoder (VAE) (Kingma & Welling, 2013) can be defined as being an autoencoder whose training is regularised to avoid overfitting and ensure that the latent space has good properties that enable generative process. To avoid the learned planningbased reward overfitting to the demonstration, we optimize both submodules in a unified variational solution, which follows the formulation of conditional VAE (Sohn et al., 2015) . Conditional VAE is a conditional generative model for structured output prediction using Gaussian latent variables, which is composed of a conditional encoder, decoder and prior. Accordingly, we regard the action back-tracing module q φ (z|s t , s t+1 ) as the encoder, p θ (s t+1 |z, s t ) as the decoder, and p θ (z|s t ) as the prior. Our vPERL module is maximized with the following objective: L(s t , s t+1 ; θ, φ) = E q φ (z|st,st+1) [log p θ (s t+1 |z, s t )] -KL(q φ (z|s t , s t+1 ) p θ (z|s t )) -αKL(q φ (â t |s t , s t+1 ) π E (a t |s t ))] where z is the latent variable, π E (a t |s t ) is the expert policy distribution, ât = Softmax(z) is the transformed latent variable, α is a positive scaling weight. The first two terms on the RHS of Eq. ( 1) in the first line denote the evidence lower bound (ELBO) of the conditional VAE (Sohn et al., 2015) . These two terms are critical for our reward learning module to perform planning-based action backtracing and transition modeling. Additionally, we integrate the third term on the RHS of Eq. ( 1) in the second line to further boost the action back-tracing. The third term minimizes the KL divergence between the expert policy distribution π E (a t |s t ) and the action distribution q φ (â t |s t , s t+1 ), where ât = Softmax(z) is transformed from the latent variable z. In this way, we train the forward state transition and action back-tracing simultaneously. Algorithm 1 Imitation learning via variational planning-embedded reward learning (vPERL). 1: Input: Expert demonstration data D = {(s i , a i )} N i=1 . 2: Initialize policy π, and the dual planning networks. 3: for e = 1, • • • , E do 4: Sample a batch of demonstration D ∼ D.

5:

Train vPERL module on D to converge. 6: end for 7: for i = 1, • • • , MAXITER do 8: Update policy via any policy gradient method, e.g., PPO on the learned surrogate reward r t . 9: end for 10: Output: Policy π. Note that the full objective in Eq. ( 1) is still a variational lower bound of the marginal likelihood log(p θ (s t+1 |s t )). Accordingly, it is reasonable to maximize this as an objective of our reward learning module. By optimizing the objective, we improve the forward state transition and action back-tracing. As a result, our reward learning module efficiently models the transition dynamics of the environment. During training, we use the latent variable z as the intermediate action. After training, we will calculate the surrogate rewards from the learned reward map. As shown in Figure 1 , our method learns meaningful reward map, which highlights the resulting region of an agent executing one action. To leverage such meaningful information, we calculate two types of rewards that both correspond to the highlighted informative region, i.e., r t = R Max = max R and r t = R Mean = meanR, which uses the maximum and mean value of the reward map R, respectively. Algorithm 1 summarizes the full training procedure of imitation learning via vPERL. The process begins by training a vPERL module for E epochs (steps 3-6). In each training epoch, we sample a mini-batch demonstration data D with a mini-batch size of B and maximize the objective in Eq. ( 1). Then in steps 7-9, we update the policy π via any policy gradient method, e.g., PPO (Schulman et al., 2017) , so as to optimize the policy π with the learned surrogate reward function r t .

4.1. ATARI GAMES

We first evaluate our proposed vPERL on one-life demonstration data for eight Atari games within OpenAI Gym (Brockman et al., 2016) . To enable a fair comparison, we evaluate our method and all other baselines under the same standard setup, where we train an agent to play Atari games without access to the true reward function (Ibarz et al., 2018; Brown et al., 2019a) . The games and demonstration details are provided in Table 1 . A one-life demonstration only contains the states and actions performed by an expert player until they lose their life in a game for the first time (Yu et al., 2020) . In contrast, one fullepisode demonstration contains states and actions after the expert player loses all available lives in a game. Therefore, the one-life demonstration data is much more limited than one fullepisode demonstration. We define three levels of performance: 1) basic one-life demonstrationlevel -gameplay up to one life lost ("one-life"), 2) expert-level -gameplay up to all-lives lost ("one full-episode"), and 3) beyond expert -"better-than-expert" performance. Our ultimate goal is to train an imitation agent that can achieve a better-than-expert performance with the demonstration data recorded up to the moment of losing their first life in the game. Demonstrations To generate one-life demonstrations, we trained a PPO (Schulman et al., 2017) agent with the ground-truth reward for 10 million simulation steps. We used PPO implementation with the default hyper-parameters in the repository (Kostrikov, 2018) . As Table 1 shows, the one-life demonstrations are all much shorter than the full-episode demonstrations, which make for extremely limited training data. Experimental Setup Our first step was to train a reward learning module for each game on the onelife demonstration. We set K = 10 in vPERL for all of the Atari games. By default, we use a neural network architecture that keeps the size of the reward map and value maps the same as that of the state, which is 84 × 84. We achieve this by using a convolutional kernel of size 3 for each convolutional layer, and applying padding. The corresponding method is called 'vPERL-Large'. Additionally, to enable faster learning, we implement our method with another neural network architecture that reduces the size of the reward map and value maps into 18 × 18. The corresponding method is called 'vPERL-Small'. Both vPERL-Large and vPERL-Small can learn meaningful reward map as well as useful planning computation. Training was conducted with the Adam optimizer (Kingma & Ba, 2015) at a learning rate of 3e-5 and a mini-batch size of 32 for 50,000 epochs. In each training epoch, we sampled a mini-batch of data every four states. To evaluate the quality of our learned reward, we trained a policy to maximize the inferred reward function via PPO. We set α = 100 for training our reward learning module. We trained the PPO on the learned reward function for 50 million simulation steps to obtain our final policy. The PPO is trained with a learning rate of 2.5e-4, a discount factor of 0.99, a clipping threshold of 0.1, an entropy coefficient of 0.01, a value function coefficient of 0.5, and a GAE parameter of 0.95 (Schulman et al., 2016) . We compared imitation performance by our vPERL agent against VIN, two state-of-the-art inverse reinforcement learning methods, GAIL (Ho & Ermon, 2016) and VAIL (Peng et al., 2019) . More details of setup are outlined in Appendix F.2. Figure 3 : Performance improvement of vPERL and baselines.

Results

In Figure 3 , we report the performance by normalizing the demonstration performance to 1. Figure 3 shows that vPERL achieves performance that is usually close or better than that of the expert demonstrator. The most impressive one is the Centipede game, our vPERL achieves performance that is around 60 times higher than the demonstration. GIRIL achieves the second best performance in Centipede, beating the demonstration by around 30 times. On the Qbert game, vPERL beats all other baselines by a large margin, achieving performance that is more than 15 times of the demonstration. A detailed quantitative comparison of IL algorithms is listed in Table 2 . We have evaluated four variants of vPERL with two types of network architectures (Large and Small) and surrogate rewards (R Max and R Mean ). Both vPERL-Large and vPERL-Small can learn meaningful reward and value maps as well as useful computation in the Atari domain. In Appendix D and E, we visualize the learned reward and value maps of vPERL-Large and vPERL-Small, respectively. With such meaningful rewards, the four variants of vPERL outperform the expert demonstrator in six out of eight Atari games. Remarkably, vPERL-Small with R Mean achieves an average performance that is 1,139.1% of the demonstration and 278.5% of the expert over the eight Atari games. Figure 3 shows the bar plot of normalized performance of four vPERL variants against other baselines. Table 2 shows that VIN is far from achieving demonstration-level performance, since it is unable to learn useful planning computation as well as the meaningful reward and value maps in the Atari domain. GAIL fails to achieve good imitation learning performance. VAIL manages to exceed the expert performance in one game, i.e., Breakout. GIRIL performs better than previous methods, outperforming the expert demonstrator in three games. The results show that our vPERL agent outperforms the expert demonstrator by a large margin in six out of eight games. Figure 4 shows the qualitative comparison of our method (vPERL-Small-R Mean ), GIRIL, VAIL, GAIL, VIN and the average performance of the expert and demonstration. Additionally in Appendix C.1, our method consistently outperforms expert and other baselines on two more Atari games, Krull and Time Pilot. The contributions of each component of vPERL. In this subsection, we study the contribution of each component of our method, i.e. Action Back-tracing submodule, Transition Modeling submodule, and the variational objective in Eq. ( 1). Specifically, we directly train the Action Back-tracing and Transition Modeling submodules in terms of supervised learning. We used the mean of the reward map and the prediction error of the next state as the reward for the former and latter submodules, respectively. To study the contribution of the variational objective, we introduced another baseline, PERL, which trains both submodules as an autoencoder. Table 3 shows quantitative comparison between the performance of vPERL and its components. The results show that individual training of each component results in no avail. PERL successfully outperforms the demonstration in one game, i.e. Centipede, which indicates the potential advantage of using both submodules. However, PERL fails in the other seven games, while vPERL outperforms the demonstration in eight games and outperforms expert in six. The large performance gap between PERL and vPERL indicates the variational objective in Eq. ( 1) is important to learn efficient rewards. To further investigate the key reason for why our method works well, we added another baselinesupervised PERL, which forces the encoding of PERL to be close to the true action. The supervised PERL fails in all of the games. Comparing with vPERL, we can attribute the critical contribution to the use of the ELBO of conditional VAE, or more specific, the term KL(q φ (z|s t , s t+1 ) p θ (z|s t )) in Eq. ( 1). It helps vPERL to work well and outperform previous methods for two reasons: 1) The generative training of VAE can serve as a good regularization to alleviate the overfitting problem. 2) The regularization enables vPERL to learn a smooth value function and reward function, which consistently provides straightforward and informative rewards for the moving states in the environment. Empirical evidence: 1) Better generalization ability. The empirical results in Table 2 and Figure 4 show that VIN, GAIL and VAIL are vulnerable to overfitting problem, usually results in no avail and has fewer chances to reach the demonstration-level performance. In contrast, our vPERL has better generalization ability and consistently achieves performance that is either close to or better than the expert. 2) Straightforward and informative reward. Figure 5 shows the state, the reward maps of vPERL and GIRIL in three Atari games. The reward map of GIRIL can be close to zero (in Battle Zone) and state (in Q*bert) or occasionally informative (in Centipede), which is ambiguous and less informative. In contrast, the reward map of our vPERL is more straightforward, and consistently attends to informative regions in the state for all of the games. Our method successfully addresses the two key issues, therefore, it can outperform previous methods.

4.2. CONTINUOUS CONTROL TASKS

We also evaluated our method on continuous control tasks where the state space is low-dimensional and the action space is continuous. The continuous control tasks were from Pybulletfoot_0 environment. Demonstrations To generate demonstrations, we trained a Proximal Policy Optimization (PPO) agent with the ground-truth reward for 1 million simulation steps. We used the PPO implementation in the repository (Kostrikov, 2018) with the default hyper-parameters for continuous control tasks. In each task, we used one demonstration with a fixed length of 1,000 for evaluation. The details of experimental setup can be found in Appendix F.1. 

4.3. ABLATION STUDIES

Ablation study on the growing number of demonstrations. 

5. CONCLUSION

This paper presents a simple but efficient reward learning method, called variational planningembedded reward learning (vPERL). By simultaneously training a planning-embedded action backtracing module and a transition dynamics module in a unified generative solution, we obtain a reward function that is straightforward and informative, and has better generalization ability than previous methods. Informative analysis and empirical evidence support the critical contribution of ELBO regularization term for learning efficient planning-based reward with extremely limited demonstrations. Empirical results show our method outperforms state-of-the-art imitation learning methods on multiple Atari games and continuous control tasks by a large margin. Extensive ablation studies show that our method is not very sensitive to the number of demonstrations, optimality of demonstration, and choices of the hyperparameter K. We remain the extension of our method to more complex continuous control tasks as future work. Another interesting topic for future investigation would be applying vPERL to hard exploration tasks with extremely sparse rewards. A QUANTITATIVE RESULTS OF CONTINUOUS CONTROL TASKS. Table 4 shows the detailed quantitative comparison of the demonstration and imitation methods. The results shown in the table were the mean performance over three random seeds. Table 4 : Average return of vPERL, GIRIL, VIN and state-of-the-arts inverse reinforcement learning algorithms GAIL (Ho & Ermon, 2016) and VAIL (Peng et al., 2019) with one demonstration data on continuous control tasks. The results shown are the mean performance over 3 random seeds with best imitation performance in bold. We also evaluated our method with different numbers of full-episode demonstrations on both Atari games and continuous control tasks. Table 5 and Table 6 show the detailed quantitative comparison of imitation learning methods across different numbers of full-episode demonstrations in the Centipede game and Qbert game. The comparisons on two continuous control tasks, Inverted Pendulum and Inverted Double Pendulum, have been shown in Table 7 and Table 8 . The results shows that our method vPERL achieves the highest performance across different numbers of full-episode demonstrations, and GIRIL usually comes the second best. GAIL is able to achieve better performance with the increase of the demonstration number in both continuous control tasks. Table 9 and Table 10 show the average return of vPERL-Small-R Mean with demonstrations of different optimality on Atari games and control tasks, respectively. In the experiments, we trained a PPO agent with ground-truth reward for 10 million (10M) simulation steps as the expert for Atari games, and 1 million (1M) steps for continuous control tasks. To study the effects of optimality of the demonstrations, we additionally trained PPO agents with less simulation steps: 1M steps and 5M steps for Atari games, and 0.1M steps and 0.5M steps for continuous control tasks. With the additional PPO agents, we generated demonstrations with 10%, 50% optimality of the 10M-step 'Expert' for both Atari games and continuous control tasks. The results show that our method outperforms expert by a large margin in Atari games and reach the demonstration-level performance in continuous control tasks with demonstrations of different optimality. Table 9 : Average return of vPERL-Small-R Mean with one-life demonstration data of expert policy trained under different simulations steps (N E =1 million, 5 million and 10 million). The results shown are the mean performance over five random seeds with better-than-expert performance in bold. For the sake of consistency, we set K = 10 for all of experiments on Atari games and continuous control tasks in Section 4. To study the effects of the hyperparameter K, we evaluate our method on two Atari games and two continuous control tasks with two additional K (K=5, and K=15). Table 11 and Table 12 shows the average return of vPERL-Small-R Mean versus different choices of K on Atari games and continuous control tasks. With the three choices of K, our method consistently outperforms the expert in the Atari games, and reach the best (demonstration-level) performance in continuous control tasks. This indicates that our method is not very sensitive to the choices of hyperparameter K. 

C ADDITIONAL EVALUATION RESULTS

C.1 ADDITIONAL ATARI GAMES. Table 13 shows the average return of vPERL-Small with R Mean and other baselines on two additional Atari games, Krull and Time Pilot. The results show that our method outperforms the expert and other baselines by a large margin on both additional Atari games. Table 13 : Average return of vPERL-Small with R Mean , GIRIL (Yu et al., 2020) , VIN and stateof-the-art IRL algorithms GAIL (Ho & Ermon, 2016) and VAIL (Peng et al., 2019) with one-life demonstration data on additional Atari games. The results shown are the mean performance over five random seeds with better-than-expert performance in bold. To avoid the effects of the scores and lives in the states of Atari games, we also evaluate our method on the "No-score Demo.", which is obtained by masking the game score and number of lives left in the demonstrations (Brown et al., 2019a) . The results show that our method achieves better performance on the "No-score Demo." than the "Standard Demo.". This indicates the negative effects of the game scores and numbers of left lives on the states of demonstrations. From Figure 1 and more reward visualization in Section D and E, we observe that our method learns to attend on the meaningful region in a state and ignore the game score and numbers of left lives automatically. Masking the game score and numbers of left lives in the demonstration further alleviates burdens on learning efficient planning computations and planning-based rewards for Atari games. In summary, our method can learn to outperform the expert without explicitly access to the true rewards, and does not relied on the game scores and numbers of left lives in the states of demonstrations. Furthermore, the results show that the performance of our method can be improved by masking out the game scores and numbers of left lives in the demonstrations. Our first step was also to train a reward learning module for each continuous control task on one demonstration. To build our reward learning module for continuous tasks, we used a simple VIN and inverse VIN as the model bases of action back-tracing and transition modeling submodules, respectively. In the simple VIN model, we used 1D convolutional layer with a kernel size of 2 and stride of 1 to implement the function f R , reward map R and Q value Q. To accomplish the action back-tracing, the final value map of VIN was fully connected with a hidden layer with a size of 32. Reversely, we used 1D deconvolutional layer to implement the inverse VIN model. We kept the size of feature maps in both VIN and inverse VIN unchanged across all the layers. We set K = 10 for both the VIN and inverse VIN in all tasks. The dimension of latent variable z is set to the action dimension for each task. Additionally, we used a two-layer feed forward neural network with tanh activation function as policy architecture. The number of hidden unit is to 100 for all tasks. To extend our method on continuous control tasks, we made minor modification on the training objective. In Atari games, we used the KL divergence to measure the distance between the expert policy distribution and the action distribution in Eq. (1). In continuous control tasks, we instead directly treated the latent variable z as the back-traced action and used mean squared error to measure the distance between the back-traced action and the true action in the demonstration. We set the scaling weight α in Eq. (1) to 1.0 for all tasks. Training was conducted with the Adam optimizer (Kingma & Ba, 2015) at a learning rate of 3e-5 and a mini-batch size of 32 for 50, 000 epochs. In each training epoch, we sampled a mini-batch of data every 20 states. To evaluate the quality of our learned reward, we used the trained reward learning module to produce rewards, and trained a policy to maximize the inferred reward function via PPO. We trained the PPO on the learned reward function for 5 million simulation steps to obtain our final policy. The PPO is trained with a learning rate of 3e-4, a clipping threshold of 0.1, a entropy coefficient of 0.0, a value function coefficient of 0.5, and a GAE parameter of 0.95 (Schulman et al., 2016) . For a fair comparison, we used the same VIN as the model base for all the baselines. The reward function of GAIL and VAIL was chosen according to the original papers (Ho & Ermon, 2016; Peng et al., 2019) . The information constraint I c in VAIL was set to 0.5 for all tasks. To enable fast training, we trained all the imitation methods with 16 parallel processes.

F.2 ADDITIONAL DETAILS OF GAIL AND VAIL

The discriminator for both GAIL and VAIL takes in a state (a stack of four frames) and an action (represented as a 2d one-hot vector with a shape of (|A| × 84 × 84), where |A| is the number of valid discrete actions in each environment) (Brown et al., 2019b) . The discriminator outputs a binary classification value, and -log(D(s, a)) is the reward. VAIL was implemented according to the repository of Karnewar (2018) . The discriminator network architecture has an additional convolutional layer (with a kernel size of 4) as the final convolutional layer to encode the latent variable in VAIL. We used the default setting of 0.2 for the information constraint (Karnewar, 2018) . PPO with the same hyper-parameters was used to optimize the policy network for all the methods. For both GAIL and VAIL, we trained the discriminator using the Adam optimizer with a learning rate of 0.001. The discriminator was updated at each policy step.



https://pybullet.org/



(a) State. (b) vPERL Reward and Value Maps. (c) VIN Reward and Value Maps.

Figure 1: Visualization of state, reward and value maps of vPERL and VIN on Battle Zone game (the first row) and Breakout game (the second row).

Figure 2: Illustration of the proposed vPERL model.

Figure 4: Average return vs. the number of simulation steps on Atari games. The solid lines show the mean performance over five random seeds. The shaded area represents the standard deviation from the mean. The blue dotted line denotes the average return of the expert. The area above the blue dotted line indicates performance beyond the expert.

Figure 5: Visualization of state, vPERL reward map and GIRIL reward map for Atari games.

Figure 6: Average return vs. number of simulation steps on continuous control tasks.Results Figure6shows that our method vPERL achieves the best imitation performance in both continuous control tasks, i.e. Inverted Pendulum and Inverted Double Pendulum. Although GIRIL achieves performance that is close to the demonstration, the efficient planning-based reward function enables vPERL to perform even better. Other baselines are unable to reach the demonstration-level performance by learning from only one demonstration. Quantitative results are shown in Appendix A.

Figure7shows the average return versus number of full-episode demonstrations on both Atari games and continuous control tasks. The results shows that our method vPERL achieves the highest performance across different numbers of full-episode demonstrations. GIRIL usually comes the second best, and GAIL can achieve good performance with more demonstrations in continuous control tasks. Quantitative results have been shown in Appendix B.1.

Figure 7: Average return vs. number of demonstrations on Atari games and continuous control tasks.

Figure 8: Average return vs. number of the expert simulation steps (N E ) on Atari games and continuous control tasks.Ablation study on the hyper-parameters K. Figure8shows the average return versus different choices of K on Atari games and continuous control tasks. The results show that our method is not very sensitive to different choices of K. Quantitative results are shown in Appendix B.3.

Figure 9: Average return vs. choices of K on Atari games and continuous control tasks.

VISUALIZATION OF REWARD AND VALUE IMAGES OF VPERL-LARGE AND VIN. In this section, we visualize the reward maps and value maps learned by vPERL-Large and VIN on Atari games. Here, both vPERL and VIN are based on large-size VIN architecture. The size of reward map is 84 × 84. The figures show that the reward and value maps learned by vPERL are meaningful than that by VIN.(a) State. (b) vPERL Reward and Value. (c) VIN Reward and Value.

Figure 10: Visualization of state, reward map and value map on Battle Zone game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 11: Visualization of state, reward map and value map on Centipede game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 12: Visualization of state, reward map and value map on Seaquest game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

vPERL Reward and Value. (c) VIN Reward and Value.

Figure 13: Visualization of state, reward map and value map on Qbert game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 14: Visualization of state, reward map and value map on Breakout game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 15: Visualization of state, reward map and value map on Space Invaders game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 16: Visualization of state, reward map and value map on Kung-Fu Master game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 17: Visualization of state, reward map and value map on Battle Zone game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 18: Visualization of state, reward map and value map on Centipede game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 19: Visualization of state, reward map and value map on Seaquest game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

vPERL Reward and Value. (c) VIN Reward and Value.

Figure 20: Visualization of state, reward map and value map on Qbert game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 21: Visualization of state, reward map and value map on Breakout game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 22: Visualization of state, reward map and value map on Space Invaders game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Figure 23: Visualization of state, reward map and value map on Beam Rider game. (a) The state, (b) the reward map and value map of vPERL, and (c) the reward map and value map of VIN.

Statistics of Atari environments.



Average return of vPERL-Small-R Mean , and its components (i.e., Action Back-tracing and Transition Modeling, and PERL) with one-life demonstration data on eight Atari games. The results shown are the mean performance over five random seeds with better-than-expert performance in bold.

Parameter Analysis of the vPERL versus other baselines with different numbers of fullepisode demonstrations on Centipede game. The results shown are the mean performance over 5 random seeds with best performance in bold.

Parameter Analysis of the vPERL versus other baselines with different numbers of fullepisode demonstrations on Qbert game. The results shown are the mean performance over 5 random seeds with best performance in bold.

Parameter Analysis of the vPERL versus other baselines with different numbers of fullepisode demonstrations on Inverted Pendulum task. The results shown are the mean performance over 5 random seeds with best performance in bold.

Parameter Analysis of the vPERL versus other baselines with different numbers of fullepisode demonstrations on InvertedDoublePendulum task. The results shown are the mean performance over 5 random seeds with best performance in bold.

Average return of vPERL-R Mean with one demonstration data of expert policy trained under different simulations steps (N E =0.1 million, 0.5 million and 1 million). The results shown are the mean performance over five random seeds with demonstration-level performance in bold.

Average return of vPERL-Small-R Mean with different choices of K, on one-life demonstration data. The results shown are the mean performance over five random seeds with better-than-expert performance in bold.

Average return of vPERL-R Mean with different choices of K, on one-life demonstration data. The results shown are the mean performance over five random seeds with demonstration-level performance in bold.

Table14compares the average return of vPERL-Small-R Mean with the "Standard Demo." and the "No-score Demo." on Q*bert game and Krull game. Average return of vPERL-Small-R Mean with different choices of K on one-life demonstration data. The results shown are the mean performance over five random seeds with better-than-expert performance in bold.

