VISUAL IMITATION LEARNING WITH PATCH REWARDS

Abstract

Visual imitation learning enables reinforcement learning agents to learn to behave from expert visual demonstrations such as videos or image sequences, without explicit, well-defined rewards. Previous research either adopted supervised learning techniques or induce simple and coarse scalar rewards from pixels, neglecting the dense information contained in the image demonstrations. In this work, we propose to measure the expertise of various local regions of image samples, or called patches, and recover multi-dimensional patch rewards accordingly. Patch reward is a more precise rewarding characterization that serves as a finegrained expertise measurement and visual explainability tool. Specifically, we present Adversarial Imitation Learning with Patch Rewards (PatchAIL), which employs a patch-based discriminator to measure the expertise of different local parts from given images and provide patch rewards. The patch-based knowledge is also used to regularize the aggregated reward and stabilize the training. We evaluate our method on DeepMind Control Suite and Atari tasks. The experiment results have demonstrated that PatchAIL outperforms baseline methods and provides valuable interpretations for visual demonstrations. Our codes are available at https://github.com/sail-sg/PatchAIL.

1. INTRODUCTION

Reinforcement learning (RL) has gained encouraging success in various domains, e.g., games (Silver et al., 2017; Yang et al., 2022; Vinyals et al., 2019) , robotics (Gu et al., 2017) and autonomous driving (Pomerleau, 1991; Zhou et al., 2020) , which however heavily relies on well-designed reward functions. Beyond hand-crafting reward functions that are non-trivial, imitation learning (IL) offers a data-driven way to learn behaviors and recover informative rewards from expert demonstrations without access to any explicit reward (Arora & Doshi, 2021; Hussein et al., 2017) . Recently, visual imitation learning (VIL) (Pathak et al., 2018; Rafailov et al., 2021) has attracted increasing attention, which aims to learn from high-dimensional visual demonstrations like image sequences or videos. Compared with previous IL works tackling low-dimensional inputs, i.e. proprioceptive features like positions, velocities, and accelerations (Ho & Ermon, 2016; Fu et al., 2017; Liu et al., 2021) , VIL is a more general problem in the physical world, such as learning to cook by watching videos. For humans, it is very easy but it remains a critical challenge for AI agents. Previous approaches apply supervised learning (i.e. behavior cloning (Pomerleau, 1991; Pathak et al., 2018) ) or design reward functions (Reddy et al., 2019; Brantley et al., 2019) to solve VIL problems. Among them, supervised methods often require many expert demonstrations and suffer from compounding error (Ross et al., 2011) , while the rest tend to design rewards that are too naive to provide appropriate expertise measurements for efficient learning. Some recent works (Rafailov et al., 2021; Cohen et al., 2021; Haldar et al., 2022) choose to directly build on pixel-based RL methods, i.e. using an encoder, and learn rewards from the latent space. These methods inspect the whole image for deriving the rewards, which tend to give only coarse expertise measurements. Additionally, they focus on behavior learning, and seldom pay any attention to the explainability of the learned behaviors. In this work, we argue that the scalar-form reward used in previous works only conveys "sparse" information of the full image, and hardly measures local expertise, which limits the learning effi- We then propose to achieve more efficient VIL via inferring dense and informative reward signals from images, which also brings better explainability of the learned behaviors. To achieve this goal, we develop an intuitive and principled approach for VIL, termed Adversarial Imitation Learning with Patch rewards (PatchAIL), which produces patch rewards for agents to learn by comparing local regions of agent image samples with visual demonstrations, and also provides interpretability of learned behaviors. Also, it utilizes a patch regularization mechanism to employ the patch information and fully stabilize the training. We visualize the patch rewards induced by our PatchAIL in Fig. 1 . From the illustration, we can easily interpret where and why agents are rewarded or penalized in these control tasks compared with expert demonstrations without any task prior. These examples show the friendly explainability of our method, and intuitively tell why the algorithm can work well. We conduct experiments on a wide range of DeepMind Control Suite (DMC) (Tassa et al., 2018) tasks along with several Atari tasks, and find our PatchAIL significantly outperforms all the baseline methods, verifying the merit of our PatchAIL for enhancing the learning efficiency of VIL. Moreover, we conduct ablations on the pivotal choices of PatchAIL, showing the key options in PatchAIL such as the number of patches and aggregation function highly impact the learning performance. To verify its explainability, we further compare the spatial attention maps of PatchAIL to baseline methods, showing PatchAIL is better at focusing on the correct parts of robot bodies. It is hoped that our results could provide some insights for exploiting the rich information in visual demonstrations to benefit VIL and also reveal the potential of using patch-level rewards.

2. PRELIMINARIES

Markov decision process. RL problems are typically modeled as a γ-discounted infinite horizon Markov decision process (MDP) M = S, A, P, ρ 0 , r, γ , where S denotes the set of states, A the



† The work was done during an internship at Sea AI Lab, Singapore. Corresponding author.



Figure 1: Visualization of patch rewards by the proposed PatchAIL on seven DeepMind Control Suite tasks. The patch rewards are calculated regarding several stacked frames and are mapped back on each pixel weighted by the spatial attention map. We use the final model trained with around 200M frames. Strikingly, the patch rewards align with the movement of key elements (e.g., moving foot, inclined body) on both expert and random samples, indicating where the high rewards (red areas) or low ones (blue areas) come from, compared with the reward on backgrounds. With the patch rewards recovered from expert's visual demonstrations, we can easily reason why and where agents are rewarded or penalized in these tasks compared with expert demonstrations without any task prior. Figure best viewed in color. A corresponding video is provided in the demo page https://sites.google.com/view/patchail/.

