VISUAL IMITATION LEARNING WITH PATCH REWARDS

Abstract

Visual imitation learning enables reinforcement learning agents to learn to behave from expert visual demonstrations such as videos or image sequences, without explicit, well-defined rewards. Previous research either adopted supervised learning techniques or induce simple and coarse scalar rewards from pixels, neglecting the dense information contained in the image demonstrations. In this work, we propose to measure the expertise of various local regions of image samples, or called patches, and recover multi-dimensional patch rewards accordingly. Patch reward is a more precise rewarding characterization that serves as a finegrained expertise measurement and visual explainability tool. Specifically, we present Adversarial Imitation Learning with Patch Rewards (PatchAIL), which employs a patch-based discriminator to measure the expertise of different local parts from given images and provide patch rewards. The patch-based knowledge is also used to regularize the aggregated reward and stabilize the training. We evaluate our method on DeepMind Control Suite and Atari tasks. The experiment results have demonstrated that PatchAIL outperforms baseline methods and provides valuable interpretations for visual demonstrations. Our codes are available at https://github.com/sail-sg/PatchAIL.

1. INTRODUCTION

Reinforcement learning (RL) has gained encouraging success in various domains, e.g., games (Silver et al., 2017; Yang et al., 2022; Vinyals et al., 2019) , robotics (Gu et al., 2017) and autonomous driving (Pomerleau, 1991; Zhou et al., 2020) , which however heavily relies on well-designed reward functions. Beyond hand-crafting reward functions that are non-trivial, imitation learning (IL) offers a data-driven way to learn behaviors and recover informative rewards from expert demonstrations without access to any explicit reward (Arora & Doshi, 2021; Hussein et al., 2017) . Recently, visual imitation learning (VIL) (Pathak et al., 2018; Rafailov et al., 2021) has attracted increasing attention, which aims to learn from high-dimensional visual demonstrations like image sequences or videos. Compared with previous IL works tackling low-dimensional inputs, i.e. proprioceptive features like positions, velocities, and accelerations (Ho & Ermon, 2016; Fu et al., 2017; Liu et al., 2021) , VIL is a more general problem in the physical world, such as learning to cook by watching videos. For humans, it is very easy but it remains a critical challenge for AI agents. Previous approaches apply supervised learning (i.e. behavior cloning (Pomerleau, 1991; Pathak et al., 2018) ) or design reward functions (Reddy et al., 2019; Brantley et al., 2019) to solve VIL problems. Among them, supervised methods often require many expert demonstrations and suffer from compounding error (Ross et al., 2011) , while the rest tend to design rewards that are too naive to provide appropriate expertise measurements for efficient learning. Some recent works (Rafailov et al., 2021; Cohen et al., 2021; Haldar et al., 2022) choose to directly build on pixel-based RL methods, i.e. using an encoder, and learn rewards from the latent space. These methods inspect the whole image for deriving the rewards, which tend to give only coarse expertise measurements. Additionally, they focus on behavior learning, and seldom pay any attention to the explainability of the learned behaviors. In this work, we argue that the scalar-form reward used in previous works only conveys "sparse" information of the full image, and hardly measures local expertise, which limits the learning effi- † The work was done during an internship at Sea AI Lab, Singapore. Corresponding author.

