BC-IRL: LEARNING GENERALIZABLE REWARD FUNC-TIONS FROM DEMONSTRATIONS

Abstract

How well do reward functions learned with inverse reinforcement learning (IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL, a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.

Goal

. However, this success depends on specifying an accurate and informative reward signal to guide the agent towards solving the task. For instance, imagine designing a reward function for a robot window cleaning task. The reward should tell the robot how to grasp the cleaning rag, how to use the rag to clean the window, and to wipe hard enough to remove dirt, but not hard enough to break the window. Manually shaping such reward functions is difficult, non-intuitive, and time-consuming. Furthermore, the need for an expert to design a reward function for every new skill limits the ability of agents to autonomously acquire new skills. Inverse reinforcement learning (IRL) (Abbeel & Ng, 2004; Ziebart et al., 2008; Osa et al., 2018) is one way of addressing the challenge of acquiring rewards by learning reward functions from demonstrations and then using the learned rewards to learn policies via reinforcement learning. When compared to direct imitation learning, which learns policies from demonstrations directly, potential benefits of IRL are at least two-fold: first, IRL does not suffer from the compounding error problem that is often observed with policies directly learned from demonstrations (Ross et al., 2011; Barde et al., 2020) ; and second, a reward function could be a more abstract and parsimonious description of



Figure 1: A visualization of learned rewards on a task where a 2D agent must navigate to the goal at the center.

Figure 1a: Four trajectories are provided as demonstrations and the demonstrated states are visualized as points. Rewards learned via Maximum Entropy are in Figure 1b and BC-IRL in Figure 1c. Lighter colors represent larger predicted rewards. The MaxEnt objective overfits to the demonstrations, giving high rewards only close to the expert states, preventing the reward from providing meaningful learning signals in new states. 1 INTRODUCTION Reinforcement learning has demonstrated success on a broad range of tasks from navigation Wijmans et al. (2019), locomotion Kumar et al. (2021); Iscen et al. (2018), and manipulation Kalashnikov et al.(2018). However, this success depends on specifying an accurate and informative reward signal to guide the agent towards solving the task. For instance, imagine designing a reward function for a robot window cleaning task. The reward should tell the robot how to grasp the cleaning rag, how to use the rag to clean the window, and to wipe hard enough to remove dirt, but not hard enough to break the window. Manually shaping such reward functions is difficult, non-intuitive, and time-consuming. Furthermore, the need for an expert to design a reward function for every new skill limits the ability of agents to autonomously acquire new skills. Inverse reinforcement learning (IRL)(Abbeel & Ng, 2004; Ziebart et al., 2008; Osa et al., 2018)   is one way of addressing the challenge of acquiring rewards by learning reward functions from demonstrations and then using the learned rewards to learn policies via reinforcement learning. When compared to direct imitation learning, which learns policies from demonstrations directly, potential benefits of IRL are at least two-fold: first, IRL does not suffer from the compounding error problem that is often observed with policies directly learned from demonstrations(Ross et al., 2011; Barde  et al., 2020); and second, a reward function could be a more abstract and parsimonious description of

