BC-IRL: LEARNING GENERALIZABLE REWARD FUNC-TIONS FROM DEMONSTRATIONS

Abstract

How well do reward functions learned with inverse reinforcement learning (IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL, a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.

Goal

. However, this success depends on specifying an accurate and informative reward signal to guide the agent towards solving the task. For instance, imagine designing a reward function for a robot window cleaning task. The reward should tell the robot how to grasp the cleaning rag, how to use the rag to clean the window, and to wipe hard enough to remove dirt, but not hard enough to break the window. Manually shaping such reward functions is difficult, non-intuitive, and time-consuming. Furthermore, the need for an expert to design a reward function for every new skill limits the ability of agents to autonomously acquire new skills. Inverse reinforcement learning (IRL) (Abbeel & Ng, 2004; Ziebart et al., 2008; Osa et al., 2018) is one way of addressing the challenge of acquiring rewards by learning reward functions from demonstrations and then using the learned rewards to learn policies via reinforcement learning. When compared to direct imitation learning, which learns policies from demonstrations directly, potential benefits of IRL are at least two-fold: first, IRL does not suffer from the compounding error problem that is often observed with policies directly learned from demonstrations (Ross et al., 2011; Barde et al., 2020) ; and second, a reward function could be a more abstract and parsimonious description of the observed task that generalizes better to unseen task settings (Ng et al., 2000; Osa et al., 2018) . This second potential benefit is appealing as it allows the agent to learn a reward function to train policies not only for the demonstrated task setting (e.g. specific start-goal configurations in a reaching task) but also for unseen settings (e.g. unseen start-goal configurations), autonomously without additional expert supervision. However, thus far the generalization properties of reward functions learned via IRL are poorly understood. Here, we study the generalization of learned reward functions and find that prior IRL methods fail to learn generalizable rewards and instead overfit to the demonstrations. Figure 1 demonstrates this on a task where a point mass agent must navigate in a 2D space to a goal location at the center. An important reward characteristic for this task is that an agent, located anywhere in the state-space, should receive increasing rewards as it gets closer to the goal. 1b ), which fails to capture goal distance in the reward. Instead, the MaxEnt objective leads to rewards that separate non-expert from expert behavior by maximizing reward values along the expert demonstration. While useful for imitating the experts, the MaxEnt objective prevents the IRL algorithms from learning to assign meaningful rewards to other parts of the state space, thus limiting generalization of the reward function. As a remedy to the reward generalization challenge in the maximum entropy IRL framework, we propose a new IRL framework called Behavioral Cloning Inverse Reinforcement Learning (BC-IRL). In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, the BC-IRL framework updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. This is akin to the model-agnostic meta-learning (Finn et al., 2017) and loss learning (Bechtle et al., 2021) frameworks where model or loss function parameters are learned such that the downstream task performs well when utilizing the meta-learned parameters. By using gradient-based bi-level optimization Grefenstette et al. ( 2019), BC-IRL can optimize the behavior cloning loss to learn the reward, rather than a separation objective like the maximum entropy objective. Importantly, to learn the reward, BC-IRL differentiates through the reinforcement learning policy optimization, which incorporates exploration and requires the reward to provide a meaningful reward throughout the state space to guide the policy to better match the expert. We find BC-IRL learns more generalizable rewards (Figure 1c ), and achieves over twice the success rate of baseline IRL methods in challenging generalization settings. Our contributions are as follows: 1) The general BC-IRL framework for learning more generalizable rewards from demonstrations, and a specific BC-IRL-PPO variant that uses PPO as the RL algorithm. 2) A quantitative and qualitative analysis of reward functions learned with BC-IRL and Maximum-Entropy IRL variants on a simple task for easy analysis. 3) An evaluation of our novel BC-IRL algorithm on two continuous control tasks against state-of-the-art IRL and IL methods. Our method learns rewards that transfer better to novel task settings.

2. BACKGROUND AND RELATED WORK

We begin by reviewing Inverse Reinforcement Learning through the lense of bi-level optimization. We assume access to a rewardless Markov decision process (MDP) defined through the tuple M = (S, A, P, ρ 0 , γ, H) for state-space S, action space A, transition distribution P(s ′ |s, a), initial state distribution ρ 0 , discounting factor γ, and episode horizon H. We also have access to a set of expert demonstration trajectories D e = {τ e i } N i=1 where each trajectory is a sequence of state, action tuples. IRL learns a parameterized reward function R ψ (τ i ) which assigns a trajectory a scalar reward. Given the reward, a policy π θ (a|s) is learned which maps from states to a distribution over actions. The goal of IRL is to produce a reward R ψ , such that a policy trained to maximize the sum of (discounted) rewards under this reward function matches the behavior of the expert. This is captured through the following bi-level optimization problem:  where L IRL (R ψ ; π θ ) denotes the IRL loss and measures the performance of the learned reward R ψ and policy π θ ; g(R ψ , θ) is the reinforcement learning objective used to optimize policy parameters θ. Algorithms for this bi-level optimization consist of an outer loop ((1a)) that optimizes the reward and an inner loop ((1b)) that optimizes the policy given the current reward.



Figure 1: A visualization of learned rewards on a task where a 2D agent must navigate to the goal at the center.

Figure 1a: Four trajectories are provided as demonstrations and the demonstrated states are visualized as points. Rewards learned via Maximum Entropy are in Figure 1b and BC-IRL in Figure 1c. Lighter colors represent larger predicted rewards. The MaxEnt objective overfits to the demonstrations, giving high rewards only close to the expert states, preventing the reward from providing meaningful learning signals in new states. 1 INTRODUCTION Reinforcement learning has demonstrated success on a broad range of tasks from navigation Wijmans et al. (2019), locomotion Kumar et al. (2021); Iscen et al. (2018), and manipulation Kalashnikov et al.(2018). However, this success depends on specifying an accurate and informative reward signal to guide the agent towards solving the task. For instance, imagine designing a reward function for a robot window cleaning task. The reward should tell the robot how to grasp the cleaning rag, how to use the rag to clean the window, and to wipe hard enough to remove dirt, but not hard enough to break the window. Manually shaping such reward functions is difficult, non-intuitive, and time-consuming. Furthermore, the need for an expert to design a reward function for every new skill limits the ability of agents to autonomously acquire new skills. Inverse reinforcement learning (IRL)(Abbeel & Ng, 2004; Ziebart et al., 2008; Osa et al., 2018)   is one way of addressing the challenge of acquiring rewards by learning reward functions from demonstrations and then using the learned rewards to learn policies via reinforcement learning. When compared to direct imitation learning, which learns policies from demonstrations directly, potential benefits of IRL are at least two-fold: first, IRL does not suffer from the compounding error problem that is often observed with policies directly learned from demonstrations(Ross et al., 2011; Barde  et al., 2020); and second, a reward function could be a more abstract and parsimonious description of

min ψ IRL (R ψ ; π θ ) (outer obj.) (1a) s.t. θ ∈ argmax θ g(R ψ , θ) (inner obj.)

Most recent prior work Fu et al. (2017); Ni et al. (2020); Finn et al. (2016c) developed IRL algorithms that optimize the maximum entropy objective (Ziebart et al., 2008) (Figure

