LS-IQ: IMPLICIT REWARD REGULARIZATION FOR INVERSE REINFORCEMENT LEARNING

Abstract

Recent methods for imitation learning directly learn a Q-function using an implicit reward formulation rather than an explicit reward function. However, these methods generally require implicit reward regularization to improve stability and often mistreat absorbing states. Previous works show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms. In this work, we show that using this regularizer under a mixture distribution of the policy and the expert provides a particularly illuminating perspective: the original objective can be understood as squared Bellman error minimization, and the corresponding optimization problem minimizes a bounded χ 2 -Divergence between the expert and the mixture distribution. This perspective allows us to address instabilities and properly treat absorbing states. We show that our method, Least Squares Inverse Q-Learning (LS-IQ), outperforms state-of-the-art algorithms, particularly in environments with absorbing states. Finally, we propose to use an inverse dynamics model to learn from observations only. Using this approach, we retain performance in settings where no expert actions are available. 1 

1. INTRODUCTION

Inverse Reinforcement Learning (IRL) techniques have been developed to robustly extract behaviors from expert demonstration and solve the problems of classical Imitation Learning (IL) methods (Ng et al., 1999; Ziebart et al., 2008) . Among the recent methods for IRL, the Adversarial Imitation Learning (AIL) approach (Ho & Ermon, 2016; Fu et al., 2018; Peng et al., 2021) , which casts the optimization over rewards and policies into an adversarial setting, have been proven particularly successful. These methods, inspired by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , alternate between learning a discriminator, and improving the agent's policy w.r.t. a reward function, computed based on the discriminator's output. These explicit reward methods require many interactions with the environment as they learn both a reward and a value function. Recently, implicit reward methods (Kostrikov et al., 2020; Arenz & Neumann, 2020; Garg et al., 2021) have been proposed. These methods directly learn the Q-function, significantly accelerating the policy optimization. Among the implicit reward approaches, the Inverse soft Q-Learning (IQ-Learn) is the current state-of-the-art. This method modifies the distribution matching objective by including reward regularization on the expert distribution, which results in a minimization of the χ 2divergence between the policy and the expert distribution. However, whereas their derivations only consider regularization on the expert distribution, their practical implementations on continuous control tasks have shown that regularizing the reward on both the expert and policy distribution achieves significantly better performance. The contribution of this paper is twofold: First, when using this regularizer, we show that the resulting objective minimizes the χ 2 divergence between the expert and a mixture distribution between the expert and the policy. We then investigate the effects of regularizing w.r.t. the mixture distribution on the theoretical properties of IQ-Learn. We show that this divergence is bounded, which translates to bounds on the reward and Q-function, significantly improving learning stability. Indeed, the resulting objective corresponds to least-squares Bellman error minimization and is closely related to Soft Q-Imitation Learning (SQIL) (Reddy et al., 2020) . Second, we formulate Least Squares Inverse Q-Learning (LS-IQ), a novel IRL algorithm. By following the theoretical insight coming from the analysis of the χ 2 regularizer, we tackle many sources of instabilities of the IQ-Learn approach: the arbitrariness of the Q-function scales, exploding Q-functions targets, and reward bias Kostrikov et al. ( 2019), i.e., assuming that states provide the null reward. We derive the LS-IQ algorithm by exploiting structural properties of the Q-function and heuristics based on expert optimality. This results in increased performance on many tasks and, in general, more stable learning and less variance in the Q-function estimation. Finally, we extend the implicit reward methods to the IL from observations setting by training an Inverse-Dynamics Model (IDM) to predict the expert actions, which are no longer assumed to be available. Even in this challenging setting, our approach retains performance similar to the one where expert actions are known. Related Work. The vast majority of IRL and IL methods build upon the Maximum Entropy (MaxEnt) IRL framework (Ziebart, 2010) . In particular, Ho & Ermon (2016) introduce Generative Adversarial Imitation Learning (GAIL), which applies GANs to the IL problem. While the original method minimizes the Jensen-Shannon divergence to the expert distribution, the approach is extended to general f -divergences (Ghasemipour et al., 2019) , building on the work of Nowozin et al. ( 2016). Among the f -divergences, the Pearson χ 2 divergence improves the training stability for GANs (Mao et al., 2017) and for AIL (Peng et al., 2021) . Kostrikov et al. ( 2019) introduce a replay buffer for off-policy updates of the policy and discriminator. The authors also point out the problem of reward bias, which is common in many imitation learning methods. Indeed, AIL methods implicitly assign a null reward to these states, leading to survival or termination biases, depending on the chosen divergence. Kostrikov et al. (2020) improve the previous work introducing recent advances from offline policy evaluation (Nachum et al., 2019) . Their method, ValueDice, uses an inverse Bellman operator that expresses the reward function in terms of its Q-function, to minimize the reverse Kullback-Leibler Divergence (KL) to the expert distribution. Arenz & Neumann (2020) derive a non-adversarial formulation based on trust-region updates on the policy. Their method, O-NAIL, uses a standard Soft-Actor Critic (SAC) (Haarnoja et al., 2018) update for policy improvement. O-NAIL can be understood as an instance of the more general IQ-Learn algorithm (Garg et al., 2021) , which can optimize different divergences depending on an implicit reward regularizer. Garg et al. ( 2021) also show that their algorithm achieves better performance using the χ 2 divergence instead of the reverse KL. Reddy et al. (2020) propose a method that uses SAC and assigns fixed binary rewards to the expert and the policy. Swamy et al. ( 2021) provide a unifying perspective on many of the methods mentioned above, explicitly showing that GAIL, ValueDice, MaxEnt-IRL, and SQIL can be viewed as moment matching algorithms. Lastly, Sikchi et al. (2023) propose a ranking loss for AIL, which trains a reward function using a least-squares objective with ranked targets.

2. PRELIMINARIES

Notation. A Markov Decision Process (MDP) is a tuple (S, A, P, r, γ, µ 0 ), where S is the state space, A is the action space, P : S × A × S → R + is the transition kernel, r : S × A → R is the reward function, γ is the discount factor, and µ 0 : S → R + is the initial state distribution. At each step, the agent observes a state s ∈ S from the environment, samples an action a ∈ A using the policy π : S × A → R + , and transitions with probability P (s ′ |s, a) into the next state s ′ ∈ S, where it receives the reward r(s, a). We define an occupancy measure ρ π (s, a) = π(a|s) ∞ t=0 γ t µ π t (s), where µ π t (s ′ ) = s,a µ π t (s)π(a|s)P (s ′ |s, a)da ds is the state distribution for t > 0, with µ π 0 (s) = µ 0 (s). The occupancy measure allows us to denote the expected reward under policy π as E ρπ [r(s, a)] ≜ E[ ∞ t=0 γ t r(s t , a t )], where s 0 ∼ µ 0 , a t ∼ π(.|s t ) and s t+1 ∼ P (.|s t , a t ) for t > 0. Furthermore, R S×A = {x : S × A → R} denotes the set of functions in the state-action space and R denotes the extended real numbers R ∪ {+∞}. We refer to the soft value functions as Ṽ (s) and Q(s, a), while we use V (s) and Q(s, a) to denote the value functions without entropy bonus. Inverse Reinforcement Learning as an Occupancy Matching Problem. Given a set of demonstrations consisting of states and actions sampled from an expert policy π E , IRL aims at finding a reward function r(s, a) from a family of reward functions R = R S×A assigning high reward to



The code is available at https://github.com/robfiras/ls-iq

