LS-IQ: IMPLICIT REWARD REGULARIZATION FOR INVERSE REINFORCEMENT LEARNING

Abstract

Recent methods for imitation learning directly learn a Q-function using an implicit reward formulation rather than an explicit reward function. However, these methods generally require implicit reward regularization to improve stability and often mistreat absorbing states. Previous works show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms. In this work, we show that using this regularizer under a mixture distribution of the policy and the expert provides a particularly illuminating perspective: the original objective can be understood as squared Bellman error minimization, and the corresponding optimization problem minimizes a bounded χ 2 -Divergence between the expert and the mixture distribution. This perspective allows us to address instabilities and properly treat absorbing states. We show that our method, Least Squares Inverse Q-Learning (LS-IQ), outperforms state-of-the-art algorithms, particularly in environments with absorbing states. Finally, we propose to use an inverse dynamics model to learn from observations only. Using this approach, we retain performance in settings where no expert actions are available. 1 

1. INTRODUCTION

Inverse Reinforcement Learning (IRL) techniques have been developed to robustly extract behaviors from expert demonstration and solve the problems of classical Imitation Learning (IL) methods (Ng et al., 1999; Ziebart et al., 2008) . Among the recent methods for IRL, the Adversarial Imitation Learning (AIL) approach (Ho & Ermon, 2016; Fu et al., 2018; Peng et al., 2021) , which casts the optimization over rewards and policies into an adversarial setting, have been proven particularly successful. These methods, inspired by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , alternate between learning a discriminator, and improving the agent's policy w.r.t. a reward function, computed based on the discriminator's output. These explicit reward methods require many interactions with the environment as they learn both a reward and a value function. Recently, implicit reward methods (Kostrikov et al., 2020; Arenz & Neumann, 2020; Garg et al., 2021) have been proposed. These methods directly learn the Q-function, significantly accelerating the policy optimization. Among the implicit reward approaches, the Inverse soft Q-Learning (IQ-Learn) is the current state-of-the-art. This method modifies the distribution matching objective by including reward regularization on the expert distribution, which results in a minimization of the χ 2divergence between the policy and the expert distribution. However, whereas their derivations only consider regularization on the expert distribution, their practical implementations on continuous control tasks have shown that regularizing the reward on both the expert and policy distribution achieves significantly better performance. The contribution of this paper is twofold: First, when using this regularizer, we show that the resulting objective minimizes the χ 2 divergence between the expert and a mixture distribution between the expert and the policy. We then investigate the effects of regularizing w.r.t. the mixture distribution on the theoretical properties of IQ-Learn. We show that this divergence is bounded, which translates



The code is available at https://github.com/robfiras/ls-iq 1

