INSTABILITY IN GENERATIVE ADVERSARIAL IMITA-TION LEARNING WITH DETERMINISTIC POLICY

Abstract

Deterministic policies are widely applied in generative adversarial imitation learning (GAIL). When adopting these policies, some GAIL variants modify the reward function to avoid training instability. However, the mechanism behind this instability is still largely unknown. In this paper, we capture the instability through the underlying exploding gradients theoretically in the updating process. Our novelties lie in: 1) We establish and prove a probabilistic lower bound for exploding gradients, which can describe the instability universally, while the stochastic policy will never suffer from such pathology subsequently, by employing the multivariate Gaussian policy with small covariance to approximate deterministic policy. 2) We also prove that the modified reward function of adversarial inverse reinforcement learning (AIRL) can relieve exploding gradients. Experiments support our analysis.

1. INTRODUCTION

Imitation learning (IL) trains a policy directly from expert demonstrations without reward signals (Ng et al., 2000; Syed & Schapire, 2007; Ho & Ermon, 2016) . It has been broadly studied under the twin umbrellas of behavioral cloning (BC) (Pomerleau, 1991) and inverse reinforcement learning (IRL) (Ziebart et al., 2008) . Generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) , established by the trust region policy optimization (TRPO) (Schulman et al., 2015) policy training, plugs the inspiration of generative adversarial networks (GANs) (Goodfellow et al., 2014) into the maximum entropy IRL. The discriminator in GAIL aims to distinguish whether a state-action pair comes from the expert demonstration or is generated by the agent. Meanwhile, the learned policy generates interaction data for confusing the discriminator. GAIL is promising for many real-world scenarios where designing reward functions to learn the optimal control policies requires significant effort. It has made remarkable achievements in physical-world tasks, e.g., robot manipulation (Jabri, 2021), mobile robot navigating (Tai et al., 2018) , commodities search (Shi et al., 2019) , endovascular catheterization (Chi et al., 2020) , etc. The learned policy in GAIL can be effectively accomplished by reinforcement learning (RL) methods (Sutton & Barto, 2018; Puterman, 2014) , which are divided into stochastic policy algorithms and deterministic policy algorithms, incorporating the two classes of algorithms into GAIL denoted as ST-GAIL and DE-GAIL respectively in our paper. For ST-GAIL, one can refer to the proximal policy optimization (PPO)-GAIL (Chen et al., 2020) , the natural policy gradient (NPG)-GAIL (Guan et al., 2021) and the two-stage stochastic gradient (TSSG) (Zhou et al., 2022) . These algorithms have shown that GAIL can ensure global convergence in high-dimensional environments against traditional IRL methods (Ng et al., 2000; Ziebart et al., 2008; Boularias et al., 2011) . Unfortunately, ST-GAIL methods have low sample efficiency and cost a long time to train the learned policy well (Zuo et al., 2020) . In comparison, some related works (Kostrikov et al., 2019; Zuo et al., 2020) imply that deterministic policies are capable of enhancing sample efficiency when training GAIL variants. Kostrikov et al. (2019) proposed the discriminator-actor-critic (DAC) algorithm, which defines the reward function with the GAIL discriminator for the policy trained by the twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) . It reduces policy-environment interaction sample complexity by an average factor of 10. Deterministic generative adversarial imitation learning (DGAIL) (Zuo

