INSTABILITY IN GENERATIVE ADVERSARIAL IMITA-TION LEARNING WITH DETERMINISTIC POLICY

Abstract

Deterministic policies are widely applied in generative adversarial imitation learning (GAIL). When adopting these policies, some GAIL variants modify the reward function to avoid training instability. However, the mechanism behind this instability is still largely unknown. In this paper, we capture the instability through the underlying exploding gradients theoretically in the updating process. Our novelties lie in: 1) We establish and prove a probabilistic lower bound for exploding gradients, which can describe the instability universally, while the stochastic policy will never suffer from such pathology subsequently, by employing the multivariate Gaussian policy with small covariance to approximate deterministic policy. 2) We also prove that the modified reward function of adversarial inverse reinforcement learning (AIRL) can relieve exploding gradients. Experiments support our analysis.

1. INTRODUCTION

Imitation learning (IL) trains a policy directly from expert demonstrations without reward signals (Ng et al., 2000; Syed & Schapire, 2007; Ho & Ermon, 2016) . It has been broadly studied under the twin umbrellas of behavioral cloning (BC) (Pomerleau, 1991) and inverse reinforcement learning (IRL) (Ziebart et al., 2008) . Generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) , established by the trust region policy optimization (TRPO) (Schulman et al., 2015) policy training, plugs the inspiration of generative adversarial networks (GANs) (Goodfellow et al., 2014) into the maximum entropy IRL. The discriminator in GAIL aims to distinguish whether a state-action pair comes from the expert demonstration or is generated by the agent. Meanwhile, the learned policy generates interaction data for confusing the discriminator. GAIL is promising for many real-world scenarios where designing reward functions to learn the optimal control policies requires significant effort. It has made remarkable achievements in physical-world tasks, e.g., robot manipulation (Jabri, 2021) , mobile robot navigating (Tai et al., 2018) , commodities search (Shi et al., 2019) , endovascular catheterization (Chi et al., 2020) , etc. The learned policy in GAIL can be effectively accomplished by reinforcement learning (RL) methods (Sutton & Barto, 2018; Puterman, 2014) , which are divided into stochastic policy algorithms and deterministic policy algorithms, incorporating the two classes of algorithms into GAIL denoted as ST-GAIL and DE-GAIL respectively in our paper. For ST-GAIL, one can refer to the proximal policy optimization (PPO)-GAIL (Chen et al., 2020) , the natural policy gradient (NPG)-GAIL (Guan et al., 2021) and the two-stage stochastic gradient (TSSG) (Zhou et al., 2022) . These algorithms have shown that GAIL can ensure global convergence in high-dimensional environments against traditional IRL methods (Ng et al., 2000; Ziebart et al., 2008; Boularias et al., 2011) . Unfortunately, ST-GAIL methods have low sample efficiency and cost a long time to train the learned policy well (Zuo et al., 2020) . In comparison, some related works (Kostrikov et al., 2019; Zuo et al., 2020) imply that deterministic policies are capable of enhancing sample efficiency when training GAIL variants. Kostrikov et al. (2019) proposed the discriminator-actor-critic (DAC) algorithm, which defines the reward function with the GAIL discriminator for the policy trained by the twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) . It reduces policy-environment interaction sample complexity by an average factor of 10. Deterministic generative adversarial imitation learning (DGAIL) (Zuo et al., 2020) utilizes the modified deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) to update the policy, and achieves a faster learning speed than ST-GAIL. These DE-GAIL methods not only effectively improve sample efficiency, but also mitigate instability caused by the reward function -log(1 -D(s, a)), referred to as the positive logarithmic reward function (PLR). This classical logarithmic reward function is one of the primary shapes and is frequently used in GAIL. For PLR, Kostrikov et al. (2019) However, the instability caused by PLR is largely unknown. We investigate the performance of DDPG-GAIL and TD3-GAIL, i.e., replacing TRPO to RL algorithms DDPG and TD3 with unchanged PLR, with 10 6 expert demonstrations displayed in Fig. 1 . It emerges extreme instability under 11 random seeds. The learning effects under some seeds are capable of reaching expert levels (valid), while others hardly learn anything (invalid). At some point, the two corner cases cannot be well captured by the bias analysis introduced through the toy demo in Kostrikov et al. (2019) . In this paper, we incorporate a different deterministic policy algorithm, i.e., the softmax deep double deterministic policy gradients (SD3) (Pan et al., 2020) into GAIL with PLR. SD3-GAIL exhibits the same instability as the experiment results in DDPG-GAIL and TD3-GAIL. This implies that instability is not a special case in DE-GAIL. In addition, the instability of DE-GAIL is mainly down to the invalid case that appeared in the experiment. Then, we prove that there exist exploding phenomena in the absolute gradients of the policy loss, describing the invalidity theoretically and universally. Further, we conclude that the discriminator will possibly degenerate to 0 or 1. Meanwhile, we give the probabilistic lower bound for exploding gradients, with respect to the mismatching between the learned policy and the expert demonstration, i.e., the significant state-action distribution discrepancy between the learned state-action pair and that of the expert. Finally, we disclose that the outliers interval under the modified reward function in adversarial inverse reinforcement learning (AIRL) (Fu et al., 2018) is smaller than that under PLR in GAIL. Such a modified reward function shows its superiority of stability. Our contributions can be summarized as follows: • Establish and prove the probability of exploding gradients theorem to theoretically describe the instability in DE-GAIL. In contrast, our analysis strikes that stochastic policy can avoid exploding gradients. • Compared with the consistent trend in ST-GAIL, we point out the instability is caused by deterministic policies, rather than GANs. • The reward function in AIRL is shown to reduce the probability of exploding gradients.

2. RELATED WORK

In large and high-dimensional environments, Ho & Ermon (2016) proposed GAIL, which is processed on TRPO (Schulman et al., 2015) . It gains significant performance in imitating complex expert policies (Ghasemipour et al., 2019; Xu et al., 2020; Ke et al., 2020; Chen et al., 2021) . To accelerate the GAIL learning process, a natural idea is to use deterministic policy gradients. Sample-efficient adversarial mimic (SAM) (Blondé & Kalousis, 2019 ) method integrates DDPG (Lillicrap et al., 2015) into GAIL, and adds a penalty on the gradient of the discriminator meanwhile. Zuo et al. (2020) proposed deterministic generative adversarial imitation learning (DGAIL), which combines the modified DDPG with LfD to train the generator under the guidance of the discriminator. The reward function in DGAIL is set as D(s, a). TD3 (Fujimoto et al., 2018) and off-policy training of the discriminator are performed in DAC (Kostrikov et al., 2019) to reduce policy-environment interaction sample complexity by an average factor of 10. The revised reward function in DAC is log(D(s, a)) -log(1 -D(s, a)). Notably, these works achieve gratifying results benefiting from modifications to reward functions. When implementing GAIL with PLR directly, Kostrikov et al. (2019) pointed out the reward bias via a special toy demo; DGAIL exhibited its instability merely from the perspective of experimental results. Differently, we illustrate this phenomenon through a universal theory.

3. PRELIMINARY

In this section, we introduce the definition of a Markov decision process, reproducing kernel Hilbert space and generative adversarial imitation learning setup.

3.1. MARKOV DECISION PROCESS

A discounted Markov decision process (MDP) is characterized by a 5-tuple (S, A, r, p M , γ) in the standard RL setting. S and A denote the finite state space and action space, respectively. r(s, a) : S × A → R is the reward function for performing action a ∈ A in state s ∈ S. p M (s ′ |s, a) : S × A × S → [0, 1] denotes the transition distribution and γ is the discount factor. A policy π(a|s) specifies an action distribution conditioned on state s. The objective of RL is to maximize the expected reward-to-go η(π ) = E π [ ∞ t=0 γ t r (s t , a t ) |s 0 , a 0 ]. Induced by a policy π, we define the discounted stationary state distribution as d π (s) = (1 -γ) ∞ t=0 γ t Pr(s t = s; π). Here Pr(s t = s; π) denotes the probability of reaching state s at time t, which is given by t-1 u=0 p M (s u+1 | s u , a u ) π (a u | s u ) Pr (s 0 ) dsda, where ds = ds 0 . . . ds t-1 and da = da 0 . . . da t-1 imply integration over the previous states and actions. Similarly, the discounted stationary state-action distribution is defined as ρ π (s, a) = (1 -γ) ∞ t=0 γ t Pr(s t = s, a t = a; π), which measures the overall "frequency" of visiting a state-action pair under the policy π. The relationship between ρ π (s, a) and d π (s) can be described as ρ π (s, a) = π(a|s)d π (s). (1)

3.2. REPRODUCING KERNEL HILBERT SPACE

Given a set S, for any c ∈ R p and x ∈ S, if the linear functional mapping h ∈ H to (c, h(x)) is continuous, then a vector-valued reproducing kernel Hilbert space (RKHS) H is a Hilbert space of functions h : S → R p (Micchelli & Pontil, 2005) . For all y ∈ S, κ x (y) is a symmetric function that is a positive definite matrix, thereby (κ x c) (y) = κ(x, y)c ∈ H. It has the reproducing property ⟨h, κ x c⟩ H = h(x) ⊤ c. Here (•, •) and ⟨•, •⟩ H denote the inner product in R p and in H, respectively. We denote H = H κ .

3.3. GENERATIVE ADVERSARIAL IMITATION LEARNING

GAIL combines IRL with GANs, treating RL methods as the generator. GAIL takes the advantage of the discriminator D(s, a) to calculate the difference between the distribution of the state-action pair induced by the learned policy π and the expert policy π E , thereby providing the reward for the agent. Moreover, the policy and discriminator can be approximated by RKHS (Ormoneit & Sen, 2002) . The optimization problem in GAIL is min π max D∈(0,1) S×A E (s,a)∼ρ π E [log(D(s, a))] + E (s,a)∼ρ π [log(1 -D(s, a))], where the policy π mimics the expert policy via the reward function r(s, a) = -log(1 -D(s, a)). When the discriminator reaches its optimal D * (s, a) = ρ πE (s, a)/(ρ πE (s, a) + ρ π (s, a)), the optimization objective of the learned policy is formalized as minimizing the state-action distribution discrepancy between the imitated policy and the expert policy with the Jensen-Shannon (JS) divergence: min π D JS (ρ π (s, a), ρ πE (s, a)) := 1 2 D KL ρ π , ρ π + ρ πE 2 + 1 2 D KL ρ πE , ρ π + ρ πE 2 . ( )

4. THE PRINCIPLE OF INSTABILITY IN GAIL WITH DETERMINISTIC POLICIES

Inspired by the instability of DE-GAIL in MuJoCo environments (Subsect. 4.1), we theoretically impute the invalidity to exploding phenomenon in the absolute gradients of the policy loss in Subsect. 4.2. Additionally, the probabilistic lower bound for exploding gradient is provided. In contrast, ST-GAIL will not suffer from such pathology. Finally, we present that the reward function in AIRL relieves the exploding gradients in Subsect. 4.3. Fig. 2 shows the architecture for our analysis. Figure 2 : An illustration of our proposed theorem of exploding gradients in DE-GAIL. Due to exploding gradients, the invalid case in experiments corresponds to diverging from the optimal discriminator or degenerating to 0 or 1.

4.1. ILLNESS REWARD EMPIRICAL EVALUATIONS IN MUJOCO ENVIRONMENTS

We replicate the experimental setup of Zhou et al. (2022) . Expert trajectories are created by the SAC agent in Hopper-v2, HalfCheetah-v2 and Walker2d-v2 respectively. The size of expert demonstration data is 10 6 obtained with a 0.01 standard deviation. The mean return of the demonstration data in each environment is 3433, 9890 and 3509 respectively. When training GAIL, we use two-layer networks to approximate the kernel function (see Arora et al. (2019) ). The reward function is set as r(s, a) = -log(1 -D(s, a)) (PLR). First, a wellperformed TSSG is revisited as the baseline for subsequent comparison. Then we explore DDPG and TD3, which also cause GAIL to fail as Kostrikov et al. (2019) . A brief description of the training procedure is laid out in Appendix A.1 (Alg. 1). The best evaluation results of DDPG-GAIL and TD3-GAIL are shown in Fig. 3 . Compared to TSSG, we can observe that DDPG-GAIL and TD3-GAIL not only obtain a lower mean return but also suffer from obviously higher variance. We suspect this is owing to the inaccurate Q-value estimations of DDPG and TD3 (Pan et al., 2020) . (a) (b) (c) (d) Figure 4 : The performances of the four algorithms under different 11 seeds in Walker2d-v2. We next experimentally examine whether better value estimation will improve GAIL. SD3 (Pan et al., 2020) , a deterministic policy algorithm that enables a smaller absolute bias of true values and value estimates than TD3, is conducted into the framework of GAIL, as shown in Alg. 1. Its best performances are presented in Fig. 3 . We conclude the results in two aspects: • SD3-GAIL improves the average return compared to DDPG-GAIL and TD3-GAIL, and even achieves a higher upper limit than TSSG. This phenomenon is possibly attributed to the accuracy of Q-value estimation in SD3. • Although the enhancement of average return in SD3-GAIL compared to DDPG-GAIL and TD3-GAIL, the variance of SD3-GAIL remains high. In order to thoroughly observe the phenomenon of high variance of SD3-GAIL, DDPG-GAIL and TD3-GAIL, we specify the performance of the four algorithms under 11 random seeds in Walker2d-v2 as exhibited in Fig. 4 . The three deterministic policy algorithms reveal extreme instability under 11 random seeds. Specifically, -The performances under some seeds are capable of reaching expert levels, such as SD3-GAIL, TD3-GAIL and DDPG-GAIL with seed 0. -Some seeds almost learn nothing, such as SD3-GAIL, TD3-GAIL and DDPG-GAIL with seed 7. -SD3-GAIL with seed 2 and seed 9 learn only part of the expert policy. While TSSG maintains a similar trend under different random seeds, the instability is not caused by GANs itself depicted by Arjovsky & Bottou (2017) . These empirical evaluations manifest that DE-GAIL algorithms suffer from pathological training dynamics. We attempt to explain this phenomenon by employing the statement of reward bias introduced by Kostrikov et al. (2019) . In their specific toy demo, repeated trajectories may be trained with the stochastic learned policy, which brings about a higher return than expert under PLR r(s, a) = -log(1-D(s, a)), the strictly positive reward function. This is known as the reward bias. Conversely, if the learned policy is deterministic, the trajectories will be trained with no repetition. Furthermore, the expert return will not be exceeded by the return of the deterministic learned policy technically, in other words, the reward bias vanishes. This can not interpret DE-GAIL as inferior to ST-GAIL which appeared in our experiment. An illustrative example of the invalid case is shown in Appendix A.2. In the sequel, we will explain such instability from the perspective of invalidity in DE-GAIL.

4.2. EXPLODING GRADIENTS IN DE-GAIL

Inspired by Lever & Stafford (2015) and Paternain et al. (2020) who employ multivariate Gaussian policy to approximate deterministic policy, we define the learned policy π h as π h (a|s) = 1 det(2πΣ) exp -(a -h(s)) ⊤ Σ -1 (a -h(s)) 2 , parameterized by deterministic functions h ∈ H, h : S → A and covariance matrix Σ. The function h(•) is an element of an RKHS H κ , h(•) = i κ(s i , •)a i ∈ H κ , where s i ∈ S and a i ∈ A. Note that π h (a|s) can be regarded as an approximation to the Dirac's impulse via covariance matrix approaching zero, i.e., lim Σ→0 π h (a|s) = δ(a -h(s)). (5) Eq. ( 5) means that when the covariance Σ → 0, the stochastic policy π h (a|s) approaches the deterministic policy h(s). Replacing π with π h in Eq. ( 2), Eq. ( 3) and Eq. ( 4) respectively, the optimization problem of GAIL under π h is min π h max D E (s,a)∼ρ π E [log(D(s, a))] + E (s,a)∼ρ π h [log(1 -D(s, a))], the optimum discriminator is D * (s, a) = ρ πE (s, a)/(ρ πE (s, a) + ρ π h (s, a)), and the policy optimization objective is min π h D JS (ρ π h (s, a), ρ πE (s, a)). Before proceeding with our main result, we need a crucial definition. Definition 1 (Mismatched and Matched State-action Pair) The state-action pair (s t , h(s t )) induced by the learned policy mismatches the expert demonstration (s t , a t ), if ∥h(s t )-a t ∥ 2 ≥ C∥Σ∥ 2 for any C > 0. Otherwise, (s t , h(s t )) matches the expert. We utilize an event Ξ = {(s t , h(s t )) : ∥h(s t ) -a t ∥ 2 ≥ C∥Σ∥ 2 for any C > 0} to characterize the mismatching. A descriptive example of mismatching and matching is shown in Appendix A.3. Now we present the following theorems on the probability of exploding gradients in DE-GAIL. Theorem 1 Let π h (•|s) be the Gaussian stochastic policy with mean h(s) and covariance Σ. When the discriminator is set to be optimal D(s, a) in Eq. ( 6), the gradient estimator of the policy loss with respect to the policy's parameter h satisfies ∥ ∇h D JS (ρ π h , ρ πE )∥ 2 → ∞ with a probability of Pr(∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0) as Σ → 0 where ∇h D JS (ρ π h , ρ πE ) = d π h (s t )∇ h π h (a t |s t ) 2d πE (s t )π E (a t |s t ) log 2d π h (s t )π h (a t |s t ) d π h (s t )π h (a t |s t ) + d πE (s t )π E (a t |s t ) , and ∇ h π h (a|s) = π h (a|s)κ(s, •)Σ -1 (a -h(s)). Proof. See Appendix A.4. □ Remark 1 When Σ → 0, in other words, the policy is deterministic, we have Pr(∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0) ≥ Pr(∥a t -h(s t )∥ 2 ≥ C∥Σ∥ 2 for any C > 0) = Pr(Ξ). The probability of mismatching Pr(Ξ) is nontrivial since Σ → 0. Theorem 1 implies that when the discriminator is set to be optimal, DE-GAIL will suffer from exploding gradients with the probabilistic lower bound Pr(Ξ). In contrast, for a Gaussian stochastic policy (fixed Σ), we have that ∥ ∇h D JS (ρ π h , ρ πE )∥ 2 is bounded referring to the proof strategy of Theorem 1. Thus, when the discriminator is set to be optimal, the Gaussian stochastic policy in GAIL will not suffer from exploding gradients. Analogous conclusions can be drawn for non-Gaussian stochastic policies. Theorem 1 reveals that the policy loss possibly suffers from exploding gradients when the discriminator is set to be optimal, subsequently, we will present a more universal result on the regular discriminator, which is defined as D(s t , a t ) = (1 + ϵ)ρ π E (s t , a t ) (1 + ϵ)ρ π E (s t , a t ) + (1 -ϵ)ρ π (s t , a t ) , where ϵ ∈ (-1, 1). Note that • D(s t , a t ) is monotonically increasing from 0 to 1 as ϵ ∈ (-1, 1). • "Regular" suggests that D(s t , a t ) ranges in (0, 1) stemmed from Eq. ( 2). • D(s t , a t ) reaches its optimal when ϵ = 0. We next state the exploding gradients on D(s t , a t ). Theorem 2 (Main Result) Let π h (•|s) be the Gaussian stochastic policy with mean h(s) and covariance Σ. When the discriminator is set to be regular D(s, a) in Eq. ( 7), i.e., D(s, a) ∈ (0, 1), the gradient estimator of the policy loss with respect to the policy's parameter h satisfies ∇h E (s,a)∼DE [log( D(s, a))] + E (s,a)∼DI [log(1 -D(s, a))] 2 → ∞ with a probability of Pr(∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0) as Σ → 0, where D E and D I denote the expert demonstration and the replay buffer of π h respectively, ∇h E (s,a)∼DE [log( D(s, a))] + E (s,a)∼DI [log(1 -D(s, a))] = d π h (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log (1 -ϵ)d π h (s t )π h (a t |s t ) (1 + ϵ)d πE (s t )π E (a t |s t ) + (1 -ϵ)d π h (s t )π h (a t |s t ) + 2ϵd π h (s t )∇ h π h (a t |s t ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) , and ∇ h π h (a|s) = π h (a|s)κ(s, •)Σ -1 (a -h(s)). Proof. See Appendix A.5. □ Analogous to Theorem 1, Theorem 2 implies • When the discriminator ranges in (0, 1), DE-GAIL will also be at risk of exploding gradients. • When the policy loss suffers from exploding gradients during many runs, the discriminator in DE-GAIL degenerates to 0 or 1. Differently, ST-GAIL will not suffer from exploding gradients when the discriminator ranges in (0, 1). Experimental results in Fig. 5 and Fig. 6 support our theoretical analysis. Seeds 5, 7, 10 exhibit exploding gradient performances (left figure in Fig. 6 ) and degenerating discriminator behaviors (first row in Fig. 5 ), which are consistent with the invalid cases (r → 0) in Fig. 4 (a). Note that the gradients of TSSG and the valid cases in SD3-GAIL are in the same order of magnitude, far less than the invalid cases in SD3-GAIL. Further, in this section, we proceed with this reduction of instability from a theoretical explanation, in other words, whether CR-DE-GAIL permits a relief in exploding gradient probability over PLR-DE-GAIL. Theorem 1 points out that the mismatched state-action pair results in exploding gradients. Further, what truly pertains to our discussion in the sequel is the behavior of the discriminator. This is due to its unified characteristic that lies in the interval [0, 1]. Now we present the following Lemma on the mismatching in the view of the discriminator. Proposition 1 When the discriminator is set to be optimal D * (s, a) in Eq. ( 6), we have D * (s t , a t ) ≈ 1 ⇔ h(s t ) mismatches a t . Proof. See Appendix A.6. □ Lemma 1 indicates that exploding gradients depend on the distance between the discriminator's value and 1, or they depend on the degree of r i (s t , a t ) for i = 1, 2 when r approaches infinity. This is due to the monotonicity of both r 1 (s t , a t ) and r 2 (s t , a t ). we thus obtain r 1 (s t , a t ) ≈ ∞ and r 2 (s t , a t ) ≈ ∞ (8) as D(s t , a t ) ≈ 1. Naturally, to prevent exploding gradients, we make the constraints that r i (s t , a t ) ≤ C, i = 1, 2, for some appropriate constant C. In contrast, the outlier of the discriminator can be characterized as r i (s t , a t ) > C for i = 1, 2, which is referred to as the following. Definition 2 When the discriminator is set to be optimal D * (s, a) in Eq. ( 6), the outliers of the discriminator are defined in [α, 1] such that r 1 (s t , a t ) ≥ C. Similarly, under the same upper bound C, the outliers of the discriminator are defined in [β, 1] for r 2 (s t , a t ). We note that the training process will suffer from exploding gradients when the discriminator comes to rest in [α, 1]. The remission of exploding gradients in CR-DE-GAIL is presented in the following proposition. Proposition 2 When the discriminator is set to be optimal D * (s, a) in Eq. ( 6), we have β ≥ α. Proof. See Appendix A.7. □ Proposition 2 reveals that the discriminator in CR-DE-GAIL exhibits a smaller interval of outliers than that in PLR-DE-GAIL, decreasing the probability of gradient explosion. Our conclusion is consistent with the claim in Kostrikov et al. (2019) (Fig. 5 ).

5. CONCLUSION

In this paper, we have explored the principle of instability in DE-GAIL. We first experimentally show the extreme instability performance of DE-GAIL algorithms compared to ST-GAIL. Subsequently, our proof manifests that the gradient of the deterministic policy loss with respect to the policy will suffer from exploding with some probability, thereby leading to training failure. In comparison, we present the compatibility between stochastic policy and GAIL. Finally, the modified reward function is shown to remedy the exploding gradients. Reducing the probability of exploding gradients is under consideration. By introducing the idea of clipping the reward into SD3-GAIL, we have discovered some interesting phenomena. For a preliminary verification of our algorithm, please refer to Appendix A.8. Specifically, the fact that clipping the reward that leads to the outlier of the discriminator shows its superiority in the stability of DE-GAIL. Further analysis that improves the sample efficiency while keeping low exploding gradient probability is left for future works. Update θ by a deterministic policy algorithm. 6: end for

A.2 THE INVALIDITY IN DE-GAIL

The instability of DE-GAIL is mainly imputed to the invalid case, i.e., the reward function approaches zero. To illustrate this invalidity, consider a specific MDP M: S contains (g + 1) states s 0 , s 1 , ..., s g , where s g is the terminal state. The action a i→j in A is such that s j ∼ p(•|s i , a i→j ) satisfies that |i -j| ≤ g/4 for i ̸ = j. The expert demonstration is shown in Fig. 7 . For the state s t (0 ≤ t < 3g/4), the action a t→t+1 occurs 3 times while others occur at most twice in the expert demonstration. Note that D * (s, a) = ρ πE (s, a) ρ πE (s, a) + ρ π (s, a) = 1 1 + ρ π (s,a) ρ π E (s,a) , D * (s, a) is monotonically increasing with respect to ρ πE (s, a). Thus, the more frequent expert state-action pair tends to attain a higher value of the discriminator, thereby resulting in higher PLR. Meanwhile, policy training in RL aims to maximize the expected reward-to-go. Therefore, the most frequent action a t→t+1 is the best choice for a deterministic learned policy under each s t . The detailed trajectory is shown in Fig. 8 . Figure 8 : The trajectory of the deterministic learned policy. From s 0 to the terminal state s g sequentially. We now show the behavior of rewards for state-action pairs in the deterministic policy. Proposition 3 In the MDP M, when the discriminator is set to be optimal, PLR of (s t , a t→t+1 ) satisfies r(s t , a t→t+1 ) = -log(1 -D(s t , a t→t+1 )) → 0 as |S| → ∞, D(s t , a t→t+1 ) =    48g 11g 2 +72g-16 , 0 ≤ t < 3g 4 , 16g 11g 2 +40g-16 , t = 3g 4 , 0, otherwise. Proof. The length of the expert trajectory is N = 2 g 4 + ( g 4 + 1) + • • • + ( g 4 + g 4 -1) + 2( g 2 + 1) g 2 -1 + 3g 4 = 11g 2 + 24g -16 16 . Note that the first term 2(g/4 + (g/4 + 1) + • • • + (g/4 + g/4 -1)) comes from executing actions a i→j , a j→i in states s 0 , s 1 , • • • , s g/4-1 , the second term 2(g/2 + 1)g/2 -1 is derived by executing actions a i→j , a j→i in states s g/4 , s g/4+1 , • • • , s 3g/4 and the last term 3g/4 is obtained by executing the action a i→i+1 . Denote the expert policy and the deterministic policy as π E and h, respectively. For the (s t , a t→t+1 ), 0 ≤ t < 3g/4 in the trajectory of the deterministic policy, the optimal discriminator is (Kostrikov et al., 2019 ): D(s t , a t→t+1 ) = ρ πE (s t , a t→t+1 ) ρ πE (s t , a t→t+1 ) + ρ h (s t , a t→t+1 ) = 3 N 3 N + 1 g = 48g 11g 2 + 72g -16 . For t = 3g/4, we have D(s 3g/4 , a 3g/4→3g/4+1 ) = 16g 11g 2 + 40g -16 . While for any t > 3g/4, the value D(s t , a t→t+1 ) is zero since it never appears in the expert demonstrations. To sum up, when the discriminator achieves its optimal, r(s t , a t→t+1 ) = -log(1 -D((s t , a t→t+1 ))) → 0 as |S| → ∞.

□

Remark 2 Note that under the deterministic policy, PLR of the best action approaches 0. As a result, we have r → 0 for all state-action pairs. In contrast, for a stochastic policy reaching the expert level, the length of the expert trajectory and the stochastic policy have the same order O(|S| 2 ). Here |S| denotes the size of S. Referring to the proof strategy, PLR will not decline to zero. Proposition 3 illustrates the invalid case, implying the instability in DE-GAIL.

A.3 AN DESCRIPTIVE EXAMPLE OF MISMATCHING AND MATCHING

The probability density contour map of the expert action demonstration in state s t is shown in Fig. 9 . Figure 9 : The process of the mismatched case and the matched case on a descriptive example of a two-dimensional action space (x-axis and y-axis). Left: The mismatched case. Right: The matched case. Without loss of generality, the threshold of matching is set as 0.035. The trajectories used to train the learned policies are shown in red curves. The matching is that h(s t ) lies in the neighborhood of a t , which has a high probabilistic density in the expert demonstration. A.4 PROOF OF THEOREM 1 Theorem 1 Let π h (•|s) be the Gaussian stochastic policy with mean h(s) and covariance Σ. When the discriminator is set to be optimal D(s, a) in Eq. ( 6), the gradient estimator of the policy loss with respect to the policy's parameter h satisfies ∥ ∇h D JS (ρ π h , ρ πE )∥ 2 → ∞ with a probability of Pr(∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0) as Σ → 0 where ∇h D JS (ρ π h , ρ πE ) = d π h (s t )∇ h π h (a t |s t ) 2d πE (s t )π E (a t |s t ) log 2d π h (s t )π h (a t |s t ) d π h (s t )π h (a t |s t ) + d πE (s t )π E (a t |s t ) , and ∇ h π h (a|s) = π h (a|s)κ(s, •)Σ -1 (a -h(s)). Proof. Through importance sampling which transfers the learned state-action distribution to the expert demonstration distribution, the JS divergence can be rewritten from the definition in Eq. ( 4) as D JS (ρ π h , ρ πE ) = 1 2 D KL (ρ π h , ρ π h + ρ πE 2 ) + 1 2 D KL (ρ πE , ρ π h + ρ πE 2 ) = 1 2 E (s,a)∼DI log 2ρ π h (s, a) ρ π h (s, a) + ρ πE (s, a) + 1 2 E (s,a)∼DE log 2ρ πE (s, a) ρ π h (s, a) + ρ πE (s, a) = 1 2 E (s,a)∼DE ρ π h (s, a) ρ πE (s, a) log 2ρ π h (s, a) ρ π h (s, a) + ρ πE (s, a) + log 2ρ πE (s, a) ρ π h (s, a) + ρ πE (s, a) , where D E and D I denote the expert demonstration and the replay buffer of π h respectively. Then we can approximate the gradient of Eq. ( 9) with respect to h with ∇h D JS (ρ π h , ρ πE ) (i) = 1 2 ∇ h ρ π h (s t , a t ) ρ πE (s t , a t ) log 2ρ π h (s t , a t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) + log 2ρ πE (s t , a t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) (ii) = 1 2 d π h (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log 2d π h (s t )π h (a t |s t ) d π h (s t )π h (a t |s t ) + d πE (s t )π E (a t |s t ) + ρ π h (s t , a t ) ρ πE (s t , a t ) • ρ π h (s t , a t ) + ρ πE (s t , a t ) 2ρ π h (s t , a t ) • 2d π h (s t )∇ h π h (a t |s t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) -2ρ π h (s t , a t )d π h (s t )∇ h π h (a t |s t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) 2 - ρ π h (s t , a t ) + ρ πE (s t , a t ) 2ρ πE (s t , a t ) • 2ρ πE (s t , a t )d π h (s t )∇ h π h (a t |s t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) 2 (iii) = 1 2 d π h (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log 2d π h (s t )π h (a t |s t ) d π h (s t )π h (a t |s t ) + d πE (s t )π E (a t |s t ) + d π h (s t )∇ h π h (a t |s t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) - d π h (s t )∇ h π h (a t |s t ) ρ π h (s t , a t ) + ρ πE (s t , a t ) (iv) = d π h (s t )∇ h π h (a t |s t ) 2d πE (s t )π E (a t |s t ) log 2d π h (s t )π h (a t |s t ) d π h (s t )π h (a t |s t ) + d πE (s t )π E (a t |s t ) , where (ii) comes from Eq. ( 1). By the fact that ∇ h π h (a|s) = π h (a|s)∇ h log π h (a|s) = π h (a|s)κ(s, •)Σ -1 (a -h(s)), Eq. ( 10) can be shown that ∥ ∇h D JS (ρ π h , ρ πE )∥ 2 = d π h (s t )π h (a t |s t )κ(s t , •)Σ -1 (a t -h(s t )) 2d πE (s t )π E (a t |s t ) log 2d π h (s t )π h (a t |s t ) d π h (s t )π h (a t |s t ) + d πE (s t )π E (a t |s t ) 2 . Then it follows that ∥ ∇h D JS (ρ π h , ρ πE )∥ 2 → ∞ with a probability of Pr(∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0) as Σ → 0. □

A.5 PROOF OF THEOREM 2

Theorem 2 (Main Result) Let π h (•|s) be the Gaussian stochastic policy with mean h(s) and covariance Σ. When the discriminator is set to be regular D(s, a) in Eq. ( 7), i.e., D(s, a) ∈ (0, 1), the gradient estimator of the policy loss with respect to the policy's parameter h satisfies ∇h E (s,a)∼DE [log( D(s, a))] + E (s,a)∼DI [log(1 -D(s, a))] 2 → ∞ with a probability of Pr(∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0) as Σ → 0, where D E and D I denote the expert demonstration and the replay buffer of π h respectively, ∇h E (s,a)∼DE [log( D(s, a))] + E (s,a)∼DI [log(1 -D(s, a))] = d π h (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log (1 -ϵ)d π h (s t )π h (a t |s t ) (1 + ϵ)d πE (s t )π E (a t |s t ) + (1 -ϵ)d π h (s t )π h (a t |s t ) + 2ϵd π h (s t )∇ h π h (a t |s t ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) , and ∇ h π h (a|s) = π h (a|s)κ(s, •)Σ -1 (a -h(s)). Proof. Referring to the proof strategy of Theorem 1, the learned state-action distribution can be transferred to the expert demonstration distribution by importance sampling. Thus when the discriminator is set to be regular D(s, a), we can write the policy objective from the optimization problem in Eq. ( 2) as . ( 12) E Then the gradient of Eq. ( 12) can be approximated with  Plugging Eq. ( 11) into Eq. ( 13), when ∥Σ -1 (a t -h(s t ))∥ 2 ≥ C for any C > 0, we have Proposition 1 When the discriminator is set to be optimal D * (s, a) in Eq. ( 6), we have D * (s t , a t ) ≈ 1 ⇔ h(s t ) mismatches a t . Proof. The optimal discriminator of (s t , a t ) can be denoted by D * (s t , a t ) = ρ πE (s t , a t ) ρ πE (s t , a t ) + ρ π h (s t , a t ) . We can derive that the necessary and sufficient condition of D * (s t , a t ) ≈ 1 is that ρ π h (s t , a t ) ≈ 0, i.e., (s t , h(s t )) mismatches (s t , a t ). □



discussed the reward bias in a simple toy demo; Zuo et al. (2020) exhibited its instability through contrast experiments, and improved the stability of DE-GAIL by introducing the idea of learning from demonstrations (LfD) into the generator.

Figure 1: Learning curves of DDPG-GAIL and TD3-GAIL with different random seeds in Walker2d-v2.

Figure 3: Comparison of SD3-GAIL, TD3-GAIL, DDPG-GAIL and TSSG in three different environments. Solid lines and dashed lines correspond to the average performance of the four algorithms and expert demonstrations, respectively.

Figure 5: The discriminators of SD3-GAIL and TSSG in Walker2d-v2 under 11 seeds.

Figure 6: The absolute gradients of SD3-GAIL and TSSG policy networks in Walker2d-v2.

RELIEVING EXPLODING GRADIENTS WITH REWARD MODIFICATIONKostrikov et al. (2019) introduced the reward function r 2 (s t , a t ) = log(D(s t , a t )) -log(1 -D(s t , a t )) of AIRL into GAIL, and illustrated a reduction in instability through comparative experiments against GAIL with PLR r 1 (s t , a t ) = -log(1 -D(s t , a t )). The reward function of AIRL is defined as the combination reward function (CR) inWang & Li (2021). For written convenience, DE-GAIL with PLR and CR are called PLR-DE-GAIL and CR-DE-GAIL respectively.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 8, pp. 1433-1438, 2008. Guoyu Zuo, Kexin Chen, Jiahao Lu, and Xiangsheng Huang. Deterministic generative adversarial imitation learning. Neurocomputing, 388:60-69, 2020. A APPENDIX A.1 DE-GAIL ALGORITHM Algorithm 1 GAIL with deterministic policy algorithms 1: Input: Expert demonstrations, initialize the learned policy π θ and discriminator network D, some RL deterministic policy algorithms, such as SD3, TD3 and DDPG. 2: for iteration 0, 1, 2, • • • do 3: Update D by maximizing E (s,a)∼ρ π E [log(D(s, a))] + E (s,a)∼ρ π [log(1 -D(s, a))].

r(s, a) = -log(1 -D(s, a)).

Figure 7: Expert demonstration. Each state transfers to the neighboring g/4 states and back to itself from left to right. Left: The schematic of the expert demonstration, where the red arrow is the last action taken in each state. Right: The pseudocode of the expert demonstration.

(s,a)∼DE [log( D(s, a))] + E DI [log(1 -D(s, a))] = E (s,a)∼DE log ρ πE (s, a)(1 + ϵ) ρ πE (s, a)(1 + ϵ) + ρ π h (s, a)(1 -ϵ) + E (s,a)∼DI log ρ π h (s, a)(1 -ϵ) ρ πE (s, a)(1 + ϵ) + ρ π h (s, a)(1 -ϵ) = E (s,a)∼DE log ρ πE (s, a)(1 + ϵ) ρ πE (s, a)(1 + ϵ) + ρ π h (s, a)(1 -ϵ) + ρ π h (s, a) ρ πE (s, a) log ρ π h (s, a)(1 -ϵ) ρ πE (s, a)(1 + ϵ) + ρ π h (s, a)(1 -ϵ)

∇h E (s,a)∼DE [log( D(s, a))] + E DI [log(1 -D(s, a))]

∇h E (s,a)∼DE [log( D(s, a))] + E DI [log(1 -D(s, a))]= ∇ h log ρ πE (s t , a t )(1 + ϵ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) + ρ π h (s t , a t ) ρ πE (s t , a t ) log ρ π h (s t , a t )(1 -ϵ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) = -ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) ρ πE (s t , a t )(1 + ϵ) • ρ πE (s t , a t )d π h (s t )∇ h π h (a t |s t )(1 + ϵ)(1 -ϵ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) 2 + d π h (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log (1 -ϵ)d π h (s t )π h (a t |s t ) (1 + ϵ)d πE (s t )π E (a t |s t ) + (1 -ϵ)d π h (s t )π h (a t |s t ) + ρ π h (s t , a t ) ρ πE (s t , a t ) • ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) ρ π h (s t , a (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log (1 -ϵ)d π h (s t )π h (a t |s t ) (1 + ϵ)d πE (s t )π E (a t |s t ) + (1 -ϵ)d π h (s t )π h (a t |s t ) -(1 -ϵ)d π h (s t )∇ h π h (a t |s t ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) + (1 + ϵ)d π h (s t )∇ h π h (a t |s t ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) = d π h (s t )∇ h π h (a t |s t ) d πE (s t )π E (a t |s t ) log (1 -ϵ)d π h (s t )π h (a t |s t ) (1 + ϵ)d πE (s t )π E (a t |s t ) + (1 -ϵ)d π h (s t )π h (a t |s t ) + 2ϵd π h (s t )∇ h π h (a t |s t ) ρ πE (s t , a t )(1 + ϵ) + ρ π h (s t , a t )(1 -ϵ) .

A.7 PROOF OF PROPOSITION 2

Proposition 2 When the discriminator is set to be optimal D * (s, a) in Eq. ( 6), we have β ≥ α.Clipping reward shows its superiority in the stability of DE-GAIL but at the expense of lower sample efficiency.Figure 10 : Comparison of SD3-GAIL and SD3-GAIL with clipped reward in three different environments.

