TRULY DETERMINISTIC POLICY OPTIMIZATION

Abstract

In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments-one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps)for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3).



Policy Gradient (PG) methods can be broadly characterized by three defining elements: the policy gradient estimator, the regularization measures, and the exploration profile. For gradient estimation, episodic (Williams, 1992) , importance-sampling-based (Schulman et al., 2015a) , and deterministic (Silver et al., 2014) gradients are some of the most common estimation oracles. As for regularization measures, either an Euclidean distance within the parameter space (Williams, 1992; Silver et al., 2014; Lillicrap et al., 2015) , or dimensionally consistent non-metric measures (Schulman et al., 2015a; Kakade & Langford, 2002; Schulman et al., 2017; Kakade, 2002; Wu et al., 2017) have been frequently adapted. Common exploration profiles include Gaussian (Schulman et al., 2015a) and stochastic processes (Lillicrap et al., 2015) . These elements form the basis of many model-free and stochastic policy optimization methods successfully capable of learning high-dimensional policy parameters. Both stochastic and deterministic policy search can be useful in applications. A stochastic policy has the effect of smoothing or filtering the policy landscape, which is desirable for optimization. Searching through stochastic policies has enabled the effective control of challenging environments under a general framework (Schulman et al., 2015a; 2017) . The same method could either learn robotic movements or play basic games (1) with minimal domain-specific knowledge, (2) regardless of function approximation classes, and (3) with less human intervention (ignoring reward engineering and hyper-parameter tuning) (Duan et al., 2016) . Using stochasticity for exploration, although it imposes approximations and variance, has provided a robust way to actively search for higher rewards. Despite many successes, there are practical environments which remain challenging for current policy gradient methods. For example, non-local rewards (e.g., those defined in the frequency domain), long time horizons, and naturally-resonant environments all occur in realistic robotic systems (Kuo & Golnaraghi, 2002; Meirovitch, 1975; Preumont & Seto, 2008) but can present issues for policy gradient search. To tackle challenging environments such as these, this paper considers policy gradient methods based on deterministic policies and deterministic gradient estimates, which could offer advantages by allowing the estimation of global reward gradients on long horizons without the need to inject noise into the system for exploration. To facilitate a dimensionally consistent and low-variance deterministic policy search, a compatible policy gradient estimator and a metric measure for regularization should be employed. For gradient estimation we focus on Vine estimators (Schulman et al., 2015a) , which can be easily applied to deterministic policies. As a metric measure we use the Wasserstein distance, which can measure meaningful distances between deterministic policy functions that have non-overlapping supports (in contrast to the Kullback-Liebler (KL) divergence and the Total Variation (TV) distance). The Wasserstein metric has seen substantial recent application in a variety of machine-learning domains, such as the successful stable learning of generative adversarial models (Arjovsky et al., 2017) . Theoretically, this metric has been studied in the context of Lipschitz-continuous Markov decision processes in reinforcement learning (Hinderer, 2005; Ferns et al., 2012) . Pirotta et al. (2015) defined a policy gradient method using the Wasserestein distance by relying on Lipschitz continuity assumptions with respect to the policy gradient itself. Furthermore, for Lipschitz-continuous Markov decision processes, Asadi et al. (2018) and Rachelson & Lagoudakis (2010) used the Wasserstein distance to derive model-based value-iteration and policy-iteration methods, respectively. On a more practical note, Pacchiano et al. (2019) utilized Wasserstein regularization for behavior-guided stochastic policy optimization. Moreover, Abdullah et al. (2019) has proposed another robust stochastic policy gradient formulation. Estimating the Wasserstein distance for general distributions is more complicated than typical KL-divergences (Villani, 2008) . This fact constitutes and emphasizes the contributions of Abdullah et al. (2019) and Pacchiano et al. (2019) . However, for our deterministic observation-conditional policies, closed-form computation of Wasserstein distances is possible without any approximation. Existing deterministic policy gradient methods (e.g., DDPG and TD3) use deterministic policies (Silver et al., 2014; Lillicrap et al., 2015; Fujimoto et al., 2018) , meaning that they learn a deterministic policy function from states to actions. However, such methods still use stochastic search (i.e., they add stochastic noise to their deterministic actions to force exploration during policy search). In contrast, we will be interested in a method which not only uses deterministic policies, but also uses deterministic search (i.e., without constant stochastic noise injection). We call this new method truly deterministic policy optimization (TDPO) and it may have lower estimation variances and better scalability to long horizons, as we will show in numerical examples. Scalability to long horizons is one of the most challenging aspects for policy gradient methods that use stochastic search. This issue is sometimes referred to as the curse of horizon in reinforcement learning (Liu et al., 2018) . General worst-case analyses suggests that the sample complexity of reinforcement learning is exponential with respect to the horizon length (Kakade et al., 2003; Kearns et al., 2000; 2002) . Deriving polynomial lower-bounds for the sample complexity of reinforcement learning methods is still an open problem (Jiang & Agarwal, 2018) . Lower-bounding the sample complexity of reinforcement learning for long horizons under different settings and simplifying assumptions has been a topic of theoretical research (Dann & Brunskill, 2015; Wang et al., 2020) . Some recent work has examined the scalability of importance sampling gradient estimators to long horizons in terms of both theoretical and practical estimator variances (Liu et al., 2018; Kallus & Uehara, 2019; 2020) . All in all, long horizons are challenging for all reinforcement learning methods, especially the ones suffering from excessive estimation variance due to the use of stochastic policies for exploration, and our truly deterministic method may have advantages in this respect. In this paper we focus on continuous-domain robotic environments with reset capability to previously visited states. The main contributions of this work are: (1) we introduce a Deterministic Vine (DeVine) policy gradient estimator which avoids constant exploratory noise injection; (2) we derive a novel deterministically-compatible surrogate function and provide monotonic payoff improvement guarantees; (3) we show how to use the DeVine policy gradient estimator with the Wasserstein-based surrogate in a practical algorithm (TDPO: Truly Deterministic Policy Optimization); (4) we illustrate the robustness of the TDPO policy search process in robotic control environments with non-local rewards, long horizons, and/or resonant frequencies.

1. BACKGROUND

MDP preliminaries. An infinite-horizon discounted Markov decision process (MDP) is specified by (S, A, P, R, µ, γ), where S is the state space, A is the action space, P : S × A → ∆(S) is the transition dynamics, R : S × A → [0, R max ] is the reward function, γ ∈ [0, 1) is the discount factor, and µ(s) is the initial state distribution of interest (where ∆(F) denotes the set of all probability distributions over F, otherwise known as the Credal set of F). The transition dynamics P is defined as an operator which produces a distribution over the state space for the next state s ∼ P (s, a). The transition dynamics can be easily generalized to take distributions of states or actions as input (i.e., by having P defined as P : ∆(S) × A → ∆(S) or P : S × ∆(A) → ∆(S)). We may abuse the notation and replace δ s and δ a by s and a, where δ s and δ a are the deterministic distributions concentrated at the state s and action a, respectively. A policy π : S → ∆(A) specifies a distribution over actions for each state, and induces trajectories from a given starting state s as follows: s 1 = s, a 1 ∼ π(s 1 ), r 1 = R(s 1 , a 1 ), s 2 ∼ P (s 2 , a 2 ), a 2 ∼ π(s 2 ), etc. We will denote trajectories as state-action tuples τ = (s 1 , a 1 , s 2 , a 2 , . . .). One can generalize the dynamics (1) by using a policy instead of an action distribution P(µ s , π) := E s∼µs [E a∼π(s) [P (s, a)]], and (2) by introducing the t-step transition dynamics recursively as P t (µ s , π) := P(P t-1 (µ s , π), π) with P 0 (µ s , π) := µ s , where µ s is a distribution over S. The visitation frequency can then be defined as  ρ π µ := (1 -γ) ∞ t=1 γ t-1 P t-1 (µ, π). V π (s) := E[ ∞ t=1 γ t-1 r t | s 1 = s; π]. Similarly, one can define Q π (s, a ) by conditioning on the first action. The advantage function can then be defined as their difference (i.e. A π (s, a) := Q π (s, a) -V π (s)). Generally, one can define the advantage/value of one policy with respect to another using A π (s, π ) := E[Q π (s, a) -V π (s) | a ∼ π (•|s)] and Q π (s, π ) := E[Q π (s, a) | a ∼ π (•|s)]. Finally, the payoff of a policy η π := E[V π (s); s ∼ µ] is the average value over the initial states distribution of the MDP. Probabilistic and mathematical notations. Sometimes we refer to f (x)g(x)dx integrals as f, g x Hilbert space inner products. Assuming that ζ and ν are two probabilistic densities, the Kulback-Liebler (KL) divergence is y , γ(x, y) x,y where Γ(ζ, ν) is the set of couplings for ζ and ν. We define Lip(f (x, y); x) := sup x ∇ x f (x, y) 2 and assume the existence of Lip(Q π (s, a); a) and Lip(∇ s Q π (s, a); a) 2 constants. Under this notation, the Rubinstein-Kantrovich (RK) duality states that the D KL (ζ|ν) := ζ(x), log( ζ(x) ν(x) ) x , the Total-Variation (TV) distance is TV(ζ, ν) =: 1 2 |ζ(x) -ν(x)|, 1 x , and the Wasserstein distance is W (ζ, ν) = inf γ∈Γ(ζ,ν) x - | ζ(x) -ν(x), f (x) x | ≤ W (ζ, ν) • Lip(f ; x) bound is tight for all f . For brevity, we may abuse the notation and denote sup s W (π 1 (•|s), π 2 (•|s)) with W (π 1 , π 2 ) (and similarly for other measures). For parameterized policies, we define ∇ π f (π) := ∇ θ f (π) where π is parameterized by the vector θ. Table 1 of the appendix summarizes all these mathematical definitions. Policy gradient preliminaries. The advantage decomposition lemma provides insight into the relationship between payoff improvements and advantages (Kakade & Langford, 2002) . That is, η π2 -η π1 = 1 1 -γ • E s∼ρ π 2 µ [A π1 (s, π 2 )]. We will denote the current and the candidate next policy as π 1 and π 2 , respectively. Taking derivatives of both sides with respect to π 2 at π 1 yields ∇ π2 η π2 = 1 1 -γ ∇ π2 ρ π2 µ (•), A π1 (•, π 1 ) + ρ π1 µ (•), ∇ π2 A π1 (•, π 2 ) . Since π 1 does not have any advantage over itself (i.e., A π1 (•, π 1 ) = 0), the first term is zero. Thus, the Policy Gradient (PG) theorem is derived as ∇ π2 η π2 π2=π1 = 1 1 -γ • E s∼ρ π 1 µ [∇ π2 A π1 (s, π 2 )] π2=π1 . For policy iteration with function approximation, we assume π 2 and π 1 to be parameterized by θ 2 and θ 1 , respectively. One can view the PG theorem as a Taylor expansion of the payoff at θ 1 . Conservative Policy Iteration (CPI) (Kakade & Langford, 2002) was one of the early dimensionally consistent methods with a surrogate of the form L π1 (π 2 ) = η π1 + 1 1-γ • E s∼ρ π 1 µ [A π1 (s, π 2 )] - C 2 TV 2 (π 1 , π 2 ). The C coefficient guarantees non-decreasing payoffs. However, CPI is limited to linear function approximation classes due to the update rule π new ← (1 -α)π old + απ . This lead to the design of the Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) algorithm. TRPO exchanged the bounded squared TV distance with the KL divergence by lower bounding it using the Pinsker inequality. This made TRPO closer to the Natural Policy Gradient algorithm (Kakade, 2002) , and for Gaussian policies the modified terms had similar Taylor expansions within small trust regions. Confined trust regions are a stable way of making large updates and avoiding pessimistic coefficients. For gradient estimates, TRPO used Importance Sampling (IS) with a baseline shift: ∇ θ2 E s∼ρ π 1 µ [A π1 (s, π 2 )] θ2=θ1 = ∇ θ2 E s∼ρ π 1 µ a∼π1(•|s) Q π1 (s, a) π 2 (a|s) π 1 (a|s) θ2=θ1 . While empirical E[A π1 (s, π 2 )] and E[Q π1 (s, π 2 )] estimates yield identical variances in principle, the importance sampling estimator in (4) imposes larger variances. Later, Proximal Policy Optimization (PPO) (Schulman et al., 2015b) proposed utilizing the Generalized Advantage Estimation (GAE) method for variance reduction and incorporated first-order smoothing like ADAM (Kingma & Ba, 2014) . GAE employed Temporal-Difference (TD) learning (Bhatnagar et al., 2009) for variance reduction. Although TD-learning was not theoretically guaranteed to converge and could add bias, it improved the estimation accuracy. As an alternative to importance sampling, deterministic policy gradient estimators were also utilized in an actor-critic fashion. Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) generalized deterministic gradients by employing Approximate Dynamic Programming (ADP) (Mnih et al., 2015) for variance reduction. Twin Delayed Deterministic Policy Gradient (TD3) (Fujimoto et al., 2018 ) improved DDPG's approximation to build an even better policy optimization method. Although both methods used deterministic policies, they still performed stochastic search by adding stochastic noise to the deterministic policies to force exploration. Other lines of stochastic policy optimization were later proposed. Wu et al. (2017) used a Kroneckerfactored approximation for curvatures. Haarnoja et al. (2018) proposed a maximum entropy actorcritic method for stochastic policy optimization.

2. MONOTONIC POLICY IMPROVEMENT GUARANTEE

We use the Wasserstein metric because it allows the effective measurement of distances between probability distributions or functions with non-overlapping support, such as deterministic policies, unlike the KL divergence or TV distance which are either unbounded or maximal in this case. The physical transition model's smoothness enables the use of the Wasserstein distance to regularize deterministic policies. Therefore, we make the following two assumptions about the transition model: W (P(µ, π 1 ), P(µ, π 2 )) ≤ L π • W (π 1 , π 2 ), W (P(µ 1 , π), P(µ 2 , π)) ≤ L µ • W (µ 1 , µ 2 ). Also, we make the dynamics stability assumption sup t k=1 L(k-1) µ,π1,π2 t-1 i=k+1 L(i) µ,π1,π2 < ∞, with the definitions of the new constants and further discussion of the implications deferred to Section A.5 of the appendix where we also discuss Assumptions 5 and 6 and the existence of Lip(Q π (s, a); a). The advantage decomposition lemma can be rewritten as η π2 = η π1 + 1 1 -γ • E s∼ρ π 1 µ [A π1 (s, π 2 )] + 1 1 -γ • ρ π2 µ -ρ π1 µ , A π1 (•, π 2 ) s . The ρ π2 µ -ρ π1 µ , A π1 (•, π 2 ) term has zero gradient at π 2 = π 1 , which qualifies it to be crudely called "the second-order term". The theory behind lower-bounding this second-order term with all derivations is left to the appendix. Next, we present the theoretical bottom line.

2.1. THE MONOTONIC PAYOFF IMPROVEMENT GUARANTEE

Combining the results of Inequality (30) and Theorems A.5 and A.4 of the appendix leads us to define the regularization terms and coefficients: C 1 := sup s 2 • Lip(Q π1 (s, a); a) • γ • L π (1 -γ)(1 -γL µ ) C 2 := sup s Lip(∇ s Q π1 (s, a); a) 2 • γ • L π (1 -γ)(1 -γL µ ) L W G (π 1 , π 2 ; s) := W (π 2 (a|s), π 1 (a|s)) × ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 . ( ) This gives us the novel lower bound for payoff improvement: L sup π1 (π 2 ) = η π1 + 1 1 -γ E s∼ρ π 1 µ [A π1 (s, π 2 )] -C 1 • sup s L W G (π 1 , π 2 ; s) -C 2 • sup s W (π 2 (a|s), π 1 (a|s)) 2 . ( ) We have the inequalities η π2 ≥ L sup π1 (π 2 ) and L sup π1 (π 1 ) = η π1 . This facilitates the application of Theorem 2.1 as an instance of Minorization-Maximization algorithms (Hunter & Lange, 2004) . Theorem 2.1. Successive maximization of L sup yields non-decreasing policy payoffs. Proof. With π 2 = arg max π L sup π1 (π), we have L sup π1 (π 2 ) ≥ L sup π1 (π 1 ). Thus, η π2 ≥ L sup π1 (π 2 ) and η π1 = L sup π1 (π 1 ) =⇒ η π2 -η π1 ≥ L sup π1 (π 2 ) -L sup π1 (π 1 ) ≥ 0. ( ) 3 SURROGATE OPTIMIZATION AND PRACTICAL ALGORITHM Successive optimization of L sup π1 (π 2 ) generates non-decreasing payoffs. However, it is impractical due to the large number of constraints and statistical estimation of maximums. To mitigate this, we take a similar approach to TRPO and optimize for the surrogate Lπ1 (π 2 ) = η π1 + 1 1 -γ • E s∼ρ π 1 µ [A π1 (s, π 2 )] -C 1 • E s∼ρ π 1 µ L W G (π 1 , π 2 ; s) -C 2 • E s∼ρ π 1 µ W (π 2 (a|s), π 1 (a|s)) 2 . ( ) Although first order stochastic optimization methods can be directly applied to the surrogate defined in Equation ( 11), second order methods can be more efficient. Since L W G (π 1 , π 2 ; s) is the geometric mean of two functions of quadratic order, it is also of quadratic order. However, L W G (π 1 , π 2 ; s) may not be twice continuously differentiable. For this, we lower bound L W G (π 1 , π 2 ; s) further using the AM-GM inequality and replace it with L G 2 (π 1 , π 2 ; s) := ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 2 (12) to form a quadratic regularization term with a definable Hessian matrix (see Section A.8 of the appendix for detailed derivations). This induces our final surrogate which is used for second order optimization: L π1 (π 2 ) = η π1 + 1 1 -γ • E s∼ρ π 1 µ [A π1 (s, π 2 )] -C 1 • E s∼ρ π 1 µ L G 2 (π 1 , π 2 ; s) -C 2 • E s∼ρ π 1 µ W (π 2 (a|s), π 1 (a|s)) 2 . ( ) The coefficients C 1 and C 2 are dynamics-dependent. For simplicity, we used constant coefficients and a trust region. This yields the Truly Deterministic Policy Optimization (TDPO) as given in Algorithm 1. See the appendix for practical notes on the choice of C 1 and C 2 . Alternatively, one could adopt processes similar to Schulman et al. (2015a) where the IS-based advantage estimator used a line search for proper step size selection, or the adaptive penalty coefficient setting in Schulman et al. (2017) . We plan to consider such approaches in future work.

3.1. ON THE INTERPRETATION OF THE SURROGATE FUNCTION

For deterministic policies, the squared Wasserstein distance W (π 2 (a|s), π 1 (a|s)) 2 degenerates to the Euclidean distance over the action space. Any policy defines a sensitivity matrix at a given state s, which is the Jacobian matrix of the policy output with respect to s. The policy sensitivity Collect trajectories and construct the advantage estimator oracle A π k . 3: Compute the policy gradient g at θ k : g ← ∇ θ A π k (π )| π =π k 4: Construct a surrogate Hessian vector product oracle v → H • v such that for θ = θ k + δθ, E s∼ρ π k µ W (π (a|s), π k (a|s)) 2 + C 1 C 2 E s∼ρ π k µ L G 2 (π , π k ; s) = 1 2 δθ T Hδθ + h.o.t., where h.o.t. denotes higher order terms in δθ.

5:

Find the optimal update direction δθ * = H -1 g using the Conjugate Gradient algorithm. 6: Determine the best step size α * within the trust region: α * = arg max α g T (αδθ * ) - C 2 2 (αδθ * ) T H(αδθ * ) s.t. 1 2 (α * δθ * ) T H(α * δθ * ) ≤ δ 2 max ( ) 7: Update the policy parameters: θ k+1 ← θ k + α * δθ * . 8: end for term L G 2 (π 1 , π 2 ; s) is essentially the squared Euclidean distance over the action-to-observation Jacobian matrix elements. In other words, our surrogate prefers to step in directions where the action-to-observation sensitivity is preserved within updates. Although our surrogate uses a metric distance instead of the traditional non-metric measures for regularization, we do not consider this sole replacement a major contribution. The squared Wasserestein distance and the KL divergence of two identically-scaled Gaussian distributions are the same up to a constant (i.e., D KL (N (m 1 , σ) N (m 2 , σ)) = W (N (m 1 , σ), N (m 2 , σ)) 2 /2σ 2 ). On the other hand, our surrogate's compatibility with deterministic policies makes it a valuable asset for our policy gradient algorithm; both W (π 2 (a|s), π 1 (a|s)) 2 and L G 2 (π 1 , π 2 ; s) can be evaluated for two deterministic policies π 1 and π 2 numerically without resorting to any approximations to overcome singularities.

4. MODEL-FREE ESTIMATION OF POLICY GRADIENT

The DeVine advantage estimator is formally defined in Algorithm 2. Unlike DDPG and TD3, the DeVine estimator allows our method to perform deterministic search by not consistently injecting noise in actions for exploration. Essentially, DeVine rolls out a trajectory and computes the values of each state. Since the transition dynamics and the policy are deterministic, these values are exact. Then, it picks a perturbation state s t according to the visitation frequencies using importance sampling. A state-reset to s t is made, a σ-perturbed action is applied for a single time-step, followed by π 1 policy. This exactly produces Q π1 (s t , a t + σ). Then, A π1 (s t , a t + σ) can be computed by subtracting the value baseline. Finally, A π1 (s t , a t ) = 0 and A π1 (s t , a t + σ) define a two-point linear A π1 (s t , a) model with respect to the action. Parallelization can be used to have as many states of the first roll-out included in the estimator as desired. The parameter σ acts as an exploration parameter and a finite difference to establish derivatives. While σ 0 can produce exact gradients, larger σ can build stabler interpolations. Under deterministic dynamics and policies, if the DeVine oracle samples each dimension at each time-step exactly once then in the limit of small σ it can produce exact advantages, as stated in Theorem 4.1, whose proof is deferred to the appendix. Theorem 4.1. Assume a finite horizon MDP with both deterministic transition dynamics P and initial distribution µ, with maximal horizon length of H. Define K = H • dim(A), a uniform ν, and Algorithm 2 Deterministic Vine (DeVine) Policy Advantage Estimator Require: The number of parallel workers K Require: A policy π, an exploration policy q, discrete time-step distribution ν(t), initial state distribution µ(s), and the discount factor γ.foot_0: Sample an initial state s 0 from µ, and then roll out a trajectory τ = (s 0 , a 0 , s 1 , a 1 , • • • ) using π. 2: for k = 1, 2, • • • , K do 3: Sample the integer number t = t k from ν.

4:

Compute the value V π1 (s t ) = ∞ i=t γ t-i R(s i , a i ).

5:

Reset the initial state to s t , sample the first action a t according to q(•|s t ), and use π for the rest of the trajectory. This will create τ = (s t , a t , s t+1 , a t+1 , • • • ). 6: Compute the value Q π1 (s t , a t ) = ∞ i=t γ t-i R(s i , a i ). 7: Compute the advantage A π1 (s t , a t ) = Q π (s t , a t ) -V π (s t ). 8: end for 9: Define A π1 (π 2 ) := 1 K K k=1 dim(A) • γ t k ν(t k ) • (π 2 (s) -a t k ) T (a t k -a t k ) (a t k -a t k ) T (a t k -a t k ) • A π1 (s t k , a t k ). 10: Return A π1 (π 2 ) and ∇ π2 A π1 (π 2 ) as unbiased estimators for E s∼ρ π 1 µ [A π1 (s, π 2 )] and PG, respectively. q(s; σ) = π 1 (s) + σe j in Algorithm 2 with e j being the j th basis element for A. If the (j, t k ) pairs are sampled to exactly cover {1, • • • , dim(A)} × {1, • • • , H}, then we have lim σ→0 ∇ π2 A π1 (π 2 ) π2=π1 = ∇ π2 η π2 π2=π1 . Theorem 4.1 provides a guarantee for recovering the exact policy gradient if the initial state distribution was deterministic and all time-steps of the trajectory were used to branch vine trajectories. Although this theorem sets the stage for computing a fully deterministic gradient, stochastic approximation can be used in Algorithm 2 by randomly sampling a small set of states for advantage estimation. In other words, Theorem 4.1 would use ν to deterministically sample all trajectory states, whereas this is not a practical requirement for Algorithm 2 and the gradients are still unbiased if a random set of vine branches is used. The DeVine estimator can be advantageous in at least two scenarios. First, in the case of rewards that cannot be decomposed into summations of immediate rewards. For example, overshoot penalizations or frequency-based rewards as used in robotic systems are non-local. DeVine can be robust to non-local rewards as it is insensitive to whether the rewards were applied immediately or after a long period. Second, DeVine can be an appropriate choice for systems that are sensitive to the injection of noise, such as high-bandwidth robots with natural resonant frequencies. In such cases, using white (or colored) noise for exploration can excite these resonant frequencies and cause instability, making learning difficult. DeVine avoids the need for constant noise injection.

5. EXPERIMENTS

The next two subsections show challenging robotic control tasks including frequency-based non-local rewards, long horizons, and sensitivity to resonant frequencies. See Section A.11 of the appendix for a comparison on traditional gym environments. TDPO works similar or slightly worse on these traditional gym environments as they seem to be well-suited for stochastic exploration.

5.1. AN ENVIRONMENT WITH NON-LOCAL REWARDS 1

The first environment that we consider is a simple pendulum. The transition function is standard-the states are joint angle and joint velocity, and the action is joint torque. The reward function is nonstandard-rather than define a local reward in the time domain with the goal of making the pendulum In particular, we compute this non-local reward by taking the Fourier transform of the joint angle signal over the entire trajectory and by penalizing differences between the resulting power spectrum and a desired power spectrum. We apply this non-local reward at the last time step of the trajectory. Implementation details and similar results for more pendulum variants are left to the appendix. 0 100M 200M 300M 400M 500M Sample Count Figure 1 shows training curves for TDPO (our method) as compared to TRPO, PPO, DDPG, and TD3. These results were averaged over 25 experiments in which the desired oscillation frequency was 1.7 Hz (different from the pendulum's natural frequency of 0.5 Hz), the desired oscillation amplitude was 0.28 rad, and the desired offset was 0.52 rad. Figure 1 also shows trajectories obtained by the best agents from each method. TDPO (our method) was able to learn high-reward behavior and to achieve the desired frequency, amplitude, and offset. TRPO was able to learn the correct offset but did not produce any oscillatory behavior. TD3 also learned the correct offset, but could not produce desirable oscillations. PPO and DDPG failed to learn any desired behavior.

5.2. AN ENVIRONMENT WITH LONG HORIZON AND RESONANT FREQUENCIES 2

The second environment that we consider is a single leg from a quadruped robot (Park et al., 2017) . This leg has two joints, a "hip" and a "knee," about which it is possible to exert torques. The hip is attached to a slider that confines motion to a vertical line above flat ground. We assume the leg is dropped from some height above the ground and the task is to recover from this drop and to stand upright at rest after impact. States given to the agent are the angle and velocity of each joint (slider position and velocity are hidden), and actions are the joint torques. The reward function penalizes difference from an upright posture, slipping or chattering at the contact between the foot and the Figure 2 : Results for the leg environment with a long horizon and resonant frequencies due to ground compliance. Upper panel: training curves with empirical discounted payoffs. Lower panel: partial trajectories, restricted to times shortly before and after impact with the ground. Note the oscillations at about 100 Hz that appear just after the impact at 0.2 s-these oscillations are evidence of a resonant frequency. ground, non-zero joint velocities, and steady-state joint torque deviations. We use MuJoCo for simulation (Todorov et al., 2012) , with high-fidelity models of ground compliance, motor nonlinearity, and joint friction. The control loop rate is 4000 Hz and the rollout length is 2 s, resulting in a horizon of 8000 steps. Implementation details are left to the appendix. Figure 2 shows training curves for TDPO (our method) as compared to TRPO, PPO, DDPG and TD3. These results were averaged over 75 experiments. A discount factor of γ = 0.99975 was chosen for all methods, where (1 -γ) -1 is half the trajectory length. Similarly, the GAE factors for PPO and TRPO were scaled up to 0.99875 and 0.9995, respectively, in proportion to the trajectory length. Figure 2 also shows trajectories obtained by the best agents from each method. TDPO (our method) was able to learn high-reward behavior. TRPO, PPO, DDPG, and TD3 were not. We hypothesize that the reason for this difference in performance is that TDPO better handles the combination of two challenges presented by the leg environment-an unusually long time horizon (8000 steps) and the existence of a resonant frequency that results from compliance between the foot and the ground (note the oscillations at a frequency of about 100 Hz that appear in the trajectories after impact). Both high-speed control loops and resonance due to ground compliance are common features of real-world legged robots to which TDPO seems to be more resilient.

6. DISCUSSION

This article proposed a deterministic policy gradient method (TDPO: Truly Deterministic Policy Optimization) based on the use of a deterministic Vine (DeVine) gradient estimator and the Wasserstein metric. We proved monotonic payoff guarantees for our method, and proposed a novel surrogate for policy optimization. We showed numerical evidence for superior performance with non-local rewards defined in the frequency domain and a realistic long-horizon resonant environment. This method enables applications of policy gradient to customize frequency response characteristics of agents. Future work should include the addition of a regularization coefficient adaptation process for the C 1 and C 2 parameters in the TDPO algorithm.

A APPENDIX A.1 TABLES OF NOTATION

The same mathematical definitions and notations used in the paper were re-introduced and summarized in two tables; Table 1 describes the mathematical functions and operators used throughout the paper, and Table 2 describes the notations needed to define the Markov Decision Process (MDP). The tables consist of two columns; one showing or defining the notation, and the other includes the name in which the same notation was called in the paper.

Name Mathematical Definition or Description

Value function V π (s) := 1 1 -γ E st∼ρ π µ at∼π(st) [R(s t , a t )] = E[ ∞ t=1 γ t-1 R(s t , a t )|s 1 = s, a t ∼ π(s t ), s t+1 ∼ P (s t , a t )]. Q-Value function Q π (s, a) := R(s, a) + γ • E s ∼P (s,a) [V π (s )] Advantage function A π (s, a) := Q π (s, a) -V π (s). Advantage function A π (s, π ) := E a∼π (s) [A π (s, a)]. Arbitrary functions f and g are arbitrary functions used next. Arbitrary distributions ν and ζ are arbitrary probability distributions used next. Hilbert inner product f, g x := f (x)g(x)dx Kulback-Liebler (KL) divergence D KL (ζ|ν) := ζ(x), log( ζ(x) ν(x) ) x = ζ(x) log( ζ(x) ν(x) )dx Total Variation (TV) distance TV(ζ, ν) := 1 2 |ζ(x) -ν(x)|, 1 x = 1 2 |ζ(x) -ν(x)|dx. Coupling set Γ(ζ, ν) is the set of couplings for ζ and ν. Wasserstein distance W (ζ, ν) = inf γ∈Γ(ζ,ν) x -y , γ(x, y) x,y . Policy Wasserstein distance In other words, a ∼ π(s) and a ∼ π(•|s). W (π 1 , π 2 ) := sup s∈S W (π 1 (•|s), π 2 (•|s)). Lipschitz Constant Lip(f (x, y); x) := sup x ∇ x f (x, y) 2 . Rubinstein-Kantrovich (RK) duality | ζ(x) -ν(x), f (x) x | ≤ W (ζ, ν) • Lip(f ; x). π det : S → A For a deterministic policy π det , the unique action a suggested by the policy given the state s can be denoted by π(s) specially. In other words, a = π det (s). Π Π is the set of all policies (i.e., ∀π : π ∈ Π).

P

In general, P denotes the transition dynamics model of the MDP. However, the input argument types could vary throughout the text. See the next lines for more clarification. P : S × A → ∆(S) Given a particular state s and action a, P (s, a) will be the next state distribution of the transition dynamics (i.e. s ∼ P (s, a) where s denotes the next state after applying s, a to the transition P ).  P : ∆(S) × A → ∆(S) P : ∆(S) × Π → ∆(S) This is a generalization of the transition dynamics to accept a state distribution and a policy as input. Given an arbitrary state distribution ν s and a policy π, and P(ν s , π) will be the next state distribution given that the state is sampled from ν s and the action is sampled from the π(s) distribution. In other words, we have P(ν s , π) := E s∼νs a∼π(s) [P (s, a)]. P t : ∆(S) × Π → ∆(S) This is the t-step transition dynamics generalization. Given an arbitrary state distribution ν s and a policy π and non-negative integer t, one can define P t recursively as P 0 (ν s , π) := ν s and P t (ν s , π) := P(P t-1 (ν s , π), π). ρ π µ The discounted visitation frequency ρ π µ is a distribution over S, and can be defined as ρ π µ := (1 -γ) ∞ t=0 γ t P t (µ, π). Table 2 : The MDP notations used throughout the paper. Theorem A.4. Assuming ( 5), ( 6), and γL µ < 1, we have the inequality W (ρ π1 µ , ρ π2 µ ) ≤ γL π 1 -γL µ • W (π 1 , π 2 ). Proof. Using Lemma A.3 and the definition of ρ π µ , we can write W (ρ π1 µ , ρ π2 µ ) ≤ (1 -γ) ∞ t=0 γ t • W (P t (µ, π 1 ), P t (µ, π 2 )). Using Lemma A.2, we can take another step to relax the inequality (26) and write W (ρ π1 µ , ρ π2 µ ) ≤ L π (1 -γ)W (π 1 , π 2 ) (L µ -1) ∞ t=0 ((γL µ ) t -γ t ). ( ) Due to the γL µ < 1 assumption, the right-hand summation in ( 27) is convergent. This leads us to W (ρ π1 µ , ρ π2 µ ) ≤ L π (1 -γ)W (π 1 , π 2 ) (L µ -1) ( 1 1 -γL µ - 1 1 -γ ). Inequality ( 28) can then be simplified to give the result.

A.4 STEPS TO BOUND THE SECOND-ORDER TERM

The RK duality yields the following bound: | ρ π2 µ -ρ π1 µ , A π1 (•, π 2 ) s | ≤ W (ρ π1 µ , ρ π2 µ ) • sup s ∇ s A π1 (s, π 2 ) 2 . ( ) To facilitate the further application of the RK duality, any advantage can be rewritten as the following inner product: A π1 (s, π 2 ) = π 2 (a|s) -π 1 (a|s), Q π1 (s, a) a . Taking derivatives of both sides with respect to the state variable and applying the triangle inequality produces the bound sup s ∇ s A π1 (s, π 2 ) 2 ≤ sup s ∇ s (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a 2 + sup s π 2 (a|s) -π 1 (a|s), ∇ s Q π1 (s, a) a 2 . ( ) The second term of the RHS in ( 30) is compatible with the RK duality. However, the form of the first term does not warrant an easy application of RK. For this, we introduce Theorem A.5. Theorem A.5. Assuming the existence of Lip(Q π1 (s, a); a), we have the bound ∇ s (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a 2 (31) ≤ 2 • Lip(Q π1 (s, a); a) • ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|bs ) 2 s =s 2 . Proof. By definition, we have ∇ s (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a 2 = dim(S) j=1 ∂ ∂s (j) (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a 2 . ( ) For better insight, we will write the derivative using finite differences: ∂ ∂s (j) (π 2 (a|s)-π 1 (a|s)), Q π1 (s, a) a = lim δs→0 1 δs (π 2 (a|s + δs • e j ) -π 1 (a|s + δs • e j ) ), Q π1 (s, a) a -(π 2 (a|s) -π 1 (a|s) ), Q π1 (s, a) a . ( ) We can rearrange the finite difference terms to get ∂ ∂s (j) (π 2 (a|s)-π 1 (a|s)), Q π1 (s, a) a = lim δs→0 1 δs (π 2 (a|s + δs • e j ) +π 1 (a|s) ), Q π1 (s, a) a -(π 2 (a|s) +π 1 (a|s + δs • e j ) ), Q π1 (s, a) a . Equivalently, we can divide and multiply the inner products by a factor of 2, to make the inner product arguments resemble mixture distributions: ∂ ∂s (j) (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a = lim δs→0 2 δs π 2 (a|s + δs • e j ) + π 1 (a|s) 2 , Q π1 (s, a) a - π 2 (a|s) + π 1 (a|s + δs • e j ) 2 , Q π1 (s, a) a . ( ) The RK duality can now be used to bound this difference as ∂ ∂s (j) (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a (36) ≤ lim δs→0 2 δs W π 2 (a|s + δs • e j ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s + δs • e j ) 2 • Lip(Q π1 (s, a); a) , which can be simplified as ∂ ∂s (j) (π 2 (a|s) -π 1 (a|s)), Q π1 (s, a) a ≤ 2 • Lip(Q π1 (s, a); a) • ∂ ∂s (j) W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s . ( ) Combining Inequality (37) and Equation (32), we obtain the bound in the theorem.

A.5 THE LIPSCHITZ CONTINUITY AND TRANSITION STABILITY ASSUMPTIONS

There are three key groups of assumptions made in the derivation of our policy improvement lower bound. First is the existence of Q π -function Lipschitz constants. Second is the transition dynamics Lipschitz-continuity assumptions. Finally, we make an assumption about the stability of the transition dynamics. Next, we will discuss the meaning and the necessity of these assumptions in the same order. A.5.1 ON THE EXISTENCE OF THE LIP(Q π , a) CONSTANT The Lip(Q π , a) constant may be undefined when either the reward function or the transition dynamics are discontinuous. Examples of known environments with undefined Lip(Q π , a) constants include those with grazing contacts which define a discontinuous transition dynamics. In practice, even for environments that do not satisfy Lipschitz continuity assumptions, there are mitigating factors; practical Q π functions are reasonably narrow-bounded in a small trust region neighborhood, and since we use non-vanishing exploration scales and trust regions, a bounded interpolation slope can still model the Q-function variation effectively. We should also note that a slightly stronger version of this assumption is frequently used in the context of Lipschitz MDPs (Pirotta et al., 2015; Rachelson & Lagoudakis, 2010; Asadi et al., 2018) . In practice, we have not found this to be a substantial limitation.

A.5.2 THE TRANSITION DYNAMICS LIPSCHITZ CONTINUITY ASSUMPTION

Assumptions 5 and 6 of the main paper essentially represent the Lipschitz continuity assumptions of the transition dynamics with respect to actions and states, respectively. If the transition dynamics and the policy are deterministic, then these assumptions are exactly equivalent to the Lipschitz continuity assumptions. Assumptions 5 and 6 only generalize the Lipschitz continuity assumptions in a distributional sense. The necessity of these assumptions is a consequence of using metric measures for bounding errors. Traditional non-metric bounds force the use of full-support stochastic policies where all actions have non-zero probabilities (e.g., for the KL-divergence of two policies to be defined, TRPO needs to operate on full-support policies such as the Gaussian policies). In those analyses, since all policies share the same support, the next state distribution automatically becomes smooth and Lipschitz continuous with respect to the policy measure even if the transition dynamics was not originally smooth with respect to its input actions. However, metric measures are also defined for policies of non-overlapping support. To be able to provide closeness bounds for future state visitations of two similar policies with non-overlapping support, it becomes necessary to assume that close-enough actions or states must be yielding close-enough next states. In fact, this is a very common assumption in the framework of Lipschitz MDPs (See Section 2.2 of Rachelson & Lagoudakis (2010) , Section 3 of Asadi et al. (2018) , and Assumption 1 of Pirotta et al. ( 2015)).

A.5.3 THE TRANSITION DYNAMICS STABILITY ASSUMPTION

Before moving to relax the γL µ < 1 assumption, we will make a few definitions and restate the previous lemmas and theorems under these definitions. We define L µ1,µ2,π to be the infimum non-negative value that makes the equation W (P(µ 1 , π), P(µ 2 , π)) = L µ1,µ2,π W (µ 1 , µ 2 ) hold. Similarly, L µ1,µ2,π is defined as the infimum non-negative value that makes the equation W (P(µ, π 1 ), P(µ, π)) = L µ,π1,π2 W (π 1 (•|µ), π 2 (•|µ)) hold. For notation brevity, we will also denote L P t (µ,π1),P t (µ,π2),π2 and L P t (µ,π1),π1,π2 by L(t) µ,π1,π2 and L(t) µ,π1,π2 , respectively. Under these definitions, Lemma A.1 evolves into W (P(µ 1 , π 1 ), P(µ 2 , π 2 )) ≤ L µ1,µ2,π W (µ 1 , µ 2 ) + L µ1,π1,π2 W (π 1 , π 2 ). We can apply a time-point recursion to this lemma and have W (P(P t (µ, π 1 ), π 1 ), P(P t (µ, π 2 ), π 2 )) ≤ L P t (µ,π1),π1,π2 W (π 1 , π 2 ) + L P t (µ,π1),P t (µ,π2),π2 W (P t (µ, π 1 ), P t (µ, π 2 )) , which can be notationally simplified to W (P t (µ, π 1 ), P t (µ, π 2 )) ≤ L(t-1) µ,π1,π2 W (π 1 , π 2 ) + L(t-1) µ,π1,π2 W (P t-1 (µ, π 1 ), P t-1 (µ, π 2 )). (40) These modifications lead Lemma A.2 to be updated accordingly into W (P t (µ, π 1 ), P t (µ, π 2 )) ≤ C (t) L;µ,π1,π2 • W (π 1 , π 2 ) , where we have C (t) L;µ,π1,π2 := t k=1 L(t-k) µ,π1,π2 k-1 i=1 L(t-i) µ,π1,π2 . By a simple change of variables, we can have the equivalent definition of C (t) L;µ,π1,π2 := t k=1 L(k-1) µ,π1,π2 t-1 i=k+1 L(i) µ,π1,π2 . Now, we would replace the γL µ < 1 assumption with the following assumption. 

The Transition Dynamics Stability

L(k-1) µ,π1,π2 t-1 i=k+1 L(i) µ,π1,π2 < ∞. To help understand which { L(t) µ,π1,π2 } t≥0 and { L(t) µ,π1,π2 } t≥0 sequences can satisfy this assumption, we will provide some examples: • The ∀t : L(t) µ,π1,π2 = c 1 > 1 and ∀t : L(t) µ,π1,π2 = c 2 sequences violate the dynamics stability assumption. • The ∀t : L(t) µ,π1,π2 ≤ 1 and ∀t : L(t) µ,π1,π2 = O( 1 t 2 ) sequences satisfy the dynamics stability assumption. • sup t L(t) µ,π1,π2 < 1 guarantees the dynamics stability assumption. • ∀t ≥ t 0 : L(t) µ,π1,π2 < 1 guarantees the dynamics stability assumption no matter (1) how big t 0 is (as long as it is finite), or (2) how big the members of the finite set { L(t) µ,π1,π2 |t < t 0 } are. If the dynamics stability assumption holds with a constant C L , one can define a Lµ constant such that C L = L π ∞ t=0 (γ Lµ ) t . Then, we can replace all the L µ instances in the rest of the proof with the corresponding Lµ constant, and the results will remain the same without any change of format. The L(t) µ,π1,π2 and L(t) µ,π1,π2 constants can be thought as tighter versions of L µ and L π , but with dependency on π 1 , π 2 , µ and the time-point of application. Having γL µ < 1 is a sufficient yet unnecessary condition for this dynamics stability assumption to hold. Vaguely speaking, L µ is an expansion rate for the state distribution distance; it tells you how much a divergence in the state distribution will expand after a single application of the transition dynamics. Having effective expansion rates that are larger than one throughout an infinite horizon trajectory is a sign of the system instability; some change in the initial state's distribution could cause the observations to diverge exponentially. While controlling unstable systems is an important and practical challenge, none of the existing reinforcement learning methods is capable of learning effective policies on such environments. Roughly speaking, having the dynamics stability assumption guarantees that the expansion rates cannot be consistently larger than one for infinite time steps. A.6 CHOICE OF C 1 AND C 2 Since the TDPO algorithm operates using the metric Wasserstein distance, thinking about how normalizing actions and rewards affect the corresponding optimization objective builds insight into how to set these coefficients properly. Say we use the same dynamics, only the new actions are scaled up by a factor of β and the rewards are scaled up by a factor of α: a new = β • a old r new = α • r old . If the policy function approximation class remained the same, the policy gradient would be scaled by a factor of α β (i.e., ∂ηnew ∂anew = α β • ∂ηold ∂aold ). Therefore, one can easily show that the corresponding new regularization coefficient and trust region sizes can be obtained by C new = α β 2 • C old ( ) and δ new max = β • δ old max . ( ) We used equal regularization coefficients (i.e., C 1 = C 2 = C), and the process to choose them can be summarized as follows: (1) Define C = 3600 • α • β -2 , δ max = β/600 and σ q = β/60 (where σ q is the action disturbance parameter used for DeVine), (2) using prior knowledge or by trial and error determine appropriate action and reward normalization coefficients. The reward normalization coefficient α was sought to be approximately the average per-step discounted reward difference of a null policy and an optimal policy. We used a reward scaling value of α = 5 and an action scaling value of β = 5 for the non-locally rewarded pendulum and β = 1.5 for the long-horizon legged robot. Both environments had a per-step discounted reward of approximately -5 for a null policy and non-positive rewards, justifying the choice of α. A.7 PROOF OF THEOREM 4.1 We restate Theorem 4.1 below for reference and now prove it. Theorem 4.1. Assume a finite horizon MDP with both deterministic transition dynamics P and initial distribution µ, with maximal horizon length of H. Define K = H • dim(A), a uniform ν, and q(s; σ) = π 1 (s) + σe j in Algorithm 2 with e j being the j th basis element for A. If the (j, t k ) pairs are sampled to exactly cover {1, . . . , dim(A)} × {1, . . . , H}, then we have lim σ→0 ∇ π2 A π1 (π 2 ) π2=π1 = ∇ π2 η π2 π2=π1 . ( ) Proof. According to the advantage decomposition lemma, we have ∇ π2 η π2 π2=π1 = 1 1 -γ E s∼ρ π 1 µ [∇ π2 A π1 (s, π 2 )] π2=π1 . ( ) Due to the fact that the transition dynamics, policies π 1 and π 2 , and initial state distribution are all deterministic, we can simplify Equation ( 49) to ∇ π2 η π2 π2=π1 = H-1 t=0 γ t • ∇ π2 A π1 (s t , π 2 ) π2=π1 , ( ) where s t is the state after applying the policy π 1 for t time-steps. We can use the chain rule to write ∇ π2 A π1 (s t , π 2 ) π2=π1 = ∇ π2 A π1 (s t , a t ) at=π2(st) π2=π1 = dim(A) j=1 ∇ π2 a (j) t • ∂ ∂a (j) t A π1 (s t , a t ) at=π2(st) π2=π1 . ( ) To recap, Equations ( 50), ( 50), and ( 51) can be summarized as ∇ π2 η π2 π2=π1 = H-1 t=0 γ t dim(A) j=1 ∇ π2 a (j) t • ∂ ∂a (j) t A π1 (s t , a t ) at=π2(st) π2=π1 . ( ) Under the assumption that the (j, t) pairs are sampled to exactly cover {1, . . . , dim(A)}×{1, . . . , H}, we can simplify the DeVine oracle to A π1 (π 2 ) = 1 K H-1 t=0 dim(A) j=1 dim(A) • γ t ν(t) • (π 2 (s t ) -π 1 (s t )) T (q(s t ; j, σ) -π 1 (s t )) (q(s t ; j, σ) -π 1 (s t )) T (q(s t ; j, σ) -π 1 (s t )) • A π1 (s t , q(s t ; j, σ)) . ( ) From the q definition, we have q(s t ; j, σ) -π 1 (s t ) = σe j and (q(s t ; j, σ) -π 1 (s t )) T (q(s t ; j, σ)π 1 (s t )) = σ 2 . Since ν is uniform (i.e., ν(t) = 1 H ) and K = H • dim(A), we can take the policy gradient of Equation ( 53) and simplify it into ∇ π2 A π1 (π 2 ) π2=π1 = H-1 t=0 dim(A) j=1 γ t • ∇ π2 (π 2 (s t ) -π 1 (s t )) T e j • A π1 (s t , q(s t ; j, σ)) σ . Since, A π1 (s t , π 1 (s t )) = 0, we can write lim σ→0 A π1 (s t , q(s t ; j, σ)) σ = lim σ→0 A π1 (s t , π 1 (s t ) + σe j ) -A π1 (s t , π 1 (s t )) σ = ∂ ∂a (j) t A π1 (s t , a t ) at=π1(st) . ( ) Also, by the definition of the gradient, we can write ∇ π2 (π 2 (s t ) -π 1 (s t )) T e j = ∇ π2 a (j) t . (56) Combining Equations ( 55) and ( 56), and applying them to Equation (54), yields lim σ→0 ∇ π2 A π1 (π 2 ) π2=π1 = H-1 t=0 dim(A) j=1 γ t • ∇ π2 a (j) t • ∂ ∂a (j) t A π1 (s t , a t ) at=π2(st) π2=π1 . ( ) Finally, the theorem can be obtained by comparing Equations ( 57) and (52). A.8 QUADRATIC MODELING OF POLICY SENSITIVITY REGULARIZATION First, we will build insight into the nature of the L W G (π 1 , π 2 ; s) =W (π 2 (a|s), π 1 (a|s))× ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 (58) term. It is fairly obvious that W (π 2 (a|s), π 1 (a|s)) π2=π1 = 0. ( ) If π 2 = π 1 , then the two distributions π2(a|s )+π1(a|s) 2 and π2(a|s)+π1(a|s ) 2 will be the same no matter what s is. In other words, π 1 = π 2 ⇒ ∀s : W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 = 0. ( ) This means that ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 π2=π1 = 0. ( ) The Taylor expansion of the squared Wasserestein distance can be written as W (π 2 (a|s), π 1 (a|s)) 2 θ2=θ1+δθ = 1 2 δθ T H 2 δθ + h.o.t.. ( ) Considering ( 60) and similar to the previous point, one can write the following Taylor expansion ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 2 θ2=θ1+δθ = δθ T H 1 δθ + h.o.t.. ( ) According to above, L W G is the geometric mean of two functions of quadratic order. Although this makes L W G of quadratic order (i.e., lim δθ→0 L W G (αδθ) L W G (δθ) = α 2 holds for any constant α), this does not guarantee that L W G is twice continuously differentiable w.r.t. the policy parameters, and may not have a defined Hessian matrix (e.g., f (x 1 , x 2 ) = |x 1 x 2 | is of quadratic order, yet is not twice differentiable). To avoid this issue, we compromise on the local model. Using the AM-GM inequality and for any arbitrary positive α, one can bound the L W G term into two quadratic terms: L W G (π 1 , π 2 ; s) ≤ 1 2 1 α 2 W (π 2 (a|s), π 1 (a|s)) 2 + α 2 ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 2 . Therefore, by defining L G 2 (π 1 , π 2 ; s) := ∇ s W π 2 (a|s ) + π 1 (a|s) 2 , π 2 (a|s) + π 1 (a|s ) 2 s =s 2 2 , ( ) C 1 := C1•α 2 2 , and C 2 := (C 2 + C1 2 ) the new surrogate will have the twice-differentiable form L π1 (π 2 ) = 1 1 -γ • E s∼ρ π 1 µ [A π1 (s, π 2 )] -C 1 • E s∼ρ π 1 µ L G 2 (π 1 , π 2 ; s) -C 2 • E s∼ρ π 1 µ W (π 2 (a|s), π 1 (a|s)) 2 . ( ) C 1 and C 2 will be the corresponding regularization coefficients for the surrogate defined in (13). Due to the arbitrary α used in bounding, no constrain governs the C 1 and C 2 coefficients. Therefore, C 1 and C 2 can be chosen without constraining each other.

A.9 IMPLEMENTATION DETAILS FOR THE ENVIRONMENT WITH NON-LOCAL REWARDS

We used the stable-baselines implementation (Hill et al., 2018) , which has the same structure as the original OpenAI baselines (Dhariwal et al., 2017) implementation. We used the "ppo1" variant since no hardware acceleration was necessary for automatic differentiation and MPI parallelization was practically efficient. TDPO, TRPO, and PPO used the same function approximation architecture with two hidden layers, 64 units in each layer, and the tanh activation. TRPO, PPO, DDPG, and TD3 used their default hyper-parameter settings. We used Xavier initialization (Glorot & Bengio, 2010) for TDPO, and multiplied the outputs of the network by a factor of 0.001 so that the initial actions were small and nearly zero. We also confirmed that TDPO does not decrease in performance when using an identical network initialization to PPO/TRPO. TD3's baseline implementation was amended to support MPI parallelization just like TRPO, PPO, and DDPG. To produce the results for DDPG and TD3, we used hyperparameter optimization both with and without the tanh final activation function that is common for DDPG and TD3 (this causes the difference in initial payoff in the figures). However, under no conditions were DDPG and TD3 able to solve these environments effectively, suggesting that the deterministic search used by TDPO is operating in a qualitatively different way than the stochastic policy optimization used by DDPG and TD3. Note that we made a thorough attempt to compare DDPG and TD3 fairly, including trying different initializations, different final layer scalings/activations, different network architectures, and performing hyperparameter optimization. Mini-batch selection was unnecessary for TDPO since optimization for samples generated by DeVine was fully tractable. The confidence intervals in all figures were generated using 1000 samples of the statistics of interest. For designing the environment, we used Dhariwal et al. ( 2017)'s pendulum dynamics and relaxed the torque thresholds to be as large as 40 N m. The environment also had the same episode length of 200 time-steps. We used the reward function described by the following equations: R(s t , a t ) = C R • R(τ ) • 1{t = 200} R(τ ) = R Freq (τ ) + R Offset (τ ) + R Amp (τ ) R Freq (τ ) = 0.1 • fmax f =fmin Θ + std (f ) 2 -1 R Offset (τ ) = - Θ(f = 0) 200 -θ Target Offset = - 1 200 199 t=0 θ t -θ Target Offset R Amp (τ ) = h piecewise Θ AC θ Target Amp. -1 where • θ is the pendulum angle signal in the time domain. • Θ is the magnitude of the Fourier transform of θ. • Θ + is the same as Θ only for the positive frequency components. • Θ AC is the normalized oscillatory spectrum of Θ: Θ AC = Θ + T Θ + 200 . • h piecewise is a piece-wise linear error penalization function: h piecewise (x) = -x • 1{x ≥ 0} + 10 -4 x • 1{-x ≥ 0}. • Θ + std is the standardized positive amplitudes vector: Θ + std = Θ + Θ + T Θ + + 10 -6 . • C R = 1.3×10 4 is a reward normalization coefficient, and was chosen to yield approximately the same payoff as a null policy would yield in the typical pendulum environment of Dhariwal et al. (2017) . • θ Target Offset is the target offset, θ Target Amp. is the target amplitude, and [f min , f max ] is the target frequency range of the environment. All methods used 48 parallel workers. The machines used Xeon E5-2690-v3 processors and 256 GB of memory. Each experiment was repeated 25 times for each method, and each run was given 6 hours or 500 million samples to finish.

A.10 IMPLEMENTATION DETAILS FOR THE ENVIRONMENT WITH LONG HORIZON AND RESONANT FREQUENCIES

For the robotic leg, we used exactly the same algorithms with the same parameters as described in Section A.9 above. We used the reward function described by the following equations: R = R posture + R velocity + R foot offset + R foot height + R ground force + R knee height + R on-air torques (71) with R posture = -1 × θ knee + π 2 + θ hip + π 4 R velocity = -0.08 × |ω knee | + |ω hip | R foot offset = -10 × |x foot | • 1{z knee < 0.2} R ground force = -1 × |f z -mg| • 1{f z < mg} • 1 touchdown R foot height = -1 × |z foot | • 1 touchdown R knee height = -15 × z knee -z target knee • 1 touchdown R on-air torques = -10 -4 × (τ 2 knee + τ 2 hip ) • (1 -1 touchdown ) where • θ knee and θ hip are the knee and hip angles in radians, respectively. • ω knee and ω hip are the knee and hip angular velocities in radians per second, respectively. • x foot and z foot are the horizontal and vertical foot offsets in meters from the desired standing point on the ground, respectively. • x knee and z knee are the horizontal and vertical knee offsets in meters from the desired standing point on the ground, respectively. • f z is the vertical ground reaction force on the robot in Newtons. • m is the robot weight in kilograms (i.e., m = 0.76 kg). • g is the gravitational acceleration in meters per second squared. • 1 touchdown is the indicator function of whether the robot has ever touched the ground. • z target knee is a target knee height of 0.1 m. • τ knee and τ hip are the knee and hip torques in Newton meters, respectively. All methods used 72 full trajectories between each policy update, and each run was given 16 hours of wall time, which corresponded to almost 500 million samples. This experiment was repeated 75 times for each method. The empirical mean of the discounted payoff values were reported without any performance or seed filtration. The same hardware as the non-local rewards experiments (i.e., Xeon E5-2690-v3 processors and 256 GB of memory) was used.

A.11 GYM SUITE BENCHMARKS

While it is clear that our deterministic policy gradient performs well on the new control environments we consider, one may naturally wonder about its performance on existing RL control benchmarks. We ran our method on a suite of Gym environments and include four representative examples in Figure 3 . Broadly speaking, our method (TDPO) performs slightly worse on average than others, but occasionally performs much better as seen in the Swimmer-v3 environment. We speculate that these gym environments are reasonably robust to injected noise, and this may mean that stochastic policy gradients can more rapidly and efficiently explore the policy space, or there may be other algorithmic enhancements that are needed for fully deterministic policy gradients in these cases. The experiments granted each method 72 parallel MPI workers for about 144 million steps (i.e., 2 million sequential steps), and the returns were averaged over 100 different seeds for each method. Since the computational cost of running both DDPG and TD3 were high, we only included TD3 since it was shown to outperform DDPG in earlier benchmarks. A.12 RUNNING TIME COMPARISON Figure 4 depicts a comparison of each method's running time per million steps. These plots show the combination of both the simulation (i.e., environment sampling) and the optimization (i.e., computing the policy gradient and running the conjugate gradient solver) time. It is clear that our method (TDPO) is generally faster than the other algorithms. This is mainly due to the computational efficiency of the DeVine gradient estimator, which summarizes two full trajectories in a single state-action-advantage tuple which can significantly reduce the optimization time. That being said, these relative comparisons could vary to a large extent (1) under different processor architectures, (2) with more (or less) efficient implementations, or (3) when running environments whose simulation time constitutes a significantly larger (or smaller) portion of the total running time. A.13 OTHER SWINGING PENDULUM VARIANTS Multiple variants of the pendulum with non-local rewards were used, each with different frequency targets and the same reward structure. Table 3 summarizes the target characteristics of each variant. The main variant was shown in the paper. Figures 5, 6, 7, 8, 9, 10, 11, and 12 show similar results for the second to ninth variants. To focus on our method's ability to solve all these variants efficiently, we only show the performance of our method (TDPO) on all variants in Figure 13 . Overall, we found TRPO, PPO, DDPG, and TD3 to occasionally find the correct offset. They either excited the natural or the maximum (not the desired) frequency of the pendulum, but they were not able to drive the desired frequency and amplitude. TDPO was able to achieve the desired oscillations (and thus high rewards) in all variants. A.14 NOTES ON HOW TO IMPLEMENT TDPO In short, our method (TDPO) is structured in the same way TRPO was structured; both TDPO and TRPO use policy gradient estimation, and a conjugate-gradient solver utilizing a Hessian-vector product machinery. On the other hand, there are some algorithmic differences that distinguish TDPO from TRPO. First of all, TRPO uses line-search heuristics to adaptively find the update scale; no such heuristics are applied in TDPO. Second, TDPO uses the DeVine advantage estimator, which requires storing and reloading pseudo-random generator states. Finally, the Hessian-vector product machinery used in TDPO computes Wasserstein-vector products, which is slightly different from those used in TRPO. The hyper-parameter settings and notes on how to choose them were discussed in Sections A.9, A.6, and A.10. We will describe how to implement TDPO, and focus on the subtle differences between TDPO and TRPO next. As for the state-reset capability, our algorithm does not require access to a reset function to arbitrary states. Instead, we only require to be able to start from the prior trajectory's initial state. Many environments, including the Gym environments, instantiate their own pseudo-random generators and only utilize that pseudo-random generator for all randomized operations. This facilitates a straightforward implementation of the DeVine oracle; in such environments, implementing an arbitrary state-reset functionality is unnecessary, and only reloading the pseudo-random generator to its configuration prior to the trajectory would suffice. In other words, the DeVine oracle can store the initial configuration of the pseudo-random generator before asking for a trajectory reset, and then start sampling. Once the main trajectory is finished, the pseudo-random generator can be reloaded, thus producing the same initial state upon a reset request. Other time-step states can then be recovered by applying the same preceeding action sequence. To optimize the quadratic surrogate, the conjugate gradient solver was used. Implementing the conjugate gradient algorithm is fairly straightforward, and is already included in many common automatic differentiation libraries. The conjugate gradient solver is perfect for situations where (1) the Hessian matrix is larger than can efficiently be stored in the memory, and (2) the Hessian matrix includes many nearly identical eigenvalues. Both of these conditions apply well for TDPO, as well as for TRPO. Instead of requiring the full Hessian matrix to be stored, the conjugate gradient solver only requires a Hessian-vector product machinery v → Hv, which must be specifically implemented for TDPO. Our surrogate function can be viewed as L(δθ) = g T δθ + C 2 2 δθ T Hδθ where the Hessian matrix can be defined as H = H 2 + C 1 C 2 H 1 , H 1 := ∇ 2 θ E s∼ρ π k µ L G 2 (π , π k ; s) , H 2 := ∇ 2 θ E s∼ρ π k µ W (π (a|s), π k (a|s)) 2 . ( ) In order to construct a Hessian-vector product machinery v → Hv, one can design an automaticdifferentiation procedure that returns the Hessian-vector product. Many automatic-differentiation packages already include functionalities that can provide a Hessian-vector product machinery of a given scalar loss function without computing the Hessian matrix. This can be used to implement the Hessian-vector product machinery in a straightforward manner; one only needs to provide the scalar quadratic terms of our surrogate, and would obtain the Hessian-vector product machinery in return. On the other hand, this may not be the most computationally efficient approach, as our problem exhibits a more specific structure. Alternatively, one can implement a more elaborate and specifically designed Hessian-vector product machinery by following these three steps: • Compute the Wasserstein-vector product v → H 2 v according to Algorithm 3. • Compute the Sensitivity-vector product v → H 1 v according to Algorithm 4. • Return the weighted sum of H 1 v and H 2 v as the final Hessian-vector product Hv. One may also need to add a conjugate gradient damping to the conjugate gradient solver (i.e., return βv + Hv for some small β as opposed to returning Hv purely), which is also done in the TRPO method. This may be important when the number of policy parameters is much larger than the sample size. Setting β = 0 may yield poor numerical stability if H had small eigenvalues, and setting large β will cause the conjugate gradient optimizer to mimic the gradient descent optimizer by making updates in the same direction as the gradient. The optimal conjugate gradient damping may depend on the problem and other hyper-parameters such as the sample size. However, it can easily be picked to be a small value that ensures numerical stability. Once the conjugate gradient solver returned the optimal update direction H -1 g, it must be scaled down by a factor of C 2 (i.e., δθ * = H -1 g/C 2 ). If δθ * satisfied the trust region criterion (i.e., 1 2 δθ * T Hδθ * ≤ δ 2 max ), then one can make the parameter update (i.e., θ new = θ old + δθ * ) and proceed to the next iteration. Otherwise, the proposed update δθ * must be scaled down further, namely by α, such that the trust region condition would be satisfied (i.e., 1 2 (αδθ * ) T H(αδθ * ) = δ 2 max ) before making the update θ new = θ old + αδθ * . Algorithm 3 Wasserstein-Vector-Product Machinery Require: Current Policy π 1 with parameters θ 1 . Require: The vector v with the same dimensions as θ 1 . Require: An observation s. 1: Compute the action for the observation s with |A| elements. a |A|×1 :=    π (1) (s) . . . π (|A|) (s)    . ( ) This vector should be capable of propagating gradients back to the policy parameters when used in automatic differentiation software. 2: Define t to have be a constant vector with the same shape as a. It could be populated with any values such as all ones. 3: Define the scalar ã := a T t. 4: Using back-propagation, find the gradient ∇ θ ã = |A| i=1 t i ∇ θ a i = |A| i=1 t i ∂ai ∂θ1 • • • ∂ai ∂θ |Θ| . 5: Compute the following dot-product: ∇ θ ã, v = ( |A| i=1 t i • ∂a i ∂θ 1 ) • v 1 + • • • + ( |A| i=1 t i • ∂a i ∂θ |Θ| ) • v |Θ| . ( ) 6: Using automatic differentiation, take the gradient w.r.t. the t vector. Require: Current Policy π 1 with parameters θ 1 . ãθ,v := ∇ t ∇ θ ã, v =     ∂a1 ∂θ1 • v 1 + • • • + ∂a1 ∂θ |Θ| • v |Θ| . . . ∂a |A| ∂θ1 • v 1 + • • • + ∂a |A| ∂θ |Θ| • v |Θ|     =     ∂a1 ∂θ1 Require: The vector v with the same dimensions as θ 1 . Require: An observation s. 1: Compute the action to observation Jacobian matrix J |A|×|S| :=     ∂π (1) (s) ∂s 1 • • • ∂π (1) (s) ∂s (|S|) . . . . . . (79) This can either be done using finite-differences in the observation using ∂π (i) (s) ∂s (j) π (i) (s + ds • e j ) -π (i) (s) ds (80) (which may be a bit numerically inaccurate), or using automatic differentiation. In any case, this matrix should be a parameter tensor capable of propagating gradients back to the parameters when used in automatic differentiation software. 2: Define J to be the vectorized (i.e., reshaped into a column) J matrix, with |AS| = |A| × |S| rows and one column. 3: Define t to have be a constant vector with the same shape as J. It could be populated with any values such as all ones. 4: Define the scalar J t := JT t. 5: Using back-propagation, find the gradient ∇ θ J t = |A| i=1 |S| j=1 t i,j ∇ θ J i,j = |A| i=1 |S| j=1 t i,j ∂Ji,j ∂θ1 • • • ∂Ji,j ∂θ |Θ| . 6: Compute the following dot-product. 82) 7: Using automatic differentiation, take the gradient w.r.t. the t vector. ∇ θ J t , v = ( |A| i=1 |S| j=1 t i,j • ∂J i,j ∂θ 1 ) × v 1 + • • • + ( |A| i=1 |S| j=1 t i,j • ∂J i,j ∂θ |Θ| ) × v |Θ| (∇ θ J)v := ∇ t ∇ θ J t , v =     ∂J1,1 ∂θ1 • v 1 + • • • + ∂J1,1 ∂θ |Θ| • v |Θ| . . . ∂J |A|,|S| ∂θ1 • v 1 + • • • + ∂J |A|,|S| ∂θ |Θ| • v |Θ|     =     ∂J (1,1) ∂θ1 • • • ∂J (1,1) ∂θ |Θ| . . . . . . ∂J (|A|,|S|) ∂θ1 • • • ∂J (|A|,|S|) ∂θ |Θ|     v 8: Reshape (∇ θ J)v into a column vector and name it Jθ,v . 9: Compute the dot product Jθ,v , J . 10: Using back-propagation, take the gradient w.r.t. θ, and return it as the gain-vector-product. ∇ θ Jθ,v , J =     ∂J (1,1) ∂θ1 



Non-local rewards are reward functions of the entire trajectory whose payoffs cannot be decomposed into the sum of terms such as η = t ft(st, at), where functions ft only depend on nearby states and actions. An example non-local reward is one that depends on the Fourier transform of the complete trajectory signal. Resonant frequencies are a concept from control theory. In the frequency domain, signals of certain frequencies are excited more than others when applied to a system. This is captured by the frequency-domain transfer function of the system, which may have a peak of magnitude greater than one. The resonant frequency is the frequency at which the frequency-domain transfer function has the highest amplitude. Common examples of systems with a resonant frequency include the undamped pendulum, which oscillates at its natural frequency, and RLC circuits which have characteristic frequencies at which they are most excitable. See Chapter 8 ofKuo & Golnaraghi (2002) for more information.



Figure 1: Results for the simple pendulum with non-local rewards. Upper panel: training curves with empirical discounted payoffs. Lower panels: trajectories in both the time domain and frequency domain, showing target values of oscillation frequency, amplitude, and offset.

This is the discount factor of the MDP. R : S × A → R This is the reward function of the MDP. µ This is the initial state distribution of the MDP over the state space. ∆ ∆(F) is the set of all probability distributions over the arbitrary set F (otherwise known as the Credal set of F). π In general, π denotes the policy of the MDP. However, the output argument type could vary in the text. See the next lines. π : S → ∆(A) Given a state s ∈ S, π(s) and π(•|s) denote the action distribution suggested by the policy π.

This is a generalization of the transition dynamics to accept state distributions as input. In other words, P (ν s , a) := E s∼νs [P (s, a)]. P : S × ∆(A) → ∆(S) This is a generalization of the transition dynamics to accept action distributions as input. In other words, P (s, ν a ) := E a∼νa [P (s, a)].

Figure 3: Results for the gym suite benchmarks.

Figure 4: Training time comparison in different environments. The lower the bar, the faster the method. The vertical axis shows the time in seconds needed to consume one million state-action pairs for training. Each environment was shown separately in a different subplot.

Figure 5: Results for the second variant of the simple pendulum with non-local rewards. Upper panel: training curves with empirical discounted payoffs. Lower panels: trajectories in both the time domain and frequency domain, showing target values of oscillation frequency, amplitude, and offset.

Figure 9: Results for the sixth variant of the simple pendulum with non-local rewards. Upper panel: training curves with empirical discounted payoffs. Lower panels: trajectories in both the time domain and frequency domain, showing target values of oscillation frequency, amplitude, and offset.

Figure 11: Results for the eighth variant of the simple pendulum with non-local rewards. Upper panel: training curves with empirical discounted payoffs. Lower panels: trajectories in both the time domain and frequency domain, showing target values of oscillation frequency, amplitude, and offset.

Table 2 of the appendix summarizes all MDP notation. The value function of π is defined as

The squared Wasserestein regularization coefficient C 2 , The secondary regularization coefficient ratio C 1 /C 2 , and a trust region radius δ max . Require: Initial policy π 0 .

The mathematical notations used throughout the paper.

The target oscillation characteristics used to define different pendulum swinging environments.

: Using back-propagation, take the gradient w.r.t. θ, and return it as the gain-vector-product.

annex

A.2 BOUNDING W (P t (µ, π 1 ), P t (µ, π 2 ))To review, the dynamical smoothness assumptions were W (P(µ, π 1 ), P(µ, π 2 )) ≤ L π • W (π 1 , π 2 ), W (P(µ 1 , π), P(µ 2 , π)) ≤ L µ • W (µ 1 , µ 2 ).The following lemma states that these two assumptions are equivalent to a more concise assumption. This will be used to bound the t-step visitation distance and prove Lemma A.2. Lemma A.1. Assumptions ( 5) and ( 6) are equivalent to havingProof. To prove the (5), ( 6) ⇒ (17) direction, the triangle inequality for the Wasserstein distance givesand using ( 5), (6), and (18) then impliesThe other direction is trivial.Lemma A.2. Under Assumptions ( 5) and ( 6) we have the boundwhere P t (µ, π) denotes the state distribution after running the MDP for t time-steps with the initial state distribution µ and policy π.Proof. For t = 1, the lemma is equivalent to Assumption (5). This paves the way for the lemma to be proved using induction. The hypothesis isand for the induction step we writeUsing Assumption (17), which according to Lemma A.1 is equivalent to Assumptions ( 5) and ( 6), we can combine ( 21) and ( 22) intoThus, by applying the induction Hypothesis (21), we have) which can be simplified into the lemma statement (i.e., Inequality (20)).Lemma A.2 suggests making the γL µ < 1 assumption and paves the way for Theorem A.4. The γL µ < 1 assumption is overly restrictive and unnecessary, but makes the rest of the proof easier to follow. This assumption can be relaxed by a general transition dynamics stability assumption which is discussed in more detail later at section A.5.3, and an equivalent γ Lµ < 1 assumption is introduced to replace γL µ < 1.First, we need to introduce Lemma A.3 first, which will be used in the proof of Theorem A.4. Lemma A.3. The Wasserstein distance between linear combinations of distributions can be bounded as) in the Wasserstein definition yields the result. 

