TRULY DETERMINISTIC POLICY OPTIMIZATION

Abstract

In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments-one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps)for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3).



Policy Gradient (PG) methods can be broadly characterized by three defining elements: the policy gradient estimator, the regularization measures, and the exploration profile. For gradient estimation, episodic (Williams, 1992) , importance-sampling-based (Schulman et al., 2015a) , and deterministic (Silver et al., 2014) gradients are some of the most common estimation oracles. As for regularization measures, either an Euclidean distance within the parameter space (Williams, 1992; Silver et al., 2014; Lillicrap et al., 2015) , or dimensionally consistent non-metric measures (Schulman et al., 2015a; Kakade & Langford, 2002; Schulman et al., 2017; Kakade, 2002; Wu et al., 2017) have been frequently adapted. Common exploration profiles include Gaussian (Schulman et al., 2015a) and stochastic processes (Lillicrap et al., 2015) . These elements form the basis of many model-free and stochastic policy optimization methods successfully capable of learning high-dimensional policy parameters. Both stochastic and deterministic policy search can be useful in applications. A stochastic policy has the effect of smoothing or filtering the policy landscape, which is desirable for optimization. Searching through stochastic policies has enabled the effective control of challenging environments under a general framework (Schulman et al., 2015a; 2017) . The same method could either learn robotic movements or play basic games (1) with minimal domain-specific knowledge, (2) regardless of function approximation classes, and (3) with less human intervention (ignoring reward engineering and hyper-parameter tuning) (Duan et al., 2016) . Using stochasticity for exploration, although it imposes approximations and variance, has provided a robust way to actively search for higher rewards. Despite many successes, there are practical environments which remain challenging for current policy gradient methods. For example, non-local rewards (e.g., those defined in the frequency domain), long time horizons, and naturally-resonant environments all occur in realistic robotic systems (Kuo & Golnaraghi, 2002; Meirovitch, 1975; Preumont & Seto, 2008) but can present issues for policy gradient search. To tackle challenging environments such as these, this paper considers policy gradient methods based on deterministic policies and deterministic gradient estimates, which could offer advantages by allowing the estimation of global reward gradients on long horizons without the need to inject noise into the system for exploration. To facilitate a dimensionally consistent and low-variance deterministic policy search, a compatible policy gradient estimator and a metric measure for regularization should be employed. For gradient estimation we focus on Vine estimators (Schulman et al., 2015a) , which can be easily applied to deterministic policies. As a metric measure we use the Wasserstein distance, which can measure meaningful distances between deterministic policy functions that have non-overlapping supports (in contrast to the Kullback-Liebler (KL) divergence and the Total Variation (TV) distance). The Wasserstein metric has seen substantial recent application in a variety of machine-learning domains, such as the successful stable learning of generative adversarial models (Arjovsky et al., 2017) . Theoretically, this metric has been studied in the context of Lipschitz-continuous Markov decision processes in reinforcement learning (Hinderer, 2005; Ferns et al., 2012) . Pirotta et al. (2015) defined a policy gradient method using the Wasserestein distance by relying on Lipschitz continuity assumptions with respect to the policy gradient itself. Furthermore, for Lipschitz-continuous Markov decision processes, Asadi et al. ( 2018 2019). However, for our deterministic observation-conditional policies, closed-form computation of Wasserstein distances is possible without any approximation. Existing deterministic policy gradient methods (e.g., DDPG and TD3) use deterministic policies (Silver et al., 2014; Lillicrap et al., 2015; Fujimoto et al., 2018) , meaning that they learn a deterministic policy function from states to actions. However, such methods still use stochastic search (i.e., they add stochastic noise to their deterministic actions to force exploration during policy search). In contrast, we will be interested in a method which not only uses deterministic policies, but also uses deterministic search (i.e., without constant stochastic noise injection). We call this new method truly deterministic policy optimization (TDPO) and it may have lower estimation variances and better scalability to long horizons, as we will show in numerical examples. Scalability to long horizons is one of the most challenging aspects for policy gradient methods that use stochastic search. This issue is sometimes referred to as the curse of horizon in reinforcement learning (Liu et al., 2018) . General worst-case analyses suggests that the sample complexity of reinforcement learning is exponential with respect to the horizon length (Kakade et al., 2003; Kearns et al., 2000; 2002) . Deriving polynomial lower-bounds for the sample complexity of reinforcement learning methods is still an open problem (Jiang & Agarwal, 2018) . Lower-bounding the sample complexity of reinforcement learning for long horizons under different settings and simplifying assumptions has been a topic of theoretical research (Dann & Brunskill, 2015; Wang et al., 2020) . Some recent work has examined the scalability of importance sampling gradient estimators to long horizons in terms of both theoretical and practical estimator variances (Liu et al., 2018; Kallus & Uehara, 2019; 2020) . All in all, long horizons are challenging for all reinforcement learning methods, especially the ones suffering from excessive estimation variance due to the use of stochastic policies for exploration, and our truly deterministic method may have advantages in this respect. In this paper we focus on continuous-domain robotic environments with reset capability to previously visited states. The main contributions of this work are: (1) we introduce a Deterministic Vine (DeVine) policy gradient estimator which avoids constant exploratory noise injection; (2) we derive a novel deterministically-compatible surrogate function and provide monotonic payoff improvement guarantees; (3) we show how to use the DeVine policy gradient estimator with the Wasserstein-based surrogate in a practical algorithm (TDPO: Truly Deterministic Policy Optimization); (4) we illustrate the robustness of the TDPO policy search process in robotic control environments with non-local rewards, long horizons, and/or resonant frequencies.

1. BACKGROUND

MDP preliminaries. An infinite-horizon discounted Markov decision process (MDP) is specified by (S, A, P, R, µ, γ), where S is the state space, A is the action space, P : S × A → ∆(S) is the transition dynamics, R : S × A → [0, R max ] is the reward function, γ ∈ [0, 1) is the discount factor, and µ(s) is the initial state distribution of interest (where ∆(F) denotes the set of all probability distributions over F, otherwise known as the Credal set of F). The transition dynamics P is defined as an operator which produces a distribution over the state space for the next state s ∼ P (s, a). The transition dynamics can be easily generalized to take distributions of states or



) and Rachelson & Lagoudakis (2010) used the Wasserstein distance to derive model-based value-iteration and policy-iteration methods, respectively. On a more practical note, Pacchiano et al. (2019) utilized Wasserstein regularization for behavior-guided stochastic policy optimization. Moreover, Abdullah et al. (2019) has proposed another robust stochastic policy gradient formulation. Estimating the Wasserstein distance for general distributions is more complicated than typical KL-divergences (Villani, 2008). This fact constitutes and emphasizes the contributions of Abdullah et al. (2019) and Pacchiano et al. (

