TRULY DETERMINISTIC POLICY OPTIMIZATION

Abstract

In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments-one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps)for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3).



Policy Gradient (PG) methods can be broadly characterized by three defining elements: the policy gradient estimator, the regularization measures, and the exploration profile. For gradient estimation, episodic (Williams, 1992) , importance-sampling-based (Schulman et al., 2015a), and deterministic (Silver et al., 2014) gradients are some of the most common estimation oracles. As for regularization measures, either an Euclidean distance within the parameter space (Williams, 1992; Silver et al., 2014; Lillicrap et al., 2015) , or dimensionally consistent non-metric measures (Schulman et al., 2015a; Kakade & Langford, 2002; Schulman et al., 2017; Kakade, 2002; Wu et al., 2017) have been frequently adapted. Common exploration profiles include Gaussian (Schulman et al., 2015a) and stochastic processes (Lillicrap et al., 2015) . These elements form the basis of many model-free and stochastic policy optimization methods successfully capable of learning high-dimensional policy parameters. Both stochastic and deterministic policy search can be useful in applications. A stochastic policy has the effect of smoothing or filtering the policy landscape, which is desirable for optimization. Searching through stochastic policies has enabled the effective control of challenging environments under a general framework (Schulman et al., 2015a; 2017) . The same method could either learn robotic movements or play basic games (1) with minimal domain-specific knowledge, (2) regardless of function approximation classes, and (3) with less human intervention (ignoring reward engineering and hyper-parameter tuning) (Duan et al., 2016) . Using stochasticity for exploration, although it imposes approximations and variance, has provided a robust way to actively search for higher rewards. Despite many successes, there are practical environments which remain challenging for current policy gradient methods. For example, non-local rewards (e.g., those defined in the frequency domain), long time horizons, and naturally-resonant environments all occur in realistic robotic systems (Kuo & Golnaraghi, 2002; Meirovitch, 1975; Preumont & Seto, 2008) but can present issues for policy gradient search. To tackle challenging environments such as these, this paper considers policy gradient methods based on deterministic policies and deterministic gradient estimates, which could offer advantages by allowing the estimation of global reward gradients on long horizons without the need to inject noise into the system for exploration. To facilitate a dimensionally consistent and low-variance deterministic policy search, a compatible policy gradient estimator and a metric measure for regularization should be employed. For gradient estimation we focus on Vine estimators (Schulman et al., 2015a) , which can be easily applied to deterministic policies. As a metric measure we use the Wasserstein distance, which can measure meaningful distances between deterministic policy functions that have non-1

