IMPROVING DEEP POLICY GRADIENTS WITH VALUE FUNCTION SEARCH

Abstract

Deep Policy Gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep PG primitives results in improved sample efficiency and policies with higher returns using common continuous control benchmark domains.

1. INTRODUCTION

Deep Policy Gradient (PG) methods achieved impressive results in numerous control tasks (Haarnoja et al., 2018) . However, these methods deviate from the underlying theoretical framework to compute gradients tractably. Hence, the promising performance of Deep PG algorithms suggests a lack of rigorous analysis to motivate such results. Ilyas et al. (2020) investigated the phenomena arising in practical implementations by taking a closer look at key PG primitives (e.g., gradient estimates, value predictions). Interestingly, the learned value networks used for predictions (critics) poorly fit the actual return. As a result, the local optima where critics get stuck limits their efficacy in the gradient estimates, driving policies (actors) toward sub-optimal performance. Despite the lack of investigations to understand Deep PG's results, several approaches have been proposed to improve these methods. Ensemble learning (Lee et al., 2021; He et al., 2022) , for example, combines multiple learning actors' (or critics') predictions to address overestimation and foster diversity. These methods generate different solutions that improve exploration and stabilize the training, leading to higher returns. However, the models used at training and inference time pose significant challenges that we discuss in Section 5. Nonetheless, popular Deep Reinforcement Learning (RL) algorithms (e.g., TD3 (Fujimoto et al., 2018) ) use two value networks, leveraging the benefits of ensemble approaches while limiting their complexity. To address the issues of ensembles, gradient-free methods have been recently proposed (Khadka & Tumer, 2018; Marchesini & Farinelli, 2020; Sigaud, 2022) . The idea is to complement Deep PG algorithms with a search mechanism that uses a population of perturbed policies to improve exploration and to find policy parameters with higher payoffs. In contrast to ensemble methods that employ multiple actors and critics, gradientfree population searches typically focus on the actors, disregarding the value network component of Deep PG. Section 5 discusses the limitations of policy search methods in detail. However, in a PG context, critics have a pivotal role in driving the policy learning process as poor value predictions lead to sub-optimal performance and higher variance in gradient estimates (Sutton & Barto, 2018). 2022)). Big-scale perturbations search for parameters that allow escaping from local optima where value networks get stuck. In contrast to previous search methods, evaluating perturbed value networks require computing standard value error measures using samples from the agent's buffer. Hence, the Deep PG agent uses the parameters with the lowest error until the next periodical search. Crucially, VFS's critics-based design addresses the issues of prior methods as it does not require a simulator, hand-designed environment interactions, or weighted optimizations. Moreover, our population's goal is to find the weights that minimize the same objective of the Deep PG critic. We show the effectiveness of VFS on different Deep PG baselines: (i) Proximal Policy Optimization (PPO) (Schulman et al., 2017) , (ii) Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016) , and (iii) TD3 on a range of continuous control benchmark tasks (Brockman et al., 2016; Todorov et al., 2012) . We compare over such baselines, an ensembled Deep RL method, SUNRISE (Lee et al., 2021) , and the policy search Supe-RL (Marchesini et al., 2021a) . In addition, we analyze key Deep PG primitives (i.e., value prediction errors, gradient estimate, and variance) to motivate the performance improvement of VFS-based algorithms. Our evaluation confirms that VFS leads to better gradient estimates with a lower variance that significantly improve sample efficiency and lead to policies with higher returns. Our analysis highlights a fundamental issue with current state-ofthe-art Deep PG methods, opening the door for future research.

2. BACKGROUND

PG methods parameterize a policy π θ with a parameters vector θ (typically the weights of a Deep Neural Network (DNN) in a Deep PG context), and π θ (a t |s t ) models the probability to take action a t in a state s t at step t in the environment. These approaches aim to learn θ following the gradient of an objective η θ over such parameters (Sutton & Barto, 2018) . Formally, the primitive gradient estimate on which modern Deep PG algorithms build has the following form (Sutton et al., 1999) : ∇η θ = E (st,at)∈τ ∼π θ [∇ θ logπ θ (a t |s t )Q π θ (s t , a t )] (1) where (s t , a t ) ∈ τ ∼ π θ are states and actions that form the trajectories sampled from the distribution induced by π θ , and Q π θ (s t , a t ) is the expected return after taking a t in s t . However, Equation 1 suffers from high-variance expectation, and different baselines have been used to margin the issue. In particular, given a discount value γ ∈ [0, 1), the state-value function V π θ (s) is an ideal baseline as it leaves the expectation unchanged while reducing variance (Williams, 1992): V π θ (s) = E π θ G t := ∞ t γ t r t+1 |s t = s (2) where G t is the sum of future discounted rewards (return). Despite building on the theoretical framework of PG, Deep PG algorithms rely on several assumptions and approximations for designing feasible updates for the policy parameters. In more detail, these methods typically build on surrogate objective functions that are easier to optimize. A leading example is TRPO (Schulman et al., 2015) , which ensures that a surrogate objective updates the policy locally by imposing a trust region. Formally, TRPO imposes a constraint on the Kullback-Leibler diverge (KL) on successive policies π θ , π θ ′ , resulting in the following optimization problem:  max θ E π θ π θ (



Figure 1: Overview of VFS.Such issues are further exacerbated in state-of-the-art Deep PG methods as they struggle to learn good value function estimationIlyas et al. (2020). For this reason, we propose a novel gradient-free population-based approach for critics called Value Function Search (VFS), depicted in Figure1. We aim to improve Deep PG algorithms by enhancing value networks to achieve (i) a better fit of the actual return, (ii) a higher correlation of the gradients estimate with the (approximate) true gradient, and (iii) reduced variance. In detail, given a Deep PG agent characterized by actor and critic networks, VFS periodically instantiates a population of perturbed critics using a two-scale perturbation noise designed to improve value predictions. Small-scale perturbations explore local value predictions that only slightly modify those of the original critic (similarly to gradient-based perturbations(Lehman et al., 2018; Martin H. & de Lope, 2009; Marchesini &  Amato, 2022)). Big-scale perturbations search for parameters that allow escaping from local optima where value networks get stuck. In contrast to previous search methods, evaluating perturbed value networks require computing standard value error measures using samples from the agent's buffer. Hence, the Deep PG agent uses the parameters with the lowest error until the next periodical search. Crucially, VFS's critics-based design addresses the issues of prior methods as it does not require a simulator, hand-designed environment interactions, or weighted optimizations. Moreover, our population's goal is to find the weights that minimize the same objective of the Deep PG critic.

a t |s t ) π θ ′ (a t |s t ) A π θ (s t , a t ) s.t. D KL (π θ (•|s)||π θ

