IMPROVING DEEP POLICY GRADIENTS WITH VALUE FUNCTION SEARCH

Abstract

Deep Policy Gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep PG primitives results in improved sample efficiency and policies with higher returns using common continuous control benchmark domains.

1. INTRODUCTION

Deep Policy Gradient (PG) methods achieved impressive results in numerous control tasks (Haarnoja et al., 2018) . However, these methods deviate from the underlying theoretical framework to compute gradients tractably. Hence, the promising performance of Deep PG algorithms suggests a lack of rigorous analysis to motivate such results. Ilyas et al. (2020) investigated the phenomena arising in practical implementations by taking a closer look at key PG primitives (e.g., gradient estimates, value predictions). Interestingly, the learned value networks used for predictions (critics) poorly fit the actual return. As a result, the local optima where critics get stuck limits their efficacy in the gradient estimates, driving policies (actors) toward sub-optimal performance. Despite the lack of investigations to understand Deep PG's results, several approaches have been proposed to improve these methods. Ensemble learning (Lee et al., 2021; He et al., 2022) , for example, combines multiple learning actors' (or critics') predictions to address overestimation and foster diversity. These methods generate different solutions that improve exploration and stabilize the training, leading to higher returns. However, the models used at training and inference time pose significant challenges that we discuss in Section 5. Nonetheless, popular Deep Reinforcement Learning (RL) algorithms (e.g., TD3 (Fujimoto et al., 2018) ) use two value networks, leveraging the benefits of ensemble approaches while limiting their complexity. To address the issues of ensembles, gradient-free methods have been recently proposed (Khadka & Tumer, 2018; Marchesini & Farinelli, 2020; Sigaud, 2022) . The idea is to complement Deep PG algorithms with a search mechanism that uses a population of perturbed policies to improve exploration and to find policy parameters with higher payoffs. In contrast to ensemble methods that employ multiple actors and critics, gradientfree population searches typically focus on the actors, disregarding the value network component of Deep PG. Section 5 discusses the limitations of policy search methods in detail. However, in a PG context, critics have a pivotal role in driving the policy learning process as poor value predictions lead to sub-optimal performance and higher variance in gradient estimates (Sutton & Barto, 2018) .

