DEEP REINFORCEMENT LEARNING ON ADAPTIVE PAIRWISE CRITIC AND ASYMPTOTIC ACTOR

Abstract

Maximum entropy deep reinforcement learning has displayed great potential on a range of challenging continuous tasks. The maximum entropy is able to encourage policy exploration, however, it has a tradeoff between the efficiency and stability, especially when employed on large-scale tasks with high state and action dimensionality. Sometimes the temperature hyperparameter of maximum entropy term is limited to remain stable at the cost of slower and lower convergence. Besides, the function approximation errors existing in actor-critic learning are known to induce estimation errors and suboptimal policies. In this paper, we propose an algorithm based on adaptive pairwise critics, and adaptive asymptotic maximum entropy combined. Specifically, we add a trainable state-dependent weight factor to build an adaptive pairwise target Q-value to serve as the surrogate policy objective. Then we adopt a state-dependent adaptive temperature to smooth the entropy policy exploration, which introduces an asymptotic maximum entropy. The adaptive pairwise critics can effectively improve the value estimation, preventing overestimation or underestimation errors. Meanwhile, the adaptive asymptotic entropy can adapt to the tradeoff between efficiency and stability, which provides more exploration and flexibility. We evaluate our method on a set of Gym tasks, and the results show that the proposed algorithms have better performance than several baselines on continuous control.

1. INTRODUCTION

The task of deep reinforcement learning (DRL) is to learn good policies by optimizing a discounted cumulative reward through function approximation. In DRL, the maximization over all noisy Q-value estimates at every update tends to prefer inaccurate value approximation that outweighes the true value Thrun & Schwartz (1993) , i.e., the overestimation. This error further accumulates and broadcasts via bootstrapping of temporal difference learning Sutton & Barto (2018) , which estimates the value function using the value estimate of a subsequent state. When function approximation is unavoidably adopted in the actor-critic setting on continuous control, the estimation errors are exaggerated. These errors may cause suboptimal policies, divergence and instability. To some extent, the inaccurate estimation is unavoidable in DRL because it is the basic trait for value-involved DRL to use random variables as target values. On the one hand, these stochastic target values will introduce some estimation biases. On the other hand, even an unbiased estimate with high variance can still lead to future overestimation in local regions of state space, which in turn can negatively affect the global policy Fujimoto et al. (2018) . Therefore, diminishing the value variance without partiality can be an effective means to reduce estimation errors, no matter overestimation or underestimation. Taking the twin delayed deep deterministic policy gradient (TD3) Fujimoto et al. ( 2018) for example, always selecting the lower value from a pair of critics will induce an underestimation bias although it is beneficial for lower variance. Several of recent works deal with errors like bootstrapping error caused by out-of-distribution (OOD) actions Kumar et al. (2019; 2020) , and extrapolation error induced by the mismatch between the distribution of buffer-sampled data and true state-action visitation of the current policy The paper has the following contributions. First, we propose the concept of adaptive pairwise critics, which connects a pair of critics using a trainable state-dependent weight factor, to combat estimation errors. Second, we propose the adaptive temperature which is also state-dependent so that the agent can freely explore with loose restriction on the selection of temperature hyperparameter. Based on this adaptive temperature, we organize a term of asymptotic maximum entropy to optimize the policy. The asymptotic maximum entropy is combined with the adaptive pairwise critics to serve the target Q-value as well as the surrogate policy objective. Third, we present a novel algorithm to tackle estimation errors and pursue effective and stable exploration. Finally, experimental evaluations are conducted to compare the proposed algorithm with several baselines in terms of sample complexity and stability.

2. RELATED WORK

In reinforcement learning, the agent needs to interact with the environment to collect enough knowledge for training. Without sufficient exploration, the collected data may be invalid for an optimal value. Therefore, reinforcement learning has to deal with the tradeoff between exploration and exploitation. There are several ways to enhance exploration in deep reinforcement learning (DRL), one of which is the off-policy approach which takes full advantage of past experience from replay buffer instead of on-policy data Mnih et al. (2015) . Another method adopts policy exploration to stimulate the agent's motivation for a better balance between exploration and exploitation Mnih et al. ( 2016 It is empirically shown that SAC is sensitive to the temperature hyperparameter. To provide flexibility for the choice of optimal temperature, SAC-v2 Haarnoja et al. (2018b) makes the first step to automatically tune the temperature hyperparameter by formulating a constrained optimization problem for the average entropy of policy. The dual to the constrained optimization will add an additional update procedure for the dual variable in determining the temperature. However, the assumption of convexity for theoretical convergence does not hold for neural networks, and the extra hyperparameter introduced by the transformation remains undetermined and needs more trials for generalization. Meta-SAC Wang & Ni (2020) uses metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. It distinguishes metaparameters from the learnable parameters and hyperparameters, and uses some initial states to train the meta temperature. However, due



Fujimoto et al. (2019). The authors inWu et al. (2019)  address the distribution errors by extra value penalty or policy regularization.Overestimation is another induced errors, which was originally found in Q-learning algorithm byWatkins (1989), and was demonstrated in deep Q-network (DQN) Mnih et al. (2015) on discrete control. In recent years, overestimation is reported in function approximation of actor-critic methods on continuous control Fujimoto et al. (2018); Duan et al. (2021a). Although several algorithms are created to address the overestimation errors Fujimoto et al. (2018; 2019); Kumar et al. (2019); Wu et al. (2019); Duan et al. (2021a), the accuracy of function approximation is not flexibly touched since underestimation errors usually accompanies the correction to overestimation.

); Haarnoja et al. (2018a). Among them, soft actor-critic (SAC) Haarnoja et al. (2018a) achieves good performance on a set of continuous control tasks by adopting stochastic policies and maximum entropy. Stochastic policies generalize the policy improvement and introduce uncertainty into action decisions over deterministic counterparts Heess et al. (2015), and augmenting the reward return with an entropy maximization term encourages exploration, thus improving robustness and stability Ziebart et al. (2008); Ziebart (2010). In recent years, many works have been proposed on top of SAC. The improvement of SAC can be realized by changing the rule of experience replay, for example, Wang & Ross (2019) samples more aggressively from recent experience while ordering the updates to ensure that updates from old data do not overwrite updates from new data, and Martin et al. (2021) relabels successful episodes as expert demonstrations for the agent to match. The distributional soft actor-critic (DSAC) Duan et al. (2021b); Ren et al. (2020); Ma et al. (2020); Duan et al. (2021c) combines the distributional return function within the maximum entropy to improve the estimation accuracy of the Q-value. It claims to prevent gradient explosion by truncating the difference between target and current return distributions, however, its assumptions of Gaussian distributions for random returns will induce more complexity and may not fit with the real distributions. Akimov (2019); Hou et al. (2020) reparameterize the reward representation and the policy, respectively, using a neural network transformation composed of multivariate factorization, and Ward et al. (2019) constructs normalizing flow policies before applying the squashing function to improving exploration within the SAC framework.

