DEEP REINFORCEMENT LEARNING ON ADAPTIVE PAIRWISE CRITIC AND ASYMPTOTIC ACTOR

Abstract

Maximum entropy deep reinforcement learning has displayed great potential on a range of challenging continuous tasks. The maximum entropy is able to encourage policy exploration, however, it has a tradeoff between the efficiency and stability, especially when employed on large-scale tasks with high state and action dimensionality. Sometimes the temperature hyperparameter of maximum entropy term is limited to remain stable at the cost of slower and lower convergence. Besides, the function approximation errors existing in actor-critic learning are known to induce estimation errors and suboptimal policies. In this paper, we propose an algorithm based on adaptive pairwise critics, and adaptive asymptotic maximum entropy combined. Specifically, we add a trainable state-dependent weight factor to build an adaptive pairwise target Q-value to serve as the surrogate policy objective. Then we adopt a state-dependent adaptive temperature to smooth the entropy policy exploration, which introduces an asymptotic maximum entropy. The adaptive pairwise critics can effectively improve the value estimation, preventing overestimation or underestimation errors. Meanwhile, the adaptive asymptotic entropy can adapt to the tradeoff between efficiency and stability, which provides more exploration and flexibility. We evaluate our method on a set of Gym tasks, and the results show that the proposed algorithms have better performance than several baselines on continuous control.

1. INTRODUCTION

The task of deep reinforcement learning (DRL) is to learn good policies by optimizing a discounted cumulative reward through function approximation. In DRL, the maximization over all noisy Q-value estimates at every update tends to prefer inaccurate value approximation that outweighes the true value Thrun & Schwartz (1993) , i.e., the overestimation. This error further accumulates and broadcasts via bootstrapping of temporal difference learning Sutton & Barto (2018), which estimates the value function using the value estimate of a subsequent state. When function approximation is unavoidably adopted in the actor-critic setting on continuous control, the estimation errors are exaggerated. These errors may cause suboptimal policies, divergence and instability. To some extent, the inaccurate estimation is unavoidable in DRL because it is the basic trait for value-involved DRL to use random variables as target values. On the one hand, these stochastic target values will introduce some estimation biases. On the other hand, even an unbiased estimate with high variance can still lead to future overestimation in local regions of state space, which in turn can negatively affect the global policy Fujimoto et al. (2018) . Therefore, diminishing the value variance without partiality can be an effective means to reduce estimation errors, no matter overestimation or underestimation. Taking the twin delayed deep deterministic policy gradient (TD3) Fujimoto et al. ( 2018) for example, always selecting the lower value from a pair of critics will induce an underestimation bias although it is beneficial for lower variance. Several of recent works deal with errors like bootstrapping error caused by out-of-distribution (OOD) actions Kumar et al. (2019; 2020) , and extrapolation error induced by the mismatch between the distribution of buffer-sampled data and true state-action visitation of the current policy Fujimoto et al. (2019) . The authors in Wu et al. (2019) address the distribution errors by extra value penalty or policy regularization.

