ASYMQ: ASYMMETRIC Q-LOSS TO MITIGATE OVER-ESTIMATION BIAS IN OFF-POLICY REINFORCEMENT LEARNING

Abstract

It is well-known that off-policy deep reinforcement learning algorithms suffer from overestimation bias in value function approximation. Existing methods to reduce overestimation bias often utilize multiple value function estimators. Consequently, these methods have a larger time and memory consumption. In this work, we propose a new class of policy evaluation algorithms dubbed, AsymQ, that use asymmetric loss functions to train the Q-value network. Departing from the symmetric loss functions such as mean squared error (MSE) and Huber loss on the Temporal difference (TD) error, we adopt asymmetric loss functions of the TD-error to impose a higher penalty on overestimation error. We present one such AsymQ loss called Softmax MSE (SMSE) that can be implemented with minimal modifications to the standard policy evaluation. Empirically, we show that using SMSE loss helps reduce estimation bias, and subsequently improves policy performance when combined with standard reinforcement learning algorithms. With SMSE, even the Deep Deterministic Policy Gradients (DDPG) algorithm can achieve performance comparable to that of state-of-the-art methods such as the Twin-Delayed DDPG (TD3) and Soft Actor Critic (SAC) on challenging environments in the OpenAI Gym MuJoCo benchmark. We additionally demonstrate that the proposed SMSE loss can also boost the performance of Deep Q learning (DQN) in Atari games with discrete action spaces.

1. INTRODUCTION

Learning an accurate value function in Deep Reinforcement Learning (DRL) is crucial; the value function of a policy is not only important for policy improvement (Sutton & Barto, 2018) but also useful for many downstream applications such as risk-aware planning (Kochenderfer, 2015) and goal-based reachability planning (Nasiriany et al., 2019) . However, most off-policy DRL algorithms are accompanied by estimation bias in policy evaluation and how to remove this estimation bias has been a long-standing challenge. In this work, we revisit the estimation bias problem in offpolicy DRL from a new perspective and propose a lightweight modification of the standard policy evaluation algorithm to mitigate estimation bias. Value estimation bias in DRL Thrun & Schwartz (1993) firstly shows that maximization of a noisy value estimation consistently induces overestimation bias in Q-learning, where the learned value function overestimates the learned policy, i.e., the prediction from the learned value function is higher than the ground truth value of the policy. Several methods have been proposed to reduce estimation bias in policy evaluation and policy improvement. Hasselt (2010); Van Hasselt et al. (2016) propose double Q-learning, which trains two independent estimators to suppress overestimation. In the continuous state-action space setting, Fujimoto et al. (2018) shows the existence of overestimation in popular Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) and proposes Twin Delayed Deep Deterministic Policy Gradient (TD3) to alleviate the overestimation issue, which utilizes the minimum value estimation from two critic networks to fit Q value function. However, these approaches that involve multiple value function estimators together with a minimum operator, may succumb to underestimation bias (Ciosek et al., 2019; Pan et al., 2020) , where the value prediction of the learned function is lower than the real policy performance. Lyu et al. ( 2022 



); Wu et al. (2020); Wei et al. (2022) Kuznetsov et al. (2020); Chen et al. (2021); Liang et al. (2022);

