ASYMQ: ASYMMETRIC Q-LOSS TO MITIGATE OVER-ESTIMATION BIAS IN OFF-POLICY REINFORCEMENT LEARNING

Abstract

It is well-known that off-policy deep reinforcement learning algorithms suffer from overestimation bias in value function approximation. Existing methods to reduce overestimation bias often utilize multiple value function estimators. Consequently, these methods have a larger time and memory consumption. In this work, we propose a new class of policy evaluation algorithms dubbed, AsymQ, that use asymmetric loss functions to train the Q-value network. Departing from the symmetric loss functions such as mean squared error (MSE) and Huber loss on the Temporal difference (TD) error, we adopt asymmetric loss functions of the TD-error to impose a higher penalty on overestimation error. We present one such AsymQ loss called Softmax MSE (SMSE) that can be implemented with minimal modifications to the standard policy evaluation. Empirically, we show that using SMSE loss helps reduce estimation bias, and subsequently improves policy performance when combined with standard reinforcement learning algorithms. With SMSE, even the Deep Deterministic Policy Gradients (DDPG) algorithm can achieve performance comparable to that of state-of-the-art methods such as the Twin-Delayed DDPG (TD3) and Soft Actor Critic (SAC) on challenging environments in the OpenAI Gym MuJoCo benchmark. We additionally demonstrate that the proposed SMSE loss can also boost the performance of Deep Q learning (DQN) in Atari games with discrete action spaces.

1. INTRODUCTION

Learning an accurate value function in Deep Reinforcement Learning (DRL) is crucial; the value function of a policy is not only important for policy improvement (Sutton & Barto, 2018) but also useful for many downstream applications such as risk-aware planning (Kochenderfer, 2015) and goal-based reachability planning (Nasiriany et al., 2019) . However, most off-policy DRL algorithms are accompanied by estimation bias in policy evaluation and how to remove this estimation bias has been a long-standing challenge. In this work, we revisit the estimation bias problem in offpolicy DRL from a new perspective and propose a lightweight modification of the standard policy evaluation algorithm to mitigate estimation bias. Thrun & Schwartz (1993) firstly shows that maximization of a noisy value estimation consistently induces overestimation bias in Q-learning, where the learned value function overestimates the learned policy, i.e., the prediction from the learned value function is higher than the ground truth value of the policy. Several methods have been proposed to reduce estimation bias in policy evaluation and policy improvement. Hasselt (2010); Van Hasselt et al. (2016) propose double Q-learning, which trains two independent estimators to suppress overestimation. In the continuous state-action space setting, Fujimoto et al. (2018) shows the existence of overestimation in popular Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) and proposes Twin Delayed Deep Deterministic Policy Gradient (TD3) to alleviate the overestimation issue, which utilizes the minimum value estimation from two critic networks to fit Q value function. However, these approaches that involve multiple value function estimators together with a minimum operator, may succumb to underestimation bias (Ciosek et al., 2019; Pan et al., 2020) , where the value prediction of the learned function is lower than the real policy performance. Lyu et al. ( 2022 2021) propose new critic update schemes to reduce the estimation bias. However, all these methods require multiple actors or an ensemble of critics and usually involve other convoluted tricks that often incur additional computation and memory overheads to improve policy performance. In this work, we explore an efficient approach that reduces value estimation bias without incurring extra computational costs.

Loss function in policy evaluation

The policy evaluation in DRL is typically based on the Bellman update equation (Bellman, 1966) where the discrepancy between the predicted and target values are minimized. The target value is the combination of the immediate reward and subsequent value function prediction on future states. There are two main components in policy evaluation based on Bellman temporal difference (TD) learning, the fitting target that acts as supervised signals to train neural networks and the loss function that serves as an objective metric. Most of the existing work in reducing estimation bias focuses on the first component, and typically try to construct a more robust target value, such as a lagged target network (Mnih et al., 2015) and ensemble multiple value networks (Fujimoto et al., 2018) . The choice of the loss function used for value-function fitting somehow receives much less attention. In practice, most RL algorithms choose symmetric mean square error (MSE) or Huber loss (Patterson et al., 2022) as a metric in fitting value functions. Once zero MSE loss is reached for each state, the prediction of the value network can match policy performance exactly. In this work, we propose a novel approach based on modifying the loss function to control estimation bias in DRL. We discover that the optimization landscape used to train value networks, governed by the loss functions, plays a crucial role in policy evaluation. A proper choice of the value-fitting loss function can effectively control estimation bias. In particular, we show asymmetric functions can be used for policy evaluation to reduce estimation bias and propose a class of algorithms called AsymQ. We specifically evaluate one family of AsymQ loss functions called SoftMax MSE (SMSE) to learn the Q value function in both continuous and discrete action environments, but we validate the benefit of using other AsymQ loss functions as well. We find asymmetric loss functions can inject inductive bias to the learning process and thereby control estimation bias present in DRL policy evaluation updates. We highlight the intuition behind our approach in Fig 1 , where asymmetry can assign different weights for both overestimated and underestimated states and help alleviate bias in policy evaluation. Notably, compared with existing methods (Hasselt, 2010; Fujimoto et al., 2018) , our approach needs only one critic and actor and negligible overhead computational and memory cost. We summarize our contributions as follows: (1) We show asymmetric loss functions can be used for learning the value function, and introduce a simple asymmetric loss for policy evaluation, namely, Softmax MSE (SMSE) parameterized by a temperature parameter, that can be easily combined with existing RL algorithms. (2) We further propose an auto-tuning algorithm for the temperature to reduce the burden of tuning parameters for our proposed approach. (3) We show that SMSE can significantly reduce estimation bias and improve the performance of popular baseline algorithms such as DDPG on MuJuCo environments and DQN on Atari games. To the best of our knowledge, this is the first algorithm that achieves such competitive performance without the need for other tricks such as multiple critic networks and crafted exploration methods that often incur extra computational and memory costs.

2. PRELIMINARIES: VALUE-BASED DEEP REINFORCEMENT LEARNING

The interaction of the RL agent with the environment can be formalized as a Markov Decision Process (MDP). The MDP is represented as a tuple (S, A, p, r, γ), where S, A represents the set of states and actions respectively, p represents the transition probabilities, r(s, a) the reward function, and γ represents the discount factor. The goal of an RL agent is to learn a behavior policy π ϕ (a t |s t ) such that the expected return J(π ϕ ) = E[ ∞ t=0 γ t r(s t , a t )|π ϕ ] is maximized. In RL the value function is usually learned based on temporal difference (TD) learning, an update scheme based on the Bellman equation Sutton & Barto (2018); Bellman (1966) . The Bellman equation formulates the value of a state-action pair (s, a) in terms of the value of subsequent state-action pairs (s ′ , a ′ ): Q π (s, a) = r(s, a) + γE s ′ ,a ′ [Q π (s ′ , a ′ )], a ′ ∼ π(•|s ′ ),



); Wu et al. (2020); Wei et al. (2022) Kuznetsov et al. (2020); Chen et al. (2021); Liang et al. (2022); Lee et al. (

