WEIGHTED BELLMAN BACKUPS FOR IMPROVED SIGNAL-TO-NOISE IN Q-UPDATES

Abstract

Off-policy deep reinforcement learning (RL) has been successful in a range of challenging domains. However, standard off-policy RL algorithms can suffer from low signal and even instability in Q-learning because target values are derived from current Q-estimates, which are often noisy. To mitigate the issue, we propose ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble. We empirically observe that the proposed method stabilizes and improves learning on both continuous and discrete control benchmarks. We also specifically investigate the signal-to-noise aspect by studying environments with noisy rewards, and find that weighted Bellman backups significantly outperform standard Bellman backups. Furthermore, since our weighted Bellman backups rely on maintaining an ensemble, we investigate how weighted Bellman backups interact with UCB Exploration. By enforcing the diversity between agents using Bootstrap, we show that these different ideas are largely orthogonal and can be fruitfully integrated, together further improving the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both lowdimensional and high-dimensional environments.

1. INTRODUCTION

Model-free reinforcement learning (RL), with high-capacity function approximators, such as deep neural networks (DNNs), has been used to solve a variety of sequential decision-making problems, including board games (Silver et al., 2017; 2018 ), video games (Mnih et al., 2015;; Vinyals et al., 2019) , and robotic manipulation (Kalashnikov et al., 2018) . It has been well established that the above successes are highly sample inefficient (Kaiser et al., 2020) . Recently, a lot of progress has been made in more sample-efficient model-free RL algorithms through improvements in off-policy learning both in discrete and continuous domains (Fujimoto et al., 2018; Haarnoja et al., 2018; Hessel et al., 2018; Amos et al., 2020) . However, standard off-policy RL algorithms can suffer from instability in Q-learning due to error propagation in the Bellman backup, i.e., the errors induced in the target value can lead to an increase in overall error in the Q-function (Kumar et al., 2019; 2020) . One way to address the error propagation issue is to use ensemble methods, which combine multiple models of the value function (Hasselt, 2010; Van Hasselt et al., 2016; Fujimoto et al., 2018) . For discrete control tasks, double Q-learning (Hasselt, 2010; Van Hasselt et al., 2016) addressed the value overestimation by maintaining two independent estimators of the action values and later extended to continuous control tasks in TD3 (Fujimoto et al., 2018) . While most prior work has improved the stability by taking the minimum over Q-functions, this also needlessly loses signal, and we propose an alternative way that utilizes ensembles to estimate uncertainty and provide more stable backups. In this paper, we propose ensemble-based weighted Bellman backups that can be applied to most modern off-policy RL algorithms, such as Q-learning and actor-critic algorithms. Our main idea is to reweight sample transitions based on uncertainty estimates from a Q-ensemble. Because prediction errors can be characterized by uncertainty estimates from ensembles (i.e., variance of predictions) as shown in Figure 1 (b), we find that the proposed method significantly improves the signal-to-noise in the Q-updates and stabilizes the learning process. Finally, we present a unified framework, coined SUNRISE, that combines our weighted Bellman backups with an inference method that selects actions using highest upper-confidence bounds (UCB) for efficient exploration (Chen et al., 2017) .

