WEIGHTED BELLMAN BACKUPS FOR IMPROVED SIGNAL-TO-NOISE IN Q-UPDATES

Abstract

Off-policy deep reinforcement learning (RL) has been successful in a range of challenging domains. However, standard off-policy RL algorithms can suffer from low signal and even instability in Q-learning because target values are derived from current Q-estimates, which are often noisy. To mitigate the issue, we propose ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble. We empirically observe that the proposed method stabilizes and improves learning on both continuous and discrete control benchmarks. We also specifically investigate the signal-to-noise aspect by studying environments with noisy rewards, and find that weighted Bellman backups significantly outperform standard Bellman backups. Furthermore, since our weighted Bellman backups rely on maintaining an ensemble, we investigate how weighted Bellman backups interact with UCB Exploration. By enforcing the diversity between agents using Bootstrap, we show that these different ideas are largely orthogonal and can be fruitfully integrated, together further improving the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both lowdimensional and high-dimensional environments.

1. INTRODUCTION

Model-free reinforcement learning (RL), with high-capacity function approximators, such as deep neural networks (DNNs), has been used to solve a variety of sequential decision-making problems, including board games (Silver et al., 2017; 2018 ), video games (Mnih et al., 2015;; Vinyals et al., 2019) , and robotic manipulation (Kalashnikov et al., 2018) . It has been well established that the above successes are highly sample inefficient (Kaiser et al., 2020) . Recently, a lot of progress has been made in more sample-efficient model-free RL algorithms through improvements in off-policy learning both in discrete and continuous domains (Fujimoto et al., 2018; Haarnoja et al., 2018; Hessel et al., 2018; Amos et al., 2020) . However, standard off-policy RL algorithms can suffer from instability in Q-learning due to error propagation in the Bellman backup, i.e., the errors induced in the target value can lead to an increase in overall error in the Q-function (Kumar et al., 2019; 2020) . One way to address the error propagation issue is to use ensemble methods, which combine multiple models of the value function (Hasselt, 2010; Van Hasselt et al., 2016; Fujimoto et al., 2018) . For discrete control tasks, double Q-learning (Hasselt, 2010; Van Hasselt et al., 2016) addressed the value overestimation by maintaining two independent estimators of the action values and later extended to continuous control tasks in TD3 (Fujimoto et al., 2018) . While most prior work has improved the stability by taking the minimum over Q-functions, this also needlessly loses signal, and we propose an alternative way that utilizes ensembles to estimate uncertainty and provide more stable backups. In this paper, we propose ensemble-based weighted Bellman backups that can be applied to most modern off-policy RL algorithms, such as Q-learning and actor-critic algorithms. Our main idea is to reweight sample transitions based on uncertainty estimates from a Q-ensemble. Because prediction errors can be characterized by uncertainty estimates from ensembles (i.e., variance of predictions) as shown in Figure 1 (b), we find that the proposed method significantly improves the signal-to-noise in the Q-updates and stabilizes the learning process. Finally, we present a unified framework, coined SUNRISE, that combines our weighted Bellman backups with an inference method that selects actions using highest upper-confidence bounds (UCB) for efficient exploration (Chen et al., 2017) . We find that these different ideas can be fruitfully integrated, and they are largely complementary (see Figure 1(a) ). We demonstrate the effectiveness of the proposed method using Soft Actor-Critic (SAC; Haarnoja et al. 2018) for continuous control benchmarks (specifically, OpenAI Gym (Brockman et al., 2016) and DeepMind Control Suite (Tassa et al., 2018) ) and Rainbow DQN (Hessel et al., 2018) for discrete control benchmarks (specifically, Atari games (Bellemare et al., 2013) ). In our experiments, SUNRISE consistently improves the performance of existing off-policy RL methods. Furthermore, we find that the proposed weighted Bellman backups yield improvements in environments with noisy reward, which have a low signal-to-noise ratio.

2. RELATED WORK

Off-policy RL algorithms. Recently, various off-policy RL algorithms have provided large gains in sample-efficiency by reusing past experiences (Fujimoto et al., 2018; Haarnoja et al., 2018; Hessel et al., 2018) . Rainbow DQN (Hessel et al., 2018) achieved state-of-the-art performance on the Atari games (Bellemare et al., 2013) by combining several techniques, such as double Qlearning (Van Hasselt et al., 2016) and distributional DQN (Bellemare et al., 2017) . For continuous control tasks, SAC (Haarnoja et al., 2018) achieved state-of-the-art sample-efficiency results by incorporating the maximum entropy framework. Our ensemble method brings orthogonal benefits and is complementary and compatible with these existing state-of-the-art algorithms. Stabilizing Q-learning. It has been empirically observed that instability in Q-learning can be caused by applying the Bellman backup on the learned value function (Hasselt, 2010; Van Hasselt et al., 2016; Fujimoto et al., 2018; Song et al., 2019; Kim et al., 2019; Kumar et al., 2019; 2020) . By following the principle of double Q-learning (Hasselt, 2010; Van Hasselt et al., 2016) , twin-Q trick (Fujimoto et al., 2018) was proposed to handle the overestimation of value functions for continuous control tasks. Song et al. (2019) and Kim et al. (2019) proposed to replace the max operator with Softmax and Mellowmax, respectively, to reduce the overestimation error. Recently, Kumar et al. (2020) handled the error propagation issue by reweighting the Bellman backup based on cumulative Bellman errors. However, our method is different in that we propose an alternative way that also utilizes ensembles to estimate uncertainty and provide more stable, higher-signal-to-noise backups. Ensemble methods in RL. Ensemble methods have been studied for different purposes in RL (Wiering & Van Hasselt, 2008; Osband et al., 2016a; Anschel et al., 2017; Agarwal et al., 2020; Lan et al., 2020) . Chua et al. (2018) showed that modeling errors in model-based RL can be reduced using an ensemble of dynamics models, and Kurutach et al. ( 2018) accelerated policy learning by generating imagined experiences from the ensemble of dynamics models. For efficient exploration, Osband et al. (2016a) and Chen et al. (2017) also leveraged the ensemble of Q-functions. However, most prior works have studied the various axes of improvements from ensemble methods in isolation, while we propose a unified framework that handles various issues in off-policy RL algorithms. Exploration in RL. To balance exploration and exploitation, several methods, such as the maximum entropy frameworks (Ziebart, 2010; Haarnoja et al., 2018) , exploration bonus rewards (Bellemare et al., 2016; Houthooft et al., 2016; Pathak et al., 2017; Choi et al., 2019) and randomization (Osband et al., 2016a; b) , have been proposed. Despite the success of these exploration methods, a potential drawback is that agents can focus on irrelevant aspects of the environment because these methods do not depend on the rewards. To handle this issue, Chen et al. ( 2017) proposed an exploration strategy that considers both best estimates (i.e., mean) and uncertainty (i.e., variance) of Q-functions for discrete control tasks. We further extend this strategy to continuous control tasks and show that it can be combined with other techniques.

3. BACKGROUND

Reinforcement learning. We consider a standard RL framework where an agent interacts with an environment in discrete time. Formally, at each timestep t, the agent receives a state s t from the environment and chooses an action a t based on its policy π. The environment returns a reward r t and the agent transitions to the next state s t+1 . The return R t = ∞ k=0 γ k r t+k is the total accumulated rewards from timestep t with a discount factor γ ∈ [0, 1). RL then maximizes the expected return.

