PREVENTING VALUE FUNCTION COLLAPSE IN ENSEMBLE Q-LEARNING BY MAXIMIZING REPRESENTATION DIVERSITY Anonymous

Abstract

The first deep RL algorithm, DQN, was limited by the overestimation bias of the learned Q-function. Subsequent algorithms proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Qlearning algorithms used the different estimates provided by ensembles of learners to reduce the bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize diversity in the representation space in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the resulting approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.

1. INTRODUCTION

Q-learning (Watkins, 1989) and its deep learning based successors inaugurated by DQN (Mnih et al., 2015) are model-free, value function based reinforcement learning algorithms. Their popularity stems from their intuitive, easy-to-implement update rule derived from the Bellman equation. At each time step, the agent updates its Q-value towards the expectation of the current reward plus the value corresponding to the maximal action in the next state. This state-action value represents the maximum sum of reward the agent believes it could obtain from the current state by taking the current action. Unfortunately (Thrun & Schwartz, 1993; van Hasselt, 2010) have shown that this simple rule suffers from overestimation bias: due to the maximization operator in the update rule, positive and negative errors do not cancel each other out, but positive errors accumulate. The overestimation bias is particularly problematic under function approximation and have contributed towards learning sub-optimal policies (Thrun & Schwartz, 1993; Szita & Lőrincz, 2008; Strehl et al., 2009) . A possible solution is to introduce underestimation bias in the estimation of the Q-value. Double Q-learning (van Hasselt, 2010) maintains two independent state-action value estimators (Q-functions). The state-action value of estimator one is calculated by adding observed reward and maximal stateaction value from the other estimator. Double DQN (Hado van Hasselt et al., 2016) applied this idea using neural networks, and was shown to provide better performance than DQN. More recent actor-critic type deep RL algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) also use two Q function estimators (in combination with other techniques). Other approaches such as EnsembleDQN (Anschel et al., 2017) and MaxminDQN (Lan et al., 2020) maintain ensembles of Q-functions to estimate an unbiased Q-function. EnsembleDQN estimates the state-action values by adding the current observed reward and the maximal state-action value from the average of Q-functions from the ensemble. MaxminDQN creates a proxy Q-function by selecting the minimum Q-value for each action from all the Q-functions and using the maximal state-action value from the proxy Q-function to estimate an unbiased Q-function. Both EnsembleDQN and MaxminDQN have been shown to perform better than Double DQN. The primary insight of this paper is that the performance of ensemble based methods is contingent on maintaining sufficient diversity in the representation space between the Q-functions in the ensembles. If the Q-functions in the ensembles converge to a common representation (we will show that this is the case in many scenarios), the performance of these approaches significantly degrades.

