PREVENTING VALUE FUNCTION COLLAPSE IN ENSEMBLE Q-LEARNING BY MAXIMIZING REPRESENTATION DIVERSITY Anonymous

Abstract

The first deep RL algorithm, DQN, was limited by the overestimation bias of the learned Q-function. Subsequent algorithms proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Qlearning algorithms used the different estimates provided by ensembles of learners to reduce the bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize diversity in the representation space in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the resulting approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.

1. INTRODUCTION

Q-learning (Watkins, 1989) and its deep learning based successors inaugurated by DQN (Mnih et al., 2015) are model-free, value function based reinforcement learning algorithms. Their popularity stems from their intuitive, easy-to-implement update rule derived from the Bellman equation. At each time step, the agent updates its Q-value towards the expectation of the current reward plus the value corresponding to the maximal action in the next state. This state-action value represents the maximum sum of reward the agent believes it could obtain from the current state by taking the current action. Unfortunately (Thrun & Schwartz, 1993; van Hasselt, 2010) have shown that this simple rule suffers from overestimation bias: due to the maximization operator in the update rule, positive and negative errors do not cancel each other out, but positive errors accumulate. The overestimation bias is particularly problematic under function approximation and have contributed towards learning sub-optimal policies (Thrun & Schwartz, 1993; Szita & Lőrincz, 2008; Strehl et al., 2009) . A possible solution is to introduce underestimation bias in the estimation of the Q-value. Double Q-learning (van Hasselt, 2010) maintains two independent state-action value estimators (Q-functions). The state-action value of estimator one is calculated by adding observed reward and maximal stateaction value from the other estimator. Double DQN (Hado van Hasselt et al., 2016) applied this idea using neural networks, and was shown to provide better performance than DQN. More recent actor-critic type deep RL algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) also use two Q function estimators (in combination with other techniques). Other approaches such as EnsembleDQN (Anschel et al., 2017) and MaxminDQN (Lan et al., 2020) maintain ensembles of Q-functions to estimate an unbiased Q-function. EnsembleDQN estimates the state-action values by adding the current observed reward and the maximal state-action value from the average of Q-functions from the ensemble. MaxminDQN creates a proxy Q-function by selecting the minimum Q-value for each action from all the Q-functions and using the maximal state-action value from the proxy Q-function to estimate an unbiased Q-function. Both EnsembleDQN and MaxminDQN have been shown to perform better than Double DQN. The primary insight of this paper is that the performance of ensemble based methods is contingent on maintaining sufficient diversity in the representation space between the Q-functions in the ensembles. If the Q-functions in the ensembles converge to a common representation (we will show that this is the case in many scenarios), the performance of these approaches significantly degrades. In this paper we propose to use cross-learner regularizers to prevent the collapse of the representation space in ensemble-based Q-learning methods. Intuitively, these representations capture an inductive bias towards more diverse representations. We have investigated five different regularizers. The mathematical formulation of four of the regularizers correspond to inequality measures borrowed from economics theory. While in economics, high inequality is seen as a negative, in this case we use the metrics to encourage inequality between the representations. The fifth regularizer is inspired from consensus optimization. There is a separate line of reinforcement learning literature where ensembles are used to address several different issues (Chen et al., 2017; Chua et al., 2018; Kurutach et al., 2018; Lee et al., 2020; Osband et al., 2016) such as exploration and error propagation but we limit our solution to algorithms addressing the overestimation bias problem only. To summarize, our contributions are following: 1. We show that high representation similarity between neural network based Q-functions leads to decline in performance in ensemble based Q-learning methods. 2. To mitigate this, we propose five regularizers based on inequality measures from economics theory and consensus optimization that maximize representation diversity between Q-functions in ensemble based Q-learning methods. 3. We show that applying the proposed regularizers to the MaxminDQN and EnsembleDQN methods can lead to significant improvement in performance over a variety of benchmarks.

2. BACKGROUND

Reinforcement learning considers an agent as a Markov Decision Process (MDP) defined as a five element tuple (S, A, P, r, γ), where S is the state space, A is the action space, P : S ×A×S → [0, 1] are the state-action transition probabilities, r : S × A × S → R is the reward mapping and γ → [0, 1] is the discount factor. At each time step t the agent observes the state of the environment s t ∈ S and selects an action a t ∈ A. The effect of the action triggers a transition to a new state s t+1 ∈ S according to the transition probabilities P , while the agent receives a scalar reward R t = r (s t , a t , s t+1 ). The goal of the agent is to learn a policy π that maximizes the expectation of the discounted sum of future rewards. One way to implicitly learn the policy π is the Q-learning algorithm that estimates the expected sum of rewards of state s t if we take the action a t by solving the Bellman equation Q * (s t , a t ) = E R t + max a ∈A Q * (s t+1 , a ) The implicit policy π can extracted by acting greedily with respect to the optimal Q-function: arg max a∈A Q * (s, a). One possible way to estimate the optimal Q-value is by iteratively updating it for sampled states s t and action a t using Q * (s t , a t ) ← Q * (s t , a t ) + α (Y t -Q * (s t , a t )) where Y t = R t + max a ∈A Q * (s t+1 , a ) where α is the step size and Y t is called the target value. While this algorithm had been initially studied in the context of a tabular representation of Q for discrete states and actions, in many practical applications the Q value is approximated by a learned function. Since the emergence of deep learning, the preferred approximation technique is based on a deep neural network. DQN (Mnih et al., 2015) , had demonstrated super-human performance in Atari Games, but required a very large number of training iterations. From this baseline, subsequent algorithms improved both the learning speed and achievable performance, with one of the main means for this being techniques to reduce the overestimation bias of the Q-function. EnsembleDQN (Anschel et al., 2017) uses an ensemble of N neural networks to estimate state-action values and uses their average to reduce both overestimation bias and estimation variance. Formally,

