PREVENTING VALUE FUNCTION COLLAPSE IN ENSEMBLE Q-LEARNING BY MAXIMIZING REPRESENTATION DIVERSITY Anonymous

Abstract

The first deep RL algorithm, DQN, was limited by the overestimation bias of the learned Q-function. Subsequent algorithms proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Qlearning algorithms used the different estimates provided by ensembles of learners to reduce the bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize diversity in the representation space in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the resulting approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.

1. INTRODUCTION

Q-learning (Watkins, 1989) and its deep learning based successors inaugurated by DQN (Mnih et al., 2015) are model-free, value function based reinforcement learning algorithms. Their popularity stems from their intuitive, easy-to-implement update rule derived from the Bellman equation. At each time step, the agent updates its Q-value towards the expectation of the current reward plus the value corresponding to the maximal action in the next state. This state-action value represents the maximum sum of reward the agent believes it could obtain from the current state by taking the current action. Unfortunately (Thrun & Schwartz, 1993; van Hasselt, 2010) have shown that this simple rule suffers from overestimation bias: due to the maximization operator in the update rule, positive and negative errors do not cancel each other out, but positive errors accumulate. The overestimation bias is particularly problematic under function approximation and have contributed towards learning sub-optimal policies (Thrun & Schwartz, 1993; Szita & Lőrincz, 2008; Strehl et al., 2009) . A possible solution is to introduce underestimation bias in the estimation of the Q-value. Double Q-learning (van Hasselt, 2010) maintains two independent state-action value estimators (Q-functions). The state-action value of estimator one is calculated by adding observed reward and maximal stateaction value from the other estimator. Double DQN (Hado van Hasselt et al., 2016) applied this idea using neural networks, and was shown to provide better performance than DQN. More recent actor-critic type deep RL algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) also use two Q function estimators (in combination with other techniques). Other approaches such as EnsembleDQN (Anschel et al., 2017) and MaxminDQN (Lan et al., 2020) maintain ensembles of Q-functions to estimate an unbiased Q-function. EnsembleDQN estimates the state-action values by adding the current observed reward and the maximal state-action value from the average of Q-functions from the ensemble. MaxminDQN creates a proxy Q-function by selecting the minimum Q-value for each action from all the Q-functions and using the maximal state-action value from the proxy Q-function to estimate an unbiased Q-function. Both EnsembleDQN and MaxminDQN have been shown to perform better than Double DQN. The primary insight of this paper is that the performance of ensemble based methods is contingent on maintaining sufficient diversity in the representation space between the Q-functions in the ensembles. If the Q-functions in the ensembles converge to a common representation (we will show that this is the case in many scenarios), the performance of these approaches significantly degrades. In this paper we propose to use cross-learner regularizers to prevent the collapse of the representation space in ensemble-based Q-learning methods. Intuitively, these representations capture an inductive bias towards more diverse representations. We have investigated five different regularizers. The mathematical formulation of four of the regularizers correspond to inequality measures borrowed from economics theory. While in economics, high inequality is seen as a negative, in this case we use the metrics to encourage inequality between the representations. The fifth regularizer is inspired from consensus optimization. There is a separate line of reinforcement learning literature where ensembles are used to address several different issues (Chen et al., 2017; Chua et al., 2018; Kurutach et al., 2018; Lee et al., 2020; Osband et al., 2016) such as exploration and error propagation but we limit our solution to algorithms addressing the overestimation bias problem only. To summarize, our contributions are following: 1. We show that high representation similarity between neural network based Q-functions leads to decline in performance in ensemble based Q-learning methods. 2. To mitigate this, we propose five regularizers based on inequality measures from economics theory and consensus optimization that maximize representation diversity between Q-functions in ensemble based Q-learning methods. 3. We show that applying the proposed regularizers to the MaxminDQN and EnsembleDQN methods can lead to significant improvement in performance over a variety of benchmarks.

2. BACKGROUND

Reinforcement learning considers an agent as a Markov Decision Process (MDP) defined as a five element tuple (S, A, P, r, γ), where S is the state space, A is the action space, P : S ×A×S → [0, 1] are the state-action transition probabilities, r : S × A × S → R is the reward mapping and γ → [0, 1] is the discount factor. At each time step t the agent observes the state of the environment s t ∈ S and selects an action a t ∈ A. The effect of the action triggers a transition to a new state s t+1 ∈ S according to the transition probabilities P , while the agent receives a scalar reward R t = r (s t , a t , s t+1 ). The goal of the agent is to learn a policy π that maximizes the expectation of the discounted sum of future rewards. One way to implicitly learn the policy π is the Q-learning algorithm that estimates the expected sum of rewards of state s t if we take the action a t by solving the Bellman equation Q * (s t , a t ) = E R t + max a ∈A Q * (s t+1 , a ) The implicit policy π can extracted by acting greedily with respect to the optimal Q-function: arg max a∈A Q * (s, a). One possible way to estimate the optimal Q-value is by iteratively updating it for sampled states s t and action a t using Q * (s t , a t ) ← Q * (s t , a t ) + α (Y t -Q * (s t , a t )) where Y t = R t + max a ∈A Q * (s t+1 , a ) where α is the step size and Y t is called the target value. While this algorithm had been initially studied in the context of a tabular representation of Q for discrete states and actions, in many practical applications the Q value is approximated by a learned function. Since the emergence of deep learning, the preferred approximation technique is based on a deep neural network. DQN (Mnih et al., 2015) , had demonstrated super-human performance in Atari Games, but required a very large number of training iterations. From this baseline, subsequent algorithms improved both the learning speed and achievable performance, with one of the main means for this being techniques to reduce the overestimation bias of the Q-function. EnsembleDQN (Anschel et al., 2017) uses an ensemble of N neural networks to estimate state-action values and uses their average to reduce both overestimation bias and estimation variance. Formally, the target value for EnsembleDQN is calculated using Q E (•) = 1 N N i=1 Q i (•) Y E t = R t + max a ∈A Q E (s t+1 , a ) More recent, MaxminDQN (Lan et al., 2020) addresses the overestimation bias using order statistics, using the ensemble size N as a hyperparameter to tune between underestimating and overestimating bias. The target value for MaxminDQN is calculated using Q M (•, •) = min i=1,...,N Q i (•, •) Y M t = R t + max a ∈A Q M (s t+1 , a )

3. RELATED WORK

Techniques to Address Overestimation Bias in RL: Addressing overestimation bias is a long standing research topic not only in reinforcement learning but other fields of science such as economics and statistics. It is commonly known as max-operator bias in statistics (D 'Eramo et al., 2017) and as the winner's curse in economics (Thaler, 2012; Smith & Winkler, 2006) . To address this, (van Hasselt, 2010) proposed Double Q-learning, subsequently adapted to a neural network based function approximators as Double DQN (Hado van Hasselt et al., 2016) . Alternatively, (Zhang et al., 2017; Lv et al., 2019) proposed weighted estimators of Double Q-learning and (Lee et al., 2013) introduced a bias correction term. Other approaches to address the overestimation are based on averaging and ensembling. Techniques include averaging Q-values from previous N versions of the Q-network (Anschel et al., 2017) , taking linear combinations of min and max over the pool of Q-values (Kumar et al., 2019) , or using a random mixture from the pool (Agarwal et al., 2019) . Regularization in Reinforcement Learning: Regularization in reinforcement learning has been used to perform effective exploration and learning generalized policies. For instance, (Grau-Moya et al., 2019) uses mutual-information regularization to optimize a prior action distribution for better performance and exploration, (Cheng et al., 2019) regularizes the policy π(a|s) using a control prior, (Galashov et al., 2019) uses temporal difference error regularization to reduce variance in Generalized Advantage Estimation (Schulman et al., 2016) . Generalization in reinforcement learning refers to the performance of the policy on different environment compared to the training environment. For example, (Farebrother et al., 2018) studied the effect of L 2 norm on DQN on generalization, (Tobin et al., 2017) studied generalization between simulations vs. the real world, (Pattanaik et al., 2018) studied parameter variations and (Zhang et al., 2018) studied the effect of different random seeds in environment generation. Representation Similarity: Measuring similarity between the representations learned by different neural networks is an active area of research. For instance, (Raghu et al., 2017) used Canonical Correlation Analysis (CCA) to measure the representation similarity. CCA find two basis matrices such that when original matrices are projected on these bases, the correlation is maximized. (Raghu et al., 2017; Mroueh et al., 2015) used truncated singular value decomposition on the activations to make it robust for perturbations. Other work such as (Li et al., 2015) and (Wang et al., 2018) studied the correlation between the neurons in the neural networks.

4. MAXIMIZING REPRESENTATION DIVERSITY IN ENSEMBLE-BASED DEEP Q-LEARNING

The work described in this paper is based on the conjecture that while ensemble-based deep Qlearning approaches aim to reduce the overestimation bias, this only works to the degree that the neural networks in the ensemble use diverse representations. If during training, these networks collapse to closely related representations, the learning performance decreases. From this idea, we propose to use regularization techniques to maximize representation diversity between the networks of the ensemble.

4.1. REPRESENTATION SIMILARITY MEASURE

Let X ∈ R n×p1 denote a matrix of activations of p 1 neurons for n examples and Y ∈ R n×p2 denote a matrix of activations of p 2 neurons for the same n examples. Furthermore, we consider K ij = k (x i , x j ) and L ij = l (y i , y j ) where k and l are two kernels. Centered Kernel Alignment (CKA) (Kornblith et al., 2019; Cortes et al., 2012; Cristianini et al., 2002) is a method for comparing representations of neural networks, and identifying correspondences between layers, not only in the same network but also on different neural network architectures. CKA is a normalized form of Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) . Formally, CKA is defined as: CKA (K, L) = HSIC (K, L) HSIC (K, K) • HSIC (L, L) HSIC is a test statistic for determining whether two sets of variables are independent. The empirical estimator of HSIC is defined as:  HSIC (K, L) = 1 (n -1) 2 tr (KHLH)

4.2. CORRELATION BETWEEN PERFORMANCE AND REPRESENTATION SIMILARITY

The work in this paper starts from the conjecture that high representation similarity between neural networks in an ensemble-based Q-learning technique correlates to poor performance. To empirically verify our hypothesis, we trained a MaxminDQN agent with two neural networks on the Catcher environment (Qingfeng, 2019) for about 3000 episodes (5 × 10 6 training steps) and calculated the CKA similarity with a linear kernel after every 500 episodes. The training graph along with the CKA similarity heatmaps are shown in Figure 1 . Notably at episode 500 (heatmap A) and episode 2000 (heatmap C), the representation similarity between neural networks is low but the average return is relatively high. In contrast, at episode 1000 (heatmap B) and episode 3000 (heatmap D) the representation similarity is highest but the average return is lowest. Additionally, in Appendix A.1, we performed a regression experiment to demonstrate that when two neural networks trained on same data, despite having different architecture, learning rate and batch size can learn almost identical representations. This experiment also demonstrates that random initialization of neural networks enforces diversity is a misconception.

4.3. REGULARIZATION FOR MAXIMIZING REPRESENTATION DIVERSITY

In order to maximize the representation diversity, we propose to regularize the training algorithm with an additional criteria that favors diversity in the representation space. In the following, N is the number of neural networks in the ensemble, i is the L 2 norm of the i-th neural network's parameters, ¯ is the mean of all the L 2 norms and is the list of all the L 2 norms. The first four metrics we consider are based on inequality measures from economic theory. While in economics, inequality is usually considered something to be avoided, in our case we aim to increase inequality (and thus, representation diversity). The Atkinson Index (Atkinson et al., 1970) measures income inequality and is useful in identifying the end of the distribution that contributes the most towards the observed inequality. Formally, it is defined as A =              1 - 1 ¯ 1 N N i=1 1- i 1 1-at , for 0 ≤ at = 1, 1 - 1 ¯ 1 N N i=1 i 1 N , for at = 1, where at is the inequality aversion parameter used to tune the sensitivity of the measured change. When at = 0, the index is more sensitive to the changes at the upper end of the distribution, while the index becomes more sensitive towards the change at the lower end of the distribution when at approaches 1. The Gini coefficient (Allison, 1978) is a statistical measure of the wealth distribution or income inequality among a population and defined as the half of the relative mean absolute difference: G = N i=1 N j=1 | i -j | 2N 2 ¯ (4) The Gini coefficient is more sensitive to deviation around the middle of the distribution than at the upper or lower part of the distribution. The Theil index (Johnston, 1969) measures redundancy, lack of diversity, isolation, segregation and income inequality among a population. Using the Theil index is identical to measuring the redundancy in information theory, defined as the maximum possible entropy of the data minus the observed entropy: T T = 1 N N i=1 i ¯ ln i ¯ (5) The variance of logarithms (Ok & Foster, 1997 ) is a widely used measure of dispersion with natural links to wage distribution models. Formally, it is defined as: V L ( ) = 1 N N i=1 [ln i -ln g( )] 2 (6) where g( ) is the geometric mean of defined as ( N i=1 i ) 1/N . The final regularization method we use is inspired from consensus optimization. In a consensus method (Boyd et al., 2011) , a number of models are independently optimized with their own taskspecific parameters, and the tasks communicate via a penalty that encourages all the individual solutions to converge around a common value. Formally, it is defined as M = ¯ -i 2 (7) We will refer this regularizer as MeanVector throughout this paper.

4.4. TRAINING ALGORITHM

Using the regularization functions defined above, we can develop diversity-regularized variants of the the MaxminDQN and EnsembleDQN algorithms. The training technique is identical to the algorithms described in (Lan et al., 2020) and (Anschel et al., 2017) , with a regularization term added to the loss of the Q-functions. The loss term for i-th Q-function with parameters ψ i is: L (ψ i ) = E s,a,r,s Q i ψ (s, a) -Y 2 -λI ( i , ) , where Y is the target value calculated using either Equation (1) or Equation ( 2) depending on the algorithm, I is the regularizer of choice from the list above and λ is the regularization weight. Notice that the regularization term appears with a negative sign, as the regularizers are essentially inequality metrics that we want to maximize. For completeness, the algorithm are shown in Appendix B.

5.1. TRAINING CURVES

We chose three environments from PyGames (Qingfeng, 2019) and MinAtar (Young & Tian, 2019) : Catcher, Pixelcopter and Asterix. These environments were used by the authors in the Max-minDQN (Lan et al., 2020) paper. We reused all the hyperparameter settings from (Lan et al., 2020) except the number of neural networks, which we limited to four and trained each solution for five fixed seeds. For the regularization weight λ, we chose the best value from {10 -5 , 10 -6 , 10 -7 , 10 -8 }. The baselines were also fine-tuned. The complete list of training parameters can be found in Appendix E. Figure 2 shows the training curves for the three environments. To avoid crowding the figures, for each environment and baseline algorithm (MaxminDQN and EnsembleDQN) we only plotted the regularized version which performed the best. We also show as baseline the original MaxminDQN and EnsembleDQN, as well as the DQN and DDQN algorithms. For the Catcher environment, both Gini-MaxminDQN and VOL-EnsembleDQN were able to quickly reach the optimal performance and stabilized after 2 × 10 6 training steps while the baseline MaxminDQN reached its maximum performance after 3.5 × 10 6 training steps but went down afterwards. Similarly, the baseline EnsembleDQN reached its maximum performance after 4 × 10 6 training steps, with the performance fluctuating with continued training. For the PixelCopter environment, VOL-MaxminDQN and Theil-EnsembleDQN were slower in the initial part of the learning that some of the other approaches, but over time they achieved at least double return compared to the other approaches. Similarly, for the Asterix environment, Atkinson-MaxminDQN and Theil-EnsembleDQN lagged in training for about 1 × 10 6 training steps but after that they achieved at least 50% higher return compared to the baselines. Full results together with CKA similarity heatmaps are shown in Appendix C.

5.2. T-SNE VISUALIZATIONS

To visualize the impact of the regularization, Figure 3 shows t-SNE (van der Maaten & Hinton, 2008) visualization of the activations of the last layer of the trained networks. Figure 3a show the network trained for the Catcher environment, while Figure 3b 

5.3. STATISTICAL ANALYSIS

What is the impact of the regularization on the performance? Similar to the approach taken by (Liu et al., 2020) , we performed a z-score test to rigorously evaluate the improvement of regularization over baseline solutions. The z-score is also known as "standard score", the signed fractional number of standard deviations by which the value of a data point is above the mean value. A regularizer's z-score roughly measures its relative performance among others. For each algorithm, environment and neural network setting, we calculated the z-score for each regularization method and the baseline by treating results all the results as a populations. For example, to find out which EnsembleDQN with two neural networks is best for the Catcher environment, we took the average reward of 10 episodes for each experiment ((5 + 1) × 5 seeds) and treated it as a population. Finally, we averaged the z-scores to generate the final result presented in Table 1 . In terms of improved performance, all the regularizers have achieved significant improvement over the baselines for all three environments. The z-scores for four neural network experiments are shown in Appendix C.4. Is the improvement statistically significant? We collected the z-scores from the previous section and performed the Welch's t-test with the corresponding z-scores produced by the baseline. The resulting p-values are presented in Table 2 . From the results, we observed that the improvement introduced from regularization is statistically significant (p < 0.05) in almost all the cases. To test the limits of the regularizers, we initialized, each layer of each neural network with the same fixed seed. This initialization enforces maximum representation similarity and is considered the worst case scenario for ensemble based learning methods. We performed this experiment on all three environments and used the same seeds and hyperparameters that were used for the main experiments. The training curves are shown in Figure 4 . Notably, the results from the baseline MaxminDQN and EnsembleDQN on both Catcher and PixelCopter environments are similar to the main results. For the Catcher environment, both Gini-MaxminDQN and Theil-EnsembleDQN were slow in learning for about 2 × 10foot_0 training steps but both solutions were able to achieve the optimal performance by the end of training. Similarly for PixelCopter environment, the VOL-MaxminDQN was slow in learning till 1.5 × 10 6 training steps but it was able to outperform the baseline results and achieved optimal performance. The complete training plots for these experiments are shown in Appendix D. 

7. CONCLUSION

In this paper we showed that high representation similarity between the Q-functions in ensemble based Q-learning algorithms such as MaxminDQN and EnsembleDQN leads to a decline in learning performance. To mitigate this, we proposed a regularization approach using five different metrics to maximize the diversity in the representation space of the Q-functions. Experiments have shown that our solution outperforms baseline MaxminDQN and EnsembleDQN in standard training settings as well as in scenarios where the parameters of the neural layers were initialized using one fixed seed.

A SUPPLEMENTARY MATERIAL A.1 MOTIVATING EXAMPLE TO DEMONSTRATE SIMILARITY BETWEEN NEURAL NETWORKS

We performed a regression experiment in which we learnt a sine wave function using two different three layered fully connected neural networks with 64 and 32 neurons in each hidden layer with ReLU. The neural networks were initialized using different seeds and were trained using different batch sizes (512, 128) and learning rates (1e-4, 1e-3). The Figure 5a shows the learnt functions while Figure 5b represents their CKA similarity heatmap before and after training. The odd numbered layers represent pre-ReLU activations while the even numbered layers represent post-ReLU activations. It can be seen that before training, the CKA similarity between the two neural networks from layer 4 and onward is relatively low and the output being 0% similar while after training, the trained networks have learnt highly similar representation while their output being 98% similar. This example shows that neural networks can learn similar representation while trained on different batches. This observation is important because in MaxminDQN and EnsembleDQN training, each neural network is trained on a separate batch from the replay buffer but still learns similar representation similarity (see Figure 8 ). 

B ALGORITHMS

Y M ← r D + γ max a ∈A Q min (s D , a ) Generate list of L 2 norms : = ψ 1 2 , . . . , ψ N 2 Update Q i by minimizing E s D ,a D ,r D ,s D Q i ψi (s D , a D ) -Y M 2 -λI ( i , ) end s ← s end Algorithm 2: Regularized EnsembleDQN The differences between the baseline EnsembleDQN and regularized EnsembleDQN are highlighted Initialize N Q-functions {Q 1 , . . . , Q N } parameterized by {ψ 1 , . . . , ψ N } Initialize empty replay buffer D Observe initial state s while Agent is interacting with the Environment do Q ens (s, a) ← 1 N N i=1 Q i (s, a) Choose action a by -greedy based on Q ens Take action a, observe r, s Store transition (s, a, r, s ) in D Select a subset S from {1, . . . , N } (e.g., randomly select one i to update) For the baseline experiments, the output layer has more than 96% similarity in almost all the scenarios while for the regularized versions have around 90% similarity in the output layer. This 10% difference provides enough variance in the Q-values to prevent the ensemble based Q-learning methods converging to the standard DQN.  for i ∈ S do Sample random mini-batch of transitions (s D , a D , r D , s D ) from D Get update target: Y E ← r D + γ max a ∈A Q ens (s D , a ) Generate list of L 2 norms : = ψ 1 2 , . . . , ψ N 2 Update Q i by minimizing E s D ,a D ,r D ,s D Q i ψi (s D , a D ) -Y E 2 -λI ( i , )

Baseline MaxminDQN Baseline EnsembleDQN

Theil-MaxminDQN Theil-EnsembleDQN 



IDENTICAL LAYERS EXPERIMENT



Figure 1: The training graph and CKA similarity heatmaps of a MaxminDQN agent with 2 neural networks. The letters on the plot show the time when CKA similarities were calculated. Heatmaps at A and C have relatively low CKA similarity and have relatively higher average return as compared to heatmaps at point B and D that have extremely high similarity across all the layers.

Figure 2: Training curves and 95% confidence interval (shaded area) for the best augmented variants for MaxminDQN and EnsembleDQN together with baseline algorithms.

Figure 3: Clustering last layer activations from Catcher and PixelCopter after processing them with t-SNE to map them in 2D. The regularized variants have visible clusters while the baseline MaxminDQN and EnsembleDQN activations are mixed together with no visible pattern.

Figure 4: Training plots representing the best results from each solution for Catcher, PixelCopter and Asterix environment when the layers of the neural networks were initialized with one fixed seed.

CKA similarity heatmap between different layers of the two neural networks used for the regression experiment.

Figure 5: Left: Fitting a sine function using two different neural network architectures. The upper function was approximated using 64 neurons in each hidden layer while the lower function used 32 neurons in each hidden layer. Right: Represents the CKA similarity heatmap between different layers of both neural networks before and after training. The right diagonal (bottom left to top right) measures representation similarity of the corresponding layers of both neural networks. The trained networks have learnt similar representations while their output was 98% similar.

Figure 6: All MaxminDQN Results. Top to Bottom: Atkinson, Gini, MeanVector, Theil, Variance of Logarithms

Figure 8: Heatmaps representing the CKA similarity of 2 neural network experiments.

Figure9represents the t-SNE visualizations of the baseline and regularized solutions trained with four neural networks on the PixelCopter environment. This visualization is consistent with the visualizations shown in Section 5.2 where the baseline activations are cluttered without any pattern while the Theil-MaxminDQN and Theil-EnsembleDQN activations have visible clusters.

Figure 9: Clustering last layer activations from PixelCopter after processing them witht-SNE to map them in 2D

Figure 10: All MaxminDQN Results. Top to Bottom: Atkinson, Gini, MeanVector, Theil, Variance of Logarithms

, the network trained for the PixelCopter environment. The upper row of the figure shows the original, unregularized models, while the lower row a regularized version. For all combinations, we find that the activations from the original MaxminDQN and EnsembleDQN versions do not show any obvious pattern, while the regularized ones show distinct clusters. An additional benefit of t-SNE visualizations over CKA similarity heatmaps is that the CKA similarity heatmaps are useful to show representation similarity between two neural networks, but they become counter intuitive as the number of neural networks increases. More t-SNE visualizations for four neural network experiments are shown in Appendix C.3.

Averaged z-scores for each regularization method.

For completeness, the regularizered MaxminDQN and EnsembleDQN algorithms are given below Algorithm 1: Regularized MaxminDQN The differences between the baseline MaxminDQN and regularized MaxminDQN are highlighted Initialize N Q-functions {Q 1 , . . . , Q N } parameterized by {ψ 1 , . . . , ψ N } Initialize empty replay buffer D Observe initial state s while Agent is interacting with the Environment do Q min (s, a) ← min k∈{1,...,N } Q k (s, a), ∀a ∈ A Choose action a by -greedy based on Q min Take action a, observe r, s Store transition (s, a, r, s ) in D Select a subset S from {1, . . . , N } (e.g., randomly select one i to update) for i ∈ S do Sample random mini-batch of transitions (s D , a D , r D , s D ) from D Get update target:

TABLE FOR FOUR NEURAL NETWORK EXPERIMENTS Averaged z-scores for each regularization method with four neural networks

P-values from Welch's t-test comparing the z-scores of regularization and baseline

annex

 [1e -3, 1e -4, 3e -5 ] and limited the number of ensembles to four. The complete list of hyperparameters for each environment is shown in Table 5 . The values in bold represent the values used for the reported results. For the identical layer experiment, no hyperparameter tuning was performed and we reused the hyperparameters from the main results. In terms of number of experiments, we ran 190 experiments:(5 regularizers × 5 seeds × 3 ensemble settings × 2 algorithms) + 40 baseline experiments for each environment totaling 570 runs for all environments after hyperparameter tuning. The same number of experiments were performed for the identical layer experiment which sums up to 1140 runs where each run took 11 hours of compute time on average.

F PLOTTING THE GINI INEQUALITY

We measured the L 2 norm inequality of the baseline MaxminDQN and EnsembleDQN along with their regularized versions. We trained baseline MaxminDQN and EnsembleDQN with two neural networks along with their Gini index versions with regularization weight of 10 -8 on the PixelCopter environment on a fixed seed . Figure 12 

