MAXMIN-NOVELTY: MAXIMIZING NOVELTY VIA MINIMIZING THE STATE-ACTION VALUES IN DEEP REINFORCEMENT LEARNING

Abstract

Reinforcement learning research has achieved high acceleration in its progress starting from the initial installation of deep neural networks as function approximators to learn policies that make sequential decisions in high-dimensional state representation MDPs. While several consecutive barriers have been broken in deep reinforcement learning research (i.e. learning from high-dimensional states, learning purely via self-play), several others still stand. On this line, the question of how to explore in high-dimensional complex MDPs is a well-understudied and ongoing open problem. To address this, in our paper we propose a unique exploration technique based on maximization of novelty via minimization of the state-action value function (MaxMin Novelty). Our method is theoretically well motivated, and comes with zero computational cost while leading to significant sample efficiency gains in deep reinforcement learning training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. We show that our technique improves the human normalized median scores of Arcade Learning Environment by 248% in the low-data regime.

1. INTRODUCTION

Utilization of deep neural networks as function approximators enabled learning functioning policies in high-dimensional state representation MDPs (Mnih et al., 2015) . Following this initial work, the current line of work trains deep reinforcement learning policies to solve highly complex problems from game solving (Hasselt et al., 2016; Schrittwieser et al., 2020) to self driving vehicles (Lan et al., 2020) . Yet there are still remaining unsolved problems restricting the current capabilities of deep neural policies. One of the main intrinsic open problems in deep reinforcement learning research is exploration in high-dimensional state representation MDPs. While prior work extensively studied the exploration problem in bandits and tabular reinforcement learning, and proposed various algorithms and techniques optimal to the tabular form or the bandit setting (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Lu & Roy, 2019; Wang et al., 2020; Karnin et al., 2013; Wagenmaker et al., 2022) , exploration in deep reinforcement learning remains an open challenging problem. Despite the provable optimality of these exploration techniques in the tabular or bandit setting, they generally rely strongly on the assumptions of tabular reinforcement learning, and in particular on the ability to record tables of statistical estimates for every state-action pair. Thus, in high-dimensional complex MDPs, for which deep neural networks are used as function approximators, the efficiency and the optimality of exploration methods proposed for tabular settings do not transfer well to deep reinforcement learning exploration. This is primarily due to the increase in the MDP dimensions and the incline in the complexity. Hence, in deep reinforcement learning research still, naive and simple exploration techniques (e.g. -greedy) are preferred over the optimal tabular techniques (Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2016; Anschel et al., 2017; Bellemare et al., 2017; Lan et al., 2020) . Sample efficiency in deep neural policies is still one of the main challenging problems restricting research progress in reinforcement learning. The magnitude of the number of samples required to learn and adapt continuously is one of the main limiting factors preventing current state-of-the-art deep reinforcement learning algorithms from being deployed in many diverse settings, but most importantly one of the main challenges that needs to be dealt with on the way to building general artificial intelligence. In our paper we aim to seek answers for the following questions: • Can we explore a high-dimensional state representation MDP more efficiently with zero additional computational cost? • Is there a natural theoretical motivation that can be used to design a zero-cost exploration strategy while achieving high sample efficiency? To be able to answer these questions, in our paper we focus on exploration in deep reinforcement learning and make the following contributions: • We propose a novel exploration technique based on minimizing the state-action value function to increase the information gain from each particular experience acquired in the MDP. • We conduct extensive study in the Arcade Learning Environment 100K benchmark with the state-of-the-art algorithms and demonstrate that our proposed method achieves significant performance improvement. • We show the efficacy of our proposed MaxMin Novelty method in terms of sample efficiency. Our method based on maximizing novelty via minimizing the state-action value function reaches approximately to the same performance level as model-based deep reinforcement learning algorithms, without building and learning any model of the environment.

2.1. DEEP REINFORCEMENT LEARNING

The reinforcement learning problem is formalized as a Markov Decision Process (MDP) M = S, A, r, γ, ρ 0 , P that contains a continous set of states s ∈ S, a set of discrete actions a ∈ A, a probability transition function T (s, a, s ) on S × A × S, discount factor γ, a reward function r(s, a) : S × A → R with initial state distribution ρ 0 . A policy π(s, a) : S → P(A) in an MDP is a mapping function between states and actions assigning a probability distribution over actions for each state s ∈ S. The main goal in reinforcement learning is to learn an optimal policy π that maximizes the discounted expected cumulative discounted rewards. R = E at∼π(st,•) t γ t r(s t , a t ), where a t ∼ π(s t , •). In Q-learning the learned policy is parameterized by a state-action value function Q : S × A → R, which represents the value of taking action a in state s. The optimal state-action value function is learnt via iterative Bellman update Q(s t , a t ) = r(s t , a t ) + γ st T (s t , a t , s t+1 )V(s t+1 ). where V(s t+1 ) = max a Q(s t+1 , a). Let a * be the action maximizing the state-action value function, a * (s) = arg max a Q(s, a), in state s. Once the Q-function is learnt the policy is determined via taking action a * (s) = arg max a Q(s, a). In deep reinforcement learning, the state space or the action space is large enough that it is not possible to learn and store the state-action values in a tabular form. Thus, the Q-function is approximated via deep neural networks. θ t+1 = θ t + α(r(s t , a t ) + γQ(s t+1 , arg max a Q(s t+1 , a; θ t ); θ t ) -Q(s t , a t ; θ t ))∇ θt Q(s t , a t ; θ t ) In deep double-Q learning, two Q-networks are used to decouple the Q-network deciding which action to take and the Q-network to evaluate the action taken. θ t+1 = θ t + α(r(s t , a t ) + γQ(s t+1 , arg max a Q(s t+1 , a; θ t ); θt ) -Q(s t , a t ; θ t ))∇ θt Q(s t , a t ; θ t ) Current deep reinforcement learning algorithms use -greedy exploration during training (Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2016; Hamrick et al., 2020; Flennerhag et al., 2022) . In particular, the -greedy algorithm takes an action a k ∼ U(A) with probability in a given state s, i.e. π(s, a k ) = |A| , and takes an action a * = arg max a Q(s, a) with probability 1 -, i.e. π(s, a * ) = 1 -+ |A| .

2.2. EXPLORATION IN REINFORCEMENT LEARNING

In the tabular MDP setting, there has been extensive theoretical work proving optimal regret bounds using the principal of optimism in the face of uncertainty. One prominent class of algorithms in this setting utilizes a bonus to value estimates based on the Upper Confidence Bound (UCB) approach (Auer et al., 2008) . In fact, the recent work of Azar et al. ( 2017) achieves minimax optimal regret using a carefully designed variant of the UCB approach. Furthermore, the UCB approach to exploration continues to be an active area of research for deriving new algorithms with provable regret bounds in reinforcement learning (Zanette & Brunskill, 2019; Jin et al., 2020) . Finally, the action with the highest value for the upper end of its confidence interval is selected. In this sense the UCB algorithm is optimistic, as it chooses an action based on the highest plausible estimate of its value given the previously observed data. Note also that as the state action pair (s, a) is visited more frequently, the corresponding confidence interval becomes smaller, eventually converging to the final estimated value. A second general class of theoretically-justified algorithms for exploration is based on randomized value functions, where specifically tuned randomness is added to value estimates in order to encourage exploration. Notable examples of algorithms in this category include Thompson sampling (Osband et al., 2013; Agrawal & Jia, 2017) , based on sampling from a posterior distribution on actions given past observations, and randomized least-squares value iteration (RLSVI) (Osband et al., 2016) , based on using tuned Gaussian noise to sample a randomized value function. Despite the strong theoretical performance of the aforementioned approaches, there are significant difficulties in effectively extending to the setting of deep reinforcement learning. The primary obstacle is that these methods utilize count-based uncertainty estimates (e.g. the state-action visit counts N t (s, a)), which are generally not immediately available in deep reinforcement learning where the state space is modeled as a continuous high-dimensional vector space (e.g. in deep reinforcement learning from pixels). Instead, incorporating count-based methods into deep reinforcement learning requires significant complexity including training additional deep neural networks to estimate counts or other uncertainty metrics. As a result, many state-of-the-art deep reinforcement learning algorithms use simple, randomized exploration methods such as the -greedy approach of sampling a uniformly random action with probability (Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2016; Hamrick et al., 2020; Flennerhag et al., 2022) , or the injection of random noise via noisynetworks (Hessel et al., 2018) .

3. MAXIMIZING NOVELTY

In deep reinforcement learning the state-action value function is initialized with random weights (Mnih et al., 2015; 2016; Hasselt et al., 2016; Wang et al., 2016; Schaul et al., 2016; Oh et al., 2020; Schrittwieser et al., 2020; Hubert et al., 2021) . Thus, in the early phase of the training the Q-function will behave more like a random function rather than providing an accurate representation of the optimal state-action values. In particular, early in training the Q-function, on average, will assign approximately similar values to states that are similar, and will have little correlation with the immediate rewards. We first formalize this intuition in the following definitions. Definition 3.1 (η-uninformed Q). Let η > 0. A Q-function parameterized by weights θ ∼ Θ is η-uninformed if for any state s ∈ S with a min = arg min a Q θ (s, a) we have |E θ∼Θ [r(s t , a min )] -E a∼U (A) [r(s t , a)]| < η. Definition 3.2 (δ-smooth Q). Let δ > 0. A Q-function parameterized by weights θ ∼ Θ is δ-smooth if for any state s ∈ S and action â ∈ A with s ∼ T (s, â, •) we have |E s ∼T (s,â,•),θ∼Θ [max a Q θ (s, a)] -E s ∼T (s,â,•),θ∼Θ [max a Q θ (s , a)]| < δ where the expectation is over both the random initialization of the Q-function weights, and the random transition to state s ∼ T (s, â, •). Definition 3.3 (Disadvantage Gap). For a state-action value function Q θ the disadvantage gap in a state s ∈ S is given by D(s) = E a∼U (A),θ∼Θ [Q θ (s, a) -Q θ (s, a min )] where a min = arg min a Q θ (s, a). The following proposition captures the intuition that when the Q-function on average assigns similar maximum values to consecutive states, choosing the action minimizing the state-action value function will achieve an above-average temporal difference loss. Proposition 3.1. Let η, δ > 0 and suppose that Q θ (s, a) is η-uninformed and δ-smooth. Let s t ∈ S be a state, and let a min be the action minimizing the state-action value in a given state s t , a min = arg min a Q θ (s t , a). Let s min t+1 ∼ T (s t , a min , •). Then for an action a t ∼ U(A) with s t+1 ∼ T (s t , a t , •) we have E s min t+1 ∼T (st,a min ,•),θ∼Θ [r(s t , a min ) + γ max a Q θ (s min t+1 , a) -Q θ (s t , a min )] > E at∼U ,(A)st+1∼T (st,at,•),θ∼Θ [r(s t , a t ) + γ max a Q θ (s t+1 , a) -Q θ (s t , a t )] + D(s) -2δ -η Proof. Since Q θ (s, a) is δ-smooth we have E s min t+1 ∼T (st,a min ,•),θ∼Θ [γ max a Q θ (s min t+1 , a) -Q θ (s t , a min )] > γE θ∼Θ [max a Q θ (s t , a)] -δ -E θ∼Θ [Q θ (s t , a min )] > γE st+1∼T (st,at,•),θ∼Θ [max a Q θ (s t+1 , a)] -2δ -E θ∼Θ [Q θ (s t , a min )] ≥ E at∼U (A),st+1∼T (st,at,•),θ∼Θ [γ max a Q θ (s t+1 , a) -Q θ (s t , a t )] + D(s) -2δ where the last line follows from Definition 3.3. Further, because Q θ (s, a) is η-uninformed, E θ∼Θ [r(s t , a min )] > E at∼U (A) [r(s t , a t )] -η. Combining with the previous inequality completes the proof. In words, the proposition shows that the temporal difference loss achieved by the minimum-value action is above-average by an amount approximately equal to the disadvantage gap. The above argument can be extended to the case where action selection and evaluation in the temporal difference loss are computed with two different sets of weights θ and θ as in double Q-learning. Definition 3.4 (δ-smoothness for Double-Q). Let δ > 0. A pair of Q-functions parameterized by weights θ ∼ Θ and θ ∼ Θ are δ-smooth if for any state s ∈ S and action â ∈ A with s ∼ T (s, â, •) we have |E s ∼T (s,â,•),θ∼Θ, θ∼Θ Q θ (s, arg max a Q θ (s, a)) -E s ∼T (s,â,•),θ∼Θ, θ∼Θ Q θ (s , arg max a Q θ (s , a)) | < δ where the expectation is over both the random initialization of the Q-function weights θ and θ, and the random transition to state s ∼ T (s, â, •). With this definition we can then prove that choosing the minimum valued action will lead to a temporal difference loss that is above-average by approximately D(s). Proposition 3.2. Let η, δ > 0 and suppose that Q θ and Q θ are η-uniformed and δ-smooth. Let s t ∈ S be a state, and let a min = arg min a Q θ (s t , a). Let s min t+1 ∼ T (s t , a min , •). Then for an action a t ∼ U(A) with s t+1 ∼ T (s t , a t , •) we have E st+1∼T (s,a,•),θ∼Θ, θ∼Θ [r(s t , a min ) + γQ θ (s min t+1 , arg max a Q θ (s min t+1 , a)) -Q θ (s t , a min )] > E at∼U (A),st+1∼T (s,a,•),θ∼Θ, θ∼Θ [r(s t , a t ) + γQ θ (s t+1 , arg max a Q θ (s t+1 , a)) -Q θ (s t , a t )] + D(s) -2δ -η Proof. Since Q θ and Q θ are δ-smooth we have E s min t+1 ∼T (st,a min ,•),θ∼Θ, θ∼Θ [+γQ θ (s min t+1 , arg max a Q θ (s min t+1 , a)) -Q θ (s t , a min )] > E s min t+1 ∼T (st,a min ,•),θ∼Θ, θ∼Θ [+γQ θ (s t , arg max a Q θ (s t , a)) -Q θ (s t , a min )] -δ > E st+1∼T (st,at,•),θ∼Θ, θ∼Θ [+γQ θ (s t+1 , arg max a Q θ (s t+1 , a)) -Q θ (s t , a min )] -2δ ≥ E st+1∼T (st,at,•),θ∼Θ, θ∼Θ [+γQ θ (s t+1 , arg max a Q θ (s t+1 , a)) -Q θ (s t , a t )] + D(s) -2δ where the last line follows from Definition 3.3. Further, because Q θ and Q θ are η-uniformed, E θ∼Θ, θ∼Θ [r(s t , a min )] > E at∼U (A) [r(s t , a t )] -η. Combining with the previous inequality completes the proof. At first, the results in Proposition 3.1 and 3.2 might appear counterintuitive. The fact that the Qfunction is δ-smooth and η-uninformed seem like properties of a random function. Thus, taking the minimum Q-value action should be approximately equivalent to taking a uniform random action. However, Proposition 3.1 and 3.2 show that the temporal difference loss achieved by taking the minimum action is larger than that of a random action by an amount equal to the disadvantage gap D(s). In order to reconcile these two statements it is useful at this point to look at the limiting case of the Q function at initialization. In particular, the following proposition shows that, at initialization, the distribution of the minimum value action in a given state is uniform by itself, but is constant once we condition on the weights θ. Proposition 3.3. Let θ be the random initial weights for the Q-function. For any state s ∈ S let a min (s) = arg min a ∈A Q θ (s, a ). Then for any a ∈ A P θ∼Θ arg min a ∈A Q θ (s, a ) = a = 1 |A| i.e. the distribution P θ∼Θ [a min (s)] is uniform. Simultaneously, the conditional distribution P θ∼Θ [a min (s) | θ] is constant. Proof. Since Q θ (s, •) is a random function (given the random choice of θ), each action a ∈ A is equally likely to be assigned the minimum Q-value in state s. Thus, P θ∼Θ arg min a ∈A Q θ (s, a) = a = 1 |A| . However, given the value of θ, the value of a min (s) is uniquely determined because a min (s) = arg min a∈A Q θ (s, a). Therefore, the distribution of a min (s) conditional on θ is constant.  T D = r(s t , a * ) +γ max a Q(s t+1 , a) -Q(s t , a * ) end for return ∇L(T D) This implies that, in states whose Q-values have not changed much from initialization, taking the minimum action is almost equivalent to taking a random action. However, while the action chosen early on in training is almost uniformly random when only considering the current state, it is at the same time completely determined by the current value of the weights θ. The temporal difference loss is also determined by the weights θ. Thus while the marginal distribution on actions taken is uniform, the temporal difference loss when taking the minimum action is quite different than from the case where an independently random action is chosen. Algorithm 1 summarizes our proposed exploration method MaxMin Novelty based on minimizing the state-action value function as described in detail in Section 3. Note that populating the experience replay buffer and learning are happening simultaneously with different rates. As a motivating example we consider the chain MDP which consists of a chain of n states s ∈ S = {1, 2, • • • n} each with two actions. Each state i has one action that transitions the agent up the chain by one step to state i + 1, and one action which resets the agent to state 1 at the beginning of the chain. All transitions have reward zero, except for the last transition returning the agent to the beginning from the nth state. Thus, when started from the first state in the chain, the agent must learn a policy that takes n -1 consecutive steps up the chain, and then the one final step to reset and get the reward.

4. MOTIVATING EXAMPLE

For the chain MDP, we compare standard approaches to exploration in tabular Q-learning with our method MaxMin Novelty based on minimization of the state-action values. In particular we compare our method MaxMin Novelty with both the -greedy action selection method, and the upper confidence bound (UCB) method. In more detail, in the UCB method the number of training steps t, and the number of times N t (s, a) that each action a has been taken in state s by step t are recorded. Furthermore, the action a ∈ A selection is determined as follows: a UCB = arg max a∈A Q(s, a) + 2 log t N t (s, a) . In a given state s if N (s, a) = 0 for any action a, then an action is sampled uniformly at random from the set of actions a with N (s, a ) = 0. For the experiments reported in our paper the length of the chain is set to n = 10, and = 0.2. The Q-function is initialized by independently sampling each state-action value from a normal distribution with µ = 0 and σ = 0.1. In each iteration we train the agent using Q-learning for 100 steps, and then evaluate the reward obtained by the argmax policy using the current Q-function for 100 steps. Note that the maximum achievable reward in 100 steps is 10. The learning curves in Figure 1 demonstrate that our method converges more quickly to the optimal policy than either of the standard approaches.

5. LARGE SCALE EXPERIMENTAL RESULTS

The experiments are conducted in the Arcade Learning Environment (ALE) (Bellemare et al., 2013) For completeness we also report several results with 200 million frame training (i.e. 50 million environment interactions). In particular, Figure 2 demonstrates the learning curves for our proposed algorithm MaxMin Novelty and the original version of the DDQN algorithm with -greedy training (Hasselt et al., 2016) . In the large data regime we observe that while in some MDPs our proposed method MaxMin Novelty based on exploring with novelty maximization via minimizing the stateaction values converges faster, in other MDPs MaxMin Novelty simply converges to a better policy. More concretely, while the learning curves of StarGunner, FishingDerby, Boxing, Enduro, Hero, and IceHockey games in Figure 2 demonstrate the faster convergence rate of our proposed algorithm MaxMin Novelty, the learning curves of the BankHeist, JamesBond, KungFuMaster, Amidar, Gravitar and Tennis games demonstrate that our exploration technique not only increases the sample efficiency in deep reinforcement learning, but also results in learning a policy that is more close to optimal compared to learning a policy with the original method used in the DDQN algorithm. Additionally, we also compare our proposed MaxMin Novelty algorithm with NoisyNetworks as described in Section 2.2. Table 1 further demonstrates that the MaxMin Novelty algorithm achieves significantly better performance results compared to NoisyNetworks. Furthermore, note that NoisyNetworks includes adding layers in the Q-network to increase exploration. However, this increases the number of parameters that have been added in the training process; thus, introducing additional cost to increase exploration. Table 1 reports results of human normalized median scores, 20 th percentile, and 80 th percentile for the Arcade Learning Environment 100K benchmark. Thus, Table 1 demonstrates that our proposed MaxMin-Novelty algorithm improves on the performance of the canonical algorithm -greedy by 248% and NoisyNetworks by 204%.

6. INVESTIGATING THE TEMPORAL DIFFERENCE LOSS

The original justification for exploring with the minimum Q-value action, is that taking this action tends to result in transitions with higher temporal difference loss. The theoretical analysis from Proposition 3.1 indicates that, when the Q function is δ-smooth and η-uninformed, taking the minimum value action results in an increase in the temporal difference loss proportional to the disadvantage gap. In particular, Proposition 3.1 states that the temporal difference loss achieved when taking the minimum Q-value action in state s exceeds the average loss over a uniform random action by D(s) -2δ -η. In order to evaluate how well the theoretical prediction matches reality, in this section we provide empirical measurements of the temporal difference loss in our experiments. To measure the change in the loss when taking the minimum action versus the average action, we compare the temporal difference loss obtained by MaxMin Novelty exploration with that obtained by -greedy exploration. In more detail, during training, for each batch Λ of transitions of the form (s t , a t , s t+1 ) we record, the temporal difference loss T D = E (st,at,st+1)∼Λ T D(s t , a t , s t+1 ) = E (st,at,st+1)∼Λ [r(s t , a t ) + γ max a Q θ (s t+1 , a) -Q θ (s t , a t )]. The results reported in Figure 3 and Figure 5 further confirm the theoretical predictions made via Definition 3.2 and Proposition 3.1. In addition to the results for individual games reported in Figure 3 , we compute a normalized measure of the gain in temporal difference achieved when using MaxMin Novelty exploration and plot the median across games. We define the normalized T D gain to be Normalized T D Gain = 1 + T D method -T D -greedy |T D -greedy | where T D method and T D -greedy are the temporal difference for any given exploration method and -greedy respectively. The leftmost and middle plot of Figure 5 report the median across all games of the normalized T D gain results for MaxMin Novelty and NoisyNetworks in the Arcade Learning Environment 100K benchmark. Note that, consistent with the predictions of Proposition 3.1, the median normalized temporal difference gain for MaxMin Novelty is up to 25 percent larger than that of -greedy. The results for NoisyNetworks demonstrate that alternate exploration methods lack this positive bias relative to the uniform random action. The fact that, as demonstrated in Table 1 , MaxMin Novelty significantly outperforms noisy networks in the low-data regime is further evidence of the advantage the positive bias in temporal difference confers. The rightmost plot of Figure 5 reports T D for the motivating example of the chain MDP. As in the large-scale experiments, prior to convergence MaxMin Novelty exhibits a notably larger temporal difference loss relative to the other exploration methods.

7. CONCLUSION

In our study we focus on the following questions in deep reinforcement learning: (i) Is it possible to increase sample efficiency in deep reinforcement learning in a computationally efficient way with conceptually simple choices?, (ii) What is the theoretical motivation of our proposed perspective, simply minimizing the state-action value function in early training, that results in one of the most computational efficient ways to explore in deep reinforcement learning? and, (iii) How would the theoretically motivated simple idea transfer to large scale experiments in high-dimensional state representation MDPs? To be able to answer these questions we propose a novel, theoretically motivated method with zero additional computational cost based on following actions that minimize the state-action value function to explore in deep reinforcement learning. We demonstrate theoretically that our method MaxMin Novelty based on minimization of the state-action value results in higher temporal difference loss, and thus creates novel transitions in exploration with more unique experience collection. Following the theoretical motivation we initially show in a toy example in the chain MDP setup that our proposed method MaxMin Novelty results in achieving higher sample efficiency. Then, we expand this intuition and conduct large scale experiments in the Arcade Learning Environment, and demonstrate that our proposed method MaxMin Novelty increases the performance on the Arcade Learning Environment 100K benchmark by 248%.

A.2 ARCADE LEARNING ENVIRONMENT RESULTS

Table 3 reports the average scores for human, random, our proposed algorithm MaxMin-Novelty, canonical algorithm -greedy and NoisyNetworks for all of the games in the Arcade Learning Environment 100K benchmark. Scores are reported with the mean over 5 random seeds. The highest score amongst the three algorithms is marked with bold font. We also reported human scores and random scores to provide complete information on the learning curves reported in the main body of the paper. up to 4 × 10 4 environment interactions) in half of the games we observe a steeper increase in the performance. This is again a result of the MaxMin Novelty algorithm targeting higher temporal difference bias. In particular, in Amidar, CrazyClimber, Hero, JamesBond, Kangaroo, RoadRunner, PrivateEye, Seaquest, UpNDown, Freeway, Breakout and Asterix the gradient of the performance curve for MaxMin Novelty is higher than the canonical algorithm -greedy. This is again supporting Proposition 3.6 in the main body of the paper. In particular, early in the training the Q-function is η-uninformed and δ-smooth as has been described in Definition 3.1 and 3.2 in the main body of the paper. Thus, the steep increase in early training matches the predictions of Proposition 3.6 where a positive bias in temporal difference yields faster learning. Also further note that in 6 gamesfoot_0 simple double-Q learning already outperforms Rainbow in the low data regime. Note that Rainbow has several additional components such as dueling network, multi-step return, distributional reinforcement learning that introduces new parameters as large as the number of bins used in the algorithm, and NoisyNetworks. Thus, the fact that MaxMin Novelty with simple double-Q learning achieves a higher score in these games than an algorithm that combines all these various techniques is further evidence that demonstrates the NoisyNetworks can be replaced with the MaxMin Novelty algorithm in Rainbow as a future research direction to obtain better performance. Also further note that MaxMin Novelty does not introduce any additional new parameters as NoisyNetworks does; more precisely, NoisyNetworks doubles the number of parameters used in the Q-network. Hence, the fact that MaxMin Novelty achieves higher performance as also reported in the main body of the paper without any additional computational cost further demonstrates the benefits of the utilization of MaxMin Novelty in a more diversified portfolio of algorithms as a zero cost exploration technique.



Boxing, Breakout, ChopperCommand, DemonAttack, Gopher, Pong QRDQN is one of the baseline distributional reinforcement learning algorithms.



Figure 1: Exploring the chain MDP with Upper Confidence Bound (UCB) method, -greedy and our proposed method MaxMin Novelty.

Figure 2: The learning curves of StarGunner, FishingDerby, Boxing, Enduro, Bowling, IceHockey, BankHeist, JamesBond, KungFuMaster, Amidar, Gravitar and Tennis with our proposed method MaxMin Novelty and the -greedy algorithm in the Arcade Learning Environment with 200 million frame training.

Figure 3: Temporal difference loss for our proposed algorithm MaxMin-Novelty and the canonical -greedy algorithm in the Arcade Learning Environment 100K benchmark. Dashed lines report the temporal difference loss for the -greedy algorithm and solid lines report the temporal difference loss for the MaxMin-Novelty algorithm. Colors indicate games.

Figure 5: Left and Middle: Normalized temporal difference T D gain median across all games in the Arcade Learning Environment 100K benchmark for MaxMin Novelty and NoisyNetworks. Right: Temporal difference loss T D when exploring chain MDP with Upper Confidence Bound (UCB) method, -greedy and our proposed algorithm MaxMin Novelty.

Figure 6: Learning curves in the Arcade Learning Environment 100K benchmark with our proposed algorithm MaxMin Novelty and the canonical algorithm -greedy.

. The basic idea of UCB algorithms is to explore by adding an optimistic bonus to the state-action values, based on an estimate of the uncertainty in the current value function. The basic UCB approach (Sutton & Barto, 2018) is to use visit-count statistics N t (s, a) representing the number of times action a has been taken in state s by time step t in order to estimate the variance of the current state-action values. The variance estimate is then used to construct a confidence interval around the current value estimate, usually given by some multiple of the standard deviation c

Algorithm 1: MaxMin Novelty Input: In MDP M with γ ∈ (0, 1], s ∈ S, a ∈ A with Q(s, a), B experience replay buffer, exploration parameter, N is the training learning steps. Populating Experience Replay Buffer: for s t in e do Sample κ ∼ U (0, 1) if κ < then a min = arg min a Q(s t , a) B ← (r(s t , a min ), s t , s min t+1 , a min ) else a * = arg max a Q(s t , a) B ← (r(s t , a * ), s t , s t+1 , a * )

Human normalized scores median and 80 th percentile over all games in the Arcade Learning Environment (ALE) 100K benchmark for MaxMin Novelty algorithm and the canonical exploration algorithm -greedy. Human normalized scores median and 20 th percentile across all of the games in the Arcade Learning Environment 100K benchmark for MaxMin-Novelty, -greedy and NoisyNetworks.

Average returns for human, random, our proposed algorithm MaxMin-Novelty, canonical algorithm -greedy and NoisyNetworks across all of the games in the Arcade Learning Environment 100K benchmark. Scores are averaged over 5 random seeds. Figure 6 reports learning curves in the Arcade Learning Environment 100K benchmark with our proposed algorithm MaxMin Novelty and the canonical algorithm -greedy. In early training (i.e.

annex

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando. De Freitas. Dueling network architectures for deep reinforcement learning. Internation Conference on Machine Learning ICML., pp. 1995 -2003 , 2016 .Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304-7312. PMLR, 2019.

A APPENDIX

A.1 HYPERPARAMETER AND ARCHITECTURE DETAILS For reproducibility and completeness in research in Table 2 we report the hyperparameter details for our proposed algorithm MaxMin-Novelty, canonical algorithm -greedy and NoisyNetworks.Furthermore, for all of the algorithms the hyperparameters and the architectures are identical with each other. Note that the architecture parameters are also identical for the 200 million frame training. Note that we did not tune hyperparameters reported below. To increase transparency in research we kept hyperparameters exactly the same with the prior studies. We ran our experiments with JAX implementation Bradbury et al. (2018) . We used Haiku Hennigan et al. (2020) for the neural network library, Optax Hessel et al. (2020) for the optimization library, and RLax for the reinforcement learning library Babuschkin et al. (2020) . In this section we provide more results into the motivating example of the chain-MDP. In particular, while the main body of the paper provides results with the baseline chain-MDP in this section we provide more results with the modified chain-MDP. In detail, the modified chain-MDP refers to the chain-MDP with increased action size to obtain more fine-grained observations into the effects of the exploration techniques. Hence, the modified chain-MDP consists of n states s ∈ S = 1, 2, . . . , n each with four actions. In the modified chain-MDP each state i has one action that transitions the agent up the chain by one step to state i + 1, one action that transitions the agent to state two, one action that transitions the agent to state three, and one action which resets the agent to state one at the beginning of the chain. The Figure 7 reports results for MaxMin Novelty, canonical -greedy, and the UCB method with varying ∈ [0.15, 0.25] with a step size of 0.025. The results reported in Figure 7 once more demonstrate that MaxMin Novelty performs significantly better compared to prior exploration techniques. et al. (2019) . Furthermore, note that Rainbow contains dueling architecture, multi-step return, noisy networks on top of the distributional reinforcement learning. Thus, the fact that QRDQN, a baseline distributional reinforcement learning algorithm, can achieve human normalized median score that is already substantially higher than data-efficient Rainbow once more demonstrates the substantial sample efficiency gained by the MaxMin Novelty algorithm. Furthermore, the significantly higher performance gain obtained by MaxMin Novelty over all tasks of the Arcade Learning Environment 100K benchmark as reported via the human normalized scores in Figure 8 , once more demonstrates that MaxMin Novelty increases the performance of the baseline algorithms further beyond the performance of much more complicated algorithms.Median 80 th Percentile Figure 8 : Human normalized median and human normalized 80 th percentile scores of QRDQN with MaxMin Novelty and -greedy in the Arcade Learning Environment 100K benchmark.

