MAXMIN-NOVELTY: MAXIMIZING NOVELTY VIA MINIMIZING THE STATE-ACTION VALUES IN DEEP REINFORCEMENT LEARNING

Abstract

Reinforcement learning research has achieved high acceleration in its progress starting from the initial installation of deep neural networks as function approximators to learn policies that make sequential decisions in high-dimensional state representation MDPs. While several consecutive barriers have been broken in deep reinforcement learning research (i.e. learning from high-dimensional states, learning purely via self-play), several others still stand. On this line, the question of how to explore in high-dimensional complex MDPs is a well-understudied and ongoing open problem. To address this, in our paper we propose a unique exploration technique based on maximization of novelty via minimization of the state-action value function (MaxMin Novelty). Our method is theoretically well motivated, and comes with zero computational cost while leading to significant sample efficiency gains in deep reinforcement learning training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. We show that our technique improves the human normalized median scores of Arcade Learning Environment by 248% in the low-data regime.

1. INTRODUCTION

Utilization of deep neural networks as function approximators enabled learning functioning policies in high-dimensional state representation MDPs (Mnih et al., 2015) . Following this initial work, the current line of work trains deep reinforcement learning policies to solve highly complex problems from game solving (Hasselt et al., 2016; Schrittwieser et al., 2020) to self driving vehicles (Lan et al., 2020) . Yet there are still remaining unsolved problems restricting the current capabilities of deep neural policies. One of the main intrinsic open problems in deep reinforcement learning research is exploration in high-dimensional state representation MDPs. While prior work extensively studied the exploration problem in bandits and tabular reinforcement learning, and proposed various algorithms and techniques optimal to the tabular form or the bandit setting (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Lu & Roy, 2019; Wang et al., 2020; Karnin et al., 2013; Wagenmaker et al., 2022) , exploration in deep reinforcement learning remains an open challenging problem. Despite the provable optimality of these exploration techniques in the tabular or bandit setting, they generally rely strongly on the assumptions of tabular reinforcement learning, and in particular on the ability to record tables of statistical estimates for every state-action pair. Thus, in high-dimensional complex MDPs, for which deep neural networks are used as function approximators, the efficiency and the optimality of exploration methods proposed for tabular settings do not transfer well to deep reinforcement learning exploration. This is primarily due to the increase in the MDP dimensions and the incline in the complexity. Hence, in deep reinforcement learning research still, naive and simple exploration techniques (e.g. -greedy) are preferred over the optimal tabular techniques (Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2016; Anschel et al., 2017; Bellemare et al., 2017; Lan et al., 2020) . learn and adapt continuously is one of the main limiting factors preventing current state-of-the-art deep reinforcement learning algorithms from being deployed in many diverse settings, but most importantly one of the main challenges that needs to be dealt with on the way to building general artificial intelligence. In our paper we aim to seek answers for the following questions: • Can we explore a high-dimensional state representation MDP more efficiently with zero additional computational cost? • Is there a natural theoretical motivation that can be used to design a zero-cost exploration strategy while achieving high sample efficiency? To be able to answer these questions, in our paper we focus on exploration in deep reinforcement learning and make the following contributions: • We propose a novel exploration technique based on minimizing the state-action value function to increase the information gain from each particular experience acquired in the MDP. • We conduct extensive study in the Arcade Learning Environment 100K benchmark with the state-of-the-art algorithms and demonstrate that our proposed method achieves significant performance improvement. • We show the efficacy of our proposed MaxMin Novelty method in terms of sample efficiency. Our method based on maximizing novelty via minimizing the state-action value function reaches approximately to the same performance level as model-based deep reinforcement learning algorithms, without building and learning any model of the environment.

2.1. DEEP REINFORCEMENT LEARNING

The reinforcement learning problem is formalized as a Markov Decision Process (MDP) M = S, A, r, γ, ρ 0 , P that contains a continous set of states s ∈ S, a set of discrete actions a ∈ A, a probability transition function T (s, a, s ) on S × A × S, discount factor γ, a reward function r(s, a) : S × A → R with initial state distribution ρ 0 . A policy π(s, a) : S → P(A) in an MDP is a mapping function between states and actions assigning a probability distribution over actions for each state s ∈ S. The main goal in reinforcement learning is to learn an optimal policy π that maximizes the discounted expected cumulative discounted rewards. R = E at∼π(st,•) t γ t r(s t , a t ), where a t ∼ π(s t , •). In Q-learning the learned policy is parameterized by a state-action value function Q : S × A → R, which represents the value of taking action a in state s. The optimal state-action value function is learnt via iterative Bellman update Q(s t , a t ) = r(s t , a t ) + γ st T (s t , a t , s t+1 )V(s t+1 ). where V(s t+1 ) = max a Q(s t+1 , a). Let a * be the action maximizing the state-action value function, a * (s) = arg max a Q(s, a), in state s. Once the Q-function is learnt the policy is determined via taking action a * (s) = arg max a Q(s, a). In deep reinforcement learning, the state space or the action space is large enough that it is not possible to learn and store the state-action values in a tabular form. Thus, the Q-function is approximated via deep neural networks.  θ t+1 = θ t + α(



r(s t , a t ) + γQ(s t+1 , arg maxa Q(s t+1 , a; θ t ); θ t ) -Q(s t , a t ; θ t ))∇ θt Q(s t , a t ; θ t )In deep double-Q learning, two Q-networks are used to decouple the Q-network deciding which action to take and the Q-network to evaluate the action taken.

