A LARGE SCALE SAMPLE COMPLEXITY ANALYSIS OF NEURAL POLICIES IN THE LOW-DATA REGIME Anonymous authors Paper under double-blind review

Abstract

The progress in reinforcement learning algorithm development is at one of its highest points starting from the initial study that enabled sequential decision making from high-dimensional observations. Currently, deep reinforcement learning research has had quite recent breakthroughs from learning without the presence of rewards to learning functioning policies without even knowing the rules of the game. In our paper we focus on the trends currently used in deep reinforcement learning algorithm development in the low-data regime. We theoretically show that the performance profiles of the algorithms developed for the high-data regime do not transfer to the low-data regime in the same order. We conduct extensive experiments in the Arcade Learning Environment and our results demonstrate that the baseline algorithms perform significantly better in the low-data regime compared to the set of algorithms that were initially designed and compared in the large-data region.

1. INTRODUCTION

Reinforcement learning research achieved high acceleration upon the proposal of the initial study on approximating the state-action value function via deep neural networks (Mnih et al., 2015) . Following this initial study several different highly successful deep reinforcement learning algorithms have been proposed (Hasselt et al., 2016b; Wang et al., 2016; Hessel et al., 2018; 2021) from targeting different architectural ideas to employing estimators targeting overestimation, all of which were designed and tested in the high-data regime (i.e. two hundred million frame training). An alternative recent line of research with an extensive of amount of publications focused on pushing the performance bounds of deep reinforcement learning policies in the low-data regime (Yarats et al., 2021; Ye et al., 2021; Kaiser et al., 2020; van Hasselt et al., 2019; Kielak, 2019) (i.e. with one hundred thousand environment interaction training). Several different unique ideas in current reinforcement learning research, from model-based reinforcement learning to increasing sample efficiency with observation regularization, gained acceleration in several research directions based on policy performance comparisons demonstrated in the Arcade Learning Environment 100K benchmark. However, we demonstrate that there is a significant overlooked assumption being made in this line of research without being explicitly discussed. This implicit assumption, that is commonly shared amongst a large collection of low-data regime studies carries a significant importance due to the fact that these studies shape future research directions with incorrect reasoning; hence, influencing the overall research efforts put in for particular research ideas for several years following. Thus, in our paper we target this implicit assumption and aim to answer the following questions: • How can we theoretically explain the relationship between asymptotic sample complexity versus the low-data regime sample complexity in deep reinforcement learning? • How would the performance profiles of deep reinforcement learning algorithms designed for the high-data regime transform to the low-data regime? • Can we expect the performance rank of algorithms to hold with variations on the number of samples used in policy training? Hence, to be able to answer the questions raised above in our paper we focus on sample complexity in deep reinforcement learning and make the following contributions: • We provide theoretical foundation on the non-transferability of the performance profiles of deep reinforcement learning algorithms designed for the high-data regime to the low-data regime. • We theoretically demonstrate that the performance profile has a non-monotonic relationship with the asymptotic sample complexity and the low-data sample complexity region. Furthermore, we prove that the sample complexity of distributional reinforcement learning is higher than the sample complexity of baseline deep Q-network algorithms. • We conduct large scale extensive experiments for a variety of deep reinforcement learning baseline algorithms in both the low-data regime and the high-data regime Arcade Learning Environment benchmark. • We highlight that recent algorithms proposed and evaluated in the Arcade Learning Environment 100K benchmark are significantly affected by the implicit assumption on the relationship between performance profiles and sample complexity.

2. BACKGROUND AND PRELIMINARIES 2.1 DEEP REINFORCEMENT LEARNING

The reinforcement learning problem is formalized as a Markov Decision Process (MDP) represented as a tuple S, A, P, R, γ, ρ 0 where S represents the state space, A represents the set of actions, P represents the transition probability distribution P on S × A × S, R : S × A → R represents the reward function, and γ ∈ (0, 1] represents the discount factor. The aim in reinforcement learning is to learn an optimal policy π(s, a) that maps state observations to actions π : S → ∆(A) that will maximize expected cumulative discounted rewards. R = E at∼π(st,•) t γ t R(s t , a t , s t+1 ), This objective is achieved by constructing a state-action value function that learns for each stateaction pair the expected cumulative discounted rewards that will be obtained if action a ∈ A is executed in state s ∈ S. Q(s t , a t ) = R(s t , a t , s t+1 ) + γ st P(s t+1 |s t , a t )V(s t+1 ). (2) In settings where the state space and/or action space is large enough that the state-action value function Q(s, a) cannot be held in a tabular form, a function approximator is used. Thus, for deep reinforcement learning the Q-function is approximated via deep neural networks. θ t+1 = θ t + α(R(s t , a t , s t+1 ) + γQ(s t+1 , arg max a Q(s t+1 , a; θ t ); θ t ) -Q(s t , a t ; θ t ))∇ θt Q(s t , a t ; θ t ). Dueling Architecture: At the end of convolutional layers for a given deep Q-Network, the dueling architecture outputs two streams of fully connected layers for both estimating the state values V(s) and the advantage A(s, a) for each action in a given state s. A(s, a) = Q(s, a) -max a Q(s, a) In particular, the last layer of the dueling architecture contains the forward mapping Q(s, a; θ, α, β) = V(s; θ, β) + A(s, a; θ, α) -max a ∈A A(s, a ; θ, α) where θ represents the parameters of the convolutional layers and α and β represent the parameters of the fully connected layers outputting the advantage and state value estimates respectively. Distributional Reinforcement Learning: The baseline distributional reinforcement learning algorithm C51 was proposed by Bellemare et al. (2017) . The projected Bellman update for the i th atom is computed as  (ΦT Z θ (s t , a t )) i = N -1 j 1 - |[T z j ] vmax vmin -z i | ∆z



τ j (s t+1 , max a∈A EZ θ (s t+1 , a))(5)

