A LARGE SCALE SAMPLE COMPLEXITY ANALYSIS OF NEURAL POLICIES IN THE LOW-DATA REGIME Anonymous authors Paper under double-blind review

Abstract

The progress in reinforcement learning algorithm development is at one of its highest points starting from the initial study that enabled sequential decision making from high-dimensional observations. Currently, deep reinforcement learning research has had quite recent breakthroughs from learning without the presence of rewards to learning functioning policies without even knowing the rules of the game. In our paper we focus on the trends currently used in deep reinforcement learning algorithm development in the low-data regime. We theoretically show that the performance profiles of the algorithms developed for the high-data regime do not transfer to the low-data regime in the same order. We conduct extensive experiments in the Arcade Learning Environment and our results demonstrate that the baseline algorithms perform significantly better in the low-data regime compared to the set of algorithms that were initially designed and compared in the large-data region.

1. INTRODUCTION

Reinforcement learning research achieved high acceleration upon the proposal of the initial study on approximating the state-action value function via deep neural networks (Mnih et al., 2015) . Following this initial study several different highly successful deep reinforcement learning algorithms have been proposed (Hasselt et al., 2016b; Wang et al., 2016; Hessel et al., 2018; 2021) from targeting different architectural ideas to employing estimators targeting overestimation, all of which were designed and tested in the high-data regime (i.e. two hundred million frame training). An alternative recent line of research with an extensive of amount of publications focused on pushing the performance bounds of deep reinforcement learning policies in the low-data regime (Yarats et al., 2021; Ye et al., 2021; Kaiser et al., 2020; van Hasselt et al., 2019; Kielak, 2019) (i.e. with one hundred thousand environment interaction training). Several different unique ideas in current reinforcement learning research, from model-based reinforcement learning to increasing sample efficiency with observation regularization, gained acceleration in several research directions based on policy performance comparisons demonstrated in the Arcade Learning Environment 100K benchmark. However, we demonstrate that there is a significant overlooked assumption being made in this line of research without being explicitly discussed. This implicit assumption, that is commonly shared amongst a large collection of low-data regime studies carries a significant importance due to the fact that these studies shape future research directions with incorrect reasoning; hence, influencing the overall research efforts put in for particular research ideas for several years following. Thus, in our paper we target this implicit assumption and aim to answer the following questions: • How can we theoretically explain the relationship between asymptotic sample complexity versus the low-data regime sample complexity in deep reinforcement learning? • How would the performance profiles of deep reinforcement learning algorithms designed for the high-data regime transform to the low-data regime? • Can we expect the performance rank of algorithms to hold with variations on the number of samples used in policy training? Hence, to be able to answer the questions raised above in our paper we focus on sample complexity in deep reinforcement learning and make the following contributions:

