MEASURING PROGRESS IN DEEP REINFORCEMENT LEARNING SAMPLE EFFICIENCY Anonymous

Abstract

Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does not necessarily correspond to improved sample efficiency. As simulating real world processes is often prohibitively hard and collecting real world experience is costly, sample efficiency is an important indicator for economically relevant applications of DRL. We investigate progress in sample efficiency on Atari games and continuous control tasks by comparing the number of samples that a variety of algorithms need to reach a given performance level according to training curves in the corresponding publications. We find exponential progress in sample efficiency with estimated doubling times of around 10 to 18 months on Atari, 5 to 24 months on state-based continuous control and of around 4 to 9 months on pixel-based continuous control depending on the specific task and performance level.

1. INTRODUCTION

Recent successes of deep reinforcement learning (DRL) in Go (Silver et al., 2016; 2017; 2018) and complex real-time strategy games (Berner et al., 2019; Vinyals et al., 2019) indicate the vast potential for automating complex economically relevant tasks like assembling goods in non-standardized environments or physically assisting humans using DRL. However, sample efficiency is likely to remain an important bottleneck for the economic feasibility of real world applications of DRL. This is because samples of state transitions in the environment caused by an agent's action are essential for training DRL agents but automated systems interacting with the real world are often fragile, slow or costly, which makes DRL training in the real world expensive both in terms of money and time (Dulac-Arnold et al., 2019) . As most widely used benchmarks in DRL still consist of computer games and simulations where samples can be obtained risk-free, fast and cheaplyfoot_0 , progress in DRL does not necessarily translate to future real world applications unless it corresponds to progress in sample efficiency or simulation accuracy and transfer learning. This makes information about progress in sample efficiency an important input to researchers studying topics like the future of employment (Frey & Osborne, 2017), the potential for malicious uses of DRL (Brundage et al., 2018) and other potentially transformative impacts of AI systems in the future (Gruetzemacher & Whittlestone, 2019).

2.1. SAMPLE EFFICIENCY

While research on progress in the field of AI used to have a strong focus on benchmark performance metrics in AI (Eckersley & Nasser, 2017) , there have been calls to pay more attention to the importance of other metrics: Martinez-Plumed et al. (2018) enumerate previously neglected dimensions of AI progress and explore the relationship between computing resources and final performance in RL. The wider implications of sample efficiency, more specifically, are explored by Tucker et al. (2020) . While their scope is broader than DRL, most of their points do apply to reinforcement learning as a



Espeholt et al. (2019) reach up to 2.4 M FPS and 1 B frames per 25$ on DM Lab(Beattie et al., 2016).1

