THE GUIDE AND THE EXPLORER: SMART AGENTS FOR RESOURCE-LIMITED ITERATED BATCH REINFORCE-MENT LEARNING

Abstract

Iterated (a.k.a growing) batch reinforcement learning (RL) is a growing subfield fueled by the demand from systems engineers for intelligent control solutions that they can apply within their technical and organizational constraints. Model-based RL (MBRL) suits this scenario well for its sample efficiency and modularity. Recent MBRL techniques combine efficient neural system models with classical planning (like model predictive control; MPC). In this paper we add two components to this classical setup. The first is a Dyna-style policy learned on the system model using model-free techniques. We call it the guide since it guides the planner. The second component is the explorer, a strategy to expand the limited knowledge of the guide during planning. Through a rigorous ablation study we show that combination of these two ingredients is crucial for optimal performance and better data efficiency. We apply this approach with an off-policy guide and a heating explorer to improve the state of the art of benchmark systems addressing both discrete and continuous action spaces.

1. INTRODUCTION

John is a telecommunication engineer. His day job is to operate a mobile antenna. He has about forty knobs to turn, in principle every five minutes, based on about a hundred external and internal system observables. His goal is to keep some performance indicators within operational limits while optimizing some others. In the evenings John dreams about using reinforcement learning (RL) to help him with his job. He knows that he cannot put an untrusted model-free agent on the antenna control (failures are very costly), but he manages to convince his boss to run live tests a couple of days every month. John's case is arguably on the R&D table of a lot of engineering companies today. AI adoption is slow, partly because these companies have little experience with AI, but partly also because the algorithms we develop fail to address the constraints and operational requirements of these systems. What are the common attributes of these systems? • They are physical, not getting faster with time, producing tiny data compared to what model-free RL (MFRL) algorithms require for training. • System access is limited to a small number of relatively short live tests, each producing logs that can be used to evaluate the current policy and can be fed into the training of the next. • They are relatively small-dimensional, and system observables were designed to support human control decisions, so there is no need to filter them or to learn representations (with the exception when the engineer uses complex images, e.g., a driver). • Rewards are non-sparse, performance indicators come continually. Delays are possible but usually not long. The RL setup that fits this scenario is neither pure offline (batch RL; Levine et al. ( 2020)), since interacting with the system is possible during multiple live tests, nor pure online, since the policy can only be deployed a limited number of times on the system (Fig 1 ). After each deployment on the real system, the policy is updated offline with access to all the data collected during the ). Additionally, model-based RL works well on small-dimensional systems with dense rewards, and the system model (data-driven simulator, digital twin) itself is an object of interest because it can ease the adoption of data-driven algorithms by systems engineers. Given a robust system model, simple model predictive control (MPC) agents using random shooting (RS; Richards (2005); Rao ( 2010)) or the cross entropy method (CEM; de Boer et al. ( 2004)) have been shown to perform remarkably well on many benchmark systems (Nagabandi et al., 2018; Chua et al., 2018; Wang et al., 2019; Hafner et al., 2019; Kégl et al., 2021) and real-life domains such as robotics (Yang et al., 2020) . However these methods can be time consuming to run at decision-time on the real system, especially when using large models and with the search budgets required for complex action spaces. On the other hand, implementing successfully the seemingly elegant Dynastyle approach (Sutton, 1991; Kurutach et al., 2018; Clavera et al., 2018; Luo et al., 2019) , when we learn fast reactive model-free agents on the system model and apply them on the real system, remains challenging especially on systems that require planning with long horizon. Our main findings are that i) the Dyna-style approach can still be an excellent choice when combined with decision-time planning or, looking at it from the opposite direction, ii) the required decision-time planner can be made resource efficient by guiding it with the Dyna-style policy and optionally bootstrapping with its associated value function: this allows an efficient exploration of the action search space (given the limited resource budget) where fewer and shorted rollouts are needed to find the best action to play. We also innovate on the experimental framework (metrics, statistically rigorous measurements), so we can profit from the modularity of the Dyna-style approach by tuning ingredients (model, the MFRL guide policy, exploration, planning, bootstrapping; explained in Section 3.2) independently. This modular approach makes engineering easier (as opposed to monolithic approaches like Alp-haZero (Silver et al., 2017)), which is an important aspect if we want to give the methodology to non-expert systems engineers.

1.1. SUMMARY OF CONTRIBUTIONS

• A conceptual framework with interchangeable algorithmic bricks for iterative batch reinforcement learning, suitable to bring intelligent control into slow, physical, lowdimensional engineering systems and the organizational constraints surrounding them. • A case study that indicates that the combination of a Dyna-style approach and resourcelimited planning can mutually improve each other. • An ablation study that helped us find the combination of a neural model, a bootstrapping off-policy algorithm guide, and a heating explorer, which brings significant improvement over vanilla agents (MPC, pure Dyna without planning) on both discrete and continuous actions systems.



Figure 1: Iterated (a.k.a growing) batch RL. The policy is updated offline between scheduled live tests where the latest learned policy can be deployed on the real system to collect data and further improve itself at the next offline update.

