THE GUIDE AND THE EXPLORER: SMART AGENTS FOR RESOURCE-LIMITED ITERATED BATCH REINFORCE-MENT LEARNING

Abstract

Iterated (a.k.a growing) batch reinforcement learning (RL) is a growing subfield fueled by the demand from systems engineers for intelligent control solutions that they can apply within their technical and organizational constraints. Model-based RL (MBRL) suits this scenario well for its sample efficiency and modularity. Recent MBRL techniques combine efficient neural system models with classical planning (like model predictive control; MPC). In this paper we add two components to this classical setup. The first is a Dyna-style policy learned on the system model using model-free techniques. We call it the guide since it guides the planner. The second component is the explorer, a strategy to expand the limited knowledge of the guide during planning. Through a rigorous ablation study we show that combination of these two ingredients is crucial for optimal performance and better data efficiency. We apply this approach with an off-policy guide and a heating explorer to improve the state of the art of benchmark systems addressing both discrete and continuous action spaces.

1. INTRODUCTION

John is a telecommunication engineer. His day job is to operate a mobile antenna. He has about forty knobs to turn, in principle every five minutes, based on about a hundred external and internal system observables. His goal is to keep some performance indicators within operational limits while optimizing some others. In the evenings John dreams about using reinforcement learning (RL) to help him with his job. He knows that he cannot put an untrusted model-free agent on the antenna control (failures are very costly), but he manages to convince his boss to run live tests a couple of days every month. John's case is arguably on the R&D table of a lot of engineering companies today. AI adoption is slow, partly because these companies have little experience with AI, but partly also because the algorithms we develop fail to address the constraints and operational requirements of these systems. What are the common attributes of these systems? • They are physical, not getting faster with time, producing tiny data compared to what model-free RL (MFRL) algorithms require for training. • System access is limited to a small number of relatively short live tests, each producing logs that can be used to evaluate the current policy and can be fed into the training of the next. • They are relatively small-dimensional, and system observables were designed to support human control decisions, so there is no need to filter them or to learn representations (with the exception when the engineer uses complex images, e.g., a driver). • Rewards are non-sparse, performance indicators come continually. Delays are possible but usually not long. The RL setup that fits this scenario is neither pure offline (batch RL; Levine et al. (2020) ), since interacting with the system is possible during multiple live tests, nor pure online, since the policy can only be deployed a limited number of times on the system (Fig 1 ). After each deployment on the real system, the policy is updated offline with access to all the data collected during the The policy is updated offline between scheduled live tests where the latest learned policy can be deployed on the real system to collect data and further improve itself at the next offline update. previous deployments, each update benefiting from a larger and more diverse data set. This setup also assumes that the policy cannot be updated online while it is being deployed on the system. We refer to it as iterated batch RL (also called growing batch (Lange et al., 2012) or semi-batch (Singh et al., 1995; Matsushima et al., 2021) in the literature). Furthermore, we are interested in modelbased RL (Deisenroth & Rasmussen, 2011; Chua et al., 2018; Moerland et al., 2021) . With limited access to the real system, a model of the system transitions can be used to simulate trajectories, either at decision-time to search for the best action (decision-time planning, e.g., model predictive control (MPC)) or when learning the policy (background planning, e.g., Dyna-style algorithms that learn model-free agents with the model), which makes model-based RL sample efficient (Chua et al., 2018; Wang et al., 2019) . Additionally, model-based RL works well on small-dimensional systems with dense rewards, and the system model (data-driven simulator, digital twin) itself is an object of interest because it can ease the adoption of data-driven algorithms by systems engineers. Given a robust system model, simple model predictive control (MPC) agents using random shooting (RS; Richards (2005); Rao (2010)) or the cross entropy method (CEM; de Boer et al. ( 2004)) have been shown to perform remarkably well on many benchmark systems (Nagabandi et al., 2018; Chua et al., 2018; Wang et al., 2019; Hafner et al., 2019; Kégl et al., 2021) and real-life domains such as robotics (Yang et al., 2020) . However these methods can be time consuming to run at decision-time on the real system, especially when using large models and with the search budgets required for complex action spaces. On the other hand, implementing successfully the seemingly elegant Dynastyle approach (Sutton, 1991; Kurutach et al., 2018; Clavera et al., 2018; Luo et al., 2019) , when we learn fast reactive model-free agents on the system model and apply them on the real system, remains challenging especially on systems that require planning with long horizon. Our main findings are that i) the Dyna-style approach can still be an excellent choice when combined with decision-time planning or, looking at it from the opposite direction, ii) the required decision-time planner can be made resource efficient by guiding it with the Dyna-style policy and optionally bootstrapping with its associated value function: this allows an efficient exploration of the action search space (given the limited resource budget) where fewer and shorted rollouts are needed to find the best action to play. We also innovate on the experimental framework (metrics, statistically rigorous measurements), so we can profit from the modularity of the Dyna-style approach by tuning ingredients (model, the MFRL guide policy, exploration, planning, bootstrapping; explained in Section 3.2) independently. This modular approach makes engineering easier (as opposed to monolithic approaches like Alp-haZero (Silver et al., 2017) ), which is an important aspect if we want to give the methodology to non-expert systems engineers.

1.1. SUMMARY OF CONTRIBUTIONS

• A conceptual framework with interchangeable algorithmic bricks for iterative batch reinforcement learning, suitable to bring intelligent control into slow, physical, lowdimensional engineering systems and the organizational constraints surrounding them. • A case study that indicates that the combination of a Dyna-style approach and resourcelimited planning can mutually improve each other. • An ablation study that helped us find the combination of a neural model, a bootstrapping off-policy algorithm guide, and a heating explorer, which brings significant improvement over vanilla agents (MPC, pure Dyna without planning) on both discrete and continuous actions systems.

2. RELATED WORK

The MBRL subfield has seen a proliferation of powerful methods, but most of them miss the specific requirements (solving problems irrelevant in this scenario like representation learning or sparse rewards) and missing others (limited and costly system access; time constraints for action search; data taking and experimentation through campaigns, live tests; safety) (Hamrick, 2019) . The Dyna framework developed by Sutton (1991) consists in training an agent from both real experience and from simulations from a system model learned from the real data. Its efficient use of system access makes it a natural candidate for iterated batch RL. The well-known limitation of this approach is the agent overfitting the imperfect system model (Grill et al., 2020) . A first solution is to use short rollouts on the model to reduce error accumulation as done in Model-Based Policy Optimization (MBPO; Janner et al. (2019) ). Another solution is to rely on ensembling techniques for the model as done in ME-TRPO (Kurutach et al., 2018) and MP-MPO (Clavera et al., 2018) The idea of using a guide and a value function when planning is not novel (Silver et al., 2017; Schrittwieser et al., 2020; Wang & Ba, 2020; Argenson & Dulac-Arnold, 2021; Sikchi et al., 2021) . We were greatly inspired by these elements in our objective of building smarter agents as they can make the search more efficient and lead to a better performance. POPLIN-A (Wang & Ba, 2020) relies on behavior cloning (using only real experience, unlike our Dyna-style approach that mainly uses the model), but their decision-time planner is similar to our approach. During the planning, they add a Gaussian noise to the actions recommended by a deterministic policy network and update the noise distribution using a CEM strategy. In a similar way our GUIDE&EXPLORE strategy also adds a carefully controlled amount of noise to the recommended actions. Our results highlight the importance of a well-calibrated exploration. Additionally, our planner does not require to specify the amount of noise beforehand. Argenson & Dulac-Arnold (2021) ; Lowrey et al. (2019) ; Sikchi et al. (2021) found that bootstrapping with a value estimate improves the performance of simple guided MPC strategies (see also Bhardwaj et al. (2021) ). The popular AlphaZero (Silver et al., 2017) and MuZero (Schrittwieser et al., 2020) algorithms also rely on a guide and a value function for their Monte Carlo Tree Search (MCTS). The principal issue of MuZero (Schrittwieser et al., 2020) in our micro-data iterated batch RL context is that it does not control the number of system access steps: it needs to simulate a lot from the real environment to establish the targets for the value function. In these two algorithms the guide is updated from the results obtained during the search that it guided, a procedure known as dual policy Iteration (Anthony et al., 2017; Sun et al., 2018) . Furthermore, most of the computation to grow the search tree is done sequentially which results in a slower planner compared to the natural parallelized implementation of our agent. We prefer experimenting with Dyna-style approaches first to leverage popular MFRL algorithms and defer the study of dual policy iteration to future work. Our results show that decision-time planning is an important ingredient, a claim already made by 2020) study where additional computation is best spent between policy update and decision-time planning. In our case we are however less concerned by computational resources required to update the policy, as it is done offline, but rather by the time spent at decision-time while interacting with the real system.

3.1. THE FORMAL SETUP

Let T T = (s 1 , a 1 ), . . . , (s T , a T ) be a system trace consisting of T steps of observable-action pairs (s t , a t ): given an observable s t of the system state at time t, an action a t was taken, leading to a new system state observed as s t+1 . The observable vector s t = (sfoot_0 t , . . . , s ds t ) contains d s numerical or categorical variables, measured on the system at time t. The action vector a t contains d a numerical or categorical action variables, typically set by a control function a t = π(s t ) of the current observable s t (or by a stochastic policy a t ∼ π(s t ); we will also use the notation π : s t ; a t ). The performance of the policy is measured by the reward r t which is a function of s t and a t . Given a trace T T and a reward r t obtained at each step t, we define the mean reward as R(T T ) = 1 T T t=1 r t . 1 The transition p : (s t , a t ) ; s t+1 can be deterministic (a function) or probabilistic (generative). The transition may either be the real system p = p real or a system model p = p. When the model p is probabilistic, besides the point prediction E p s t+1 |(s t , a t ) , it also provides information on the uncertainty of the prediction and/or to model the randomness of the system (Deisenroth & Rasmussen, 2011; Chua et al., 2018) . Finally, in the description of the algorithms we index a trace T = (s 1 , a 1 ), . . . , (s T , a T ) as follows: for t ∈ {1, . . . , T }, T s [t] = s t and T a [t] = a t .

3.2. A NOTE ON TERMINOLOGY

By model we will consistently refer to the learned transition or system model p (never to any policy). Rollout is the procedure of obtaining a trace T from an initial state s 1 by alternating a model or real system p and a policy π (Fig 2 ). We decided to rename what Silver et al. (2017) calls the prior policy to guide since prior clashes with Bayesian terminology (as, e.g., Grill et al. (2020) ; Hamrick et al. (2021) also note), and guide expresses well that the role of this policy is to guide the search at decision time. Sometimes the guide is also called the reactive policy (Sun et al., 2018) since it is typically an explicit function or conditional distribution ξ : s ; a that can be executed or drawn from rapidly. We will call the (often implicit) policy π : s ; a resulting from the guided plan/search the actor (sometimes also called the non-reactive policy since it takes time to simulate from the model, before each action). Planning generally refers to the use of a model p to generate imaginary plans and in that sense, planning is part of training the guide. However, in the rest of the paper we will use the term planning to refer to the guided search procedure that results in acting on the real system (this is sometimes called decision-time planning). We will explicitly use the term background planning to refer to the planning used at training time as it is done in Sutton & Barto (2018) and Hamrick et al. (2021) .

3.3. EXPERIMENTAL SETUP: THE ITERATED BATCH MBRL

foot_1 For rigorously studying and comparing algorithms and algorithmic ingredients, we need to fix not only the simulation environment but also the experimental setup. We parameterize the iterated batch RL loop (the pseudocode in Fig 2 is the formal definition) by four parameters: • the number of episodes N , • the number of system access steps T per episode, • the planning horizon L, and • the number of generated rollouts n at each planning step. N and T are usually set by hard organizational constraints (number N and length T of live tests) that are part of the experimental setup. Our main goal is to measure the performance of our algorithms at a given (and challengingly small) number of system access steps N × T for a given planning budget (n and L) determined by the (physical) time between two steps and the available computational resources. In benchmark studies, we argue that fixing N , T , n, and L is important for making the problem well defined (taking some of the usual algorithmic choices out of the input of the optimizer), affording meaningful comparison across papers and steady progress of algorithms. As in all benchmark designs, the goal is to make the problem challenging but not unsolvable. That said, we are aware that these choices may change the task and the research priorities implicitly but significantly (for example, a longer horizon L will be more challenging for the model but may make the planning easier although more expensive), so it would make sense to carefully design several settings (quadruples N -T -n -L) on the same environment. ROLLOUT(π, p, s 1 , T ): 1 T ← {} 2 for t ← 1 to T : ▷ for T steps 3 a t ; π(s t ) ▷ draw action from policy 4 T ← T ∪ (s t , a t ) ▷ update trace 5 s t+1 ; p(s t , a t ) ▷ draw next state 6 return T ITERATEDMBRL(p real , S 0 , π (0) , N, T, L, n): 1 s 1 ; S 0 ▷ draw initial state 2 T (1) ← ROLLOUT π (0) , p real , s 1 , T ▷ random trace 3 for τ ← 1 to N : ▷ for N episodes 4 p(τ) ← LEARN ∪ τ τ ′ =1 T (τ ′ ) ▷ learn system model 5 π (τ ) ← ACTOR π (0) , π (τ -1) , p(τ) , ∪ τ τ ′ =1 T (τ ) , L, n 6 s 1 ; S 0 ▷ draw initial state 7 T (τ +1) ← ROLLOUT π (τ ) , p real , s 1 , T ▷ new trace 8 return ∪ N τ =1 T (τ ) Figure 2 : The iterated batch MBRL loop. p real : (s t , a t ) ; s t+1 is the real system (so Line 7 is what dominates the cost) and p : (s t , a t ) ; s t+1 is the transition in ROLLOUT that can be either the real system p real or the system model p. S 0 is the distribution of the initial state of the real system. π (0) : s t ; a t is an initial (typically random) policy and in ROLLOUT π : s t ; a t is any policy. N is the number of episodes; T is the length of the episodes; L is the planning horizon and n is the number of planning trajectories used by the actor policies π (τ ) . τ = 1, . . . , N is the episode index whereas t = 1, . . . , T is the system (or model) access step index. LEARN is a supervised learning (probabilistic or deterministic time-series forecasting) algorithm applied to the collected traces and ACTOR is a wrapper of the various techniques that we experiment with in this paper (Fig 4 ). An ACTOR typically updates π (τ -1) using the model p(τ) in a background-planning loop, but it can also access the initial policy π (0) and the trace ∪ τ τ ′ =1 T (τ ′ ) collected on p real up to episode τ . HEATINGEXPLORE π (0) , ξ, n [s]: 1 for i ← 1 to n: Our main operational cost is system access step so we are looking for any-time algorithms that achieve the best possible performance at any episode τ for a given decision-time planning budget (n and L). Hence, in the MBRL iteration (Fig 2 ), we use the same traces T (τ ) , rolled out in each iteration (Line 7), to i) update the model p and the actor policy (Line 4 and 5) and ii) to measure the performance of the techniques (Section 3.5). 2 ρ i (a|s) ∼      ξ(a|s) 1/T i a ′ ξ(a ′ |s) 1/T i if a is discrete N E {ξ(.|s)}, T i if a is continuous. 3 return [ρ i ] n i=1

3.4. MODEL-BASED ACTOR POLICIES: GUIDE AND EXPLORE

Our main contribution is a Dyna-style GUIDE&EXPLORE strategy (Fig 4 ). This strategy consists in learning a guide policy ξ for the decision-time planner (TRAJOPT in 2020)), partly because the goal of π is not only to exploit the traces T = ∪ N τ =1 T (τ ) collected so far and the model p = LEARN(T ), but also to collect data to self-improve p and ξ/π in the next episode τ . This second reason is particular in our iterated batch setup: contrary to pure batch RSACTOR π (0) , π prev , p, T , L, n : 1 return TRAJOPT [π (0) ] n i=1 , p, L, n ▷ planning GUIDE&EXPLOREACTOR π (0) , π prev , p, T , L, n : 1 ξ ← MFRL π prev , p, T ▷ model free guide policy 2 [ρ i ] n i=1 ← HEATINGEXPLORE π (0) , ξ, n ▷ explorer policies 3 return TRAJOPT [ρ i ] n i=1 , p, L, n ▷ planning Figure 4 : Model-based ACTORs (policies executed on the real system). RSACTOR is a classical random shooting planner that uses the random policy π (0) for all rollouts. GUIDE&EXPLOREACTOR first learns a Dyna-style guide policy ξ on the transition p (more precisely, updates the previous guide contained in π prev ). It can also use the traces T collected on the real system. It then "decorates" the guide by (possibly n different) exploration strategies, and runs these reactive guide&explore policies [ρ i ] n i=1 in the TRAJOPT planner in Fig 5 . TOTALREWARD T : 1 return T × R(T ) ▷ total reward (a.k.a return) BOOTSTRAP V, α (T ): 1 return T × R T + αV T s [L] ▷total reward + value of last state TRAJOPT [ρ i ] n i=1 , p, L, n [s]: 1 for i ← 1 to n: 2 T (i) ← ROLLOUT(ρ i , p, s, L) ▷ ith roll-out trace 3 V (i) ← VALUE T (i) ▷ total reward of T (i) 4 i * ← argmax i V (i) ▷ index of the best trace 5 return a * 1 = T (i * ) a [1] ▷ first action of the best trace Figure 5 : VALUE estimates on rollout traces and TRAJOPT: trajectory optimization using random shooting with a set of policies. TOTALREWARD and BOOSTRAP are two ways to evaluate the value of a rollout trace. The latter adds the value of the last state to the total reward, according to a value estimate V : s → R + , weighted by a hyperparameter α. They are called in Line 3 of TRAJOPT which is a random shooting planner that accepts n different shooting policies [ρ i ] n i=1 for the n rollouts used in the search. As usual, it returns the first action a * 1 = T (i * ) a [1] of the best trace T (i * ) = (s * 1 , a * 1 ), . . . , (s * T , a * T ) . Its parameters are the shooting policies [ρ i ] n i=1 , the transition p, and the number n and length L of rollouts, but to properly define it, we also need the state s which we plan from, so we use a double argument list ()[]. We recall here that for a trace T = (s 1 , a 1 ), . . . , (s T , a T ) and for t ∈ {1, . . . , T }, T s [t] = s t and T a [t] = a t . RL, exploration here is crucial. When the guide ξ is probabilistic, we explore implicitly because of the sampling step (Line 3 in ROLLOUT, Fig 1 ), and partly also because of the draw from the imperfect and possibly stochastic model (Line 5). Nevertheless, we found that it helps if we control exploration explicitly. To show this, we experiment with a HEATINGEXPLORE strategy which consists in modulating the temperature of the guide distribution ξ(a|s) (Fig 4 ). The novelty of our approach is that, instead of constant randomness in the exploration, we use a set of temperatures [T i ] n i=1 to further diversify the search and let the planner explore promising regions far from the distribution of trajectories where the guide may have falsely converged. Finally, similarly to Lowrey et al. (2019) ; Argenson & Dulac-Arnold (2021) , we found that bootstrapping the planning with the learned value function at the end of each rollout trace (BOOTSTRAP in Fig 5 ) can be helpful for optimizing the performance with a short horizon.

3.5. METRICS

We use two rigorously defined and measured metrics (Kégl et al., 2021) to assess the performance of the different algorithmic combinations. MAR measures the asymptotic performance after the learning has converged, and MRCP measures the convergence pace. Both can be averaged over seeds, and MAR is also an average over episodes, so we can detect statistically significant differences even when they are tiny, leading to a proper support for experimental development. MEAN ASYMPTOTIC REWARD (MAR). Our measure of asymptotic performance, the mean asymptotic reward, is the mean reward MR(τ ) = R T (τ ) T in the second half of the episodes (after convergence; we set N in such a way that the algorithms converge after less than N/2 episodes) MAR = 2 N N τ =N/2 MR(τ ). MEAN REWARD CONVERGENCE PACE (MRCP(r)). To assess the speed of convergence, we define the mean reward convergence pace MRCP(r) as the number of steps needed to achieve mean reward r, smoothed over a window of size 5: MRCP(r) = T ×argmin τ 1 5 τ +2 τ ′ =τ -2 MR(τ ) > r . The unit of MRCP(r) is system access steps, not episodes, first to make it invariant to episode length, and second because in micro-data RL the unit of cost is a system access step. For Acrobot, we use r = 1.8 in our experiments, which is roughly 70% of the best achievable mean reward.

4.1. ACROBOT

Acrobot is an underactuated double pendulum with four observables s t = [θ 1 , θ 2 , θ1 , θ2 ] which are usually augmented to six by taking sine and cosine of the angles (Brockman et al., 2016) ; θ 1 is the angle to the vertical axis of the upper link; θ 2 is the angle of the lower link relative to the upper link, both being clipped to [-π, π]; θ1 and θ2 are the corresponding angular momenta. For the starting position s 1 of each episode, all four state variables are sampled uniformly from an approximately hanging and stationary position s j 1 ∈ [-0.1, 0.1]. The action is a discrete torque on the lower link a ∈ {-1, 0, 1}. The reward is the height of the tip of the lower link over the hanging position r(s) = 2 -cos θ 1 -cos(θ 1 + θ 2 ) ∈ [0, 4]. 3 Acrobot is a small but relatively difficult and fascinating system, so it is an ideal benchmark for continuous-reward engineering systems. Similarly to Kégl et al. (2021) , we set the number of episodes to N = 100, the number of steps per episode to T = 200, the number of planning rollouts to n = 100, and the horizon to L = 10. With these settings, we can identify four distinctively different regimes (see the attached videos): i) the random uniform policy π (0) achieves MAR ≈ 0.1 -0.2 (Acrobot keeps approximately hanging), ii) reasonable models with random shooting or pure Dyna-style controllers achieve MAR ≈ 1.4 -1.6 (Acrobot gains energy but moves its limb quite uncontrollably), iii) random shooting n = 100, L = 10 with good models such as PETS (Chua et al., 2018; Wang et al., 2019) or DARMDN (Kégl et al., 2021) keep the limb up and manage to have its tip above horizon on average MAR ≈ 2.0 -2.1 (previous state of the art), and iv) in our experiments we could achieve a quasi perfect policy (Acrobot moves up like a gymnast and stays balanced at the top) MAR ≈ 2.7 -2.8 using random shooting with n = 100K, L = 20 on the real system, giving us a target and a possibly large margin of improvement. Acrobot is an ideal benchmark for making our point for the two following reasons. First, it turned out to be a quite difficult system for pure model-free baselines and associated Dyna-style algorithms (we achieve a MAR ≈ 2.1 with a DQN, see Table 1 , and Wang et al. (2019) report a MAR ≈ 1.6 -1.7 for other Dyna-style algorithms). Second, decision-time planning can achieve a quasi-perfect policy (MAR ≈ 2.7 -2.8) but doing so while being data efficient and with a limited planning budget appears to be challenging.

4.2. CARTPOLE SWING-UP

Cartpole swing-up from the DeepMind Control Suite (Tunyasuvunakool et al., 2020) is an underactuated pendulum attached by a frictionless pivot to a cart that moves freely along a horizontal line. Observations include the cart position, cosine and sine of the pendulum angle, and their time derivatives s t = [x, ẋ, cos θ, sin θ, θ]. The cart is initialized at a position x and a velocity close to 0, and an angle close to θ = π (hanging position). The goal is to swing-up the pendulum and stabilize it upright by applying a continuous horizontal force a t ∈ [-1, 1] to the cart at each timestep t. The reward in [0, 1] is obtained by the multiplication of four reward components: one depending on the height of the pendulum (in [1/2, 1]), one on the cart position (in [1/2, 1]), one on its velocity (in [1/2, 1]) and one on the amplitude of the force (in [4/5, 1]). The maximum reward is obtained when the pendulum is centered (x = 0), upright with no velocity and an applied force of 0. This task has been widely used by the literature as a standard benchmark system for nonlinear control and complex action space due to its potential generalization to different domains (Boubaker, 2012; Nagendra et al., 2017) . We set the number of episodes to N = 35, the number of steps per episode to T = 1000, the number of planning rollouts to n = 500, and the horizon to L = 20. A mean reward of 0.8 corresponds to a pole that succeeds at standing upright and being stable. We chose the Cartpole system as vanilla MPC agents (RS or CEM) require a long planning horizon (at least L = 100) to succeed at swinging up the pendulum and stabilize it upright. It also illustrates how our approach extends to the continuous action setting which increases the complexity of the optimization search space and requires sensitive controllers.

4.3. MODELS, GUIDES, AND ACTORS

Following Kégl et al. (2021) , we tried different system models (Fig 2/Line 4) from the family of Deep (Autoregressive) Mixture Density Networks (D(A)RMDN) and selected the ones giving the best results on the Acrobot and Cartpole swing-up systems. In principle, any MFRL technique providing a value function (for bootstrapping) and a policy can be used as a guide ξ when applying ITERATEDMBRL (Fig 4/Line 1). We argue though that an off-policy algorithm is better suited here since it can leverage all the (off-policy) data coming from interaction with the real system. In particular, those traces are generated using planning and represent a stronger learning signal for the agent (planning during training vs. at test time; Hamrick et al. (2021) ). Thus, we experimented with Deep Q Networks (DQN; Mnih et al. (2015) ) for the discrete action Acrobot system and Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , we start from the simple model-free guide (DQN or SAC) which is trained with data generated by the model and is interacting with the system without planning. We refer to these agents as MBPO(DQN) and MBPO(SAC). Adding (n, L) to the name of the agent means that we use the agent to guide a planning of n rollouts with horizon L. Appendix A contains detailed information on the various algorithmic choices.

4.4. RESULTS

Table 1 and Fig 6 compare the results obtained with GUIDE&EXPLORE, a vanilla RSACTOR agent using the same budget as the one we consider for GUIDE&EXPLORE, the pure model free guides trained on the real system and their Dyna-style version (no planning, MBPO). We see that the GUIDE&EXPLORE algorithm gives the best performance. On Acrobot it is almost matching the costly RSACTOR(n = 100K, L = 20) that we include as a target even though it would not be officially accepted in our benchmark as we restrict n to 100 and L to 10. On Cartpole it reaches the performance reported in Lee et al. (2020) and approaches the one from (Springenberg et al., 2020) which are, to the best of our knowledge, the state of the art on Cartpole. We note that Springenberg et al. (2020) reports the median performance whereas we report the mean performance as Lee et al. (2020) . We ran an ablation study on Acrobot showing that all ingredients add to the performance. Although MBPO(DQN) performs reasonably well and is comparable to RSACTOR(n = 100, L = 10) it fails to achieve the performance of RSACTOR(n = 100K, L = 20). Fig 6 also shows that adding a heating explorer to the guide significantly improves the performance of RSACTOR(n = 100, L = 10) and MBPO(DQN). We found that allowing the planner to choose the right amount of exploration (the right temperature) is a robust and safe approach (see Appendix B and Fig 8 for more results with a Table 1 : Agent evaluation results. MAR is the mean asymptotic reward showing the asymptotic performance of the agent and MRCP(1.8) is the mean reward convergence pace showing the sampleefficiency (the number of system access steps required to achieve a mean reward of 1.8). The modelfree agents are trained on the real system and the corresponding MAR shows the asymptotic performance obtained after convergence. ↓ and ↑ mean lower and higher the better, respectively. 

5. CONCLUSION

In this paper we show that an offline Dyna-style approach can be successfully applied on benchmark systems where previously Dyna-style algorithms were failing. Our empirical results exhibit the importance of guiding the decision-time planning with the correct amount of exploration to achieve the best performance under a planning budget constraint. More precisely, our decision-time planner explores a varied range of trajectories around the guide's prior distribution and bootstraps with a value function estimate to further improve the performance. This combination leads to achieving state-of-the-art performance while respecting reasonable resource constraints. Future work includes modelling the uncertainties of the value estimates so as to use them for better exploration. performance was obtained with a logistic schedule (with a linear end). The exact values will be provided in the code. For Acrobot and the DQN guide we also played with a multi-ε exploration strategy based on EPSGREEDYEXPLORE (Fig 7 ) where we use one ε value for each of the n = 100 rollouts: {0.001, 0.01, 0.02, . . . , 0.99}. Refer to Appendix B for experiments with a suboptimal DQN guide using this exploration strategy. EPSGREEDYEXPLORE π (0) , ξ, n [s]: 1 for i ← 1 to n:  2 ρ i (a|s) = argmax a ξ(a|s) with probability 1 -ε i π (0) (a) with probability ε i . 3 return [ρ i ] n i=1

B IMPORTANCE OF THE EXPLORATION: STUDY WITH A SUBOPTIMAL DQN GUIDE ON ACROBOT

We ran an ablation study with a suboptimal DQN guide on Acrobot (achieving an asymptotic performance of 1.6 on the real system) and a multi-ε greedy explorer (EPSGREEDYEXPLORE) to claim the importance of the explorer. EPSGREEDYEXPLORE makes it easy to control and interpret the degree of exploration through the ε parameter. We consider the following agents. We start from the simple DQN guide which is interacting with the system without planning. Adding (n, L) to the name of the agent means that we use the agent to guide a planning of n rollouts with horizon L. It is important to note here that planning without exploration using the greedy guide is, in our case, equivalent to no planning since both the model p and the guides ξ are deterministic. DQN-EPSGREEDYEXPLORE refers to the additional use of the associated exploration strategy (Fig 7 ). When a fixed ε is used for the exploration strategy, we add it as an explicit parameter, e.g., EPSGREEDYEXPLORE(ε) . No parameters means that a different ε or temperature is used for each of the n rollouts. Setting ε to 0 corresponds to no exploration and is equivalent to using the guide greedily without planning (n = 1 and L = 1) as our model is used deterministically when sampled from. Setting ε to 1 corresponds to full exploration and is equivalent to the purely random RSACTOR(n = 100, L = 10). Table 6 reports the results obtained by DQN alone, DQN with planning and fixed ε values (DQN(n = 100, L = 10)-EPSGREEDYEXPLORE(ε) for ε ∈ {0.0001, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 0.99, 0.9999}). The closer ε is to 0 the closer the performance is to DQN, and the closer ε is to 1 the closer the performance is to RSACTOR(n = 100, L = 10). With a well-chosen ε between these two extremes, say ε = 0.4, we obtain a better performance than either extremes. We can thus claim that planning is required with a correct amount of exploration. Our EPSGREEDYEXPLORE exploration strategy, used with multiple ε values allows for the automatic and dynamic selection of the good amount of exploration. It would indeed be possible that using a fixed ε value would not give the best performance as different values would be required at different epochs or different steps of an episode. We illustrate this by plotting the selected epsilon vs the episode step for 3 different epochs (Fig 8) . Even though the guide is suboptimal the exploration scheme can make the agent benefit from the guide where it is good and discard it where it is bad. 



We use the mean reward (as opposed to the total reward, a.k.a return), since it is invariant to episode length and its unit is more meaningful to systems engineers. First a practical note: the pseudocode in Fig 2 and the subroutines in Figs 3-5 contain our formal definition. They are ordered top-down, but can also be read in reverse order according to the reader's preference. We chose this rather than the sparse variable-episode-length version r(s) = I {2 -cos θ1 -cos(θ1 + θ2) > 3}(Sutton, 1996) since it corresponds better to the continuous aspect of engineering systems.



Figure 1: Iterated (a.k.a growing) batch RL. The policy is updated offline between scheduled live tests where the latest learned policy can be deployed on the real system to collect data and further improve itself at the next offline update.

Hamrick et al. (2021) and Springenberg et al. (2020) among others. Hamrick et al. (2021) use MuZero to run their ablation study while we prefer using an explicit model for practical reasons explained in the introduction. Springenberg et al. (

Figure 3: Heating exploration strategy. HEATINGEXPLORE heats the guide action distribution ξ(a|s) to n different temperatures. The temperatures [T i ] n i=1 are hyperparameters.

Fig 5) using a model-free RL technique on the model p and on the traces collected on the real system T . It is known that the guide ξ, executed as an actor π = ξ on the real system, does not usually give the best performance (we also confirm it in Section 4), partly because ξ overfits the model (Fig 5 in Kurutach et al. (2018); Grill et al. (

for the continuous action Cartpole swing-up task. For SAC on Cartpole, following Janner et al. (2019), short rollouts starting from real observations are performed on the model to sample transitions which are then placed in an experience replay buffer, along with the real transitions observed during the rollouts (Fig 2/Line 7). The SAC is then updated by sampling batches from this buffer. Details on the implementation and hyperparameters of the DQN and SAC agents are given in Appendix A. For actors (Fig 2/Line 5, Fig 4)

Figure 6: Learning curves obtained with different agents. Mean reward curves are averaged across at least four seeds. Areas with lighter colors show the 90% confidence intervals and dashed lines represent the score of the best converged model-free algorithms. (a) Mean reward is between 0 (hanging) and 4 (standing up). Episode length is T = 200, number of epochs is N = 100 with one episode per epoch. (b) Mean reward is between 0 and 1. Episode length is T = 1000, number of epochs is N = 35 with one episode per epoch.

Figure 7: Exploration strategies. EPSGREEDYEXPLORE changes the action to a random action π (0) ; a with different probabilities. The probabilities [ε i ] n i=1 are hyperparameters.

RSACTOR PERFORMANCE ON THE REAL ACROBOT AND CARTPOLE SYSTEMS C.1 ACROBOT We present the results one can obtain on the real system with an RSACTOR and different values of the planning horizon L and the number of generated rollouts n in Fig 9. For the considered planning

Figure 9: Performance obtained with RSACTOR on the real Acrobot system for different planning horizons L and number of generated rollouts n. The plot shows the mean rewards obtained for several randomly initialized episodes of 200 steps. The error bars give the associated 90% confidence intervals. Note that since Acrobot has a discrete action space with three actions, the total number of different rollouts for h = 10 is n = 3 10 = 59,049. The performance shown for h = 10 and n = 100,000 thus only requires n = 59,049 rollouts.

Figure 10: Performance obtained with RSACTOR and CEM on the real Cartpole system for different planning horizons L and number of generated rollouts n. The plot shows the mean rewards obtained for several randomly initialized episodes of 1000 steps. The error bars give the associated 90% confidence intervals.

REPRODUCIBILITY STATEMENT

In order to ensure reproducibility we will release the code at <URL hidden for review>, once the paper has been accepted. We also provide details on the hyperparameter optimization of the agents and models as well as the best ones in the Appendix.

A IMPLEMENTATION DETAILS

A.1 CODE AND DEPENDENCIES Our code will be made publicly available after publication to ease the reproducibility of all our results. We use Pytorch (Paszke et al., 2019) to build and train the neural network system models and policies. To run the ITERATEDMBRL experiments we use the rl_simulator (https:// github.com/ramp-kits/rl_simulator) python library developed by Kégl et al. (2021) which relies on Open AI Gym (Brockman et al., 2016) for the Acrobot dynamics and dm_control (Tunyasuvunakool et al., 2020) for the Cartpole swing-up task. For the DQN and SAC agents we rely on the StableBaselines3 implementations (Raffin et al., 2019) .

A.2 MODELS AND AGENTS

It is known that carefully tuning hyperparameters of deep reinforcement learning algorithms is crucial for success and fair comparisons (Henderson et al., 2018; Zhang et al., 2021) . To reduce the computational cost and consider a reasonable search space the models and the agents were optimized independently. For the Acrobot system models we use the same hyperparameters as the ones used in Kégl et al. (2021) . Please refer to Appendix D in Kégl et al. (2021) for a complete description of the hyperparameter search and the selected hyperparameters. We decided to use DMDN(1) det trained on the first 2000 and last 3000 collected samples as it lead to a similar performance with a limited training time. For Cartpole we use a DARMDN model with 1 hidden layer and 128 neurons trained on the full dataset. These models are trained by minimizing the negative log-likelihood. The 'det' suffix means that the model is sampled from deterministically, returning the mean of the predicted distribution. The reader can also refer to Kégl et al. (2021) for a complete description of these models.For the DQN we optimized its hyperparameters with a random search of 1000 trials and a parallel training on 10 copies of the real system for 10 million steps (Table 2 ). We then selected the DQN with the best mean reward (Table 3 ). When training DQN on the system model, we iteratively update it with 100 000 steps at each episode using the most recent system model. We do not use short rollouts as this was not necessary. When bootstrapping with the value function we used a discount factor of 0.95 as it lead to the best performance.For the SAC agent we performed a similar hyperparameter optimization with a random search of 1000 trials and a parallel training on 10 copies of the real system for 1 million steps (Table 4 ). The best SAC parameters are given in Table 5 . When training SAC on the system model, we train it from scratch at each episode for 250 000 steps and perform short rollouts of length 100. • For environments with discrete actions, we learn a Q function and first normalize the Q values by their maximum value, Q(s, a) = Q(s, a)/ max a ′ Q(s, a ′ ), before applying a softmax:• For continuous actions environments, we learn a Gaussian policy ξ(.|s) ∼ N µ(s), σ(s) and define the explorer policies as:is an increasing sequence of temperatures. A large temperature gives a uniform distribution, whereas a low temperature corresponds to taking argmax a Q(s t , a) or µ(s). Different shapes of sequences were tried (linear, logarithmic, exponential, polynomial, logistic), and best Table 6 : Importance of the explorer. MAR is the Mean Asymptotic Reward showing the asymptotic performance of the agent and MRCP(1.8) is the Mean Reward Convergence Space showing the sample-efficiency performance as the number of system access steps required to achieve a reward of 1.8. ↓ and ↑ mean lower and higher the better, respectively. The ± values are 90% Gaussian confidence intervals. horizons a larger number of generated rollouts lead to a better performance. We also observed in our simulations that for the Acrobot to stay balanced, it was necessary (although not always sufficient) to have a reward larger than 2.6. We see from Fig 9 that this can be achieved with a simple agent such as RSACTOR but at the price of a very large number of generated rollouts. The goal is therefore to design a smarter agent that can come as close as possible to this performance with a limited budget. For the CEM agent we run 5 iterations with the given n and L so the total budget is 5×n×L.A mean reward of 0.8 corresponds to a pole that succeeds at standing upright and being stable. We see that achieving such a performance requires a CEM agent and a planning horizon larger than L = 100.

