MODEL-BASED OFFLINE PLANNING

Abstract

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability to leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments.

1. INTRODUCTION

Learnt policies for robotic and industrial systems have the potential to both increase existing systems' efficiency & robustness, as well as open possibilities for systems previously considered too complex to control. Learnt policies also afford the possibility for non-experts to program controllers for systems that would currently require weeks of specialized work. Currently, however, most approaches for learning controllers require significant interactive time with a system to be able to converge to a performant policy. This is often either undesirable or impossible due to operating cost, safety issues, or system availability. Fortunately, many systems are designed to log sufficient data about their state and control choices to create a dataset of operator commands and resulting system states. In these cases, controllers could be learned offline, using algorithms that produce a good controller using only these logs, without ever interacting with the system. In this paper we propose such an algorithm, which we call Model-Based Offline Planning (MBOP), which is able to learn policies directly from logs of a semi-performant controller without interacting with the corresponding environment. It is able to leverage these logs to generate a more performant policy than the one used to generate the logs, which can subsequently be goal-conditioned or constrained dynamically during system operation. Learning from logs of a system is often called 'Offline Reinforcement Learning' (Wu et al., 2019; Peng et al., 2019; Fujimoto et al., 2019; Wang et al., 2020) and both model-free (Wu et al., 2019; Wang et al., 2020; Fujimoto et al., 2019; Peng et al., 2019) and model-based (Yu et al., 2020; Kidambi et al., 2020) approaches have been proposed to learn policies in this setting. Current modelbased approaches, MOPO (Yu et al., 2020) and MoREL (Kidambi et al., 2020) , learn a model to train a model-free policy in a Dyna-like (Sutton & Barto, 2018) manner. Our proposed approach, MBOP, is a model-based approach that leverages Model-Predictive Control (MPC) (Rault et al., 1978) and extends the MPPI (Williams et al., 2017b) trajectory optimizer to provide a goal or reward-conditioned policy using real-time planning. It combines three main elements: a learnt world model, a learnt behavior-cloning policy, and a learnt fixed-horizon value-function. MBOP's key advantages are its data-efficiency and adaptability. MBOP is able to learn policies that perform better than the demonstration data from as little as 100 seconds of simulated system time (equivalent to 5000 steps). A single trained MBOP policy can be conditioned with a reward function, a goal state, as well as state-based constraints, all of which can be non-stationary, allowing for easy control by a human operator or a hierarchical system. Given these two key advantages, we believe it to be a good candidate for real-world use in control systems with offline data. We contextualize MBOP relative to existing work in Section 2, and describe MBOP in Section 3. In Section 4.2, we demonstrate MBOP's performance on standard benchmark performance tasks for offline RL, and in Section 4.3 we demonstrate MBOP's performance in zero-shot adaptation to varying task goals and constraints. In Section 4.4 we perform an ablation analysis and consider combined contributions of MBOP's various elements.

2. RELATED WORKS

Model-Based approaches with neural networks have shown promising results in recent years. Guided Policy Search (Levine & Koltun, 2013) leverages differential dynamic programming as a trajectory optimizer on locally linear models, and caches the resulting piece-wise policy in a neural network. Williams et al. (2017b) show that a simple model-based controller can quickly learn to drive a vehicle on a dirt track, the BADGR robot (Kahn et al., 2020 ) also uses Model-Predictive Path Integral (MPPI) (Williams et al., 2017a ) with a learned model to learn to navigate to novel locations, Yang et al. (2020) show good results learning legged locomotion policies using MPC with learned models, and (Ebert et al., 2018) demonstrate flexible robot arm controllers leveraging learned models with image-based goals. Silver et al. (2016) have shown the power of additional explicit planning in various board games including Go. More recently planning-based algorithms such as PlaNet (Hafner et al., 2019b) have shown strong results in pixel-based continuous control tasks by leveraging latent variational RNNs. Simpler approaches such as PDDM (Nagabandi et al., 2020) or PETS (Chua et al., 2018) have shown good results using full state information both in simulation and on real robots. MBOP is strongly influenced by PDDM (Nagabandi et al., 2020) (itself an extension on PETS (Chua et al., 2018) ), in particular with the use of ensembles and how they are leveraged during planning. PDDM was not designed for offline use, and MBOP adds a value function composition as well as a policy prior during planning to increase data efficiency and strengthen the set of priors for offline learning. It leverages the same trajectory re-weighting approach used in PDDM and takes advantage of its beta-mixture of the T trajectory buffer. Both MoREL (Kidambi et al., 2020) and MOPO (Yu et al., 2020) leverage model-based approaches for offline learning. This is similar to approaches used in MBPO (Janner et al., 2019) and DREAMER (Hafner et al., 2019a) , both of which leverage a learnt model to learn a model-free controller. MoREL and MOPO, however, due to their offline nature, train their model-free learner by using a surrogate MDP which penalizes for underlying model uncertainty. They do not use the models for direct planning on the problem, thus making the final policy task-specific. MOPO demonstrate the ability of their algorithm to alter the reward function and re-train a new policy according to this reward, but cannot leverage the final policy to dynamically adapt to an arbitrary goal or constrained objective. Matsushima et al. (2020) use a model-based policy for deployment efficient RL. Their use case is a mix between offline and online RL, where they consider that there is a limited number of deployments. They share a similarity in the sense that they also use a behaviorcloning policy π β to guide trajectories in a learned ensemble model, but perform policy improvement steps on a parametrized policy initialized from π β using a behavior-regularized objective function. Similarly to MoREL and MOPO their approach learns a parameterized policy for acting in the real system. The use of a value function to extend the planning horizon of a planning-based policy has been previously proposed by Lowrey et al. (2018) with the POLO algorithm. POLO uses a ground-truth model (e.g. physics simulator) with MPPI/MPC for trajectory optimization. POLO additionally learns an approximate value-function through interaction with the environment which is then appended to optimized trajectories to improve return estimation. Aside from the fact that MBOP uses an entirely approximate & learned model, it uses a similar idea but with a fixed-horizon value function to avoid bootstrapping, and separate heads of the ensemble during trajectory optimization. BC-trained policies as sampling priors have been looked at by POPLIN (Wang & Ba, 2019) . POPLIN does not use value bootstrapping, and re-samples an ensemble head at each timestep during rollouts, which likely provides less consistent variations in simulated plans. They show strong results relative to a series of model-based and model-free approaches, but do not manage to perform on the Gym Walker envi-ronment. Additionally, they are overall much less data efficient than MBOP and do not demonstrate performance in the offline setting. Task-time adaptation using model-based approaches has been considered previously in the modelbased literature. Lu et al. (2019) look at mixing model-free and model-based approaches using notions of uncertainty to allow for adaptive controllers for non-stationary problems. Rajeswaran et al. (2020) use a game-theoretic framework to describe two adaptive learners that are both more sample efficient than common MBRL algorithms, as well as being more robust to non-stationary goals and system dynamics. MBOP is able to perform zero-shot adaptation to non-stationary goals and constraints, but does not provide a mechanism for dealing with non-stationary dynamics. If brought into the on-line settings, approaches from these algorithms such as concentrating on recent data, could however be leveraged to allow for this. Previous approaches all look at various elements present in MBOP but none consider the full combination of a BC prior on the trajectory optimizer with a value-function initialization, especially in the case of full offline learning. Along with this high-level design, many implementation details such as consistent ensemble sampling during rollouts, or averaging returns over ensemble heads, appear to be important for a stable controller from our experience.

3. MODEL-BASED OFFLINE PLANNING

Our proposed algorithm, MBOP (Model-Based Offline Planning), is a model-based RL algorithm able to produce performant policies entirely from logs of a less-performant policy, without ever interacting with the actual environment. MBOP learns a world model and leverages a particle-based trajectory optimizer and model-predictive control (MPC) to produce a control action conditioned on the current state. It can be seen as an extension of PDDM (Nagabandi et al., 2020) , with a behaviorcloned policy used as a prior on action sampling, and a fixed-horizon value function used to extend the planning horizon. In this following sections, we introduce the Markov Decision Process (MDP) formalism, briefly explain planning-based approaches, discuss offline learning, and then introduce the elements of MBOP before describing the algorithm in full.

3.1. MARKOV DECISION PROCESS

Let us model our tasks as a Markov Decision Process (MDP), which can be defined as a tuple (S, A, p, r, γ), where an agent is in a state s t ∈ S and takes an action a t ∈ A at timestep t. When in state s t and taking an action a t , an agent will arrive in a new state s t+1 with probability p(s t+1 |s t , a t ), and receive a reward r(s t , a t , s t+1 ). The cumulative reward over a full episode is called the return R and can be truncated to a specific horizon as R H . Generally reinforcement learning and control aim to provide an optimal policy function π s : S → A which will provide an action a t in state s t which will lead to the highest long-term return: π * (s t ) = arg max a∈A ∞ t=1 γ t r(s t , π * (s t )), where γ is a time-wise discounting factor that we fix to γ = 1, and therefore only consider finite-horizon returns.

3.2. PLANNING WITH LEARNED MODELS

A large body of the contemporary work with MDPs involves Reinforcement Learning (RL) Sutton & Barto (2018) with model-free policies Mnih et al. (2015) ; Lillicrap et al. (2015) ; Schulman et al. (2017) ; Abdolmaleki et al. (2018) . These approaches learn some form of policy network which provides its approximation of the best action a t for a given state s t often as a single forward-pass of the network. MBOP and other model-based approaches Deisenroth & Rasmussen (2011); Chua et al. (2018) ; Williams et al. (2017b) ; Hafner et al. (2019b) ; Lowrey et al. (2018) ; Nagabandi et al. (2020) are very different. They learn an approximate model of their environment and then use a planning algorithm to find a high-return trajectory through this model, which is then applied to the environment 1 . This is interesting because the final policy can be more easily adapted to new tasks, be made to respect constraints, or offer some level of explainability. When bringing learned controllers to industrial systems, many of these aspects are highly desireable, even to the expense of raw performance.

3.3. OFFLINE LEARNING

Most previous work in both reinforcement learning and planning with learned models has assumed repeated interactions with the target environment. This assumption allows the system to gather increased data along trajectories that are more likely, and more importantly to provides counterfactuals, able to contradict prediction errors in the learned policy, which is fundamental to policy improvement. In the case of offline learning, we consider that the environment is not available during the learning phase, but rather that we are given a dataset D of interactions with the environment, representing a series of timestep tuples (s t , a t , r t , s t+1 ). The goal is to provide a performant policy π given this particular dataset D. Existing RL algorithms do not easily port over to the offline learning setup, for a varied set of reasons well-covered in Levine et al. (2020) . In our work, we use the real environment to benchmark the performance of the produced policy. It is important to point out that oftentimes there is nevertheless a need to evaluate the performance of a given policy π without providing access to the final system, which is the concern of Off Policy Evaluation (OPE) Precup (2000) ; Nachum et al. (2019) and Offline Hyperparameter Selection(OHS) Paine et al. ( 2020) which are outside the scope of our contribution.

3.4. LEARNING DYNAMICS, ACTION PRIORS, AND VALUES

MBOP uses three parameterized function approximators for its planning algorithm. These are: 1. f m : S × A → S × R, a single-timestep model of environment dynamics such that (r t , ŝt+1 ) = f m (s t , a t ) . This is the model used by the planning algorithm to roll out potential action trajectories. We will use f m (s t , a t ) s to denote the state prediction and f m (s t , a t ) r for the reward prediction. 2. f b : S × A → A, a behavior-cloned policy network which produces a t = f b (s t , a t-1 ), and is used by the planning algorithm as a prior to guide trajectory sampling. 3. f R : S × A → R is a truncated value function, which provides the expected return over a fixed horizon R H of taking a specific action a in a state s, as RH = f R (s t , a t-1 ). Each one is a bootstrap ensemble (Lakshminarayanan et al., 2017)  of K feed-forward neural net- works, thus f m is composed of f i m ∀i ∈ [1, K], where each f i m is trained with a different weight initialization but from the same dataset D. This approach has been shown to work well empirically to stabilize planning (Nagabandi et al., 2020; Chua et al., 2018) . Each of the ensemble member networks is optimized to minimize the L 2 loss on the predicted values in the dataset D in a standard supervised manner.

3.5. MBOP-PO L I C Y

MBOP uses Model-Predictive Control (Rault et al., 1978) to provide actions for each new state as a t = π(s t ). MPC works by running a fixed-horizon planning algorithm at every timestep, which returns a trajectory T of length H. MPC selects the first action from this trajectory and returns it as a t . This fixed-horizon planning algorithm is effectively a black box to MPC, although in our case we have the MPC loop carry around a global trajectory buffer T . A high-level view of the policy loop using MPC is provided in Algorithm 1. The MBOP-Policy loop is straightforward, and only needs to keep around T at each timestep. MPC is well-known to be a surprisingly simple yet effective method for planning-based control. Finding a good trajectory is however more complicated, as we will see in the next section.

3.6. MBOP-TR A J O P T

MBOP-Trajopt extends ideas used by PDDM (Nagabandi et al., 2020) by adding a policy prior (provided by f b ) and value prediction (provided by f R ). The full algorithm is described in Algorithm Algorithm 1 High-Level MBOP-Policy 1: Let D be a dataset of E episodes 2: Train fm, f b , fR on D 3: Initialize planned trajectory: T 0 = [00, • • • , 0H-1]. 4: for t = 1..∞ do 5: Observe st 6: T t = MBOP-Trajopt(T ∼ N (0, σ 2 ) 10: at = f l b (st, at-1) + Sample current action using BC policy. 11: An,t = (1 -β)at + βT i=min(t,H-1) Beta-mixture with previous trajectory T . 12: st+1 = f l m (st, An,t)s Sample next state from environment model. 13: R = R + 1 K K i=1 f i m (st, An,t)r Take average reward over all ensemble members. 14: end for 15: Rn = R + 1 K K i=1 f i R (sH+1, An,H ) Append predicted return and store. 16: end for 17: T t = N n=1 e κRn An,t+1 N n=1 e κRn , ∀t ∈ [0, H -1] Generate return-weighted average trajectory. 18: return T 19: end procedure 2. In essence, MBOP-Trajopt is an iterative guided-shooting trajectory optimizer with refinement. MBOP-Trajopt rolls out N trajectories of length H using f m as an environment model. As f m is actually an ensemble with K members, we denote the l th ensemble member as f l m . Line 6 of Alg. 2 allows the nth trajectory to always use the same lth ensemble member for both the BC policy and model steps. This use of consistent ensemble members for trajectory rollouts is inspired by PDDM. We point out that f m models return both state transitions and reward, and so we denote the state component as f m (s t , a t ) s and the reward component as f m (s t , a t ) r . The policy prior f l b is used to sample an action which is then averaged with the corresponding action from the previous trajectory generated by MBOP-Trajopt. By maintaining T from one MPC step to another we maintain a trajectory prior that allows us to amortize trajectory optimization over time. The β parameter can be interpreted as a form of learning rate defining how quickly the current optimal trajectory should change with new rollout information (Wagener et al., 2019) . We did not find any empirical advantage to the time-correlated noise in Nagabandi et al. (2020) , instead opting for i.i.d. noise. As opposed to the BC policy and environment model, reward model is calculated using the average over all ensemble members to calculate the expected return R n for trajectory n. At the end of a trajectory, we append the predicted return for the final state and action by averaging over all members of f R . The decision to take an average of returns vs using the ensemble heads was also inspired by the approach used in Nagabandi et al. (2020) . Once we have a set of trajectories and their associated return, we generate an average action for timestep t by re-weighting the actions of each trajectory according their exponentiated return, as in Nagabandi et al. (2020) Section 4 demonstrates how the combination of these elements makes our planning algorithm capable of generating improved trajectories over the behavior trajectories from D, especially in low-data regimes. In higher-data regimes, variants of MBOP without the BC prior can also be used for goal & constraint-based control. Further work will consider the addition of goal-conditioned f b and f R to allow for more data-efficient goal and constraint-based control.

4. EXPERIMENTAL RESULTS

We look at two operating scenarios to demonstrate MBOP performance and flexibility. First we consider the standard offline settings where the evaluation environment and task are identical to the behavior policy's. We show that MBOP is able to perform well with very little data. We then look at MBOP's ability to provide controllers that can naturally transfer to novel tasks with the same system dynamics. We use both goal-conditioned tasks (that ignore the original reward function) and constrained tasks (that require optimising for the original reward under some state constraint) to demonstrate the MBOP's transfer abilities. Accompanying videos are available here: https: //youtu.be/nxGGHdZOFts.

4.1. METHODOLOGY

We use standard datasets from the RL Unplugged (RLU) (Gulcehre et al., 2020) and D4RL (Fu et al., 2020) papers. For both RLU and D4RL, policies are trained from offline datasets and then evaluated on the corresponding environment. For datasets with high variance in performance, we discard episodes that are below a certain threshold for the training of f b and f R . This is only done on the Quadruped and Walker tasks from RLU, and only provides a slight performance boostperformance on unfiltered data for these two tasks can be found in the Appendix's 5.6. The unfiltered data is always used for training f s . We perform a grid-search to find optimal parameters for each dataset, but for most tasks these parameters are mostly uniform. The full set of parameters for each experiment can be found in the Appendix Sec. 5.2. For experiments on RLU, we generated (Yu et al., 2020) and MBPO (Janner et al., 2019) , with values taken from the MOPO paper (Yu et al., 2020) . As in Fu et al. (2020) , we normalize the scores according to a converged SAC policy, reported in their appendix. Scores are reported averaged over 5 random seeds, with 20 episode runs per seed. ± is one standard deviation and represents variance due to seed and episode. We have inserted our BC prior as the BC baseline, and have set performance to 0.0 when it is negative. We include the performance of behavior cloning (BC) from the batch data for comparison. We bold the highest mean. additional smaller datasets to increase the difficulty of the problem. On all plots we also report the performance of the behavior policy used to generate the data (directly from the episode returns in the datasets) and label it as the DATA policy. All non-standard datasets will be available publicly. For RLU the datasets are generated using a 70% performant MPO (Abdolmaleki et al., 2018) policy on the original task, and smaller versions of the datasets are a fixed set of randomly sampled contiguous episodes (Dulac-Arnold et al., 2020; Gulcehre et al., 2020) . D4RL has 4 behavior policies, ranging from random behavior to expert demonstrations, and are fully described in Fu et al. (2020) . On all datasets, training is performed on 90% of data and 10% is used for validation.

4.2. PERFORMANCE ON RL-UNPLUGGED & D4RL

For experiments on RLU we consider the unperturbed RWRL cartpole-swingup, walker and quadruped tasks (Tassa et al., 2018; Dulac-Arnold et al., 2020) . For D4RL we consider the halfcheetah, hopper, walker2d and Adroit tasks (Brockman et al., 2016; Rajeswaran et al., 2017) . Results for the RLU tasks as well as Adroit are presented in Figure 1 . On the remaining D4RL tasks, results are compared to those presented by MOPO Yu et al. (2020) in Table 1 for four different data regimes (medium, medium-expert, medium-replay, random). For all experiments we report MBOP performance as well as the performance of a behavior cloning (BC) policy. The BC policy is simply the policy prior f b , with the control action as the average ensemble output. We use this baseline to demonstrate the advantages brought about by planning beyond simple cloning. For the RLU datasets (Fig. 1 ), we observe that MBOP is able to find a near-optimal policy on most dataset sizes in Cartpole and Quadruped with as little as 5000 steps, which corresponds to 5 episodes, or approximately 50 seconds on Cartpole and 100 seconds on Quadruped. On the Walker datasets MBOP requires 23 episodes (approx. 10 minutes) before it finds a reasonable policy, and with sufficient data converges to a score of 900 which is near optimal. On most tasks, MBOP is able to generate a policy significantly better than the behavior data as well as the the BC prior. For the Adroit task, we show that MBOP is able to outperform the behavior policy after training on a dataset of 50k data points generated by an expert policy (Fig. 1d ). For other D4RL datasets, we compare to the performance of MOPO (Yu et al., 2020) . We show that on the medium and medium-expert data regimes MBOP outperforms MOPO, sometimes significantly. However on higher-variance datasets such as random and mixed MBOP is not as performant. This is likely due to the reliance on policy-conditioned priors, which we hope to render more flexible in future work (for instance using multi-modal stochastic models). There are nevertheless many tasks where a human operator is running a systems in a relatively consistent yet sub-optimal manner, and one Published as a conference paper at ICLR 2021 may want to either replicate or improve upon the operator's control policy. In such scenarios, MBOP would likely be able to not only replicate but improve upon the operator's control strategy. 2a illustrates a sequences of frames from the RLU Cartpole task with constrained and unconstrained MBOP controllers. In the constrained cases MBOP prevents the cart from crossing the middle of the rail (dotted red line) and contains it to one side. Fig. 2b displays cart trajectories for constrained and unconstrained versions of the same controller. MBOP can maintain a performant policy (above 750) while respecting these constraints. Fig. 2c displays goal-conditioned performance on the RLU Quadruped. We ignore the original reward function and optimize directly for trajectories that maximize a particular velocity vector. Although influence from fB and fR biases the controller to maintain forward direction, we can still exert significant goal-directed influence on the policy.

4.3. ZERO-SHOT TASK ADAPTATION

One of the main advantages of using planning-based methods in the offline scenario is that they are easy to adapt to new objective functions. In the case of MBOP these would be novel objectives different from those optimized by the behavior policy that generates the offline data. We can easily take these new objectives into account by computing a secondary objective return as follows: R n = t f obj (s t ) where f obj is a user-provided function that computes a scalar objective reward given a state. We can then adapt the trajectory update rule to take into account the secondary objective: T t = N n=1 e κRn+κobjR n A n,t N n=1 e κRn+κ obj R n , ∀t ∈ [1, H]. To demonstrate this, we run MBOP on two types of modified objectives: goal-conditioned control, and constrained control. In goal-conditioned control, we ignore the original reward function (κ = 0) and define a new goal (such as a velocity vector) and optimize trajectories relative to that goal. In constrained operation, we add a state-based constraint which we penalize during planning, while maintaining the original objective and find a reasonable combination of κ and κ obj . We define three tasks: position-constrained Cartpole, where we penalize the cart's position to encourage it to stay either on the right or left side of the track; heading-conditioned Quadruped, where we provide a target heading to the policy (Forward, Backwards, Right & Left); and finally height-constrained Walker, where we penalize the policy for bringing the torso height above a certain threshold. Results on Cartpole & Quadruped are presented in Figure 2 . We show that MBOP successfully integrates constraints that were not initially in the dataset and is able to perform well on objectives that are different from the objective of the behavior policy. Walker performs similarly, obtaining nearly 80% constraint satisfaction while maintaining a reward of 730. More analysis is available in the Appendix Sec. 5.5.

4.4. ALGORITHMIC INVESTIGATIONS

Ablations To better understand the benefits of MBOP's various elements, we perform three ablations: MBOP-NOPP which replaces f b with a Gaussian prior, MBOP-NOVF which removes f R 's estimated returns, and PDDM which removes both, thus recovering the PDDM controller. We show performance of these four ablations on the Walker dataset in Fig. 3a . A full set of ablations is available in the appendix Figures 4 & 5 . Overall we see that the full combination of BC prior, value function and environment model are important for optimal performance. We also see that the PDDM approach is generally below either of the MBOP-NOPP and MBOP-NOVF ablations. Finally, we note that the BC prior when used alone can perform well on certain environments, but on others it stagnates at behavior policy's performance. Execution Speed A frequent concern with planning-based methods is their slower response time prohibiting practical use. We calculate the average control frequency of MBOP on the RLU Walker task using a single Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz core and a Nvidia 1080TI and find that MBOP can operate at frequencies ranging from 106 Hz for h = 4 to 40 Hz for a h = 40, with BC operating at 362 Hz. Additional values are presented in Appendix Sec. 5.4.

Hyperparameter Stability

We perform a grid sweep over the κ (trajectory re-weighting) and H (planning horizon) on the three RLU environments and visualize the effects on return in Fig. 3b . We observe that overall MBOP maintains consistent performance scores for wide ranges of hyperparameter values, only really degrading near extreme values. Additional analysis is present in the Appendix's Section 5.5.

5. CONCLUSION

Planning-based methods provide significantly more flexibility for external systems to interact with the learned controller. Bringing them into the offline data regime opens the door to their use on more real-world systems for which online training is not an option. MBOP provides an easy to implement, dataefficient, stable, and flexible algorithm for policy generation. It is easy to implement because the learning components are simple supervised learners, it is data-efficient thanks to its use of multiple complementary estimators, and it is flexible due to its use of on-line planning which allows it to dynamically react to changing goals, costs and environmental constraints. We show that MBOP can perform competitively in various data regimes, and can provide easily adaptable policies for more complex goal-conditioned or constrained tasks, even if the original data does not provide prior experience. Although MBOP's performance is degraded when offline data is multi-modal or downright random, we believe there are a large number of scenarios where the current operating policy (be it human or automated) is reasonably consistent, but could benefit from being automated and improved upon. In these scenarios we believe that MBOP could be readily applicable. Future work intends to ameliorate performance by investigating the use of goal-conditioned policy priors and value estimates, as well as looking at effective ways to perform offline model selection and evaluation. We sincerely hope that MBOP can be useful as an out-of-the-box algorithm for learning stable and configurable control policies for real systems. Table 9 : MBOP maximum control frequencies (steps/second) including simulator time on an Tesla P100 using a single core of a Xeon 2200 MHz equivalent processor. Execution speeds on the RLU Walker task in represented in Table 9 . We see that we can easily achieve control frequencies below 10Hz, but cannot currently attain 100Hz with longer horizons. For lower level control policies for which high-frequency is important, we would suggest distilling the controller into a task-specific policy similar to MoREL (Kidambi et al., 2020) or MOPO (Yu et al., 2020) .

5.5. MBOP PARAMETERS

All parameters were set as follows except for the D4RL Walker task where we use 15 ensemble networks. • # FC Layers : 2 



This approach is often called Model-Based Reinforcement Learning (MBRL) in the literature, but we chose to talk more generally about planning with learned models as the presence of a reward is not fundamentally necessary and the notion of reinforcement is much less present.



and Williams et al. (2017b) (Alg 3, Line 17).

Figure1: Performance of MBOP on various RLU and D4RL datasets. For each of the above tasks we have sub-sampled subsets of the original dataset to obtain the desired number of data points. The subsets are the same throughout the paper. The box plots describe the first quartile of the dataset, with the whiskers extending out to the full distribution, with outliers plotted individually, using the standard Seaborn (more info here).

Visualized trajectories for constrained Cartpole.

Figure 2: The above figures describe performance of MBOP on constrained & goal-conditioned tasks. Fig.

MBOP ablations' performance on RLU Walker Dataset. We observe that MBOP is consistently more performant than its ablations. MBOP sensitivity to Kappa (κ) and Horizon (H).

Figure 5: Performance on D4RL tasks from MBOP.

Beta parameter on RLU / Quadruped (a) Sensitivity to Beta parameter on RLU / Quadruped. Beta parameter on RLU / Walker (b) Sensitivity to Beta parameter on RLU / Walker. Horizon parameter on RLU / Walker

Horizon parameter on RLU / Cartpole

(f) Sensitivity to Horizon parameter on RLU / Cartpole.

Figure 7: MBOP sensitivity to Beta & Horizon on RLU datasets.

t-1 , st, fm, f b , fr) Update planned trajectory T t starting at T0.

Results for MBOP on D4RL tasks compare to MOPO

D4RL Door Performance

D4RL HalfCheetah Performance

MBOP performance on RLU / Quadruped with various filtering thresholds for top episodes

5.1. MBOP PERTINENCE TO ROBOTICS

MBOP provides a general model-based approach for offline learning. We have considered only physics-bound tasks in this paper as the underlying methods (MPC, MPPI) are known to work well on real systems (Nagabandi et al., 2020; Williams et al., 2015; Kahn et al., 2020) . Although this paper does not implement MBOP on actual robots, this is upcoming work, and we believe that by having shown MBOP's performance over 6 different environments (cartpole, walker, quadruped, Adroit, halfcheetah, hopper) involving under-actuated control, locomotion, and manipulation, MBOP's potential for applicability on a real systems is promising. More specifically, we believe MBOP provides a couple key contributions specifically interesting to the robotics community:• Ability to learn entirely offline without a simulator.• Ability to constrain policy operation.• Ability to completely rephrase the policy's goal according to an arbitrary cost function.These aspects make MBOP a unique contribution that potentially opens a series of interesting research questions around zero-shot adaptation, leveraging behavior priors, using sub-optimal models, leveraging uncertainty, and more generally exploring the additional control opportunities provided by model-based methods that are much more difficult with model-free learnt controllers.As mentioned above it is our intent to quickly try out MBOP on various robotic systems. If results are available by the time of CoRL 2020 they will be presented as well.

5.2. PERFORMANCE OF MBOP ABLATIONS AND ASSOCIATED HYPERPARAMETERS

We present mean evaluation performance and associated hyper parameters for runs of MBOP and its ablations in a set of tables. For RLU: • # Ensemble Networks : 3• Learning Rate : 0.001• Batch Size : 512• # Epochs : 40

CONTINUED ANALYSIS OF CONSTRAINED TASKS

We can see the height-constrained Walker performance in Figure 6a . MBOP is able to satisfy the height constraint 80% of the episode while maintaining reasonable performance. Over the various ablations we have found that MBOP is better able to maintain base task performance for similar constraint satisfaction rates. (a) This figure describes the performance of MBOP on RLU Walker when constrained to stay below a height threshold. We see that MBOP is able to increase the rate of respect of the constraint compared to the behavior policy while maintaining similar episode returns. 

HYPERPARAMETER STABILITY

Figure 7 shows the sensitivity of MBOP and associated ablations to the Beta and Horizon parameters.Figure 8 shows the effects of Sigma to MBOP and ablations on the RLU datasets. Figure 6b shows sensitivity to Horizon and Kappa in synchrony.

5.6. IMPACT OF FILTERING POOR EPISODES

As mentioned in the above part of the paper, for RLU / Quadruped and RLU / Walker we exclude the episodes with lowest returns before training the behavior cloning and value function models. In this section we report the performances on these environment with various filtering thresholds.For each of these two environments, and each of the dataset sizes we keep a subset of the initial dataset by filtering on the top episodes. We experiments with filters varying from the top-1% to the top-100% (i.e. the entire raw dataset). 

