FINE-TUNING OFFLINE REINFORCEMENT LEARNING WITH MODEL-BASED POLICY OPTIMIZATION

Abstract

In offline reinforcement learning (RL), we attempt to learn a control policy from a fixed dataset of environment interactions. This setting has the potential benefit of allowing us to learn effective policies without needing to collect additional interactive data, which can be expensive or dangerous in real-world systems. However, traditional off-policy RL methods tend to perform poorly in this setting due to the distributional shift between the fixed data set and the learned policy. In particular, they tend to extrapolate optimistically and overestimate the action-values outside of the dataset distribution. Recently, two major avenues have been explored to address this issue. First, behavior-regularized methods that penalize actions that deviate from the demonstrated action distribution. Second, uncertainty-aware model-based (MB) methods that discourage state-actions where the dynamics are uncertain. In this work, we propose an algorithmic framework that consists of two stages. In the first stage, we train a policy using behavior-regularized model-free RL on the offline dataset. Then, a second stage where we fine-tune the policy using our novel Model-Based Behavior-Regularized Policy Optimization (MB2PO) algorithm. We demonstrate that for certain tasks and dataset distributions our conservative model-based fine-tuning can greatly increase performance and allow the agent to generalize and outperform the demonstrated behavior. We evaluate our method on a variety of the Gym-MuJoCo tasks in the D4RL benchmark and demonstrate that our method is competitive and in some cases superior to the state of the art for most of the evaluated tasks.

1. INTRODUCTION

Deep reinforcement learning has recently been able to achieve impressive results in a variety of video games (Badia et al., 2020) and board games (Schrittwieser et al., 2020) . However, it has had limited success in complicated real-world tasks. In contrast, deep supervised learning algorithms have been achieving extraordinary success in scaling to difficult real-world datasets and tasks, especially in computer vision (Deng et al., 2009) and NLP (Rajpurkar et al., 2016) . The success of supervised learning algorithms can be attributed to the combination of deep neural networks and methods that can effectively scale with large corpora of varied data. The previous successes of deep RL (Levine, 2016; Schrittwieser et al., 2020) seem to indicate that reinforcement learning can potentially scale with large active data exploration to solve specific tasks. However, the ability to collect such large datasets online seems infeasible in many real-world applications such as automated driving or robotassisted surgery, due to the difficulty and inherent risks in collecting online exploratory data with an imperfect agent. Existing off-policy RL algorithms can potentially leverage large, previously collected datasets, but they often struggle to learn effective policies without collecting their own online exploratory data (Agarwal et al., 2020) . These failures are often attributed to the Q-function poorly extrapolating to out-of-distribution actions, which leads to overly optimistic agents that largely over-estimate the values of unseen actions. Because we train Q-functions using bootstrapping, these errors will often compound and lead to divergent Q-functions and unstable policy learning (Kumar et al., 2019) . Recently, there have been a variety of offline RL approaches that have attempted to address these issues. Broadly, we group these approaches into two main categories based on how they address the extrapolation issue. The first set of approaches (Wu et al., 2019; Kumar et al., 2019) rely on behavior-regularization to limit the learned policy's divergence from the perceived behavioral policy that collected the data. These approaches discourage the agent from considering out-of-distribution actions in order to avoid erroneous extrapolation. While these methods can often be effective when given some amount of expert demonstrations, they often seem too conservative and rarely outperform the best demonstrated behavior. The second set of approaches (Yu et al., 2020; Kidambi et al., 2020) leverage uncertainty-aware MB RL to learn a policy that is discouraged from taking state-action transitions where the learned model has low confidence. Thus, these methods allow a certain degree of extrapolation where the models are confident. Because these methods tend to be less restrictive, they can generalize better than behavior-regularization methods and sometimes outperform the behavioral dataset. However, this flexibility also seems to make it harder for these methods to recover the expert policy when it is present in the dataset, and reduce their effectiveness when trained with a narrow distribution. In this work, we develop an algorithmic framework that combines ideas from behavior-regularization and uncertainty-aware model-based learning. Specifically, we first train a policy using behaviorregularized model-free RL. Then, we fine-tune our results with our novel algorithm Model-Based Behavior-Regularized Policy Optimization (MB2PO). We find that our approach is able to combine the upside of these approaches and achieve competitive or superior results on most of the Gym-MuJoCo (Todorov et al., 2012) tasks in the D4RL (Fu et al., 2020) benchmark.

2. RELATED WORK

While there exist many off-policy RL methods that can learn to solve a large variety of complex control tasks and can scale with large amounts of online data collection, these methods often perform quite poorly when run completely offline without any online data collection. Recently, there have been several methods that made progress in improving the capabilities of offline RL. For a general overview of the field of offline RL, we refer the reader to Levine et al. (2020) . Here we will discuss some recent works that are particularly relevant to our approach.

2.1. IMPROVING OFF-POLICY Q-LEARNING

Many of the recent advances in both discrete and continuous action off-policy deep RL can be attributed to improvements in stabilizing off-policy Q-learning and reducing overestimation due to erroneous extrapolation. Some notable methods include target networks (Mnih et al., 2013) , double Q-learning (DDQN) (van Hasselt et al., 2015 ), distributional RL (Bellemare et al., 2017; Dabney et al., 2017) , and variance reduction through invertible transforms (Pohlen et al., 2018) . In learning for continuous control, Fujimoto et al. ( 2018) introduced a conservative method that uses the minimum estimate of an ensemble of Q-networks as the target, which is often referred to as clipped double-Q-learning. Agarwal et al. (2020) demonstrated that Quantile Regression DDQN (Dabney et al., 2017) and other ensemble methods can be effective in certain discrete action offline RL problems. However, Agarwal et al. (2020) showed that when used naively, these methods do not perform well on complex continuous control tasks. In our work, we incorporate the mentioned advances in off-policy Q-learning into our approach to stabilize performance and prevent potential divergence. Additionally, the offline RL algorithm Conservative Q-learning (CQL) (Kumar et al., 2020) has attempted to address Q-learning's overestimation issue on offline data directly by including a constraint term that discourages the agent from valuing an out-of-distribution action more than the demonstrated actions. In our method, instead of using a constraint on the Q-values, we use a combination of behavior-regularized model-free RL and uncertainty-aware model-based RL to discourage erroneous extrapolation.

2.2. BEHAVIOR-REGULARIZED MODEL-FREE RL

A variety of recent offline RL approaches have incorporated constraints or penalties on the learned policy's divergence from the empirical behavioral policy. In particular, recent works have used both KL Divergence (Wu et al., 2019) and mean measure of divergence (MMD) (Kumar et al., 2019) .

