ADDRESSING DISTRIBUTION SHIFT IN OFFLINE-TO-ONLINE REINFORCEMENT LEARNING

Abstract

Recent progress in offline reinforcement learning (RL) has made it possible to train strong RL agents from previously-collected, static datasets. However, depending on the quality of the trained agents and the application being considered, it is often desirable to improve such offline RL agents with further online interaction. As it turns out, fine-tuning offline RL agents is a non-trivial challenge, due to distribution shift -the agent encounters out-of-distribution samples during online interaction, which may cause bootstrapping error in Q-learning and instability during fine-tuning. In order to address the issue, we present a simple yet effective framework, which incorporates a balanced replay scheme and an ensemble distillation scheme. First, we propose to keep separate offline and online replay buffers, and carefully balance the number of samples from each buffer during updates. By utilizing samples from a wider distribution, i.e., both online and offline samples, we stabilize the Q-learning. Next, we present an ensemble distillation scheme, where we train an ensemble of independent actor-critic agents, then distill the policies into a single policy. In turn, we improve the policy using the Q-ensemble during fine-tuning, which allows the policy updates to be more robust to error in each individual Q-function. We demonstrate the superiority of our method on MuJoCo datasets from the recently proposed D4RL benchmark suite.

1. INTRODUCTION

Offline reinforcement learning (RL), the task of training a sequential decision-making agent with a static offline dataset, holds the promise of a data-driven approach to reinforcement learning, thereby bypassing the laborious and often dangerous process of sample collection (Levine et al., 2020) . Accordingly, various offline RL methods have been developed, some of which are often capable of training agents that are more performant than the behavior policy (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020; Agarwal et al., 2020; Kidambi et al., 2020; Yu et al., 2020; Kumar et al., 2020) . However, agents trained via offline RL methods may be suboptimal, for (a) the dataset they were trained on may only contain suboptimal data; and (b) environment in which they are deployed may be different from the environment in which dataset was generated. This necessitates an online fine-tuning procedure, where the agent improves by interacting with the environment and gathering additional information. Fine-tuning an offline RL agent, however, poses certain challenges. For example, Nair et al. (2020) pointed out that offline RL algorithms based on modeling the behavior policy (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020) are not amenable to fine-tuning. This is because such methods require sampling actions from the modeled behavior policy for updating the agent, and fine-tuning such a generative model online for reliable sample generation is a challenging task. On the other hand, a more recent state-of-the-art offline RL algorithm, conservative Q-learning (CQL; Kumar et al., 2020) , does not require explicit behavior modeling, and one might expect CQL to be amenable to fine-tuning. However, we make an observation that fine-tuning a CQL agent is a non-trivial task, due to the so-called distribution shift problem -the agent encounters out-ofdistribution samples, and in turn loses its good initial policy from offline RL training. This can be attributed to the bootstrapping error, i.e., error introduced when Q-function is updated with an inaccurate target value evaluated at unfamiliar states and actions. Such initial training instability is a severe limitation, given the appeal of offline RL lies in safe deployment at test time, and losing such safety guarantees directly conflicts with the goal of offline RL. Contribution. In this paper, we first demonstrate that fine-tuning a CQL agent may lead to unstable training due to distribution shift (see Section 3 for more details). To handle this issue, we introduce a simple yet effective framework, which incorporates a balanced replay scheme and an ensemble distillation scheme. Specifically, we propose to maintain two separate replay buffers for offline and online samples, respectively. Then we modulate the sampling ratio between the two, in order to balance the effects of (a) widening the data distribution the agent sees (offline data), and (b) exploiting the environment feedback (online data). Furthermore, we propose an ensemble distillation scheme: first, we learn an ensemble of independent CQL agents, then distill the multiple policies into a single policy. During fine-tuning, we improve the policy using the mean of Q-functions, so that policy updates are more robust to error in each individual Q-function. In our experiments, we demonstrate the strength of our method based on MuJoCo (Todorov et al., 2012) datasets from the D4RL (Fu et al., 2020) benchmark suite. Our goal is to achieve both (a) strong initial performance as well as maintaining it during the initial training phase, and (b) better sample-efficiency. For evaluation, we measure the final performance and sample-efficiency of RL agents throughout the fine-tuning procedure. We demonstrate that our method achieves stable training, while significantly outperforming all baseline methods considered, including BCQ (Fujimoto et al., 2019) and AWAC (Nair et al., 2020) , both in terms of final performance and sample-efficiency.

2. BACKGROUND

Reinforcement learning. We consider the standard RL framework, where an agent interacts with the environment so as to maximize the expected total return. More formally, at each timestep t, the agent observes a state s t , and performs an action a t according to its policy π. The environment rewards the agent with r t , then transitions to the next state s t+1 . The agent's objective is to maximize the expected return E π [ ∞ k=0 γ k r k ], where γ ∈ [0, 1) is the discount factor. In this work, we mainly consider off-policy RL algorithms, a class of algorithms that can, in principle, train an agent with samples generated by any behavior policy. These algorithms are well-suited for fine-tuning a pretrained RL agent, for they can leverage both offline and online samples. Offline RL algorithms are off-policy RL algorithms that only utilize static datasets for training. Here, we introduce an off-policy RL algorithm and an offline RL algorithm we build on in this work. Soft Actor-Critic. SAC (Haarnoja et al., 2018) is an off-policy actor-critic algorithm that learns a soft Q-function Q θ (s, a) parameterized by θ and a stochastic policy π φ modeled as a Gaussian with its parameters φ. To update parameters θ and φ, SAC alternates between a soft policy evaluation and a soft policy improvement. During soft policy evaluation, soft Q-function parameter θ is updated to minimize the soft Bellman residual: L SAC critic (θ) = E τt∼B [L Q (τ t , θ)], L SAC Q (τ t , θ) = Q θ (s t , a t ) -(r t + γE at+1∼π φ [Qθ(s t+1 , a t+1 ) -α log π φ (a t+1 |s t+1 )]) 2 , where τ t = (s t , a t , r t , s t+1 ) is a transition, B is the replay buffer, θ is the moving average of soft Q-function parameter θ, and α is the temperature parameter. During soft policy improvement, policy parameter φ is updated to minimize the following objective: L SAC actor (φ) = E st∼B [L π (s t , φ)], where L π (s t , φ) = E at∼π φ [α log π φ (a t |s t ) -Q θ (s t , a t )]. Conservative Q-Learning. CQL (Kumar et al., 2020) is an offline RL algorithm that learns a lower bound of the Q-function Q θ (s, a), in order to prevent extrapolation error -value overestimation caused by bootstrapping from out-of-distribution actions. To this end, CQL(H), a variant of CQL, imposes a regularization that minimizes the expected Q-value at unseen actions, and maximizes the expected Q-value at seen actions by minimizing the following objective:  L CQL critic (θ) = 1 2 • L CQL Q (θ) + α CQL • L CQL reg (θ), L CQL Q (θ) = E τt∼D Q θ (



s t , a t ) -(r t + γE at+1∼π φ [Qθ(s t+1 , a t+1 )])

