FINE-TUNING OFFLINE POLICIES WITH OPTIMISTIC ACTION SELECTION

Abstract

Offline reinforcement learning algorithms can train performant policies for hard tasks using previously-collected datasets. However, the quality of the offline dataset often limits the levels of performance possible. We consider the problem of improving offline policies through online fine-tuning. Offline RL requires a pessimistic training objective to mitigate distributional shift between the trained policy and the offline behavior policy, which will make the trained policy averse to picking novel actions. In contrast, online RL requires exploration, or optimism. Thus, fine-tuning online policies with the offline training objective is not ideal. Additionally, loosening the fine-tuning objective to allow for more exploration can potentially destroy the behaviors learned in the offline phase because of the sudden and significant change in the optimization objective. To mitigate this challenge, we propose a method to facilitate exploration during online fine-tuning that maintains the same training objective throughout both offline and online phases, while encouraging exploration. We accomplish this by changing the action-selection method to be more optimistic with respect to the Q-function. By choosing to take actions in the environment with higher expected Q-values, our method is able to explore and improve behaviors more efficiently, obtaining 56% more returns on average than the alternative approaches on several locomotion, navigation, and manipulation tasks. Offline meta reinforcement learning: Offline meta RL is related to online fine-tuning, since it deals with efficient online adaptation using an offline dataset. However, the meta-RL domain assumes access to a multi-task offline dataset, and trains an agent to quickly adapt to new tasks (Mitchell et al., 2021; Dorfman et al., 2021; Pong et al., 2022). As opposed to offline meta RL, our work (and offline-to-online fine-tuning in general) does not assume access to multi-task data, and deals with adaptation and improvement in the context of the same offline task.

1. INTRODUCTION

Offline reinforcement learning (RL) algorithms show strong performance on different tasks even when the available data contains sub-optimal behaviors such as random exploration or another RL agent's behavior. However, the quality of the offline data can limit the performance of an offline RL agent, and in these cases, further online fine-tuning can help improve the policy using additional environment interactions. A key challenge to online fine-tuning is that the training mechanisms designed for the offline setting are not effective at improving the policy online; offline learning requires a conservative objective to mitigate distributional shift between the learned policy and the behavior policy. However, during the online fine-tuning phase, the conservative objective causes the agent to keep doing the same behaviors, preventing it from exploring as needed to further improve the policy. In this paper, we aim to mitigate these challenges and develop a more effective method for offline-to-online RL. This challenge of misaligned priorities between the offline and online training phases could in principle be tackled by switching training objectives when the online phase starts. For example, a policy trained with an offline RL algorithm such as CQL (Kumar et al., 2020) could be fine-tuned using an off-policy online algorithm designed for efficient exploration, such as SAC (Haarnoja et al., 2018) . However, significantly changing the training objective between the offline and online phases can cause instability in the performance, potentially resulting in the destruction of behaviors learned during the offline phase (Nair et al., 2020) . It can also exacerbate distributional shift between the offline data and the learned policy (Lee et al., 2022) since the more exploratory training objective will make the policy deviate from the offline behavior policy even more, resulting in unsatisfactory fine-tuning performance. The problem of online fine-tuning from offline training presents an impasse: on the one hand, online fine-tuning requires efficient exploration, which is not feasible using the conservative offline training objective; on the other hand, changing the objective might destroy the behaviors learned by the offline training. A key insight of our method is that we collect optimistic data without changing the training objective. To collect such exploratory data, we aim to use the knowledge embedded in the Q-function to direct exploration, i.e., selecting actions that are estimated to be better than the ones given by the policy. Concretely, during online fine-tuning, we can execute actions that have high estimated Q-values, thus providing exploratory data that can be used to improve the policy, while using the same offline objective with this new data. The key contribution of this paper is an exploration technique that allows for stable and efficient offline-to-online fine-tuning. Our approach is simple, as it only modifies the action-selection process, and can in principle be implemented on top of any offline learning algorithm that doesn't explicitly penalize the Q-values of out-of-distribution state-action pairs. We show results built on top of LAPO (Chen et al., 2022) and IQL (Kostrikov et al., 2021) for several locomotion, navigation, and manipulation tasks, and find that by using this action-selection mechanism agents get an average of 56% more returns than the next best prior method.

2. RELATED WORK

Exploration mechanisms: One simple technique for exploration is to apply noise or randomness to the agent's behavior (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015) . Other approaches explore in a more targeted way through novelty seeking (Houthooft et al., 2016; Bellemare et al., 2016; Pathak et al., 2017; Ostrovski et al., 2017; Fu et al., 2017a; Burda et al., 2018b; a) and entropy regularization (Pong et al., 2019; Eysenbach et al., 2018; Florensa et al., 2017; Gregor et al., 2016) . These prior exploration methods are focused on online RL and are ill-suited for offline-to-online fine-tuning on their own, as exploration can negatively impact offline training. Our exploration mechanism is based on modifying the action-selection mechanism to choose actions more optimistically using a Q-function pre-trained with offline RL.



Offline to online fine-tuning: The problem of leveraging offline data during online RL has been widely studied especially when the offline data corresponds to demonstrations. Prior work has proposed a variety of approaches for this setting, ranging from online inverse RL and imitation learning(Lu et al., 2022; Ziebart et al., 2008; Finn et al., 2016; Fu et al., 2017b; Ho & Ermon, 2016; Kostrikov et al., 2018) to using the demonstrations to initialize the policy or replay buffer(Peters & Schaal, 2008; Vecerik et al., 2017; Hester et al., 2018; Rajeswaran et al., 2018; Zhu et al., 2018a; Gupta et al., 2019; Zhu et al., 2018b; Kober & Peters, 2009). We consider a more general case where the offline data may contain low or mixed quality data. The simplest approach for the general case of offline to online RL is to fine-tune online using the same offline objective(Kumar et al., 2020; Kostrikov et al., 2021; Nair et al., 2020; Lyu et al., 2022); however, this approach may be too conservative during the online phase, resulting in too little exploration and plateauing performance. Instead of maintaining the same offline training objective during the online phase,Wu et al. (2022) propose to gradually make the training objective less conservative. As we empirically show, this approach does not significantly improve exploration for the considered baselines.Lee et al. (2022) instead propose to target the over-conservatism of using an offline training objective during the online phase with a prioritized replay buffer that prefers more on-policy samples. This approach could in principle be combined with our method for enhanced exploration, but we leave this direction for future work. Unlike these past works, we develop a simple and sample-efficient fine-tuning approach that effectively trade-offs conservatism in the offline phase and optimism in the online phase without sacrificing stability.

