FINE-TUNING OFFLINE POLICIES WITH OPTIMISTIC ACTION SELECTION

Abstract

Offline reinforcement learning algorithms can train performant policies for hard tasks using previously-collected datasets. However, the quality of the offline dataset often limits the levels of performance possible. We consider the problem of improving offline policies through online fine-tuning. Offline RL requires a pessimistic training objective to mitigate distributional shift between the trained policy and the offline behavior policy, which will make the trained policy averse to picking novel actions. In contrast, online RL requires exploration, or optimism. Thus, fine-tuning online policies with the offline training objective is not ideal. Additionally, loosening the fine-tuning objective to allow for more exploration can potentially destroy the behaviors learned in the offline phase because of the sudden and significant change in the optimization objective. To mitigate this challenge, we propose a method to facilitate exploration during online fine-tuning that maintains the same training objective throughout both offline and online phases, while encouraging exploration. We accomplish this by changing the action-selection method to be more optimistic with respect to the Q-function. By choosing to take actions in the environment with higher expected Q-values, our method is able to explore and improve behaviors more efficiently, obtaining 56% more returns on average than the alternative approaches on several locomotion, navigation, and manipulation tasks.

1. INTRODUCTION

Offline reinforcement learning (RL) algorithms show strong performance on different tasks even when the available data contains sub-optimal behaviors such as random exploration or another RL agent's behavior. However, the quality of the offline data can limit the performance of an offline RL agent, and in these cases, further online fine-tuning can help improve the policy using additional environment interactions. A key challenge to online fine-tuning is that the training mechanisms designed for the offline setting are not effective at improving the policy online; offline learning requires a conservative objective to mitigate distributional shift between the learned policy and the behavior policy. However, during the online fine-tuning phase, the conservative objective causes the agent to keep doing the same behaviors, preventing it from exploring as needed to further improve the policy. In this paper, we aim to mitigate these challenges and develop a more effective method for offline-to-online RL. This challenge of misaligned priorities between the offline and online training phases could in principle be tackled by switching training objectives when the online phase starts. For example, a policy trained with an offline RL algorithm such as CQL (Kumar et al., 2020) could be fine-tuned using an off-policy online algorithm designed for efficient exploration, such as SAC (Haarnoja et al., 2018) . However, significantly changing the training objective between the offline and online phases can cause instability in the performance, potentially resulting in the destruction of behaviors learned during the offline phase (Nair et al., 2020) . It can also exacerbate distributional shift between the offline data and the learned policy (Lee et al., 2022) since the more exploratory training objective will make the policy deviate from the offline behavior policy even more, resulting in unsatisfactory fine-tuning performance. The problem of online fine-tuning from offline training presents an impasse: on the one hand, online fine-tuning requires efficient exploration, which is not feasible using the conservative offline training 1

