PSEUDOMETRIC GUIDED ONLINE QUERY AND UPDATE FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline Reinforcement Learning (RL) extracts effective policies from historical data without the need to interact with the environment. However, the learned policy often suffers large generalization errors in the online environment due to the distributional shift. While existing work mostly focuses on learning a generalizable policy, we propose to adapt the learned policy to fit the online environment with limited queries. The goals include querying reasonable actions with limited chances and efficiently modifying the policy. Our insight is to unify these two goals via a proper pseudometric. Intuitively, the metric can compare online and offline states to infer optimal query actions. Additionally, efficient policy updates require good knowledge of the similarity between query results and historical data. Therefore, we propose a unified framework, denoted Pseudometric Guided Offline-to-Online RL (PGO 2 ). Specifically, in deep Q learning, PGO 2 has a structural design between the Q-neural network and the Siamese network, which guarantees simultaneous Q-network updating and pseudometric learning, promoting Q-network fine-tuning. In the inference phase, PGO 2 solves convex optimizations to identify optimal query actions. We also show that PGO 2 training converges to the so-called bisimulation metric with strong theoretical guarantees. Finally, we demonstrate the superiority of PGO 2 on diversified datasets.

1. INTRODUCTION

Offline Reinforcement Learning (RL) leverages large historical data to learn the behavior policy, which seeks the optimality for sequential decision-making without any costs from interacting with the environments (Lange et al., 2012; Levine et al., 2020) . This promising feature significantly promotes real-world RL applications, especially when explorative actions are costly. Thus, advances for offline RL have been made in robotic control (Lee et al., 2022 ), healthcare (Tang & Wiens, 2021) , dialogue model (Jaques et al., 2020 ), recommendation (Xiao & Wang, 2021 ), and E-commerce (Zhang et al., 2021) , etc. Despite the achievement, the learned policy in offline RL may still suffer large extrapolation errors in online implementations (Dadashi et al., 2021; Fujimoto et al., 2019; Lee et al., 2022) . The central reason is the data distributional shift between the offline and online environments. To address this issue, most solutions improve the algorithmic model, like adding constraints in the learning procedure (Fujimoto et al., 2019; Wang et al., 2020) , designing ensemble models (Agarwal et al., 2020; Lee et al., 2022; An et al., 2021) , or providing better value function estimation (Kumar et al., 2020; Dadashi et al., 2021; Rezaeifar et al., 2022) . While these methods provide appealing results, further improvements can be made when the agent can obtain online data. A classic setting is off-policy RL (Levine et al., 2020; Munos et al., 2016) where a data buffer continuously memorizes online data to update the policy. In many cases, it is often unacceptable to obtain large online samples based on the offline policy or some random explorations. This is because the poor generalization of offline policy may cause high costs for online implementations, let alone random explorations. Therefore, we propose to actively produce query actions to the environment for online states. Subsequently, the environment can evaluate the query and provide feedback (i.e., rewards), which facilitates adapting the offline policy. The general goal is to utilize limited queries to achieve efficient policy adaptation. More specifically, one needs to (1) produce proper query actions and (2) accurately modify the offline policy according to limited query results. Basically, the state distributional shift challenges 1

