PSEUDOMETRIC GUIDED ONLINE QUERY AND UPDATE FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline Reinforcement Learning (RL) extracts effective policies from historical data without the need to interact with the environment. However, the learned policy often suffers large generalization errors in the online environment due to the distributional shift. While existing work mostly focuses on learning a generalizable policy, we propose to adapt the learned policy to fit the online environment with limited queries. The goals include querying reasonable actions with limited chances and efficiently modifying the policy. Our insight is to unify these two goals via a proper pseudometric. Intuitively, the metric can compare online and offline states to infer optimal query actions. Additionally, efficient policy updates require good knowledge of the similarity between query results and historical data. Therefore, we propose a unified framework, denoted Pseudometric Guided Offline-to-Online RL (PGO 2 ). Specifically, in deep Q learning, PGO 2 has a structural design between the Q-neural network and the Siamese network, which guarantees simultaneous Q-network updating and pseudometric learning, promoting Q-network fine-tuning. In the inference phase, PGO 2 solves convex optimizations to identify optimal query actions. We also show that PGO 2 training converges to the so-called bisimulation metric with strong theoretical guarantees. Finally, we demonstrate the superiority of PGO 2 on diversified datasets.

1. INTRODUCTION

Offline Reinforcement Learning (RL) leverages large historical data to learn the behavior policy, which seeks the optimality for sequential decision-making without any costs from interacting with the environments (Lange et al., 2012; Levine et al., 2020) . This promising feature significantly promotes real-world RL applications, especially when explorative actions are costly. Thus, advances for offline RL have been made in robotic control (Lee et al., 2022) , healthcare (Tang & Wiens, 2021) , dialogue model (Jaques et al., 2020) , recommendation (Xiao & Wang, 2021) , and E-commerce (Zhang et al., 2021) , etc. Despite the achievement, the learned policy in offline RL may still suffer large extrapolation errors in online implementations (Dadashi et al., 2021; Fujimoto et al., 2019; Lee et al., 2022) . The central reason is the data distributional shift between the offline and online environments. To address this issue, most solutions improve the algorithmic model, like adding constraints in the learning procedure (Fujimoto et al., 2019; Wang et al., 2020) , designing ensemble models (Agarwal et al., 2020; Lee et al., 2022; An et al., 2021) , or providing better value function estimation (Kumar et al., 2020; Dadashi et al., 2021; Rezaeifar et al., 2022) . While these methods provide appealing results, further improvements can be made when the agent can obtain online data. A classic setting is off-policy RL (Levine et al., 2020; Munos et al., 2016) where a data buffer continuously memorizes online data to update the policy. In many cases, it is often unacceptable to obtain large online samples based on the offline policy or some random explorations. This is because the poor generalization of offline policy may cause high costs for online implementations, let alone random explorations. Therefore, we propose to actively produce query actions to the environment for online states. Subsequently, the environment can evaluate the query and provide feedback (i.e., rewards), which facilitates adapting the offline policy. The general goal is to utilize limited queries to achieve efficient policy adaptation. More specifically, one needs to (1) produce proper query actions and (2) accurately modify the offline policy according to limited query results. Basically, the state distributional shift challenges goal (1). In addition, limited query results challenge goal (2). For example, for deep Q-learning, it is hard to adjust a huge parameter set in the Q-neural network with few samples. To address these issues, one principled approach is to find a proper similarity measure for the state-action pairs in a Markov Decision Process (MDP). Then, for goal (1), we can infer query actions based on similar states and actions in the historical dataset. For goal (2), the differences between each online query with all the historical samples provide rich information that can be intelligently employed to update the Q-network. In general, we demand a good pseudometric (Dadashi et al., 2021) to link offline datasets with online queries. In particular, the well-defined pseudometric should guarantee that similar states must have a small difference between the expected rewards. In an MDP, the above definition is known as the bisimulation metric (Ferns & Precup, 2014; Dadashi et al., 2021) . To approximate the bisimulation metric, (Dadashi et al., 2021) has developed a pseudometric learning model that takes in the historical data and outputs the pseudometric, a modified bisimulation metric in the state-action space. Although this model benefits goal (1), goal (2) is still unreachable since the pseudometric learning and the Q-network updating are decoupled. Therefore, we propose PGO 2 : Pseudometric Guided Offline-to-Online RL. PGO 2 is a unified framework to learn the pseudometric, update the Q-network, and infer query actions. Specifically, PGO 2 employs Siamese networks to learn the similarity of the state-action pairs (Bromley et al., 1993; Dadashi et al., 2021) . More importantly, we restrict the Q-network and the Siamese network to sharing parameters. Consequently, learning the pseudometric between each query result and all historical data updates the parameters of the Q-network. This updating scheme provides sufficient evaluations from limited query results, which enables the offline policy to quickly adapt to the online environment. Moreover, the Q-network in PGO 2 is designated as Partially Input Convex Neural Network (PICNN) (Amos et al., 2017a) . Thus, the inference of the action query has optimality by solving convex optimizations. For theoretical guarantees, we also show the theorems of convergence to the bisimulation metric and the global optimality for the query process. Finally, we observe significant improvements of PGO 2 in multiple RL tasks under the offline-to-online setting.

2. RELATED WORK

Offline RL. In the Introduction, we have presented a review of different models to optimize the sequential decisions using offline data. In addition, some other works view the offline RL as a supervised learning model to generate sequences. (Janner et al., 2021) employs a trajectory Transformer to predict the sequence of states, actions, and rewards. (Chen et al., 2021) also builds these sequences by a masked Transformer. Finally, (Schweighofer et al., 2021; Sinha et al., 2022) study how the historical data can impact the learned policy in offline RL. However, these studies do not provide a reasonable approach for online queries and updates. Online Implementations to Improve Offline RL. We categorize existing studies into the following groups. The first group requires the historical data to be informative so that they can abstract prior knowledge, e.g., skills (Pertsch et al., 2020 ), primitive behaviors (Ajay et al., 2020) , and behavioral priors (Singh et al., 2020) , for online implementations. However, we may not have enough highquality offline data. The second group admits imperfect data and employs online data to update the learned policy. (Nair et al., 2020) proposes to restrict Kullback-Leibler (KL) divergence of the policy for both offline and online learning, which requires a certain amount of online data. (Lee et al., 2022) weighs the offline and online data based on the density ratio so that the offline Qnetwork can be fine-tuned. Similarly, enough online samples are required to accurately estimate the density ratio. We make significant contributions by showing how a generalized pseudometric can guide the online action query and policy update, even with limited query opportunities. State-action Similarity Metric for MDPs. The bisimulation metric uses rewards to determine the similarity of two states and/or actions (Ferns & Precup, 2014) . Another similarity definition is the MDP homomorphism by considering both the reward and the transition probability (Ravindran & Barto, 2003) . Many studies have been done to approximate these metrics (Castro, 2020; Dadashi et al., 2021; van der Pol et al., 2020) . Our pseudometric learning is similar to (Dadashi et al., 2021) to approximate the bisimulation metric in the state-action space with theoretical guarantees. However, we have a unique structural design with convex optimizations to infer optimal queries and conduct sufficient policy updating.

