IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCE-MENT LEARNING

Abstract

Offline reinforcement learning suffers from out-of-distribution issue and extrapolation error. Most methods penalize the out-of-distribution state-action pairs or regularize the trained policy towards the behavior policy but cannot guarantee to get rid of extrapolation error. We propose In-sample Actor Critic (IAC), which utilizes sampling-importance resampling to execute in-sample policy evaluation. IAC only uses the target Q-values of the actions in the dataset to evaluate the trained policy, thus avoiding extrapolation error. The proposed method performs unbiased policy evaluation and has a lower variance than importance sampling in many cases. Empirical results show that IAC obtains competitive performance compared to the state-of-the-art methods on Gym-MuJoCo locomotion domains and much more challenging AntMaze domains.

1. INTRODUCTION

Reinforcement learning (RL) aims to solve sequential decision problems and has received extensive attention in recent years (Mnih et al., 2015) . However, the practical applications of RL meet several challenges, such as risky attempts during exploration and time-consuming data collecting phase. Offline RL is capable of tackling these issues without interaction with the environment. It can get rid of unsafe exploration and could tap into existing large-scale datasets (Gulcehre et al., 2020) . However, offline RL suffers from out-of-distribution (OOD) issue and extrapolation error (Fujimoto et al., 2019) . Numerous works have been proposed to overcome these issues. One branch of popular methods penalizes the OOD state-action pairs or regularizes the trained policy towards the behavior policy (Fujimoto & Gu, 2021; Kumar et al., 2020) . These methods have to control the degree of regularization to balance pessimism and generalization, and thus are sensitive to the regularization level (Fujimoto & Gu, 2021) . In addition, OOD constraints cannot guarantee to avoid extrapolation error (Kostrikov et al., 2022) . Another branch chooses to eliminate extrapolation error completely (Brandfonbrener et al., 2021; Kostrikov et al., 2022) . These methods conduct in-sample learning by only querying the Q-values of the actions in the dataset when formulating the Bellman target. However, OneStep RL (Brandfonbrener et al., 2021) estimates the behavior policy's Q-value according to SARSA (Sutton & Barto, 2018) and only improves the policy a step based on the Qvalue function, which has a limited potential to discover the optimal policy hidden in the dataset. IQL (Kostrikov et al., 2022) relies on expectile regression to perform implicit value iteration. It can be regarded as in-support Q-learning when the expectile approaches 1, but suffers from instability in this case. Thus a suboptimal solution is obtained by using a smaller expectile. Besides, these two lines of study adapt the trained policy to the fixed dataset's distribution. Then one question appears-"Can we introduce the concept of in-sample learning to iterative policy iteration, which is a commonly used paradigm to solve RL"? General policy iteration cannot be updated in an in-sample style, since the trained policy will inevitably produce actions that are out of the dataset (out-of-sample) and provide overestimated Q-target for policy evaluation. To enable in-sample learning, we first consider sampling the target action from the dataset and reweighting

