IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCE-MENT LEARNING

Abstract

Offline reinforcement learning suffers from out-of-distribution issue and extrapolation error. Most methods penalize the out-of-distribution state-action pairs or regularize the trained policy towards the behavior policy but cannot guarantee to get rid of extrapolation error. We propose In-sample Actor Critic (IAC), which utilizes sampling-importance resampling to execute in-sample policy evaluation. IAC only uses the target Q-values of the actions in the dataset to evaluate the trained policy, thus avoiding extrapolation error. The proposed method performs unbiased policy evaluation and has a lower variance than importance sampling in many cases. Empirical results show that IAC obtains competitive performance compared to the state-of-the-art methods on Gym-MuJoCo locomotion domains and much more challenging AntMaze domains.

1. INTRODUCTION

Reinforcement learning (RL) aims to solve sequential decision problems and has received extensive attention in recent years (Mnih et al., 2015) . However, the practical applications of RL meet several challenges, such as risky attempts during exploration and time-consuming data collecting phase. Offline RL is capable of tackling these issues without interaction with the environment. It can get rid of unsafe exploration and could tap into existing large-scale datasets (Gulcehre et al., 2020) . However, offline RL suffers from out-of-distribution (OOD) issue and extrapolation error (Fujimoto et al., 2019) . Numerous works have been proposed to overcome these issues. One branch of popular methods penalizes the OOD state-action pairs or regularizes the trained policy towards the behavior policy (Fujimoto & Gu, 2021; Kumar et al., 2020) . These methods have to control the degree of regularization to balance pessimism and generalization, and thus are sensitive to the regularization level (Fujimoto & Gu, 2021) . In addition, OOD constraints cannot guarantee to avoid extrapolation error (Kostrikov et al., 2022) . Another branch chooses to eliminate extrapolation error completely (Brandfonbrener et al., 2021; Kostrikov et al., 2022) . These methods conduct in-sample learning by only querying the Q-values of the actions in the dataset when formulating the Bellman target. However, OneStep RL (Brandfonbrener et al., 2021) estimates the behavior policy's Q-value according to SARSA (Sutton & Barto, 2018) and only improves the policy a step based on the Qvalue function, which has a limited potential to discover the optimal policy hidden in the dataset. IQL (Kostrikov et al., 2022) relies on expectile regression to perform implicit value iteration. It can be regarded as in-support Q-learning when the expectile approaches 1, but suffers from instability in this case. Thus a suboptimal solution is obtained by using a smaller expectile. Besides, these two lines of study adapt the trained policy to the fixed dataset's distribution. Then one question appears-"Can we introduce the concept of in-sample learning to iterative policy iteration, which is a commonly used paradigm to solve RL"? General policy iteration cannot be updated in an in-sample style, since the trained policy will inevitably produce actions that are out of the dataset (out-of-sample) and provide overestimated Q-target for policy evaluation. To enable in-sample learning, we first consider sampling the target action from the dataset and reweighting the temporal difference gradient via importance sampling. However, it is known that importance sampling suffers from high variance (Precup et al., 2001) and would impair the training process. In this paper, we propose In-sample Actor Critic (IAC), which performs iterative policy iteration and simultaneously follows the principle of in-sample learning to eliminate extrapolation error. We resort to sampling-importance resampling (Rubin, 1988) to reduce variance and execute in-sample policy evaluation, which formulates the gradient as it is sampled from the trained policy. To this end, we use SumTree to sample according to the importance resampling weight. For policy improvement, we tap into advantage-weighted regression (Peng et al., 2019) to control the deviation from the behavior policy. The proposed method executes unbiased policy evaluation and has smaller variance than importance sampling in many cases. We point out that, unlike previous methods, IAC adapts the dataset's distribution to match the trained policy during learning dynamically. We test IAC on D4RL benchmark (Fu et al., 2020) , including Gym-MuJoCo locomotion domains and much more challenging AntMaze domains. The empirical results show the effectiveness of IAC.

2. RELATED WORKS

Offline RL. Offline RL, previously termed batch RL (Ernst et al., 2005; Riedmiller, 2005) , provides a static dataset to learn a policy. It has received attention recently due to the extensive usage of deep function approximators and the availability of large-scale datasets (Fujimoto et al., 2019; Ghasemipour et al., 2021) . However, it suffers from extrapolation error due to OOD actions. Some works attempt to penalize the Q-values of OOD actions (Kumar et al., 2020; An et al., 2021) . Other methods force the trained policy to be close to the behavior policy by KL divergence (Wu et al., 2019) , behavior cloning (Fujimoto & Gu, 2021) , or Maximum Mean Discrepancy(MMD) (Kumar et al., 2019) . These methods cannot eliminate extrapolation error and require a regularization hyperparameter to control the constraint level to balance pessimism and generalization. Another branch chooses to only refer to the Q-values of in-sample actions when formulating the Bellman target without querying the values of actions not contained in the dataset (Brandfonbrener et al., 2021; Kostrikov et al., 2022) . By doing so, they can avoid extrapolation error. OneStep RL (Brandfonbrener et al., 2021) evaluates the behavior policy's Q-value function and only conducts one-step of policy improvement without off-policy evaluation. However, it performs worse than the multi-step counterparts when a large dataset with good coverage is provided. IQL (Kostrikov et al., 2022) draws on expectile regression to approximate an upper expectile of the value distribution, and executes multi-step dynamic programming update. When the expectile approaches 1, it resembles in-support Q-learning in theory but suffers from instability in practice. Thus a suboptimal solution is obtained by using a smaller expectile. Our proposed method opens up a venue for in-sample iterative policy iteration. It prevents querying unseen actions and is unbiased for policy evaluation. In practice, it modifies the sampling distribution to allow better computational efficiency. OptiDICE (Lee et al., 2021) also does not refer to out-of-sample samples. However, it involves with complex minmax optimization and requires a normalization constraint to stabilize the learning process. Importance sampling. Importance sampling's application in RL has a long history (Precup, 2000) for its unbiasedness and consistency (Kahn & Marshall, 1953) . Importance sampling suffers from high variance, especially for long horizon tasks and high dimensional spaces (Levine et al., 2020) . Weighted importance sampling (Mahmood et al., 2014; Munos et al., 2016) and truncated importance sampling (Espeholt et al., 2018) have been developed to reduce variance. Recently, marginalized importance sampling has been proposed to mitigate the high variance of the multiplication of importance ratios for off-policy evaluation (Nachum et al., 2019; Liu et al., 2018) . Samplingimportance resampling is an alternative strategy that samples the data from the dataset according to the importance ratio (Rubin, 1988; Smith & Gelfand, 1992; Gordon et al., 1993) . It has been applied in Sequential Monte Carlo sampling (Skare et al., 2003) and off-policy evaluation (Schlegel et al., 2019) . To the best of our knowledge, our work is the first to draw on sampling-importance resampling to solve the extrapolation error problem in offline RL.

