OFFLINE RL WITH NO OOD ACTIONS: IN-SAMPLE LEARNING VIA IMPLICIT VALUE REGULARIZATION

Abstract

Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing Q-values using outof-distribution (OOD) actions will suffer from errors due to distributional shift. The recent proposed In-sample Learning paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the Implicit Value Regularization (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse Q-learning (SQL) and Exponential Q-learning (EQL), which adopt the same value regularization used in existing works, but in a complete insample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes. Code is available at https://github.com/ryanxhr/IVR. * The core difference between in-sample learning and out-of-sample learning is that in-sample learning uses only dataset actions to learn the value function while out-of-sample learning uses actions produced by the policy.

1. INTRODUCTION

Reinforcement learning (RL) is an increasingly important technology for developing highly capable AI systems, it has achieved great success in game-playing domains (Mnih et al., 2013; Silver et al., 2017) . However, the fundamental online learning paradigm in RL is also one of the biggest obstacles to RL's widespread adoption, as interacting with the environment can be costly and dangerous in real-world settings. Offline RL, also known as batch RL, aims at solving the abovementioned problem by learning effective policies solely from offline data, without any additional online interactions. It is a promising area for bringing RL into real-world domains, such as robotics (Kalashnikov et al., 2021 ), healthcare (Tang & Wiens, 2021) and industrial control (Zhan et al., 2022) . In such scenarios, arbitrary exploration with untrained policies is costly or dangerous, but sufficient prior data is available. While most off-policy RL algorithms are applicable in the offline setting by filling the replay buffer with offline data, improving the policy beyond the level of the behavior policy entails querying the Q-function about values of actions produced by the policy, which are often not seen in the dataset. Those out-of-distribution actions can be deemed as adversarial examples of the Q-function, which cause extrapolation errors of the Q-function (Kumar et al., 2020) . To alleviate this issue, prior model-free offline RL methods typically add pessimism to the learning objective, in order to be pessimistic about the distributional shift. Pessimism can be achieved by policy constraint, which constrains the policy to be close to the behavior policy (Kumar et al., 2019; Wu et al., 2019; Nair et al., 2020; Fujimoto & Gu, 2021) ; or value regularization, which directly modifies the Q-function to be pessimistic (Kumar et al., 2020; Kostrikov et al., 2021a; An et al., 2021; Bai et al., 2021) . Nevertheless, this imposes a trade-off between accurate value estimation (more regularization) and maximum policy performance (less regularization). In this work, we find that we could alleviate the trade-off in out-of-sample learning by performing implicit value regularization, this bypasses querying the value function of any unseen actions, allows learning an optimal policy using in-sample learning * . More specifically, we propose the Implicit Value Regulazization (IVR) framework, in which a general form of behavior regularizers is added to the policy learning objective. Because of the regularization, the optimal policy in the IVR framework has a closed-form solution, which can be expressed by imposing weight on the behavior policy. The weight can be computed by a state-value function and an action-value function, the state-value function serves as a normalization term to make the optimal policy integrate to 1. It is usually intractable to find a closed form of the state-value function, however, we make a subtle mathematical transformation and show its equivalence to solving a convex optimization problem. In this manner, both of these two value functions can be learned by only dataset samples. Note that the recently proposed method, IQL (Kostrikov et al., 2021b) , although derived from a different view (i.e., approximate an upper expectile of dataset actions given a state), remains pretty close to the learning paradigm of our framework. Furthermore, our IVR framework explains why learning the state-value function is important in IQL and gives a deeper understanding of how IQL handles the distributional shift: it is doing implicit value regularization, with the hyperparameter τ to control the strength. This explains one disturbing issue of IQL, i.e., the role of τ does not have a perfect match between theory and practice. In theory, τ should be close to 1 to obtain an optimal policy while in practice a larger τ may give a worse result. Based on the IVR framework, we further propose some practical algorithms. We find that the value regularization terms used in CQL (Kumar et al., 2020) and AWR (Peng et al., 2019) are two valid choices in our framework. However, when applying them to our framework, we get two complete in-sample learning algorithms. The resulting algorithms also bear similarities to IQL. However, we find that our algorithm introduces sparsity in learning the state-value function, which is missing in IQL. The sparsity term filters out those bad actions whose Q-values are below a threshold, which brings benefits when the quality of offline datasets is inferior. We verify the effectiveness of SQL on widely-used D4RL benchmark datasets and demonstrate the state-of-the-art performance, especially on suboptimal datasets in which value learning is necessary (e.g., AntMaze and Kitchen). We also show the benefits of sparsity in our algorithms by comparing with IQL in noisy data regimes and the robustness of in-sample learning by comparing with CQL in small data regimes. To summarize, the contributions of this paper are as follows: • We propose a general implicit value regularization framework, where different behavior regularizers can be included, all leading to a complete in-sample learning paradigm. • Based on the proposed framework, we design two effective offline RL algorithms: Sparse Q-Learning (SQL) and Exponential Q-learning (EQL), which obtain SOTA results on benchmark datasets and show robustness in both noisy and small data regimes.

2. RELATED WORK

To tackle the distributional shift problem, most model-free offline RL methods augment existing off-policy methods (e.g., Q-learning or actor-critic) with a behavior regularization term. Behavior regularization can appear explicitly as divergence penalties (Wu et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021) , implicitly through weighted behavior cloning (Wang et al., 2020; Nair et al., 2020) , or more directly through careful parameterization of the policy (Fujimoto et al., 2018; Zhou et al., 2020) . Another way to apply behavior regularization is via modification of the critic learning objective to incorporate some form of regularization, to encourage staying near the behavioral distribution and being pessimistic about unknown state-action pairs (Nachum et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021a; Xu et al., 2022c) . There are also several works incorporating

