ACQL: AN ADAPTIVE CONSERVATIVE Q-LEARNING FRAMEWORK FOR OFFLINE REINFORCEMENT LEARN-ING

Abstract

Offline Reinforcement Learning (RL), which relies only on static datasets without additional interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. Most existing offline RL methods did not consider relative data quality and only crudely constrained the distribution gap between the learned policy and the behavior policy in general. Moreover, these algorithms cannot adaptively control the conservative level in more fine-grained ways, like for each state-action pair, leading to a performance drop especially over highly diversified datasets. In this paper, we propose an Adaptive Conservative Q-Learning (ACQL) framework that enables more flexible control over the conservative level of Q-function for offline RL. Specifically, we present two adaptive weight functions to shape the Q-values for collected and out-ofdistribution data. Then we discuss different conditions under which the conservative level of the learned Q-function changes and define the monotonicity with respect to data quality and similarity. Motivated by the theoretical analysis, we propose a novel algorithm with the ACQL framework, using neural networks as the adaptive weight functions. To learn proper adaptive weight functions, we design surrogate losses incorporating the conditions for adjusting conservative levels and a contrastive loss to maintain the monotonicity of adaptive weight functions. We evaluate ACQL on the commonly-used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.

1. INTRODUCTION

With the help of deep learning, Reinforcement Learning (RL) has achieved remarkable results on a variety of previously intractable problems, such as playing video games (Silver et al., 2016) , controlling robot (Kalashnikov et al., 2018; Akkaya et al., 2019) and driving autonomous cars (Yu et al., 2020a; Zhao et al., 2022) . However, the prerequisite that the agent has to interact with the environments makes the learning process costly and unsafe for many real-world scenarios. Recently, offline RL (Lange et al., 2012; Prudencio et al., 2022) has been proposed as a promising alternative to relax this requirement. In offline RL, the agent directly learns a control policy from a given static dataset, which is previously-collected by an unknown behavior policy. Offline RL enables the agent to achieve comparable or even better performance without additional interactions with environment. Unfortunately, stripping the interactions from the online RL, offline RL is very challenging due to the distribution shift between the behavior policy and the learned policy. It often leads to the overestimation of values of out-of-distribution (OOD) actions (Kumar et al., 2019; Levine et al., 2020) and thus misleads the policy into choosing these erroneously estimated actions. To alleviate the distribution shift problem, recent methods (Kumar et al., 2019; Jaques et al., 2019; Wu et al., 2019; Siegel et al., 2020) proposed to constrain the learned policy to the behavior policy in different ways, such as limiting the action space (Fujimoto et al., 2019 ), using KL divergence (Wu et al., 2019) and using Maximum Mean Discrepancy (MMD) (Kumar et al., 2019) . Besides directly constraining the policy, other methods (Kumar et al., 2020; Yu et al., 2021a; b; Ma et al., 2021) choose to learn a conservative Q-function to constrain the policy implicitly and thus alleviate the overestimation problem of Q-function.

